# Too Long, Didn’t Read

*How to summarize papers*

Top row: Cohan, Weld. Bottom: Cachola, Lo. |

Isabel Cachola, Kyle Lo, Arman Cohan, Daniel Weld are the authors of a recent paper on summarizing papers. They are all connected in various way to the Allen Institute for Artificial Intelligence.

Today we take a moment to try out the webtool that comes with their paper. It is noted also in a news item last week in *Nature*.

They have created a program that reads a science article and outputs a single sentence that summarizes its content. Their goal is to help researchers search through the huge number of published papers faster than looking at abstracts. The software utilizes neural networks trained on many examples.

For example: Their own paper has for its abstract:

We introduce TLDR generation, a new form of extreme summarization, for scientific papers. TLDR generation involves high source compression and requires expert background knowledge and understanding of complex domain-specific language. To facilitate study on this task, we introduce SCITLDR, a new multi-target dataset of 5.4K TLDRs over 3.2K papers. SCITLDR contains both author-written and expert-derived TLDRs, where the latter are collected using a novel annotation protocol that produces high-quality summaries while minimizing annotation burden. We propose CATTS, a simple yet effective learning strategy for generating TLDRs that exploits titles as an auxiliary training signal. CATTS improves upon strong baselines under both automated metrics and human evaluations.

And this becomes:

“We introduce SCITLDR, a new multi-target dataset of 5.4K TLDRs over 3.2K papers.”

Not so impressive, but more on their program shortly.

## Another Tryout

Perhaps the big news this week is an advance on protein folding by Google DeepMind, the same team whose work on AlphaGo and AlphaZero we have covered. Sure enough, their new system is called AlphaFold—actually, AlphaFold2. See here for basic information:

The detailed announcement on the DeepMind blog says:

Until we’ve published a paper on this work, please cite: “High Accuracy Protein Structure Prediction Using Deep Learning,” John Jumper [et al.]. In Fourteenth Critical Assessment of Techniques for Protein Structure Prediction (Abstract Book), 30 November – 4 December 2020. Retrieved from here.

The link under “here” goes to a long book of abstracts. Among them, there is a paper last January in *Nature* with many of the same authors, though Jumper not as first author. Using its abstract, we got:

“We train a neural network to make accurate predictions of the distances between pairs of residues, which convey more information about the structure than contact predictions.”

The abstracts book has the abstract for the current paper:

In the CASP14 experiment, we deployed AlphaFold 2. This new system uses a different deep learning method than CASP13 AlphaFold, and it produces much more accurate protein structures and estimates of model accuracy. The training data for the system is publicly available and similar to that used for CASP13 AlphaFold.

This is already short, but with SCITLDR it becomes:

“In the CASP14 experiment, we deployed AlphaFold 2.0, a new deep learning system that produces much more accurate protein structures and”

The dangling “and” is a little mystifying. SCITLDR has optional fields for *Introduction* and *Conclusions*. The entry in the abstracts book has a body titled “Methods,” whose last subsection “T1064” reads like a conclusion, so we added them as such. We cleaned up the paste from PDF to have normal line breaks. After a few seconds, we obtained:

“In the CASP14 experiment, we deployed AlphaFold 2.0, a novel attention-based deep learning architecture that produces accurate protein structures”

The DeepMind announcement poses a further challenge: how to glean an important paper that for now exists only as a blog post. The first paragraph of their post is boldfaced and reads like an abstract:

Proteins are essential to life, supporting practically all its functions. They are large complex molecules, made up of chains of amino acids, and what a protein does largely depends on its unique 3D structure. Figuring out what shapes proteins fold into is known as the “protein folding problem,” and has stood as a grand challenge in biology for the past 50 years. In a major scientific advance, the latest version of our AI system AlphaFold has been recognised as a solution to this grand challenge by the organisers of the biennial Critical Assessment of protein Structure Prediction (CASP). This breakthrough demonstrates the impact AI can have on scientific discovery and its potential to dramatically accelerate progress in some of the most fundamental fields that explain and shape our world.

It becomes:

“AlphaFold has been recognized as a solution to this grand challenge by the organizers of the Critical Assessment of protein Structure Prediction (CASP)”

Pretty good—do you agree?

## Even Faster

Both Ken and I are fans of the book *The Mathematical Experience* by Philip Davis and Reuben Hersch. The following passage, from the chapter “The Ideal Mathematician,” describes how one writes a paper, but Ken has always taken it to describe the swiftness needed to read a paper:

The intended readers (all twelve of them) can decode the formal presentation, detect the new idea hidden in lemma 4, ignore the routine and uninteresting calculations of lemmas 1, 2, 3, 5, 6, 7, and see what the author is doing and why he does it.

Nowadays we pore through papers as they appear on arXiv. For most, we just scan the titles. For others, we go to the main page with the abstract. For some, we click on the PDF to read the paper. This may be the greatest exercise in freedom of intellect we have our professional lives, but it is also an exercise in speed. We who do research in time complexity need to minimize our own time.

## Some CS Examples

Another way to say this: we all need to read through the literature faster and more precisely. Here are a few examples of their one sentence abstracts for theory papers from the SCITLDR program. Rate them by seeing if you can match the summarizes with the papers.

Here are the one sentence summaries:

- “Finite automata are considered in this paper as instruments for classifying finite tapes.”
- “The number of steps required to compute a function depends, in general, on the type of computer that is used, on the choice of computer program”
- “In this paper, it is proven that when both randomization and interaction are allowed, the proofs that can be verified in polynomial time are”
- “In this paper a computational complexity theory of the “knowledge” contained in a proof is developed.”
- “It is shown that any recognition problem solved by a polynomial time-bounded nondeterministic Turing machine can be “reduced”
- “We prove that there are arbitrarily long arithmetic progressions of primes.”
- “Nonuniform Upper Bounds: The Converse Direction of the Nonuniform Complexity Bounds .”

Match them to the paper titles:

- The complexity of theorem-proving procedures
- Some connections between nonuniform and uniform complexity classes
- Finite Automata and Their Decision Problem
- A Machine-Independent theory of the Complexity of Recursive Functions
- Some connections between nonuniform and uniform complexity classes
- The Knowledge Complexity Of Interactive Proof Systems
- The primes contain arbitrarily long arithmetic progressions

## Open Problems

The papers:

- “Finite automata are considered in this paper as instruments for classifying finite tapes.” 1
- “The number of steps required to compute a function depends, in general, on the type of computer that is used, on the choice of computer program” 2
- “In this paper, it is proven that when both randomization and interaction are allowed, the proofs that can be verified in polynomial time are” 3
- “In this paper a computational complexity theory of the “knowledge” contained in a proof is developed.” 4
- “It is shown that any recognition problem solved by a polynomial time-bounded nondeterministic Turing machine can be “reduced” 5
- “We prove that there are arbitrarily long arithmetic progressions of primes.” 6
- “Nonuniform Upper Bounds: The Converse Direction of the Nonuniform Complexity Bounds .” 7

The answers in brief: **a-5, b-7,c-1,d-2,e-3,f-4,g-6**.

[fixed blog formatting issues, fixed paper ascription in the intro]

IRTNOG

Is it a good sign when SCITDLR simply takes the first sentence of the abstract as summary?

I’m asking this question since I was at first disappointed by the webtool, trying with my own papers. It seemed to simply cut the abstract after the first sentence. But then I thought I may turn this into a positive comment on my abstracts: They have a very clear first sentence, that efficiently summarize the content of the paper!

Reblogged this on Pink Iguana.

😳 ok, am aware you guys dont regard yourselves as journalists exactly, but am disappointed by your/ this ultimately superficial “coverage” of the deepmind alphafold breakthru, it really deserves better. you say you “covered” deepmind in the past, are you “covering” it again here? look this is a ½ century breakthru and youre analyzing the semantics of the abstract with computer programs.

sigh!😥ok, its totally voluntary what to write on ones blog, its the epitome of an unpaid gig, but c’mon guys youre top computer scientists and so are they, how about some higher professional esteem here… ❗

dont be afraid of physics!

youve written deeply on bioinformatics etc… for 1 voting that you revisit the topic. if not,

whatever!ps like davies + hirsch a lot too, great book! they talk about a mathematical community decades before cyberspace/ blogs/ social media… (pre 21st century book! (c) 1999!) thx for now providing some of that service! 😀