Our condolences on the loss of a colleague

 Cropped from TCS journal source

Alberto Apostolico was a Professor in the Georgia Tech College of Computing. He passed away on Monday after a long battle with cancer.

Today Ken and I offer our condolences to his family and friends, and our appreciation for his beautiful work.

Alberto was still active. He had a joint paper in the recent 2015 RECOMB conference, that is Research in Computational Molecular Biology. It was written with Srinivas Aluru and Sharma Thankachan and titled, “Efficient Alignment Free Sequence Comparison with Bounded Mismatches.” Srinivas is also here at Tech and wrote some words of appreciation:

[We] submitted two journal papers from this joint work this month, one on last Thursday night. … When I empathized with his situation, he would remind me that all of our lives are temporary and his situation is no different.

A full session of that conference was devoted to fighting cancer. One can only hope that some of the results of this and other theory conferences contribute to finally solving that problem.

## Words and Work

Alberto’s work was on how much work one needs to do to identify notable properties of words. We mean very long words such as the textual representation of the human genome. Many “obvious” methods for processing strings do too much work. In a Georgia Tech feature several years ago, Alberto put it this way:

How do you compare things that are essentially too big to compare, meaning that the old ways of computing are no longer feasible, meaningful, or both? It’s one thing to compare and classify 30 proteins that are a thousand characters long; it’s another to compare a million species by their entire genomes, and then come up with a classification system for those species.

The theme of a special issue of the journal Theoretical Computer Science for Alberto’s 60th birthday in 2008 was:

Work is for people who do not know how to SAIL — String Algorithms, Information and Learning.

The foreword by Raffaele Giancarlo and Stefano Lonardi lists Alberto’s many contributions.

Giancarlo, who did a Master’s with Alberto in Salerno and then a PhD with Zvi Galil—our Dean at Georgia Tech—told some stories yesterday to a mailing list of their field. As an undergraduate feeling jitters attending a summer school taught by Apostolico on the island of Lipari off the north coast of Sicily, he was greatly heartened to see the leader arriving in his sailboat, named Obliqua (“Oblique”). A month ago Ken was in Sardinia—further north in the same waters—for a meeting of the World Chess Federation’s Anti-Cheating Committee, and offers this peaceful picture of the 41-foot yacht in which they took an excursion.

## Stringology and Zvi

Ken and I have already talked about stringology—see here. Stringology is the study of the most basic objects in computing, linear finite sequences of letters, and is filled with deep and often surprising results. In the post we mentioned the surprise that an ordinary multitape Turing machine working in real time can print a 1 each time the first ${n}$ letters it has read form a palindrome. One can also trace roots to the discovery that string matching—telling whether a string ${y}$ occurs as a subword of ${x}$ and finding it if so—can be done in linear time.

Alberto and Rafaelle wrote a paper about Zvi before either wrote a paper with Zvi: “The Boyer-Moore-Galil String Searching Strategies Revisited.” This paper has a Wikipedia page under the name, “Apostolico-Giancarlo algorithm.” By employing a compact database of certain substrings of the pattern string ${y}$ with room to record their matches and non-matches to parts of ${x}$, they showed how to reduce the overall number of character comparisons to ${2|x| - |y| + 1.}$ The “${2}$” had previously been ${4}$, ${6}$, and ${14}$ in this and related measures, and matched a conjectured lower bound by Leo Guibas.

Before the paper, however, Alberto and Zvi edited and contributed to a highly influential collection of papers from a workshop in Italy sponsored by NATO’s Advanced Sciences Institute in 1984, titled Combinatorial Algorithms on Words. Many great people we know took part and wrote for the volume: Michael Rabin, Andy Yao, Andrew Odlyzko, Bob Sedgewick with Philippe Flajolet and Mireille Régnier, Andrei Broder, Joel Seiferas—and others such as Maxime Crochemore, Shimon Even, the aforementioned Guibas, Michael Main, Dominique Perrin, Wojciech Rytter, and James Storer. A contribution from Victor Miller and Mark Wegman (of universal hashing fame) titled “Variations on a Theme by Ziv and Lempel” is followed by one from the Lempel and Ziv.

Alberto and Zvi teamed on another volume in 1997, viz. the book Pattern Matching Algorithms. And yes they did co-write papers, including several on parallel algorithms for string problems—even the basic palindrome-finding problem. Their latest collaboration was a multi-author survey titled “Forty Years of Text Indexing,” which was a keynote presentation at the 2013 Combinatorial Pattern Matching symposium.

## Discovery…

Alberto proved many surprising theorems during his long career. A recent example of his “out-of-the-box” approach is his 2008 paper with Olgert Denas titled, “Fast algorithms for computing sequence distances by exhaustive substring composition.” The abstract notes that the standard edit-distance measure

…hardly fulfills the growing needs for methods of sequence analysis and comparison on a genomic scale […] due to a mixture of epistemological and computational problems.

Well we have compressed the abstract with a little stringology ourselves. Recall our recent post on new evidence that the time to compute edit distance really is quadratic. Edit distance is so basic, it takes chutzpah to imagine that “alternative measures, based on the subword composition of sequences” could be both quicker and useful. The main theorem is a linear-time algorithm for a distance measure that seems to depend on quadratically-many pairs of substrings. This is a real feat of leveraging the ways words can combine. The paper goes on to show how this is programmed and applied—note also that the link to the paper is on the NIH website.

In the Georgia Tech feature we quoted above, Alberto said this about the genome:

It’s the closest thing we have to a message from outer space. We do not know where it comes from, understand very little of what it means, and have no clue about where it is going.

He went on to note the shift from pattern matching to pattern discovery—without saying how far he was in the vanguard on this.

## …and Surprise

Alberto captured discovery by the element of surprise, which can be quantified statistically. One of my favorite papers of his is titled, “Monotony of Surprise and Large-Scale Quest for Unusual Words,” with Mary Ellen Bock and the above-mentioned Lonardi. It is another victory in Alberto’s perpetual battle of linear over quadratic.

Here is the idea. Given a long string ${x}$ and substring ${y}$, let ${f_x(y)}$ be the number of occurrences of ${y}$ in ${x.}$ Now suppose ${x}$ is in the support of a random distribution in which each character ${x_k}$ depend only on a fixed finite number of previous characters ${x_{k-r}\dots x_{k-1}}$ in a manner that is also independent of ${k.}$ Let ${Z_y}$ be the random variable of ${f_{x'}(y)}$ over ${x'}$ drawn from this distribution. Then

$\displaystyle z_x(y) = \frac{f_x(y) - E[Z_y]}{\sigma(Z_y)}$

is a normal statistical z-score, where ${E[\cdot]}$ stands for expectation and ${\sigma}$ for standard deviation. If ${z_x(y) \geq T}$ for some high positive threshold ${T}$, then ${y}$ occurs unusually frequently in ${x}$ and so is a surprising substring. Substrings ${y}$ with ${z_x(y) \leq -T}$ are surprising by their lack of expected frequency in the given long text ${x.}$

Telling which substrings are surprising still seems to face the wall that there are quadratically many substrings overall. But now the authors work a little structural magic. Suppose we have a partition ${[C_1,C_2,\dots,C_\ell]}$ of the substrings into ${\ell}$-many classes such that each ${C_i}$ has a unique longest and shortest member. Given ${y \in C_i}$, put ${y}$ into ${{\cal O}^T_z}$ if ${z_x(y) \geq T}$ and no other string in ${C_i}$ has a higher z-score, and ${y \in {\cal U}^T_z}$ if ${z_x(y) \leq -T}$ and no other member is more negative. They prove:

Theorem 1 Given any ${x}$ of length ${n}$ and an ${\ell}$-partition of its substrings as above, and any fixed displacement ${T > 0}$, the sets ${{\cal O}^T_z}$ and ${{\cal U}^T_z}$ can be computed in ${O(n+\ell)}$ time.

We have skimmed some fine print about complexity of the distribution and access to the longest and shortest strings in the ${C_i}$, plus how they make everything work also for other score functions ${z = z_x(\cdot).}$ The way they combine the monotonicity of ${f(y)}$ with regard to sub- and super-strings of ${y}$, properties of convexity and concavity, numerical analysis, and graph-theoretic diagrams of the substring structures is a tour de force—the paper rewards attention to its details.

## Open Problems

We could go on… But let’s stop now and just again repeat our condolences again to his family and friends. Georgia Tech will be putting together a memorial in his honor—perhaps you will be able to attend.

Here is the gist: consider the “order k de Bruijn quiver” formed by treating a word over n symbols as cyclic and taking k-grams as vertices and each instance of a (k+1)-gram as an edge in the obvious way. If two words are statistically similar, the quiver corresponding to the absolute value of the difference of the adjacency matrices will contain most of the information needed to reconstruct one word from another, and the matrix-tree/BEST theorems inform the computation of the appropriate notion of entropy, which I call “relative de Bruijn entropy”. The necessary matrix determinant, nominally taking $O((n^k)^\omega)$ time, can in principle be computed in $O((n^k) \log^2 (n^k))$ time using black-box/superfast linear algebra. Since a natural heuristic for maximally informative k is just $\log_n (word length)$, this is very close to linear time for practical problems. This dissimilarity measure captures all of the information from subwords of length $k$ and is completely ignorant of scales larger than $k$, which in many settings is a feature, not a bug.