How linear algebra can make databases go really fast

Mike Stonebraker is one of the world’s leading expert on database technology. He started in academe at Berkeley, is now again in academe at MIT, and has launched a raft of successful companies. He is currently the co-founder and chief technology officer of at least three startups in the database area. One is called “Data Tamer” and is a joint venture with researchers from QCRI—Qatar Computing Research Institute—see this release.

Today I would like to talk about his presentation at the TTI Vanguard meeting on “Ginormous Systems” in DC. His talk was on “Seven Big Data Megatrends.”

By the way the word “Ginormous” is a real word–see here for the formal definition. I initially thought the Vanguard organizers had made up the word, but it is real. It should be obvious that Ginormous means large, actually really Large. This Vanguard meeting was dedicated to Ginormous systems of all kinds: from huge data centers, to city-wide systems, to supercomputers, and much more.

In Mike’s wonderful talk he made seven points about the past, present, and the future of database technology. He has a great track record, so likely he is mostly right on his guesses. One of his predictions was about a way of re-organizing databases that has several remarkable properties:

• It speeds up database operations 50x. That is to say, on typical queries—ones that companies actually do—it is fifty times faster than classical database implementations. As a theorist we like speedups, especially asymptotic ones. But 50x is pretty cool. That is enough to change a query from an hour to a minute.
• It is not a new idea. But the time is finally right, and Mike predicts that future databases will use this method.
• It is an idea that no one seems to know who invented it. I asked Mike, I asked other experts at the conference, and all shrugged and said effectively: “I have no idea.” Curious.

Let’s look quickly at the way databases work, and then consider the trick.

## Some Background

Modern databases store records—lots of them, usually on disks. A record is a fixed-size vector of information. The vector is divided into fields, where a field stores a type of information. An example is:

$\displaystyle [name, address, home\text{-}number, cell\text{-}number, age, \dots ].$

Thus the first field is the person’s name, the next the address, and so on.

In a sense the data is really stored in a table—or an array if you wish to be mathematical—call it ${D}$ for data. The rows contain each record, and the columns store the fields.

The issue is how the array is stored on the disk. Each record is stored one after the other on the disk. The records are stored as

$\displaystyle R_{1},R_{2},\dots,R_{N}.$

Here each ${R_{i}}$ is the ${i^{th}}$ row.

This is a reasonable method, it puts each record together, and allows fast access of all of the records. Thus, a query can scan over all the records by reading the disk one track at a time. This is not a bad way to use a disk-like device.

Mike points out that all the classic database systems—well at least most—store records in this manner. Their code, which also is huge (if not ginormous) is tuned to handle data that is stored in this manner. Let’s call it the “record ordered method” (ROM). As a mathematical idea it is just storing the array in row-major order. Not only is this a perfectly fine way to organize the data, and to store the array, it respects principles that go back to COBOL in the 1950’s: Each data object should be conceptually and physically together.

But there is a better way.

## The Trick

The trick to the 50x speedup is based on the deep, advanced, complex operation that we in math call the transpose of a matrix. Just kidding. It is based on the simple idea that instead of storing the matrix ${D}$ we store the matrix ${D^{\intercal}}$. Recall ${D^{\intercal}}$ is just the matrix defined by

$\displaystyle D^{\intercal}(j,i) = D(i,j).$

Let’s call this the column ordered method: COM. Now the data on the disk contains

$\displaystyle C_{1},c_{2},\dots,C_{M}.$

Here each ${C_{j}}$ is the ${j^{th}}$ column.

So why is this method so much faster than the ROM? The answer is how the data is accessed by the queries. The data is read much more than it is written, so the key is to speed up the ability to read the data. But the critical insight is this:

A query is likely to use only a few columns.

For example, suppose the query is:

Select all the records with age in the range [21,31] and cell phones with area code 404.

Then the query needs only to look at two columns. All the other fields are completely un-needed.

Now suppose the records have a hundred fields. Since the query only looks at two fields there is a huge speedup. Then the speedup is ${2:100}$ roughly. In the COM the database algorithm only reads the data that it needs to use to answer the query. In the ROM method it reads all the data and that tremendously slows down the query. Note, things can be even worse, since the size of fields can vary widely. So the true speedup depends on the ratio of

$\displaystyle \frac{ \text{number bits used in query}}{\text{ number of bits in the whole record}}.$

Clearly if a record has even one large field that is not used in the query, the speedup could be very large.

How did people not realize this simple idea: replace the table ${D}$ by its transpose ${D^{\intercal}}$? Well they did not actually miss it, but its power was not realized until relatively recently.

## Whose Trick?

As I stated earlier no one seems to be able to say who exactly discovered the COM. Maybe as a default we could call it the Gauss Database Method, since most things are named for him. I did track down a system called TAXIR that was essentially a COM storage system with focus on information-retrieval in biology in 1969. The paper describing it is by George Estabrook and Robert Brill. Maybe they invented it. Perhaps their focus on biology made it hard for those in databases to notice their work? Especially years ago before powerful on-line search engines. Perhaps.

Ken adds that in a textbook used years ago for Buffalo’s course on programming-language concepts, the COM idea was called “parallel arrays” and was frowned upon. The main reason given was that this structure was hard to maintain, as a single off-by-one indexing error in one array could damage the entire set of records. However, a high-level system can maintain the data in-sync, while modern machine architectures increase the reward for keeping just the data you need in caches.

## Open Problems

Okay, maybe the trick is not worth a trillion dollars. But the total amount invested yearly in data systems suggests that the column idea could over the next few years be worth quite a few dollars.

A simple thought: Is the column method the best way to store records? Can we in theory prove it is best in some sense, or is there an even better method? So forget the million-dollar Clay prizes and go after the real money.

May 2, 2013 2:37 pm

Hi Dick

I also don’t know who invented this idea, but it probably goes back into the 50s. “Thinking in terms of columns” was a heuristic that was around when I started working with computers — originally on punch card accounting machines (PCAM) ca 1961.

The most used “column idea” was “radix sorting”—which I think goes back to Hollerith—but I recall “searching the columns” in card data bases in which the most searched columns were put on the same cards of multi-card records. What to do when magtape started to be used was a pretty hot debate in the Air Force computer center where I worked.

Cheers,

Alan

2. May 2, 2013 3:00 pm

Thanks for the thoughtful writeup.

You do know there are a bunch of column-oriented databases already in the marketplace, right? Beyond the proofs-of-concept Alan mentions, there are commercial products: SybaseIQ for many years, more recently Vertica, MonetDB, etc.

3. May 2, 2013 3:17 pm

theres a lot of innovation lately in this space esp with new approaches like large RAM caches and hashkey databases. and its being used in the latest/cutting edge batch/generation of companies like facebook.

imho there is a lot of room for optimizing databases further partly because system boundaries tend to make the interactions opaque. here is a related idea I have been having & toying with recently. most applications often only have a small set of queries. have worked with very large apps that have say only a few dozen queries. which is a slightly staggering thought because the database can support extremely many queries via SQL, but in practice only a tiny fixed subset of the possibilities are used.

the database is generally “ignorant” of those queries until they are passed in realtime from the app to the database. the database may react by compiling the sql or caching the queries to some degree, but clearly a simple idea of what could really be a gamechanger: is a more adaptive/intelligent approach:

suppose the database were informed of all the structure of queries the app was going to make ahead of time. then the database self-optimizes its structure to accomodate these queries. as far as automatically creating applicable indexes and even access methods/strategies etc (eg imagine that it decides to make one table row ordered, one column ordered, denormalize other stuff, normalize yet other etc—wow!). another method that would be less invasive is for the app to be able to recognize the same queries over time, and compile a list of all the frequent queries, and optimize on that (instead of the initial “declaration” by the app).

in many of the systems Ive worked with, there is poor maintenance of indexes also. this is a DBA type job that requires a lot of care/attention but which is skipped by many. a simple index can convert the query from linear $O(n)$ to logarithmic $log(n)$ and yet these are not applied in many cases. note that tracking indexes is a bit of a maintenance hassle & really ought to be more systematized anyway.

May 2, 2013 4:11 pm

Columnar databases are commonplace now. The “NoSQL” tabular data stores are typically columnar. HP Vertica, Teradata Aster, EMC Greenplum, ParAccel (used by Amazon in its data warehouse in the cloud – Redshift) are all columnar. Column oriented databases also allow for more efficient compression of data, because columns typically have less variation in values than rows.

Since the global Information Technology spend is almost \$4 trillion a year, it is plausible that this is indeed a trillion dollar math trick.

May 2, 2013 4:19 pm

Just to clarify a bit further from the systems perspective: column-oriented databases are a huge win for data-warehousing-style applications where data is loaded in bulk and queries do complex analysis. There is the advantage you mentioned of reading fewer bytes. Another interesting phenomenon is that column-oriented data is almost always easier to compress, as there tends to be less entropy within the compression window if data comes exclusively from a single column. I wonder if there are any interesting theory questions related to that observation.

For other applications, such as those that tend to read or update a couple of rows at a time, row-oriented databases are often more appropriate as the necessary data is less scattered around.

May 2, 2013 6:16 pm

> “So forget the million-dollar Clay prizes and go after the real money.”

“Mathematical fame, if you have the cash to pay for it, is one of the soundest and steadiest of investments.” (Godfrey Hardy)

May 2, 2013 8:29 pm

In high-energy physics we call them column-wise ntuples (CWN). I first encountered them around 1992, when they were added to the CERN HBOOK package. All the many (many) petabytes of LHC data are stored in column orientation, with a lot of tuning. If we had used a row oriented format, it would have been prohibitively costly to scale the computing facilities to analyze the volume of data.

8. May 2, 2013 10:09 pm

The APL world has been doing this forever. We call them inverted databases. I am currently working on a C# system that accesses inverted data and it easily beats record style layouts.

9. May 2, 2013 10:32 pm

Question? what does it happen with linear algebra computations. if you surpass the speed of the light? lets say the Hertz speed wave has as equal the speed of the light. will the data be lost if you have the speed of the light C as a constant = 1 square? will you be able to pair Hendricks’ binaries with data recorded faster than the speed of light. Lasers will do it? if the particle accelerator travels almost to the speed of light?

10. May 3, 2013 12:02 am

The Geospatial Information System (GIS) world would probably love to make use of this as they are all dealing with the exponential increase in spatial data – that just keeps growing) and figuring out not only how to store it, but to retrieve it easily when needed.

May 3, 2013 3:55 am

I think that the COM way of storing data was the standard way of organizing data in the FORTRAN world, before the structured data paradigm was invented in the ’70s. In these old languages, structured data (i.e. records) could not be declared. To store a set of records made by N different fields, you need N tables of some basic data type. The ROM way of organizing data was an improvement over languages like FORTRAN.
In a certain sense, the COM way of organizing data was the first proposed, the ROM is an improvement, at least for programming languages.
However, for databases, the discourse is different.

May 3, 2013 5:48 am

Reblogged this on Pink Iguana and commented:
Nice summary of column stores

May 3, 2013 7:21 am

Don’t want to be “that guy”, but statements like

“He has a great track record, so likely he is mostly right on his guesses.”

should probably not be uttered by anyone with a firm understanding of probability.

May 3, 2013 9:33 am

If there was real random access memory where each access (in a sequence of accesses) to different memory cells could be treated equal there was no difference between ROM and COM at all. Consider fixed size fields, then there’s a simple bijective relationship to get from row and column index to the linear memory cell index. So, in theory, you wouldn’t have to scan the complete table to do a query with ROM but just skip to the right bits inside a row.

The problem is that the cost function for getting a sequence of memory cells with today’s architectures is quite complicated because you possibly and realistically have a layered system of caches:

* hard disk access
* hard disk cache
* RAM level cache
* cpu caches

each with its own characteristic about how fast it is to access two adjacent (or distant, where adjacency/distance can differ between the layers) memory cells in sequence. This means, that in practice you may be very well scanning through the complete table with ROM because e.g. the hard disk has only a granularity of access of 512 bytes which are then cached for a while etc.

Therefore, it’s right in general, that putting data together which is accessed together is usually a good idea.

There’s another trade-off you have to consider when talking about database queries: a query may consist of different parts: the filtering part and the result selection part. For the filtering part a COM would usually make more sense because you query only a part of all fields. For the selection part, however, you are usually browsing the results of the filtered data by rows (even if only part of a row is actually needed), so that those accesses may be faster with ROMs.

So this is less about linear algebra as it’s more about finding good compromises to optimize common use cases over the cost function.

15. May 3, 2013 9:34 am

“Maybe as a default we could call it the Gauss Database Method”

Ha! Love it. It has to go with a story about how he discovered it while doing his hobby of manually calculating ginormous tables of logarithms, and found it “not important enough to publish.”

16. May 3, 2013 7:14 pm

Text retrieval systems have been using inverted indexes since the early 1960’s, if not before. Those are column stores that also take advantage of sparsity.

17. May 4, 2013 10:22 am

I think this may be ignoring the difference between indexed and non-indexed relations. Doing queries on columns doesn’t tell you much unless one of those columns is a unique ID of some sort, so it takes at least 3 columns to discover the relation between 2 other dimensions.

18. May 4, 2013 11:08 am

If we are thinking about records of a fixed finite length $k$ and a fixed signature $X_1, \ldots, X_k$ then a relational data base is a finite subset $D$ of a $k$-dimensional coordinate space $X = X_1 \times \ldots \times X_k = \prod_{j=1}^k X_j.$

Given a non-empty a subset $J$ of the indices $K = [1, k]$, we can take the projection $\text{proj}_J$ of $D$ on the subspace $X_J = \prod X_{j \in J} X_j.$

Saying that “a query is likely to use only a few columns” amounts to saying that most of the time we can get by with the help of our small dimension projections. This is akin to a very old idea, having its ancestor in Descartes’s suggestion that “we should never attend to more than one or two” dimensions at a time.

May 4, 2013 2:11 pm

>“[the row-oriented storage method] respects principles that go back to COBOL in the 1950′s: Each data object should be conceptually and physically together.”

I’d like to point out that one of goals of database technology, at least since the development of the relational model, had always been to decouple the physical representation from the logical (even more, from the conceptual) representation. This is the so-called _ physical independence_ principle. Implementations have seldom taken the principle to heart. This may be due to many reasons, but in my opinion, one is just: “things work just fine, most of the time, if we simply implement relations as sequences of records—so why bother?” (btw, this implementation was hinted to by Codd himself in his original paper on the relational model). Given today’s shift towards storing and processing bigger and bigger amounts of data, software vendors are now _forced_ to take physical indepedence seriously.

20. May 5, 2013 12:12 am

Some datasets are just as complex as they appear at first sight, others are subject to laws whose discovery permits of radical simplification. We pray with Saint Augustine for the wisdom to know the difference. Some call this the problem of induction, others abduction, and the promise of declarative, deductive, or logical database systems is that representing empirical datasets in a fluid enough logical system would permit the natural contours of the data to come to light in the normal process of logical simplification. The potential is yet to be actualized as much as we might hope,

May 5, 2013 2:42 am

This was all over the news two years ago when Stonebraker made some controversial statements about NoSQL databases. [1] http://www.theregister.co.uk/2011/07/13/mike_stonebraker_versus_facebook/ [2] http://gigaom.com/2011/07/11/amazons-werner-vogels-on-the-stonebrakernewsql-debate/

22. May 5, 2013 7:12 am

Back around 1980, I worked briefly alongside Mike Stonebraker in the INGRES Corporation – and I still treasure his papers from over the years.

Even back in the 1970s, we were aware that one ‘record’ might usefully be split into two records in a 1-1 relationship (each held in its own table on separate disc-drives), that tables could be vertically-partitioned and, obviously, that keys might be usefully held on fields … all examples of the technique proposed.

Guy H

23. May 9, 2013 11:59 am

i’ve been making my living programming an APL-descended column db (http://kx.com/) for almost seven years, and yes, it’s a huge improvement in efficiency. it easily handles a >10 billion row, ~.5 trillion cell database (on a single server, not sharded), and that’s just the one db i work with on a daily basis. when it’s used for stock market data, 2 billion rows a day is not uncommon.

24. June 1, 2013 8:34 am

Reblogged this on justanotherhumanoid.