Skip to content

What Is Big? What Is Small?

December 16, 2010


How does the growth in computer memory change applications

Andrei Broder is one of the top experts in computational theory, and is especially famous for his work on algorithms. A recurrent theme that makes his work so wonderful is that some of his best theoretical work is also extremely practical. No galactic algorithms for Andrei.

Today I want to talk about computer memory, and how applications need for memory can shape them.

Broder’s work on the “power of two” phenomena with Yossi Azar, Anna Karlin, Eli Upfal is some of the best algorithmic work ever. Mathematicians and theorists have been studying the throwing of {n} balls into {n} bins for decades. It was a huge surprise that after such intensive study, Broder and his colleagues could discover a method which exponentially improved the previous best results. The maximum number of balls in a bin grows logarithmically; they show that allowing two choices reduced this growth to {\ln \ln n /\ln 2 + \Theta(1)}. A beautiful result—take a look at their paper or the many versions of it on the Web. It is a classic.

Broder also has worked on the question of large data sets—at least indirectly. He was a central figure in one of the best early search engines, AltaVista. For a while Broder was the chief scientist of AltaVista. While Google is now the dominant search engine, AltaVista introduced many important features into searching, features that we now take for granted. He also helped make AltaVista the search engine for several years. He is now at Yahoo and works on search and computational advertising.

Andrei once told me that when AltaVista was first implemented it was available only internally to DEC employees. In order to update the data structures that stored the search information, they did a re-make of the system during the night. This meant that the search engine was not available for about 30 minutes late each night.

Andrei said that they quickly began to get heated complaints that said

“what’s up, I need the search engine now.”

And far worse complaints. Broder said that the AltaVista team realized that DEC employees all over the world were using the engine. There was no good time to stop the search engine, so that they could update the tables. None. Late east coast in the US was a just fine time to be working somewhere else.

So the team re-implemented the entire AltaVista engine. They had to rebuild it completely so that the updates could happen while the engine was running. Andrei told me this made a big impression on him—successful tools like AltaVista had to run 24/7. Users expected it to always be available: every day, every hour, every minute.

Three Points About Memory

Let me make three points—you did not expect four? They are about the growth and need of computer memory.

  1. Need for memory will continue to grow forever:

    \displaystyle \text{memory} \ \rightarrow \aleph_1.

  2. Ability to deliver cheap memory will continue to grow for many years:

    \displaystyle \text{memory} \ \rightarrow \aleph_0.

  3. Some requirements of memory will plateau.

    \displaystyle \text{memory} \ \rightarrow \ \text{limit}.

(1) The memory needed will continue to grow. I doubt this will be a shock to anyone. In the history of computing the one quantity that we have consistently underestimated is memory—or space as we say in complexity theory. There will always be applications that need more memory than is currently available.

(2) The memory available will continue to grow, but there may be limits. The key word is may. As I said many times, it is very hard to prove physical limits. Perhaps twenty years from now people will be shocked that we now use electrons to store information, perhaps they will use some exotic quantum effect. Who knows.

I used the math symbol {\aleph_1} and later {\aleph_0} to make the point that the need for memory is going to continue to grow toward a potentially larger “infinity,” than the amount of memory available. Of course this is only an approximate notion: it is quite an abuse of these symbols {\aleph_0,\aleph_1}, since there is not a different type of infinity involved here. But it is fun to pretend.

The cost of memory is still going down rapidly. Moore’s Law on the decrease in the cost of cycles seems to be slowing, but there are indications that flash memory is decreasing still at a phenomenal rate: at least 50 per cent per year.

(3) The memory needed for some applications will plateau. Let me discuss that next.

Plateau?

Anyone who bets against the need for more memory in the future is probably wrong. Also anyone who bets against the industry’s ability to continue to make more and cheaper memory available is also probably wrong. The later point is of course a bit of a paradox: there must be some physical limits on the amount of bits that we will be able to buy cheaply. There must be some limit. But again seeing where that limit will be and when we will hit it seems very difficult.

However, I do believe that the requirements for memory for some applications will plateau. Let me be clear I think there will continue to be new applications that require more and more memory. Years ago no one would have guessed that we all would have videos on our machines, or even just audio. In the near future I could imagine that we may start to have immersive 3-D environments. These could require astronomical amounts of memory—amounts that are way beyond anything available today.

So what is the point? I think the key insight is that some applications’ memory requirements will stop growing or at least not grow as fast as the growth in available memory. They will change from being at the edge of what can be stored, needing huge amounts of memory, and move to what can routinely be stored. The reason for this is that some memory requirements are based on social limits, and others on physical limits.

Call an application small if its data set is easily stored in memory today, and large if it cannot. Here is an example of a data set that changed from large to small:

{\bullet } The Dictionary: There was a time when storing on your computer any dictionary, not the OED, was out of the question. Now you get realtime spell correction, which I could not possibly do without. The reason this application’s memory needs reached a plateau is that the number of words we use is not increasing fast, if at all.

Here are two examples of data sets that will flip, in my opinion, to be small in the near future.

{\bullet } Technical Papers: Today there are applications like Citeseer that contain about one million papers. The total storage for this is beyond the ability of most of us to store on our laptops. But this should change in the near future. The issue is that the number of papers will continue to grow, but will unlikely grow as fast as memory increases. If this is the case then an implication is that in the future we could have all the technical papers from an area on our own devices. Just as realtime spelling is useful, realtime access to technical papers seems to be a potentially exciting development.

This prediction could wrong in at least two ways that I can think of: papers could start to contain much richer data types. They might start to contain videos, or simulations, or animations. I might argue that they would then not be papers, but would be a new type of object, with new applications.

Another way I could be wrong is: what if papers start to be written by automatic agents. This could lead to a huge explosion in papers. Imagine a world where “robot” authors create papers that summarize other papers, or re-write papers for a less expert audience, or who knows what. Perhaps.

{\bullet } Books: Right now there are too many books, even restricted to a subfield like mathematics, to have all of them stored on a single cheap device. But this large—huge—amount of memory could easily become a small one in the future.

The reason I think this could happen is again related to the above point about papers. As long as they are written by people, there is an inherent limit to how many books there can be. Even if every mathematician wrote a book a month, the number of bytes would not remain large. At some point it would become small.

In this future we could have a device that already has every book on your favorite topic. Then you would not need a connection to search, browse, and read them. There might be interesting questions about costs. How would authors be payed in this world: would they get money as their book is read? Not when it is included in a device? Seems like a number of interesting questions.

Open Problems

I think that there will be fundamental shifts in many applications as their memory requirements change from large to small. What do you think will happen? Are there interesting research questions that we should be working on?

16 Comments leave one →
  1. December 16, 2010 10:54 am

    “there must be some physical limits on the amount of bits that we will be able to buy cheaply” — I think this limit will not be a problem in near future, but today we have more important limit: the limit to buy books and articles via Internet. The price about $40US to read a new paper only to see that this paper is not interesting for you, seems important problem for our colleagues in the world.

  2. December 16, 2010 11:13 am

    The current state of the technology is often a limiting factor. I’ve seen organizations collect way more data than they can mine. The data collection is comparatively easy compared to mining it with our modern (crude) software construction methods.

    As for books and paper and such, I could see them eventually plateauing. If we learn how to extract the knowledge out from under them, then that entire ‘unique’ knowledge base will eventually slow down in growth (but still be infinite). The end-products themselves would be mere presentation layers (in specific voices) applied to the raw data. They would continue to grow, but again at a reduced rate.

    On the other hand I think simulations/models would continue to need an endless about of memory. That is, to get an increasingly accurate model, we would need more and more depth and smaller and smaller time periods. At its limits, we might be interested in simulating the entire universe at microseconds, which would require more particles as a machine then the universe itself actually contains (since time acts as a multiplier).

    Paul.

  3. December 16, 2010 12:46 pm

    This sort of gets into the Ray Kurzweil augmented reality land :

    2018

    * 10^13 bits of computer memory—roughly the equivalent of the memory space in a single human brain—will cost $1000.

    2020

    * One personal computer will have the same processing power as a human brain.

    A computer passes the Turing test by the last year of the decade (2029), meaning that it is a Strong AI and can think like a human (though the first A.I. is likely to be the equivalent of a very stupid human). This first A.I. is built around a computer simulation of a human brain, which was made possible by previous, nanotech-guided brainscanning.

    2045: The Singularity

    * $1000 buys a computer a billion times more intelligent than every human combined. This means that average and even low-end computers are vastly smarter than even highly intelligent, unenhanced humans.
    * The technological singularity occurs as artificial intelligences surpass human beings as the smartest and most capable life forms on the Earth.

  4. Proaonuiq permalink
    December 16, 2010 1:12 pm

    Nice post.

    You say that ” Another way I could be wrong is: what if papers start to be written by automatic agents. This could lead to a huge explosion in papers. Imagine a world where “robot” authors create papers that summarize other papers, or re-write papers for a less expert audience, or who knows what. Perhaps”.

    According to this New Scientist article this is already hapening: theorems à la carte !
    Waiter, one of P!=NC please !

    http://www.newscientist.com/article/dn19809-mathematical-immortality-give-a-theorem-your-name.htmlb

  5. Michael Mitzenmacher permalink
    December 16, 2010 3:08 pm

    Since I literally would not have had a PhD thesis without the power-of-two-choices paper, and since Andrei generously mentored me for several years, I’m glad to see him and his work acknowledged so nicely in the blog. He is a truly inspirational algorithmist. Among his other notable accomplishments are co-developing some of the first CAPTCHAs, and demonstrating the “bow-tie” structure of the Internet graph. A course on Andrei’s papers would be a fascinating exploration of how the growth of the Web has affected algorithms (up to and including today).

    • rjlipton permalink*
      December 22, 2010 4:14 pm

      Mike,

      Thanks. I agree Andrei is special.

  6. December 16, 2010 3:30 pm

    Books may indeed be getting close to being considered ‘small’.
    The following paper appearing today has analyzed
    data estimated to contain ‘4% of all books ever printed”

    http://www.sciencemag.org/content/early/2010/12/15/science.1199644

    • December 16, 2010 7:16 pm

      Oz’s post is terrific — the article he links to includes Steven Pinker, Martin A. Nowak, Peter Norvig, and the entire “Google Books Team” as its authors. It is perfectly feasible to search their immense database—just visit the site Google has set-up for this purpose: “http://ngrams.googlelabs.com”

      Here are a few quick impressions of Google amazing database, associated to five specific lexigraphic searches:

      Human concerns that never fade “war,peace,food,children,home,health,happiness,liberty”.

      Sobering lexicography for engineers: “mechanical engineering,electrical engineering,chemical engineering,systems engineering”.

      Sobering lexicography for mathematicians: “naturality,complexity class,algebraic variety,diffeomorphic,graph isomorphism”.

      Sobering lexicography for financiers: “computer trading,automated trading,commodities trading,options trading”.

      Exponentially-growing career options: “vampire,zombie,troll,werewolf,tattoo”🙂

      The same most-recent issue of Science profiles physics grad student Aaron O’Connell in an article titled Breakthrough of the Year: Bridging the Quantum and the Classical Worlds.

      If more people shared my enthusiasm for what I’m doing, I would more likely continue on in the sciences,” says O’Connell, who defended his thesis last week.

      But “it’s hard to keep telling yourself that what you’re doing is important when everybody you talk to seems to think what you’re doing is not important.”

      Besides, “all the rewards you get as a grad student and a postdoc … are [in] knowing that you’ve done something that no one else has done, or … a select few in the physics community telling you you did a good job, and that doesn’t happen very often,” he adds.

      So “I think I am going to go to finance, because at least they compensate you monetarily there.”

      Hmmm … perhaps no-one has told Aaron that today’s fastest-expanding career options for young people—by far—are “vampire,zombie,troll,werewolf, tattoo”.

      It appears that now that humanity is gaining free access (thanks, Google!) to the contents of our own near-infinite cultural memory … the contents of that memory are neither as reassuring nor as evidently useful, as we might wish. Yikes!

  7. johne permalink
    December 17, 2010 8:27 pm

    One thing that I have found fascinating is a question that can be roughly stated as:

    “What is the bandwidth in bits per second for your senses?”

    Vision is probably the highest, followed by hearing, with everything else falling off pretty quickly. But there clearly is some number, and I think few would argue that there is some kind of an upper bound to the number. This raises an interesting possibility that:

    “There is a limit to the amount of bandwidth you can consume, and at some point more bits per second / fidelity exceeds what you are capable of experiencing.”

    Then you can start to do entropy analysis on the bit stream. It seems like a safe bet that such a bit stream can be losslessly compressed to some degree. How about when you add lossy compression? To put this in context, how much lossy compression has taken place about the experience that represented you eating lunch yesterday.

    With H.264 video compression and MP3 audio compression, we can get a fairly reasonable approximation as to what it would take to capture a “essentially lossless” representation of what is probably the two highest bit rate senses. Now, take that result, and extrapolate. How many gigabytes of data does it take to represent your experience for a 24 hour period? And just keep going- how much data does it take to represent your entire life experience up to this point? And this is for an “essentially lossless” representation.

    Which brings us back to the topic of the original post: At what point can you carry around on an iPod an “essentially lossless” representation of your lifes experience? Now throw in a Google like search engine for your experiences. Read an article 20 or 30 years ago? You’ll have a perfect copy of it.

    I find this fascinating. Once you get it down to numbers like this, you almost have to ask “Since my life will only be able to represent x bits, how do I make the most of it?” Another interesting question is “How much bandwidth do I generate as a consequence of my life? What fraction of that is ‘captured’ by others as bits that are worth keeping?”

    • December 18, 2010 2:20 pm

      There are people, currently living, who remember everything that has ever happened to them. They can give a perfect recounting of information about any second of any day in their lives that you choose.

      Scientists are studying the half a dozen folks who have this ability in order to try and understand if there are structural, physical differences to their brains.

      The conventional wisdom was that perfect recall would leave a person mentally or socially incapacitated in some way. These people are perfectly functioning, perfectly sane.

      In order to address your question, one would probably try to push the limits of their capacity in a structured environment. Which may well, be on the list of tests to perform.

      The kind of data you are addressing is context dependent. I outgrew my need for training wheels when I physically understood the effects of gravity on a two wheeled conveyance with a shifting mass balanced between the wheels. Do I need to retain the information on training wheels and how to utilize them when riding a bicycle or motorcycle or unicycle?
      Probably not.

      However, if I am to teach someone how to use training wheels, it is helpful to access that information in order to make the learning process clear to that other person.

      But it would be interesting to see if memory hoarders would develop in the same way material hoarders have been cultivated in our current society.

  8. December 18, 2010 4:37 am

    This was an interesting post to read. Thank you. I think your point about technical papers going the way of the dictionary is interesting. Currently, it seems like the trend in systems is to push more and more data storage “into the cloud.” There are plenty of good security and reliability reasons for doing this. However, your post makes it clear that it may become increasingly cheap to simply store all of your data locally. I wonder if and how these two trends interact.

  9. Jamie Oglethorpe permalink
    December 26, 2010 2:43 am

    All the examples centered on human experience, our books, the papers ever published, our life experience, the output of all security cameras for example, seem to be ultimately small. With Moore’s Law operating on disk drives and flash memory, the time scale for these things to become small is not too long, certainly within this century.

    Things that are large seem to be natural observations. The most obvious example is the huge dataset being built up by astronomers of the observable universe. I cannot imagine them being satisfied with any resolution they can achieve, any time scale or any portion of the electromagnetic spectrum, or other observables such as neutrinos or gravity waves when they become observable. Imagine we find a way of observing dark matter directly. Other examples would include ever finer observation of the weather.

    As for simulations, remember that there is always the tradeoff between time and space. The point of a simulation is that you can repeat it, so you don’t need to keep all the intermediate steps. This strikes me as aleph null (we only keep the final result) not aleph one (we keep all the intermediate results as well).

  10. December 31, 2010 1:21 am

    You can actually calculate the approximate date at which all books can be stored uncompressed on your laptop. It’s 2025. About when next year’s kindergarteners enter college.

    The Google corpus contains 500B words representing approximately 4% of all books, so that’s a total of 12.5 trillion words in all books printed to date. At an average 6 bytes per word, that’s about 70 TB, growing to about 400TB by 2100 assuming 6x growth rate per century (Google’s paper cited the words/year grew by 5.71 from 1900 to 2000). Meanwhile, laptops currently ship with about 0.5TB and double every two years. These two exponential curves cross in 2025.

    I think the actual year will be much sooner than that, 2020 at the latest. For one thing, even just GZIP compression gets you at least 3x ( http://www.maximumcompression.com/data/text.php ) and that’s enough to pull in the year to 2020 or 2021. For another, by then most of the world’s printed text will be unavailable due to copyright/DRM issues.

    In any case, it’s not far off.

    • rjlipton permalink*
      December 31, 2010 1:43 pm

      Michael,

      Nice calculation. That is less than ten years….great

Trackbacks

  1. Tweets that mention What Is Big? What Is Small? « Gödel’s Lost Letter and P=NP -- Topsy.com
  2. Thanks For Sharing « Gödel’s Lost Letter and P=NP

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s