Defining A Data Deluge
IRA FLATOW, host:
This is SCIENCE FRIDAY. I'm Ira Flatow.
Events are still unfolding in Egypt following the resignation of President Mubarak, and you can be sure that NPR News is following the situation and will update you with the latest throughout this program, should we need to, and we will have some occasion to do that later.
And a little bit later in the hour, we'll talk about a computer competing on the "Jeopardy!" game show, but first, from your stuffed email inbox to that iPod filled with gigabytes of music and movies to the countless number of Web pages, the world is now filled with data. But just how much?
A study published in the journal "Science" this week takes a shot at the answer. It puts the worldwide total at over 295 exabytes. An exabyte, that's a number followed by 20 zeros.
Joining me now to talk about it is one of the authors of that study, Martin Hilbert. He's in the Annenberg School of Communications at the University of Southern California. He joins me from the studios of member station KUSC in Los Angeles. Welcome to SCIENCE FRIDAY.
Dr. MARTIN HILBERT (Annenberg School of Communication, University of Southern California): Thank you very much for having me.
FLATOW: And if you'd like to talk with Dr. Hilbert, you can phone us. Our number is 1-800-989-8255. And tweet us - considering the news today, a good shot of tweets coming through. Our tweet is @scifri, @-S-C-I-F-R-I.
Dr. Hilbert, you looked at how much information there is in the world. What did you find? Did I give out that number right?
Dr. HILBERT: Yeah, that's right. The number was for 2007. So in 2007, it was 295 exabytes. We can imagine that by now, it has already doubled again. So now we should be around, like, 600, 600 exabytes. The amount of information that we store doubles around every three years and four months.
FLATOW: I've never heard an exabyte before.
Dr. HILBERT: Yeah.
(Soundbite of laughter)
FLATOW: It sounds like a dental term, you know.
Dr. HILBERT: Right. Yeah, it goes kilo-, mega-, giga-, we know that. And right, and then comes peta-, tera-, exa-. So, that's how you...
FLATOW: Wow. How do you count? How do you count all the information on the Internet?
Dr. HILBERT: Well, basically - well, first of all, we have to see that there are three kinds of things that we do with information. One is we transmit information through space. We call that communication. There are two kinds of communication. One is broadcasting. That's one way, diffusion of information. The other one is telecommunication. That's two-way, where you can actually also respond. So that's the first thing.
The second time is we can transmit information through time. That's what we call storage. And a third one, we can manipulate information, transform its meaning. We call that computation.
So basically, you have three big numbers: how much information can be communicated; how much can be stored; and how much can be computed. So these are the three big numbers.
And then basically what we did is we estimated the technological capacity, so all the information that is technologically mediated. That is basically because, you know, people kind of like the biological information processing. We could suppose that that is pretty constant. It would be quite arrogant to say that people nowadays think more than they did in the '90s or in the '80s or that they speak more.
What changed is that we intermediate much more of this information through these technologies. So that is very exciting.
FLATOW: But how do you actually make, get the number that you arrived at, 295 exabytes?
Dr. HILBERT: Yeah, so basically you take the number of devices that are out there. Then you look at the informational performance of each of the devices. And then you multiply it. With that you get the hardware performance.
Now, after that, you have to standardize it. You have to normalize it on compression rates. That is basically because information is a very slippery thing to measure.
Think about like if you have a Word document, for example, and you save it on your hard disk. It might be 100 kilobytes. And then you compress it with a zip file, and it's only 50 kilobytes. So how much information is in that, 100 or 50? So you have to normalize it on the compression rate on the uttermost compression rate. Like, you zip it and zip it and zip it, and what is left there is then real information, and the rest was just redundant data.
So we took all of that out, and these are the three statistics you need: the number of devices, the hardware performance and then the normalization of compression rates.
FLATOW: So if you have an e-book, and you have a paperback book, do you count just one of them, or do you count both of them?
Dr. HILBERT: We count both of them. So - be - (unintelligible) information. So we estimate the technological capacity.
FLATOW: And how long do you think all of this is going to be stored? Forever? I mean, now that something is now digitized, will it last forever and just fill things up until there's no more room?
Dr. HILBERT: Well, all technologies have a certain shelf life, a certain utility life. So the technological stock in 2007, it's a stock of diverse technologies from a couple of years before.
We use cell phones, in general, three years. We use a computer on average five years. VHS video cassettes last a little bit longer. We have them around an average eight years. Paper, photos, picture even longer; we save them for a very long time.
So the stock of information is a process of accumulation. However, no technology stays forever. They get discarded eventually.
FLATOW: You say that while we might be impressed with ourselves for our information totals, the really impressive info storage is nature. Why do you - why is that so impressive?
Dr. HILBERT: Yeah. Well because, like, these numbers are very big, when you say 295 exabytes. That's what we store. You could cover the entire area of the United States in 13 layers of books if you put all of this information into books. That is a lot.
If you put all of this information in CD-ROMs, you could make a pile that goes from here to the moon and a quarter of this distance beyond. So that's a huge amount of information that we have.
Also, in historical terms, if you look at it. So if you take all this information, put it into books, that's more than 80 times the historic -famous, historic Library of Alexandria per person on planet Earth. That means everybody has 80 times more information right now available.
So that's a huge - these are huge numbers and huge progresses. But if you compare it, for example, to what the human DNA is storing, the human DNA of one single person in all the 60 trillion cells that you have can store around 300 times more information.
Dr. HILBERT: So compared to that, compared to Mother Nature, we are but humble apprentices. I mean, she's still way ahead.
(Soundbite of laughter)
FLATOW: Let's see if I can get a call in here from Jeff(ph) in Rapid City, South Dakota. Hi, Jeff.
JEFF (Caller): Hi, just curious. Is there any way to tell how much is original information and how much is echoes of the original information?
Dr. HILBERT: That's a very interesting question. Some other researchers from Berkeley tried to do that before. Now, that is actually, I would say, not really possible because what is original information, and what is new?
Is a Beatles song really new information, or did they take the chords from somebody else? If a remix of a Beatles song, is that new, or is that like, just the remix of the Beatles song?
Now, you take a couple of songs and mix them in a different order on your iPod in a playlist, playlist different order. Now, is that now new information, or is that just old information? So what of that is new, just the reordering of this?
So information basically, from a theoretical perspective, is always nothing else than the recombination of previously existed symbols. You think of the information on a page of text as nothing else than the reordering of words that existed before. So how much of that is new? How much of that is original? It's very slippery.
And so from a scientific, really solid perspective, we cannot do that. And this study, we wanted to have a solid baseline and just say okay. As long as we just say: As long as somebody copies it, it's useful for this person, at least. If the neighbor already knows it, that's not so important to this individual. So as long as it's there, it's useful information.
FLATOW: All right, thanks for calling.
JEFF: Thank you.
FLATOW: 1-800-989-8255 is our number. Do you think we're going to eventually hit a wall where it's no longer possible? Is the data multiplying, you know - what's the word I'm looking for - geometrically, so that, you know, we're not going to be able to keep track of it all?
Dr. HILBERT: I wouldn't worry about that too much. First of all, there are two things we have to consider. On the one hand, okay, this information's becoming bigger and bigger. But what we found out, what is actually growing much faster than the amount of information that we store is our computation capacity.
Information storage grew around four times faster than the economy growth. However, our computational capacity, the amount of instructions per second that we do with our computers, grew about nine times faster than the economy.
So computation is going even faster, and these computers basically help us to make meaning of this information. So they help us to foster through it.
Think about five years back. I'm sure five years back, your email inbox was filled with spam every day, and now you get a couple of spams a week, if you're really unlucky. So these technologies themselves help us. We kind of like fight fire with fire.
So we have a lot of these technologies bring us a lot more information, but then we come up with clever, artificial ways to filter through this information. So that's on the one hand why I wouldn't worry about that too much.
On the other hand, the human brain is very plastic. If you think about maybe our great-grandfathers, 150 years ago, if they were lucky, they read some, maybe 50 books in their lifetimes. Now my little cousin already saw a couple of hundred movies. Every child sees a couple of hundred movies before going to elementary school. So we process a lot more information. The human brain is pretty plastic.
What one of our numbers show is that if you combine all the computational power of our general purpose computers, how many instructions they can make per second, that's the same number, right about, as the number of nerve impulses the human brain can make in one second. That means the (unintelligible) all one single human brain. They are as much as all the instructions as all computers can make.
So the human brain is the most impressive information processing machine of them all. And with these two things, using our technology to help us and with our powerful human brain, I wouldn't worry too much about the information, real information overload problem, at least for a couple of...
FLATOW: Let me take a quick question from Reggie(ph) in Cleveland. Hi, Reggie.
REGGIE (Caller): Hello.
FLATOW: Hi, quickly.
REGGIE: Yes. I just wanted to ask the guest, like, de he think that the evidence - the two things that he talked about, how the human brain is, like, the best information processing, you know, device that we have, and that nature, you know, succeeds our capacity to create information and store it, like, could be evidence for creationism and the fact that, you know, God created the world? And, you know, somewhere...
FLATOW: I've only got a minute. So let me get an answer to you.
FLATOW: Thanks for calling. I have a minute. How can you answer that one?
Dr. HILBERT: No, I wouldn't see that as evidence for creationism.
FLATOW: No, the fact that nature has far outdone us in the ability to store data.
Dr. HILBERT: Right. Evolution is a very - evolution is the most powerful of all mechanisms. It just relentlessly tinkers out new combinations and comes up with that. So it's completely possible that evolution came up with the brain, yes.
FLATOW: Well, Dr. Hilbert, we've run out of time, but this is fascinating, and we hope you keep crunching those numbers for us.
Dr. HILBERT: Okay, well, thank you very much for having us.
FLATOW: You're welcome. Martin Hilbert is in the Annenberg School of Communications at the University of Southern California. And he was joining us from KUSC in Los Angeles.
We're going to take a break. When we come back, we're going to try to win a game show. How "Jeopardy!" is playing IBM's new computer, Watson, and the two best "Jeopardy!" contestants ever are going to play the computer, and we don't know how that's going to turn out, but we'll explain how it's going to all happen. And you can actually play it online and try it out for yourself. So we'll talk about it. Don't go away. We'll be right back.
(Soundbite of music)
FLATOW: I'm Ira Flatow. This is SCIENCE FRIDAY, from NPR.
NPR transcripts are created on a rush deadline by Verb8tm, Inc., an NPR contractor, and produced using a proprietary transcription process developed with NPR. This text may not be in its final form and may be updated or revised in the future. Accuracy and availability may vary. The authoritative record of NPR’s programming is the audio record.