"Big Data" hasn't made any of the words-of-the-year lists I've seen so far. That's probably because it didn't get the wide public exposure given to items like "frankenstorm," "fiscal cliff" and YOLO. But it had a huge surge in venues like Wired and The Economist, and it was the buzz of Silicon Valley and Davos. And if the phrase wasn't as familiar to many people as "Etch A Sketch" and "47 percent," Big Data had just as much to do with President Obama's victory as they did.
Whether it's explicitly mentioned or not, the Big Data phenomenon has been all over the news. It's responsible for a lot of our anxieties about intrusions on our privacy, whether from the government's anti-terrorist data sweeps or the ads that track us as we wander around the Web. It has even turned statistics into a sexy major. So if you haven't heard the phrase yet, there's still time — it will be around a lot longer than "gangnam style."
It's a good name, too. I always want to pronounce it with a plosive B, the way Carl Sagan would have. Bbig Data, bbillions of bbillions of bbits. That's about the right magnitude for what people are calling the age of exabytes. Exa- is the prefix for 1 followed by 18 zeroes. Exabytes come after petabytes, which come after terabytes, which come after gigabytes. If the numbers are hard to grasp, think of it this way: If you started stacking up those old 64k floppies until you got to 500 exabytes, you'd have a pile that stretched from here to someplace really far away.
But Big Data is no more exact a notion than Big Hair. Nothing magic happens when you get to the 18th or 19th zero. After all, digital data has been accumulating for decades in quantities that always seemed unimaginably vast at the time, whether they were followed by a K or an M or a G. The fact is that an exponential curve looks just as overwhelming wherever you get onboard. And anyway, nobody really knows how to quantify this stuff precisely. Whatever the sticklers say, data isn't a plural noun like "pebbles." It's a mass noun like "dust."
What's new is the way data is generated and processed. It's like dust in that regard, too. We kick up clouds of it wherever we go. Cellphones and cable boxes; Google and Amazon, Facebook and Twitter; cable boxes and the cameras at stoplights; the bar codes on milk cartons; and the RFID chip that whips you through the toll plaza — each of them captures a sliver of what we're doing, and nowadays they're all calling home.
It's only when all those little chunks are aggregated that they turn into Big Data; then the software called analytics can scour it for patterns. Epidemiologists watch for blips in Google queries to localize flu outbreaks; economists use them to spot shifts in consumer confidence. Police analytics comb over crime data looking for hot zones; security agencies comb over travel and credit card records looking for possible terrorists.
This was the year we held the first Big Data election, too. The Republicans may have had more money, but the Obama campaign had better voter data and analytics. That gave them an edge in identifying likely supporters and finding the best ways to reach what they call "low-information" independents — which turned out to include running ads on Jimmy Kimmel and the rerun cable network TV Land. And it was Big Data analytics that Nate Silver used to correctly predict the election outcome in all 50 states, skunking the pundits in the process.
It's the amalgamation of all that personal data that makes it possible for businesses to target their customers online and tailor their sales pitches to individual consumers. You idly click on an ad for a pair of red sneakers one morning, and they'll stalk you to the end of your days. It makes me nostalgic for the age when cyberspace promised a liberating anonymity. I think of that famous 1993 New Yorker cartoon by Peter Steiner: "On the Internet, nobody knows you're a dog." Now it's more like, "On the Internet, everybody knows what brand of dog food you buy."
Though actually, it's more worrisome when they get your brand of dog food wrong. In some circles, Big Data has spawned a cult of infallibility — a vision of prediction obviating explanation and math trumping science. In a manifesto in Wired, Chris Anderson wrote, "With enough data, the numbers speak for themselves."
The trouble is that you can't always believe what they're saying. When you've got algorithms weighing hundreds of factors over a huge data set, you can't really know why they come to a particular decision or whether it really makes sense.
When I was working with systems like these some years ago at the Xerox Palo Alto Research Center, we used to talk about a 95 percent solution. So what if Amazon's algorithms conclude that I'd be interested in Celine Dion's greatest hits, as long as they get 19 out of 20 recommendations right? But those odds are less reassuring when the algorithms are selecting candidates for the no-fly list.
I don't know if the phrase Big Data itself will be around 20 years from now, when we'll probably be measuring information in humongobytes. People will be amused to recall that a couple of exabytes were once considered big data, the way we laugh to think of a time when $15,000 a year sounded like big money. But 19 out of 20 is probably still going to be a good hit rate for those algorithms, and people will still feel the need to sort out the causes from the correlations — still asking the old question, what are patterns for?