Justin Sullivan/Getty Images
Researchers have built a database of more than 500 billion words, culled from a collection of 5 million books. Parsing the words provides a unique insight into cultural changes, they say.
Perhaps the biggest collection of words ever assembled has just gone online: 500 billion of them, from 5 million books published over the past four centuries.
The words make up a searchable database that researchers at Harvard say is a new and powerful tool to study cultural change.
The words are a product of Google's book-scanning project. The company has converted approximately 15 million books so far into electronic documents. That's about 15 percent of all books ever published. It includes books published in English, Spanish, French, German, Chinese, Russian and Hebrew.
The study used 5,195,769 digitized books, or about 4 percent of all books ever published. That resulted in more than 500 billion words. By language:
- English: 361 billion
- French: 45 billion
- Spanish: 45 billion
- Russian: 35 billion
- Chinese: 13 billion
- Hebrew: 2 billion
At a reasonable rate of 200 words per minute, reading just the entries from the year 2000 would take 80 years, without interruptions for food or sleep.
"Occupational choices affect the rise to fame. ... Actors tend to become famous earliest, at around 30. The writers became famous about a decade after the actors, but rose for longer and to a much higher peak. ... Politicians did not become famous until their 50s, when, upon being elected President of the United States, they rapidly rose to become the most famous of the groups. ... Science is a poor route to fame. Physicists and biologists eventually reached a similar level of fame as actors, but it took them far longer. Alas, even at their peak, mathematicians tend not to be appreciated by the public."
"Suppression — of a person, or an idea — leaves quantifiable fingerprints. For instance, Nazi censorship of the Jewish artist Marc Chagall is evident by comparing the frequency of 'Marc Chagall' in English and in German books. In both languages, there is a rapid ascent starting in the late 1910s (when Chagall was in his early 30s). In English, the ascent continues. But in German, the artist’s popularity decreases, reaching a nadir from 1936-1944, when his full name appears only once. (In contrast, from 1946-1954, 'Marc Chagall' appears nearly 100 times in the German corpus.) Such examples are found in many countries, including Russia (e.g. Trotsky), China (Tiananmen Square) and the U.S. (the Hollywood Ten, blacklisted in 1947)."
Many of these books are covered by copyright, and publishers aren't letting people read them online. But the new database gets around that problem: It's just a collection of words and phrases, stripped of all context except the date in which they appeared.
Yet Erez Lieberman Aiden, a mathematician and bioengineer at Harvard and co-creator of this new database, says it opens the door to a whole new style of literary scholarship.
"Instead of saying, 'What insight can I glean if I have one short text in front of me?' — it's, 'What insight can I glean if I have 500 billion words in front of me; if I have such a large collection of texts that you could never read it in a thousand lifetimes?' "
A 'Fantastically Addictive' Tool
You can, for instance, type in a word or a short phrase, and the database produces a graph — a curve that traces how often an author used those words every year since 1800.
"And you realize that it's fantastically addictive," says Jean-Baptiste Michel, a mathematician and biologist at Harvard who created the new database together with Aiden. "You can just spend hours and hours typing in the names of people you know, places you like, or just random stuff. And so you end up discovering quite a lot of things that way."
The researchers discovered, for instance, that the trajectory of fame — the curve that shows how often a very famous person is mentioned in books — has changed over the centuries. Today, fame is more fleeting.
"You become famous earlier in life; so fame knocks on your door earlier than before. And then you rise to fame even faster than before. The flip side of this is that you become forgotten also somewhat faster than before," says Michel.
Specific years — 1973, for instance — also seem to fade from the literary record more quickly nowadays. And God got a lot of print in the early 19th century, but not today.
Windows Into Evolving Cultures
Aiden and Michel argue that these graphs are windows into evolving cultures. All those words represent a chunk of our cultural DNA; not a genome, they say, but a "culturome." They've named the website where anybody can search their database culturomics.org. It's just been unveiled in the journal Science.
Aiden is, however, quick to point out that the collection is limited.
"Books are just one form of cultural exchange," he says. "It's a biased form of cultural exchange. Only certain types of people write books, and only certain types of people manage to get their books published."
Estimated Number Of Words In The English Lexicon
But at least books have survived, and it's possible to catalog the words in them, unlike casual conversations or lovers' quarrels.
Some scholars may be horrified by this approach to literature, but Stanford historian Caroline Winterer is not. She says such new tools give historians more comprehensive information about the words that people used in the past to describe their world.
"Before, you had to sit there and, well, you actually had to read the whole text, God forbid! And you'd find two or three examples, and nobody could really check up on it. For better or for worse, it does give you a more accurate sense of some things in the humanities."
But some things require knowledge of a word's context. Take the decline of the word "God," Winterer says. Over the past century or two, some writers started describing the wonders of the natural world as divine. Their books don't always use the word God, "But they are talking about nature, or the environment, or Yosemite, or Yellowstone; these are all codes for God."