Google Book Tool Tracks Cultural Change With WordsA searchable database of more than 500 billion words from millions of books published over the past four centuries is now online. Researchers say the tool is a powerful way to study cultural change.
Researchers have built a database of more than 500 billion words, culled from a collection of 5 million books. Parsing the words provides a unique insight into cultural changes, they say.
Justin Sullivan/Getty Images
Justin Sullivan/Getty Images
Perhaps the biggest collection of words ever assembled has just gone online: 500 billion of them, from 5 million books published over the past four centuries.
The words make up a searchable database that researchers at Harvard say is a new and powerful tool to study cultural change.
The words are a product of Google's book-scanning project. The company has converted approximately 15 million books so far into electronic documents. That's about 15 percent of all books ever published. It includes books published in English, Spanish, French, German, Chinese, Russian and Hebrew.
Facts From The Study
The study used 5,195,769 digitized books, or about 4 percent of all books ever published. That resulted in more than 500 billion words. By language:
English: 361 billion
French: 45 billion
Spanish: 45 billion
Russian: 35 billion
Chinese: 13 billion
Hebrew: 2 billion
At a reasonable rate of 200 words per minute, reading just the entries from the year 2000 would take 80 years, without interruptions for food or sleep.
"Occupational choices affect the rise to fame. ... Actors tend to become famous earliest, at around 30. The writers became famous about a decade after the actors, but rose for longer and to a much higher peak. ... Politicians did not become famous until their 50s, when, upon being elected President of the United States, they rapidly rose to become the most famous of the groups. ... Science is a poor route to fame. Physicists and biologists eventually reached a similar level of fame as actors, but it took them far longer. Alas, even at their peak, mathematicians tend not to be appreciated by the public."
"Suppression -- of a person, or an idea -- leaves quantifiable fingerprints. For instance, Nazi censorship of the Jewish artist Marc Chagall is evident by comparing the frequency of 'Marc Chagall' in English and in German books. In both languages, there is a rapid ascent starting in the late 1910s (when Chagall was in his early 30s). In English, the ascent continues. But in German, the artist’s popularity decreases, reaching a nadir from 1936-1944, when his full name appears only once. (In contrast, from 1946-1954, 'Marc Chagall' appears nearly 100 times in the German corpus.) Such examples are found in many countries, including Russia (e.g. Trotsky), China (Tiananmen Square) and the U.S. (the Hollywood Ten, blacklisted in 1947)."
Many of these books are covered by copyright, and publishers aren't letting people read them online. But the new database gets around that problem: It's just a collection of words and phrases, stripped of all context except the date in which they appeared.
Yet Erez Lieberman Aiden, a mathematician and bioengineer at Harvard and co-creator of this new database, says it opens the door to a whole new style of literary scholarship.
"Instead of saying, 'What insight can I glean if I have one short text in front of me?' -- it's, 'What insight can I glean if I have 500 billion words in front of me; if I have such a large collection of texts that you could never read it in a thousand lifetimes?' "
A 'Fantastically Addictive' Tool
You can, for instance, type in a word or a short phrase, and the database produces a graph -- a curve that traces how often an author used those words every year since 1800.
"And you realize that it's fantastically addictive," says Jean-Baptiste Michel, a mathematician and biologist at Harvard who created the new database together with Aiden. "You can just spend hours and hours typing in the names of people you know, places you like, or just random stuff. And so you end up discovering quite a lot of things that way."
The researchers discovered, for instance, that the trajectory of fame -- the curve that shows how often a very famous person is mentioned in books -- has changed over the centuries. Today, fame is more fleeting.
"You become famous earlier in life; so fame knocks on your door earlier than before. And then you rise to fame even faster than before. The flip side of this is that you become forgotten also somewhat faster than before," says Michel.
Specific years -- 1973, for instance -- also seem to fade from the literary record more quickly nowadays. And God got a lot of print in the early 19th century, but not today.
Windows Into Evolving Cultures
Aiden and Michel argue that these graphs are windows into evolving cultures. All those words represent a chunk of our cultural DNA; not a genome, they say, but a "culturome." They've named the website where anybody can search their database culturomics.org. It's just been unveiled in the journal Science.
Aiden is, however, quick to point out that the collection is limited.
"Books are just one form of cultural exchange," he says. "It's a biased form of cultural exchange. Only certain types of people write books, and only certain types of people manage to get their books published."
Estimated Number Of Words In The English Lexicon
Source: Jean-Baptise Michel/AAAS/Science
But at least books have survived, and it's possible to catalog the words in them, unlike casual conversations or lovers' quarrels.
Some scholars may be horrified by this approach to literature, but Stanford historian Caroline Winterer is not. She says such new tools give historians more comprehensive information about the words that people used in the past to describe their world.
"Before, you had to sit there and, well, you actually had to read the whole text, God forbid! And you'd find two or three examples, and nobody could really check up on it. For better or for worse, it does give you a more accurate sense of some things in the humanities."
But some things require knowledge of a word's context. Take the decline of the word "God," Winterer says. Over the past century or two, some writers started describing the wonders of the natural world as divine. Their books don't always use the word God, "But they are talking about nature, or the environment, or Yosemite, or Yellowstone; these are all codes for God."
Findings From The Study
Using analysis of more than 500 billion words scanned as part of the Google Books project, researchers tracked themes and phrases through time. Below is a sampling of their findings from the study.
Known Events Exhibit Sharp Peaks At Date Of Occurrence
Researchers selected groups of events that occurred at known dates, then analyzed the relevant words and data around those dates. The top chart focuses on a list of 124 treaties; the second chart was made from a list of 43 heads of state (U.S. presidents and U.K. monarchs), centered around the year when they were elected or became king or queen; and the third from a list of 28 country name changes, centered around the year of name change.
Michel et. al.
Researchers selected groups of events that occurred at known dates, then analyzed the relevant words and data around those dates. The chart to the left focuses on a list of 124 treaties. Click the chart to see a similar graph of a list of 43 heads of state (U.S. presidents and U.K. monarchs), centered around the year when they were elected or became king or queen; and a list of 28 country name changes, centered around the year of name change.
Irregular verbs are used as a model of grammatical evolution. For each verb, researchers plotted the usage frequency of its irregular form in red ("throve/thriven"), and the usage frequency of its regular past-tense form in blue ("thrived"). Virtually all irregular verbs are found from time to time used in a regular form, but those used more often tend to be used in a regular way more rarely.
Irregular verbs are used as a model of grammatical evolution. For each verb, researchers plotted the usage frequency of its irregular form in red ("throve/thriven"), and the usage frequency of its regular past-tense form in blue ("thrived"). Virtually all irregular verbs are found from time to time used in a regular form, but those used more often tend to be used in a regular way more rarely. Click the chart to see more comparisons.
(A) The usage frequency of various diseases: "fever" (blue), "cancer" (green), "asthma" (red), "tuberculosis" (cyan), "diabetes" (purple), "obesity" (yellow) and "heart attack" (black). (B) Cultural prevalence of AIDS and HIV. Researchers highlight the year 1983, when the viral agent was discovered. (C) Usage of the term "cholera" peaks during the cholera epidemics that affected Europe and the United States (blue shading). (D) Usage of the term "infantile paralysis" (blue) exhibits one peak during the 1916 polio epidemic (blue shading), and a second around the time of a series of polio epidemics that took place during the early 1950s. But the second peak is anomalously broad. Discussion of polio during that time may have been fueled by the election of "Franklin Delano Roosevelt" (green), who had been paralyzed by polio in 1936 (green shading), as well as by the development of the "polio vaccine" (red) in 1952. The vaccine ultimately eradicated "infantile paralysis" in the United States.
The usage frequency of various diseases: "fever" (blue), "cancer" (green), "asthma" (red), "tuberculosis" (cyan), "diabetes" (purple), "obesity" (yellow) and "heart attack" (black). Click the chart to see more about HIV/AIDS, cholera and polio.