Web Security Words Help Digitize Old Books

Screenshot from reCAPTCHA

The reCAPTCHA test offers two distorted words. One is a known "control word," which lets users access a Web site. The other is an "unknown word." If several people agree on what it is, their answer will be incorporated in digitized books or newspapers. hide caption

itoggle caption
Comparisons of text read by OCR programs. i i

Optical character-recognition (OCR) software can make mistakes in converting images to text. Science/AAAS hide caption

itoggle caption Science/AAAS
Comparisons of text read by OCR programs.

Optical character-recognition (OCR) software can make mistakes in converting images to text.

Science/AAAS
Optical character recognition programs are unable to correctly read distorted text like this. i i

Some documents, especially older ones with yellowed pages and fading ink, challenge computer programs designed to decipher. Creative Commons hide caption

itoggle caption Creative Commons
Optical character recognition programs are unable to correctly read distorted text like this.

Some documents, especially older ones with yellowed pages and fading ink, challenge computer programs designed to decipher.

Creative Commons

People who use the Internet to talk to friends, set up free e-mail accounts or buy concert tickets are often unknowingly helping to digitize vast libraries of old books and newspapers.

That's because more than 40,000 Web sites — including popular ones such as Ticketmaster, Facebook and Craigslist — are using a new kind of security program called reCAPTCHA.

It's the brainchild of Luis von Ahn, a computer scientist at Carnegie Mellon University in Pittsburgh, who helped develop another commonly used Web security system. That one, called CAPTCHA, will allow people to access a Web site only if they prove they are human — and not a spammer's computer — by typing in a sequence of letters or numbers that appear on the screen in a distorted or garbled image.

"Each time you type one of these, your brain is doing something amazing," von Ahn says. "Your brain is performing a task that, despite 50 years of research in computer science, we cannot yet get computers to do."

The trouble is, each time you type in one of these garbled words, you're also wasting time. Von Ahn recently realized exactly how much time was being wasted, and he found it demoralizing.

"Approximately 200 million of these are typed every day by people around the world. Each time you type one of these, essentially you waste about 10 seconds of your time," he says. "If you multiply that by 200 million, you get that humanity as a whole is wasting around 500,000 hours every day, typing these annoying squiggly characters."

But with reCAPTCHA, von Ahn has come up with an idea for harnessing all that human brain power.

He knew that lots of libraries have huge efforts under way to digitize their collections. These projects first scan books or newspapers by basically taking a picture of each page. Then a computer takes the image of each word and converts it into text, using optical character-recognition software.

But computers often come across printed words they just can't recognize. "Especially for older documents, things that were written before 1900, where the ink has faded and the pages have yellowed out, the computer makes a lot of mistakes," says von Ahn.

A human being has to look at those words and decipher them. It occurred to von Ahn that he could link this kind of activity to security devices used on the Internet. Instead of asking people to prove they're human by copying random sequences of distorted letters and numbers, he could ask them to decipher mystery words from scanned books and newspapers.

So he got together with The New York Times, which is digitizing newspapers going back to 1851, and a nonprofit called the Internet Archive, which is digitizing thousands of books.

And now, if you go to someplace like Ticketmaster to buy, say, Jimmy Buffett tickets, you'll be shown images of not one but two distorted words.

One of these is the real security word: Type this one correctly and you're in. The other image is something that has mystified the digitizing software.

If people recognize that word, they type it in. This image will actually be shown to several people. If they all agree on what the word is, it will be considered accurately transcribed. And von Ahn says it will be incorporated into the digitized copy of the book or the newspaper that it came from.

"And the number of words that we've been able to digitize like this is insanely large, it's like over a billion. It's like 1.3 billion by now," von Ahn says.

In the journal Science, he and his colleagues report that over the last year Web users have transcribed enough text to fill up more than 17,600 books, with better than 99 percent accuracy.

Marc Frons, chief technology officer of digital operations for The Times, says the pace is astonishing. Each month, the project digitizes about two years' worth of newspapers

"Next year, if all goes well, we can do as many as 70 years, which would be almost the entire rest of the archive that is not digitized," says Frons. "It's just pretty cool when you're signing up for a Web site and you see the reCAPTCHA sign. You sort of know, 'Gee, I'm helping digitize part of The New York Times.' "

People might wonder if this new system is wasting even more of their time than the traditional CAPTCHA setup, since it requires them to type in two different things instead of just one. But von Ahn says it's actually faster to type English words than to type random letters and numbers.

There is one problem. Sometimes, the book scanners offer up something that people can't read at all. "Like, for example, some sort of ink blot on the page," says von Ahn. "We might think it's a word and we present it, and you know, it says, 'Type the two words,' and sometimes one of the things is a word and one of the things is just a blob there. So sometimes people can be annoyed."

And here's another thing: "When you pull two random words from books, you can get some very random combinations," says Brian Pike, chief technology officer for Ticketmaster.

The two words can occasionally form juxtapositions that could be weird or offensive. "And there's certain phrases and words we've asked them to make sure don't show up," Pike says.

He declined to cite an example. Still, Pike says the system works great from a security standpoint. And if customers find it somewhat annoying, at least now they can know their time isn't being totally wasted.

Comments

 

Please keep your community civil. All comments must follow the NPR.org Community rules and terms of use, and will be moderated prior to posting. NPR reserves the right to use the comments we receive, in whole or in part, and to use the commenter's name and location, in any medium. See also the Terms of Use, Privacy Policy and Community FAQ.