Web Security Words Help Digitize Old Books Every day, millions of people are asked to retype sequences of squiggly letters so Web sites. A scientist has figured out how to harness that manpower to digitize old books.
NPR logo

Web Security Words Help Digitize Old Books

  • Download
  • <iframe src="https://www.npr.org/player/embed/93605988/93606381" width="100%" height="290" frameborder="0" scrolling="no" title="NPR embedded audio player">
  • Transcript
Web Security Words Help Digitize Old Books

Web Security Words Help Digitize Old Books

  • Download
  • <iframe src="https://www.npr.org/player/embed/93605988/93606381" width="100%" height="290" frameborder="0" scrolling="no" title="NPR embedded audio player">
  • Transcript


I'm Melissa Block.


And I'm Robert Siegel. This is ALL THINGS CONSIDERED from NPR News.

BLOCK: When you're cruising around on the Internet, there's something you do all the time and maybe never really think about. Let's say I wanted to get tickets to a baseball tonight. So, I'm at the Web site for the Washington Nationals. I'm asking for two tickets, the best available seats. And here's what I have to do.

A bunch of numbers have popped up on the screen in a grid. They're a little scrambled up, and I have to type them into a box, 1067278. Why? Why do I have to do that? Nell Greenfieldboyce, our technology reporter, is here to explain.

NELL GREENFIELDBOYCE: Well, you've just proved that you're human. It's sort of a silly little thing that you have to do and it seems like a waste of time, but that's how the Web site knows that you're a human being trying to get tickets and not some sort of malevolent computer.

BLOCK: Well, it's sort of a nice confirmation of my humanity, but it's also annoying. I mean, it takes time. It's one more step.

GREENFIELDBOYCE: Yeah. A lot of people feel that way, including this computer scientist who actually developed this kind of security measure. His name is Luis von Ahn. And he's a computer scientist at Carnegie Mellon University in Pittsburgh. And he recently sort of added up all the time that we spend typing in this stuff.

Mr. LUIS VON AHN (Computer Scientist, Carnegie Melon University): Approximately 200 million of these are typed every day by people around the world. Each time you type one of these, essentially, you waste about 10 seconds of your time. And if you multiply that by 200 million, you get the humanity as a whole is wasting about 500,000 hours every day typing these annoying squiggly characters.

GREENFIELDBOYCE: What a waste. But then, he had an idea for harnessing all that human brain power. He knew that lots of libraries have these huge efforts underway to digitize their collections. They first scan books or newspapers, basically take a picture of each page. Then, a computer takes the image of each word and converts it into text. Now, von Ahn knew that computers often come across printed words that they just can't recognize.

Mr. VON AHN: Especially for older documents, things that were written before like 1900 where the ink has faded and the pages have yellowed out, the computer makes a lot of mistakes.

GREENFIELDBOYCE: And so, a human being has to look at those words and decipher them. It occurred to von Ahn that he could link this kind of activity to security devices used on the Internet. Instead of asking people to copy random letters and numbers, why not ask them to decipher those mystery words from scanned books and newspapers?

So, he got together with The New York Times, which is digitizing newspapers going back to 1851, and a nonprofit called the Internet Archive, which is digitizing thousands of books. He developed a new security device called reCAPTCHA. It gets people to identify words that gave the scanning software trouble. His security system is now being used by over 40,000 Web sites.

Mr. VON AHN: And that includes a few big Web sites that most everybody's heard of like Ticketmaster, Facebook, Craigslist.

GRRENFIELDBOYCE: This is how it works. When you go someplace like Ticketmaster to buy, say, Jimmy Buffett tickets, you're shown images of not one but two distorted words. One of these is the real security word. Type this one correctly and you're in. The other image is something that has mystified the scanning software.

If people recognize the word, they type it in. This image will actually be shown to several people. If they all agree on what the word is, the system will figure, it's pretty accurate, they must be right. And von Ahn says it will be incorporated into the digital copy of the book or newspaper that it came from.

Mr. VON AHN: And the number of words that we've been able to digitize like this is insanely large, it's like over a billion.

GREENFIELDBOYCE: In the journal Science, he and his colleagues report that over the last year, Web users have transcribed enough text to fill more than 17,000 books. Marc Frons, chief technology officer of digital operations for The New York Times, says the pace is astonishing. Each month, the project completes about two years' worth of newspapers

Mr. MARC FRONS (Chief Technology officer of Digital Operations, New York Times): Next year, if all goes well, we can do as many as 70 years, which would be almost the entire rest of the archive that is not digitized.

GREENFIELDBOYCE: You might think that this system is taking people even more time than the old way since they have to type in two different things instead of just one. But von Ahn says it's actually faster to type English words than it is to type random letters and numbers.

There is one problem though. Sometimes, the book scanners offer up something that people can't read at all.

Mr. FRONS: Like, for example, just some sort of ink blot on the page.

GREENFIELDBOYCE: Web users can find that confusing. And, here's another thing.

Mr. BRIAN PIKE (Chief Technology officer, Ticketmaster): When you pull two random words from books, you can get some very random combinations.

GREENFIELDBOYCE: Brian Pike is chief technology officer for Ticketmaster. He says the two words can occasionally form juxtapositions that could be weird or even offensive.

Mr. PIKE: And there are certain phrases and words we've asked them to make sure don't show up.

GREENFIELDBOYCE: He didn't want to give me an example. He says mainly, the system works great from a security standpoint. And if customers don't like it, at least now, they know that their time isn't wasted.

Nell Greenfieldboyce, NPR News.

Copyright © 2008 NPR. All rights reserved. Visit our website terms of use and permissions pages at www.npr.org for further information.

NPR transcripts are created on a rush deadline by Verb8tm, Inc., an NPR contractor, and produced using a proprietary transcription process developed with NPR. This text may not be in its final form and may be updated or revised in the future. Accuracy and availability may vary. The authoritative record of NPR’s programming is the audio record.