Turning Verification Codes into Books? While entering information on a Web page, you may have run into a CAPTCHA — that string of letters you type to prove you're a real person. Luis von Ahn, an assistant professor of computer science at Carnegie Mellon University, wants to harness all that extra typing to streamline the process of digitizing books.
NPR logo

Turning Verification Codes into Books?

  • Download
  • <iframe src="https://www.npr.org/player/embed/10936942/10936943" width="100%" height="290" frameborder="0" scrolling="no" title="NPR embedded audio player">
  • Transcript
Turning Verification Codes into Books?

Turning Verification Codes into Books?

  • Download
  • <iframe src="https://www.npr.org/player/embed/10936942/10936943" width="100%" height="290" frameborder="0" scrolling="no" title="NPR embedded audio player">
  • Transcript


If you've ever bought something online, chances are you've come across something known as a CAPTCHA. A CAPTCHA is that string of letters and numbers you have to read and then type back in. They often look twisted, as if they were typed on a woozy keyboard.


By typing them in, you prove that you are a real human being trying to make an online purchase. Sixty millions CAPTCHAs are plugged in every day around the world. That's according to Luis von Ahn. He's an assistant professor of computer science at Carnegie Mellon University. He developed the technology behind CAPTCHAs and is now working on a project called reCAPTCHA.

COHEN: I spoke with Professor von Ahn about this new way to use CAPTCHAs.

Dr. LUIS VON AHN (Carnegie Mellon University): Each time you solve a CAPTCHA, you basically waste 10 seconds of your time. See, the thing is, during those 10 seconds you're doing something amazing. You're doing something that computers cannot yet do. What we're going to do now is we're going to get people to help us digitize books while they're solving CAPTCHAs.

COHEN: Can you explain how this works?

Dr. VON AHN: Sure. So there's a lot of projects out there trying to digitize books. And basically what they're doing is they're taking books and they're scanning them. Now, in order to make these images searchable, you need to transform the images into text. The technology that does that is called OCR for Optical Character Recognition. The problem is that OCR is not perfect, so what we're going to do is we're going to take all the words that the computer cannot recognize and we're going to send them to the Web so that people can solve them for us while they're solving a CAPTCHA.

COHEN: And right now I'm looking at your Web site and I see here you've got an example of a sentence that was scanned in, and the actual sentence says, this aging portion of society were distinguished from - and the computer reads it as something that looks basically like it's maybe Norwegian or something, niss ajud tankem(ph), a society were distinguished frau.

Dr. VON AHN: All those words that you see there were incorrectly recognized. We're going to take each one of them and we're going to send them to some Web site around the world that's using a CAPTCHA.

COHEN: All right. So let's give it a shot here. I'm at the reCAPTCHA Web site and it's asking me to type in two words. Oh my goodness, and this first one is a little bit tough. I think it says P-R-A-E-T-O-R but I don't know what that would be.

Dr. VON AHN: So a lot of them are not actual words, you know. Sometimes say things - they're might be proper names or other things.

COHEN: Okay. The next one I recognize. It says warpath. I'm going to give that a shot, hit submit. And it says my solution was correct.

Dr. VON AHN: Excellent.

COHEN: So these are words from actual books?

Dr. VON AHN: Yup.

COHEN: What will happen after I type in the solution? What happens to those words?

Dr. VON AHN: If a lot of people, when given the same images, type the same word, we're going to assume that those were the words that were in the image and then we're going to feed it back to the Internet archive.

COHEN: Now, if this reCAPTCHA site tells me that my solution was correct, doesn't that mean somewhere in there they already know the answer, what these words actually are?

Dr. VON AHN: Aha. That's we give you two words. One of those words we already know the answer for. The other one is a new one. And we don't tell you which one we know the answer for and which one we don't. We, you know, we flip them at random.

COHEN: Now, just to play devil's advocate here for a moment, couldn't you say that all the time that you spent getting this books scanned, getting these images up there to get people to type them in, couldn't you just use that same amount of time to have someone type in all these scanned books and to save a couple of steps?

Dr. VON AHN: Oh, no way. Here we're talking about, you know, hundreds of millions of books literally. And each book has, I don't know, maybe 200,000 words. So typing that many words, it's an impossible task. A better way, which is what we're trying to do, is just distribute it among all humanity so that everybody will have to type a few words, but you'll only have to type three or four yourself.

COHEN: Luis von Ahn of Carnegie Mellon University's computer science department. Thank you so much.

Dr. VON AHN: Thank you.

Copyright © 2007 NPR. All rights reserved. Visit our website terms of use and permissions pages at www.npr.org for further information.

NPR transcripts are created on a rush deadline by Verb8tm, Inc., an NPR contractor, and produced using a proprietary transcription process developed with NPR. This text may not be in its final form and may be updated or revised in the future. Accuracy and availability may vary. The authoritative record of NPR’s programming is the audio record.