Kindle's New Voice Is Almost Human The latest version of Amazon's electronic book reader features the latest in text-to-speech technology. Could a dystopian future where NPR hosts are replaced by soulless robots soon be upon us?
NPR logo

Kindle's New Voice Is Almost Human

  • Download
  • <iframe src="" width="100%" height="290" frameborder="0" scrolling="no" title="NPR embedded audio player">
  • Transcript
Kindle's New Voice Is Almost Human

Kindle's New Voice Is Almost Human

  • Download
  • <iframe src="" width="100%" height="290" frameborder="0" scrolling="no" title="NPR embedded audio player">
  • Transcript

Unidentified Voice #1: From NPR News, this is ALL THINGS CONSIDERED. I'm Jacki Lyden.


Wait a minute. I'm Jacki Lyden. That plastic imposter you just heard was generated by a computer program, just so many gears and widgets.

No, I'm not being hostile. That's called text to speech or speech synthesis, and…

Unidentified Voice #1: It's Science Out of the Box.

(Soundbite of music)

LYDEN: Text to speech has been in the news lately, thanks to the latest version of Amazon's electronic book reader. It's called the Kindle 2. And it has a feature that will read the text out loud to you.

That worries some authors, including Roy Blount Jr. who's president of the Authors Guild and a panelist on NPR's WAIT WAIT ... DON'T TELL ME. Blount wrote an op-ed piece in the New York Times recently about the Kindle 2. He thinks the device could eventually take a bite out of the audio book market. He wants authors to be paid for those audio rights.

We wondered how text to speech actually works, so we turned to Andy Aaron. He's a speech researcher at IBM, and he specializes in this sort of thing. He says computer voices sound more human these days because, well, guess what? They are more human.

Mr. ANDY AARON (Speech Researcher, IBM): So the idea is, we've given up trying to model the human speech-production system. It's too complicated. There are too many parameters. Nobody understands it.

So instead, what we'll do is, we'll audition lots and lots of voice actors, pick one we like, and record them reading untold amounts of sentences over a period of maybe a month, until they're exhausted.

And then we take those sentences, chop them up into individual pieces called phonemes, and build a library of that person's voice.

LYDEN: So I was sitting at my computer with a text-to-speech program open, and I type in a sentence - let's say, hello, I'm Jacki Lyden. What happens then?

Mr. AARON: The words that you typed are turned into a stream of phonemes, a list of phonemes rather than a list of words. The word Jacki, for example, has four phonemes in it.

LYDEN: Four phonemes. What are they?

Mr. AARON: They are j, a, k and e. We may have 10,000 j sounds. We want to pick one that in its original context, when the speaker said it, was followed by an a sound because it's much more likely to fit correctly.

We try to pick one that's as closely matched to where it was originally recorded in the sentence to where it's going to be used.

LYDEN: So last week, as we noted, Roy Blount and the Authors Guild complained about the text to speech on the Kindle 2 and Amazon, in turn, agreed to let individual authors decide whether or not they wanted that feature to be used for their books, the text to speech.

We got our hands on a Kindle 2, and we had the machine read to us a little bit. This is from the book "The Best and The Brightest," by David Halberstam.

Unidentified Voice #2: (Reading) Then they spoke of defense, a glandular thing, love it said, a monstrosity. Even talking about it damaged a man's stomach.

LYDEN: Doesn't exactly have a whole lot of range, that human dramatic emotion, does it?

Mr. AARON: No, it doesn't. I'm a big believer in text to speech. I think it's an amazing technology. But even so, I don't think that we're really anywhere near reading a novel out loud in a meaningful way.

And the reason is, it requires a deep text understanding. And the technology isn't even close to that right now, and I don't see it happening five years from now, either.

LYDEN: I understand that you are working, Andy, on text to speech systems that can express different emotions. You sent us two examples of computer speech, old and new. Let's play them and then talk about the differences.

Unidentified Voice #3: These cookies are delicious.

Unidentified Voice #4: These cookies are delicious.

LYDEN: How did you do that? How did that change?

Mr. AARON: What we did is, in the first sample, that's just the straight text-to-speech system you're hearing. In the second sample, we brought the actor into the studio and told her to read 1,000 sentences that were recorded with an upbeat voice.

And just the fact that her mood was different, her tone was different in recording those phonemes translates into a completely different text to speech sample.

LYDEN: I guess, Andy, pretty soon we're not going to need any more radio hosts.

Mr. AARON: I don't know if I'd go that far.

LYDEN: Well, I'll be hopeful with you.

(Soundbite of laughter)

LYDEN: And on that note, I'm going to have my computer doppelganger say goodbye to you.

Unidentified Voice #1: Andy Aaron is a speech researcher working on text to speech systems with IBM. Thanks for joining us.

Mr. AARON: Thank you.

Copyright © 2009 NPR. All rights reserved. Visit our website terms of use and permissions pages at for further information.

NPR transcripts are created on a rush deadline by Verb8tm, Inc., an NPR contractor, and produced using a proprietary transcription process developed with NPR. This text may not be in its final form and may be updated or revised in the future. Accuracy and availability may vary. The authoritative record of NPR’s programming is the audio record.