Software Mimics Person's Voice
Software Mimics Person's Voice
Film critic Roger Ebert had his larynx removed through surgery, but a company called CereProc in Edinburgh, Scotland, has created a beta version of his voice. Dr. Matthew Aylett, chief technical Officer of CereProc, offers his insight.
ROBERT SIEGEL, host:
For our next item, I'm joined now by some guests who will now introduce themselves.
Synthesized Voice "Katherine": Hello, Robert. I'm Katherine. No last name, sorry.
SIEGEL: Well, pleased to meet you just the same, Katherine. You have a friend with you, I see.
Synthesized Male Voice: Hi, Mr. Siegel.
SIEGEL: Welcome to the program. And our next guest?
Synthesized Voice of George W. Bush: Hello, Robert. Folks know me as W. Friends call me George. You can call me Mr. President.
SIEGEL: Well, those people exist only in a software program. Actually, President Bush exists in real life, but his voice only exists in a program. Matthew Aylett is also real, and he joins us from Edinburgh, Scotland, right now. Welcome to the program.
Dr. MATTHEW AYLETT (Chief Technical Officer, CereProc): Thank you.
SIEGEL: And I should say that you are the technical officer of CereProc, a company in Scotland. We've been hearing synthetic computer voices for years, but these take it to a new level. What are you doing?
Dr. AYLETT: Well, what has happened over the years is that the industry's developed, and technology's developed quite extensively. So before, synthetic voices used to sound quite robotic.
Synthesized Voice "Perfect Paul": I am Perfect Paul, the standard male voice.
Dr. AYLETT: Now, they can sound extremely natural.
Synthesized Female Voice: I'm in a bad mood. So don't come anywhere near me.
SIEGEL: How do you go about doing this? How do you create a voice?
Dr. AYLETT: To a certain extent, the methodology is fairly straightforward. You take a lot of audio from a speaker. You then cut that up into tiny little pieces. Each piece is a little sound. So for example, cat would be made up of three sounds, /k/, /a/, and /t/.
In order to then produce a new sentence, you then take those sounds, you rearrange them, and you stick them back together again.
SIEGEL: But you would need an awful lot of sound of one voice to do that.
Dr. AYLETT: Not as much sound as you might think, because although there are hundreds and thousands of words, within English there are only about 45 different sounds.
SIEGEL: Now, we heard about you and your company when we learned the story of the film critic Roger Ebert, who after surgery lost his voice, lost the ability to speak. And your company is providing him with a synthetic voice of himself.
Dr. AYLETT: That's correct. So Roger's lost his voice, but he has, of course, an awful lot of audio that he's recorded in the past. So we've been able to mine this audio data and to produce a prototype of his voice for him.
SIEGEL: He enters text, and the software finds those phonemes of Roger Ebert from things he actually said in recordings that you've assembled, and out comes a plausible Roger Ebert.
Dr. AYLETT: That's right.
SIEGEL: How much harder is it to recreate the voice of someone, given a record of their speech that's been recorded, as opposed to just creating "Sue from the north of England" or whatever?
Synthesized Voice of "Sue": This is Sue from up north, 'round Birmingham.
Dr. AYLETT: It's quite a lot more difficult. To record her, we got her into the studio. We gave her a script. Everything is recorded in the same environment with the same microphone. The material for Roger was a lot more difficult to deal with because, of course, it's recorded in different times with different microphone, different environments.
SIEGEL: Now, you have brought some examples for us of other things that you've done.
Dr. AYLETT: Yes. So one of the things that we're interested in is putting emotion to voices.
Synthesized Female Voice: What a lovely day.
Dr. AYLETT: We record the voice, and then we can tweak it to give a little bit of a sense of emotion, not as much as you'd be able to get out of a normal person speaking, but a little bit.
Synthesized Female Voice: You never listen to anything I say.
SIEGEL: The first one, I think, you played, could you just play that again?
Unidentified Woman #2: What a lovely day.
SIEGEL: There's something - I was trying to make President Bush earlier say gray.
Synthesized Voice of George W. Bush: Our director today is Melissa Gray.
SIEGEL: Something about those vowels doesn't seem quite right, especially at the end of a phrase. There should be more oomph there behind it, and we don't seem to speak that way when we just give you lots of phonemes.
Dr. AYLETT: That's very observant. The - one of the biggest problems duplicating voices is getting the intonation right as well.
SIEGEL: The application we heard about for Roger Ebert is remarkable. Is there much of that, or are we still at the level of creating a fun thing to do on your computer as opposed to some new service that can really enhance people's lives?
Dr. AYLETT: No, I think this is the beginning of something very important. It's not just a cool piece of software. It really represents what we are as people. There's no question that we can reproduce people's voices if we have sufficient audio. And for them, that's really quite important.
SIEGEL: Matthew Aylett, chief technical officer for CereProc in Edinburgh, Scotland. Thanks so much for talking with us.
Dr. AYLETT: No problem.
(Soundbite of music)
SIEGEL: This is NPR.
NPR transcripts are created on a rush deadline by an NPR contractor. This text may not be in its final form and may be updated or revised in the future. Accuracy and availability may vary. The authoritative record of NPR’s programming is the audio record.