Computers And Babies, Listening Carefully
IRA FLATOW, Host:
You're listening to SCIENCE FRIDAY. I'm Ira Flatow.
Used to be, if you wanted to talk to your computer, first you had to speak to it for hours at a time, pausing between each word to train this box to understand you. And then what? Well, you use it to write your novel, like speaking instead of typing. In those days, speech recognition required, you know, maybe it might flub one word in 10. It was almost faster to type it if you wanted to get it really accurate. Those days, we're talking about the long time ago of maybe 10 years.
But now, speech recognition is really starting to come into its own. You can talk to your car's GPS. You can get automatic captions to YouTube videos, or maybe you've tried doing a Google Voice Search on your smartphone. I've tried this many times and I am constantly amazed about what this understands, even foreign language words, with all that - without all that painstaking one-on-one training.
TALK: 1-800-989-8255. 1-800-989-TALK. And you can tweet us: @scifri, @S-C-I-F-R-I.
Let me introduce my guests. Mike Cohen, heads up of Google's research in speech technology. He's based at the Googleplex in Mountain View, California. He joins us from a studio there. Welcome to SCIENCE FRIDAY, Dr. Cohen.
MIKE COHEN: Thank you. It's a pleasure to be here.
FLATOW: You're welcome. Sheila Blumstein is a professor of cognitive linguistic and psychological science at Brown University in Providence, Rhode Island. She joins us from a campus there. Welcome to SCIENCE FRIDAY, Dr. Blumstein.
SHEILA BLUMSTEIN: Thanks so much. Welcome to be - it's great to be here.
FLATOW: Thank you very much. Mike, just how does Google speech recognition technology work? How do you get this to recognize the speech so well?
COHEN: Well, the most fundamental idea behind the approach that we take is that it's data-driven. And by that, what I mean is, we feed our machines many, many, many examples of people speaking - many different people speaking. And based on that, we have algorithms that build a model. And it's fundamentally a statistical model. So we build a statistical model of the language. That model has information about the basic acoustics or sound units of the language, like how the mms(ph) and ahhs(ph) and the buzz(ph) and the puhs(ph) and so on in the language are produced.
COHEN: It has information about the words and how that pronounced, like economics might be spoken economics or ee-conomics(ph). And it has information about the word sequences, the phrases and sentences that tend to happen...
COHEN: ...in the language.
FLATOW: So you're not building rules of the grammar and English. You're using basically crowd-sourcing, listening to people say them and sort of self- correcting over time. More people talk about it, the more you learn how it will work.
COHEN: Yeah. That's fundamentally correct. It would be probably extremely difficult if not impossible to try to explicitly program all these knowledge. And we don't really understand all of the minute details...
COHEN: ...so instead, we get lots and lots and lots of data and we build these big models, statistical models, of the...
FLATOW: Right. So give me...
COHEN: ...speech based on all that data.
FLATOW: So give me an idea - when I ask Google to find me something in Google Search, what goes on - give - walk us through some of the steps that goes on in the process of recognition there.
COHEN: Okay. So first, the speech comes in. We figure out where speech started and ended, and we do something called feature extraction. Basically, it's something like a spectral analysis or trying to find what are all of the frequency components, the basic features that make up the sound. Then we take that and we feed it to the model that I just described.
COHEN: So it's the statistical model of the acoustics, the words, the grammar, in terms of how words are put together, and fundamentally, speech recognition is finding a best path, best matching path, through that model. And that's what the speech recognizer guesses what's said.
FLATOW: Mm-hmm. Is there a dictionary it goes to find words that best fit this, the question I'm asking.
COHEN: Yeah, so one of the components of that big statistical model is a lexicon. And it's a list of all of the words in the language and a specification of how they are pronounced.
COHEN: So, for example, economics and ee-conomics...
COHEN: ...there are two pronunciations.
FLATOW: Right. Sheila Blumstein, tell us what happens in the brain when we hear people someone speaking - is it this similar to what Google is doing in voice recognition?
BLUMSTEIN: Well, in many ways, it is. Our idea of - the notion is that acoustic life form comes in and, again, just as Mike said, it's transformed into frequency temporal amplitude domain. And that, in turn, maps on to another level of extraction, individual sound segments like puh-tuh-cuh(ph), buh-duh- guh(ph), ee-aa-ooh(ph).
Those, in turn, map on to words like cat, dog, a mental lexicon as we would call it, and that, in turn, maps onto meaning. But broadly speaking, what's really important is that the system is always making probability guesses, using all the information it can at all points in time that it can.
FLATOW: You mean, when I - No, I'm sorry. Go ahead. Are you going to give an example?
BLUMSTEIN: No, no, no. Go ahead.
FLATOW: Now, give me an example what you mean.
BLUMSTEIN: Well, an example would be, for example, if I say - let's say, I say see(ph) versus sue(ph), as an example. When you say sue, you round your lips. When you round your lips, that lengthens your vocal track which lowers the frequency. So it turns out that we don't perceive individual sounds, but we use all the information to tell us that - what's coming. So suh(ph), followed by ee(ph) versus suh(ph), followed by ooh(ph) is very different. If you just say ss(ph) and ss(ph), like that, can you hear the difference in the frequency?
BLUMSTEIN: We use that. Another example would be that when we - the sounds map on to words. We know what are the possible consonants and the possible vowels that can fit. So, for example, truck. T-R starts the word, but it doesn't end any word. So we know where the beginning of a word is.
So all the time we're making hypotheses about what the possible word may be and then we, slowly but surely, as information comes in, we break it down into smaller pieces.
FLATOW: Is this how babies are learning when they learn vocabulary?
BLUMSTEIN: Well, we think so. I mean, one advantage the baby has is that - we - the baby appears to be born with essentially the building blocks for the sounds of language. So infants as young as a few days old perceive all the sounds, potential sounds of all languages in the beginning. And then slowly, as they're exposed to their native language, they actually reduce their ability to map on to the language.
However, they're also very much aware and are sensitive to the probabilities of what sounds can go with what. So again, the baby might hear truck and in English, T-R can begin a word but never ends a word. So that gives him information about where the beginning of a word might be.
BLUMSTEIN: Or milk. L-K ends the word, but there's no information - or we don't use L-K at the beginning of a word. So that the baby uses those kinds of probabilities in order to be able to map it onto what is the word truck, what is the word milk, and is it a word? Highlight...
FLATOW: Yeah, Mike, does Google use sort of the same kind of recognition?
COHEN: In a certain sense, yes, but in a certain sense, no. So it's not like we explicitly give our machine rules about where an L-K might happen. But those aspects of the language are learned in a statistical sense from all of the data that we feed it.
FLATOW: Mm-hmm. Did the algorithms sometimes disagree with each other? For example, the lower level processor says, I know I heard this sound and the higher level processes says, but that sentence doesn't make sense, if you put those together.
COHEN: Yes. Yeah, very much so. In fact, one of the core principles of how we do this is, all of the knowledge sources exert their influence simultaneously. So our knowledge about the acoustics or the basic sounds, like the ums, ahs, and uhs are, our knowledge about the pronunciation of words, our knowledge about grammar in the statistical sense, what words tend to follow what other words, all gets brought to bear simultaneously.
So if somebody said something - let's say the recognizer just recognized the dog and maybe it's a little unsure. Did it say the dog ran or did the person say the dog can? If there is a lot of acoustic ambiguity there then the language model, the piece that says, hey, it's way more likely after the dog to have the word ran than can, having learned that from lots of examples, that will influence the recognition search.
FLATOW: Do you folks in the computer world actually look at how the brain works and try to design a computer about that or do these speech recognition processes just resemble each other because we've all arrived at that - that's the best solution independently?
COHEN: Right. It's a little of each. You know, we're certainly influenced by things that we've learned from linguists or neuroscientists, but fundamentally our goals are engineering goals. We're trying to figure out how, with modern day computers, to build the very best, most accurate, most useful systems for end users and so fundamentally, what we're guided by is new algorithms that gradually work better and better.
FLATOW: Let me see if I can get quick question in before the break. Mike Ford(ph) in North Carolina. Hi, Mike.
COHEN: Hey, how you doing?
FLATOW: Hi, there. Go ahead.
COHEN: Yeah, question. As you can tell, I'm not originally from North Carolina so I was wondering how they handle dialect.
FLATOW: Yeah. How to accents and dialects - good question. How do you do that?
COHEN: Yep. Okay. So that goes back to the fact that we learn from examples. So when we train our recognition system or feed it examples, we try to get speech from literally millions of people speaking. And it has to include, you know, people like me from Brooklyn who do, you know, ca, ca, ca, and people from other parts of the country.
And so it has to cover dialects, it has to cover voice types, it has to cover, you know, obviously both genders. It has to cover all the different kinds of noise conditions and everything else so that as long as it's represented in our training set, therefore it gets into the statistics of our statistical model, we should be able to handle it.
FLATOW: Mm-hmm. And is there anything you need to do to work out, any future that we should see, better voice recognition?
COHEN: Well, there's a lot of active research. A lot of the research has to do with getting more data, finding ways to use more data to train these systems. And as we get more data, the question becomes: How do we make the model bigger or richer in an appropriate way so it becomes a better model of the language?
FLATOW: Right. All right.
COHEN: And that research is at many levels.
FLATOW: Thank you both for taking time to be with us. Mike Cohen heads up Google's research in speech technology at Googolplex, and Sheila Blumstein, a professor of cognitive linguistics and psychological science at Brown University. Thank you both for taking time to be with us today. We're going to take a...
COHEN: Thank you.
FLATOW: You're welcome. We're going to take a short break. We're going to come back and change gears. And we're going to still talk about the technology the Xbox Kinect. How does it know what you're doing? You're not holding onto any of its tools. So we'll try to dissect that one also. Stay with us. We'll be right back after this break.
I'm Ira Flatow. This is SCIENCE FRIDAY from NPR.
NPR transcripts are created on a rush deadline by a contractor for NPR, and accuracy and availability may vary. This text may not be in its final form and may be updated or revised in the future. Please be aware that the authoritative record of NPR’s programming is the audio.