(Soundbite of phone dialing)
Unidentified Woman: 1-800-Fandango. Do you want to look in and around New York? You can also enter a theater express code followed by pound. Or for restaurants and other services, you can say call 411.
MIKE PESCA, host:
I forgot, what was the original question? Am I supposed to be looking around New York?
Unidentified Woman: I'm sorry, I still didn't get that. If you want to look in New York, New York, say yes, or press one. Otherwise, say no, or press two to pick another location.
PESCA: Yeah. I'll look in New York. That sounds good to me. By the way, we're here with John Seabrook...
Unidentified Woman: I'm sorry. I'm having trouble understanding. To look in New York, New York, press one.
PESCA: Yeah. New York. I'll look in New York. Do you get that? We are here with John Seabrook of the New Yorker who just wrote about voice-recognition software. How are you doing there, Fandango?
Unidentified Woman: Please try your call again later. Thanks for calling 1-800-Fandango. Good-bye.
PESCA: Yeah. Thanks a lot, Fandango. Well, John, I guess I had - what? A typical experience when using this voice-recognition software? What didn't she get about me?
Mr. JOHN SEABROOK (Staff Writer, The New Yorker): She doesn't get your kind of vibe. She wants you to be like a straitlaced guy.
PESCA: I guess.
Mr. SEABROOK: You're a loose-goosey guy.
PESCA: I'm on a pretty clean phone line. If I was on a bad cell phone and was just saying yes or no, would they be able to get that?
Mr. SEABROOK: Doubt it.
PESCA: Really? So what are the big weaknesses of voice-recognition software, in general?
Mr. SEABROOK: Well, you know, the big weakness is that people say things in all kinds of different ways. They say yes in any of 20 or 30 different ways. You know, or they say yes, ma'am, or yes, sir, if they're from the South. And the system is - I mean, they try to anticipate all different ways people are going to say something. But language always evolves, always changes. People kind of want to say things in new, fun ways, and computers just can't really keep up.
(Soundbite of laughter)
PESCA: So it's like an arms race between the slang for yes and what computers understand.
Mr. SEABROOK: Yeah, I mean, another example of that is the whole Valley Girl way of raising your voice at the end of a sentence.
PESCA: Up talk?
Mr. SEABROOK: Yeah, when you're actually making a declarative sentence. That's something computers can't really get at all, because they think you're asking a question.
(Soundbite of laughter)
PESCA: Well, that would be useful in real life. Maybe we can drum that out of people. But so, will computers understand the slang of two months ago? Is it constantly inputted, or once the voice-recognition software is in place, can't you update it?
Mr. SEABROOK: You can update it, but it's not something that's real simple to do. I mean, basically, this is a cost. It's all a cost. From the corporate point of view, they don't want to spend money on this because the whole reason we're doing it is to save money. Talking to a real person costs a lot more than talking to a machine. So at a certain point, if you're updating it every few days, it's probably going to cost a lot more than they really want to spend.
PESCA: And what kind of - you quoted I think a five dollar per call cost in your article. That seemed high to me. It really costs them five dollars to deal with a customer?
Mr. SEABROOK: That's on the low side of the numbers that I saw.
Mr. SEABROOK: It was between five and 15 are the numbers that were given. Although when you think about how much they are paying the people out in Bangalore, the call centers, which is like a couple dollars at that, I don't really know what costs that much money, but that's what they say.
PESCA: Another number in your article is 43 billion. That's the number of minutes North Americans lost last year to, you know, wasting their time with trying to deal with customer service on the phone. I did the math, 43 billion minutes, that's 80,000 combined years. That's more than 80,000 years spent on the phone last year. That was mindboggling to me. How about to you?
Mr. SEABROOK: Yeah, I know, it's a nightmare. When you think about just how much frustration you can get from five minutes, you know, or even just what you just had, one minute, and then you make it 43 billion minutes. That's like a lot of - if you could bottle all of that frustration, you could like move mountains with it, probably.
PESCA: Have you found that people are more frustrated because they are being asked to use voice-recognition software, and the software is just not up to the task?
Mr. SEABROOK: Well, I don't know. I mean, there are situations where it works. The thing about it is if you have a very specific task for which there are only a few possible ways of saying it, it works pretty well.
PESCA: But then again, if you do that, you can just press one or two. If they are simple tasks with binary choices, I always opt for the keypad.
Mr. SEABROOK: Yeah, I mean, if you have a keypad in front of you, you might want to do that. But maybe you are sort of kicked back in your chair or, you know, whatever, you don't feel like taking the phone away from your ear and pressing the one. What you were working with there was a kind of what they called natural-language-understanding program. They don't tell you, say yes or say no. You're supposed to be able to speak as you would speak, and that's usually where things go wrong.
PESCA: Yeah. And Amtrak's Judy.
Mr. SEABROOK: Julie.
Mr. SEABROOK: Julie is her name. Yeah, she's the Amtrak voice. She sounds kind of like that Fandango voice. I mean, there are a few voice talents that you hear quite a lot. It's kind of a lucrative business for people who have the right voice.
Mr. SEABROOK: To be the voice of whatever.
PESCA: Always women? Do any companies want to go with an authoritative man?
Mr. SEABROOK: Google has a guy. He's not that authoritative. Google now has a 411 number, so you can - it's 800-GOOG-411. And that has a guy, but he's kind of like - he's not really a macho guy. He's more of a kind of sort of computer-nerdy type.
PESCA: OK, I was going to think, maybe, the drinking buddy.
Mr. SEABROOK: He's not a drinking buddy.
PESCA: Sort of like if Google came to life, he'd be this guy.
Mr. SEABROOK: Like if you called Bud, maybe you'd get like a drinking buddy.
(Soundbite of laughter)
PESCA: So what were the most surprising things that you found out that were hurdles for computers to recognize human voices?
Mr. SEABROOK: Well, one of the interesting things about the whole world of voice recognition is that it's all probability-based. I mean, there's no actual understanding of what you are saying. It's all based on probability that what you said might be this, or might be this. And that was a decision made many years ago by the researchers, because they realized they couldn't figure out the rules of language, you know, how we actually understand and use language. So they made this one choice, and you know, now we're seeing some of the fruits of that.
Like translation, for example, works pretty well with speech recognition. I saw these translators - they were actually kind of amazing, they translated from English into Arabic and Arabic back into English - that they are now using at checkpoints in Iraq. Because a lot of times, you don't have a translator, and it's dangerous to be a translator. And so, if you had something that you can at least introduce a little understanding into the situation, you know, I think that can help defuse tensions.
But there are a lot of - you know, there're a lot - the emotional component of speech, which it turns out is a lot more important to speech than we thought. We didn't know that until we started making speech recognizers that couldn't recognize emotion, and it turns out a lot of what goes on in our voice is actually a kind of an emotional thing that isn't just logical. It isn't just about the words. But it's about the way we say them and the emphasis we give them. And all of that is just completely dark in terms of what computers can understand.
PESCA: Are these companies trying to get into the realm of if they have an angry customer on a line, deal with them a different way than with some guy who's just giving yes, no answers?
Mr. SEABROOK: Yeah, because from a company's point of view, if somebody is angry, and particularly, if it's like a sale is at stake, they want to transfer that person to an agent. But even there, the software is very primitive. I mean, it's basically just like the guy is shouting obscenities, then he's probably angry, you know? But it's not like they can pick up, you know, sort of nuanced anger. In the piece, I talk about how there's cold anger, where people speak very slowly, and that could be perceived as...
PESCA: I want to see "The Incredible Hulk," Fandango.
Mr. SEABROOK: Right. And also the other problem over there is if you make a mistake, and you think someone's angry, and in fact, they are not, or you think they are sad and you start sort of - then the voice turns very sort of sympathetic in an inappropriate context, that would just be a nightmare.
But another interesting thing is that speech recognizers are all trained with speech, and up until recently, they've only used acted speech, the actors speaking lines. And so by collecting all these angry voices, real angry voices, and feeding those into the recognizers, there is actually a chance they can improve and respond to the way people really speak.
PESCA: Because you write, even an actor approximating anger doesn't get to real anger that a very finely-calibrated machine perhaps could sense. Acted anger is different from real anger. If you have a great machine, that can sense that.
Mr. SEABROOK: Yeah, the vocal cords apparently vibrate in a different way when you are truly enraged. And that's damaging, or can be damaging, to your voice, so actors tend to not hit that point. Also, actors tend to emote all at once. They turn on anger all at once, whereas, in fact, anger usually builds over several utterances.
PESCA: Right. When you wrote that in the article, my mind flashed to, this is what I want to use a machine to find out if Bill O'Reilly is just acting angry, or if he is really driven over the edge. So you mentioned the translation. I did come away thinking that as messed up as voice-recognition software and customer service is, there seems to be some pretty bright spots. Not only in terms of translation, but this Dutch train station kind of has this sound recognition software that helps cut down on vandalism. Tell me about that.
Mr. SEABROOK: Yeah, well, you have these cameras installed in a lot of public places, but cameras can only pick up a visual sort of angry action. And usually by the time people are actually hitting someone, something that a camera can see, it's too late.
Mr. SEABROOK: Whereas if they could pick up the voice, they can sense that a violent confrontation might be occurring in about 40 seconds or a minute, and they can then try to respond or at least train the cameras on that. And yeah, actually, that worked really well, as I said in the article, because it doesn't pretend to be any more intelligent than it is. It just does one thing, but for what it does, it works pretty well.
PESCA: The article is called "Hello, HAL." And throughout it, you talk about how taken you were by "2001," the Stanley Kubrick film, and when the computer HAL gets smart and speaks soothingly, "I'm sorry, Dave." But did the impetus for this article come from fond memories of the "2001" film? Or did it come from frustration at having to deal with an online call center?
Mr. SEABROOK: Well, for me, it came from HAL, because it's 40 years since the movie came out. And I saw it when I was nine, and it really did. Just talking to a computer is, as I say in the article, if you can marry language and tool-making, you would achieve the two greatest human technologies kind of brought together. And that would be an extraordinary - when you think about it, all the things that could be done with your voice, you'd never have to use a keyboard. I mean, that would be amazing, but it's really still Hollywood stuff. And in a way, Hollywood has kind of misled us, as it often does, I guess, into expecting more than technology can actually offer.
PESCA: Of all the voice-recognition experts, computer experts, that you spoke with, did anyone give you a really good tip that you can share with our audience, in terms of how to deal with the voice recognition guy on the other end of the line?
Mr. SEABROOK: Well, definitely don't say, yes, ma'am, and yes, sir. Don't be polite. Don't treat them like they are humans, because if you do, they're not going to understand, you know, where you are coming from.
PESCA: And have you changed your behavior when given the option between punching one or saying one, do you still punch one on your keypad?
Mr. SEABROOK: No, I say it. I mean, my one seems to be pretty well understood. So, I just - it's still kind of a kick when it works.
Mr. SEABROOK: You know, so...
PESCA: To me, the craziest thing is when they say punch in or say your 18-digit account number. Right, I'm going to give you 18 chances to mishear me. I'll just be punching it in, thank you.
Mr. SEABROOK: Well, the other one there is somebody can be sort of standing behind you and listening to your eight-digit account number, whereas if you're punching it in, they might not see you doing that. So...
PESCA: It's both a fascinating article and practical tips from John Seabrook of the New Yorker. He wrote "Hello, Hal: Will we ever get a computer we can really talk to?" in the newest edition of the New Yorker. Thanks a lot, John.
Mr. SEABROOK: Thanks. Good to be here.
PESCA: And that is it for this hour of the BPP. We are always online at npr.org/bryantpark. Let's chill out to these sounds for a second, and then I'll come back and tell you a little bit about the people who help us with the show.
The Bryant Park Project is directed by Jacob Ganz and edited by Trish McKinney. I'm sorry. Did we try to say Ish McTrinney? No. Trish McKinney. Our technical director is Manoli Wetherell. I'm sorry. Did you say, Amoli Witherall? No. Stop it, voice-recognition person.
Our staff includes Dan Pashman, Ian Chillag, Win Rosenfeld, Angela Ellis, Lauren Spohrer, Caitlin Kenney, Paul Hechinger - got to ask Paul about that later - Zena Barakat and Laura Silver. Laura Conaway edits our website and blog. Our newscaster is Mark Garrison, unless sometimes when it's Korva Coleman. You never know by how I introduce that segment. Our senior producer is Matt Martinez. Sharon Hoffman is our executive producer.
I'm Mike Pesca. We're online all the time at npr.org/bryantpark. Are you trying to say Ryan Dark? No, Bryant Park. I hate you! This is the Bryant Park Project from NPR News.
NPR transcripts are created on a rush deadline by Verb8tm, Inc., an NPR contractor, and produced using a proprietary transcription process developed with NPR. This text may not be in its final form and may be updated or revised in the future. Accuracy and availability may vary. The authoritative record of NPR’s programming is the audio record.