Bridging the Online Language Barrier: Translating the Internet : All Tech Considered Not long ago, the internet was all in English. But how do bring people together when they can't understand each other.

Bridging the Online Language Barrier: Translating the Internet

  • Download
  • <iframe src="" width="100%" height="290" frameborder="0" scrolling="no" title="NPR embedded audio player">
  • Transcript


From NPR News, this is ALL THINGS CONSIDERED. I'm Melissa Block.

English is the Internet's dominant language. It's used by more than a quarter of those who go online. But Chinese seems certain to surpass English within a few years, and dozens of other languages are experiencing huge growth.

Mark Phillips of WNYC's On The Media set out to explore the language barriers that are forming in cyberspace and the tools that are being developed to overcome them.

MARK PHILLIPS: The Internet started as a project by the U.S. military and then grew with the help of American universities. Many of the earliest, big Internet startups were also U.S.-based, and that's why...

Mr. ETHAN ZUCKERMAN (Co-Founder, Global Voices): There's been this sense that English is the default language on the Internet.

Ethan Zuckerman, co-founder of the multi-lingual blog network Global Voices, says Internet users all over the world used English and that at first it had a big upside.

Mr. ZUCKERMAN: Things were easier. It was easier to have the sense that we're all in the same conversation. That we were all laughing at the same jokes. A common language is a first step towards communication across cultural boundaries.

PHILLIPS: But non-English growth on the Web is skyrocketing. Better access to computers and Internet connections around the globe have allowed hundreds of millions of new users to get online. And with so many signing on, communities of bloggers, message boards and news sites can write in their native languages.

Arabic Internet users, for example, have increased by over 2,000 percent over the past decade. Ethan Zuckerman says this is mostly very good news. But...

Mr. ZUCKERMAN: The one big negative is that it introduces this huge source of fragmentation, lots of separate Internets. And they're separated by a couple of different things, but language is the big one. And language is the one that turns out to be hardest to overcome.

PHILLIPS: Many Internet developers hope that translation software can overcome that age-old language barrier and make all the content on the Web available to everyone. The idea dates back to World War II. And while there are many machine translating systems, one of the most promising right now is Google Translate.

Michael Galvez, a project manager at Google Translate, says Google's advantage comes from harnessing unprecedented amounts of data.

Mr. MICHAEL GALVEZ (Project Manager, Google Translate): What we do is we actually use hundreds of billions of words that Google infrastructure has access to. This is actually scoured from the Web.

PHILLIPS: It's a two-step process. First, Google pulls in all the text, recognizes the language and creates what it calls a language model. There's one for each of the 52 languages available on the service.

The language model gives Google Translate a feel for the language. For example, it knows the sentence the boy are sad is very rare, just as a five-year-old knows that sounds weird. But the language model only teaches the computer how to speak each language by itself.

The next step is to translate from one language from another. Google's Michael Galvez says for that...

Mr. GALVEZ: We also build what's called a translation model, using previous human translation that we have access to - documents from the E.U., the United Nations, very high-quality translation corpora.

PHILLIPS: This translation model allows Google's computers to move between multiple languages. Paired with the language model, it produces startlingly accurate results. Plug in an article from a Spanish language newspaper, and it reads like an English article that just needs a trip to the copy editor. And because it's based on what people actually write on the Web, it can deal with mistakes.

Mr. GALVEZ: So for example, instead of spelling receive with an E before I, transpose the I and the E. And you translate this into Spanish, it will actually translate this correctly.

PHILLIPS: Yeah. It even knows what to do with slang, like the abbreviation for laugh out loud. I just typed in LOL, and that's ja, ja, ja with a J.

(Soundbite of laughter)

Mr. GALVEZ: Yes, so it found LOL. Yeah, in Spanish, it's ja, ja, ja with a J, yes.

PHILLIPS: But Google Translate is limited. Often it communicates the gist without the all-important context. Sometimes it makes no sense at all.

Mr. ZUCKERMAN: The solution isn't machine translation just getting better or human translators just getting more pervasive. The solution is some combination of the two.

PHILLIPS: Blogger Ethan Zuckerman.

Mr. ZUCKERMAN: We're doing a much, much better job of figuring out how to organize these communities of human translators, take advantage of people's willingness to volunteer, willingness to work for small sums of money through a system like Mechanical Turk.

PHILLIPS: A new website called is an example of this kind of project. It translates stories about the Arab world from both English and Arab media. The short posts on its home page are always translated by a person.

Mr. ED BICE (Founder, The idea is a Wikipedia-style approach to translation.

PHILLIPS: Meedan founder Ed Bice.

Mr. BICE: Any registered translator on our system, and we now have about 1,000 people who are capable of generating translations on Meedan, can contribute to improving a translation.

PHILLIPS: But Meedan's signature innovation is how it presents the translations. The usual way is to put a toggle button at the top of the webpage where you choose your language. Click on English, and all the Arabic on the page disappears. This is how Google Translate works. But Ed Bice says Meedan puts both languages side by side.

Mr. BICE: Page right is Arabic, which conveniently it's a right-to-left language, and we have page left English. And kind of having that visual cue, you can actually see this cross-language conversation happening on the website.

PHILLIPS: You immediately see a ping-pong back and forth between the two languages. On Meedan, a two-sentence story about a Syrian man throwing his shoe at the Turkish prime minister generated 26 comments. The first two were originally in English, then two from Arabic, then five that were originally in English and so on.

At first, the two languages have separate conversations. But as the thread continues, the conversations merge to create a cross-cultural discussion about the meaning of shoe-throwing.

Mr. BICE: There are two narratives that describe almost any emerging situation or policy decision regarding the U.S. and the Middle East, and the most obvious dividing line for those narratives is linguistic.

PHILLIPS: So if sites like Meedan put these two narratives side by side and translate them, will a third narrative emerge? Will our differences dissolve and world peace reign? Maybe in the long run, says Ethan Zuckerman.

Mr. ZUCKERMAN: In the short run, actually it could be very, very difficult. When you can read what people say in their own languages, it's often a lot less diplomatic and a lot more nationalistic.

As we get better and better and better at translating, I think what it's really going to do is force us to address each other's preconceptions, prejudices, biases, but unless we can actually hear what people are saying, it's very hard to start on that process.

PHILLIPS: The process has begun. But without better translation tools, supporters of a unified Internet worry about separate Internets that encourage entirely different narratives and ways of seeing the world.

For NPR News, I'm Mark Phillips.

BLOCK: And you can hear a longer version of Mark's story at

(Soundbite of music)

BLOCK: This is NPR.

Copyright © 2010 NPR. All rights reserved. Visit our website terms of use and permissions pages at for further information.

NPR transcripts are created on a rush deadline by an NPR contractor. This text may not be in its final form and may be updated or revised in the future. Accuracy and availability may vary. The authoritative record of NPR’s programming is the audio record.