Can Google Build A Typeface To Support Every Written Language?
Google has taken on its fair share of ambitious projects — digitizing millions and millions of books, mapping the whole world, pioneering self-driving cars. It's a company that doesn't shy away from grand plans.
But one recent effort, despite its rather lofty scope, has escaped much notice. The company is working on a font that aims to include "all the world's languages" — every written language on Earth.
"Tofu" is what the pros call those tiny, empty rectangles that show up when a script isn't supported. This is where Google's new font family, "Noto," gets its name: "No Tofu."
Right now, Noto includes a wide breadth of language scripts from all around the world — specifically, 100 scripts with 100,000 characters. That includes over 600 written languages, says Jungshik Shin, an engineer on Google's text and font team. The first fonts were released in 2012. But this month, Google (in partnership with Adobe) has released a new set of Chinese-Japanese-Korean fonts — the latest in their effort to make the Internet more inclusive.
But as with any product intended to be universal, the implementation gets complicated — and not everyone for whom the product is intended is happy.
'Internationalizing' The Internet
It all started with the Unicode Consortium — a nonprofit for the "internationalization" of the Internet — which kicked off the research into language fonts in 1987. It started work on what was called the Unicode Standard — a "character coding system designed to support the worldwide interchange, processing and display of the written texts of the diverse languages and technical disciplines of the modern world."
Coding Arabic Art
Lebanon-born artist Ramsey Nasser has used code to subvert the "dominance of the Western alphabet" within the realm of coding — the fundamental building blocks of computer language.
Using Arabic, Nasser coded the Fibonacci sequence algorithm and turned it into calligraphic art.
Basically, what Unicode was trying to do was create a unique combination of numbers (known as a "code point") for every character, in every written language, ever.
So it kind of makes sense that it took awhile. In fact, the wider adoption of this standard for Web browsers happened only around 2008, says Finn Brunton, professor of media, culture and technology at New York University.
"It's an enormous shame — kind of a calamity — that we're using systems designed so narrowly around the needs and languages of a very narrow group of people," Brunton says. It's ironic, he adds, because the Internet was conceived as a global network of computers.
Updates to Unicode's standard for languages and characters are made incrementally. What Google is trying with Noto is a large-scale rollout of fonts consistent with the Unicode Standard — so that now people all over the world can make Web pages, apps and URLs, all in their own languages.
Universality Vs. Individuality
But critics like Pakistani-American writer Ali Eteraz are suspicious about grand plans by any of these big companies.
"I tend to go back and forth," Eteraz says. "Is it sort of a benign — possibly even helpful — universalism that Google is bringing to the table? Or is it something like technological imperialism?"
What he means is that when one group of people (in this case, Google) decides what to code for and what not to — and in what way — people who are not a part of that decision-making process, those who actually use these fonts and these languages, can feel ill-served.
It's understandable for linguistic communities to feel like this, Brunton says, because the record of big companies aiming for linguistic diversity on the Web isn't a sparkling one.
One bitter chapter was written when Unicode tried to implement Han unification — an effort to unify characters shared by written Chinese, Japanese and Korean languages into a single "character set." Because the technology wasn't there yet, Unicode ran out of unique code points, which, remember, are required for every language character.
"So they were like, 'Hey, you know, Chinese, Japanese, Korean — they're pretty close. Can we just mash big chunks of them together?'" explains Brunton.
"There's all these different, sort of, approaches, which are fundamentally, obviously reflecting cultural models — cultural biases," Brunton adds. "But when they get substantiated into software, they turn into exclusionary systems."
The Han characters are shared between the three languages but weren't exactly the same — so clubbing them all together as one character set caused discrepancies in spellings in each of these languages.
"Imagine a version of the Wikipedia wars about whether or not to spell color with a 'u,' you know, and then ... turn up the dial to 11," says Brunton.
Technology has since advanced. Now we don't have to use the same character set for these variations. In fact, Noto is the first open-source font to support the different Chinese, Japanese and Korean variations.
The Lowest Common Denominator
The array of language fonts and the detail work in the Noto family is impressive, Brunton says.
Even some very small endangered linguistic communities that are used to not being represented digitally — such as Inuktitut, one of the principal Inuit languages in Canada — are present in the Google font family.
James Crippen is from one such endangered linguistic community. He belongs to one of the Tlingit tribes — tribes of indigenous people from the Pacific Northwest whose language has only 200 native speakers. He is also a linguist who studies language revitalization at the University of British Columbia.
Crippen, who developed a love for languages when his grandmother taught him calligraphy as a child, tries new fonts out on different programs to test them. He's often disappointed. He says using Noto to type Tlingit doesn't pass the test. Tlingit speakers are lucky because the language uses the Latin alphabet, he says, but even then, only some of the Noto Latin fonts support the accents (called "diacritics") that bring out various sounds.
He has made his peace with the fact that Tlingit is often neglected.
"You don't count as different from some other groups ... and it's kind of a put-down," he says. "It's frustrating too because everybody is supporting the least common denominator, so, yeah, it kind of stinks."
Aside from fonts for endangered languages like Inuktitut and Tlingit, Noto also includes fonts for languages that are whimsical and fun, such as the "shavian alphabet" named after Irish poet George Bernard Shaw.
But, Brunton points out, while all "these wacky, eccentric, utopian projects" are supported, languages such as Oriya — an Indian language spoken by millions of people — are still not.
How The Internet Flattened Urdu
Even when more widely spoken languages are supported, their scripts may not accurately reflect the culture within which they're used. Urdu is one example.
Being from the South, Ali Eteraz loves William Faulkner's work, but as a Pakistani-American, he is also a fan of Mirza Ghalib, whose influence on the Urdu language is often compared to Shakespeare's influence on English. But between Faulkner and Ghalib, Eteraz could only share Faulkner's works online.
The problem is that the nastaliq Urdu used for Ghalib's verses — ornate and calligraphic with distinctive hanging characters — is not supported. So Eteraz and others who want to share poetry written in the gorgeous script have to upload snapshots instead of being able to merely copy and paste. "People even email entire books to each other, in individual images," Eteraz wrote in an October 2013 essay.
"Constantly uploading image files to communicate may be romantic (or it can make you feel like a second-class digital citizen), but it is not practical," he wrote.
The naskh script of the Arabic alphabet is more angular, linear — and incidentally, easier to code — than the nastaliq script. So that's what is currently present in Noto for the Urdu language, even though Persian and Urdu language communities say nastaliq is a more accurate representation.
It's sort of the opposite of the Han unification. There, they were trying to pass off one Unicode character for visually different language characters. Here, although the building blocks that make up the phrase are the same, they are stylistically different — and have to be coded by more than one Unicode character.
So nastaliq and other scripts, such as Tibetan, require extensive research and development. Although representatives from Google say they plan to add these in the future to Noto, they say it will take time.
Eteraz says he has heard this answer before, and for him, it's frustrating. In his essay, he writes about calling up tech companies including Apple, Twitter and Microsoft to inquire whether there was any effort to support nastaliq script. And he's still working on Google.
The Beauty And Agony Of The Internet
Google, with its new font family, is just trying to make the best out of a bad situation, Brunton says.
"Part of the beauty and agony of the Internet is that we're constantly building systems on top of infrastructure and technologies that were never meant to do what they're doing now," he says. But as we speak, programmers are at work building those systems of language that punch through the restrictions.
At Google, "it is a balancing act between these different factions to build our fonts that are hopefully enduring and useful in a broad range of uses," writes Jungshik Shin in an email.
That promise comes with a lot of responsibility, Eteraz says.
"Language is the building block of people's identities all around the world, and Google is basically saying that, 'We got this,' " he says.
"Whether that strikes you as hubris or whether it's noble depends on whether they pull it off."
Correction Aug. 4, 2014
An earlier version of this story mistakenly said that the Unicode Consortium releases fonts. The consortium maintains the standard on which fonts are based.