<le to this year's Linguistics Colloquium Series at Carleton, Math/Computer Science Professor Jeff Ondich spoke about one of his areas of expertise: auto-translation with computers. Ondich's very own software company, Ultralingua, provides digital language reference tools, and he explained the inner workings of such programs, focusing on the methods of Babel Fish and Google Translate to relay a sentence in one language to another.
The main difference between Babel Fish and Google Translate is that Babel Fish uses a rule-based model, while Google uses statistics. The rule-based paradigms are “built by hand. Lots of Babel Fish is based on intensive labor and expert knowledge,” says Ondich. He also notes that “the devil is in the details.” For example, translating the word “bright” into Hungarian is disambiguated within the computer program, building in the option between dealing with light or a person’s intelligence.
The statistical model of Google uses a vast collection of monolingual texts as well as parallel texts, such UN and EU documents which translate between many languages word-for-word. Utilizing conditional probability to find the likelihood of words appearing given the previous word between different languages, Google Translate “computes based on all the digitized English we have available.”
An example Ondich showed was a table of word pairing probabilities for an automated restaurant search telephone service. Pairings, also called bigrams (as opposed to a unigram, or word), like “I want” and “to eat” appeared very often, while others such as “I to” and “I food” never occurred. However, certain bigrams, like “I lunch” never appear although they are grammatical in English. These sorts of issues have the potential to throw off language translation tools, who, from this limited data, would think that such a bigram would be impossible. “Data sparseness leads to the need for ‘smoothing,’… which is a big area of research that has been going on for many years,” to figure out how to account for such data issues.
Other challenges include dealing with parallel texts, such as government documents which provide multiple language translations that the translation tool can draw from. “A big issue of parallel texts is getting them aligned…it’s a tremendous amount of work” with many individuals studying how to improve the methodology, says Ondich. The translation tool’s source of data is also a significant factor. For instance, a speech to text tool that Ondich encountered could not understand an everyday sentence, like “I walked home for dinner,” but it could get stock market jargon down with ease; its data source was mainly financial newspapers.
Putting users to good use is one of Google Translate’s innovations. Ondich remarks that Google has “collected an immense amount of data, and have lots of monolingual data,” but they still need word-for-word parallel texts between languages. Intelligently, Google Translate allows its visitors to log suggestions and “now [Google] gets parallel texts from users.”
“Usability is a big deal, with lots of money and attention to not just translation but its context,” says Ondich. In a user-driven market, convenience is key. Ondich raised other questions, such as, “how are these approaches to translation different from what our brains do?” which lead to diverse lines of research and scholarship. As Ondich commented during his talk, machine translation and the interplay between software and language processing is “complicated” and offers many avenues for innovation.