CLEAR (Computational Language and EducAtion Research)

End-End Systems

STAGES - Machine Translation (MT)

***STAGES is a joint machine translation project that involving Brandeis University, Columbia University, Information Science Institute at the University of Southern California, The University of Colorado at Boulder, and The University of Rochester. It is funded by the NSF.***

Statistical machine translation (MT) systems have improved greatly in the past several years and reached a point where they are widely used for at least getting the gist of foreign language documents and web pages. However, reading the output of even the best Chinese-English machine translation systems remains a painful experience. Furthermore, current systems perform well only on the type of text on which they have been trained (most often newswire text), and require very large amounts of texts from this domain. This project proposes a new approach to MT that features four key departures from state-of-the-art approaches:

� Semantically based analysis: We infuse semantic analysis in all stages of our proposed MT system, abstracting away from the surface representation of source and target languages in order to provide a semantically based representation that can guide the translation process.

� Semantic statistical processing: Statistical processing is trained using semantic structures in addition to syntactic ones and produce a novel form of output representing a range of decisions from phrases, where possible, to semantic concepts where not.

� Multiple pathways for translation: We experiment with several semantic-statistical approaches to MT, along with a concept-to-concept alignment between source and target languages.

� Language Generation: Our system uses a language generation component to fuse the output of the different translation paths, using text-to-text generation along with linguistic realization to produce the target language.

Our new approach to MT, combining semantic analysis, new forms of statistical MT and language generation, allow us to handle fundamental differences in how Chinese and English encode information. Our research addresses differences in Chinese and English realization of tense, grammatical function words, constituent ordering (particularly when long distance dependencies are involved), and discourse relations. At each level of processing in this system, STAGES (Statistical Translation And GEneration using Semantics), we develop novel methods to handle these problems.