CLEAR (Computational Language and EducAtion Research)

Linguistic Annotation

TreeBank

The first level of OntoNotes analysis will capture the syntactic structure of the text, following the approach taken in the Penn Treebank:

The Penn Treebank project, which began in 1989, has produced over three million words of skeletally parsed text from various genres. Among many other uses, the one million word corpus of English Wall Street Journal text included in Treebank-2 has fueled widespread and productive research efforts to improve the performance of statistical parsing engines. Treebanking efforts following the same general approach have also more recently been applied to other languages, including Chinese and Arabic.

While statistical parsers have often been evaluated on a reduced version of the Penn Treebank's structure, the OntoNotes goal of capturing literal semantics provides exactly the kind of context for which the full version of Treebank was initially designed. The function tags and trace information that are part of a full Treebank analysis will provide a crucial first step toward the OntoNotes analysis.

Within the OntoNotes project, the University of Pennsylvania will be providing the Treebank annotation for new genres of English text, and also contributing towards improving statistical parsing technology. The University of Colorado and the Linguistic Data Consortium will also be contributing Treebank data in Chinese and Arabic.

The Chinese Treebank is being developed at the University of Colorado, under the supervision of Prof. Martha Palmer
The English Treebank is being developed at the University of Pennsylvania under the supervision of Prof. Mitchell Marcus