CLEAR (Computational Language and EducAtion Research)

Linguistic Annotation

In PropBank, we identify the arguments of predicates (e.g. verbs, eventive nouns) and label them with semantic roles that show their relationship to the predicate. The semantic arguments of the verb are labeled on a verb-by-verb basis, creating a separate frame file that includes verb specific semantic roles to account for each subcategorization frame of the verb. It has been shown that training supervised systems with PropBank�s semantic roles for shallow semantic analysis yields good results (see CoNLL 2005 and 2008). PropBank currently includes four language projects: English, Chinese, Hindi/Urdu, and Arabic.

We currently have two annotation tools that have been used in several different universities: a PropBank annotation tool, Jubilee, and a PropBank Frame File editor, Cornerstone. Both tools are available through google code as open source projects.

English PropBank Project   Funded by GALE, NIH, and HHS
Chinese PropBank Project   Funded by GALE
Hindi/Urdu PropBank Project   Funded by the NSF
Arabic PropBank Project   Funded by GALE

Word Sense (Ontonotes Sense Groups):
Funded by GALE and NSF

Word sense ambiguity is a continuing major obstacle to accurate information extraction, summarization and machine translation. While WordNet has been an important resource in this area, the subtle fine-grained sense distinctions in it have not lent themselves to high agreement between human annotators or high automatic tagging performance. Building on results in grouping fine-grained WordNet senses into more coarse-grained senses that led to improved inter-annotator agreement (ITA) and system performance (Palmer et al., 2004; Palmer et al., 2006), we have developed a process for rapid sense inventory creation and annotation that also provides critical links between the grouped word senses and the Omega ontology.

English Word Sense Annotation Project Funded by GALE

TreeBank:

The first level of OntoNotes analysis will capture the syntactic structure of the text, following the approach taken in the Penn Treebank. The Penn Treebank project, which began in 1989, has produced over three million words of skeletally parsed text from various genres. Among many other uses, the one million word corpus of English Wall Street Journal text included in Treebank-2 has fueled widespread and productive research efforts to improve the performance of statistical parsing engines. Treebanking efforts following the same general approach have also more recently been applied to other languages, including Chinese and Arabic.

The Penn treebanking approach has been ported to Colorado, where we have recently finished treebanking Bioinformatics journal papers and are currently treebanking clinical notes for the Medical Informatics projects.

Clinical annotation (SHARP and THYME):

Incorporating the findings of the above efforts, the SHARP and THYME projects are developing semantic annotations in the clinical domain for materials such as radiology and pathology notes. The following annotation guidelines are being developed in these projects:
Syntactic tree (TreeBank) annotation guidelines
Semantic role (PropBank) annotation guidelines i
Unified Medical Language System (UMLS) entity annnotation guidelines
Clinical coreference annnotation guidelines