CLEAR (Computational Language and EducAtion Research)

NLP Components

This project provides several NLP tools such as a dependency parser, a semantic role labeler, a penn-to-dependency converter, a prop-to-dependency converter, and a morphological analyzer. All tools are written in Java and developed by the Computational Language and EducAtion Research (CLEAR) group at the University of Colorado at Boulder.

Word Sense Disambiguation and Efficient Annotation:

Supervised machine learning is widely used in natural language processing and, based on the extensive OntoNotes sense tagged data, we have a state-of-the-art WSD system for English verbs that approaches human accuracy. Check back to this cite soon for a link to a downloadable version.

However, porting this approach to other domains and other languages requires additional annotated training data, which is expensive to obtain. How does one choose the data for annotation? Random sampling is a common approach but not the most efficient one. Various types of selective sampling can be used to achieve the same level of performance as random sampling but with less data. Active learning is one type of selective sampling, but in many situations it is not practical (e.g. a multi-annotator, double-annotation environment). Dmitry Dligach's dissertation focuses on developing selective sampling algorithms that are similar in spirit to active learning but more practical. They utilize his state-of-the-art automatic word sense disambiguation system. He has also looked into evaluating various popular annotation practices such as single annotation, double annotation, and batch active learning.

VerbNet Class Disambiguator:

Understanding verbs is central to deep semantic parsing, requiring the identification of not only a verb�s meaning but also how it connects the participants in the sentence. Disambiguating verbs using a lexicon that has already been enriched with syntactic and semantic information would bring end systems a step closer to accurate knowledge representation and reasoning than a more traditional lexicon. VerbNet has already been shown to be a good resource for identifying deep semantics, having been used for semantic role labeling (Swier and Stevenson, 2004), the creation of conceptual graphs (Hensman and Dunion, 2004), and semantic parsing (Shi and Mihalcea, 2005). However, many verbs are members of multiple VerbNet classes, with each class membership corresponding roughly to different senses of the verbs. Therefore, application of VerbNet�s semantic and syntactic information to specific text requires first identifying the appropriate VerbNet class of each verb in the text.

Currently in development is the The VerbNet Class Disambiguator, which uses a supervised machine learning approach to classify verb tokens with VerbNet classes. It has been trained and tested with 30 verbs to date. With this initial sample, it achieves 90% accuracy, which represents a 61% error reduction over the most-frequent-class baseline. Work is underway to increase its coverage to all the multiclass verbs in the Semlink corpus.