This project provides several NLP tools such as a dependency parser, a semantic role labeler,
a penn-to-dependency converter, a prop-to-dependency converter, and a morphological analyzer.
All tools are written in Java and developed by the Computational Language and EducAtion Research
(CLEAR) group at the University of Colorado at Boulder.
- Word Sense Disambiguation and Efficient Annotation:
Supervised machine learning is widely used in natural language processing and, based on the extensive
OntoNotes sense tagged data, we have a state-of-the-art WSD system for English verbs that approaches
human accuracy. Check back to this cite soon for a link to a downloadable version.
However, porting this approach to other domains and other languages requires additional annotated training data,
which is expensive to obtain. How does one choose the data for annotation? Random sampling is a common
approach but not the most efficient one. Various types of selective sampling can be used to achieve
the same level of performance as random sampling but with less data. Active learning is one type of
selective sampling, but in many situations it is not practical (e.g. a multi-annotator, double-annotation environment).
Dmitry Dligach's dissertation focuses on developing selective sampling algorithms that are similar in spirit
to active learning but more practical. They utilize his state-of-the-art automatic word sense disambiguation system.
He has also looked into evaluating various popular annotation practices such as single annotation, double annotation,
and batch active learning.
- VerbNet Class Disambiguator:
Understanding verbs is central to deep semantic parsing, requiring the identification of not only a verbís meaning
but also how it connects the participants in the sentence. Disambiguating verbs using a lexicon that has already
been enriched with syntactic and semantic information would bring end systems a step closer to accurate knowledge
representation and reasoning than a more traditional lexicon. VerbNet has already been shown to be a good resource
for identifying deep semantics, having been used for semantic role labeling (Swier and Stevenson, 2004), the creation
of conceptual graphs (Hensman and Dunion, 2004), and semantic parsing (Shi and Mihalcea, 2005). However, many verbs
are members of multiple VerbNet classes, with each class membership corresponding roughly to different senses of the
verbs. Therefore, application of VerbNetís semantic and syntactic information to specific text requires first identifying
the appropriate VerbNet class of each verb in the text.
Currently in development is the The VerbNet Class Disambiguator,
which uses a supervised machine learning approach to classify verb tokens with VerbNet classes. It has been trained and
tested with 30 verbs to date. With this initial sample, it achieves 90% accuracy, which represents a 61% error reduction over
the most-frequent-class baseline. Work is underway to increase its coverage to all the multiclass verbs in the Semlink corpus.