CLEAR (Computational Language and EducAtion Research)

Welcome

Our goal is richer and more accurate representations of utterances in English, Chinese, Hindi/Urdu, and Arabic. Our principle approach involves the application of supervised machine learning to data with linguistic annotation. There are several different layers of annotation, and correspondingly several individual NLP components, many of which are trained on a single layer. We begin by describing several different end-to-end systems we are building which incorporate these components, then describe the individual components. Next we describe the lexical resources which inform the linguistic annotation, and then the individual layers of annotation and the different domains and genres they have been applied to. Finally we describe CLEARTK - the CLEAR NLP Toolkit - that is being use by some of the end-to-end systems.

Corpora

Corpora produced and used by Computational Semantics.

End-End Systems:

Our end-to-end systems include Question Answering, Machine Translation and Information Extraction, for application areas that include Medical Informatics, Crisis Informatics and general news.

NLP Components:

We have state-of-the-art components for parsing, semantic role labeling, sense tagging and relation extraction.

Lexical Resources:

Our sense tagging and semantic role labeling annotation is informed by lexical resources which include PropBank Frame Files for English, Arabic and Hindi, as well as English VerbNet.

Annotation:

We have annotated corpora in several genres (general news, conversational speech, clinical notes, tweets, etc.) with layers that can include POS tags, syntactic structure, semantic roles, sense tags, entity tags, relations, coreference and temporal relations.

ClearTK:

We have an open source NLP Toolkit that is UIMA compliant.

CompSemWiki:

We maintain a wiki page for the Computational Semantics group at CU Boulder.