CLEAR (Computational Language and EducAtion Research)

Lexical Resources

PropBank

The propositional level of analysis is layered on top of the parse trees and identifies predicate constituents and their arguments in OntoNotes. This level of analysis is supplied by PropBank which is described below:
Robust syntactic parsers, made possible by new statistical techniques (Ratnaparkhi, 1997; Collins, 1999; Collins, 2000; Bangalore and Joshi, 1999; Charniak, 2000) and by the availability of large, hand-annotated training corpora (Marcus, Santorini, and Marcinkiewicz, 1993; Abeille, 2003), have had a major impact on the field of natural language processing in recent years. However, the syntactic analyses produced by these parsers are a long way from representing the full meaning of the sentence. As a simple example, in the sentences:

John broke the window.
The window broke.

A syntactic analysis will represent the window as the verb's direct object in the first sentence and its subject in the second, but does not indicate that it plays the same underlying semantic role in both cases. Note that both sentences are in the active voice, and that this alternation between transitive and intransitive uses of the verb does not always occur, for example, in the sentences:

The sergeant played taps.
The sergeant played.

The subject has the same semantic role in both uses. The same verb can also undergo syntactic alternation, as in:

Taps played quietly in the background.

and even in transitive uses, the role of the verb's direct object can differ:

The sergeant played taps.
The sergeant played a beat-up old bugle.

Alternation in the syntactic realization of semantic arguments is widespread, affecting most English verbs in some way, and the patterns exhibited by specific verbs vary widely (Levin, 1993). The syntactic annotation of the Penn Treebank makes it possible to identify the subjects and objects of verbs in sentences such as the above examples. While the Treebank provides semantic function tags such as temporal and locative for certain constituents (generally syntactic adjuncts), it does not distinguish the different roles played by a verb's grammatical subject or object in the above examples. Because the same verb used with the same syntactic subcategorization can assign different semantic roles, roles cannot be deterministically added to the Treebank by an automatic conversion process with 100% accuracy. Our semantic role annotation process begins with a rule-based automatic tagger, the output of which is then hand-corrected (see Section 4 for details).
The Proposition Bank aims to provide a broad-coverage hand annotated corpus of such phenomena, enabling the development of better domain-independent language understanding systems, and the quantitative study of how and why these syntactic alternations take place. We define a set of underlying semantic roles for each verb, and annotate each occurrence in the text of the original Penn Treebank. Each verb's roles are numbered, as in the following occurrences of the verb offer from our data:

...[_Arg0 the company] to ... offer [_Arg1 a 15% to 20% stake] [_Arg2 to the public]. (wsj 0345)
... [_Arg0 Sotheby's] ... offered [_Arg2 the Dorrance heirs] [_Arg1 a money-back guarantee] (wsj 1928)
... [_Arg1 an amendment] offered [_Arg0 by Rep. Peter DeFazio] ... (wsj 0107)
... [_Arg2 Subcontractors] will be offered [_Arg1 a settlement] ... (wsj 0187)

We believe that providing this level of semantic representation is important for applications including information extraction, question answering, and machine translation. Over the past decade, most work in the field of information extraction has shifted from complex rule-based systems designed to handle a wide variety of semantic phenomena including quantification, anaphora, aspect and modality (e.g. Alshawi (1992)), to more robust finite-state or statistical systems (Hobbs et al., 1997; Miller et al., 1998).
These newer systems rely on a shallower level of semantic representation, similar to the level we adopt for the Proposition Bank, but have also tended to be very domain specific. The systems are trained and evaluated on corpora annotated for semantic relations pertaining to, for example, corporate acquisitions or terrorist events. The Proposition Bank (PropBank) takes a similar approach in that we annotate predicates' semantic roles, while steering clear of the issues involved in quantification and discourse-level structure. By annotating semantic roles for every verb in our corpus, we provide a more domain-independent resource, which we hope will lead to more robust and broad-coverage natural language understanding systems.
The Proposition Bank focuses on the argument structure of verbs, and provides a complete corpus annotated with semantic roles, including roles traditionally viewed as arguments and as adjuncts. The Proposition Bank allows us for the first time to determine the frequency of syntactic variations in practice, the problems they pose for natural language understanding, and the strategies to which they may be susceptible.

Arabic

The Arabic Propbank frame files are available, as well as guidelines.

Hindi

The Hindi PropBank is being developed at the University of Colorado, under the supervision of Prof. Martha Palmer and Prof. Bhuvana Narasimhan.

Chinese

The Chinese PropBank has moved to Brandeis University, under the supervision of Prof. Nianwen Xue.

English

The English PropBank is being developed at the University of Colorado under Prof. Martha Palmer's supervision.