----------------------------------------------------
FILE: README.txt
DATE: 2010-11-17
----------------------------------------------------

(c) Copyright 2009-2010, J.D. Power and Associates, 
All rights reserved, no re-distribution.

This is the J.D. Power and Associates mention, co-reference, meronymy,
and sentiment corpus.

Cite this corpus as:

Jason S. Kessler, Miriam Eckert, Lyndsie Clark, and Nicolas Nicolov. The ICWSM 2010 JDPA Sentiment Corpus for the Automotive Domain. In the 4th International AAAI Conference on Weblogs and Social Media Data Challenge Workshop (ICWSM-DCW 2010), 2010. Washington, D.C. 

@inproceedings{KesslerEtAl2010,
  author = {Jason S. Kessler and Miriam Eckert and Lyndsie Clark and Nicolas Nicolov},
  title = {The 2010 ICWSM JDPA Sentment Corpus for the Automotive Domain},
  booktitle = {4th International AAAI Conference on Weblogs and Social Media Data Workshop Challenge (ICWSM-DWC 2010)},
  year = {2010},
  url = {http://www.cs.indiana.edu/\~{}jaskessl/icwsm10.pdf}
}


====================================================
=============== OVERVIEW ===========================
====================================================

The JDPA Corpus consists of user-generated content (blog posts)
containing opinions about automobiles and digital cameras.  They have
been manually annotated for named, nominal, and pronominal mentions of
entities.  Entities are marked with the aggregate sentiment expressed
toward them in the document.  Mentions of each entity are marked as
co-referential.  Mentions are assigned semantic types consisting of
the Automatic Content Extraction (ACE) mention types and additional
domain-specific types.  Meronymy (part-of and feature-of) and instance
relations are also annotated.  Expressions which convey sentiment
toward an entity are annotated with the polarity of their prior and
contextual sentiments as well the mentions they target.  The following
modifiers are annotated.  These may target other modifiers or
sentiment expressions

 - negators (expressions which invert the polarity of a sentiment
   expression or modifier)
 - neutralizers (expressions that do not commit the the speaker to the
   truth of the target sentiment expression or modifier)
 - committers (expressions which shift the commitment of the speaker
   toward the truth a sentiment expression or modifier)
 - intensifiers (expressions which shift the intensity of a sentiment
   expression or modifier)

Additionally, we have annotated when the opinion holder of a sentiment
expression is someone other than the author of the blog by linking the
expression to the holder.  We also annotate when two entities are
compared on a particular dimension.

The data, organized into training and testing sets, consists of 515
documents (blog posts) covering 330,762 tokens which make up 19,322
sentences.  87,532 mentions and 15,637 sentiment expressions are
annotated.

====================================================
=============== DIRECTORY STRUCTURE ================
====================================================

doc/JDPA-Sentiment-Corpus-Annotation-Guidelines-ver-2009-12-17.pdf
    Description of the annotations.

doc/JDPA-Sentiment-Corpus-Licence-ver-2009-12-17.doc
    The licence in MS-Word format.

doc/README.txt
    This file.

Below, DOMAIN may be "car" or "camera".

    The annotation files in XML format are in:
    $DOMAIN/batch*/annotation/*.xml

    The corresponding text files are in:
    $DOMAIN/batch*/txt/*.txt

    Some files have accompanying metadata, which includes the URL of
    the file's text. 
    $DOMAIN/batch*/meta/*-meta.xml

====================================================
=============== FILE STRUCTURE =====================
====================================================

The XML files provide stand-off annotations for their corresponding
text files.  The scheme, which follows, is based on the XML format
used by the Protege plug-in Knowtator
(http://knowtator.sourceforge.net/).

Annotations span two or more tags in the <annotations> tag. 
 
The first tag is <annotation>, containing the <mention> subtag, specifying 
the id of the annotation.  Next is the <annotator> subtag, giving an
anonymized annotator's id and pseudonym.  <span> specifies the
start and end byte-offsets of the annotation and the text it
spans while <spannedText> contains the text covered by the annotation.
<spannedText> is optional and may omit some leading/trailing whitespace 
(or multiple whitespaces).  See the <annotation> tag below for an example.

The second tag is <classMention>, linked to the annotation tag's id by
the "id" attribute.  The only required subtag is <mentionClass>, whose
content and "id" attribute are the semantic type of the annotation.  A
<classMention> tag may have zero or more <hasSlotMention> subtags.
Each of these corresponds to a property of the annotation, detailed in
either a <stringSlotMention> tag or a <complexSlotMention> tag.  The
*SlotMention tags are linked via the "id" attribute in
<hasSlotMention>.

<stringSlotMention> is used for slots that have properties which are
nominal, numeric or textual.  The slot's name is in the "id" attribute
of the subtag <mentionSlot> while the value of the slot is in the
"value" attribute of the <stringSlotMentionValue> subtag.

Some slots are used to refer to other annotations.  These "complex"
slots are specified through the <complexSlotMention> tag.  Like
<stringSlotMention>, this tag requires the <mentionSlot> subtag, whose
"id" attribute specifies the name of the slot.  However, its value is
specified through the "value" attribute of <complexSlotMentionValue>
subtag.  The value is always the id of the annotation the slot refers
to.  Some <complexSlotMention> tags have multiple
<clomplexSlotMentionValue> subtags, each containing an annotation id.

Here is an example:

<annotations textSource="car-001-xxx.txt">

...

<annotation>
  <mention id="car-001--xxx-20755" /> 
  <annotator id="A3">Annotator 3</annotator> 
  <span start="0" end="6" /> 
  <spannedText>Nissan</spannedText> 
</annotation>
	
<classMention id="car-001--xxx-20755">
  <mentionClass id="Mention.Organization">Mention.Organization</mentionClass> 
  <hasSlotMention id="car-001-20759" /> 
  <hasSlotMention id="car-001-21156" /> 
</classMention>

<stringSlotMention id="car-001--xxx-20759">
  <mentionSlot id="EMLevel" /> 
  <stringSlotMentionValue value="Named" /> 
</stringSlotMention>

<complexSlotMention id="car-001--xxx-21156">
  <mentionSlot id="RefersTo" /> 
  <complexSlotMentionValue value="car-001--xxx-21145" /> 
</complexSlotMention>

...

</annotations>



The semantic types of the annotations and their slots (in the XML
files in the annotations/ directory) are explained in the annotation
guidelines.

====================================================
=============== BATCH INFORMATION ==================
====================================================

Car section:
Batch 001: First batch. Size: 78,604 tokens.

Batch 004: Addition of Mention.CarFeature to distinguish concrete, removeable or purchasable CarParts from more abstract CarFeatures such as power, acceleration and drive.  Size: 7,643 tokens.

Batch 005: Batch consists of JDPower car review files. No changes made to annotation schema. Start of preannotation. Size: 42,019 tokens.

Batch 006: Addition of Mention.Descriptor for adjectives preceding mention nouns, such as *heated*, *power* seats; MemberOf slot added to link individual mentions to a plural mention. Size: 95,864 tokens.

Batch 007: Removal of Mention.Descriptor and addition of Descriptor class to reflect the fact that descriptors do not refer to discourse entities. Size: 11,221 tokens.

Batch 008: Same format as Batch 007. Size: 30,612 tokens.

====================================================
=============== CONTRIBUTORS =======================
====================================================

Claire Bonial
Lyndsie Clark
Miriam Eckert
Meredith Green
George Figgs
Steliana Ivanova
Hanna Lind
Jason Kessler
Nicolas Nicolov
Ronald Woodward
Whitney Zimmer


====================================================
=============== ACKNOWLEDGEMENTS ===================
====================================================

We would like to thank Prof. Martha Palmer, Prof. James Martin, and
Prof. Michael Mozer of The University of Colorado at Boulder for
insightful discussions on the corpus and Dr. Richard Wolniewicz,
Chance Parker, and Rich Belanger of J.D. Power and Associates for
supporting the project.


====================================================
=============== CONTACT ============================
====================================================

ICWSM.JDPA.Corpus@gmail.com


====================================================
=============== NOTE ===============================
====================================================

The opinions and claims expressed in the corpus documents are those of
the authors and assessed by human annotators.  The possibility that
one brand or model may have more positive or negative sentiment than
another is not indicative of the difference of opinions in the
blogosphere.

--- END: README.txt --------------------------------