Difference between revisions of "ClearTK Workshop I"

From CompSemWiki
Jump to navigationJump to search
Line 30: Line 30:
 
#* Enter the URL <pre>http://www.apache.org/dist/incubator/uima/eclipse-update-site/</pre> into the "Work with" text box
 
#* Enter the URL <pre>http://www.apache.org/dist/incubator/uima/eclipse-update-site/</pre> into the "Work with" text box
 
#* Restart eclipse
 
#* Restart eclipse
# Obtain cleartk-workshop-feb-2010.zip directly from me (Philip Ogren) which is an exported eclipse project that contains source code, data, and dependencies.   
+
# Obtain cleartk-workshop-feb-2010.zip directly from me (Philip Ogren) which is an exported eclipse project that contains source code, data, and dependencies.  It is also available on mime at /raid/ogren/cleartk-workshop-feb-2010.zip.
 
# Import eclipse project from cleartk-workshop-feb-2010.zip:
 
# Import eclipse project from cleartk-workshop-feb-2010.zip:
 
#* Menu -> File -> Import -> General -> Existing Projects into Workspace -> Next
 
#* Menu -> File -> Import -> General -> Existing Projects into Workspace -> Next

Revision as of 10:55, 2 February 2010

Introduction

Abstraction Paradise / Configuration Hell

  • We are going to avoid a lot of the complexities of UIMA by using UUTUC. This will serve to simplify use of UIMA and ClearTK for this workshop.

Use of Twitter Data

We have been given permission to use Twitter data from the EPIC project under the following conditions (quoting Leysia Palen):

  1. Confirm with me that all attendees are CU people. If not, tell me who else is there, and how many.
  2. The data are not to be made public, and are therefore to put behind a password-protected site. The data must be taken down when workshop is over, and you must confirm to me that this has been done.
  3. you give full credit to Project EPIC for the data source.

The data we have been given is subject to IRB protocols. Here's another quote from Leysia:

"...we are not prepared to have it be public. This is data that has been collected by many students on our project, and they and we have the rights to it.... [posting the data] will be violating IRB protocols as well, protocols that we have to adhere to that are federal guidelines. Not only can grant funds be yanked for such a move, a whole university can go under audit for inappropriate posting of human subjects data."

You are expected to respect the above concerns.

Setup

  1. Download and install eclipse from here. You want to select "Eclipse IDE for Java Developers" and it should send you to a page that has a large green arrow on it - click on that green arrow. Here are some installation instructions.
  2. Start Eclipse and choose a new workspace directory. If you get the welcome screen, then click on the image with a silver and gold swooshing arrow (labeled "Workbench" when you hover your mouse over it)
  3. Intall UIMA Eclipse plugins
    • Select Help -> Install New Software
    • Enter the URL
      http://www.apache.org/dist/incubator/uima/eclipse-update-site/
      into the "Work with" text box
    • Restart eclipse
  4. Obtain cleartk-workshop-feb-2010.zip directly from me (Philip Ogren) which is an exported eclipse project that contains source code, data, and dependencies. It is also available on mime at /raid/ogren/cleartk-workshop-feb-2010.zip.
  5. Import eclipse project from cleartk-workshop-feb-2010.zip:
    • Menu -> File -> Import -> General -> Existing Projects into Workspace -> Next
    • Check "select archive file" and browse to cleartk-workshop-feb-2010.zip. Select project CleartkWorkshop and click finish.

Example 1

  • Open Example1ReadData by typing CTRL+SHIFT+T and typing Example1
  • Run this program by typing ALT+SHIFT+X followed by J. This will run the program and immediately throw an exception and exit.
  • Open the Run Configurations window: Menu -> Run -> Run Configurations. Select "Example1ReadData" and edit the program arguments under the Arguments tab. Enter the following arguments:
data/100-examples.csv
data/100-examples-xmi
  • Run the CAS Visual Debugger.
    • Open the Run Configurations window (see above) and create a new launch configuration for a "Java Application". Enter "CVD" for the name and enter "org.apache.uima.tools.cvd.CVD" as the main class. Run it.
    • In the CVD read in a type system: Menu -> File -> Read Type System File. Select <workspace-dir>/CleartkWorkshop/src/org/cleartk/workshop/feb2010/type/TypeSystem.xml.
    • Now read in an XMI file: Menu -> File -> Read XMI CAS File. Select <workspace-dir>/CleartkWorkshop/data/100-examples-xmi/1474720291.xmi (or any other xmi file in that directory)

Arguments for Examples

  • Example1ReadData
data/100-examples.csv
data/100-examples-xmi
  • Example2SimpleAnalysis
data/100-examples.csv
data/100-examples-xmi
  • Example3TrainingData
data/training-data.csv
data/experiment/maxent
  • Example4TrainModel
data/experiment/maxent
100
5
  • Example5Classify
data/testing-data.csv
data/experiment/maxent/model.jar
  • Example6Complete

Running the experiment with SVMlight will require downloading and installing SVMlight.

data/training-data.csv
data/experiment/svmlight
org.cleartk.classifier.svmlight.DefaultSVMlightDataWriterFactory
data/testing-data.csv

or

data/training-data.csv
data/experiment/libsvm
org.cleartk.classifier.libsvm.DefaultBinaryLIBSVMDataWriterFactory
data/testing-data.csv
-t
0

or

data/training-data.csv
data/experiment/maxent
org.cleartk.classifier.opennlp.DefaultBinaryMaxentDataWriterFactory
data/testing-data.csv
200
3