ClearTK Workshop I
Introduction
Abstraction Paradise / Configuration Hell
- We are going to avoid a lot of the complexities of UIMA by using UUTUC. This will serve to simplify use of UIMA and ClearTK for this workshop.
Use of Twitter Data
We have been given permission to use Twitter data from the EPIC project under the following conditions (quoting Leysia Palen):
- Confirm with me that all attendees are CU people. If not, tell me who else is there, and how many.
- The data are not to be made public, and are therefore to put behind a password-protected site. The data must be taken down when workshop is over, and you must confirm to me that this has been done.
- you give full credit to Project EPIC for the data source.
The data we have been given is subject to IRB protocols. Here's another quote from Leysia:
"...we are not prepared to have it be public. This is data that has been collected by many students on our project, and they and we have the rights to it.... [posting the data] will be violating IRB protocols as well, protocols that we have to adhere to that are federal guidelines. Not only can grant funds be yanked for such a move, a whole university can go under audit for inappropriate posting of human subjects data."
You are expected to respect the above concerns.
Setup
- Download and install eclipse from here. You want to select "Eclipse IDE for Java Developers" and it should send you to a page that has a large green arrow on it - click on that green arrow. Here are some installation instructions.
- Start Eclipse and choose a new workspace directory. If you get the welcome screen, then click on the image with a silver and gold swooshing arrow (labeled "Workbench" when you hover your mouse over it)
- Intall UIMA Eclipse plugins
- Select Help -> Install New Software
- Enter the URL
http://www.apache.org/dist/incubator/uima/eclipse-update-site/
into the "Work with" text box - Restart eclipse
- Obtain cleartk-workshop-feb-2010.zip directly from me (Philip Ogren) which is an exported eclipse project that contains source code, data, and dependencies.
- Import eclipse project from cleartk-workshop-feb-2010.zip:
- Menu -> File -> Import -> General -> Existing Projects into Workspace -> Next
- Check "select archive file" and browse to cleartk-workshop-feb-2010.zip. Select project CleartkWorkshop and click finish.
Example 1
- Open Example1ReadData by typing CTRL+SHIFT+T and typing Example1
- Run this program by typing ALT+SHIFT+X followed by J. This will run the program and immediately throw an exception and exit.
- Open the Run Configurations window: Menu -> Run -> Run Configurations. Select "Example1ReadData" and edit the program arguments under the Arguments tab. Enter the following arguments:
data/100-examples.csv data/100-examples-xmi
- Run the CAS Visual Debugger.
- Open the Run Configurations window (see above) and create a new launch configuration for a "Java Application". Enter "CVD" for the name and enter "org.apache.uima.tools.cvd.CVD" as the main class. Run it.
- In the CVD read in a type system: Menu -> File -> Read Type System File. Select <workspace-dir>/CleartkWorkshop/src/org/cleartk/workshop/feb2010/type/TypeSystem.xml.
- Now read in an XMI file: Menu -> File -> Read XMI CAS File. Select <workspace-dir>/CleartkWorkshop/data/100-examples-xmi/1474720291.xmi (or any other xmi file in that directory)
Arguments for Examples
- Example1ReadData
data/100-examples.csv data/100-examples-xmi
- Example2SimpleAnalysis
data/100-examples.csv data/100-examples-xmi
- Example3TrainingData
data/training-data.csv data/experiment/maxent
- Example4TrainModel
data/experiment/maxent 100 5
- Example5Classify
data/testing-data.csv data/experiment/maxent/model.jar
- Example6Complete
Running the experiment with SVMlight will require downloading and installing SVMlight.
data/training-data.csv data/experiment/svmlight org.cleartk.classifier.svmlight.DefaultSVMlightDataWriterFactory data/testing-data.csv
or
data/training-data.csv data/experiment/libsvm org.cleartk.classifier.libsvm.DefaultBinaryLIBSVMDataWriterFactory data/testing-data.csv -t 0
or
data/training-data.csv data/experiment/maxent org.cleartk.classifier.opennlp.DefaultBinaryMaxentDataWriterFactory data/testing-data.csv 200 3