Difference between revisions of "GIZA++"

From CompSemWiki
Jump to navigationJump to search
Line 59: Line 59:
  
 
== Parameter ==
 
== Parameter ==
When you want to use GIZA++, use can produce a config file to store all the parameters. The example .gizacfg file is list as follows.
+
Use this parameter setting when running GIZA++
adbackoff 0
+
  -c /home/verbs/student/wech5560/testMGIZA/giza/corpus/de-en-int-train.snt (Location of the sentences files)
  c /home/verbs/student/wech5560/testMGIZA/giza/corpus/de-en-int-train.snt
+
  -d (Location of the dictionary file)
  compactadtable 1
+
  -l 110-09-03.145823.wech5560.log (Log File Name)
compactalignmentformat 0
+
  -log 0 (Use log file or not; 0 for no, 1 for yes)
coocurrence /home/verbs/student/wech5560/testMGIZA/giza/corpus/de-en.cooc
+
  m1 5 (Number of iteration for IBM-Model 1)
corpusfile /home/verbs/student/wech5560/testMGIZA/giza/corpus/de-en-int-train.snt
+
  m2 0 (Number of iteration for IBM-Model 2)
countcutoff 1e-06
+
  m3 3 (Number of iteration for IBM-Model 3)
countcutoffal 1e-05
+
  m4 3 (Number of iteration for IBM-Model 4)
countincreasecutoff 1e-06
+
  m5 0 (Number of iteration for IBM-Model 5)
countincreasecutoffal 1e-05
+
  m5p0 -1 (Number of iteration for IBM-Model 5)
d
+
  m6 0 (Number of iteration for IBM-Model 6?)
deficientdistortionforemptyword 0
+
  mh 5 (Number of iteration for HMM)
depm4 76
+
  nbestalignments 0 (Show the top-n best alignment result)
depm5 68
+
  -o /home/verbs/student/wech5560/testMGIZA/giza/result/de-en (The output file location & prefix)
dictionary
+
  s /home/verbs/student/wech5560/testMGIZA/giza/corpus/en.vcb (The source vocabulary file)
  dopeggingyn 0
+
  t /home/verbs/student/wech5560/testMGIZA/giza/corpus/de.vcb (The target vocabulary file)
emalignmentdependencies 2
 
emalsmooth 0.2
 
emprobforempty 0.4
 
emsmoothhmm 2
 
hmmdumpfrequency 0
 
hmmiterations 5
 
l 110-09-03.145823.wech5560.log
 
  log 0
 
logfile 110-09-03.145823.wech5560.log
 
  m1 5
 
  m2 0
 
  m3 3
 
  m4 3
 
  m5 0
 
  m5p0 -1
 
  m6 0
 
manlexfactor1 0
 
manlexfactor2 0
 
manlexmaxmultiplicity 20
 
maxfertility 10
 
maxsentencelength 101
 
  mh 5
 
mincountincrease 1e-07
 
ml 101
 
model1dumpfrequency 1
 
model1iterations 5
 
model23smoothfactor 0
 
model2dumpfrequency 0
 
model2iterations 0
 
model345dumpfrequency 0
 
model3dumpfrequency 0
 
model3iterations 3
 
model4iterations 3
 
model4smoothfactor 0.4
 
model5iterations 0
 
model5smoothfactor 0.1
 
model6iterations 0
 
  nbestalignments 0
 
  nodumps 1
 
nofiledumpsyn 1
 
noiterationsmodel1 5
 
noiterationsmodel2 0
 
noiterationsmodel3 3
 
noiterationsmodel4 3
 
noiterationsmodel5 0
 
noiterationsmodel6 0
 
nsmooth 4
 
nsmoothgeneral 0
 
numberofiterationsforhmmalignmentmodel 5
 
o /home/verbs/student/wech5560/testMGIZA/giza/result/de-en
 
onlyaldumps 1
 
outputfileprefix /home/verbs/student/wech5560/testMGIZA/giza/result/de-en
 
outputpath
 
p 0
 
p0 0.999
 
peggedcutoff 0.03
 
pegging 0
 
probcutoff 1e-07
 
probsmooth 1e-07
 
readtableprefix
 
  s /home/verbs/student/wech5560/testMGIZA/giza/corpus/en.vcb
 
sourcevocabularyfile /home/verbs/student/wech5560/testMGIZA/giza/corpus/en.vcb
 
  t /home/verbs/student/wech5560/testMGIZA/giza/corpus/de.vcb
 
t1 1
 
t2 0
 
t2to3 0
 
t3 0
 
t345 0
 
targetvocabularyfile /home/verbs/student/wech5560/testMGIZA/giza/corpus/de.vcb
 
tc
 
testcorpusfile
 
th 0
 
transferdumpfrequency 0
 
v 0
 
verbose 0
 
verbosesentence -1
 
  
 
== Run ==
 
== Run ==

Revision as of 15:30, 10 September 2010

Install

For more information about how to install GIZA++, please refer to Install GIZA++

Information

  • Directory on verbs: /home/verbs/share/stage/tools/gizapp

Input Format

Basic Files

  • In order to run GIZA++, we need to input 3 or more files. Here are the list of these files:
    1. Vocabulary Files
      • For both source and target language, we need to input vocabulary files for GIZA++. The format of the vocabulary is as follow:
1	UNK	0
2	the	58947
3	,	52221
4	.	48789
5	of	30427
6	to	26185
..........

The first column is an unique-id number. The second column is the vocabulary. And the last column is the frequency of vocabularies appearing in the corpus. All of the columns are separated by a tab. We name the source and target vocabulary files as source.vcb and target.vcb

    1. Sentences File
      • With the unique-id of vocabularies, the sentences file are formatted with these id number. Here is the example of the sentences file
1
8007 36 577 226 25 97 6 1987 106 2 2560 7 315 4
5729 39 539 29 12 3877 87 2469 6 4323 1273 568 3
1
45 2 5090 10 152 534 18 2 625 58 2277 107 2 59 222 1893 54 230 5 31 144 12 2 195 4
4 2358 10 162 326 40 12109 68 4 5456 104 55 4 2252 5 145 985 2031 625 118 264 15 4 319 3
1
1519 11 16680 3 7 2 6570 5 31 112 3 307 6 16 2753 8 9 424 1718 4
4 17439 4689 6 5 1924 126 2138 194 23 18502 1962 6493 41 3
1
19 153 10 63 346 36 423 99 15 10 2 880 983 112 8 2 43 4
35 381 10 14 508 39 4 428 138 16 10 4 17734 24079 5 76 3
..........

The first line is the occurrence of the this sentences-pair. The second line is the source sentence while the words are transferred into unique-id. The third line is the target sentence with the same encoding style.

    1. Dictionary File (Optional)
      • We can provide a dictionary file for GIZA++. The format of the dictionary file is listed as follow:
5279 8007
3877 226
..........

The first column is the unique-id of target vocabulary, while the second column is the unique-id of source vocabulary.

Advanced Files

For using IBM-Model 4 or HMM Model in GIZA++, we need to generate another word class files for GIZA++. We use the mkcls to reach this.

  • Type this commend
mkcls -psource -Vsource.vcb.classes
mkcls -ptarget -Vtarget.vcb.classes

Then the software produces four classes files: source.vcb.classes, source.vcb.classes.cats, target.vcb.classes, target.vcb.cats . Make sure that the prefix of these generation files is the same with the original vocabulary files.

Processing Input File with GIZA++ Script

  • The GIZA++ provides some scripts for processing the text file. In our case, we can use the script to generate input file into GIZA++ format from the corpus.
    • Use plain2snt.out
      1. Prepare the plain text files for both source and target language. In each files, each sentence is listed on one line. And the sentences-pair for source-target sentences should match the same line number.
      2. Type the commend
plain2snt.out source target

And then the script file will generate the vocabulary files and sentence files named source.vcb, target.vcb, source-target.snt, target-source.snt .

Parameter

Use this parameter setting when running GIZA++

-c /home/verbs/student/wech5560/testMGIZA/giza/corpus/de-en-int-train.snt		(Location of the sentences files)
-d											(Location of the dictionary file)
-l 110-09-03.145823.wech5560.log							(Log File Name)
-log 0											(Use log file or not; 0 for no, 1 for yes)
m1 5											(Number of iteration for IBM-Model 1)
m2 0											(Number of iteration for IBM-Model 2)
m3 3											(Number of iteration for IBM-Model 3)
m4 3											(Number of iteration for IBM-Model 4)
m5 0											(Number of iteration for IBM-Model 5)
m5p0 -1										(Number of iteration for IBM-Model 5)
m6 0											(Number of iteration for IBM-Model 6?)
mh 5											(Number of iteration for HMM)
nbestalignments 0									(Show the top-n best alignment result)
-o /home/verbs/student/wech5560/testMGIZA/giza/result/de-en				(The output file location & prefix)
s /home/verbs/student/wech5560/testMGIZA/giza/corpus/en.vcb				(The source vocabulary file)
t /home/verbs/student/wech5560/testMGIZA/giza/corpus/de.vcb				(The target vocabulary file)

Run

Reference

http://kwang.blogdns.com/research/how-to-compile-install-run-giza.html http://www.mail-archive.com/moses-support@mit.edu/msg01143.html