Difference between revisions of "GIZA++"

From CompSemWiki
Jump to navigationJump to search
 
(7 intermediate revisions by 2 users not shown)
Line 3: Line 3:
  
 
== Information ==
 
== Information ==
* Directory on verbs: /home/verbs/share/stage/tools/gizapp
+
* Directory on verbs: /home/verbs/shared/stages/tools/giza-pp
  
 
== Input Format ==
 
== Input Format ==
Line 46: Line 46:
 
For using IBM-Model 4 or HMM Model in GIZA++, we need to generate another word class files for GIZA++. We use the '''mkcls''' to reach this.
 
For using IBM-Model 4 or HMM Model in GIZA++, we need to generate another word class files for GIZA++. We use the '''mkcls''' to reach this.
 
* Type this commend
 
* Type this commend
  mkcls -psource -Vsource.vcb.classes
+
  giza-pp/mkcls -psource -Vsource.vcb.classes
  mkcls -ptarget -Vtarget.vcb.classes
+
  giza-pp/mkcls -ptarget -Vtarget.vcb.classes
 
Then the software produces four classes files: source.vcb.classes, source.vcb.classes.cats, target.vcb.classes, target.vcb.cats . Make sure that the prefix of these generation files is the same with the original vocabulary files.
 
Then the software produces four classes files: source.vcb.classes, source.vcb.classes.cats, target.vcb.classes, target.vcb.cats . Make sure that the prefix of these generation files is the same with the original vocabulary files.
  
Line 55: Line 55:
 
**# Prepare the plain text files for both source and target language. In each files, each sentence is listed on one line. And the sentences-pair for source-target sentences should match the same line number.
 
**# Prepare the plain text files for both source and target language. In each files, each sentence is listed on one line. And the sentences-pair for source-target sentences should match the same line number.
 
**# Type the commend
 
**# Type the commend
  plain2snt.out source target
+
  giza-pp/plain2snt.out source target
 
And then the script file will generate the vocabulary files and sentence files named source.vcb, target.vcb, source-target.snt, target-source.snt .
 
And then the script file will generate the vocabulary files and sentence files named source.vcb, target.vcb, source-target.snt, target-source.snt .
  
== Parameter ==
+
== Parameters ==
 
Use this parameter setting when running GIZA++
 
Use this parameter setting when running GIZA++
 
  -c /home/verbs/student/wech5560/testMGIZA/giza/corpus/de-en-int-train.snt (Location of the sentences files)
 
  -c /home/verbs/student/wech5560/testMGIZA/giza/corpus/de-en-int-train.snt (Location of the sentences files)
Line 78: Line 78:
  
 
== Run ==
 
== Run ==
  GIZA++ -s sourceVocabularyFile -t targetVocabularyFile -c sentenceFile
+
  giza-pp/GIZA++ -s sourceVocabularyFile -t targetVocabularyFile -c sentenceFile
 
Or you can save all the parameter in .gizacfg files
 
Or you can save all the parameter in .gizacfg files
 
.gizacfg example
 
.gizacfg example
Line 173: Line 173:
 
  verbosesentence -10
 
  verbosesentence -10
 
it is recommended that you store the .gizacfg file and use it next time without typing any other parameters
 
it is recommended that you store the .gizacfg file and use it next time without typing any other parameters
 +
 +
To run GIZA++ with .gizacfg, please use this commend:
 +
giza-pp/GIZA++ configure.gizacfg
 +
 +
== Path Information on Verbs ==
 +
* GIZA++
 +
** /home/verbs/shared/stages/tools/giza-pp
  
 
== Reference ==
 
== Reference ==
Line 179: Line 186:
  
 
http://www.mail-archive.com/moses-support@mit.edu/msg01143.html
 
http://www.mail-archive.com/moses-support@mit.edu/msg01143.html
 +
 +
[[Category:Machine Translation]]

Latest revision as of 12:34, 27 October 2016

Install

For more information about how to install GIZA++, please refer to Install GIZA++

Information

  • Directory on verbs: /home/verbs/shared/stages/tools/giza-pp

Input Format

Basic Files

  • In order to run GIZA++, we need to input 3 or more files. Here are the list of these files:
    1. Vocabulary Files
      • For both source and target language, we need to input vocabulary files for GIZA++. The format of the vocabulary is as follow:
1	UNK	0
2	the	58947
3	,	52221
4	.	48789
5	of	30427
6	to	26185
..........

The first column is an unique-id number. The second column is the vocabulary. And the last column is the frequency of vocabularies appearing in the corpus. All of the columns are separated by a tab. We name the source and target vocabulary files as source.vcb and target.vcb

    1. Sentences File
      • With the unique-id of vocabularies, the sentences file are formatted with these id number. Here is the example of the sentences file
1
8007 36 577 226 25 97 6 1987 106 2 2560 7 315 4
5729 39 539 29 12 3877 87 2469 6 4323 1273 568 3
1
45 2 5090 10 152 534 18 2 625 58 2277 107 2 59 222 1893 54 230 5 31 144 12 2 195 4
4 2358 10 162 326 40 12109 68 4 5456 104 55 4 2252 5 145 985 2031 625 118 264 15 4 319 3
1
1519 11 16680 3 7 2 6570 5 31 112 3 307 6 16 2753 8 9 424 1718 4
4 17439 4689 6 5 1924 126 2138 194 23 18502 1962 6493 41 3
1
19 153 10 63 346 36 423 99 15 10 2 880 983 112 8 2 43 4
35 381 10 14 508 39 4 428 138 16 10 4 17734 24079 5 76 3
..........

The first line is the occurrence of the this sentences-pair. The second line is the source sentence while the words are transferred into unique-id. The third line is the target sentence with the same encoding style.

    1. Dictionary File (Optional)
      • We can provide a dictionary file for GIZA++. The format of the dictionary file is listed as follow:
5279 8007
3877 226
..........

The first column is the unique-id of target vocabulary, while the second column is the unique-id of source vocabulary.

Advanced Files

For using IBM-Model 4 or HMM Model in GIZA++, we need to generate another word class files for GIZA++. We use the mkcls to reach this.

  • Type this commend
giza-pp/mkcls -psource -Vsource.vcb.classes
giza-pp/mkcls -ptarget -Vtarget.vcb.classes

Then the software produces four classes files: source.vcb.classes, source.vcb.classes.cats, target.vcb.classes, target.vcb.cats . Make sure that the prefix of these generation files is the same with the original vocabulary files.

Processing Input File with GIZA++ Script

  • The GIZA++ provides some scripts for processing the text file. In our case, we can use the script to generate input file into GIZA++ format from the corpus.
    • Use plain2snt.out
      1. Prepare the plain text files for both source and target language. In each files, each sentence is listed on one line. And the sentences-pair for source-target sentences should match the same line number.
      2. Type the commend
giza-pp/plain2snt.out source target

And then the script file will generate the vocabulary files and sentence files named source.vcb, target.vcb, source-target.snt, target-source.snt .

Parameters

Use this parameter setting when running GIZA++

-c /home/verbs/student/wech5560/testMGIZA/giza/corpus/de-en-int-train.snt		(Location of the sentences files)
-d											(Location of the dictionary file)
-l 110-09-03.145823.wech5560.log							(Log File Name)
-log 0											(Use log file or not; 0 for no, 1 for yes)
-m1 5											(Number of iteration for IBM-Model 1)
-m2 0											(Number of iteration for IBM-Model 2)
-m3 3											(Number of iteration for IBM-Model 3)
-m4 3											(Number of iteration for IBM-Model 4)
-m5 0											(Number of iteration for IBM-Model 5)
-m5p0 -1										(Number of iteration for IBM-Model 5)
-m6 0											(Number of iteration for IBM-Model 6?)
-mh 5											(Number of iteration for HMM)
-nbestalignments 0									(Show the top-n best alignment result)
-o /home/verbs/student/wech5560/testMGIZA/giza/result/de-en				(The output file location & prefix)
-s /home/verbs/student/wech5560/testMGIZA/giza/corpus/en.vcb				(The source vocabulary file)
-t /home/verbs/student/wech5560/testMGIZA/giza/corpus/de.vcb				(The target vocabulary file)

Run

giza-pp/GIZA++ -s sourceVocabularyFile -t targetVocabularyFile -c sentenceFile

Or you can save all the parameter in .gizacfg files .gizacfg example

 adbackoff 0
c unfactored/corpus/de-en-int-train.snt
compactadtable 1
compactalignmentformat 0
corpusfile unfactored/corpus/de-en-int-train.snt
countcutoff 1e-06
countcutoffal 1e-05
countincreasecutoff 1e-06
countincreasecutoffal 1e-05
d
deficientdistortionforemptyword 0
depm4 76
depm5 68
dictionary
dopeggingyn 0
emalignmentdependencies 2
emalsmooth 0.2
emprobforempty 0.4
emsmoothhmm 2
hmmdumpfrequency 0
hmmiterations 5
l 110-09-03.145823.wech5560.log
log 0
logfile 110-09-03.145823.wech5560.log
m1 5
m2 0
m3 3
m4 3
m5 0
m5p0 -1
m6 0
manlexfactor1 0
manlexfactor2 0
manlexmaxmultiplicity 20
maxfertility 10
maxsentencelength 101
mh 5
mincountincrease 1e-07
ml 101
model1dumpfrequency 1
model1iterations 5
model23smoothfactor 0
model2dumpfrequency 0
model2iterations 0
model345dumpfrequency 0
model3dumpfrequency 0
model3iterations 3
model4iterations 3
model4smoothfactor 0.4
model5iterations 0
model5smoothfactor 0.1
model6iterations 0
nbestalignments 0
nodumps 1
nofiledumpsyn 1
noiterationsmodel1 5
noiterationsmodel2 0
noiterationsmodel3 3
noiterationsmodel4 3
noiterationsmodel5 0 
noiterationsmodel6 0
nsmooth 4
nsmoothgeneral 0
numberofiterationsforhmmalignmentmodel 5
o unfactored/giza.de-en/de-en
onlyaldumps 1
outputfileprefix unfactored/giza.de-en/de-en
outputpath
p 0
p0 0.999
peggedcutoff 0.03
pegging 0
probcutoff 1e-07
probsmooth 1e-07
readtableprefix
s unfactored/corpus/en.vcb
sourcevocabularyfile unfactored/corpus/en.vcb
t unfactored/corpus/de.vcb
t1 1
t2 0
t2to3 0
t3 0
t345 0
targetvocabularyfile unfactored/corpus/de.vcb
tc
testcorpusfile
th 0
transferdumpfrequency 0
v 0
verbose 0
verbosesentence -10

it is recommended that you store the .gizacfg file and use it next time without typing any other parameters

To run GIZA++ with .gizacfg, please use this commend:

giza-pp/GIZA++ configure.gizacfg

Path Information on Verbs

  • GIZA++
    • /home/verbs/shared/stages/tools/giza-pp

Reference

http://kwang.blogdns.com/research/how-to-compile-install-run-giza.html

http://www.mail-archive.com/moses-support@mit.edu/msg01143.html