Difference between revisions of "GIZA++"

From CompSemWiki
Jump to navigationJump to search
 
(19 intermediate revisions by 2 users not shown)
Line 3: Line 3:
  
 
== Information ==
 
== Information ==
* Directory on verbs: /home/verbs/share/stage/tools/gizapp
+
* Directory on verbs: /home/verbs/shared/stages/tools/giza-pp
  
 
== Input Format ==
 
== Input Format ==
=== Basic File Format ===
+
=== Basic Files===
 
* In order to run GIZA++, we need to input 3 or more files. Here are the list of these files:
 
* In order to run GIZA++, we need to input 3 or more files. Here are the list of these files:
 
*# Vocabulary Files
 
*# Vocabulary Files
 
*#* For both source and target language, we need to input vocabulary files for GIZA++. The format of the vocabulary is as follow:
 
*#* For both source and target language, we need to input vocabulary files for GIZA++. The format of the vocabulary is as follow:
 +
 
  1 UNK 0
 
  1 UNK 0
 
  2 the 58947
 
  2 the 58947
Line 17: Line 18:
 
  6 to 26185
 
  6 to 26185
 
  ..........
 
  ..........
*#*The first column is an unique-id number. The second column is the vocabulary. And the last column is the frequency of vocabularies appearing in the corpus. All of the columns are separated by a tab. We name the source and target vocabulary files as source.vcb and target.vcb
+
 
 +
The first column is an unique-id number. The second column is the vocabulary. And the last column is the frequency of vocabularies appearing in the corpus. All of the columns are separated by a tab. We name the source and target vocabulary files as source.vcb and target.vcb
 
*# Sentences File
 
*# Sentences File
 
*#* With the unique-id of vocabularies, the sentences file are formatted with these id number. Here is the example of the sentences file
 
*#* With the unique-id of vocabularies, the sentences file are formatted with these id number. Here is the example of the sentences file
Line 32: Line 34:
 
  19 153 10 63 346 36 423 99 15 10 2 880 983 112 8 2 43 4
 
  19 153 10 63 346 36 423 99 15 10 2 880 983 112 8 2 43 4
 
  35 381 10 14 508 39 4 428 138 16 10 4 17734 24079 5 76 3
 
  35 381 10 14 508 39 4 428 138 16 10 4 17734 24079 5 76 3
 +
..........
 +
The first line is the occurrence of the this sentences-pair. The second line is the source sentence while the words are transferred into unique-id. The third line is the target sentence with the same encoding style.
 +
*# Dictionary File (Optional)
 +
*#* We can provide a dictionary file for GIZA++. The format of the dictionary file is listed as follow:
 +
5279 8007
 +
3877 226
 +
..........
 +
The first column is the unique-id of '''target''' vocabulary, while the second column is the unique-id of '''source''' vocabulary.
 +
 +
=== Advanced Files===
 +
For using IBM-Model 4 or HMM Model in GIZA++, we need to generate another word class files for GIZA++. We use the '''mkcls''' to reach this.
 +
* Type this commend
 +
giza-pp/mkcls -psource -Vsource.vcb.classes
 +
giza-pp/mkcls -ptarget -Vtarget.vcb.classes
 +
Then the software produces four classes files: source.vcb.classes, source.vcb.classes.cats, target.vcb.classes, target.vcb.cats . Make sure that the prefix of these generation files is the same with the original vocabulary files.
  
 +
=== Processing Input File with GIZA++ Script ===
 +
* The GIZA++ provides some scripts for processing the text file. In our case, we can use the script to generate input file into GIZA++ format from the corpus.
 +
** Use '''plain2snt.out'''
 +
**# Prepare the plain text files for both source and target language. In each files, each sentence is listed on one line. And the sentences-pair for source-target sentences should match the same line number.
 +
**# Type the commend
 +
giza-pp/plain2snt.out source target
 +
And then the script file will generate the vocabulary files and sentence files named source.vcb, target.vcb, source-target.snt, target-source.snt .
  
== Parameter ==
+
== Parameters ==
 +
Use this parameter setting when running GIZA++
 +
-c /home/verbs/student/wech5560/testMGIZA/giza/corpus/de-en-int-train.snt (Location of the sentences files)
 +
-d (Location of the dictionary file)
 +
-l 110-09-03.145823.wech5560.log (Log File Name)
 +
-log 0 (Use log file or not; 0 for no, 1 for yes)
 +
-m1 5 (Number of iteration for IBM-Model 1)
 +
-m2 0 (Number of iteration for IBM-Model 2)
 +
-m3 3 (Number of iteration for IBM-Model 3)
 +
-m4 3 (Number of iteration for IBM-Model 4)
 +
-m5 0 (Number of iteration for IBM-Model 5)
 +
-m5p0 -1 (Number of iteration for IBM-Model 5)
 +
-m6 0 (Number of iteration for IBM-Model 6?)
 +
-mh 5 (Number of iteration for HMM)
 +
-nbestalignments 0 (Show the top-n best alignment result)
 +
-o /home/verbs/student/wech5560/testMGIZA/giza/result/de-en (The output file location & prefix)
 +
-s /home/verbs/student/wech5560/testMGIZA/giza/corpus/en.vcb (The source vocabulary file)
 +
-t /home/verbs/student/wech5560/testMGIZA/giza/corpus/de.vcb (The target vocabulary file)
  
 
== Run ==
 
== Run ==
 +
giza-pp/GIZA++ -s sourceVocabularyFile -t targetVocabularyFile -c sentenceFile
 +
Or you can save all the parameter in .gizacfg files
 +
.gizacfg example
 +
  adbackoff 0
 +
c unfactored/corpus/de-en-int-train.snt
 +
compactadtable 1
 +
compactalignmentformat 0
 +
corpusfile unfactored/corpus/de-en-int-train.snt
 +
countcutoff 1e-06
 +
countcutoffal 1e-05
 +
countincreasecutoff 1e-06
 +
countincreasecutoffal 1e-05
 +
d
 +
deficientdistortionforemptyword 0
 +
depm4 76
 +
depm5 68
 +
dictionary
 +
dopeggingyn 0
 +
emalignmentdependencies 2
 +
emalsmooth 0.2
 +
emprobforempty 0.4
 +
emsmoothhmm 2
 +
hmmdumpfrequency 0
 +
hmmiterations 5
 +
l 110-09-03.145823.wech5560.log
 +
log 0
 +
logfile 110-09-03.145823.wech5560.log
 +
m1 5
 +
m2 0
 +
m3 3
 +
m4 3
 +
m5 0
 +
m5p0 -1
 +
m6 0
 +
manlexfactor1 0
 +
manlexfactor2 0
 +
manlexmaxmultiplicity 20
 +
maxfertility 10
 +
maxsentencelength 101
 +
mh 5
 +
mincountincrease 1e-07
 +
ml 101
 +
model1dumpfrequency 1
 +
model1iterations 5
 +
model23smoothfactor 0
 +
model2dumpfrequency 0
 +
model2iterations 0
 +
model345dumpfrequency 0
 +
model3dumpfrequency 0
 +
model3iterations 3
 +
model4iterations 3
 +
model4smoothfactor 0.4
 +
model5iterations 0
 +
model5smoothfactor 0.1
 +
model6iterations 0
 +
nbestalignments 0
 +
nodumps 1
 +
nofiledumpsyn 1
 +
noiterationsmodel1 5
 +
noiterationsmodel2 0
 +
noiterationsmodel3 3
 +
noiterationsmodel4 3
 +
noiterationsmodel5 0
 +
noiterationsmodel6 0
 +
nsmooth 4
 +
nsmoothgeneral 0
 +
numberofiterationsforhmmalignmentmodel 5
 +
o unfactored/giza.de-en/de-en
 +
onlyaldumps 1
 +
outputfileprefix unfactored/giza.de-en/de-en
 +
outputpath
 +
p 0
 +
p0 0.999
 +
peggedcutoff 0.03
 +
pegging 0
 +
probcutoff 1e-07
 +
probsmooth 1e-07
 +
readtableprefix
 +
s unfactored/corpus/en.vcb
 +
sourcevocabularyfile unfactored/corpus/en.vcb
 +
t unfactored/corpus/de.vcb
 +
t1 1
 +
t2 0
 +
t2to3 0
 +
t3 0
 +
t345 0
 +
targetvocabularyfile unfactored/corpus/de.vcb
 +
tc
 +
testcorpusfile
 +
th 0
 +
transferdumpfrequency 0
 +
v 0
 +
verbose 0
 +
verbosesentence -10
 +
it is recommended that you store the .gizacfg file and use it next time without typing any other parameters
 +
 +
To run GIZA++ with .gizacfg, please use this commend:
 +
giza-pp/GIZA++ configure.gizacfg
 +
 +
== Path Information on Verbs ==
 +
* GIZA++
 +
** /home/verbs/shared/stages/tools/giza-pp
  
 
== Reference ==
 
== Reference ==
 +
 
http://kwang.blogdns.com/research/how-to-compile-install-run-giza.html
 
http://kwang.blogdns.com/research/how-to-compile-install-run-giza.html
 +
 +
http://www.mail-archive.com/moses-support@mit.edu/msg01143.html
 +
 +
[[Category:Machine Translation]]

Latest revision as of 12:34, 27 October 2016

Install

For more information about how to install GIZA++, please refer to Install GIZA++

Information

  • Directory on verbs: /home/verbs/shared/stages/tools/giza-pp

Input Format

Basic Files

  • In order to run GIZA++, we need to input 3 or more files. Here are the list of these files:
    1. Vocabulary Files
      • For both source and target language, we need to input vocabulary files for GIZA++. The format of the vocabulary is as follow:
1	UNK	0
2	the	58947
3	,	52221
4	.	48789
5	of	30427
6	to	26185
..........

The first column is an unique-id number. The second column is the vocabulary. And the last column is the frequency of vocabularies appearing in the corpus. All of the columns are separated by a tab. We name the source and target vocabulary files as source.vcb and target.vcb

    1. Sentences File
      • With the unique-id of vocabularies, the sentences file are formatted with these id number. Here is the example of the sentences file
1
8007 36 577 226 25 97 6 1987 106 2 2560 7 315 4
5729 39 539 29 12 3877 87 2469 6 4323 1273 568 3
1
45 2 5090 10 152 534 18 2 625 58 2277 107 2 59 222 1893 54 230 5 31 144 12 2 195 4
4 2358 10 162 326 40 12109 68 4 5456 104 55 4 2252 5 145 985 2031 625 118 264 15 4 319 3
1
1519 11 16680 3 7 2 6570 5 31 112 3 307 6 16 2753 8 9 424 1718 4
4 17439 4689 6 5 1924 126 2138 194 23 18502 1962 6493 41 3
1
19 153 10 63 346 36 423 99 15 10 2 880 983 112 8 2 43 4
35 381 10 14 508 39 4 428 138 16 10 4 17734 24079 5 76 3
..........

The first line is the occurrence of the this sentences-pair. The second line is the source sentence while the words are transferred into unique-id. The third line is the target sentence with the same encoding style.

    1. Dictionary File (Optional)
      • We can provide a dictionary file for GIZA++. The format of the dictionary file is listed as follow:
5279 8007
3877 226
..........

The first column is the unique-id of target vocabulary, while the second column is the unique-id of source vocabulary.

Advanced Files

For using IBM-Model 4 or HMM Model in GIZA++, we need to generate another word class files for GIZA++. We use the mkcls to reach this.

  • Type this commend
giza-pp/mkcls -psource -Vsource.vcb.classes
giza-pp/mkcls -ptarget -Vtarget.vcb.classes

Then the software produces four classes files: source.vcb.classes, source.vcb.classes.cats, target.vcb.classes, target.vcb.cats . Make sure that the prefix of these generation files is the same with the original vocabulary files.

Processing Input File with GIZA++ Script

  • The GIZA++ provides some scripts for processing the text file. In our case, we can use the script to generate input file into GIZA++ format from the corpus.
    • Use plain2snt.out
      1. Prepare the plain text files for both source and target language. In each files, each sentence is listed on one line. And the sentences-pair for source-target sentences should match the same line number.
      2. Type the commend
giza-pp/plain2snt.out source target

And then the script file will generate the vocabulary files and sentence files named source.vcb, target.vcb, source-target.snt, target-source.snt .

Parameters

Use this parameter setting when running GIZA++

-c /home/verbs/student/wech5560/testMGIZA/giza/corpus/de-en-int-train.snt		(Location of the sentences files)
-d											(Location of the dictionary file)
-l 110-09-03.145823.wech5560.log							(Log File Name)
-log 0											(Use log file or not; 0 for no, 1 for yes)
-m1 5											(Number of iteration for IBM-Model 1)
-m2 0											(Number of iteration for IBM-Model 2)
-m3 3											(Number of iteration for IBM-Model 3)
-m4 3											(Number of iteration for IBM-Model 4)
-m5 0											(Number of iteration for IBM-Model 5)
-m5p0 -1										(Number of iteration for IBM-Model 5)
-m6 0											(Number of iteration for IBM-Model 6?)
-mh 5											(Number of iteration for HMM)
-nbestalignments 0									(Show the top-n best alignment result)
-o /home/verbs/student/wech5560/testMGIZA/giza/result/de-en				(The output file location & prefix)
-s /home/verbs/student/wech5560/testMGIZA/giza/corpus/en.vcb				(The source vocabulary file)
-t /home/verbs/student/wech5560/testMGIZA/giza/corpus/de.vcb				(The target vocabulary file)

Run

giza-pp/GIZA++ -s sourceVocabularyFile -t targetVocabularyFile -c sentenceFile

Or you can save all the parameter in .gizacfg files .gizacfg example

 adbackoff 0
c unfactored/corpus/de-en-int-train.snt
compactadtable 1
compactalignmentformat 0
corpusfile unfactored/corpus/de-en-int-train.snt
countcutoff 1e-06
countcutoffal 1e-05
countincreasecutoff 1e-06
countincreasecutoffal 1e-05
d
deficientdistortionforemptyword 0
depm4 76
depm5 68
dictionary
dopeggingyn 0
emalignmentdependencies 2
emalsmooth 0.2
emprobforempty 0.4
emsmoothhmm 2
hmmdumpfrequency 0
hmmiterations 5
l 110-09-03.145823.wech5560.log
log 0
logfile 110-09-03.145823.wech5560.log
m1 5
m2 0
m3 3
m4 3
m5 0
m5p0 -1
m6 0
manlexfactor1 0
manlexfactor2 0
manlexmaxmultiplicity 20
maxfertility 10
maxsentencelength 101
mh 5
mincountincrease 1e-07
ml 101
model1dumpfrequency 1
model1iterations 5
model23smoothfactor 0
model2dumpfrequency 0
model2iterations 0
model345dumpfrequency 0
model3dumpfrequency 0
model3iterations 3
model4iterations 3
model4smoothfactor 0.4
model5iterations 0
model5smoothfactor 0.1
model6iterations 0
nbestalignments 0
nodumps 1
nofiledumpsyn 1
noiterationsmodel1 5
noiterationsmodel2 0
noiterationsmodel3 3
noiterationsmodel4 3
noiterationsmodel5 0 
noiterationsmodel6 0
nsmooth 4
nsmoothgeneral 0
numberofiterationsforhmmalignmentmodel 5
o unfactored/giza.de-en/de-en
onlyaldumps 1
outputfileprefix unfactored/giza.de-en/de-en
outputpath
p 0
p0 0.999
peggedcutoff 0.03
pegging 0
probcutoff 1e-07
probsmooth 1e-07
readtableprefix
s unfactored/corpus/en.vcb
sourcevocabularyfile unfactored/corpus/en.vcb
t unfactored/corpus/de.vcb
t1 1
t2 0
t2to3 0
t3 0
t345 0
targetvocabularyfile unfactored/corpus/de.vcb
tc
testcorpusfile
th 0
transferdumpfrequency 0
v 0
verbose 0
verbosesentence -10

it is recommended that you store the .gizacfg file and use it next time without typing any other parameters

To run GIZA++ with .gizacfg, please use this commend:

giza-pp/GIZA++ configure.gizacfg

Path Information on Verbs

  • GIZA++
    • /home/verbs/shared/stages/tools/giza-pp

Reference

http://kwang.blogdns.com/research/how-to-compile-install-run-giza.html

http://www.mail-archive.com/moses-support@mit.edu/msg01143.html