Difference between revisions of "GIZA++"
CompSemUser (talk | contribs) |
|||
(19 intermediate revisions by 2 users not shown) | |||
Line 3: | Line 3: | ||
== Information == | == Information == | ||
− | * Directory on verbs: /home/verbs/ | + | * Directory on verbs: /home/verbs/shared/stages/tools/giza-pp |
== Input Format == | == Input Format == | ||
− | === Basic | + | === Basic Files=== |
* In order to run GIZA++, we need to input 3 or more files. Here are the list of these files: | * In order to run GIZA++, we need to input 3 or more files. Here are the list of these files: | ||
*# Vocabulary Files | *# Vocabulary Files | ||
*#* For both source and target language, we need to input vocabulary files for GIZA++. The format of the vocabulary is as follow: | *#* For both source and target language, we need to input vocabulary files for GIZA++. The format of the vocabulary is as follow: | ||
+ | |||
1 UNK 0 | 1 UNK 0 | ||
2 the 58947 | 2 the 58947 | ||
Line 17: | Line 18: | ||
6 to 26185 | 6 to 26185 | ||
.......... | .......... | ||
− | + | ||
+ | The first column is an unique-id number. The second column is the vocabulary. And the last column is the frequency of vocabularies appearing in the corpus. All of the columns are separated by a tab. We name the source and target vocabulary files as source.vcb and target.vcb | ||
*# Sentences File | *# Sentences File | ||
*#* With the unique-id of vocabularies, the sentences file are formatted with these id number. Here is the example of the sentences file | *#* With the unique-id of vocabularies, the sentences file are formatted with these id number. Here is the example of the sentences file | ||
Line 32: | Line 34: | ||
19 153 10 63 346 36 423 99 15 10 2 880 983 112 8 2 43 4 | 19 153 10 63 346 36 423 99 15 10 2 880 983 112 8 2 43 4 | ||
35 381 10 14 508 39 4 428 138 16 10 4 17734 24079 5 76 3 | 35 381 10 14 508 39 4 428 138 16 10 4 17734 24079 5 76 3 | ||
+ | .......... | ||
+ | The first line is the occurrence of the this sentences-pair. The second line is the source sentence while the words are transferred into unique-id. The third line is the target sentence with the same encoding style. | ||
+ | *# Dictionary File (Optional) | ||
+ | *#* We can provide a dictionary file for GIZA++. The format of the dictionary file is listed as follow: | ||
+ | 5279 8007 | ||
+ | 3877 226 | ||
+ | .......... | ||
+ | The first column is the unique-id of '''target''' vocabulary, while the second column is the unique-id of '''source''' vocabulary. | ||
+ | |||
+ | === Advanced Files=== | ||
+ | For using IBM-Model 4 or HMM Model in GIZA++, we need to generate another word class files for GIZA++. We use the '''mkcls''' to reach this. | ||
+ | * Type this commend | ||
+ | giza-pp/mkcls -psource -Vsource.vcb.classes | ||
+ | giza-pp/mkcls -ptarget -Vtarget.vcb.classes | ||
+ | Then the software produces four classes files: source.vcb.classes, source.vcb.classes.cats, target.vcb.classes, target.vcb.cats . Make sure that the prefix of these generation files is the same with the original vocabulary files. | ||
+ | === Processing Input File with GIZA++ Script === | ||
+ | * The GIZA++ provides some scripts for processing the text file. In our case, we can use the script to generate input file into GIZA++ format from the corpus. | ||
+ | ** Use '''plain2snt.out''' | ||
+ | **# Prepare the plain text files for both source and target language. In each files, each sentence is listed on one line. And the sentences-pair for source-target sentences should match the same line number. | ||
+ | **# Type the commend | ||
+ | giza-pp/plain2snt.out source target | ||
+ | And then the script file will generate the vocabulary files and sentence files named source.vcb, target.vcb, source-target.snt, target-source.snt . | ||
− | == | + | == Parameters == |
+ | Use this parameter setting when running GIZA++ | ||
+ | -c /home/verbs/student/wech5560/testMGIZA/giza/corpus/de-en-int-train.snt (Location of the sentences files) | ||
+ | -d (Location of the dictionary file) | ||
+ | -l 110-09-03.145823.wech5560.log (Log File Name) | ||
+ | -log 0 (Use log file or not; 0 for no, 1 for yes) | ||
+ | -m1 5 (Number of iteration for IBM-Model 1) | ||
+ | -m2 0 (Number of iteration for IBM-Model 2) | ||
+ | -m3 3 (Number of iteration for IBM-Model 3) | ||
+ | -m4 3 (Number of iteration for IBM-Model 4) | ||
+ | -m5 0 (Number of iteration for IBM-Model 5) | ||
+ | -m5p0 -1 (Number of iteration for IBM-Model 5) | ||
+ | -m6 0 (Number of iteration for IBM-Model 6?) | ||
+ | -mh 5 (Number of iteration for HMM) | ||
+ | -nbestalignments 0 (Show the top-n best alignment result) | ||
+ | -o /home/verbs/student/wech5560/testMGIZA/giza/result/de-en (The output file location & prefix) | ||
+ | -s /home/verbs/student/wech5560/testMGIZA/giza/corpus/en.vcb (The source vocabulary file) | ||
+ | -t /home/verbs/student/wech5560/testMGIZA/giza/corpus/de.vcb (The target vocabulary file) | ||
== Run == | == Run == | ||
+ | giza-pp/GIZA++ -s sourceVocabularyFile -t targetVocabularyFile -c sentenceFile | ||
+ | Or you can save all the parameter in .gizacfg files | ||
+ | .gizacfg example | ||
+ | adbackoff 0 | ||
+ | c unfactored/corpus/de-en-int-train.snt | ||
+ | compactadtable 1 | ||
+ | compactalignmentformat 0 | ||
+ | corpusfile unfactored/corpus/de-en-int-train.snt | ||
+ | countcutoff 1e-06 | ||
+ | countcutoffal 1e-05 | ||
+ | countincreasecutoff 1e-06 | ||
+ | countincreasecutoffal 1e-05 | ||
+ | d | ||
+ | deficientdistortionforemptyword 0 | ||
+ | depm4 76 | ||
+ | depm5 68 | ||
+ | dictionary | ||
+ | dopeggingyn 0 | ||
+ | emalignmentdependencies 2 | ||
+ | emalsmooth 0.2 | ||
+ | emprobforempty 0.4 | ||
+ | emsmoothhmm 2 | ||
+ | hmmdumpfrequency 0 | ||
+ | hmmiterations 5 | ||
+ | l 110-09-03.145823.wech5560.log | ||
+ | log 0 | ||
+ | logfile 110-09-03.145823.wech5560.log | ||
+ | m1 5 | ||
+ | m2 0 | ||
+ | m3 3 | ||
+ | m4 3 | ||
+ | m5 0 | ||
+ | m5p0 -1 | ||
+ | m6 0 | ||
+ | manlexfactor1 0 | ||
+ | manlexfactor2 0 | ||
+ | manlexmaxmultiplicity 20 | ||
+ | maxfertility 10 | ||
+ | maxsentencelength 101 | ||
+ | mh 5 | ||
+ | mincountincrease 1e-07 | ||
+ | ml 101 | ||
+ | model1dumpfrequency 1 | ||
+ | model1iterations 5 | ||
+ | model23smoothfactor 0 | ||
+ | model2dumpfrequency 0 | ||
+ | model2iterations 0 | ||
+ | model345dumpfrequency 0 | ||
+ | model3dumpfrequency 0 | ||
+ | model3iterations 3 | ||
+ | model4iterations 3 | ||
+ | model4smoothfactor 0.4 | ||
+ | model5iterations 0 | ||
+ | model5smoothfactor 0.1 | ||
+ | model6iterations 0 | ||
+ | nbestalignments 0 | ||
+ | nodumps 1 | ||
+ | nofiledumpsyn 1 | ||
+ | noiterationsmodel1 5 | ||
+ | noiterationsmodel2 0 | ||
+ | noiterationsmodel3 3 | ||
+ | noiterationsmodel4 3 | ||
+ | noiterationsmodel5 0 | ||
+ | noiterationsmodel6 0 | ||
+ | nsmooth 4 | ||
+ | nsmoothgeneral 0 | ||
+ | numberofiterationsforhmmalignmentmodel 5 | ||
+ | o unfactored/giza.de-en/de-en | ||
+ | onlyaldumps 1 | ||
+ | outputfileprefix unfactored/giza.de-en/de-en | ||
+ | outputpath | ||
+ | p 0 | ||
+ | p0 0.999 | ||
+ | peggedcutoff 0.03 | ||
+ | pegging 0 | ||
+ | probcutoff 1e-07 | ||
+ | probsmooth 1e-07 | ||
+ | readtableprefix | ||
+ | s unfactored/corpus/en.vcb | ||
+ | sourcevocabularyfile unfactored/corpus/en.vcb | ||
+ | t unfactored/corpus/de.vcb | ||
+ | t1 1 | ||
+ | t2 0 | ||
+ | t2to3 0 | ||
+ | t3 0 | ||
+ | t345 0 | ||
+ | targetvocabularyfile unfactored/corpus/de.vcb | ||
+ | tc | ||
+ | testcorpusfile | ||
+ | th 0 | ||
+ | transferdumpfrequency 0 | ||
+ | v 0 | ||
+ | verbose 0 | ||
+ | verbosesentence -10 | ||
+ | it is recommended that you store the .gizacfg file and use it next time without typing any other parameters | ||
+ | |||
+ | To run GIZA++ with .gizacfg, please use this commend: | ||
+ | giza-pp/GIZA++ configure.gizacfg | ||
+ | |||
+ | == Path Information on Verbs == | ||
+ | * GIZA++ | ||
+ | ** /home/verbs/shared/stages/tools/giza-pp | ||
== Reference == | == Reference == | ||
+ | |||
http://kwang.blogdns.com/research/how-to-compile-install-run-giza.html | http://kwang.blogdns.com/research/how-to-compile-install-run-giza.html | ||
+ | |||
+ | http://www.mail-archive.com/moses-support@mit.edu/msg01143.html | ||
+ | |||
+ | [[Category:Machine Translation]] |
Latest revision as of 12:34, 27 October 2016
Install
For more information about how to install GIZA++, please refer to Install GIZA++
Information
- Directory on verbs: /home/verbs/shared/stages/tools/giza-pp
Input Format
Basic Files
- In order to run GIZA++, we need to input 3 or more files. Here are the list of these files:
- Vocabulary Files
- For both source and target language, we need to input vocabulary files for GIZA++. The format of the vocabulary is as follow:
- Vocabulary Files
1 UNK 0 2 the 58947 3 , 52221 4 . 48789 5 of 30427 6 to 26185 ..........
The first column is an unique-id number. The second column is the vocabulary. And the last column is the frequency of vocabularies appearing in the corpus. All of the columns are separated by a tab. We name the source and target vocabulary files as source.vcb and target.vcb
- Sentences File
- With the unique-id of vocabularies, the sentences file are formatted with these id number. Here is the example of the sentences file
- Sentences File
1 8007 36 577 226 25 97 6 1987 106 2 2560 7 315 4 5729 39 539 29 12 3877 87 2469 6 4323 1273 568 3 1 45 2 5090 10 152 534 18 2 625 58 2277 107 2 59 222 1893 54 230 5 31 144 12 2 195 4 4 2358 10 162 326 40 12109 68 4 5456 104 55 4 2252 5 145 985 2031 625 118 264 15 4 319 3 1 1519 11 16680 3 7 2 6570 5 31 112 3 307 6 16 2753 8 9 424 1718 4 4 17439 4689 6 5 1924 126 2138 194 23 18502 1962 6493 41 3 1 19 153 10 63 346 36 423 99 15 10 2 880 983 112 8 2 43 4 35 381 10 14 508 39 4 428 138 16 10 4 17734 24079 5 76 3 ..........
The first line is the occurrence of the this sentences-pair. The second line is the source sentence while the words are transferred into unique-id. The third line is the target sentence with the same encoding style.
- Dictionary File (Optional)
- We can provide a dictionary file for GIZA++. The format of the dictionary file is listed as follow:
- Dictionary File (Optional)
5279 8007 3877 226 ..........
The first column is the unique-id of target vocabulary, while the second column is the unique-id of source vocabulary.
Advanced Files
For using IBM-Model 4 or HMM Model in GIZA++, we need to generate another word class files for GIZA++. We use the mkcls to reach this.
- Type this commend
giza-pp/mkcls -psource -Vsource.vcb.classes giza-pp/mkcls -ptarget -Vtarget.vcb.classes
Then the software produces four classes files: source.vcb.classes, source.vcb.classes.cats, target.vcb.classes, target.vcb.cats . Make sure that the prefix of these generation files is the same with the original vocabulary files.
Processing Input File with GIZA++ Script
- The GIZA++ provides some scripts for processing the text file. In our case, we can use the script to generate input file into GIZA++ format from the corpus.
- Use plain2snt.out
- Prepare the plain text files for both source and target language. In each files, each sentence is listed on one line. And the sentences-pair for source-target sentences should match the same line number.
- Type the commend
- Use plain2snt.out
giza-pp/plain2snt.out source target
And then the script file will generate the vocabulary files and sentence files named source.vcb, target.vcb, source-target.snt, target-source.snt .
Parameters
Use this parameter setting when running GIZA++
-c /home/verbs/student/wech5560/testMGIZA/giza/corpus/de-en-int-train.snt (Location of the sentences files) -d (Location of the dictionary file) -l 110-09-03.145823.wech5560.log (Log File Name) -log 0 (Use log file or not; 0 for no, 1 for yes) -m1 5 (Number of iteration for IBM-Model 1) -m2 0 (Number of iteration for IBM-Model 2) -m3 3 (Number of iteration for IBM-Model 3) -m4 3 (Number of iteration for IBM-Model 4) -m5 0 (Number of iteration for IBM-Model 5) -m5p0 -1 (Number of iteration for IBM-Model 5) -m6 0 (Number of iteration for IBM-Model 6?) -mh 5 (Number of iteration for HMM) -nbestalignments 0 (Show the top-n best alignment result) -o /home/verbs/student/wech5560/testMGIZA/giza/result/de-en (The output file location & prefix) -s /home/verbs/student/wech5560/testMGIZA/giza/corpus/en.vcb (The source vocabulary file) -t /home/verbs/student/wech5560/testMGIZA/giza/corpus/de.vcb (The target vocabulary file)
Run
giza-pp/GIZA++ -s sourceVocabularyFile -t targetVocabularyFile -c sentenceFile
Or you can save all the parameter in .gizacfg files .gizacfg example
adbackoff 0 c unfactored/corpus/de-en-int-train.snt compactadtable 1 compactalignmentformat 0 corpusfile unfactored/corpus/de-en-int-train.snt countcutoff 1e-06 countcutoffal 1e-05 countincreasecutoff 1e-06 countincreasecutoffal 1e-05 d deficientdistortionforemptyword 0 depm4 76 depm5 68 dictionary dopeggingyn 0 emalignmentdependencies 2 emalsmooth 0.2 emprobforempty 0.4 emsmoothhmm 2 hmmdumpfrequency 0 hmmiterations 5 l 110-09-03.145823.wech5560.log log 0 logfile 110-09-03.145823.wech5560.log m1 5 m2 0 m3 3 m4 3 m5 0 m5p0 -1 m6 0 manlexfactor1 0 manlexfactor2 0 manlexmaxmultiplicity 20 maxfertility 10 maxsentencelength 101 mh 5 mincountincrease 1e-07 ml 101 model1dumpfrequency 1 model1iterations 5 model23smoothfactor 0 model2dumpfrequency 0 model2iterations 0 model345dumpfrequency 0 model3dumpfrequency 0 model3iterations 3 model4iterations 3 model4smoothfactor 0.4 model5iterations 0 model5smoothfactor 0.1 model6iterations 0 nbestalignments 0 nodumps 1 nofiledumpsyn 1 noiterationsmodel1 5 noiterationsmodel2 0 noiterationsmodel3 3 noiterationsmodel4 3 noiterationsmodel5 0 noiterationsmodel6 0 nsmooth 4 nsmoothgeneral 0 numberofiterationsforhmmalignmentmodel 5 o unfactored/giza.de-en/de-en onlyaldumps 1 outputfileprefix unfactored/giza.de-en/de-en outputpath p 0 p0 0.999 peggedcutoff 0.03 pegging 0 probcutoff 1e-07 probsmooth 1e-07 readtableprefix s unfactored/corpus/en.vcb sourcevocabularyfile unfactored/corpus/en.vcb t unfactored/corpus/de.vcb t1 1 t2 0 t2to3 0 t3 0 t345 0 targetvocabularyfile unfactored/corpus/de.vcb tc testcorpusfile th 0 transferdumpfrequency 0 v 0 verbose 0 verbosesentence -10
it is recommended that you store the .gizacfg file and use it next time without typing any other parameters
To run GIZA++ with .gizacfg, please use this commend:
giza-pp/GIZA++ configure.gizacfg
Path Information on Verbs
- GIZA++
- /home/verbs/shared/stages/tools/giza-pp
Reference
http://kwang.blogdns.com/research/how-to-compile-install-run-giza.html
http://www.mail-archive.com/moses-support@mit.edu/msg01143.html