Difference between revisions of "GIZA++"
From CompSemWiki
Jump to navigationJump to searchLine 4: | Line 4: | ||
== Information == | == Information == | ||
* Directory on verbs: /home/verbs/share/stage/tools/gizapp | * Directory on verbs: /home/verbs/share/stage/tools/gizapp | ||
− | |||
== Input Format == | == Input Format == | ||
+ | === Basic File Format === | ||
+ | * In order to run GIZA++, we need to input 3 or more files. Here are the list of these files: | ||
+ | *# Vocabulary Files | ||
+ | *#* For both source and target language, we need to input vocabulary files for GIZA++. The format of the vocabulary is as follow: | ||
+ | 1 UNK 0 | ||
+ | 2 the 58947 | ||
+ | 3 , 52221 | ||
+ | 4 . 48789 | ||
+ | 5 of 30427 | ||
+ | 6 to 26185 | ||
+ | .......... | ||
+ | *#*The first column is an unique-id number. The second column is the vocabulary. And the last column is the frequency of vocabularies appearing in the corpus. All of the columns are separated by a tab. We name the source and target vocabulary files as source.vcb and target.vcb | ||
+ | *# Sentences File | ||
+ | *#* With the unique-id of vocabularies, the sentences file are formatted with these id number. Here is the example of the sentences file | ||
+ | 1 | ||
+ | 8007 36 577 226 25 97 6 1987 106 2 2560 7 315 4 | ||
+ | 5729 39 539 29 12 3877 87 2469 6 4323 1273 568 3 | ||
+ | 1 | ||
+ | 45 2 5090 10 152 534 18 2 625 58 2277 107 2 59 222 1893 54 230 5 31 144 12 2 195 4 | ||
+ | 4 2358 10 162 326 40 12109 68 4 5456 104 55 4 2252 5 145 985 2031 625 118 264 15 4 319 3 | ||
+ | 1 | ||
+ | 1519 11 16680 3 7 2 6570 5 31 112 3 307 6 16 2753 8 9 424 1718 4 | ||
+ | 4 17439 4689 6 5 1924 126 2138 194 23 18502 1962 6493 41 3 | ||
+ | 1 | ||
+ | 19 153 10 63 346 36 423 99 15 10 2 880 983 112 8 2 43 4 | ||
+ | 35 381 10 14 508 39 4 428 138 16 10 4 17734 24079 5 76 3 | ||
+ | |||
== Parameter == | == Parameter == | ||
== Run == | == Run == | ||
+ | |||
+ | == Reference == | ||
+ | http://kwang.blogdns.com/research/how-to-compile-install-run-giza.html |
Revision as of 11:50, 10 September 2010
Install
For more information about how to install GIZA++, please refer to Install GIZA++
Information
- Directory on verbs: /home/verbs/share/stage/tools/gizapp
Input Format
Basic File Format
- In order to run GIZA++, we need to input 3 or more files. Here are the list of these files:
- Vocabulary Files
- For both source and target language, we need to input vocabulary files for GIZA++. The format of the vocabulary is as follow:
- Vocabulary Files
1 UNK 0 2 the 58947 3 , 52221 4 . 48789 5 of 30427 6 to 26185 ..........
- The first column is an unique-id number. The second column is the vocabulary. And the last column is the frequency of vocabularies appearing in the corpus. All of the columns are separated by a tab. We name the source and target vocabulary files as source.vcb and target.vcb
- Sentences File
- With the unique-id of vocabularies, the sentences file are formatted with these id number. Here is the example of the sentences file
1 8007 36 577 226 25 97 6 1987 106 2 2560 7 315 4 5729 39 539 29 12 3877 87 2469 6 4323 1273 568 3 1 45 2 5090 10 152 534 18 2 625 58 2277 107 2 59 222 1893 54 230 5 31 144 12 2 195 4 4 2358 10 162 326 40 12109 68 4 5456 104 55 4 2252 5 145 985 2031 625 118 264 15 4 319 3 1 1519 11 16680 3 7 2 6570 5 31 112 3 307 6 16 2753 8 9 424 1718 4 4 17439 4689 6 5 1924 126 2138 194 23 18502 1962 6493 41 3 1 19 153 10 63 346 36 423 99 15 10 2 880 983 112 8 2 43 4 35 381 10 14 508 39 4 428 138 16 10 4 17734 24079 5 76 3
Parameter
Run
Reference
http://kwang.blogdns.com/research/how-to-compile-install-run-giza.html