Difference between revisions of "GIZA++"

From CompSemWiki
Jump to navigationJump to search
Line 4: Line 4:
 
== Information ==
 
== Information ==
 
* Directory on verbs: /home/verbs/share/stage/tools/gizapp
 
* Directory on verbs: /home/verbs/share/stage/tools/gizapp
*
 
  
 
== Input Format ==
 
== Input Format ==
 +
=== Basic File Format ===
 +
* In order to run GIZA++, we need to input 3 or more files. Here are the list of these files:
 +
*# Vocabulary Files
 +
*#* For both source and target language, we need to input vocabulary files for GIZA++. The format of the vocabulary is as follow:
 +
1 UNK 0
 +
2 the 58947
 +
3 , 52221
 +
4 . 48789
 +
5 of 30427
 +
6 to 26185
 +
..........
 +
*#*The first column is an unique-id number. The second column is the vocabulary. And the last column is the frequency of vocabularies appearing in the corpus. All of the columns are separated by a tab. We name the source and target vocabulary files as source.vcb and target.vcb
 +
*# Sentences File
 +
*#* With the unique-id of vocabularies, the sentences file are formatted with these id number. Here is the example of the sentences file
 +
1
 +
8007 36 577 226 25 97 6 1987 106 2 2560 7 315 4
 +
5729 39 539 29 12 3877 87 2469 6 4323 1273 568 3
 +
1
 +
45 2 5090 10 152 534 18 2 625 58 2277 107 2 59 222 1893 54 230 5 31 144 12 2 195 4
 +
4 2358 10 162 326 40 12109 68 4 5456 104 55 4 2252 5 145 985 2031 625 118 264 15 4 319 3
 +
1
 +
1519 11 16680 3 7 2 6570 5 31 112 3 307 6 16 2753 8 9 424 1718 4
 +
4 17439 4689 6 5 1924 126 2138 194 23 18502 1962 6493 41 3
 +
1
 +
19 153 10 63 346 36 423 99 15 10 2 880 983 112 8 2 43 4
 +
35 381 10 14 508 39 4 428 138 16 10 4 17734 24079 5 76 3
 +
  
 
== Parameter ==
 
== Parameter ==
  
 
== Run ==
 
== Run ==
 +
 +
== Reference ==
 +
http://kwang.blogdns.com/research/how-to-compile-install-run-giza.html

Revision as of 11:50, 10 September 2010

Install

For more information about how to install GIZA++, please refer to Install GIZA++

Information

  • Directory on verbs: /home/verbs/share/stage/tools/gizapp

Input Format

Basic File Format

  • In order to run GIZA++, we need to input 3 or more files. Here are the list of these files:
    1. Vocabulary Files
      • For both source and target language, we need to input vocabulary files for GIZA++. The format of the vocabulary is as follow:
1	UNK	0
2	the	58947
3	,	52221
4	.	48789
5	of	30427
6	to	26185
..........
      • The first column is an unique-id number. The second column is the vocabulary. And the last column is the frequency of vocabularies appearing in the corpus. All of the columns are separated by a tab. We name the source and target vocabulary files as source.vcb and target.vcb
    1. Sentences File
      • With the unique-id of vocabularies, the sentences file are formatted with these id number. Here is the example of the sentences file
1
8007 36 577 226 25 97 6 1987 106 2 2560 7 315 4
5729 39 539 29 12 3877 87 2469 6 4323 1273 568 3
1
45 2 5090 10 152 534 18 2 625 58 2277 107 2 59 222 1893 54 230 5 31 144 12 2 195 4
4 2358 10 162 326 40 12109 68 4 5456 104 55 4 2252 5 145 985 2031 625 118 264 15 4 319 3
1
1519 11 16680 3 7 2 6570 5 31 112 3 307 6 16 2753 8 9 424 1718 4
4 17439 4689 6 5 1924 126 2138 194 23 18502 1962 6493 41 3
1
19 153 10 63 346 36 423 99 15 10 2 880 983 112 8 2 43 4
35 381 10 14 508 39 4 428 138 16 10 4 17734 24079 5 76 3


Parameter

Run

Reference

http://kwang.blogdns.com/research/how-to-compile-install-run-giza.html