GIZA++

From CompSemWiki
Jump to navigationJump to search

Install

For more information about how to install GIZA++, please refer to Install GIZA++

Information

  • Directory on verbs: /home/verbs/share/stage/tools/gizapp

Input Format

Basic Files

  • In order to run GIZA++, we need to input 3 or more files. Here are the list of these files:
    1. Vocabulary Files
      • For both source and target language, we need to input vocabulary files for GIZA++. The format of the vocabulary is as follow:
1	UNK	0
2	the	58947
3	,	52221
4	.	48789
5	of	30427
6	to	26185
..........

The first column is an unique-id number. The second column is the vocabulary. And the last column is the frequency of vocabularies appearing in the corpus. All of the columns are separated by a tab. We name the source and target vocabulary files as source.vcb and target.vcb

    1. Sentences File
      • With the unique-id of vocabularies, the sentences file are formatted with these id number. Here is the example of the sentences file
1
8007 36 577 226 25 97 6 1987 106 2 2560 7 315 4
5729 39 539 29 12 3877 87 2469 6 4323 1273 568 3
1
45 2 5090 10 152 534 18 2 625 58 2277 107 2 59 222 1893 54 230 5 31 144 12 2 195 4
4 2358 10 162 326 40 12109 68 4 5456 104 55 4 2252 5 145 985 2031 625 118 264 15 4 319 3
1
1519 11 16680 3 7 2 6570 5 31 112 3 307 6 16 2753 8 9 424 1718 4
4 17439 4689 6 5 1924 126 2138 194 23 18502 1962 6493 41 3
1
19 153 10 63 346 36 423 99 15 10 2 880 983 112 8 2 43 4
35 381 10 14 508 39 4 428 138 16 10 4 17734 24079 5 76 3
..........

The first line is the occurrence of the this sentences-pair. The second line is the source sentence while the words are transferred into unique-id. The third line is the target sentence with the same encoding style.

    1. Dictionary File (Optional)
      • We can provide a dictionary file for GIZA++. The format of the dictionary file is listed as follow:
5279 8007
3877 226
..........

The first column is the unique-id of target vocabulary, while the second column is the unique-id of source vocabulary.

Advanced Files

For using IBM-Model 4 or HMM Model in GIZA++, we need to generate another word class files for GIZA++. We use the mkcls to reach this.

  • Type this commend
mkcls -psource -Vsource.vcb.classes
mkcls -ptarget -Vtarget.vcb.classes

Then the software produces four classes files: source.vcb.classes, source.vcb.classes.cats, target.vcb.classes, target.vcb.cats . Make sure that the prefix of these generation files is the same with the original vocabulary files.

Processing Input File with GIZA++ Script

  • The GIZA++ provides some scripts for processing the text file. In our case, we can use the script to generate input file into GIZA++ format from the corpus.
    • Use plain2snt.out
      1. Prepare the plain text files for both source and target language. In each files, each sentence is listed on one line. And the sentences-pair for source-target sentences should match the same line number.
      2. Type the commend
plain2snt.out source target

And then the script file will generate the vocabulary files and sentence files named source.vcb, target.vcb, source-target.snt, target-source.snt .

Parameter

Run

Reference

http://kwang.blogdns.com/research/how-to-compile-install-run-giza.html