GMA HOWTO ####################################################################### #COPYRIGHT (C) 1996-2004 by I. Dan Melamed ####################################################################### GMA stands for Geometric Mapping and Alignment, and it includes the Smooth Injective Map Recognizer (SIMR) algorithm for mapping bitext correspondence, as well as the Geometric Segment Alignment (GSA) post-processor for converting general bitext maps to monotonic segment alignments. To get the most benefit from this software, you are encouraged to read the relevant publications, most of which are downloadable from Dan's home page and referenced at the end of this HOWTO file. CONTENTS 0. how to contribute I. how to install GMA II. how to run GMA III. how to interpret GMA's output IV. more information on config files V. how to choose a matching predicate for your language pair VI. how to create bitext axes for GMA VII. why you must generate end-of-segment markers VIII. how to reoptimize SIMR's numerical parameters IX. troubleshooting X. how to submit bug reports XI. references ======================================================================== 0. GMA is open source. You can change it to your liking. If you make significant improvements from which others might benefit, then of course you should share. Patches, modules, language-specific resources, and additional documentation are all welcome. Please submit your contribution to the bugzilla GMA project at http://nlp.cs.nyu.edu/bugzilla. We will try to include it in the next release, giving you full credit, and everlasting glory shall be yours. I. how to install GMA 1. Download the package from the website. 2. At this point, you might want to confirm that your GMA package is genuine by checking the md5sum. See the FAQ for instructions. 3. Linux: Run gunzip and tar -xvf on the package (or just tar -xzvf if your tar supports that switch) Windows: Run any util that can unzip the .zip package. 4. This will create a directory called gma-[version], cd into it. 5. Set the environment variable $GMApath to the directory where GMA now lives. You will need to pass this environment variable to GMA with -DGMApath=$GMApath or GMA will not be able to see it. It is much easier to define an alias that passes this to java by default. In my .bashrc i have the following definitions: export GMApath=/home/argyle/gma-2.0 alias java='java -DGMApath=$GMApath' export JAVA_HOME_BIN=/usr/java/bin/ export JAVA_HOME=/usr/java/j2sdk1.4.2_04 6. You will need to have Apache Ant installed on your machine. Your installation of Ant should contain both the core tasks and optional tasks. You might have to add ant's /bin directory to your PATH in order to avoid typing the complete path each time. In addition you will need to setup several environment variables. In particular the environment variable ANT_HOME should point to the install dir of your ant (on linux it is /usr/share/ant): export ANT_HOME=/usr/share/ant 7. Copy the junit.jar file from [GMA_HOME]/lib, where [GMA_HOME] is the directory you have unjarred the distribution program, to [ANT_HOME]/lib directory. In your environment variables make sure the gma.jar gets put in your CLASSPATH: export CLASSPATH=lib/gma.jar 8. Under the directory [GMA_HOME], type "ant". This will install the program by building the source code and conducting unit tests. Please be patient, this may take up to 5 minutes. When it is done you should get a message that says BUILD SUCCESSFUL. 9. Create a config file, possibly starting with one of the samples in the config/ subdirectory (see below for more info). Make sure the path names to any #INCLUDE'd config files are correct and either explicit or are in reference to the directory you have chosen for GMA. You will probably also need to update the paths to the various resource files in your SIMR.config.* file(s). Grep for "" in the sample config files to identify where the path might need to be changed. 10. Once you have installed the program, you are able to use it. For this howto we will run GMA directly with java. Keep in mind that the default behavior is to run both SIMR and GSA, but they can be run separately. As an alternative you can run the batch file in the bin directory or the shell script depending on your environment. Note: /tmp will be used for temporary files and unspecified simr output files. ======================================================================== II. how to run GMA The GMA program takes between two and ten command-line arguments: Usage: java gma.GMA [arguments] where [arguments] are: -properties properties required argument; e.g., -properties ./GMA.properties -xAxisFile xAxisFile optional argument; e.g., -xAxisFile ./french.txt -yAxisFile yAxisFile optional argument; e.g., -yAxisFile ./english.txt -simr.outputFile simr.outputFile optional argument; e.g., -simr.outputFile ./simrOutput.txt -gsa.outputFile gsa.outputFile optional argument; e.g., -gsa.outputFile ./gsaOutput.txt Description of each of the arguments: (1 & 2) the name of your config file. (3 & 4) xAxisFile and (5 & 6) yAxisFile the names of the two files that contain the two halves of your bitext (Make sure they're in the right order!) These can be either text files or pre-generated axis files. These can also be set in the config files instead of at the command line, but must be in one of the two places, not both. (7 & 8) simr.outputFile the name of the file where you want the SIMR output stored. (9 & 10) gsa.outputFile the name of the file where you want the GSA output stored. The xAxisFile will be passed as the first language , the yAxisFile as the second. The order you give your axis files in should correspond to the order of your dictionary entries. For example, if you are in the directory where GMA is installed, and you have a French/English bitext in files E.text and F.text, you might run GMA as follows: java gma.GMA -properties config/GMA.config.default -xAxisFile E.text / -yAxisFile F.text -simr.outputFile E.F.simr -gsa.outputFile E.F.align If you do not specify a simr output file the map will be written to /tmp/temp.map If you do not specify a gsa output file the result will go to std err. Note to users familiar with older GMA versions: Remember that in previous versions of GMA special steps were required to save the resulting bitext map produced by SIMR you can now do it straight from the command line. WARNING: When using large translation lexicons you may find that java runs out of memory. You can fix this by allocating more memory to the stack and the heap with the switches -Xms128m -Xmx512m. Adding these to the example above gives: EXAMPLE (with resources that come with the package): java -Xms128m -Xmx512m gma.GMA -properties config/GMA.config.F.E -xAxisFile validation/french-test1.axis -yAxisFile validation/english-test1.axis -simr.outputFile F.E.simr -gsa.outputFile F.E.align ======================================================================== III. how to interpret GMA's output GMA outputs aligned blocks, one per line. The two sides of each aligned block are separated by the five ASCII characters " <=> ". When multiple segments are involved on one side of an aligned block, they are separated by commas. When one side of an aligned block is empty, it is represented by the word "omitted." E.g., suppose you try to align a 6-line text A and a 5-line text B, and GMA outputs 1 <=> 1 2 <=> 2,3 3,4,5 <=> omitted 6 <=> 4 omitted <=> 5 This means that segment A1 is aligned with segment B1; segment A2 is aligned with segments B2 and B3; segments A3, A4, and A5 are not aligned with anything; segment A6 is aligned with segment B4; and segment B5 is not aligned with anything. ======================================================================== IV. more information on config files The config files contain some documentation on each parameter, in the hope that these files will be self-explanatory. More realistically, this part of the HOWTO will likely be expanded in the future. The one thing that needs immediate clarification is that GMA's config files can be nested (and the default ones are). There is one top-level file called GMA.config.default. This file uses #INCLUDE statements to make the config file reader load two other config files, one specific to SIMR, the other specific to the GSA post-processor. Parameters that are common to both processes, such as the end-of-segment marker, are set in the top-level config file. ======================================================================== V. how to choose a matching predicate for your language pair A matching predicate is a heuristic for deciding whether two given tokens might be mutual translations. Different matching predicates are appropriate for different language pairs. The matching predicate is one of the required parameters in SIMR's config file. The matchingPredicate parameter must be set to one of the matching predicates that are available in the current implementation. These are: (1) gma.simr.ExactMatching the two tokens are identical (2) gma.simr.LcsrMatching the two tokens exceed the given LCSR threshold (3) gma.simr.DictMatching the two tokens appear as an entry in the supplied translation lexicon (4) gma.simr.DictExactMatching disjunction of (1) and (3) (3) gma.simr.LscrLexMatching disjunction of (2) and (3) Note that (1) is an extreme case of (2) As the available predicates suggest, the two kinds of information that a matching predicate can rely on most often are cognates and translation lexicons. Two tokens in a bitext are orthographic cognates if they have the same meaning and similar spellings. Similarity of spelling can be measured in more or less complicated ways. The matching predicates in SIMR's current implementation threshold the Longest Common Subsequence Ratio (LCSR). The LCSR of two tokens is the ratio of the length of their longest (not necessarily contiguous) common subsequence (LCS) and the length of the longer token. For example, "gouvernement," which is 12 characters long, has 10 characters that appear in the same order in "government." So, the LCSR for these two words is 10/12. On the other hand, the LCSR for "conseil" and "conservative" is only 6/12. When dealing with language pairs that have different alphabets, the matching predicate can employ "phonetic cognates." When language L1 borrows a word from language L2, the word is usually written in L1 similarly to the way it sounds in L2. For many languages, it is not difficult to construct an approximate mapping from the orthography to its underlying phonological form. Given such a mapping for L1 and L2, it is possible to identify cognates despite incomparable orthographies. If you have such a grapheme to phoneme converter, you should embed it in your axis generator (see section V), so that the axes list the phonetic forms. The phonetic forms can then be matched by the lcsrcogp matching predicate. Cognates are more common in bitexts from more similar language pairs, and from text genres where more word borrowing occurs, such as technical texts. Even distantly related languages like English and Czech will share a large number of orthographic cognates in the form of proper nouns, numerals and punctuation. When one or both of the languages involved is written in pictographs, cognates can still be found among punctuation and numerals. However, these kinds of cognates are usually too sparse to build an accurate bitext map from. When the matching predicate cannot generate enough candidate correspondence points based on cognates, its signal can be strengthened by a seed translation lexicon -- a simple list of word pairs that are believed to be mutual translations. Seed translation lexicons can be extracted from machine-readable bilingual dictionaries (MRBDs) in the rare cases where MRBDs are available. In other cases, they can be constructed automatically or semi-automatically using any of several published methods. A matching predicate based on a seed translation lexicon deems two candidate tokens to be mutual translations if the token pair appears in the lexicon. Since the matching predicate need not be perfectly accurate, the seed translation lexicons need not be perfectly accurate either. You can supply SIMR with a translation lexicon by setting the translationLexicon parameter to its full name (including explicit path). The translation lexicon should have one equivalence per line. Each equivalence should have the four characters " <> " (space, less than, greater than, space) separating its two halves. Either half can contain spaces, Unicode, or anything else except the center marker " <> ". All the matching predicates described above can be fine-tuned with stop-lists for one or both languages. For example, closed-class words are unlikely to have cognates. Indeed, French/English words like "a," "an," "on," and "par" often produce spurious points of correspondence. The same problem is caused by "false friends," words with similar spellings but different meanings in different languages. For example, the French word "librarie" means "bookstore," not "library," and "actuel" means "current," not "actual." SIMR can use a list of closed class words and/or a list of pairs of "false friends" to filter out spurious matches. To pass a list of stopwords to SIMR, put them in a file, one word per line, and set the parameters xStopWordFile and/or yStopWordFile to the file's full name (including explicit pathname) in SIMR's config file. ======================================================================== VI. how to create bitext axes for GMA This subject is dealt with at length in the gma-axis package available for download on the main website. Simply put, a text axis consists of all the units of meaning in a text with their positions. ======================================================================== VII. why you must generate end-of-segment markers This is an important topic that plays a major role in the process that converts SIMR's bitext maps to segment alignments. Please refer to the gma-axis package for more information. ======================================================================== VIII. how to reoptimize SIMR's numerical parameters This is an involved topic that deserves separate treatment. Please visit our website to get the information and resources that will help you reoptimize: http://nlp.cs.nyu.edu/GMA/. Why should you reoptimize? The GMA package provides a number of default configuration files for different language pairs. However, the numerical parameters in these files were optimized with language-specific resources that are not included in the package (because we don't own them). Therefore, you would be well advised to reoptimize for your own text type(s). ======================================================================== IX. troubleshooting Problem: GMA's accuracy is significantly lower than what has been reported for a certain language pair. Diagnosis 1: You didn't reoptimize GMA's parameters for your particular text type and language-specific resources. Solution: Follow the directions in VIII, above. Diagnosis 2: Your language specific resources are not in the right format, and are consequently being underutilized. Solution: See resource descriptions in V above and elsewhere. ------------------------------------------------------------------------ Problem: SIMR runs OK and produces a bitext map, but GSA produces no output. Diagnosis: It's possible that your config file has carriage returns in addition to newlines. In this case, GSA thinks that the EOSMarker parameter includes the trailing carriage return. If there are no carriage returns in the bitext axis, then GSA never sees any EOSMarkers. So it thinks there's nothing to align. Solution: Run your config file(s) through the utility crnl2nl (available online), to remove the carriage returns and leave just the newlines. ======================================================================== X. how to submit bug reports First, package up the following in a tarball: * your two input bitext axes * all your config files (there are probably 3 of them) * the simr output file, if your run managed to create it Then go to our bugzilla server at: http://nlp.cs.nyu.edu/bugzilla You will need to give your email address in order to get a password, but it is all automated and relatively painless. Once you are logged in, click 'Enter a new Bug', Proteus Project and then for the component of the Proteus Project, pick 'GMA'. Then fill out the form with appropriate values. In the 'Description' block you should include: * the output of `uname -a` and `echo $path` * the output of `ls -alt` in the directory where GMA in installed * the output of `ls -alt` in the directory where you ran GMA * a copy of the exact command you used to run GMA, with all parameters You may leave the 'Assigned To' field blank. Then click the 'Commit' button. You can now create a new attachment to the bug including: * the uuencoded compressed tarball Then click 'Commit' again, please be patient. ======================================================================== XI. references Except for the "Porting" TR, everything is in Dan's book: I. Dan Melamed (2001). Empirical Methods for Exploiting Parallel Texts. MIT Press. Or you can grab individual papers from http://www.cs.nyu.edu/~melamed/pubs.html I. Dan Melamed (1996). A Geometric Approach to Mapping Bitext Correspondence, IRCS Technical Report #96-22, a revised version of the paper presented at the First Conference on Empirical Methods in Natural Language Processing (EMNLP'96), Philadelphia, PA, May. I. Dan Melamed (1996). Automatic Detection of Omissions in Translations, IRCS Technical Report #96-23, a revised version of the paper presented at the 16th International Conference on Computational Linguistics (COLING'96), Copenhagen, Denmark. I. Dan Melamed (1996). Porting SIMR to New Language Pairs, IRCS Technical Report #96-26. I. Dan Melamed (1997). A Portable Algorithm for Mapping Bitext Correspondence, 35th Conference of the Association for Computational Linguistics (ACL'97), Madrid, Spain. ========================================================================