/* William A. Gale and Kenneth W. Church, "A Program for Aligning Sentence in Bilingual Corpora" in Susan Armstong ed. "Using Large Corpora", MIT Press, 1994, p91-102. with Michael D. Riley The following code is the core of align. It is a C language program that inputs two files, with one token (word) per line. The text files contain a number of delimiter tokens: "hard" and "soft". The hard regions (e.g. paragraphs) may not be changed, and there must be equal numbers of them in the two input files. The soft regions (e.g. sentences) may be deleted (1-0), inserted (0-1), contracted (2-1), expanded (1-2) or merged (2-2) as necessary so that the output ends up with the same number of soft regions. The program generates two output files. The two output files contain an equal number of soft regions, each on a line. If the -v command line option is included, each soft region is preceded by its probability score. */ /* Return -100*log probability that an English sentence of length len1 is a translation of a foreign sentence of length len2. The probability is based on two parameters, the mean and variance of number of foreign characters per English characters */ mean=(len1+len2/c)/2; z=(c*len1-len2)/sqrt(s2*mean); /* Need to deal with both sides of the normal distribution */ if (z<0) z=-z; pd=2*(1-pnorm(z)); pd=2*(1-pnorm(z)); if (pd>0) return((int)(-100*log(pd)));