CryptologyKULLBACK's Cross-Product Sum Statistic |
|
For a decision whether two texts a and b, not necesarily of the same length, belong to the same language, we count the letters in each of them and compare the distributions. A good measure for this comparison is KULLBACK's cross-product sum statistic or Chi:
Let ms(a) and ms(b) be the frequency of the letter s in a or b. Then form the »cross-products« ms(a)ms(b) and take the mean value of these products:
where r is the length of a and q, the length of b.
More on this statistic is in the mathematical background part.
Here is a Perl program.
Consider the texts
IFYOU CANKE EPYOU RHEAD WHENA LLABO UTYOU ARELO OSING THEIR SANDB LAMIN GITON YOU YOURS ISTHE EARTH ANDEV ERYTH INGTH ATSIN IT
Count the letters:
A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z | r | |
a | 7 | 2 | 1 | 2 | 6 | 1 | 2 | 3 | 5 | 0 | 1 | 4 | 1 | 6 | 8 | 1 | 0 | 3 | 2 | 3 | 5 | 0 | 1 | 0 | 4 | 0 | 68 |
b | 3 | 0 | 0 | 1 | 4 | 0 | 1 | 4 | 4 | 0 | 0 | 0 | 0 | 3 | 1 | 0 | 0 | 3 | 3 | 6 | 1 | 1 | 0 | 0 | 2 | 0 | 37 |
Σ = 21 + 0 + 0 + 2 + 24 + 0 + 2 + 12 + 20 + 0 + 0 + 0 + 0 + 18 + 8 + 0 + 0 + 9 + 6 + 18 + 5 + 0 + 0 + 0 + 8 + 0 = 153
χ(
We gathered empirical values for English, German, and random texts of length 100 using a Perl program.
From these we see that χ—in contrast with the coincidence index κ—performs extremely well, in fact in our experiments it even completely separates English and German texts from random texts of length 100.
This makes a test with power near 100% and error probability near 0%. The χ test even distinguishes between English and German texts at the 5% error level with a power of almost 75%. For this assertion compare the 95% quantile for English with the first quartile for German.
The results for 100 letter texts encourage us to try 26 letter texts for English, German, and random.
The χ-test is quite strong even for 26 letters: At the 5% error level its power is around 91% for English, 98% for German.