[JoGu]

Cryptology

KULLBACK's Cross-Product Sum Statistic

a7Hzq .#5r<
kÜ\as TâÆK$
ûj(Ö2 ñw%h:
Úk{4R f~`z8
¤˜Æ+Ô „&¢Dø

Definition

For a decision whether two texts a and b, not necesarily of the same length, belong to the same language, we count the letters in each of them and compare the distributions. A good measure for this comparison is KULLBACK's cross-product sum statistic or Chi:

Let ms(a) and ms(b) be the frequency of the letter s in a or b. Then form the »cross-products« ms(a)ms(b) and take the mean value of these products:

χ(a, b) = Σs ms(a)ms(b) / rq

where r is the length of a and q, the length of b.

More on this statistic is in the mathematical background part.

Here is a Perl program.


Example

Consider the texts

  IFYOU CANKE EPYOU RHEAD WHENA LLABO UTYOU ARELO OSING THEIR SANDB LAMIN GITON YOU
  YOURS ISTHE EARTH ANDEV ERYTH INGTH ATSIN IT

Count the letters:

  ABCDEFG HIJKLMN OPQRSTU VWXYZr
a7212612 3501416 8103235 0104068
b3001401 4400003 1003361 1002037

Σ = 21 + 0 + 0 + 2 + 24 + 0 + 2 + 12 + 20 + 0 + 0 + 0 + 0 + 18 + 8 + 0 + 0 + 9 + 6 + 18 + 5 + 0 + 0 + 0 + 8 + 0 = 153

χ(a, b) = 153 / 68x37 = 0.0608


Properties


Empirical values

We gathered empirical values for English, German, and random texts of length 100 using a Perl program.

From these we see that χ—in contrast with the coincidence index κ—performs extremely well, in fact in our experiments it even completely separates English and German texts from random texts of length 100.

This makes a test with power near 100% and error probability near 0%. The χ test even distinguishes between English and German texts at the 5% error level with a power of almost 75%. For this assertion compare the 95% quantile for English with the first quartile for German.

The results for 100 letter texts encourage us to try 26 letter texts for English, German, and random.

The χ-test is quite strong even for 26 letters: At the 5% error level its power is around 91% for English, 98% for German.


Author: Klaus Pommerening, 2013-Dec-23; last change: 2014-Feb-16.