Cryptology: KULLBACK's Cross-Product Sum Statistic

Cryptology

KULLBACK's Cross-Product Sum Statistic

a7Hzq .#5r< kÜ\as TâÆK$ ûj(Ö2 ñw%h: Úk{4R f~`z8 ¤˜Æ+Ô „&¢Dø

Definition

For a decision whether two texts a and b, not necesarily of the same length, belong to the same language, we count the letters in each of them and compare the distributions. A good measure for this comparison is KULLBACK's cross-product sum statistic or Chi:

Let m_s(a) and m_s(b) be the frequency of the letter s in a or b. Then form the »cross-products« m_s(a)m_s(b) and take the mean value of these products:

χ(a, b) = Σ_s m_s(a)m_s(b) / rq

where r is the length of a and q, the length of b.

More on this statistic is in the mathematical background part.

Here is a Perl program.

Example

Consider the texts

  IFYOU CANKE EPYOU RHEAD WHENA LLABO UTYOU ARELO OSING THEIR SANDB LAMIN GITON YOU
  YOURS ISTHE EARTH ANDEV ERYTH INGTH ATSIN IT

Count the letters:

	A	B	C	D	E	F	G	H	I	J	K	L	M	N	O	P	Q	R	S	T	U	V	W	X	Y	Z	r
a	7	2	1	2	6	1	2	3	5	0	1	4	1	6	8	1	0	3	2	3	5	0	1	0	4	0	68
b	3	0	0	1	4	0	1	4	4	0	0	0	0	3	1	0	0	3	3	6	1	1	0	0	2	0	37

Σ = 21 + 0 + 0 + 2 + 24 + 0 + 2 + 12 + 20 + 0 + 0 + 0 + 0 + 18 + 8 + 0 + 0 + 9 + 6 + 18 + 5 + 0 + 0 + 0 + 8 + 0 = 153

χ(a, b) = 153 / 68x37 = 0.0608

Properties

For a given text a and a »random« text b we have χ(a, b) ≈ 1/n.
For »random« texts a and b we have χ(a, b) ≈ 1/n.
For given texts a and b and a »random« monoalphabetic substitution f_σ we have χ(a, f_σ(b)) ≈ 1/n. This remark justifies treating a nontrivially monoalphabetically encrypted text as random with respect to χ and plaintext.
For given texts a and b and two »random« monoalphabetic substitutions f_σ, f_τ we have χ(f_σ(a), f_τ(b)) ≈ 1/n.

Empirical values

We gathered empirical values for English, German, and random texts of length 100 using a Perl program.

From these we see that χ—in contrast with the coincidence index κ—performs extremely well, in fact in our experiments it even completely separates English and German texts from random texts of length 100.

This makes a test with power near 100% and error probability near 0%. The χ test even distinguishes between English and German texts at the 5% error level with a power of almost 75%. For this assertion compare the 95% quantile for English with the first quartile for German.

The results for 100 letter texts encourage us to try 26 letter texts for English, German, and random.

The χ-test is quite strong even for 26 letters: At the 5% error level its power is around 91% for English, 98% for German.

Author: Klaus Pommerening, 2013-Dec-23; last change: 2014-Feb-16.

	A	B	C	D	E	F	G	H	I	J	K	L	M	N	O	P	Q	R	S	T	U	V	W	X	Y	Z	r
a	7	2	1	2	6	1	2	3	5	0	1	4	1	6	8	1	0	3	2	3	5	0	1	0	4	0	68
b	3	0	0	1	4	0	1	4	4	0	0	0	0	3	1	0	0	3	3	6	1	1	0	0	2	0	37

	A	B	C	D	E	F	G	H	I	J	K	L	M	N	O	P	Q	R	S	T	U	V	W	X	Y	Z	r
a	7	2	1	2	6	1	2	3	5	0	1	4	1	6	8	1	0	3	2	3	5	0	1	0	4	0	68
b	3	0	0	1	4	0	1	4	4	0	0	0	0	3	1	0	0	3	3	6	1	1	0	0	2	0	37

	A	B	C	D	E	F	G	H	I	J	K	L	M	N	O	P	Q	R	S	T	U	V	W	X	Y	Z	r
a	7	2	1	2	6	1	2	3	5	0	1	4	1	6	8	1	0	3	2	3	5	0	1	0	4	0	68
b	3	0	0	1	4	0	1	4	4	0	0	0	0	3	1	0	0	3	3	6	1	1	0	0	2	0	37