Cryptology: Empirical Results on MFL Scores

Cryptology

Empirical Results on MFL Scores

a7Hzq .#5r< kÜ\as TâÆK$ ûj(Ö2 ñw%h: Úk{4R f~`z8 ¤˜Æ+Ô „&¢Dø

The power calculations for the tests—not the tests themselves!—relied on the independency of the letters in a string. This assumption is clearly false for natural languages. Therefore getting experimental results for the distributions of the MFL scores makes sense.

English

For English we take a text of 20000 letters, an extract from the Project Gutenberg etext of Kim, by Rudyard Kipling. The partial 20000 letter text is here.

We divide this text into 2000 substrings of 10 letters each. To this set of substrings we apply the Perl script fritestE.pl. The results are collected and evaluated in a spreadsheet.

We do the same for random text, constructed by taking 20000 random numbers between 0 and 25 from random.org, see here.

The following table compares the expected and observed distributions. For random texts they match well, taking into account variations caused by drawing a sample. Also for English the observations seem to match the predicted values. The empirical values amount to a power of 68% (instead of 67%) and a predictive value of 75% (75%).

	Random		English
score	expected	observed	expected	observed
0	16	12	0	0
1	98	102	0	0
2	274	256	2	2
3	456	491	8	11
4	500	494	40	52
5	374	380	134	132
6	194	182	318	316
7	70	66	514	513
8	16	15	546	587
9	2	1	344	304
10	0	1	98	83

German and French

The 20000 letter extracts are Schach20K.txt and Lune20K.txt. We generate independent random texts, see rnd10D.txt and rnd10F.txt. (The random texts being independent, the observed values for random texts differ.)

The Perl scripts, adapted to the differing collections of most-frequent letters, are fritestD.pl and fritestF.pl.

	Random		German
score	expected	observed	expected	observed
0	16	22	0	0
1	98	111	0	0
2	274	287	0	3
3	456	443	6	4
4	500	493	32	31
5	374	363	116	110
6	194	184	290	277
7	70	78	500	553
8	16	18	564	632
9	2	1	378	314
10	0	0	114	76

	Random		French
score	expected	observed	expected	observed
0	16	17	0	0
1	98	102	0	0
2	274	290	0	0
3	456	463	2	1
4	500	491	14	5
5	374	376	62	18
6	194	188	196	160
7	70	61	424	472
8	16	11	602	719
9	2	1	506	484
10	0	0	192	141

The empirical values amount to a power of 63% (theory: 67%) and a predictive value of 75% (75%) for German, and a power of 87% (86%) and a predictive value of 88% (87%).

The values for random texts are in good accord with the theoretical values. The distributions for the natural languages seem to be somewhat narrow compared with the prediction by theory. But it's not completely misleading considering this as an affirmation of the theoretical results.