[JoGu]

Cryptology

Empirical Results on MFL Scores

a7Hzq .#5r<
kÜ\as TâÆK$
ûj(Ö2 ñw%h:
Úk{4R f~`z8
¤˜Æ+Ô „&¢Dø

The power calculations for the tests—not the tests themselves!—relied on the independency of the letters in a string. This assumption is clearly false for natural languages. Therefore getting experimental results for the distributions of the MFL scores makes sense.

English

For English we take a text of 20000 letters, an extract from the Project Gutenberg etext of Kim, by Rudyard Kipling. The partial 20000 letter text is here.

We divide this text into 2000 substrings of 10 letters each. To this set of substrings we apply the Perl script fritestE.pl. The results are collected and evaluated in a spreadsheet.

We do the same for random text, constructed by taking 20000 random numbers between 0 and 25 from random.org, see here.

The Perl script RandOrg.pl transforms the random numbers into text.

The graphic shows some characteristics of the distribution:

[MFL scores for 2000 English and random text
   chunks of 10 letters each]

The following table compares the expected and observed distributions. For random texts they match well, taking into account variations caused by drawing a sample. Also for English the observations seem to match the predicted values. The empirical values amount to a power of 68% (instead of 67%) and a predictive value of 75% (75%).

RandomEnglish
scoreexpectedobservedexpectedobserved
0 16 12 0 0
1 98102 0 0
2274256 2 2
3456491 8 11
4500494 40 52
5374380134132
6194182318316
7 70 66514513
8 16 15546587
9 2 1344304
10 0 1 98 83

German and French

We repeat this procedure for German and French. As texts we take Schachnovelle by Stefan Zweig, and De la Terre à la Lune by Jules Verne.

The 20000 letter extracts are Schach20K.txt and Lune20K.txt. We generate independent random texts, see rnd10D.txt and rnd10F.txt. (The random texts being independent, the observed values for random texts differ.)

The Perl scripts, adapted to the differing collections of most-frequent letters, are fritestD.pl and fritestF.pl.

The comprehensive evaluation is in the spreadsheets statFriD.xls and statFriF.xls.

The results for German are in the following figure and table:

[MFL scores for 2000 German and random text
   chunks of 10 letters each]
RandomGerman
scoreexpectedobservedexpectedobserved
0 16 22 0 0
1 98111 0 0
2274287 0 3
3456443 6 4
4500493 32 31
5374363116110
6194184290277
7 70 78500553
8 16 18564632
9 2 1378314
10 0 0114 76

The results for French are in the following figure and table:

[MFL scores for 2000 French and random text
   chunks of 10 letters each]
RandomFrench
scoreexpectedobservedexpectedobserved
0 16 17 0 0
1 98102 0 0
2274290 0 0
3456463 2 1
4500491 14 5
5374376 62 18
6194188196160
7 70 61424472
8 16 11602719
9 2 1506484
10 0 0192141

The empirical values amount to a power of 63% (theory: 67%) and a predictive value of 75% (75%) for German, and a power of 87% (86%) and a predictive value of 88% (87%).

The values for random texts are in good accord with the theoretical values. The distributions for the natural languages seem to be somewhat narrow compared with the prediction by theory. But it's not completely misleading considering this as an affirmation of the theoretical results.


Author: Klaus Pommerening, 2014-Jun-10; last change: 2014-Jun-10.