[JoGu]

Cryptology

Bigram Scores

a7Hzq .#5r<
kÜ\as TâÆK$
ûj(Ö2 ñw%h:
Úk{4R f~`z8
¤˜Æ+Ô „&¢Dø

Empirical Values for German

For German we take two texts of 40000 letters each, extracted from the Project Gutenberg etext of Effie Briest, by Theodor Fontane, from Chapters 1 to 5, and from Chapters 6 to 10. The partial 40000 letter texts are here and here.

We form 20000 random bigrams, keeping the expected single letter frequencies: We partition the first text into two halves of 20000 letters and pair the letters of the first half each with the corresponding letter of the second half. Then we form 2000 sequences of 10 bigrams each and calculate the bigram rate of each of these 2000 sequences. For this we use the Perl script statD_1.pl.

Next we form 20000 true bigrams: We take all the non-overlapping bigrams from the second text, form 2000 sequences à 10 bigrams, and calculate the bigram rate of each of these 2000 sequences. For this we use the Perl script statD_2.pl.

The results of these two experiments are collected and evaluated in a spreadsheet.

The graphic shows the distributions, each observed value corresponding to a sequence a 10 bigrams:

[cBLW scores for 2000 random and true
   German bigram sequences]

The following table shows characteristics of these distributions.

Random True
Minimum: 0.811.40
Maximum: 2.152.44
Median: 1.521.96
5% Quantile: 1.141.69
25% Quantile: 1.381.86
75% Quantile: 1.652.06
95% Quantile: 1.852.20
Mean value: 1.511.96
Standard deviation:0.210.15

The rates are a bit larger then for English, but also a bit more sharply separated.


Author: Klaus Pommerening, 2014-Jul-22; last change: 2014-Jul-27.