[JoGu]

Cryptology

Bigram Scores

a7Hzq .#5r<
kÜ\as TâÆK$
ûj(Ö2 ñw%h:
Úk{4R f~`z8
¤˜Æ+Ô „&¢Dø

Empirical Values for English

For English we take two texts of 40000 letters each, extracted from the Project Gutenberg etext of The Children of the New Forest, by Frederick Marryat, from Chapters I to III, and from Chapters IV to VI. The partial 40000 letter texts are here and here.

As a first experiment we form 20000 random bigrams, keeping the expected single letter frequencies: We partition the first text into two halves of 20000 letters and pair the letters of the first half each with the corresponding letter of the second half. Then we form 2000 sequences of 10 bigrams each and calculate the bigram rate of each of these 2000 sequences. For this we use the Perl script statE_1.pl.

As a second experiment we form 20000 true bigrams by partitioning the second text into 20000 non-overlapping bigrams, form 2000 sequences à 10 bigrams, and calculate the bigram rate of each of these 2000 sequences. For this we use the Perl script statE_2.pl.

The results of these two experiments are collected and evaluated in a spreadsheet.

The graphic shows the distributions, each observed value corresponding to a sequence a 10 bigrams:

[cBLW scores for 2000 random and true
   English bigram sequences]

The following table shows characteristics of these distributions.

Random True
Minimum: 0.581.26
Maximum: 2.182.31
Median: 1.461.89
5% Quantile: 1.081.59
25% Quantile: 1.311.77
75% Quantile: 1.612.00
95% Quantile: 1.802.14
Mean value: 1.451.88
Standard deviation:0.220.17

From these observations we can estimate the power and the predictive value of the bigram rate (for sequences of 10 bigrams). But suffice it to say that a threshold value of 1.80 will eliminate 95% of all false combinations, and miss about 30% of all true combinations.


Author: Klaus Pommerening, 2014-Jul-22; last change: 2014-Jul-27.