[JoGu]

Cryptology

The Kappa Distribution for English Texts

a7Hzq .#5r<
kÜ\as TâÆK$
ûj(Ö2 ñw%h:
Úk{4R f~`z8
¤˜Æ+Ô „&¢Dø

We want to learn more about the distribution of coincidence indices κ(a, b) for English texts (or text chunks) a and b. To this end we take a large English text—in this case the book The Poisoned Pen by Arthur B. Reeve (that by the way contains a cryptogram) from Project Gutenberg—and chop it into chunks a, b, c, d, ... of 100 letters each. Then we count κ(a,b), κ(c,d), ... and list the values in the first column of a spreadsheet for easy evaluation. Here is the text after clearing organizational addenda.

In fact we also record the pure incidence counts as integers. This makes it easier drawing a histogram without generating discretization artefacts—to get coincidence indices divide x-values by 100.

The text has 449163 letters. We get 2245 text pairs of length 2 x 100. We take the first 2000 of them. The following figure and table show some characteristics of the distribution.

[Frequency of coincidences]

Distribution of κ for 2000 English text pairs of 100 letters

Minimum: 0.00
Median: 0.06Mean value: 0.0669
Maximum: 0.25Standard dev:0.0272
1st quartile:0.055% quantile: 0.0300
3rd quartile:0.0895% quantile:0.1200

Author: Klaus Pommerening, 2013-Dec-20; last change: 2014-Jan-23.