CryptologyThe Kappa Distribution for English Texts |
|
We want to learn more about the distribution of coincidence indices κ(a, b) for English texts (or text chunks) a and b. To this end we take a large English text—in this case the book The Poisoned Pen by Arthur B. Reeve (that by the way contains a cryptogram) from Project Gutenberg—and chop it into chunks a, b, c, d, ... of 100 letters each. Then we count κ(a,b), κ(c,d), ... and list the values in the first column of a spreadsheet for easy evaluation. Here is the text after clearing organizational addenda.
In fact we also record the pure incidence counts as integers. This makes it easier drawing a histogram without generating discretization artefacts—to get coincidence indices divide x-values by 100.
The text has 449163 letters. We get 2245 text pairs of length 2 x 100. We take the first 2000 of them. The following figure and table show some characteristics of the distribution.
Distribution of κ for 2000 English text pairs of 100 letters
Minimum: | 0.00 | |||
Median: | 0.06 | Mean value: | 0.0669 | |
Maximum: | 0.25 | Standard dev: | 0.0272 | |
1st quartile: | 0.05 | 5% quantile: | 0.0300 | |
3rd quartile: | 0.08 | 95% quantile: | 0.1200 |