[JoGu]

Cryptology

Application of the Kappa Distribution

a7Hzq .#5r<
kÜ\as TâÆK$
ûj(Ö2 ñw%h:
Úk{4R f~`z8
¤˜Æ+Ô „&¢Dø

We want to apply the (theoretical and empirical) results on the distribution of κ-values to the questions asked in the introduction.

To test whether two texts a, b belong to the same language we calculate their coincidence index κ(a, b) and compare it with the value for random texts. For texts of length 100 in A...Z we would decide for »same language« as soon as κ(a, b) ≥ 0.070, the 95% quantile for random texts. Then we have a »5% error probability of the first kind«—that means that (in the long run) in 5% of all cases we would erroneously declare a random constellation as non-random. This is a common error level that even testers of medical treatments tolerate, and a common perception of »significance« of a test result.

On the other hand we want to avoid »errors of the second kind«—that are erroneous decisions for »random« although the texts belong to the same language. To assess the error probability for this kind of false decision we have to look at the distribution of κ-values for the language (and text length).

Our empirical observations tell us that for English 65% of text pairs of length 100 fall below the limit of 0.070, and for German, 48%. The complementary proportions are called the power of the test. Therefore our κ-test has power 35% for English, and 52% for German.

To test whether a text a belongs to a certain language we would take one (or maybe several) fixed texts of the language and would test a against them. Because the values for natural languages are quite similar this test would only make sense for testing against random.

Also adjusting the columns of a disk cipher could be tested this way: If two alphabets are relatively shifted, the corresponding columns behave like random texts with respect to each other. If the alphabets are properly adjusted, the columns represent meaningful texts encrypted by the same monoalphabetic substitution, therefore they belong to the same language and show the typical coincidence index up to statistical noise. Note that we need quite long columns for this test to work in a sensible way!

In the following sections we'll see some better tests for these decision problems. The main application of the coincidence index in its pure form is detecting identically encrypted polyalphabetic ciphertexts. Moreover it is the basis of some refined methods.


Author: Klaus Pommerening, 2013-Dec-20; last change: 2014-Jan-23.