[JoGu]

Cryptology

Recognizing Plaintext: FRIEDMAN's Most-Frequent-Letters Test

a7Hzq .#5r<
kÜ\as TâÆK$
ûj(Ö2 ñw%h:
Úk{4R f~`z8
¤˜Æ+Ô „&¢Dø

We begin with the first question: Does a given text belong to a certain language? FRIEDMAN gave a quite simple procedure for distinguishing valid text from random noise that works surprisingly well, even for short texts. Besides it makes a smooth introduction to statistical test theory.


FRIEDMAN's Procedure

Assume we are given a string of letters and want to decide whether it is a part of a meaningful text (in a given language, say English), or whether it is random gibberish. Our first contact with this problem was the exhaustion attack against the simple shift cipher that produced 26 strings, exactly one of which represented the correct solution. Cherry-picking it was easy by visual inspection. But for automating this decision procedure we would prefer a quantitative criterion.

Such a criterion was proposed by FRIEDMAN in Riverbank Publication No. 16 from 1918. The procedure is

  1. Identify a set of most frequent letters from the target language. For English take ETOANIRSHD that make up about 74% of an average English text but only 10/26 ≈ 38.5% of a random text.
  2. Count the cumulative frequencies of these most-frequent letters for each of the candidate strings.
  3. Pick the string with the highest score. If this doesn't work, also consider the next highest scores.
Example.
For the CAESAR example in Section 1.3 the scores are in the following table. We immediately see that the correct solution CAESAR has the highest score (even if this is not a genuine English word).
       FDHVDU  3       OMQEMD  3         XVZNVM  1
       GEIWEV  3       PNRFNE  4 <---    YWAOWN  3
       HFJXFW  1       QOSGOF  3         ZXBPXO  1
       IGKYGX  1       RPTHPG  3         AYCQYP  1
       JHLZHY  2       SQUIQH  3         BZDRZQ  2
       KIMAIZ  3       TRVJRI  4 <---    CAESAR  5 <===
       LJNBJA  2       USWKSJ  2         DBFTBS  3
       MKOCKB  1       VTXLTK  2         ECGUCT  2
       NLPDLC  2       WUYMUL  0

The example shows that FRIEDMAN's procedure seems to work well even for quite short strings.

To confirm this observation we analyze the distribution of the Most-Frequent-Letters scores—in short MFL scores—for strings of natural languages and for random strings. In the mathematical version of this section we consider this task from a theoretic viewpoint, in the next section we also perform some empirical evaluations.


Results of the Theoretic Analysis

English, 10 Letter Texts

We set the threshold value at 4, that means that we classify a 10 letter text as »random« if its MFL score is at most 4, and as »English« if its MFL score is at least 5. Then (in the long run) we'll miss 2.4% of all true plaintexts because the probability for an English 10 letter text string having an MFL score ≤ 4 is 0.024. On the other hand we'll discard only 67% of all random strings and erroneously keep 33% of them.

In other words, Friedman's MFL-method, interpreted as a statistical test (for the »null hypothesis« of English text against the »alternative hypothesis« of random text), has a power of ≈ 67% for English textstrings of length 10 with an error rate of 2.4%.

There is another (»Bayesian«) way to look at the decision problem. The predictive values give the probabilities that texts are actually what we decide them to be.

If we decide »random« for texts with MFL score ≤ 4, we'll be correct for about 671 of 1000 random texts and err for 24 of 1000 English texts. This makes 695 decisions for random of which 671 are correct. The predictive value of our »random« decision is 96.5% ≈ 671/695.

The decision »English« for an MFL score > 4 will be correct for 976 of 1000 English texts and false for 329 of 1000 random texts. Hence the predictive value of the decision »English« is about 75% ≈ 976/1305. That means that if we pick up texts (of length 10) with a score of at least 5, then (in the long run) one out of four selected texts will be random.

German, 10 Letter Texts

The ten most frequent letters are ENIRSATDUH. They make up 75% of an average German text.

With a threshold of T = 4 and an error rate of 1.9% the MFL-test has a power of 67%. The predictive value for »German« is 75%.

French, 10 Letter Texts

The ten most frequent letters are EASNTIRULO. They make up 79% of an average French text.

With a threshold of T = 5 and an error rate of 3.9% the MFL-test has a power of 86%. The predictive value for »French« is 87%.

English, 20 Letter Texts

With a threshold of T = 10 and an error rate of 1.9% the MFL-test has a power of 90% and a predictive value of 91%.

German, 20 Letter Texts

With a threshold of T = 11 and an error rate of 4.0% the MFL-test has a power of 96% and a predictive value of 96%.

French, 20 Letter Texts

With a threshold of T = 12 and an error rate of 4.1% the MFL-test has a power of 98.5% and a predictive value of 98.5%.

Remark

The threshold values are chosen in a way that the error rate is at most 5%. We want to bound the number of missed true plaintexts at a very low level—a missed plaintext renders the complete cryptanalysis obsolete.

On the other hand keeping too many random strings increases the effort of the analysis, but this of somewhat less concern. A predictive value of 75% is perfactly acceptable.

As a summary we note:

The MFL score performs reasonably good in distinguishing English (or German or French) texts from random gibberish for 10 letter texts.
It performs excellently for texts of 20 letters.

In the next section we support these results with some empirical data.


Author: Klaus Pommerening, 2014-Jun-10; last change: 2014-Jun-10.