[JoGu]

Cryptology

Recognizing Plaintext: The Log-Weight Method for Bigrams

a7Hzq .#5r<
kÜ\as TâÆK$
ûj(Ö2 ñw%h:
Úk{4R f~`z8
¤˜Æ+Ô „&¢Dø

Definition

In the last four sections we used only the single letter frequencies of a natural language. In other words, we treated texts as sequences of independent letters. But a characteristic aspect of every natural language is how letters are combined as bigrams (letter pairs). We may hope to get good criteria for recognizing a language by evaluating the bigrams in a text. Of course this applies to contiguous text only, in particular it is useless for the polyalphabetic example of Sections 3 and 4.

We start from the frequencies of the bigrams in natural languages. Matrices for English, German, and French that contain the relative frequencies of the bigrams in the respective language are in the files eng_rel.csv, ger_rel.csv, fra_rel.csv as comma-separated tables.

The mathematical version of this section shows how to derive corresponding bigram log-weights. These are in the files eng_blw.csv, ger_blw.csv, fra_blw.csv.

To calculate the Bigram Log-Weight (or BLW) score we go through the bigrams of a string and add the log weight of each bigram. This approach is somewhat naive because it implicitly considers the bigrams—even the overlapping ones!—as independent. This criticism doesn't mean that we are doing something mathematically wrong, but only that the usefulness of the score might be smaller than expected.

Here is the score for the English string LETTER:

1.9 (LE) + 1.9 (ET) + 1.7 (TT) + 2.0 (TE) + 2.2 (ER) = 9.7

Programs that compute BLW scores for English, German, or French are BLWscE.pl, BLWscD.pl, BLWscF.pl.


The CAESAR Example

As an example we compute the scores for the Caesar example, see the following table. The correct solution is evident in all three languages.

        BLW scores English    German      French
          FDHVDU    1.4        3.1         2.2
          GEIWEV    5.8 <---   7.3 <===    4.3
          HFJXFW    0.9        0.3         0.0
          IGKYGX    2.2        2.1         1.3
          JHLZHY    0.5        1.9         0.3
          KIMAIZ    5.9 <---   5.2         4.9
          LJNBJA    1.1        2.4         0.9
          MKOCKB    2.7        4.2         0.8
          NLPDLC    3.0        2.8         1.4
          OMQEMD    3.5        3.8         3.6
          PNRFNE    3.6        4.7         3.6
          QOSGOF    5.8 <---   4.0         3.4
          RPTHPG    4.5        2.6         2.7
          SQUIQH    2.3        0.6         6.3 <---
          TRVJRI    4.1        4.3         4.9
          USWKSJ    3.3        3.7         2.0
          VTXLTK    1.3        2.0         1.1
          WUYMUL    3.1        2.9         2.7
          XVZNVM    0.6        1.3         1.0
          YWAOWN    5.5        2.3         0.0
          ZXBPXO    0.0        0.0         0.0
          AYCQYP    3.2        0.0         0.3
          BZDRZQ    1.0        2.1         1.1
          CAESAR    7.7 <===   7.5 <===    8.4 <===
          DBFTBS    4.7        3.5         0.6
          ECGUCT    5.5        3.6         5.5

Author: Klaus Pommerening, 2014-Jun-10; last change: 2014-Jun-10.