![[JoGu]](../../JGU.png) |
Cryptology
I.3 Some Statistical Properties of Languages |
a7Hzq .#5r< kÜ\as TâÆK$ ûj(Ö2 ñw%h:
Úk{4R f~`z8 ¤˜Æ+Ô „&¢Dø |
|
Contents
- Recognizing plaintext: Friedman's Most-Frequent-Letters test
[mathematical description as PDF]
- Empirical results on MFL scores
- Application to the cryptanalysis of the Bellaso cipher
- Recognizing plaintext: Sinkov's Log-Weight test
[mathematical description as PDF]
- Recognizing plaintext: The Log-Weight method for bigrams
[mathematical description as PDF]
- Empirical results on BLW scores
- Coincidences of two texts [mathematical description as PDF]
with Perl program
- Empirical values for natural languages
- Autocoincidence of a text
[mathematical description as PDF]
with Perl program
[online call]
- The inner coincidence index of a text
[mathematical description as PDF], Perl program
- The distribution of the inner coincidence index [PDF]
with Perl program
- Sinkov's formula [PDF]
- Sinkov's test for the period [PDF],
Perl program,
testing a short ciphertext
- Kullback's Cross-Product Sum statistic [PDF]
(with a side remark on Cohen's kappa), Perl program,
- Adjusting the columns of a disk cipher
[PDF],
example
- Modeling language by a stochastic process [PDF]
- Message sources and their coincidence indices and cross-product sums
- Stochastic languages [PDF]
Here is the mathematical part as a single PDF.
Summary
In this section we study certain statistical properties of texts and languages.
These help to answer questions such as:
- Does a given text belong to a certain language?
Can we derive an algorithm for automatically distinguishing valid plaintext from
random noise? This is one of the central problems of cryptanalysis.
- Do two given texts belong to the same language?
- Can we decide these questions also for encrypted texts? Which properties
of texts are invariant under certain encryption procedures? Can we distinguish
encrypted plaintext from random noise?
- Is a given ciphertext monoalphabetically encrypted? Or polyalphabetically
with periodic repetition of alphabets? If so, what is the period?
- How to adjust the alphabets in the columns of a periodic cipher?
Or of several ciphertexts encrypted with the same key and
correctly aligned?
To get useful information on these questions we define some statistical measures
and analyze the distributions of them. The main methods for determining
reference values are:
- Exact calculation. This works for languages with exact descriptions,
but is hopeless for natural languages.
- Modelling. We try to build a simplified model of a language, based on
letter frequencies etc. and hope that the model on the one hand approximates
the statistical properties of the language closely enough, and on the other
hand is simple enough that it allows the calculation of the relevant statistics.
The two most important models are:
- the computer scientific model that regards a language as a fixed set of
strings with certain statistical properties,
- the stochastic model that regards a language as a finite stationary Markov
process. This essentially goes back to Claude
Shannon in the 1940s after at least 20 years of naive but successful use by the
Friedman school.
- Simulation. We take a large sample of texts from a language and determine
the characteristic reference numbers by counting. In this way we find empirical
approximations to the distributions and their characteristic properties.
The systematic statistical approach to cryptanalysis started
William F. Friedman (1891–1969)
around 1920. It was extended by his assistents
Solomon Kullback (1907–1994) and
Abraham Sinkov (1907–1998)
in the 1920s and 1930s.
The statistical methodology has since developed a bit and now provides a uniform conceptual
framework for statistical tests and decisions.
For a systematic treatment of the first question above and for a comparison of
several tests a good reference is
Ravi Ganesan, Alan T. Sherman,
Statistical Techniques for Language Recognition:
An Introduction and Guide for Cryptanalysts. Cryptologia 17 (1993), 321–366.
An Empirical Study Using Real and Simulated English. Cryptologia 18 (1994), 289–331.
An elementary but mathematically sound introduction to probability and statistics is
A. M. Gleason:
Elementary Course in Probability for the Cryptanalyst.
Laguna Hills: Aegean Park Press 1985.
Author: Klaus Pommerening, 1999-Nov-27;
last change: 2019-Oct-29.