[JoGu]

Cryptology

Coincidences of Two Texts

a7Hzq .#5r<
kÜ\as TâÆK$
ûj(Ö2 ñw%h:
Úk{4R f~`z8
¤˜Æ+Ô „&¢Dø

Definition

Consider two texts of equal length and count the number k of positions where they coincide, that is, show the same letters. Divide k by the length r of the texts. This gives you the coincidence index of the two texts:

κ(a, b) = k/r
if the texts are denoted by a and b.

The following examples illustrate this concept. For a mathematical treatment see here. For a Perl program see here.


Example 1: Two English Texts

We compare the first four verses (text 1) of the poem »If ...« by Rudyard Kipling and the next four verses (text 2).

[The lengths differ, so we cut the longer one.]
IFYOU CANKE EPYOU RHEAD WHENA LLABO UTYOU ARELO OSING THEIR
IFYOU CANMA KEONE HEAPO FALLY OURWI NNING SANDR ISKIT ONONE
||||| |||                                        |           
SANDB LAMIN GITON YOUIF YOUCA NTRUS TYOUR SELFW HENAL LMEND
TURNO FPITC HANDT OSSAN DLOOS EANDS TARTA GAINA TYOUR BEGIN
                                  | |                      
OUBTY OUBUT MAKEA LLOWA NCEFO RTHEI RDOUB TINGT OOIFY OUCAN
NINGS ANDNE VERBR EATHE AWORD ABOUT YOURL OSSIF YOUCA NFORC
                                                 |         
WAITA NDNOT BETIR EDBYW AITIN GORBE INGLI EDABO UTDON TDEAL
EYOUR HEART ANDNE RVEAN DSINE WTOSE RVEYO URTUR NLONG AFTER
          |                       |                        
INLIE SORBE INGHA TEDDO NTGIV EWAYT OHATI NGAND YETDO NTLOO
THEYA REGON EANDS OHOLD ONWHE NTHER EISNO THING INYOU EXCEP
                                             |             
KTOOG OODNO RTALK TOOWI SEIFY OUCAN DREAM ANDNO TMAKE DREAM
TTHEW ILLWH ICHSA YSTOT HEMHO LDONI FYOUC ANTAL KWITH CROWD
 |                       |                ||           |   
SYOUR MASTE RIFYO UCANT HINKA NDNOT MAKET HOUGH TSYOU RAIMI
SANDK EEPYO URVIR TUEOR WALKW ITHKI NGSNO RLOOS ETHEC OMMON
|                          |                               
FYOUC ANMEE TWITH TRIUM PHAND DISAS TERAN DTREA TTHOS ETWOI
TOUCH IFNEI THERF OESNO RLOVI NGFRI ENDSC ANHUR TYOUI FALLM
         |  |                                   |          
MPOST ORSAS THESA MEIFY OUCAN BEART OHEAR THETR UTHYO UVESP
ENCOU NTWOR THYOU BUTNO NETOO MUCHI FYOUC ANFIL LTHEU NFORG
            ||                                   ||        
OKENT WISTE DBYKN AVEST OMAKE ATRAP FORFO OLSOR WATCH THETH
IVING MINUT EWITH SIXTY SECON DSWOR THOFD ISTAN CERUN YOURS
   |   |                               |                   
INGSY OUGAV EYOUR LIFEF ORBRO KENAN DSTOO PANDB UILDE MUPWI
ISTHE EARTH ANDEV ERYTH INGTH ATSIN ITAND WHICH ISMOR EYOUL
|                                 |                        
THWOR NOUTT OOLS
LBEAM ANMYS ON
            | 
Text length:562 (554 if we omit the first 8 letters)
Coincidences:35 (27)
Coincidence index:35/562 = 0.0623 (0.0487)

Expected Values

In the mathematical part we show that the expected value of the coincidence index is 1/n as soon as at least one of the two texts is »random«. This value is 0.0385 for our standard alphabet with n = 26. For example:

Values that significantly differ from these mean values are suspicious for the cryptanalyst, they could have a non-random cause. For more precise statements we should assess the variances (or standard deviations) or, more generally, the distribution of κ-values in certain »populations« of texts. This is done in the mathematical part.

For the 26 letter alphabet A...Z the variance is &approx, 0.03370/r for texts of length r, the standard deviation ≈ 0.19231/√(r). From this we get the second row of the following table.

r1040100 400100010000
Std dev0.06080.03040.01920.00960.00610.0019
95% quantile 0.13850.08850.07000.05430.04850.0416

For statistical tests (one-sided in this case) we would like to know the 95% quantiles. If we take the values for a normal distribution as approximations, that is »mean value + 1.645 times standard deviation«, we get the values in the third row of the table. These raw estimates show that the κ-statistic in this form is weak in distinguishing »meaningful« texts from random texts, even for text lengths of 100 letters, and strong only for texts of several thousand letters. However for the cryptanalyst the only thing that counts is the success, and then even a weak test could give the decisive hint in a concrete situation.

Nevertheless looking for a better test makes sense.


Author: Klaus Pommerening, 2002-May-20; last change: 2014-Jan-23.