Consider two texts of equal length and count the number k of positions where they coincide, that is, show the same letters. Divide k by the length r of the texts. This gives you the coincidence index of the two texts:
κ(a, b) = k/rif the texts are denoted by a and b.
The following examples illustrate this concept. For a mathematical treatment see here. For a Perl program see here.
We compare the first four verses (text 1) of the poem »If ...« by Rudyard Kipling and the next four verses (text 2).
[The lengths differ, so we cut the longer one.]
IFYOU CANKE EPYOU RHEAD WHENA LLABO UTYOU ARELO OSING THEIR IFYOU CANMA KEONE HEAPO FALLY OURWI NNING SANDR ISKIT ONONE ||||| ||| | SANDB LAMIN GITON YOUIF YOUCA NTRUS TYOUR SELFW HENAL LMEND TURNO FPITC HANDT OSSAN DLOOS EANDS TARTA GAINA TYOUR BEGIN | | OUBTY OUBUT MAKEA LLOWA NCEFO RTHEI RDOUB TINGT OOIFY OUCAN NINGS ANDNE VERBR EATHE AWORD ABOUT YOURL OSSIF YOUCA NFORC | WAITA NDNOT BETIR EDBYW AITIN GORBE INGLI EDABO UTDON TDEAL EYOUR HEART ANDNE RVEAN DSINE WTOSE RVEYO URTUR NLONG AFTER | | INLIE SORBE INGHA TEDDO NTGIV EWAYT OHATI NGAND YETDO NTLOO THEYA REGON EANDS OHOLD ONWHE NTHER EISNO THING INYOU EXCEP | KTOOG OODNO RTALK TOOWI SEIFY OUCAN DREAM ANDNO TMAKE DREAM TTHEW ILLWH ICHSA YSTOT HEMHO LDONI FYOUC ANTAL KWITH CROWD | | || | SYOUR MASTE RIFYO UCANT HINKA NDNOT MAKET HOUGH TSYOU RAIMI SANDK EEPYO URVIR TUEOR WALKW ITHKI NGSNO RLOOS ETHEC OMMON | | FYOUC ANMEE TWITH TRIUM PHAND DISAS TERAN DTREA TTHOS ETWOI TOUCH IFNEI THERF OESNO RLOVI NGFRI ENDSC ANHUR TYOUI FALLM | | | MPOST ORSAS THESA MEIFY OUCAN BEART OHEAR THETR UTHYO UVESP ENCOU NTWOR THYOU BUTNO NETOO MUCHI FYOUC ANFIL LTHEU NFORG || || OKENT WISTE DBYKN AVEST OMAKE ATRAP FORFO OLSOR WATCH THETH IVING MINUT EWITH SIXTY SECON DSWOR THOFD ISTAN CERUN YOURS | | | INGSY OUGAV EYOUR LIFEF ORBRO KENAN DSTOO PANDB UILDE MUPWI ISTHE EARTH ANDEV ERYTH INGTH ATSIN ITAND WHICH ISMOR EYOUL | | THWOR NOUTT OOLS LBEAM ANMYS ON |
Text length: | 562 (554 if we omit the first 8 letters) |
Coincidences: | 35 (27) |
Coincidence index: | 35/562 = 0.0623 (0.0487) |
In the mathematical part we show that the expected value of the coincidence index is 1/n as soon as at least one of the two texts is »random«. This value is 0.0385 for our standard alphabet with n = 26. For example:
For the 26 letter alphabet A...Z the variance is &approx, 0.03370/>r for texts of length r, the standard deviation ≈ 0.19231/√(r). From this we get the second row of the following table.
r | 10 | 40 | 100 | 400 | 1000 | 10000 |
Std dev | 0.0608 | 0.0304 | 0.0192 | 0.0096 | 0.0061 | 0.0019 |
95% quantile | 0.1385 | 0.0885 | 0.0700 | 0.0543 | 0.0485 | 0.0416 |
For statistical tests (one-sided in this case) we would like to know the 95% quantiles. If we take the values for a normal distribution as approximations, that is »mean value + 1.645 times standard deviation«, we get the values in the third row of the table. These raw estimates show that the κ-statistic in this form is weak in distinguishing »meaningful« texts from random texts, even for text lengths of 100 letters, and strong only for texts of several thousand letters. However for the cryptanalyst the only thing that counts is the success, and then even a weak test could give the decisive hint in a concrete situation.
Nevertheless looking for a better test makes sense.