CSc 8370 fall '02


Practicing Statistics without a Licsence.

or

Statistical Measures of Languages



We need to be able to identify when a string is in a (human) language. It is also important to be able to recognize when two strings have been encrypted by the same algorithm in the same key. We've seen that doing this by hand is rather tiresome. Therefore we need to examine computable measures that we can use. We assume that if two strings are characterized by identical distributions, then they come from the same source. The measures listed below describe several ways of characterizing the distributions.


  1. Incidence of Coincidence or Ú

    where ^ is 1 when t equals t' and 0 otherwise.

    k R1 and is only equal to 1 when the two texts are identical. If the sources are characterized by a probability distribution p, then the expected value for k is the sum of p2. If the source is random (or if the sources are random with respect to each other) then the expected value is just .

  2. Measures of smoothness for the distribution (Î,Ï)

    Î,Ï measure the smoothness of the distribution. Uniformly distributed symbols are smooth, and meaningfully distributed symbols are not (to first approximation). For both of these functions one calculates the empirical distribution mi which is just the number of times the symbol is symbol i divided by the total number of symbols in the string (i.e. the size of the string).

    (some authors normalize in this formula, in which case it is m/M where M is the size of the string).

    (again this is for normalized m, for un-normalized the mi M would be just mi)

    These formulae appear to be "magic", but are actually based in approximations to the entropy of a distribution (information theory). The information contained in a distribution can be defined as -log(p). This is "simply" a useful way of describing how many "bits" are needed to represent the distribution at a given fidelity. It is also a very useful way to handle joint distributions because the logarithms add when the probabilities multiply. The expected value of the information (-p ln p) is identified as the entropy and this is maximized when the distribution is mostly flat. Chernoff noted that and therefore x ln(x) < x(x-1).

  3. Measures of errors in the distribution (æ2)

An empirical distribution m can always be compared to a theoretical distribution p by a æ2 test.

    (unscaled version would replace p by p*M).

This will be minimal at a correct decryption.

  1. How much text is enough? (unicity distance).

The unicity distance is the size of the message which has a unique key. It can be estimated by using the entropy of the distribution p. It depends both on the underlying clear language and the cipher system. For single substitution in English it is about 25 characters.


where Z is the number of possible cipher combinations. 3.5 is a constant chosen to make U= 25 when Z = 26!. With Ceaser ciphers Z = 26 (for N=26 letters), for monoalphabetic Z = 26!, for dialphabetic Z = (262!). With block keyed ciphers like DES where there are 2n keys then Z is 2n.

Problem Set, In class and out.

Write programs to derive monomer probabilities and pair probabilities. It would also be good to write a program to calculate Phi. The file 3boat10.txt is a relevant source of plaintext.


  1. (the file classcipher) is this a mono alphabetic substitution? Prove it.

    vvgfxxdfaaavaxdfdvdgvvdgfxdvdxvxgaavfgdfvaagfvfggdgavfdfafgxdvafgfgaxdagfggxddvxfxvvaxavvvggxdgvggvxvxvgdgvxdxgdvxgxdvvvxgdaaagavxddaxgdafdvvdaxgxdggavvfxgxdaddadxggvfavxfaxvgxgaavxvgvgffxdgagdgaxxggaaadxvvdvdgvgdadxvdfddafgdaadvvvdvgdfdxvxdvfavxgadvdfxxgxaavvaffgxxfdvggfvgdvgvvavvgadvavvgavfavvvgagddvxvagxdvdffgaavxvxgxgxddxdgfagggaxavdxaadgxdddgxxffgdggaadavvffxgfvaaavavaxgvavxxaxgfgddxavxgdavvggxdddaavfddddvagagggvgggaxdfffdgadagvvadaffggfagxavfgxdfvxfvadaggfdvgxaggfaxdavxvvaaaaaafdvvxxadxfdvxdvxgfagagafvgdvdgddvvagggdddvdfgagaaadadgadffvaaafxdvgadgdggffffadgfdggvvdadagadxgafaagfadgaavgvfaffvavagfvvfdvaxgvdaggaagfxaxdgxvxfxgaaffgfxgxgxvgxvvvgfvggfgdgavavaadxvvxaxxvfvdvaxdvdgaaadgxgvadavdaafdvaxdaaggxafdvxxgavvggvfxfavxxvvafvgvvgfdvfddgfvaaggfgvvvvfdvfvddvxgdgggvvggggdfxfagfffvgvdxvvfgddxaaadvavgaaaafdvgfdvaxdfgvaxgxafavvadvggdfgxvfgavvagggavxxfgdgfvvavvdvfddavaaxxdaxxfvxagfdvvvvgxgvgfaaxvfgfxgddvdgavxfadavvgvvafaxgvgxvvdfvgvadavfdaaggxavdffvadvvgddxgvfvafddaxdvadvgdxdvdxdddavvdadvaadvxvgdvavvvvagdagfvxdvfgaxaxgffffv









  1. It could be a variant of adfgvx. Is it permuted or not? Prove it.

  2. Side information tells you that if it is a transposition then it is an incomplete columnar transposition. Use the Phi test to find which period (i.e. what size column or row) was used.




















4) It may have come from 3boat10.txt. How would you check this without solving the cipher? Write a program and try it. (There are several fast algorithm that will identify too many places and a slower algorithm that will only find correct answers).