Back to DFS's Pascal Page


Character Frequency Graph Problem I

During World War II, the German use of Enigma to encrypt sensitive transmissions posed a major security problem for the British.

One of the steps that can facilitate the development of decryption algorithms is to determine the frequency of occurrence of the various letters of the alphabet in a sufficiently large corpus (body) of texts. The results of this analysis can then be used to make best-guesses when trying to decrypt a coded message.

In this problem and its companion, you will tally the frequencies of the letters in texts of two radically different lengths, O Canada and Alice's Adventures in Wonderland. The former file is only 335 characters long, while the latter is 153KB.

You will use the text of O Canada to develop a simple version your program. It should produce the following output.

Pointers

Use the graphing technique that was developed when simulating 300 dice rolls.

To make the O Canada text available to your program, set up a 10-element array of strings, one element for each line. Hard set the song in the array.

To tally the frequencies of the various letters, set up a 26-element array using capital letters as the subscripts.

Compare these frequencies with the points on the tiles used in the game Scrabble.

  1. What surprises do you find?
  2. Why do you think there are discrepancies?

In the output above, one asterisk was printed for each occurrence of each letter. This can work appropriately when a relatively small text is used. However, there are various aberrations which creep into the data. That is, we cannot generalize from this text to the English language as a whole. For one thing, five of the letters do not occur at all.

N.B. To deal with a much larger text, you will need to learn some new programming techniques. You obviously do not want to hard set 153KB of text into your code. Handling files will be dealt with on another day.


© DFStermole 2008
Created 24 Mar 08
Modified 30 Mar 08