Back to DFS's Workshop Page
Back to Agenda Page


Problem: Word Frequencies

This problem will introduce you to text processing.

Consider these questions:

  1. How many times does a particular word, e.g., rabbit, appear in Alice's Adventures in Wonderland?
  2. What word appears most frequently?
  3. What are the ten most frequently used words?
  4. How many words are used only once?
  5. Can we tell anything about the subject matter of the text by investigating words with a certain range of frequencies?

As an introduction, we will take a sentence from the old days of typing classes and determine the order of first occurrence of each of the words. Here is the code for use from the Command Line.

$line = "Now is the time for all good men to come to the aid of their party.";
$lineofwords = explode(" ", $line);
$ooo = array();
foreach($lineofwords as $word )
{
   if( !in_array($word, $ooo) )
      $ooo[] = $word;
}
echo " # Word\n";
foreach($ooo as $n => $w)
{
   printf("%2d %s\n", $n + 1, $w);
}

The subscripts, which automatically start at 0, will indicate the order in which they are encountered in the string. The output generated will be the following.

 # Word
 1 Now
 2 is
 3 the
 4 time
 5 for
 6 all
 7 good
 8 men
 9 to
10 come
11 aid
12 of
13 their
14 party.

To turn this into a useful program, we would like to get information about a file such as

You are now ready to attempt using the computer to help you actually analyze text files.

The first four of the original questions can easily be answered if we sort the data, i.e., put it in alphabetical or numerical order. The last question requires analysis and would probably entail comparing information gleaned from using your program on a number of different texts.

This then is your task: Give the user the opportunity to selectively view organized data about a text file.

  1. Finding the File
  2. Specifying the Data Display to Be Done

To see how this could work, view this possible solution.

Notes


© 2005 DFStermole
Created 14 Dec 05
Modified 14 Dec 05