Back to DFS's Pascal Page


Analyzing a Large Text I

Using Parallel Arrays

The programs you wrote to investigate the frequencies of occurrence of numerical digits and then letters have taught you to process a file and learn something about its content. They may have seemed trivial, but they have provided the foundation so that this current project will acquaint you with one of the tasks performed by linguists and text analysts -- investigating the frequencies of words used in a text.

Your previous programs used the data items (digits or letters) being investigated as subscripts for the frequency counters. In doing a frequency analysis of the words (instead of characters) in a text, this is not possible in Pascal. Instead, you will initially use parallel arrays to keep track of the words and how often they appear in the text.

Your programming task will be divided into three stages.

  1. Have each unique word serve as an element in an array of words and have a counter in a corresponding element in a frequency array keep track of how many times a particular word occurs in the text.
  2. Use sorting to make the data useful. Sorting Parallel Arrays
  3. Use the programmer-defined record data structure to make explicit the close relationship between a word and its frequency. Sorting Records

The remainder of this page deals only with the first stage.

For the arrays to be passed as parameters, you are required to declare programmer-defined types. The following const and type sections make this possible:

const
   LO = 1;
   MAX = 2000;
type
   freqarrayType = array [LO..MAX] of integer;
   wordarrayType = array [LO..MAX] of string[20];

These declarations allow you to declare variables using your own special variable types for the arrays freq and words. Among other variables, you will want to declare the following:

var
   freq : freqarrayType;  {Array of all unique word frequencies}
   words : wordarrayType; {Array of all unique words}
   numch : longint;       {Number of chars in file}
   wordsunique : integer; {Number of unique words in file}
   wordcount : integer;   {Number of words in file}
   ratio : real;          {Ratio for bars in graph}
   biggest : integer;     {Biggest frequency}

The overall structure of your program will be as follows. The subprograms marked with (f) are to be functions while the others are to be procedures.

  main
      |Initialize
      |Introduction
      |GetFilename
      |ProcessFile
                  |f WordAlready : integer
      |f LargestFreq : integer
      |f BarRatio : real
      |PrintStats
      |Menu
           |PrintOrigOrder
                          |PrintTable

For this first stage, the menu will only permit two choices: (1) print the frequency bar graph using the order in which the words were originally encountered and (2) quit the program.

Your output should look like the following which I created by processing Alice's Adventures in Wonderland. I obtained this text at the Project Gutenberg web site. You can use any 100+K text you like. To avoid a lot of needless typing, you may wish to find a text on the Internet.

The file being processed is a:\alice30.txt.
There are 155741 characters.
There are 27346 total words.
There are 2704 unique words.
The greatest frequency for any word is 1634.
The ratio to be used for the bar graph is 0.029.

How would you like the data displayed?
1: Order of first occurrence
Q: Quit

This is a frequency chart for the words in a:\alice30.txt,
listed in order of first occurrence.

   1 ALICE               :                                                   3
   2 S                   :******                                           202
   3 ADVENTURES          :                                                   1
   4 IN                  :                                                   2
   5 WONDERLAND          :                                                   1
   6 Lewis               :                                                   1
   7 Carroll             :                                                   1
   8 THE                 :                                                   9
   9 MILLENNIUM          :                                                   1
  10 FULCRUM             :                                                   1
  11 EDITION             :                                                   1
  12 3                   :                                                   1
  13 0                   :                                                   1
  14 CHAPTER             :                                                  12
  15 I                   :****************                                 543
  16 Down                :***                                              102
  17 The                 :*********************************************** 1634
  18 Rabbit              :*                                                 50
  19 Hole                :                                                   5
  20 Alice               :************                                     395
  21 Was                 :**********                                       353
  22 Beginning           :                                                  14
  23 To                  :*********************                            726
  24 Get                 :*                                                 46
  25 Very                :****                                             131
  26 Tired               :                                                   7
  27 Of                  :***************                                  511
  28 Sitting             :                                                  10
  29 By                  :**                                                58
  30 Her                 :*******                                          246
  31 Sister              :                                                   9
  32 On                  :******                                           193
  33 Bank                :                                                   3
  34 And                 :**************************                       869
  35 Having              :                                                  10
  36 Nothing             :*                                                 34
  37 Do                  :**                                                81
  38 Once                :*                                                 34
  39 Or                  :**                                                77
  40 Twice               :                                                   5
  41 She                 :****************                                 548
  42 Had                 :*****                                            177
  43 Peeped              :                                                   3
  44 Into                :**                                                67
  45 Book                :                                                  11
Do you want to continue? (Y/N)

Hint: You should start this project using the KISS principle by producing and trying to process a 15-word, 2-sentence file. Simply work on reading in individual words and printing them out on separate lines. What does the existence of the S word indicate about the algorithm used to break up the text into "words"?


© DFStermole 2002-2005
Created 24 Feb 02
Modified 8 July 05