Back to DFS's Workshop Page
Back to Agenda Page


Quick & Dirty Concordance Problem

When researchers of literature wished to study the style of an author or grammarians wanted to investigate current or past usage, they have traditionally turned to books called concordances. A concordance is an alphabetical listing of all the words in a text. For each occurrence, the neighboring context is given. A concordance, being a book, is a static presentation of data. There are various limitations which result from the original design and production considerations.

Some concordances are sentence-based. This means that even if the word being cited is the first word in the sentence, no words from the previous sentence will be provided. Can you think of anything that is missed using this method?

You are to provide access to an English-language book in a similar fashion. You can download the data for free from Project Gutenberg. For this problem, you should use Alice's Adventures in Wonderland by Lewis Carroll (1832-1898). The book is available in zipped format: alice30h.zip. However, the text is already unzipped here as alice30h.html.

Have a look at the beseda text corpus (a body of word data) at the Institute of Slovenian Language. Type in the word visokost to see what is found. You could use this web site as a model if you were to have enough time.

However, for this workshop, you will create a "quick and dirty" concordance generator. The text is broken into lines. Your web page programming will have the following characteristics:

  1. You will provide only one text for searching.
  2. It will allow the user to specify which string to be searched for.
  3. Your output should resemble the following which is a selection of the results from searching for "Alice".
    Alice started to her feet, for it flashed across her mind that
    cats eat bats, I wonder?' And here Alice began to get rather
    It was all very well to say 'Drink me,' but the wise little Alice

    The first line presents an occurrence where "Alice" is the first word on the line; the second has an occurrence of "Alice" in the middle; and the third has it in line-final position. The item searched for is highlighted in red.

  4. Your search should be case-sensitive, i.e., a search for the will find only "the", but not "The".
  5. Your search should not be concerned with word boundaries. Thus, a search for the will also find "them".

To make your programming task easier, you should

Your endproduct should perform somewhat as does my Quick & Dirty Concordance.

Something to think about

How is a web-based concordance program better than a concordance in a book?

Is the book concordance still useful? Does it fulfill a need for a researcher better than a computer program?

Extra Credit

Do one or more of the following:


© 2002-2009 DFStermole
Created 3 Oct 02
Updated 4 Dec 2009