KWIC Functional Specification

Basic service

The KWIC (KeyWord In Context) system provides a convenient search mechanism for information in a long list of lines, such as book titles, or online documentation entries.

KWIC reads a list of lines, generates all the circular shifts of each line, sorts the shifted lines, and writes the resulting lines to stdout. For a given line, a circular shift is generated by removing the first word in the line and appending it to the end of the line. Thus, a line with N words has N circular shifts (including the original line).

The KWIC output is useful because it supports a simple lookup scheme. Suppose that the input is a long list of book titles, and that the user wants all the books on "programming". In the KWIC output, the title (perhaps shifted) of every one of those books will have an entry under "p".

For some words, e.g., "the" and "of", there is no point in generating a shift; no user will look up all lines beginning with "the". Therefore, KWIC provides a facility for ignoring such "noise words" in the output. The user provides a noise words file and shifts beginning with a noise word are suppressed.

Command line invocation

KWIC is invoked from the Unix shell as follows:
	kwic [-n noiseWordsFile] linesFile ...
If the -n argument is present, then the noise words are read from noiseWordsFile. Otherwise, if the KWICNOISEWORDS environment variable exists, then KWICNOISEWORDS is assumed to contain the name of the noise words file. If the -n argument is not present and the KWICNOISEWORDS environment variable does not exist, then the noise words are read from the file noiseWords in the current directory.

If one or more linesFiles are present, then the lines to be shifted are read from these files. Otherwise the lines are read from stdin.

Input format

In the input file(s), words are delimited by whitespace: blank, tab, or newline.

In the noise words file, each word must appear on a line of its own. The words must be sorted in ascending order. Case is irrevelant. Thus, if "the" is in the noise words file, then a shift beginnning with "The" or "the" will be suppressed.

Example

If the input file contains the following lines:
        The C Programming Language
        The Cat in the Hat
and the noise words file contains the following words:
	and
	in
        the
Then KWIC will write the following lines to stdout:
        C Programming Language, The
        Cat in the Hat, The
        Hat, The Cat in the
        Language, The C Programming
        Programming Language, The C