Searching Structured Documents with the Enhanced Retrieval Functionality of freeWAIS-sf and SFgate

Ulrich Pfeifer, Universität Dortmund, Lehrstuhl Informatik VI, D-44221 Dortmund

pfeifer@ls6.informatik.uni-dortmund.de http://ls6-www.informatik.uni-dortmund.de/WhoIsWhoAtLS6.html#Pfeifer

Norbert Fuhr

fuhr@ls6.informatik.uni-dortmund.de http://ls6-www.informatik.uni-dortmund.de/WhoIsWhoAtLS6.html#Fuhr

Tung Huynh

huynh1@ls6.informatik.uni-dortmund.de

Abstract:

The original WAIS implementation by Thinking Machines et al. treats documents as uniform bags of terms. Since most documents exhibit some internal structure, it is desirable to provide the user means to exploit this structure in his queries. In this paper, we present extensions to the [freeWAIS][1] indexer and server, which allow access to documents structure using the original WAIS protocol.

Major extensions include:

arbitrary document formats,
search in individual structure elements,
stemming and phonetic search,
support of 8-bit character sets,
numeric concepts and operators,
combination of Boolean and linear retrieval.

We also present an WWW-WAIS gateway specially tailored for usage with [freeWAIS-sf], which transforms filled out HTML forms to the new query syntax.

Keywords:

WAIS, Form, User-Interface, Query Language, Retrieval, Search, Multi-Server Search, Network Information Retrieval

Introduction

The enormous increase of local secondary storage and the growing availability of internet access leads to vast amounts of information available "at your finger tips". Searching for needed information has become a nontrivial task. Tools like man, find and grep for local search and archie for searching in the net are not sufficient at all, since they simply rely on searching for single words or regular expressions.

This has been overcome by the advent of WAIS in 1988. It implements free text search and ranking using a client-server architecture. Basically the client sends a couple of keywords to the server. The server searches the specified database for documents most similar to the query. For the best matching documents it delivers the headlines and document identifiers to the client, which displays them to the user. By selecting a headline, the user can request the corresponding document. The ease of use has made the system very popular.

However both the original WAIS implementation [wais-8-b5] by Thinking Machines et al. and the successor freeWAIS suffer from different drawbacks. For large databases, plain free text queries showed to be not selective enough. Most documents exhibit some structure which should be exploited by the query language to overcome this drawback.

FreeWAIS knows a number of builtin document types, which define the separation of files in documents and the contents of the headlines. Creating a database with new types requires modification of the indexing module.

Most WWW-WAIS interfaces resemble the flat query stucture by allowing the user to enter keywords in a small input field. Refining the query structure must lead to more sophisticated interfaces, namely forms.

In the next section we define extensions to the query functionality and develope an appropriate query language. Then we define the document format language of our freeWAIS-sf, which supercedes the builtin document types. Some information about the implementation of this language follows. After that we introduce our WWW-WAIS gateway [SFgate], which allows form based access to the enhanced query semantics while protecting the user from the new syntax. as an example application, we demonstrate our solution to the "How do I index http pages" question. Finally, we give an outlook on desirable extensions.

Extending the Query Functionality

Before we started the project in the fall of 1993, a number of implementors added new functionality to the search engine. Namely Don Gilbert at Indiana University introduced new features like wildcard search and boolean operators. But most extensions produced missleading results e.g. when encountering index block borders or when used with relevance feedback. Especially equivalent Boolean expressions did not yield the same results.

Clearly it is not easy to handle the Boolean operators and nesting correctly if one inspects the keywords one by one, as the original query engine does. So we decided to define a query language and implement a parser which checks queries for validy and converts them to postfix operations. The search engine was then modified to maintain a stack of document sets, which can be combined with boolean operations.

One main objective of the project was to allow unmodified clients to access the new features. So we designed the query language such that it contains the free text queries and prior extensions as subsets. Another consequence was that we did not change the protocol to reflect the new features. As a drawback of this decision, checking of the query must be done at the server side.

As indicated above, we aimed to exploit the document structure. The way we choosed is to assign one or more concept categories to the document terms depending on where they appear within the document. Consider senders and recipients of electronic mail messages as an example. In the query the categories to be searched should be selectable for each term. To leave the original queries valid (and to support casual users) we provided a default category, which is used if no category is specified in the query. Now we give an outline of the query language:

The atomic search expressions of the language are terms, term wild-cards and phrases (e.g. information, inform*, "information retrieval").

Terms are made up of 8-bit-characters of a configurable character set. Stemming is handled transparently for the client. Terms searched in a stemmed category are searched using their word stem automatically. For a wildcard (only tail truncation is implemented), all matching words from the dictionary are used as search terms. Phrase search looks up the first word which must be an index term and then scans the documents[2] containing this word for string matches for the complete phrase.

In addition the prefix operators soundex and phonix are allowed for converting the following query term to its Soundex/Phonix ([Gadd88],[Gadd90]) code. This is e.g. very useful when searching in phonebooks if the exact spelling of a name is not known ([Pfeifer-etal95]).

Arbitrary Boolean combination of these atomic expressions with the operators and, or and not (not means and not in Boolean logic and is therefore a binary operation. See the examples below.) are allowed. Parentheses can be used for grouping. For compatibility with the original syntax, or may be omitted.

For each expression, a semantic category can be defined using the category pred operator, where pred is = for text categories and one of ==, <,> for numeric categories.

Here are some examples:

information retrieval              free text query
information OR retrieval           same as above
ti=information retrieval           'information' must be in the title
ti=(information retrieval)         one of them in title
ti=(information OR retrieval)      same as above
ti=(information AND retrieval)     both of them in title
ti=(information NOT retrieval)     'information' in title and
                                   'retrieval' not in title
py==1990                           numerically equal
py<1990                            numerically less
py>1990                            numerically greater
au=(soundex salatan)               soundex search, matches eg. 'Salton'
ti=("information retrieval")       phrase search
ti=(information system*)           wild-card search

The complete syntax can be found in appendix A.

Defining Document Formats and Indexing Methods

The original implementation has the document formats hard-coded in the indexer. The builtin document formats are defined in terms of a separator_function, which separates a file in a set of documents and a set of headline functions. The latter build headlines for each document. Indexing documents with a new format requires adding the separator and headline functions to the code and recompiling the indexer. In order to support the mapping of arbitrary regions to semantic categories, a lot of additional and more sophisticated functions would be necessary.

To avoid this, we developed a specification language based on regular expressions which allows definition of formats without writing C code and recompiling the system. This greatly reduces the turn-around times when building new databases.

We first describe which language constructs replace the separator and headline functions and then outline how the mapping of text regions to query categories can be specified.

The separator_function is replaced by a <record-end> /regexp/ directive. Each line matching the regular expression regexp separates documents in a file.

The headline functions are replaced by a <layout> ... <end> group, which is a list of <headline> directives and an optional <date> definition. Each <headline> definition ties a part of the headline to a region of the document.

<headline> start end width [skip]

The above line advises the indexer to copy the text between the matches for start and end, optionally skipping the text matched by skip to the next width characters of the headline.

Below is a complete example, where characters 1 to 5 of the headline are reserved for the publication year, next 21 characters are reserved for the author name. The <date> directive allows to define a date for the document under inspection, which is displayed by some clients.

<layout>
<headline> /^PY: / /^[A-Z][A-Z]:/ 5 /^PY: */
<headline> /^AU: / /^[A-Z][A-Z]:/ 21 /^AU: */
<headline> /^TI: / /^[A-Z][A-Z]:/ 41 /^TI: */
<date> /^ED: / /%d-%3s-%d/ day month string year /^ED: [^ ]/
<end>

To define the mapping to the categories, the rest of the specification file should be made up of <field> ... <end> groups. Each is mapping a region of text to a set of query categories.

<field> start [skip]
fieldlist options indexspecs
<end> end

The start, skip and end regular expressions define the region of the document under consideration. Fieldlist is a list of category names. Options include the directive <numeric> skip width and the directive stemming. The first option advises the indexer to allow numeric values only and makes sure that numeric atomic search expressions will work with the categories.

Indexspecs is a list of index types along with a keyword indicating if the region should be mapped to the designated categories (LOCAL) or the default category (GLOBAL) only or to both (BOTH).

Index types currently supported are TEXT, SOUNDEX and PHONIX.

Consider the following example:

 /^AU: /
au SOUNDEX LOCAL TEXT BOTH
 /^[A-Z][A-Z]:/

To the indexer this means:

For all words starting with 'AU: ' at the beginning of a line up to a line which starts with two capital letters followed by a colon and a blank, put the word in the default and the au category and its soundex code only in the au category.

Thus an author name can be found in the created database in the default category or the au category if the exact spelling is known. If the name is misspelled, it might be found using the query 'au=(soundex misspelled-name)'.

The complete syntax can be found in appendix B.

The Weighting Scheme

The original indexer produces one index file for each database. This global index now is used for searching the default category. For each defined category an additional index is produced and attributes are associated with it. E.g. the indexer sets a flag, if the category contains stemmed words. At retrieval time, query terms are stemmed automatically depending on this flag. So stemming is transparent to the user.

While the original indexer stores only the frequency by which the individual terms occur in each document, the new one stores a floating point weight, which is computed by using the following information:

        tf     = frequency of the term in the document
        max_tf = maximum frequency of a term in the document
        N      = number of documents in the database
        n      = number of documents containing the term
        idf    = log(N/n)

The for each term a preliminary weight w' is computed as follows:

        w'     = (tf/max_tf+1)/2

The final weights are computed by normalizing the documents' weights vector to a length of 1. This ensures, that large documents do not dominate the answer sets[3].

At retrieval time a score is computed for each set of atomic search terms (not connected by Boolean and or not) and each document in the collection by multiplying for each matching term the corresponding document weight with the inverse document frequency (idf) and summing up the results.

A boolean combination of two term sets is weighted by using the product of the corresponding scores for the and resp. the product of the first score and one minus the second score for the not operation.

The WWW-WAIS Gateway SFgate

The definition of the default category allows the database creator to select regions within the documents, which might be useful for unexperienced users. Plain free text queries will be answered by the server using this default category. To use other categories, their names must be known to the user [4], who has to submit syntactically correct queries, too. This may be too difficult for casual users[5].

To support the searchers in composing complex queries, fill-out forms are of great value since users do not have to remember category names and the correct syntax can be derived from the form automatically. Since the original WAIS clients cannot handle forms, we implemented a CGI-Script for this purpose.

To make the script useful for as many http-server maintainers as possible, we decided not to rely on an existing client (waisq). The script - written in [perl] - communicates directly with the WAIS servers using the [Z39.50] protocoll.

Selection of databases

The script uses the field name database in order to select a set of databases to query. For example, check-boxes like the following may be used [6]:

Note: The following form is active!

BIBDB: information retrieval, databases
HCI: information retrieval, human-computer-interaction (references from the HCI project)
IRDISS: Selected Dissertation Abstracts from IR-LIST
JTOCs: TOCs of journals in information retrieval and databases.

Query Categories

For each possible query category there may be a pair of input fields, one taking the search terms and one accepting a search type. The latter might be empty or contain the word soundex or plain or both of them:

Author/Editor

The gateway translates the supplied contents of the input fields into a query expression for the category with the same name as the term input field. Parentheses and soundex operators are added as needed.

The predefined field name text refers to the default category. In this field experienced users can enter arbitrary complex queries obeying the query syntax:

Text: au=(fuhr or soundex pfeifer) and ti=probabilistic

All fields are connected implicitly by or unless an optional field tie contains the word and:

Connect fields by: and or

Options

There are some other options:

directget: Setting this field advises the gateway to skip the second phase of the search - displaying the headlines of the matching documents - and concatenate all documents retrieved.
maxhits: Allows to set the maximum size of the answer set (default: 40).
convert: Specifies an external or internal converter be applied to a resulting document before delivering it.
verbose: Directs the verbosity of the headline display.

       How many hits do you want ?
        Skip headline menu ?  Short headlines?
                 Convert: No BibTeX pretty

The current implementation contacts the selected servers sequentially. Clearly parallel access would be faster. But since the servers are quite fast even on large databases (several dozens of MBs) we postponed the implementation of a parallel processing for the time being.

The single ranking lists are merged according to the individual score of the hits. The original servers normalized their score before delivering. So scores from different servers were not comparable. Our server leaves the scores unchanged, so that the merging can be done correctly.

Ideally the weighting should rely on the inverse document frequency(idf) computed throughout all queried databases. But this idf must be known by the individual servers when they generate their ranking lists. So an additional protocol phase would be necessary to communicate the idfs. This protocol change would exclude original clients from using our servers.

Example: Searching http pages

A slight modification of the code for the builtin document type URL allows to build and query databases of remote documents, namely HTML pages on other servers. There are numerous such databases in the net. Most suffer from one or more of the following deficiencies:

No free text query possible. Instead e.g. regular expressions are used.
No structured query possible. E.g. search only the titles or headlines.
Relative links do not work. This is mostly the case, when the document is delivered by a wais server.
Hits display the URLs of the documents, not the title.

FreeWAIS-sf and SFgate together can be used in a way, that noone of the above problems occur.

Below is the interface we use for searching our own server as well as a database of documents generated by a proxy gateway called SFproxy which comes with SFgate. It transparently indexes all pages first seen after delivering it to the requesting client. So we can be sure to find each page we have ever seen without maintaining a long hotlist. To keep the indexes small we decided to index only the titles, headlines, definition terms, adresses and bolded/italic words.

Note: The following form is active!

1. Select database(s)

Pages on the http server
Pages on the ftp server
Pages seen by the proxy server

2. Formulate Query

Title: conference
Headlines: author guideline
Description terms
Addresses
Text (All of the above and bolded/italic words): darmstadt

3. Control Search

       How many hits do you want?
        Short headlines?

Outlook

The original WAIS implementation supports only a crude relevance feedback mechanism. Marked (regions of) documents are extracted from the database and all words found there are added to the query. This can also be achieved by cut and paste operations in any client.

Real relevance feedback would require to modify the retrieval function used in the previous search in order to achieve a better score for the documents marked as relevant ([Salton-Buckley90]). This sort of relevance feedback has to be implemented in the search engine of freeWAIS-sf and the gateway must provide means to mark relevant documents.

Currently, splitting a database, querying the parts and merging the results yields different results than querying the complete database. The removal of the per server normalization of the scores did not help here because the word frequencies [7] naturally differ between the whole database and its parts. One should not remove these word frequencies from the weighting scheme since including them has been shown to improve the resulting ranking [Salton-Buckley90]. One way of doing it right might be to do the final scoring on the client side. Therefore the protocol must be extended, to give the clients access to the information needed.

Another related nice extension would be protocol support for dictionary lookup. A user could then check the dictionary of a database in preparation for searching it.

Further, the (original) WAIS protocol does not support the added semantic categories. One can trick the server in sending the database description which contains (since version 1.1) a list of used categories. But infomation about the presence of soundex codes or numeric concepts is still missing. Using this information, the gateway could examine a set of selected databases and generate a form for the largest common category set. Also propagating queries down a category hierarchy would be possible.

Currently, query categories are mapped one to one to index files. Thus assigning a region of text to more than one query category results in duplication in the indexes. Therefore, we are thinking about a mechanism for mapping the query categories to index files, which allows more flexibility.

Some more technical improvements needed can be found on the freeWAIS-sf TODO list.

Footnotes:

[1]: The code is known to be ported to AUX, AIX, HP-UX, DGUX, BSD/386, FreeBSD, IRIX, Linux, NetBSD, SUNOS 4.1.*, SUNOS 5.3, ULTRIX, UNIXWARE, .... The distribution was downloaded to more than 1800 different hosts (including a number of mirror sites). To date, more than 450 hosts have had active servers on The Net.
[2]: This is not possible for documents on remote sites since the server is unable to retrieve these via http of ftp.
[3]: When indexing the UNIX manual pages with the original system, almost every answer set contains the perl manual which is more than 200 kilobytes long.
[4]: The WAIS protocol provides no support for querying the server about the available categories.
[5]: Now there is also a X11 WAIS Client called ezxwais [EZXWAIS], which builds query forms from the server descriptions generated by the indexer. Currently, numeric concepts are not supported.
[6]: SFgate also provides selection of databases from the answer set of a directory database. See e.g. http://ls6-www.informatik.uni-dortmund.de/SFgate/
[7]: The frequencies of the words in the database, not in a single document. The latter is also exploited for the scoring.

References

[Crossley94]: Crossley, D. WAIS through the Web - Discovering Environmental Information, Proceedings of the Second International WWW Conference '94, http://www.ncsa.uiuc.edu/SDG/IT94/Proceedings/Searching/crossley/paper.html
[freeWAIS]: ftp://ftp.bio.indiana.edu/util/wais/freeWAIS-0.3.tar.gz
[freeWAIS-sf]: Pfeifer, U: FreeWAIS-sf, http://ls6-www.informatik.uni-dortmund.de/freeWAIS-sf/, http://www.cis.ohio-state.edu/hypertext/faq/usenet/wais-faq/freeWAIS-sf/faq.html
[Gadd88]: Gadd, T.N: 'Fisching for Werds'. Phonetic Retrieval of written text in Information Retrieval Systems, Program, 22/3, pp. 222-237, 1988
[Gadd90]: Gadd, T.N: PHONIX: The Algorithm, Program, 24/4, pp. 363-366, 1990
[Huynh95]: Huynh, Q. T.: Extension of the FreeWAIS Server, Diploma thesis, University of Dortmund (in preparation).
[perl]: Pointer page to Perl programming language resources, http://kaos.erin.gov.au/gis/develop/perl_library/about_perl.html
[Pfeifer-etal95]: Pfeifer, U.; Poersch, T.; Fuhr, N.:Searching Proper Names in Databases, to appear in: HIM95: Proc. of the Conference on Hypertext - Information Retrieval - Multimedia., ftp://ls6-www.informatik.uni-dortmund.de/pub/doc/reports/95/Pfeifer-etal-95a.html
[Salton-Buckley88]: Salton, G.; Buckley, C.: On the Use of Spreading Activation Methods in Automatic Information Retrieval, In: Chiaramella, Y. (ed.): 11th International Conference on Research & Development in Information Retriev al, pages 147-160. Presses Universitaires de Grenoble, Grenoble, France, 1988.
[Salton-Buckley90]: Salton, G.; Buckley, C.: Improving Retrieval Performance by Relevance Feedback, Journal of the American Society for Information Science, 41/4, pp. 288-297, 1990
[SFgate]: Pfeifer, U.: SFgate - CGI Interface to WAIS servers, http://ls6-www.informatik.uni-dortmund.de/SFgate/SFgate.html
[wais-8-b5]: ftp://wais.com/pub/freeware/unix-src/wais-8-b5.1.tar.Z
[Z39.50]: Waldstein, R.: A pointer page about Z39.50 Resources http://www.research.att.com/~wald/z3950.html
[EZXWAIS]: Weathers, R. S.: EZXWAIS, ftp://ls6-www.informatik.uni-dortmund.de/pub/wais/client-toolkit-EZ-2.2.1.tar.gz

Appendix A: Query Syntax

Appendix B: Format Definition