pfeifer@ls6.informatik.uni-dortmund.de http://ls6-www.informatik.uni-dortmund.de/WhoIsWhoAtLS6.html#Pfeifer
Norbert Fuhr
fuhr@ls6.informatik.uni-dortmund.de http://ls6-www.informatik.uni-dortmund.de/WhoIsWhoAtLS6.html#Fuhr
Tung Huynh
huynh1@ls6.informatik.uni-dortmund.de
Major extensions include:
We also present an WWW-WAIS gateway specially tailored for usage with [freeWAIS-sf], which transforms filled out HTML forms to the new query syntax.
The enormous increase of local secondary storage and the growing
availability of internet access leads to vast amounts of
information available "at your finger tips". Searching for needed
information has become a nontrivial task. Tools like
man
, find
and grep
for
local search and archie
for searching in the net are
not sufficient at all, since they simply rely on searching for
single words or regular expressions.
This has been overcome by the advent of WAIS in 1988. It implements free text search and ranking using a client-server architecture. Basically the client sends a couple of keywords to the server. The server searches the specified database for documents most similar to the query. For the best matching documents it delivers the headlines and document identifiers to the client, which displays them to the user. By selecting a headline, the user can request the corresponding document. The ease of use has made the system very popular.
However both the original WAIS implementation [wais-8-b5] by Thinking Machines et al. and the successor freeWAIS suffer from different drawbacks. For large databases, plain free text queries showed to be not selective enough. Most documents exhibit some structure which should be exploited by the query language to overcome this drawback.
FreeWAIS knows a number of builtin document types, which define the separation of files in documents and the contents of the headlines. Creating a database with new types requires modification of the indexing module.
Most WWW-WAIS interfaces resemble the flat query stucture by allowing the user to enter keywords in a small input field. Refining the query structure must lead to more sophisticated interfaces, namely forms.
In the next section we define extensions to the query functionality and develope an appropriate query language. Then we define the document format language of our freeWAIS-sf, which supercedes the builtin document types. Some information about the implementation of this language follows. After that we introduce our WWW-WAIS gateway [SFgate], which allows form based access to the enhanced query semantics while protecting the user from the new syntax. as an example application, we demonstrate our solution to the "How do I index http pages" question. Finally, we give an outlook on desirable extensions.
Before we started the project in the fall of 1993, a number of implementors added new functionality to the search engine. Namely Don Gilbert at Indiana University introduced new features like wildcard search and boolean operators. But most extensions produced missleading results e.g. when encountering index block borders or when used with relevance feedback. Especially equivalent Boolean expressions did not yield the same results.
Clearly it is not easy to handle the Boolean operators and nesting correctly if one inspects the keywords one by one, as the original query engine does. So we decided to define a query language and implement a parser which checks queries for validy and converts them to postfix operations. The search engine was then modified to maintain a stack of document sets, which can be combined with boolean operations.
One main objective of the project was to allow unmodified clients to access the new features. So we designed the query language such that it contains the free text queries and prior extensions as subsets. Another consequence was that we did not change the protocol to reflect the new features. As a drawback of this decision, checking of the query must be done at the server side.
As indicated above, we aimed to exploit the document structure. The way we choosed is to assign one or more concept categories to the document terms depending on where they appear within the document. Consider senders and recipients of electronic mail messages as an example. In the query the categories to be searched should be selectable for each term. To leave the original queries valid (and to support casual users) we provided a default category, which is used if no category is specified in the query. Now we give an outline of the query language:
The atomic search expressions of the language are terms,
term wild-cards and phrases
(e.g. information
, inform*
,
"information retrieval"
).
Terms are made up of 8-bit-characters of a configurable character set. Stemming is handled transparently for the client. Terms searched in a stemmed category are searched using their word stem automatically. For a wildcard (only tail truncation is implemented), all matching words from the dictionary are used as search terms. Phrase search looks up the first word which must be an index term and then scans the documents[2] containing this word for string matches for the complete phrase.
In addition the prefix operators soundex
and
phonix
are allowed for converting the
following query term to its Soundex/Phonix ([Gadd88],[Gadd90])
code. This is e.g. very useful when searching in phonebooks if
the exact spelling of a name is not known ([Pfeifer-etal95]).
Arbitrary Boolean combination of these atomic expressions with the
operators and
, or
and
not
(not
means
and
not
in Boolean logic and is
therefore a binary operation. See the examples
below.) are allowed. Parentheses can be used for grouping. For
compatibility with the original syntax, or
may be
omitted.
For each expression, a semantic category can be defined using the
category pred
operator, where
pred is =
for text categories and
one of ==
, <,
>
for numeric categories.
Here are some examples:
information retrieval free text query information OR retrieval same as above ti=information retrieval 'information' must be in the title ti=(information retrieval) one of them in title ti=(information OR retrieval) same as above ti=(information AND retrieval) both of them in title ti=(information NOT retrieval) 'information' in title and 'retrieval' not in title py==1990 numerically equal py<1990 numerically less py>1990 numerically greater au=(soundex salatan) soundex search, matches eg. 'Salton' ti=("information retrieval") phrase search ti=(information system*) wild-card search
The complete syntax can be found in appendix A.
The original implementation has the document formats hard-coded
in the indexer. The builtin document formats are defined in
terms of a separator_function
, which separates a
file in a set of documents and a set of headline functions. The
latter build headlines for each document. Indexing
documents with a new format requires adding the separator and
headline functions to the code and recompiling the indexer. In
order to support the mapping of arbitrary regions to semantic
categories, a lot of additional and more sophisticated functions
would be necessary.
To avoid this, we developed a specification language based on regular expressions which allows definition of formats without writing C code and recompiling the system. This greatly reduces the turn-around times when building new databases.
We first describe which language constructs replace the separator and headline functions and then outline how the mapping of text regions to query categories can be specified.
The separator_function
is replaced by a
<record-end> /regexp/
directive. Each line matching the regular expression
regexp separates documents in a file.
The headline functions are replaced by a
<layout> ... <end>
group,
which is a list of <headline>
directives and an optional <date>
definition. Each <headline>
definition
ties a part of the headline to a region of the document.
<headline> start end width [skip]
The above line advises the indexer to copy the text between the matches for start and end, optionally skipping the text matched by skip to the next width characters of the headline.
Below is a complete example, where characters 1 to 5 of the
headline are reserved for the publication year, next 21
characters are reserved for the author name. The
<date>
directive allows to define a
date for the document under inspection, which is displayed by
some clients.
<layout> <headline> /^PY: / /^[A-Z][A-Z]:/ 5 /^PY: */ <headline> /^AU: / /^[A-Z][A-Z]:/ 21 /^AU: */ <headline> /^TI: / /^[A-Z][A-Z]:/ 41 /^TI: */ <date> /^ED: / /%d-%3s-%d/ day month string year /^ED: [^ ]/ <end>
To define the mapping to the categories, the rest of the
specification file should be made up of
<field>
... <end>
groups. Each is mapping a
region of text to a set of query categories.
<field> start [skip] fieldlist options indexspecs <end> end
The start, skip and end regular
expressions define the region of the document under
consideration.
Fieldlist is a list of category names. Options
include the directive <numeric> skip
width
and the directive
stemming
. The first option advises the
indexer to allow numeric values only and makes sure that numeric
atomic search expressions will work with the categories.
Indexspecs is a list of index types along with a
keyword indicating if the region should be mapped to the
designated categories (LOCAL
) or the default
category (GLOBAL
) only or to both
(BOTH
).
Index types currently supported are TEXT
,
SOUNDEX
and PHONIX
.
Consider the following example:
/^AU: / au SOUNDEX LOCAL TEXT BOTH /^[A-Z][A-Z]:/
To the indexer this means:
'AU: '
at the beginning
of a line up to a line which starts with two capital letters
followed by a colon and a blank, put the word in the default and
the au category and its soundex code only in the au
category.
Thus an author name can be found in the created database in the
default category or the au category if the exact spelling
is known. If the name is misspelled, it might be found using the
query 'au=(soundex misspelled-name)
'.
The complete syntax can be found in appendix B.
The original indexer produces one index file for each database. This global index now is used for searching the default category. For each defined category an additional index is produced and attributes are associated with it. E.g. the indexer sets a flag, if the category contains stemmed words. At retrieval time, query terms are stemmed automatically depending on this flag. So stemming is transparent to the user.
While the original indexer stores only the frequency by which the individual terms occur in each document, the new one stores a floating point weight, which is computed by using the following information:
tf = frequency of the term in the document max_tf = maximum frequency of a term in the document N = number of documents in the database n = number of documents containing the term idf = log(N/n)
The for each term a preliminary weight w' is computed as follows:
w' = (tf/max_tf+1)/2
The final weights are computed by normalizing the documents' weights vector to a length of 1. This ensures, that large documents do not dominate the answer sets[3].
At retrieval time a score is computed for each set of atomic
search terms (not connected by Boolean and
or
not
) and each document in the collection by
multiplying for each matching term the corresponding document
weight with the inverse document frequency
(idf) and summing up the results.
A boolean combination of two term sets is weighted by using the
product of the corresponding scores for the
and
resp. the product of the first score and
one minus the second score for the not
operation.
The definition of the default category allows the database creator to select regions within the documents, which might be useful for unexperienced users. Plain free text queries will be answered by the server using this default category. To use other categories, their names must be known to the user [4], who has to submit syntactically correct queries, too. This may be too difficult for casual users[5].
To support the searchers in composing complex queries, fill-out forms are of great value since users do not have to remember category names and the correct syntax can be derived from the form automatically. Since the original WAIS clients cannot handle forms, we implemented a CGI-Script for this purpose.
To make the script useful for as many http
-server
maintainers as possible, we decided not to rely on an existing
client (waisq
). The script - written in [perl] - communicates directly with the WAIS
servers using the [Z39.50] protocoll.
The script uses the field name database
in
order to select a set of databases to query. For example,
check-boxes like the following may be used [6]:
Note: The following form is active!
The current implementation contacts the selected servers sequentially. Clearly parallel access would be faster. But since the servers are quite fast even on large databases (several dozens of MBs) we postponed the implementation of a parallel processing for the time being.
The single ranking lists are merged according to the individual score of the hits. The original servers normalized their score before delivering. So scores from different servers were not comparable. Our server leaves the scores unchanged, so that the merging can be done correctly.
Ideally the weighting should rely on the inverse document frequency(idf) computed throughout all queried databases. But this idf must be known by the individual servers when they generate their ranking lists. So an additional protocol phase would be necessary to communicate the idfs. This protocol change would exclude original clients from using our servers.
A slight modification of the code for the builtin document type
URL
allows to build and query databases of
remote documents, namely HTML
pages on other
servers. There are numerous such databases in the net. Most suffer
from one or more of the following deficiencies:
FreeWAIS-sf and SFgate together can be used in a way, that noone of the above problems occur.
Below is the interface we use for searching our own server as well as a database of documents generated by a proxy gateway called SFproxy which comes with SFgate. It transparently indexes all pages first seen after delivering it to the requesting client. So we can be sure to find each page we have ever seen without maintaining a long hotlist. To keep the indexes small we decided to index only the titles, headlines, definition terms, adresses and bolded/italic words.
Note: The following form is active!
Pages on the http server Pages on the ftp server Pages seen by the proxy server
How many hits do you want? Short headlines?
The original WAIS implementation supports only a crude relevance feedback mechanism. Marked (regions of) documents are extracted from the database and all words found there are added to the query. This can also be achieved by cut and paste operations in any client.
Real relevance feedback would require to modify the retrieval function used in the previous search in order to achieve a better score for the documents marked as relevant ([Salton-Buckley90]). This sort of relevance feedback has to be implemented in the search engine of freeWAIS-sf and the gateway must provide means to mark relevant documents.
Currently, splitting a database, querying the parts and merging the results yields different results than querying the complete database. The removal of the per server normalization of the scores did not help here because the word frequencies [7] naturally differ between the whole database and its parts. One should not remove these word frequencies from the weighting scheme since including them has been shown to improve the resulting ranking [Salton-Buckley90]. One way of doing it right might be to do the final scoring on the client side. Therefore the protocol must be extended, to give the clients access to the information needed.
Another related nice extension would be protocol support for dictionary lookup. A user could then check the dictionary of a database in preparation for searching it.
Further, the (original) WAIS protocol does not support the added semantic categories. One can trick the server in sending the database description which contains (since version 1.1) a list of used categories. But infomation about the presence of soundex codes or numeric concepts is still missing. Using this information, the gateway could examine a set of selected databases and generate a form for the largest common category set. Also propagating queries down a category hierarchy would be possible.
Currently, query categories are mapped one to one to index files. Thus assigning a region of text to more than one query category results in duplication in the indexes. Therefore, we are thinking about a mechanism for mapping the query categories to index files, which allows more flexibility.
Some more technical improvements needed can be found on the freeWAIS-sf TODO list.