topical media & game development
Even in the presence of audiovisual media,
text will remain an important vehicel for human
In this section, we will look at the issues that
arise in querying a text or document database.
First we will characterize more precisely what
we mean by effective search,
and then we will study techniques to realize effective
search for document databases.
Basically, answering a query to a document database
comes down to string matching.
However, some problems may occur such as synonymy and polysemy.
- document database + string matching
- synonymy -- topic T does not occur literally in document D
- polysemy -- some words may have many meanings
As an example, church and house of prayer
have more or less the same meaning.
So documents about churches and cathedrals
should be returned when you ask for information about
'houses of prayer'.
As an exampleof polysemy, think of the word drum,
which has quite a different meaning when taken from a musical
perspective than from a transport logistics perspective.
precision and recall
Suppose that, when you pose a query,
everything that is in the database is returned.
You would probably not be satisfied,
although every relevant document will be included,
that is for sure.
On the other hand, when nothing is returned,
at least you cannot complain about non-relevant
documents that are returned, or can you?
In [MMDBMS], the notions of precision and
recall are proposed to measure the effectiveness
of search over a document database.
In general, precision and recall can be defined as follows.
- precision -- how many answers are correct
- recall -- how many of the right documents are returned
For your intuition,
just imagine that you have a database of documents.
With full knowledge of the database you can delineate
a set of documentsthatare of relevance to a particular query.
Also, you can delineate a set that will be returned
by some given search algorithm.
Then, precision is the intersection of the two sets
in relation to whatthe search algorithm returns,
and recall that same intersection in relation to what is
In pseudo-formulas, we can express this
precision and recall
precision = ( returned and relevant ) / returned
recall = ( returned and relevant ) / relevant
Now, as indicated in the beginning,
it is not to difficult to get either perfect recall
(by returning all documents)
or perfect precision (by returning almost nothing).
- return all documents: perfect recall, low precision
- return 'nothing': 'perfect' precision, low recall
But these must be considered anomalies
(that is, sick cases),
and so the problem is to find an algorithm
that performs optimally with respect to both
precision and recall.
For the total database we can extend these measures
by taking the averages of precision and recall
for all topics that the database may be queried about.
Can these measures only be applied
to document databases?
Of course not,
these are general measures that can be applied
to search over any media type!
A frequency table is an example
of a way to improve search.
Frequency tables, as discussed in [MMDBMS],
are useful for documents only.
Let's look at an example first.
Basically, what a frequency table does is, as the name implies,
give a frequency count for particular words or phrases
for a number of documents.
In effect, a complete document database may be summarized
in a frequency table.
In other words, the frequency table may be considered as an
index to facilitate the search for similar documents.
To find a similar document, we can simply make
a word frequency count for the query,
and compare that with the colums in the table.
As with images, we can apply a simpledistance metric to find the
nearest (matching) documents.
(In effect, we may take the square root for the
sum of the squared differences between the entries
in the frequence count as our distance measure.)
The complexity of this algorithm may be characterized
compare term frequencies per document -- O(M*N)
where M is the number of terms and N is the number
Since both M and N can become very large we need
to make an effort to reduce the size of
the frequency table.
We can, for example, introduce a stop list
to prevent irrelevant words to enter the table,
and we may restrict ourselves to
including word stems only,
to bring back multiple entries to
one canonical form. With some additional effort
we could even deal with synonymy and polysemy
by introducing, respectively equivalence classes,
and alternatives (although we then need a suitable
way for ambiguation).
By the way, did you notice that frequency tables
may be regarded as feature vectors for documents?
- stop list -- irrelevant words
- word stems -- reduce different words to relevant part
research directions -- user-oriented measures
Even though the reductions proposed may result in
limiting the size of the frequency tables,
we may still be faced with frequency tables
of considerable size.
One way to reduce the size further, as discussed
in [MMDBMS], is to apply latent sematic indexing
which comes down to clustering the document database,
and limiting ourselves to the most relevant words only,
where relevance is determined by the ratio of occurrence
over the total number of words.
In effect, the less the word occurs, the more discriminating
it might be.
Alternatively,the choice of what words are
considered relevant may be determined by
taking into account the area of application
or the interest of a particular group of users.
Observe that, when evaluating a particular information
retrieval system, the notions of precision and recall
as introduced before are rather system-oriented measures,
based on the assumption of a user-independent notion of
However, as stated in [IR],
different users might have a different interpretation
on which document is relevant.
In [IR], some user-oriented measures are briefly discussed,
that to some extent cope with this problem.
- coverage ratio -- fraction of known documents
- novelty ratio -- fraction of new (relevant) documents
- relative recall -- fraction of expected documents
- recall effort -- fraction of examined documents
Consider a reference collection,
an example information request
and a retrieval strategy to be evaluated.
Then the coverage ratio
may be defined as the fraction of the documents
known to be relevant, or more precisely the number
of (known) relevant documents retrieved divided by the
total number of documents known to be relevant by the user.
The novelty ratio may then be defined as the
fraction of the documents retrieved which were not known
to be relevant by the user, or more precisely
the number of relevent documents that were not known
by the user divided by the total number of relevant documents
The relative recall is obtained by dividing
the number of relevant documents found by the number
of relevant documents the user expected to be found.
Finally, recall effortmay be characterized as
the ratio of the number of relevant documents
expected and the total number of documents that
has to be examined to retrieve these documents.
Notice that these measures all have a clearly 'subjective'
element, in that, although they may be
generalized to a particular group of users,
they will very likely not generalize to all
groups of users.
In effect, this may lead to different retrieval
strategies for different categories of users,
taking into account levelof expertise and familiarity
with the information repository.
You may not copy or print any of this material without explicit permission of the author or the publisher.
In case of other copyright issues, contact the author.