topical media & game development

talk show tell print




Even in the presence of audiovisual media, text will remain an important vehicel for human communication. In this section, we will look at the issues that arise in querying a text or document database. First we will characterize more precisely what we mean by effective search, and then we will study techniques to realize effective search for document databases.

Basically, answering a query to a document database comes down to string matching.


However, some problems may occur such as synonymy and polysemy.


As an example, church and house of prayer have more or less the same meaning. So documents about churches and cathedrals should be returned when you ask for information about 'houses of prayer'. As an exampleof polysemy, think of the word drum, which has quite a different meaning when taken from a musical perspective than from a transport logistics perspective.


precision and recall

Suppose that, when you pose a query, everything that is in the database is returned. You would probably not be satisfied, although every relevant document will be included, that is for sure. On the other hand, when nothing is returned, at least you cannot complain about non-relevant documents that are returned, or can you?

In  [MMDBMS], the notions of precision and recall are proposed to measure the effectiveness of search over a document database. In general, precision and recall can be defined as follows.

effective search

For your intuition, just imagine that you have a database of documents. With full knowledge of the database you can delineate a set of documentsthatare of relevance to a particular query. Also, you can delineate a set that will be returned by some given search algorithm. Then, precision is the intersection of the two sets in relation to whatthe search algorithm returns, and recall that same intersection in relation to what is relevant. In pseudo-formulas, we can express this as follows:

precision and recall

  precision = ( returned and relevant ) / returned  
  recall = ( returned and relevant ) / relevant 
Now, as indicated in the beginning, it is not to difficult to get either perfect recall (by returning all documents) or perfect precision (by returning almost nothing).


But these must be considered anomalies (that is, sick cases), and so the problem is to find an algorithm that performs optimally with respect to both precision and recall.

For the total database we can extend these measures by taking the averages of precision and recall for all topics that the database may be queried about.

Can these measures only be applied to document databases? Of course not, these are general measures that can be applied to search over any media type!

frequency tables

A frequency table is an example of a way to improve search. Frequency tables, as discussed in  [MMDBMS], are useful for documents only. Let's look at an example first.


Basically, what a frequency table does is, as the name implies, give a frequency count for particular words or phrases for a number of documents. In effect, a complete document database may be summarized in a frequency table. In other words, the frequency table may be considered as an index to facilitate the search for similar documents.

To find a similar document, we can simply make a word frequency count for the query, and compare that with the colums in the table. As with images, we can apply a simpledistance metric to find the nearest (matching) documents. (In effect, we may take the square root for the sum of the squared differences between the entries in the frequence count as our distance measure.)

The complexity of this algorithm may be characterized as follows:


compare term frequencies per document -- O(M*N)

where M is the number of terms and N is the number of documents. Since both M and N can become very large we need to make an effort to reduce the size of the frequency table.


  • stop list -- irrelevant words
  • word stems -- reduce different words to relevant part
We can, for example, introduce a stop list to prevent irrelevant words to enter the table, and we may restrict ourselves to including word stems only, to bring back multiple entries to one canonical form. With some additional effort we could even deal with synonymy and polysemy by introducing, respectively equivalence classes, and alternatives (although we then need a suitable way for ambiguation). By the way, did you notice that frequency tables may be regarded as feature vectors for documents?



research directions -- user-oriented measures

Even though the reductions proposed may result in limiting the size of the frequency tables, we may still be faced with frequency tables of considerable size. One way to reduce the size further, as discussed in  [MMDBMS], is to apply latent sematic indexing which comes down to clustering the document database, and limiting ourselves to the most relevant words only, where relevance is determined by the ratio of occurrence over the total number of words. In effect, the less the word occurs, the more discriminating it might be. Alternatively,the choice of what words are considered relevant may be determined by taking into account the area of application or the interest of a particular group of users.



user-oriented measures

Observe that, when evaluating a particular information retrieval system, the notions of precision and recall as introduced before are rather system-oriented measures, based on the assumption of a user-independent notion of relevance. However, as stated in  [IR], different users might have a different interpretation on which document is relevant. In  [IR], some user-oriented measures are briefly discussed, that to some extent cope with this problem.

user-oriented measures

  • coverage ratio -- fraction of known documents
  • novelty ratio -- fraction of new (relevant) documents
  • relative recall -- fraction of expected documents
  • recall effort -- fraction of examined documents
Consider a reference collection, an example information request and a retrieval strategy to be evaluated. Then the coverage ratio may be defined as the fraction of the documents known to be relevant, or more precisely the number of (known) relevant documents retrieved divided by the total number of documents known to be relevant by the user.

The novelty ratio may then be defined as the fraction of the documents retrieved which were not known to be relevant by the user, or more precisely the number of relevent documents that were not known by the user divided by the total number of relevant documents retrieved.

The relative recall is obtained by dividing the number of relevant documents found by the number of relevant documents the user expected to be found.

Finally, recall effortmay be characterized as the ratio of the number of relevant documents expected and the total number of documents that has to be examined to retrieve these documents.

Notice that these measures all have a clearly 'subjective' element, in that, although they may be generalized to a particular group of users, they will very likely not generalize to all groups of users. In effect, this may lead to different retrieval strategies for different categories of users, taking into account levelof expertise and familiarity with the information repository.

(C) Æliens 04/09/2009

You may not copy or print any of this material without explicit permission of the author or the publisher. In case of other copyright issues, contact the author.