Case study -- multimedia feature detection

intro, components, case study, crossing boundaries, styles, platform, summary, Q/A, literature

In this section, we will look at the indexing and retrieval of musical fragments. This study is primarily aimed at establishing the architectural requirements for the detection of musical features and to indicate directions for exploring the inherently difficult problem of finding proper discriminating features and similarity measures in the musical domain. In this study we have limited ourselves to the analysis of music encoded in MIDI, to avoid the technical difficulties involved in extracting basic musical properties from raw sound material. Currently we have a simple running prototype for extracting higher level features from MIDI files. In our approach to musical feature detection, we extended the basic grammar-based ACOI framework with an embedded logic component to facilitate the formulation of predicates and constraints over the musical structure obtained from the input.

slide: The extended ACOI architecture

The ACOI framework

The ACOI framework is intended to accomodate a broad spectrum of classification schemes, manual as well as (semi) automatic, for the indexing and retrieval of multimedia objects, [ACOI].

What are stored are not the actual multimedia objects themselves, but structural descriptions of these objects (including their location) that may be used for retrieval.

The ACOI model is based on the assumption that indexing an arbitrary multimedia object is equivalent to deriving a grammatical structure that provides a namespace to reason about the object and to access its components. However, there is an important difference with ordinary parsing in that the lexical and grammatical items corresponding to the components of the multimedia object must be created dynamically by inspecting the actual object. Moreover, in general, there is not a fixed sequence of lexicals as in the case of natural or formal languages. To allow for the dynamic creation of lexical and grammatical items the ACOI framework supports both black-box and white-box (feature) detectors. Black-box detectors are algorithms, usually developed by a specialist in the media domain, that extract properties from the media object by some form of analysis. White-box detectors, on the other hand, are created by defining logical or mathematical expressions over the grammar itself. In this paper we will focus on black-box detectors only.

As an example, look at the (simple) feature grammar below, specifying the structure of a hypothetical community.

  detector world;  finds the name of the world
  detector people;  checks name, eliminates institutes
  detector company;  looks if there are at least two persons
  
  atom str name;
  
  community: world people company;
  
  world: name;
  people: person*;
  
  person: name;

slide: A sample grammar

A community consists of people, and is a community only if it allows for the people to be in each other's company.

A community has a name. The actual purpose of this grammar is to select the persons that belong to a particular community from the input, which consists of names of potential community members. Note that the grammar specifies three detectors. These detectors correspond to functions that are invoked when expanding the corresponding non-terminal in the grammar. An example of a detector function is the personDetector function partially specified below.

  int personDetector(tree *pt, list *tks ){
  ...
  q = query_query("kit=pl src=check.pl");
  
  while (t = next_token(tks)) {
        sprintf(buf,"person(\%s)",t);
        query_eval(q,buf);
        if (query_result(q,0))  put name(person) on tokenstream


                  putAtom(tks,"name",t);
        }
  ...
  }

slide: A person detector

The personDetector function checks for each token on the input tokenstream tks as to whether the token corresponds to the name of a person belonging to the community. The check is performed by an embedded logic component that contains the information needed to establish whether a person is a member of the community. Note that the query for a single token may result in adding multiple names to the token stream.

The companyDetector differs from the personDetector in that it needs to inspect the complete parse tree to see whether the (implicit) company predicate is satisfied.

When parsing succeeds and the company predicate is satisfied a given input may result in a sequence of updates of the underlying database, as illustrated below.


  V0 := newoid();
  V1 := newoid();
    community_world.insert(oid(V0),oid(V1));
      world_name.insert(oid(V1),"casa");
    community_people.insert(oid(V0),oid(V1));
  V2 := newoid();
      people_person.insert(oid(V1),oid(V2));
        person_name.insert(oid(V2),"alice");
      people_person.insert(oid(V1),oid(V2));
        person_name.insert(oid(V2),"sebastiaan");
      ...

slide: Database updates

Evidently, the updates correspond to assigning appropriate values to the attributes of a structured object, reflecting the properties of the given community.

The overall architecture of the ACOI framework is depicted in slide acoi. Taking a feature grammar specification, such as the simple community grammar, as a point of reference, we see that it is related to an actual feature detector (possibly containing an embedded logic component) that is invoked by the Feature Detector Engine (FDE) when an appropriate media object is presented for indexing. The feature grammar and its associated detector further result in updating respectively the data schemas and the actual information stored in the (Monet) database. The Monet database, which underlies the ACOI framework, is a customizable, high-performance, main-memory database developed at the CWI and the University of Amsterdam, see [MONET].

At the user end, a feature grammar is related to a View, Query and Report component, that respectively allow for inspecting a feature grammar, expressing a query, and delivering a response to a query. Some examples of these components are currently implemented as applets in Java 1.1 with Swing, as described in [ACOI].

Formal specification

Formally, a feature grammar G may be defined as

G = (V,T,P,S), where V is a collection of variables or non-terminals, T a collection of terminals, P a collection of productions of the form

V -> (V \union T) and S a start symbol. A token sequence ts belongs to the language

L(G) if

S -*-> ts. Sentential token sequences, those belonging to

L(G) or its sublanguages

L(G_v) = (V_v,T_v,P_v,v) for

v \e (T \union V), correspond to a complex object

C_v, which is the object corresponding to the parse tree for v, as illustrated in the community example. The parse tree defines a hierarchical structure that may be used to access and manipulate the components of the multimedia object subjected to the detector.

The anatomy of a MIDI feature detector

Automatic indexing for musical data is an inherently difficult problem. Existing systems rely on hand-crafted solutions, geared towards a particular group of users, such as for example composers of film music, see [MM]. In this section, we will look at a simple feature detector for MIDI-encoded musical data. It provides a skeleton for future experimentation.

slide: MIDI features

The hierarchical information structure that we consider is depicted in slide midi-structure. It contains only a limited number of basic properties and must be extended with information along the lines of a musical ontology including genre, mood and the like. However, the detector presented here provides a skeleton solution that accommodates an extension with arbitrary predicates over the musical structure in a transparent manner.

The grammar given below corresponds in an obvious way with the structure depicted in slide midi-structure.


  
  detector song;  to get the filename
  detector lyrics;  extracts lyrics
  detector melody;  extracts melody
  
  atom str name;
  atom str text;
  atom str note;  
  
  song: file lyrics melody;
  
  file: name;
  lyrics: text*;
  melody: note*;

slide: A simple feature grammar for MIDI files

The start symbol is a song. The detector that is associated with song reads in a MIDI file. The musical information contained in the MIDI file is then stored a a collection of Prolog facts. This translation is very direct. In effect the MIDI file header information is stored, and events are recorded as facts, as illustrated below for a note_on and note_off event.


  event('kortjakje',2,time=384, note_on:[chan=2,pitch=72,vol=111]).
  event('kortjakje',2,time=768, note_off:[chan=2,pitch=72,vol=100]).

After translating the MIDI file into a Prolog format, the other detectors will be invoked, that is the composer, lyrics and melody detector, to extract the information related to these properties.

slide: Processing MIDI file

The actual processing is depicted in slide midi-processing. The input is a MIDI file. As indicated in the top line, the MIDI file itself may be generated from a lilypond file. Lilypond is a \LaTeX-like formatting language for musical scores that also supports the generation of MIDI, described in [Lily]. As indicated on the bottom line, processing a MIDI file results in a collection of features as well as in a MIDI file and lilypond file. The (result) MIDI file contains an extract of the original (input) MIDI file and the lilypond file contains a score for this extract, which may be presented to the (end) user as the result of a query. This setup allows us to verify whether our extract or abstraction of the original musical structure is effective, simply by comparing the input (midi or lilypond) musical structure with the output (midi or lilypond) extract.

To extract relevant fragments of the melody we use the melody detector, of which a partial listing is given below.

  int melodyDetector(tree *pt, list *tks ){
  char buf[1024]; char* _result;
  void* q = _query;
  int idq = 0; 
  
    idq = query_eval(q,"X:melody(X)");
    while ((_result = query_result(q,idq)) ) {
           printf("note: \%s",_result);
           putAtom(tks,"note",_result);
           }
    return SUCCESS;
  }

slide: The melody detector

The embedded logic component is given the query X:melody(X), which results in the notes that constitute the (relevant fragment of the) melody. These notes are then added to the tokenstream. A similar detector is available for the lyrics.

Parsing a given MIDI file, for example kortjakje.mid, results in updating the Monet database. The updates reflect the structure of the musical information object that corresponds to the properties defined in the grammar.

Implementation status

Currently, we have a running prototype of the MIDI feature detector. It uses an adapted version of public domain MIDI processing software. The embedded logic component is part of the hush framework. It uses an object extension of Prolog that allows for the definition of native objects to interface with the midi processing software. The logic component allows for the definition of arbitrary predicates to extract the musical information, such as the melody and the lyrics.

Queries -- the user interface

Assuming that we have an adequate solution for indexing musical data, we need to define how end users may access these data, that is search for musical objects in the information space represented by the database, for the ACOI project the World Wide Web.

slide: Keyboard interface

For a limited category of users, those with some musical skills, a direct interface such as a keyboard or a score editor, as provided by the hush framework, might provide a suitable interface for querying the musical database. Yet, for many others, a textual description, or a form-based query will be more appropriate.

slide: User Query Processing

In processing a query, we may in some cases derive a partial melody or rhythmic structure from the query, as well as some additional features or criteria. As explained, the output of indexing MIDI files consists of both information concerning features as well as a musical rendering of some of these features. These features can be used to match against the criteria formulated in the query. The musical renderings, which include a partial score, may be presented to the user in response to a query, to establish whether the result is acceptable.