From here to automatic content description is, admittedly,
still a long way.
We will indicate some research directions at the end
of this section.
We need not necessarily know what an image (or segment of it)
depicts to establish whether there are other images that
contain that same thing, or something similar to it.
We may, following [MMDBMS], formulate the problem of similarity-based
retrieval as follows:
How do we determine whether the content of a segment
(of a segmented image) is similar to another image (or set
Think of, for example, the problem of finding all photos
that match a particular face.
According to [MMDBMS], there are two solutions:
- metric approach -- distance between two image objects
- transformation approach -- relative to specification
As we will see later, the transformation approach in
some way subsumes the metric approach, since we can formulate
a distance measure for the transformation approach as well.
What does it mean when we say, the distance between
two images is less than the distance between this
image and that one.
What we want to express is that the first two
images (or faces) are more alike, or maybe even identical.
Abstractly, something is a distance measure
if it satisfies certain criteria.
distance is distance measure if:
d(x,y) = d(y,x)
d(x,y) d(x,z) + d(z,y)
d(x,x) = 0
For your intuition, it is enough when you limit
yourself to what you are familiar with,
that is measuring distance in ordinary (Euclidian) space.
Now, in measuring the distance between two
images, or segments of images,
we may go back to the level of pixels,
and establish a distance metric on pixel properties,
by comparing all properties pixel-wise and establishing
Leaving the details for your further research, it
is not hard to see that even if the absolute value
of a distance has no meaning, relative distances do.
So, when an image contains a face with dark sunglasses,
it will be closer to (an image of) a face with
dark sunglasses than a face without sunglasses,
other things being equal.
It is also not hard to see that a pixel-wise
approach is, computationally, quite complex.
An object is considered as
- objects with pixel properties
- object contains w x h (n+2)-tuples
a set of points in k-dimensional space for k = n + 2
In other words, to establish similarity between two images
(that is, calculate the distance)
requires n+2 times the number of pixels comparisons.
Obviously,we can do better than that by restricting
ourselves to a pre-defined set of properties or features.
For example, one of the features could indicate
whether or not it was a face with dark sunglasses.
So, instead of calculating the distance by
establishing color differences of between regions
of the images where sunglasses may be found,
we may limit ourselves to considering a binary value,
yes or no, to see whether the face has sunglasses.
Once we have determined a suitable set of features
that allow us to establish similarity between images,
we no longer need to store the images themselves,
and can build an index based on feature vectors only,
that is the combined value on the selected properties.
Feature vectors and extensive comparison are not exclusive,
and may be combined to get more precise results.
Whatever way we choose, when we present an image
we may search in our image database
and present all those objects that fall within
a suitable similarity range,
that is the images (or segments of images)
that are close enough according to the
distance metric we have chosen.
- maps object into s-dimensional space
Instead of measuring the distance between two
images (objects) directly,
we can take one image and start modifying that
until it exactly
equals the target image.
In other words, as phrased in [MMDBMS],
the principle underlying the transformation approach
Given two objects o1 and o2,
the level of dissimilarity is proportional
to the (minimum) cost of transforming object o1
into object o2 or vice versa
Now, this principle might be applied to any representation
of an object or image, including feature vectors.
Yet, on the level of images, we may think of the
-- translation, rotation, scaling
Moreover, we can attach a cost to each of these
operations and calculate the cost of
a transformation sequence TSby summing the costs
of the individual operations.
Based on the cost function we can define a distance metric,
which we call for obvious reasons the edit distance,
to establish similarity between objects.
An obvious advantage of the edit distance
over the pixel-wise distance metric is thatwe may
have a rich choice of transformation operators
that we can attach (user-defined) cost to at will.
- user-defined similarity -- choice of transformation operators
- user-defined cost-function
For example, we could define low costs for normalization operations,
such as scaling and rotation,
and attach more weight tooperations that modify color values
or add shapes.
For face recognition, for example,
we could attribute low cost to adding sunglasses
but high cost to changing the sex.
To support the transformation approach
at the image level,
our image database needs to include suitable operations.
We might even think of storing images,
not as a collection of pixels,
but as a sequence of operations
on any one of a given set of base images.
This is not such a strange idea
as it may seem.
information about faces we may take a base
collection of prototype faces
and define an individual face
by selecting a suitable prototype
and a limited number of operations or additional
example(s) -- match of the day
The images in this section present
a match of the day, which is
is part of the project split rpresentation by the Dutch media
artist Geert Mul.
As explain in the email sending the images, about once a week,
Television images are recorded at random from satellite television and compared with each other. Some 1000.000.000 (one billion) equations are done every day.
The split representation project
uses the image analyses and image composition software
which was developed by Geert Mul (concept)
and Carlo Preize (programming ∓ software design).
research directions -- multimedia repositories
What would be the proper format to store multimedia
In other words, what is the shape multimedia repositories
Some of the issues involved are discussed in chapter
, which deals with
information system architectures.
With respect to image repositories, we may rephrase the question into
what support must an image repository provide, minimally,
to allow for efficient access and search?.
In [MMDBMS], we find the following answer:
- storage -- unsegmented images
- description -- limited set of features
- index -- feature-based index
- retrieval -- distance between feature vectors
And, indeed, this seems to be what most image
Note that the actual encoding is not of importance.
The same type of information can be encoded using
either XML, relational tables or object databases.
What is of importance is the functionality that is
offered to the user, in terms of storage and retrieval
as well as presentation facilities.
What is the relation between presentation facilities
and the functionality of multimedia repositories?
Consider the following mission statement,
which is taken from my research and projects page.
Our goal is to study aspects of the deployment and architecture of virtual environments as an interface to (intelligent) multimedia information systems ...
Obviously, the underlying multimedia repository
must provide adequate retrieval facilities
and must also be able to deliver the desired objects
in a format suitable for the representation and
possibly incorporation in such an environment.
Actually, at this stage, I have only some vague ideas
about how to make this vision come through.
Look, however, at chapter
and appendix [platform]
for some initial ideas.
You may not copy or print any of this material without explicit permission of the author or the publisher.
In case of other copyright issues, contact the author.