topical media & game development

talk show tell print


A framework for mixed media -- emotive dialogs, rich media and virtual environments

Anton Eliëns, Claire Dormann, Zhisheng Huang and Cees Visser

Keywords: presentation technology, persuasion, mixed media, dialogs, virtual worlds, rich media.



We present a framework for merging mixed media in a unified fashion. Our framework supports 3D virtual environments, rich media (such as digital video) and (emotive) dialogs presented by humanoid characters or simply text balloons. We speak of mixed media since each of these media may have a clearly distinct narrative structure. In this paper, we will discuss the background and motivation of our approach, which is the presentation of instructional material using (rich media) 3D slides, enhanced with (possibly ironic) comments presented by humanoid characters. And, we will explore the design space of such mixed media presentations, that is the issues involved in merging material with potentially conflicting narrative structures. In addition, we will look at the style parameters needed to author such presentations in an effective way, and we will briefly describe the implementation platform used to realize the presentations.

Keywords and phrases: presentation technology, persuasion, mixed media, dialogs, virtual worlds, rich media


Some phenomena in the media are older than is generally recognized. As observed in  [Briggs and Burke (2001)], some of the conventions of the 20th century comic books draw directly or indirectly on an even longer visual tradition: speech balloons can be found in the eighteenth century prints, which are in turn an adaptation of the 'text scrolls' coming from the mouths of the Virgin and other figures in medieval religious art.

In other words, mixed media have their origin in the medieval era, where both visual material and textual material were part of the rethorics of institutionalized religion.

Our interest in mixed media, that is the use of speech ballons with text together with rich media and 3D virtual environments, stems from developing instructional material using 3D technology. We enhanced the material, which was organized in sequentially ordered slides, with comments presented by humanoid characters. These comments would sometimes contain additional information, and were sometimes plainly ironic, anticipating on students' comments. Obviously, the dialogs were meant to draw the attention to particular aspects, and more generally to increase the emotional involvement of the students with the material. Experimenting with the way comments could be added to a slide, we found that the humanoids were often not necessary. Also, to avoid interference with the material presented in the slides, it was often necessary to place the speech ballons somewhat off-center, for example in the lower right corner.

In this paper, we explore the design space of mixed media presentations, which involve the juxtaposition of emotive dialogs, rich media such as digital video and virtual environments. We will discuss the issues involved in merging material with distinct, potentially conflicting, narrative structures, and we will look at the style parameters needed to author mixed media presentations in an effective way.


The structure of the paper is as follows. In section 2, we will briefly present the background and motivation to our approach, with an example. In section 3, we will discuss narrative structure from the perspective of persuasion and emotional involvement and compare our approach to related work. In section 4, we will explore the design space of mixed media and study a variety of combinations at the hand of representative examples. In section 5, we will look at the authoring issues and discuss the style parameters we used. In section 6, a brief description is given of the implementation platform and in section 7 we will discuss research directions and open issues.

Background and motivation

Desktop VR is an excellent medium for presenting information, for example in class, in particular when rich media or 3D content is involved.

At VU, we have been using presentational VR for quite some time, in a course on Web3D technology and also in our introduction multimedia course.

Recently we included dialogs using speech balloons (and possibly avatars) to display the text commenting on a particular presentation.

A dialog is (simply) a sequence of phrases for two (virtual) speakers. Each speaker, alternately, may deliver a phrase. When delivering a phrase the speaker may step forward, dependent on the style of presentation chosen. The dialog text (and avatars) are programmed as annotations to a particular scene as described in more detail in section authoring.

Each presentation is organized as a sequence of slides, and dependent on the slides (or level within the slide) a dialog may be selected and displayed. See the appendix for a description of the slides format. To be more precise, when the presenter goes to another slide an observer object checks whether there is a dialog available, and if so the dialog is started. See section platform, discussing the platform used for realizing the presentations.

example -- promotion video in virtual environment

In this example, we used a promotion video produced to attract students to our university. In figure (a) the promotion video is embedded in a virtual environment of our university campus. In figure (b) only the video is shown. The dialog used is, in both examples, a somewhat ironic comment on the contents of the movie displayed.

In figure (a), you see the left avatar (named cutie) step forward and deliver her phrase. This dialog continues until cutie remarks that she always wanted to be an agent. In figure (b), you see the right avatar (named red) step forward asking what do you think of to discuss what it takes to be a student.

Note that in example (a) the avatars are more or less natural inhabitants of the virtual environment, whereas in example (b) they are just lain on top of the movie. In the latter case, the positioning of the avatars may easily seem unnatural, although it sometimes works out surprisingly well. \mbox{}\hspace{-0.8cm}

\hspace{-0.2cm} \hspace{-0.7cm}
(a) dialog in context(b) dialog on video

General perspectives -- narrative structure and persuasion

Mixed media, that is the combination of dialogs with rich media and virtual environments, may endanger the narrative structure of a presentation. Each of these media entities may have a distinct narrative structure. Slides are sequentially organized, and each slide may have levels that are displayed in a sequential fashion. The narrative structure of digital video may be arbitrarily complex, and may make full use of cinematographic rethorics. Navigation and interaction in virtual environments may be seen as a weak narrative structure, which may however be strengthened by guided tours or viewpoint transformations, taking the user to a variety of viewpoints in a controlled manner. Finally, dialogs have a well-defined temporal structure, with alternating turns for the two (virtual) speakers.

In the following we will compare our approach to Related research and investigate what possible advantages we may obtain from mixed media presentations.

consonant or dissonant comments

Our approach is clearly reminiscent to the notorious Agneta & Frida characters developed in the Persona project. The Persona project aims at: investigating a new approach to navigation through information spaces, Based on a personalised and social navigational paradigm,  [Munro et al. (1999)]. The novel idea pursued in this project is to have agents (Agneta and Frieda) that are not helpful, but instead just give comments, sometimes with humor, but sometimes ironic or even sarcastic comments on the user's activities, in particular navigating an information space or (plain) web browsing.

In contrast with the Personas project, our (dialog) comments are part of a presentation which of itself has a definite narrative structure, in opposition to the 'random' navigation that occurs by browsing 'information spaces'.

As a consequence, our comments may be designed taking the expected reaction of the audience into consideration. An interesting question is whether comments should be consonant with the information presented (drawing attention to perticular aspects) or dissonant (as with ironic or sarcastic comments).

engagement versus immersion

The characters and dialog text may be used to enliven the material.

In this way, the students' engagement with the material may be increased,  [Engagement].

Clearly, there is a tension between engagement and immersion. Immersion, understood as the absorption within a familiar narrative scheme (in our case the lecturer's presentation), may be disrupted by the presence of (possibly annoying) comments, whereas the same comments may lead the attention back to the material, or provide a foothold for affective reactions to the material,  [Dijkstra et al. (1994)]. Also, the audience might start to anticipate the occurrence of a dialog and possibly identify themselves with one of the characters.

emotional enhancement

In one of her talks, Kristina Höök observed that some users get really fed up with the comments delivered by Agneta and Frieda. Nevertheless, it also appeared that annoyance and irritation increased the emotional involvement with the task.

For our presentations, we may ask how mixed media may help in increasing the emotional involvement of the audience or, phrased differently, how dialogs may lead to emotional enhancement of the material,  [Astleneir (2000)].

An important difference with the Personas project is that our platform supports the actual merge of dialogs and the humanoid characters that deliver them in a unified presentation format, that is a rich media 3D graphics format based on X3D/VRML.

As a consequence, the tension between immersion and engagement may be partially resolved, since the characters delivering the dialog may be placed in their 'natural' context, that is a virtual environment as in example (a).

Design space -- the juxtaposition of mixed media

In this section, we will explore in a more systematic way what options we have in creating mixed media presentations. We have a division according to levels of complexity, with at level 0 the basic material, that is dialog text, rich media objects such as digital video and virtual environments. The other levels arise by adding dialogs to either the media object or the virtual environment. We also allow for the virtual speakers of the dialog to change attributes of the presentation, for example by depositing objects in the (virtual) environment,

In summary, we distinguish between the following levels of complexity:

levels of complexity

The ordering of these levels is not unique, but admittedly depends on our intuition of complexity.

For each of these levels we have a collection of representative examples. \mbox{}\hspace{-0.8cm}

\hspace{-0.2cm} \hspace{-0.7cm}
(c) context in dialog(d) dialog in context

level 0: basic material

level 1 -- combined:

When combining the media object with the dialog, we may simply superpose the dialog on the movie, in a similar way as shown in figure (b), or we may project the video onto the surface of the speech balloon, as illustrated in figure (c), which has a rather surprising effect.

level 2 -- with avatars:

It makes quite a difference whether humanoid characters are used to deliver the dialog or only the plain text balloons. The use of avatars seems to enhance the recognition of a particular role, that is the kind of comments the character makes. Once the role is established, the avatars may dissappear and the (color and position) of the speech ballons suffice. Another difference is that the ballons should be positioned differently without the presence of an avatar.

level 3 -- with attributes

Apart from speaking their dialog text, the avatars may undertake autonomous actions. They might get bored and display ambient behavior, like looking on their watch. In addition, before or after speaking their phrase, they may change their position and modify the environment, for example by depositing (3D) objects to illustrate their comments. These actions may be arbitrarily complex, and for example result in going to the next (level of the) slide.

level 4 -- with context:

The ultimate context of a slide may be a virtual environment, as in our examples (a) and (d) a virtual environment of our university campus developed by one of our students. Having such a context, the difference between a presentation and an information space becomes blurry since the user may interact with the environment and, for example, start a guided tour. In such cases the developer must decide whether the dialog takes place in a fixed position of the virtual environment or is fixed relative to the viewpoint of the user/audience.

Merging the dialogs with the virtual environment and the media object might, by the way, lead to overly complex presentations such as the one depicted in figure (a).

Authoring issues -- style parameters

In the next two sections, we will look at the implementation of the dialogs, respectively from an authoring perspective and a system perspective.

The authoring of a dialog for a particular dialog or slide should be easy. The encoding of the dialog used in the examples discussed is illustrated below.


  <phrase right="how~are~you"/>
  <phrase left="fine~thank~you"/>
  <phrase right="what do~you think~of studying ..."/>
  <phrase left="So,~what~are you?"/>
  <phrase right="an ~agent" style="[a(e)=1]"/>
  <phrase left="I always~wanted to be~an agent" style="[a(e)=1]"/>
The phrases are (textually) included in a slide, which is itself indicated by appropriate begin and end tags. The alternation between speakers is indicated by the attributes left and right. Although detailed indications of (among others) when a phrase should be uttered are possible, these advanced options are hardly ever used, except for defining complex actions.

Furthermore, there are a number of style parameters that may be used to decide for example whether the avatars or persona are visible, where to place the dialogs balloons on the display, as well as the color and transparancy of the balloons. To this end, we have included a style attribute in the phrase tag, to allow for setting any of the style parameters.

style parameters

  <phrase right="red" style="[p=(0.5,0,0),persona=0,balloon=0]"/>
  <phrase left="cutie" style="[p=(-0.5,0,0),persona=0,balloon=0]"/>
  <gesture right=1 style=default/>
  <gesture left=1 style=default/>
Apart from phrases, we also allow for gestures, taken from the built-in repertoire of the avatars. In  [Huang et al. (2003)], we discuss how to extend the repertoire of gestures, using a gesture specification language.

Both phrases and gestures are compiled into DLP code and loaded when the annotated version of the presentation VR is started. See section platform.

Implementation -- the DLP+X3D platform

In our group we have developed a platform for intelligent multimedia, that is a platform for virtual environments based on agent technology, supporting embodied conversational agents,  [Eliens et al. (2002)]. Our platform merges X3D/VRML with the distributed logic programming language DLP.

To effect an interaction between the 3D content and the behavioral component written in DLP, we need to deal with control points, and (asynchronous) event-handling.


  • control points: get/set -- position, rotation, viewpoint
  • event-handling -- asynchronous accept
The control points are actually nodes in the VRML scenegraph that act as handles which may be used to manipulate the scenegraph.

Our approach also allows for changes in the scene that are not a direct result of setting attributes from the logic component, as for example the transition to a new slide. An event observer is actually used to detect the transition to another slide. A dialog may then be started to comment on the contents of that particular slide.

The DLP+X3D platform may also be used to realize multi-user virtual environments,  [Huang et al. (2002)]. About how a multi-user environment might be used to explore narrative structures, we can only speculate!

Research directions -- conversational agents

In the examples discussed, the agent avatars entered a dialog with one another to comment on a particular scene or slide, to augment a presentation. As a next step, we would like to extend our approach to allow for interaction in a virtual environment, that is to augment information spaces.

virtual musea

As an example, for a virtual gallery for Escher, we can use dialogs to enhance visitors understanding and enjoyment of the work of Escher on display. We can increase viewers' involvement by directing their attention to salient features of the painting or the exhibition. By giving information on the ideas that motivate the painting, its creation or the life of Escher, we can draw viewers more effectively into the world of Escher.

In addition, conversational agents can assist the user in finding a particular work and, in particular for the 'impossible spaces' of Escher, offer means to experiment with the graphical attributes of the world.

cultural heritage

Another application where we can profit from the rich presentation facilities of desktop VR is the construction of a virtual environment for cultural heritage. As an example, think of the database of INCCA (International Network for the Conservation of Contemporary Art), which contains interviews (auditory material), photos and drawings (images), and documents and other written material. We could well imagine to have (conversational) presentation agents to present the information to the user. Dialogs may be used to present various viewpoints on a particular oeuvre.


We have described a framework for mixed media that allows for the superposition of (text) dialogs delivered by humanoid avatars and/or speech balloons, on arbitrary rich media objects and virtual environments.

We have looked at the design space of mixed media presentations, by discussing a number of representative examples, each illustrating a particular level of complexity. Also authoring issues were discussed, and an indication was given of the style parameters needed to develop effective presentations.

We have further described the implementation platform used to realize the mixed media presentations and explored what new applications and extensions are feasible.

Appendix: The slides format

In this technical appendix, a simplified description will be given of the slides format. Slides are fragments of a document that may be presented to an audience. Our approach allows for displaying slides in either dynamic HTML or VRML. Slides are encoded using XML.

slides in XML

  <slide id="1">
  <line>What about the slide format?</line>
  <line>yeh, what about it"?</line>
  <vrml>Sphere { radius 0.5 }</vrml>
  <slide id="2">
  <vrml>Sphere { radius 0.5 }</vrml>
The first slide contains some text and a 3D object. The second slide contains only a 3D object. Inbetween the slides there may be arbitrary text. The slides are converted to VRML using XSLT, the XML transformation language.


To support slides in VRML a small collection of PROTO definitions is used.


  • slideset -- container for slides
  • slide -- container for text and objects
  • slide -- container for lines of text
  • line -- container for text
  • break -- empty text
The slides contained in a document constitute a slide set. A slide set is a collection of slides that may contain lines of text and possibly 3D objects. For displaying 3D objects in a slide we need no specific PROTO.

The slide PROTO defines an interface which may be used to perform spatial transformations on the slide, like translation, rotation and scaling. The interface also includes a field to declare the content of the slide, that is text or (arbitrary) 3D objects.

The slideset contains a collection of slides, and allows for proceeding to the next slide. A text may contain a sequence of lines and breaks. It supports a simple layout algorithm.

annotated slides

Slides may be annotated with dialogs, as described in section 5. The annotation is compiled to DLP code, which is activated whenever the slide (or level within a slide) to which the annotation belongs is displayed.

To intercept the occurrence of a particular event, such as the display of a slide, we use an observer object which is specified as in the code fragment below.


  :- object observer : [actions].
  var slide = anonymous, level = 0, projector = nil.
  observer(X) :- 
     projector := X,
       accept( id, level, update, touched),
  id(V) :-  slide := V.
  level(V) :- level := V.
  touched(V) :- projector<-touched(V).
  update(V) :- act(V,slide,level).
  :- end_object observer.
The observer object has knowledge of, that is inherits from, an object that contains particular actions.

As indicated before, events come from the 3D scene. For example, the touched event results from mouse clicks on a particular object in the scene. On accepting an event, the corresponding method or clause is activated, resulting in either changing the value of a non-logical instance variable, invoking a method, or delegating the call to another object.

(C) Æliens 27/08/2009

You may not copy or print any of this material without explicit permission of the author or the publisher. In case of other copyright issues, contact the author.