Electronic Proceedings of the
ACM Workshop on Effective Abstractions in Multimedia
November 4, 1995
San Francisco, California
Addressing the Contents of Video in a Digital Library
- Michael G. Christel
-
- Software Engineering Institute
- Carnegie Mellon University
- Pittsburgh, PA 15213-3890
- 412-268-7799
- mac@sei.cmu.edu
-
http://www.cs.cmu.edu/afs/andrew/usr/mc6u/www/
Abstract
A digital video library must be efficient at giving users precisely the material
they
need, due to the unique characteristics of video as compared to text. To make the
retrieval of bits faster, and to enable faster viewing or
information assimilation, the digital video library will need to support
partitioning video into small-sized clips and alternate representations of
the video.
For a general purpose digital video library, precision may
have to be sacrificed in order to ensure that the material the user is
interested in will be recalled in the result set to a query. The result set may
then
become quite large, so the user may need to filter the set and decide what is
important. This can be accomplished by collapsing the playback rate of video
objects in
the result set as well as adjusting the size of the objects in the result set.
The Informedia Digital Video Library
Project at Carnegie Mellon University deals with these issues and is
introduced here with pointers to additional information.
A library cannot be very effective if it is merely a collection of information
without some understanding of what is contained in that collection. Without
that understanding it could take hundreds of hours of viewing to determine if
an item of interest is in a 1000 hour video library. Obviously, such a library
would not be used very often. Marchionini and Maurer reflect on information
accessible via the Internet [Marchionini95, p. 72]:
-
-
It has often been said that the Internet is starting to provide the
largest library humankind has ever had. As true as this may be, the
Internet is also the messiest library that ever has existed.
Information is found best on the Internet when the providers augment the
information with rich keywords and descriptors, provide links to related
information, and allow the contents of their pages to be searched and indexed.
There is a long history of sophisticated parsing and indexing for text
processing in various structured forms, from ASCII to PostScript to SGML and
HTML. However, how does one represent video content to support content-based
retrieval and manipulation?
An hour-long motion video segment clearly contains some information suitable
for indexing, so that a user can find an item of interest within it. The
problem is not the lack of information in video, but rather the
inaccessibility of that information to our primarily text-based information
retrieval mechanisms today. In fact, the video likely contains an
overabundance of information, conveyed in both the video signal (camera
motion, scene changes, colors) and the audio signal (noises, silence,
dialogue). A common practice today is to log or tag the video with keywords
and other forms of structured text to identify its contents. Such text
descriptors have the following limitations:
- Manual processes are tedious and time consuming.
- Manual processes are seriously incomplete. Even if full transcripts
of the audio track are entered, other information about the video
will almost surely be left out, such as the identity of persons and
objects in each scene.
- Transcripts are inaccurate, with mistypings and incorrect
classifications often introduced.
- Text descriptors are biased by whatever predetermined structures are
used to classify the video contents.
- Cinematic information is complex and difficult to describe,
especially for non-experts.
- Text descriptors are biased by the ambiguity of natural language.
The Informedia Digital Video
Library (IDVL) Project at Carnegie Mellon
University is an ongoing research project begun in 1994, but leveraging two
decades of related CMU research [Stevens94,
Hauptmann95, Smith95].
Central to the project is the establishment
of a large, online digital video library that goes beyond just keyword
approaches to indexing video content. Some other techniques will be
overviewed, followed by a concluding outline of how the IDVL Project is
addressing this task.
Anyone who has retrieved video from the Internet realizes that because of its
size a video clip can take a long time to move from one location to another,
such as from the digital video library to the user. Likewise, if a library
consists of only 30 minute clips, when users check one out it may take them 30
minutes to determine whether the clip met their needs. Returning a full
one-half hour video when only one minute is relevant is much worse than
returning a complete book, when only one chapter is needed. With a book,
electronic or paper, tables of contents, indices, skimming, and reading rates
permit users to quickly find the chunks they need. Since the time to scan a
video cannot be dramatically shorter than the real time of the video, a
digital video library must be efficient at giving users the material they
need. To make the retrieval of bits faster, and to enable faster viewing or
information assimilation, the digital video library will need to support
partitioning video into small-sized clips and alternate representations of
the video.
Just as text books can be decomposed into paragraphs embodying topics of
discourse, the video library can be partitioned into video paragraphs. The
difficulties arise in how this partitioning is to be carried out. Does the
author of the video information supply paragraph tags marking how a larger
video should be subsetted into smaller clips? This is routinely accomplished
in text through chapters, sections, subheadings, and similar conventions.
Analogous structure is contained in video through scenes, shots, camera
motions, and transitions. Manually describing this structure in a machine
readable form would place a tremendous burden on the video author, and in any
case would not solve the partitioning problem for pre-existing video material
created without paragraph markings.
Perhaps the paragraph boundaries can be inferred from whatever parsing and
indexing is done on the video segment. Some video, such as news broadcasts,
have a well-defined structure which could be parsed into short video
paragraphs for different news stories, sports, and weather. Techniques
monitoring the video signal can break the video into sequences sharing the
same spatial location, and these scenes could be used as paragraphs.
Davis cautions, however, that physically segmenting a video library into clips
imposes a fixed segmentation on the video data [Davis94].
The library is
decomposed into a fixed number of clips, i.e., a fixed number of small video
files, which are separated from their original context and may not meet the
future needs of the library user. A more flexible alternative is to logically
segment the library by adding sets of video paragraph markers and indices, but
keeping the video data intact in its original context. A basic tenet of MIT's
Media Streams is that what we need are "representations which make clips, not
representations of clips" [Davis94, p. 121].
In order for a digital video library to be logically segmented as such, the
system must be capable of delivering a subset of a movie (rather than having
that subset stored as its own movie) quickly and efficiently to the user.
Video compression schemes will have to be chosen carefully for the library to
retain the necessary random access within a video to allow it to be logically
segmented.
In addition to trying to size the video clips appropriately, the digital video
library can provide the users alternate representations for the video, or
layers of information. Users could then cheaply (in terms of data transfer
time, possible economic cost, and user viewing time) review a given layer of
information before deciding upon whether to incur the cost of richer layers of
information or the complete video clip. For example, a given half hour video
may have a text title, a text abstract, a full text transcript, a
representative single image, and a representative one minute "skim"
video, all in addition to the full video itself. The user could quickly review
the title and perhaps the representative image, decide on whether to view the
abstract and perhaps full transcript, and finally make the decision on whether
to retrieve and view the full video.
These layered approaches to describing video are implemented in a number of
systems [Hauptmann95, Zhang95,
Rao95]. The problems are similar to the indexing
problem: how should the alternate representations or descriptors be generated?
How can they be as complete and accurate as possible, and can tools alleviate
the labor and tediousness involved in their creation?
The utility of the digital video library can be judged on the ability of the
users to get the information they need from the library easily and
efficiently. The two standard measures of performance in information retrieval
are recall and precision. Recall is the proportion of relevant documents that
are actually retrieved, and precision is the proportion of retrieved documents
that are actually relevant. These two measures may be traded off one for the
other, i.e., returning one document that is a known match to a query
guarantees 100% precision, but fails at recall if a number of other documents
were relevant as well. Returning all of the library's contents for a query
guarantees 100% recall, but fails miserably at precision and filtering the
information. The goal of information retrieval is to maximize both recall and
precision.
In many information systems, precision is maximized by narrowing the domain
considerably, extensively indexing the data according to the parameters of the
domain, and allowing queries only via those parameters. This approach is taken
by many CD-ROM data sets, but has the following limitations:
- Data could really only be added if it falls within the boundaries of
the domain established by the predefined indices.
- Access to the data is limited by the predefined indices.
Researchers of multimedia information systems have raised concerns over the
difficulties in adequately indexing a video database so that it can be used as
a general purpose library, rather than say a more narrow domain such as a
network news archive [Davis94,
Zhang95]. For general purpose use, there may
not be enough domain knowledge to apply to the user's query and to the
library index in order to return only a very small subset of the library to
the user matching just the given query. For example, in a soccer-only library,
a query about goal can be interpreted to mean a score, and just those
appropriate materials can be retrieved accordingly. In a more open context,
goal could mean a score in hockey or a general aim or objective. A larger set
of results will need to be returned to the user, given less domain knowledge
from which to leverage.
In attempting to create a general purpose digital video library, precision may
have to be sacrificed in order to ensure that the material the user is
interested in will be recalled in the result set. The result set may then
become quite large, so the user may need to filter the set and decide what is
important. Three principle issues with respect to searching for information
are how to let the user
- quickly skim the video objects to locate sections of interest
- adjust the size of the video objects returned
- identify desired video clips when multiple objects are returned
Browsing can help users quickly and intelligently filter a number of results
to the precise information they are seeking. However, browsing video is not as
easy as browsing text. Scanning by jumping a set number of frames may skip
the target information completely. On the other hand, accelerating the
playback of motion video to, for instance, twenty times normal rate presents
the information at an incomprehensible speed.
The difference between video or audio and text or images is that video and
audio have constant rate outputs that cannot be changed without significantly
and negatively impacting the user's ability to extract information. Video and
audio are a constant rate, continuous time media. Their temporal nature is
constant due to the requirements of the viewer/ listener. Text is a variable
rate continuous medium. Its temporal nature is manifest in users, who read and
process the text at different rates.
While video and audio data types are constant rate, continuous-time, the
information contained in them is not. In fact, the granularity of the
information content is such that a one-half hour video may easily have one
hundred semantically separate chunks. The chunks may be linguistic or visual
in nature. They may range from sentences to paragraphs and from images to
scenes. If the important information from a video can be retrieved and the
less important information collapsed, the resulting "skim" video could
be browsed quickly by the user and still give him or her a great deal of
understanding about the contents of the complete video clip. This introduces
the issue of deciding what is important within a video clip and worthy of
preservation in a "skim" video.
Another approach to letting the user browse and filter through search results
more efficiently is to return smaller video clips in the result set. There are
about 150 spoken words per minute of "talking head" video. One hour of
video contains 9,000 words, which is about 15 pages of text. Even if a high
playback rate of 3 to 4 times normal speed was comprehensible, continuous play
of audio and video is a totally unacceptable browsing mechanism. For example,
assume that a desired piece of information is halfway through a one hour video
file. Fast forwarding at 4 times normal speed would take 7.5 minutes to find
it. Returning the optimally sized chunk of digital video is one aspect of the
solution to this problem.
If the user issues a query and receives ten half-hour video clips, it could
take them hours to review the results to determine their relevance, especially
given the difficulties in collapsing video playback as mentioned above. If the
results set were instead ten two minute clips, then the review time by the
user is reduced considerably. In order to return small, relevant clips the
video contents need to be indexed well and sized appropriately, tasks
discussed earlier in this abstract.
Users often wish to peruse video much as they flip through the pages of a
book. Unfortunately, today's mechanisms for this are inadequate. The results
from a query to a video library may be too large to be effectively handled
with conventional presentations such as a scrollable list. To enable better
filtering and browsing, the features deemed important by the user should be
emphasized and made visible. What are these features, though, and how can they
be made visible, especially if the digital video library is general purpose
rather than specialized to a particular domain? These questions return us back
to the problem of identifying the content within the video data and
representing it in forms that facilitate browsing, visualization, and
retrieval. Researchers at Xerox PARC's Intelligent Information Access and
Information Visualization projects note that the information in digital
libraries should not just be retrieved but should allow for rich interaction,
so that users can tailor the information into effective and memorable
renderings appropriate to their needs [Rao95].
If such rich interaction can be achieved, it can be used to browse not only
query result sets but the contents of the full library itself, allowing for
another access mechanism to the information.
The IDVL Project builds on the assumption that a video's contents are conveyed
in both the narrative (speech and language) and the image. Only by the
collaborative interaction of image, speech and natural language understanding
technology can diverse video collections be successfully populated, segmented,
indexed, and searched with satisfactory recall and precision. This approach
compensates for problems of interpretation and search in error-full and
ambiguous data environments.
Using a high-quality speech recognizer, the sound track of each video is
converted to a textual transcript. A language understanding system analyzes
and organizes the transcript and stores it in a full-text information
retrieval system, as well as generates brief text abstracts for the videos.
Image understanding techniques are used for segmenting video sequences by
automatically locating boundaries of shots, scenes, and conversations.
Integration of these techniques provides for richer indexing and segmentation
of the video library. For example, text displayed in the video can be located
via image
processing and then added to the body of text for natural language processing.
As another example, having both a visual scene change and a
change in the narrative increases the likelihood of a segment boundary.
Figure 1.Techniques underlying segmentation of video into
smaller paragraphs.
Library exploration is based on these same techniques. The user can browse
through parallel presentations of alternate representations of video clips,
from titles to single image "poster frames" to skims. In creating a skim,
image understanding techniques are used to select important, high interest
segments of video. Scene changes (as marked by color histogram spikes
characterizing big differences in adjacent frames), camera motion, object
detection (e.g., the entrance and exit of a human face in the scene), and text
detection (e.g., a title or name of a person being interviewed overlaid on the
video) are used in the heuristics determining which video should be included
in the skim. Using parallel criteria for linguistic information, natural
language processing selects appropriate audio. For example, the term
frequency-inverse document frequency weighting scheme can be used to determine
word relevance, with other heuristics employed to further filter which audio
to use, such as not repeating the same word within a certain time limit.
Selected audio and video are then integrated into a skim of the original
video.
Figure 2.Portion of skim created from significant audio and
video data.
For more details on the IDVL interface in a news-on-demand application, consult
the on-line walkthrough found in [Hauptmann95].
This work is partially funded by the National Science Foundation,
the National Space and Aeronautics Administration, and the Advanced Research
Projects
Agency. For a complete list of sponsors and partners for the Informedia Digital
Video Library Project,
consult the IDVL sponsor list.
- [Davis94]
-
Davis, M.
"Knowledge Representation for Video."
Proc. of AAAI '94,
1994, Seattle, WA, pp. 120-127.
- [Hauptmann95]
-
Hauptmann, A.G., Witbrock, M.J., & Christel, M.G.
"News-on-Demand: an Application of Informedia Technology."
Online document available at URL
http://www.cnri.reston.va.us/home/dlib/september95/nod/09hauptmann1.html,
D-Lib Magazine, September 1995.
- [Marchionini95]
-
Marchionini, G. and Maurer, H.
"The Roles of Digital Libraries in Teaching and Learning."
Communications of the ACM, 38,
April 1995, pp. 67-75.
- [Rao95]
-
Rao, R., Pedersen, J., Hearst, M., Mackinlay, J., Card, S., Masinter, L.,
Halvorsen, P.-K., and Robertson, G.
"Rich Interaction in the Digital Video Library."
Communications of the ACM, 38,
April 1995, pp. 29-39.
- [Smith95]
-
Smith, M.A., & Christel, M.G.
"Automating the Creation of a Digital Video Library."
Online document available at URL
http://www.ius.cs.cmu.edu/afs/cs.cmu.edu/Web/People/msmith/mm_95_msmith.html,
Proceedings of the ACM Multimedia '95 Conference, San Francisco,
November 1995.
- [Stevens94]
-
Stevens, S., Christel, M., & Wactlar, H.
"Informedia: Improving Access to Digital Video."
interactions, 1,
October1994, pp. 67-71.
- [Zhang95]
-
Zhang, H., Tan, S., Smoliar, S., and Yihong, G.
"Automatic Parsing and Indexing of News Video."
Multimedia Systems, 2,
1995, pp. 256-266.