Electronic Proceedings of the
ACM Workshop on Effective Abstractions in Multimedia
November 4, 1995
San Francisco, California

Addressing the Contents of Video in a Digital Library

Michael G. Christel
: Software Engineering Institute; Carnegie Mellon University; Pittsburgh, PA 15213-3890; 412-268-7799; mac@sei.cmu.edu; http://www.cs.cmu.edu/afs/andrew/usr/mc6u/www/

ACM Copyright Notice

Abstract

A digital video library must be efficient at giving users precisely the material they need, due to the unique characteristics of video as compared to text. To make the retrieval of bits faster, and to enable faster viewing or information assimilation, the digital video library will need to support partitioning video into small-sized clips and alternate representations of the video.

For a general purpose digital video library, precision may have to be sacrificed in order to ensure that the material the user is interested in will be recalled in the result set to a query. The result set may then become quite large, so the user may need to filter the set and decide what is important. This can be accomplished by collapsing the playback rate of video objects in the result set as well as adjusting the size of the objects in the result set. The Informedia Digital Video Library Project at Carnegie Mellon University deals with these issues and is introduced here with pointers to additional information.

Introduction
Beyond Keywords I: Reducing Size
- Video paragraphing
- Alternate representations for video clips
Beyond Keywords II: Improving Precision and Recall
The Informedia Digital Video Library Approach
Acknowledgements
Bibliography

Introduction

A library cannot be very effective if it is merely a collection of information without some understanding of what is contained in that collection. Without that understanding it could take hundreds of hours of viewing to determine if an item of interest is in a 1000 hour video library. Obviously, such a library would not be used very often. Marchionini and Maurer reflect on information accessible via the Internet [Marchionini95, p. 72]:

: It has often been said that the Internet is starting to provide the largest library humankind has ever had. As true as this may be, the Internet is also the messiest library that ever has existed.

Information is found best on the Internet when the providers augment the information with rich keywords and descriptors, provide links to related information, and allow the contents of their pages to be searched and indexed. There is a long history of sophisticated parsing and indexing for text processing in various structured forms, from ASCII to PostScript to SGML and HTML. However, how does one represent video content to support content-based retrieval and manipulation?

An hour-long motion video segment clearly contains some information suitable for indexing, so that a user can find an item of interest within it. The problem is not the lack of information in video, but rather the inaccessibility of that information to our primarily text-based information retrieval mechanisms today. In fact, the video likely contains an overabundance of information, conveyed in both the video signal (camera motion, scene changes, colors) and the audio signal (noises, silence, dialogue). A common practice today is to log or tag the video with keywords and other forms of structured text to identify its contents. Such text descriptors have the following limitations:

Manual processes are tedious and time consuming.
Manual processes are seriously incomplete. Even if full transcripts of the audio track are entered, other information about the video will almost surely be left out, such as the identity of persons and objects in each scene.
Transcripts are inaccurate, with mistypings and incorrect classifications often introduced.
Text descriptors are biased by whatever predetermined structures are used to classify the video contents.
Cinematic information is complex and difficult to describe, especially for non-experts.
Text descriptors are biased by the ambiguity of natural language.

The Informedia Digital Video Library (IDVL) Project at Carnegie Mellon University is an ongoing research project begun in 1994, but leveraging two decades of related CMU research [Stevens94, Hauptmann95, Smith95]. Central to the project is the establishment of a large, online digital video library that goes beyond just keyword approaches to indexing video content. Some other techniques will be overviewed, followed by a concluding outline of how the IDVL Project is addressing this task.

< -- Table of Contents

Beyond Keywords I: Reducing Size

Anyone who has retrieved video from the Internet realizes that because of its size a video clip can take a long time to move from one location to another, such as from the digital video library to the user. Likewise, if a library consists of only 30 minute clips, when users check one out it may take them 30 minutes to determine whether the clip met their needs. Returning a full one-half hour video when only one minute is relevant is much worse than returning a complete book, when only one chapter is needed. With a book, electronic or paper, tables of contents, indices, skimming, and reading rates permit users to quickly find the chunks they need. Since the time to scan a video cannot be dramatically shorter than the real time of the video, a digital video library must be efficient at giving users the material they need. To make the retrieval of bits faster, and to enable faster viewing or information assimilation, the digital video library will need to support partitioning video into small-sized clips and alternate representations of the video.

< -- Table of Contents

Video paragraphing

Just as text books can be decomposed into paragraphs embodying topics of discourse, the video library can be partitioned into video paragraphs. The difficulties arise in how this partitioning is to be carried out. Does the author of the video information supply paragraph tags marking how a larger video should be subsetted into smaller clips? This is routinely accomplished in text through chapters, sections, subheadings, and similar conventions. Analogous structure is contained in video through scenes, shots, camera motions, and transitions. Manually describing this structure in a machine readable form would place a tremendous burden on the video author, and in any case would not solve the partitioning problem for pre-existing video material created without paragraph markings.

Perhaps the paragraph boundaries can be inferred from whatever parsing and indexing is done on the video segment. Some video, such as news broadcasts, have a well-defined structure which could be parsed into short video paragraphs for different news stories, sports, and weather. Techniques monitoring the video signal can break the video into sequences sharing the same spatial location, and these scenes could be used as paragraphs.

Davis cautions, however, that physically segmenting a video library into clips imposes a fixed segmentation on the video data [Davis94]. The library is decomposed into a fixed number of clips, i.e., a fixed number of small video files, which are separated from their original context and may not meet the future needs of the library user. A more flexible alternative is to logically segment the library by adding sets of video paragraph markers and indices, but keeping the video data intact in its original context. A basic tenet of MIT's Media Streams is that what we need are "representations which make clips, not representations of clips" [Davis94, p. 121].

In order for a digital video library to be logically segmented as such, the system must be capable of delivering a subset of a movie (rather than having that subset stored as its own movie) quickly and efficiently to the user. Video compression schemes will have to be chosen carefully for the library to retain the necessary random access within a video to allow it to be logically segmented.

< -- Beyond Keywords I: Reducing Size

Alternate representations for video clips

In addition to trying to size the video clips appropriately, the digital video library can provide the users alternate representations for the video, or layers of information. Users could then cheaply (in terms of data transfer time, possible economic cost, and user viewing time) review a given layer of information before deciding upon whether to incur the cost of richer layers of information or the complete video clip. For example, a given half hour video may have a text title, a text abstract, a full text transcript, a representative single image, and a representative one minute "skim" video, all in addition to the full video itself. The user could quickly review the title and perhaps the representative image, decide on whether to view the abstract and perhaps full transcript, and finally make the decision on whether to retrieve and view the full video.

These layered approaches to describing video are implemented in a number of systems [Hauptmann95, Zhang95, Rao95]. The problems are similar to the indexing problem: how should the alternate representations or descriptors be generated? How can they be as complete and accurate as possible, and can tools alleviate the labor and tediousness involved in their creation?

< -- Beyond Keywords I: Reducing Size

< -- Table of Contents

Beyond Keywords II: Improving Precision and Recall

The utility of the digital video library can be judged on the ability of the users to get the information they need from the library easily and efficiently. The two standard measures of performance in information retrieval are recall and precision. Recall is the proportion of relevant documents that are actually retrieved, and precision is the proportion of retrieved documents that are actually relevant. These two measures may be traded off one for the other, i.e., returning one document that is a known match to a query guarantees 100% precision, but fails at recall if a number of other documents were relevant as well. Returning all of the library's contents for a query guarantees 100% recall, but fails miserably at precision and filtering the information. The goal of information retrieval is to maximize both recall and precision.

In many information systems, precision is maximized by narrowing the domain considerably, extensively indexing the data according to the parameters of the domain, and allowing queries only via those parameters. This approach is taken by many CD-ROM data sets, but has the following limitations:

Data could really only be added if it falls within the boundaries of the domain established by the predefined indices.
Access to the data is limited by the predefined indices.

Researchers of multimedia information systems have raised concerns over the difficulties in adequately indexing a video database so that it can be used as a general purpose library, rather than say a more narrow domain such as a network news archive [Davis94, Zhang95]. For general purpose use, there may not be enough domain knowledge to apply to the user's query and to the library index in order to return only a very small subset of the library to the user matching just the given query. For example, in a soccer-only library, a query about goal can be interpreted to mean a score, and just those appropriate materials can be retrieved accordingly. In a more open context, goal could mean a score in hockey or a general aim or objective. A larger set of results will need to be returned to the user, given less domain knowledge from which to leverage.

In attempting to create a general purpose digital video library, precision may have to be sacrificed in order to ensure that the material the user is interested in will be recalled in the result set. The result set may then become quite large, so the user may need to filter the set and decide what is important. Three principle issues with respect to searching for information are how to let the user

quickly skim the video objects to locate sections of interest
adjust the size of the video objects returned
identify desired video clips when multiple objects are returned

< -- Table of Contents

Collapsing playback rate

Browsing can help users quickly and intelligently filter a number of results to the precise information they are seeking. However, browsing video is not as easy as browsing text. Scanning by jumping a set number of frames may skip the target information completely. On the other hand, accelerating the playback of motion video to, for instance, twenty times normal rate presents the information at an incomprehensible speed.

The difference between video or audio and text or images is that video and audio have constant rate outputs that cannot be changed without significantly and negatively impacting the user's ability to extract information. Video and audio are a constant rate, continuous time media. Their temporal nature is constant due to the requirements of the viewer/ listener. Text is a variable rate continuous medium. Its temporal nature is manifest in users, who read and process the text at different rates.

While video and audio data types are constant rate, continuous-time, the information contained in them is not. In fact, the granularity of the information content is such that a one-half hour video may easily have one hundred semantically separate chunks. The chunks may be linguistic or visual in nature. They may range from sentences to paragraphs and from images to scenes. If the important information from a video can be retrieved and the less important information collapsed, the resulting "skim" video could be browsed quickly by the user and still give him or her a great deal of understanding about the contents of the complete video clip. This introduces the issue of deciding what is important within a video clip and worthy of preservation in a "skim" video.

< -- Beyond Keywords II: Improving Precision and Recall

Returning small pieces

Another approach to letting the user browse and filter through search results more efficiently is to return smaller video clips in the result set. There are about 150 spoken words per minute of "talking head" video. One hour of video contains 9,000 words, which is about 15 pages of text. Even if a high playback rate of 3 to 4 times normal speed was comprehensible, continuous play of audio and video is a totally unacceptable browsing mechanism. For example, assume that a desired piece of information is halfway through a one hour video file. Fast forwarding at 4 times normal speed would take 7.5 minutes to find it. Returning the optimally sized chunk of digital video is one aspect of the solution to this problem.

If the user issues a query and receives ten half-hour video clips, it could take them hours to review the results to determine their relevance, especially given the difficulties in collapsing video playback as mentioned above. If the results set were instead ten two minute clips, then the review time by the user is reduced considerably. In order to return small, relevant clips the video contents need to be indexed well and sized appropriately, tasks discussed earlier in this abstract.

< -- Beyond Keywords II: Improving Precision and Recall

Information visualization

Users often wish to peruse video much as they flip through the pages of a book. Unfortunately, today's mechanisms for this are inadequate. The results from a query to a video library may be too large to be effectively handled with conventional presentations such as a scrollable list. To enable better filtering and browsing, the features deemed important by the user should be emphasized and made visible. What are these features, though, and how can they be made visible, especially if the digital video library is general purpose rather than specialized to a particular domain? These questions return us back to the problem of identifying the content within the video data and representing it in forms that facilitate browsing, visualization, and retrieval. Researchers at Xerox PARC's Intelligent Information Access and Information Visualization projects note that the information in digital libraries should not just be retrieved but should allow for rich interaction, so that users can tailor the information into effective and memorable renderings appropriate to their needs [Rao95]. If such rich interaction can be achieved, it can be used to browse not only query result sets but the contents of the full library itself, allowing for another access mechanism to the information.

< -- Beyond Keywords II: Improving Precision and Recall

< -- Table of Contents

The Informedia Digital Video Library Approach

The IDVL Project builds on the assumption that a video's contents are conveyed in both the narrative (speech and language) and the image. Only by the collaborative interaction of image, speech and natural language understanding technology can diverse video collections be successfully populated, segmented, indexed, and searched with satisfactory recall and precision. This approach compensates for problems of interpretation and search in error-full and ambiguous data environments.

Using a high-quality speech recognizer, the sound track of each video is converted to a textual transcript. A language understanding system analyzes and organizes the transcript and stores it in a full-text information retrieval system, as well as generates brief text abstracts for the videos. Image understanding techniques are used for segmenting video sequences by automatically locating boundaries of shots, scenes, and conversations. Integration of these techniques provides for richer indexing and segmentation of the video library. For example, text displayed in the video can be located via image processing and then added to the body of text for natural language processing. As another example, having both a visual scene change and a change in the narrative increases the likelihood of a segment boundary.

Figure 1.Techniques underlying segmentation of video into smaller paragraphs.

Library exploration is based on these same techniques. The user can browse through parallel presentations of alternate representations of video clips, from titles to single image "poster frames" to skims. In creating a skim, image understanding techniques are used to select important, high interest segments of video. Scene changes (as marked by color histogram spikes characterizing big differences in adjacent frames), camera motion, object detection (e.g., the entrance and exit of a human face in the scene), and text detection (e.g., a title or name of a person being interviewed overlaid on the video) are used in the heuristics determining which video should be included in the skim. Using parallel criteria for linguistic information, natural language processing selects appropriate audio. For example, the term frequency-inverse document frequency weighting scheme can be used to determine word relevance, with other heuristics employed to further filter which audio to use, such as not repeating the same word within a certain time limit. Selected audio and video are then integrated into a skim of the original video.

Figure 2.Portion of skim created from significant audio and video data.

For more details on the IDVL interface in a news-on-demand application, consult the on-line walkthrough found in [Hauptmann95].

< -- Table of Contents

Acknowledgements

This work is partially funded by the National Science Foundation, the National Space and Aeronautics Administration, and the Advanced Research Projects Agency. For a complete list of sponsors and partners for the Informedia Digital Video Library Project, consult the IDVL sponsor list.

< -- Table of Contents

Bibliography

[Davis94]: Davis, M. "Knowledge Representation for Video." Proc. of AAAI '94, 1994, Seattle, WA, pp. 120-127.
[Hauptmann95]: Hauptmann, A.G., Witbrock, M.J., & Christel, M.G. "News-on-Demand: an Application of Informedia Technology." Online document available at URL http://www.cnri.reston.va.us/home/dlib/september95/nod/09hauptmann1.html, D-Lib Magazine, September 1995.
[Marchionini95]: Marchionini, G. and Maurer, H. "The Roles of Digital Libraries in Teaching and Learning." Communications of the ACM, 38, April 1995, pp. 67-75.
[Rao95]: Rao, R., Pedersen, J., Hearst, M., Mackinlay, J., Card, S., Masinter, L., Halvorsen, P.-K., and Robertson, G. "Rich Interaction in the Digital Video Library." Communications of the ACM, 38, April 1995, pp. 29-39.
[Smith95]: Smith, M.A., & Christel, M.G. "Automating the Creation of a Digital Video Library." Online document available at URL http://www.ius.cs.cmu.edu/afs/cs.cmu.edu/Web/People/msmith/mm_95_msmith.html, Proceedings of the ACM Multimedia '95 Conference, San Francisco, November 1995.
[Stevens94]: Stevens, S., Christel, M., & Wactlar, H. "Informedia: Improving Access to Digital Video." interactions, 1, October1994, pp. 67-71.
[Zhang95]: Zhang, H., Tan, S., Smoliar, S., and Yihong, G. "Automatic Parsing and Indexing of News Video." Multimedia Systems, 2, 1995, pp. 256-266.