Summarization of presentation records: Introducing the TIB dataset

Illustration: The Wheel, Mad Men season 1, episode 13

TIB: A Dataset for Abstractive Summarization of Long Multimodal Videoconference Records” was published as a conference paper at CBMI 2023.

Introduction

As of 2023, Large Language Models are very efficient in lots of Natural Language Processing tasks, such as document summarization, and thanks to efficient transformers architectures, they even hold up to very long inputs. On the other hand, Multimodal Language-Vision models are efficient in a handful of applications such as image manipulation, visual descriptions or text-based video retrieval. However most current Language-Vision models are not designed to handle very long inputs.

Most summarization datasets are composed of either mono-modal documents or short multimodal documents. In order to develop models designed for understanding and summarizing real-world videoconference records that are typically around 1 hour long, we propose a dataset of 9,103 videoconference records extracted from the German National Library of Science and Technology (TIB) archive, along with their abstracts. Additionally, we process the content using automatic tools in order to provide the transcripts and key frames. Finally, we present experiments for abstractive summarization, to serve as baseline for future research work in multimodal approaches.

The dataset is openly available on the 🤗 Hugging Face dataset hub under the repository gigant/tib, and is described in further details in the conference paper “TIB: A Dataset for Abstractive Summarization of Long Multimodal Videoconference Records”.

Dataset Composition

Each record consist of the following attributes:

  • doi: digital object identifier (DOI) of the record or the associated paper
  • title: title of the presentation
  • url: URL of the record in the TIB archive
  • video_url: URL of the video file
  • license: license of the record
  • subject: academic field (eg Computer Science, Mathematics, ...)
  • genre: type of presentation (eg Lecture, Conference, ...)
  • release_year: year the record was released
  • author: name of the author
  • contributors: name of the contributors
  • abstract: the abstract of the presentation, that serve as a target summary
  • transcript: the automatically extracted transcript
  • transcript_segments: the automatically extracted transcript with time codes, output of the speech recognition system
  • keyframes: the automatically extracted key frames time codes

doi, title, url, videu_url, license, subject, genre, release_year, author, contributors and abstract are provided as found in the TIB archive. The length, style, quality and content of the abstract can differ from video to video as it was likely provided by each author. For instance, some abstracts can provide very short title-like summaries, introduction of the conference, the lecture or the speaker, or longer descriptions of the content. We provide examples of transcripts and summaries in the Appendix in the paper.

Modalities

In oral presentations, such as scientific conferences, the slide show is often used to carry information that illustrates the main topics, helps understand the outline and is not always redundant with the speech. Arguably, an automatic summarization method could use the visual modality as additional input features and its structure as a prior to improve the summarization of the transcript content. The main motivation behind the creation of this dataset is to make data available to allow for work testing these assumptions.

In the video files there are two modalities: a visual stream and an audio stream.

The visual stream in the records can be of varied domains, but more often than not it is comprised of a slide show in one form or another.

The audio stream is mostly comprised of speech and the possible music or noise.

Processing the videos

The video is processed in order to derive other modalities from the audio and visual streams, such as the transcript of the speech, and the key frames that approximate the slide show.

The transcription was done automatically by using the openai/whisper-small multilingual speech recognition model.

In order to extract the slide show from the video stream, we opted for an heuristics based on perceptual hashing.

A perceptual hash is a binary code that represents images with similar features with similar codes. The Hamming distance between hashes of consecutive frames can be used to detect a change of slide in the video stream.

This method is very fast to compute on massive amounts of data (9,103 videos of 40 minutes on average) and is a good heuristics for extracting the slideshow from the video stream.

Filtering

We filtered out entries with missing abstracts or abstracts that were too short (less than 30 characters) and records for which the abstract or the transcript is not in English.

In order to get rid of all the abstracts that were written for a set of records such as conferences, instead of specifically written for a single presentation, we filtered documents if the abstract is the same as for another video.

Statistics and metadata

Some statistics can be used to describe and compare abstractive summarization datasets, such as the average number of tokens in the source document and in the target summary, and statistics computed using the fragments of the target summary that are extracted from the source document, such as coverage and density.

The coverage can be interpreted as the vocabulary overlap between the summary and the document, and the density is defined as the average length of the extractive fragments. These statistics are lower for summaries that are less extractive, and consequently use more novel vocabulary or reformulate ideas in a different way, as compared to the source document. The compression ratio refers to the average ratio between the number of tokens in the source documents and the target summaries.

The TIB dataset consists of documents that are of comparable length as other long document abstractive summarization datasets such as ArXiv, PubMed or BookSum Chapter, and longer than How2 300h, the other multimodal abstractive summarization dataset.

We can also look at the metadata in order to see what type of documents compose the dataset.

The average record is a conference talk about Computer Science released between 2013 and 2022, it lasts 37.4 minutes and contains 45.9 slides.

Evaluation

To the best of our knowledge, no end-to-end multimodal model can do abstractive summarization on documents of this size. For this reason, we tested various heuristics and models for abstractive summarization using the textual modality only. Lead-3 is a baseline method that outputs the first 3 sentences of the input as a summary, and Extractive Oracle (implemented in the extoracle_summarization library) is the upper-bound of extractive methods as measured with the ROUGE-2 score. PEGASUS-X and Longformer Encoder-Decoder are “efficient transformer” encoder-decoder models, designed for long document abstractive summarization.

Future work

We hope this dataset will contribute to the study of automatic videoconference summarization by giving access to multimodal data for abstractive summarization of long documents.

The dataset is available on the Hugging Face dataset hub: gigant/tib.

The methods we have benchmarked are textual-only baselines. However, slides usually provide information that is not always redundant in the speech, as well as relational inductive biases that help to understand the structure of a presentation. Future work on this dataset includes studying summarization models that take the whole multimodal input to predict the abstract. The way we are planning to do this is by using multimodal graphs, for reasons that are discussed in a previous blog.

If you have any questions or insights about this dataset, feel free to contact me!