Research and Projects

I am currently conducting research on summarization of multimodal presentations, for my PhD at Centrale Supélec / Université Paris-Saclay, under the supervision of Frédéric Dufaux and Camille Guinaudeau.

Tools

During my PhD, I developed multiple tools to process and study the summarization of multimodal presentations by leveraging their structure. I packaged some of those tools in a github repo, including methods for slide extraction from presentation records, speech recognition, and the construction of a structured multimodal representation, a cost-effective multimodal input for VLMs.

Publications

  • Théo Gigant, Camille Guinaudeau, Frédéric Dufaux. Summarization of Multimodal Presentations with Vision-Language Models: Study of the Effect of Modalities and Structure. Preprint, 2025, [PDF]
  • Théo Gigant, Camille Guinaudeau, Marc Décombas, Frédéric Dufaux. Mitigating the Impact of Reference Quality on Evaluation of Summarization Systems with Reference-Free Metrics. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Miami, 2024, [PDF] [blog] [code]
  • Théo Gigant, Frédéric Dufaux, Camille Guinaudeau, Marc Décombas. TIB: A Dataset for Abstractive Summarization of Long Multimodal Videoconference Records. In Proceedings of the 20th International Conference on Content-based Multimedia Indexing, Orléans, 2023, [PDF] [blog]

In 2022 I participated in the preprocessing of several datasets as part of the BigScience Research Workshop, which resulted in two publications:

  • Le Scao, Teven, et al. Bloom: A 176b-parameter open-access multilingual language model. 2022, [PDF]
  • Fries, Jason, et al. Bigbio: A framework for data-centric biomedical natural language processing. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022, [PDF]

Other Projects

  • My blog: Where I am sharing some insights on my own research, as well as some reflections and opinions about Natural Language Processing, Computer Vision and Artificial Intelligence.
  • Whisper Medium Romanian (December 2022): State-of-the-art speech recognition model for Romanian, trained for the Whisper Fine-Tuning Event. First in the leaderboard for Romanian ASR.
  • Old Book Illustrations Dataset (July 2022): Collection of 4172 public domain illustrations scanned from old books, collected from the Old Book Illustrations website with their agreement.
  • WikiArt diffusion mini (May 2022): As part of HuggingFace’s HugGAN challenge, I worked on a distilled latent diffusion model for text-conditionned image generation trained on the WikiArt dataset of pieces of visual art.
  • Romanian Wav2Vec2 (February 2022): Speech recognition model for Romanian, trained for the Robust Speech Recognition Challenge. First in the leaderboard for Romanian ASR.
  • T5-VAE (July 2021): Project for the Flax/JAX community week, that combines a T5 transformer model with a variational autoencoder to learn smooth latent spaces for texts.