Exploiting new modalities with CLIP

Author: Aitor Sánchez Abellán

Virtual Room: https://eu.bbcollab.com/guest/bc7f3cf320cc4760be9e7699dee37b76

Date & time: 22/09/2021 – 10:15 h

Session name: Multimodal Applications

Supervisors: Pau Riba,  Pau Rodriguez


Typical vision datasets are labor intensive and costly to create while teaching only a narrow set of visual concepts. CLIP aims to solve this problem with a model which efficiently learns from natural language supervision using image-text pairs. In this work we extend CLIP to operate using new modalities such as audio or video. Our proposal is generalized to work with n streams of data and makes use of the VGG-Sound audio-visual dataset to train these new branches. New modality retrieval such as text to video or audio to image is incorporated. Moreover, the expected improvement on retrieval performance after the alignment of the new latent space is tested and analyzed.


– President: Joost van de Weijer(UAB)
– Secretary: Antonio Agudo(UPF)
– Vocal: Coloma Ballester(UPF)