Author: Aitor Sánchez Abellán
Date & time: 22/09/2021 – 10:15 h
Session name: Multimodal Applications
Supervisors: Pau Riba, Pau Rodriguez
Typical vision datasets are labor intensive and costly to create while teaching only a narrow set of visual concepts. CLIP aims to solve this problem with a model which efficiently learns from natural language supervision using image-text pairs. In this work we extend CLIP to operate using new modalities such as audio or video. Our proposal is generalized to work with n streams of data and makes use of the VGG-Sound audio-visual dataset to train these new branches. New modality retrieval such as text to video or audio to image is incorporated. Moreover, the expected improvement on retrieval performance after the alignment of the new latent space is tested and analyzed.
– President: Joost van de Weijer(UAB)
– Secretary: Antonio Agudo(UPF)
– Vocal: Coloma Ballester(UPF)