Scene Text Visual Question Answering and Visual Question Generation: A multilingual approach

Author: Josep Brugués i Pujolràs

Virtual Room:

Date & time: 22/09/2021 – 9:30 h

Session name: Multimodal Applications

Supervisors: Lluís Gómez i Bigorda, Dimosthenis Karatzas


During the last decade, Visual Question Answering (VQA) and Visual Question Generation (VQG) deep learning architectures have become trending topics in the Computer Vision community. Such models have a big potential for many types of applications, but lack the ability to perform well on more than one language at a time due to the lack of bilingual and multilingual data and the use of monolingual word embeddings in training. In this work, we hypothesise the possibility to obtain bilingual and multilingual VQA and VQG models. In that regard, we use already established models that use monolingual word embeddings as part of their pipeline and substitute them for FastText and BPEmb multilingual word embeddings that have been aligned to English. We employ the EST-VQA dataset in Chinese and English, and the ST-VQA dataset, which has been translated from English to other languages. On the one hand, we demonstrate that it is possible to obtain bilingual and multilingual VQA models with a minimal loss in performance in languages not used during training. On the other hand, we show that we can generate questions in multiple languages with a single VQA model.


– President: Xavier Giró i Nieto(UPC)
– Secretary: Josep Llados Canet(UAB)
– Vocal: Ernest Valveny Llobet(UAB)