Multi-modal Image Captioning in Wikipedia

Author: Khanh Nguyen Van

Virtual Room:

Date & time: 22/09/2021 – 11:00 h

Session name: Multimodal Applications

Supervisors: Ali Furkan Biten, Andres Mafla Delgado, Dimosthenis Karatzas


Typical image captioning systems operate solely on a set of visual object features and its relationships, whereas the contextual information or prior knowledge in the world are still totally ignored. Such incorporation, by any chance, are crucial as they can serve as a plentiful source of valuable information for the model to exploit in order to produce higher-quality descriptions. With this in mind, we aim to build a captioning model that is able to interpret the scene with contextual information such as Named Entities taken into account. More specifically, in this work we focus on the problem of generating captions for images contained in articles. We propose a novel model that extend the classic sequence-to-sequence attention model in image captioning: Show, Attend and Tell in two aspects: (1) Enriching the model’s input with data from different modalities and analyzing their semantic correlation to improve the generative capability and (2) Employing a hard-attention mechanism to adaptively copy words from the source text via a pointer network, which allows the network to handle of out-of-vocabulary words. Furthermore, we introduce Wiki Dataset, a dataset for the multi-modal image captioning task and report experimental results with our model applied to it.


– President: Xavier Baró(UOC)
– Secretary: Coloma Ballester(UPF)
– Vocal: Ernest Valveny Llobet(UAB)