VLC-BERT: Visual Question Answering with Contextualized Commonsense Knowledge

Sahithya Ravi, Aditya Chinchure, Leonid Sigal, Renjie Liao, Vered Shwartz

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

29 Scopus citations

Abstract

There has been a growing interest in solving Visual Question Answering (VQA) tasks that require the model to reason beyond the content present in the image. In this work, we focus on questions that require commonsense reasoning. In contrast to previous methods which inject knowledge from static knowledge bases, we investigate the incorporation of contextualized knowledge using Commonsense Transformer (COMET), an existing knowledge model trained on human-curated knowledge bases. We propose a method to generate, select, and encode external commonsense knowledge alongside visual and textual cues in a new pre-trained Vision-Language-Commonsense transformer model, VLC-BERT. Through our evaluation on the knowledge-intensive OK-VQA and A-OKVQA datasets, we show that VLC-BERT is capable of outperforming existing models that utilize static knowledge bases. Furthermore, through a detailed analysis, we explain which questions benefit, and which don't, from contextualized commonsense knowledge from COMET. Code: https://github.com/aditya10/VLC-BERT

Original languageEnglish
Title of host publicationProceedings - 2023 IEEE Winter Conference on Applications of Computer Vision, WACV 2023
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages1155-1165
Number of pages11
ISBN (Electronic)9781665493468
DOIs
StatePublished - 2023
Externally publishedYes
Event23rd IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2023 - Waikoloa, United States
Duration: 3 Jan 20237 Jan 2023

Publication series

NameProceedings - 2023 IEEE Winter Conference on Applications of Computer Vision, WACV 2023

Conference

Conference23rd IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2023
Country/TerritoryUnited States
CityWaikoloa
Period3/01/237/01/23

Bibliographical note

Publisher Copyright:
© 2023 IEEE.

Funding

This work was funded, in part, by the Vector Institute for AI, Canada CIFAR AI Chair, NSERC CRC, NSERC DG and Accelerator Grants, and a research gift from AI2. Hardware resources used in preparing this research were provided, in part, by the Province of Ontario, the Government of Canada through CIFAR, and companies sponsoring the Vector Institute4. Additional hardware support was provided by John R. Evans Leaders Fund CFI grant and Compute Canada under the Resource Allocation Competition award. Finally, we sincerely thank Prof. Giuseppe Carenini for valuable feedback and discussions.

FundersFunder number
John R. Evans Leaders Fund CFI
Canadian Institute for Advanced Research
Compute Canada
Government of Ontario
Natural Sciences and Engineering Research Council of Canada
Vector Institute

    Keywords

    • Algorithms: Vision + language and/or other modalities
    • Image recognition and understanding (object detection, categorization, segmentation, scene modeling, visual reasoning)

    Fingerprint

    Dive into the research topics of 'VLC-BERT: Visual Question Answering with Contextualized Commonsense Knowledge'. Together they form a unique fingerprint.

    Cite this