Abstract
There has been a growing interest in solving Visual Question Answering (VQA) tasks that require the model to reason beyond the content present in the image. In this work, we focus on questions that require commonsense reasoning. In contrast to previous methods which inject knowledge from static knowledge bases, we investigate the incorporation of contextualized knowledge using Commonsense Transformer (COMET), an existing knowledge model trained on human-curated knowledge bases. We propose a method to generate, select, and encode external commonsense knowledge alongside visual and textual cues in a new pre-trained Vision-Language-Commonsense transformer model, VLC-BERT. Through our evaluation on the knowledge-intensive OK-VQA and A-OKVQA datasets, we show that VLC-BERT is capable of outperforming existing models that utilize static knowledge bases. Furthermore, through a detailed analysis, we explain which questions benefit, and which don't, from contextualized commonsense knowledge from COMET. Code: https://github.com/aditya10/VLC-BERT
Original language | English |
---|---|
Title of host publication | Proceedings - 2023 IEEE Winter Conference on Applications of Computer Vision, WACV 2023 |
Publisher | Institute of Electrical and Electronics Engineers Inc. |
Pages | 1155-1165 |
Number of pages | 11 |
ISBN (Electronic) | 9781665493468 |
DOIs | |
State | Published - 2023 |
Externally published | Yes |
Event | 23rd IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2023 - Waikoloa, United States Duration: 3 Jan 2023 → 7 Jan 2023 |
Publication series
Name | Proceedings - 2023 IEEE Winter Conference on Applications of Computer Vision, WACV 2023 |
---|
Conference
Conference | 23rd IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2023 |
---|---|
Country/Territory | United States |
City | Waikoloa |
Period | 3/01/23 → 7/01/23 |
Bibliographical note
Publisher Copyright:© 2023 IEEE.
Funding
This work was funded, in part, by the Vector Institute for AI, Canada CIFAR AI Chair, NSERC CRC, NSERC DG and Accelerator Grants, and a research gift from AI2. Hardware resources used in preparing this research were provided, in part, by the Province of Ontario, the Government of Canada through CIFAR, and companies sponsoring the Vector Institute4. Additional hardware support was provided by John R. Evans Leaders Fund CFI grant and Compute Canada under the Resource Allocation Competition award. Finally, we sincerely thank Prof. Giuseppe Carenini for valuable feedback and discussions.
Funders | Funder number |
---|---|
John R. Evans Leaders Fund CFI | |
Canadian Institute for Advanced Research | |
Compute Canada | |
Government of Ontario | |
Natural Sciences and Engineering Research Council of Canada | |
Vector Institute |
Keywords
- Algorithms: Vision + language and/or other modalities
- Image recognition and understanding (object detection, categorization, segmentation, scene modeling, visual reasoning)