Contrastive Learning for Weakly Supervised Phrase Grounding

Tanmay Gupta, Arash Vahdat, Gal Chechik, Xiaodong Yang, Jan Kautz, Derek Hoiem

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

65 Scopus citations

Abstract

Phrase grounding, the problem of associating image regions to caption words, is a crucial component of vision-language tasks. We show that phrase grounding can be learned by optimizing word-region attention to maximize a lower bound on mutual information between images and caption words. Given pairs of images and captions, we maximize compatibility of the attention-weighted regions and the words in the corresponding caption, compared to non-corresponding pairs of images and captions. A key idea is to construct effective negative captions for learning through language model guided word substitutions. Training with our negatives yields a ∼ 10 % absolute gain in accuracy over randomly-sampled negatives from the training data. Our weakly supervised phrase grounding model trained on COCO-Captions shows a healthy gain of 5.7 % to achieve 76.7 % accuracy on Flickr30K Entities benchmark. Our code and project material will be available at http://tanmaygupta.info/info-ground.

Original languageEnglish
Title of host publicationComputer Vision – ECCV 2020 - 16th European Conference 2020, Proceedings
EditorsAndrea Vedaldi, Horst Bischof, Thomas Brox, Jan-Michael Frahm
PublisherSpringer Science and Business Media Deutschland GmbH
Pages752-768
Number of pages17
ISBN (Print)9783030585792
DOIs
StatePublished - 2020
Event16th European Conference on Computer Vision, ECCV 2020 - Glasgow, United Kingdom
Duration: 23 Aug 202028 Aug 2020

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume12348 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference16th European Conference on Computer Vision, ECCV 2020
Country/TerritoryUnited Kingdom
CityGlasgow
Period23/08/2028/08/20

Bibliographical note

Publisher Copyright:
© 2020, Springer Nature Switzerland AG.

Funding

Acknowledgement. This work was done partly at NVIDIA and is partly supported by ONR MURI Award N00014-16-1-2007 This work was done partly at NVIDIA and is partly supported by ONR MURI Award N00014-16-1-2007.

FundersFunder number
ONR MURI
Multidisciplinary University Research InitiativeN00014-16-1-2007

    Keywords

    • Attention
    • Grounding
    • InfoNCE
    • Mutual information

    Fingerprint

    Dive into the research topics of 'Contrastive Learning for Weakly Supervised Phrase Grounding'. Together they form a unique fingerprint.

    Cite this