Abstract
Phrase grounding, the problem of associating image regions to caption words, is a crucial component of vision-language tasks. We show that phrase grounding can be learned by optimizing word-region attention to maximize a lower bound on mutual information between images and caption words. Given pairs of images and captions, we maximize compatibility of the attention-weighted regions and the words in the corresponding caption, compared to non-corresponding pairs of images and captions. A key idea is to construct effective negative captions for learning through language model guided word substitutions. Training with our negatives yields a ∼ 10 % absolute gain in accuracy over randomly-sampled negatives from the training data. Our weakly supervised phrase grounding model trained on COCO-Captions shows a healthy gain of 5.7 % to achieve 76.7 % accuracy on Flickr30K Entities benchmark. Our code and project material will be available at http://tanmaygupta.info/info-ground.
Original language | English |
---|---|
Title of host publication | Computer Vision – ECCV 2020 - 16th European Conference 2020, Proceedings |
Editors | Andrea Vedaldi, Horst Bischof, Thomas Brox, Jan-Michael Frahm |
Publisher | Springer Science and Business Media Deutschland GmbH |
Pages | 752-768 |
Number of pages | 17 |
ISBN (Print) | 9783030585792 |
DOIs | |
State | Published - 2020 |
Event | 16th European Conference on Computer Vision, ECCV 2020 - Glasgow, United Kingdom Duration: 23 Aug 2020 → 28 Aug 2020 |
Publication series
Name | Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) |
---|---|
Volume | 12348 LNCS |
ISSN (Print) | 0302-9743 |
ISSN (Electronic) | 1611-3349 |
Conference
Conference | 16th European Conference on Computer Vision, ECCV 2020 |
---|---|
Country/Territory | United Kingdom |
City | Glasgow |
Period | 23/08/20 → 28/08/20 |
Bibliographical note
Publisher Copyright:© 2020, Springer Nature Switzerland AG.
Funding
Acknowledgement. This work was done partly at NVIDIA and is partly supported by ONR MURI Award N00014-16-1-2007 This work was done partly at NVIDIA and is partly supported by ONR MURI Award N00014-16-1-2007.
Funders | Funder number |
---|---|
ONR MURI | |
Multidisciplinary University Research Initiative | N00014-16-1-2007 |
Keywords
- Attention
- Grounding
- InfoNCE
- Mutual information