Abstract
Recent text-to-image matching models apply contrastive learning to large corpora of uncurated pairs of images and sentences. While such models can provide a powerful score for matching and subsequent zero-shot tasks, they are not capable of generating caption given an image. In this work, we repurpose such models to generate a descriptive text given an image at inference time, without any further training or tuning step. This is done by combining the visual-semantic model with a large language model, benefiting from the knowledge in both web-scale models. The resulting captions are much less restrictive than those obtained by supervised captioning methods. Moreover, as a zero-shot learning method, it is extremely flexible and we demonstrate its ability to perform image arithmetic in which the inputs can be either images or text and the output is a sentence. This enables novel high-level vision capabilities such as comparing two images or solving visual analogy tests. Our code is available at: https://github.com/YoadTew/zero-shot-image-to-text.
| Original language | English |
|---|---|
| Title of host publication | Proceedings - 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022 |
| Publisher | IEEE Computer Society |
| Pages | 17897-17907 |
| Number of pages | 11 |
| ISBN (Electronic) | 9781665469463 |
| DOIs | |
| State | Published - 2022 |
| Externally published | Yes |
| Event | 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022 - New Orleans, United States Duration: 19 Jun 2022 → 24 Jun 2022 |
Publication series
| Name | Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition |
|---|---|
| Volume | 2022-June |
| ISSN (Print) | 1063-6919 |
Conference
| Conference | 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022 |
|---|---|
| Country/Territory | United States |
| City | New Orleans |
| Period | 19/06/22 → 24/06/22 |
Bibliographical note
Publisher Copyright:© 2022 IEEE.
Keywords
- Transfer/low-shot/long-tail learning
- Vision + language