Abstract
Vision-Language (VL) models with the TWO-TOWER architecture have dominated visual-language representation learning in recent years. Current VL models either use lightweight uni-modal encoders and learn to extract, align and fuse both modalities simultaneously in a deep cross-modal encoder, or feed the last-layer uni-modal representations from the deep pre-trained uni-modal encoders into the top cross-modal encoder. Both approaches potentially restrict vision-language representation learning and limit model performance. In this paper, we propose BRIDGETOWER, which introduces multiple bridge layers that build a connection between the top layers of uni-modal encoders and each layer of the cross-modal encoder. This enables effective bottom-up cross-modal alignment and fusion between visual and textual representations of different semantic levels of pre-trained uni-modal encoders in the cross-modal encoder. Pre-trained with only 4M images, BRIDGETOWER achieves state-of-the-art performance on various downstream vision-language tasks. In particular, on the VQAv2 test-std set, BRIDGETOWER achieves an accuracy of 78.73%, outperforming the previous state-of-the-art model METER by 1.09% with the same pre-training data and almost negligible additional parameters and computational costs. Notably, when further scaling the model, BRIDGETOWER achieves an accuracy of 81.15%, surpassing models that are pre-trained on orders-of-magnitude larger datasets. Code and checkpoints are available at https://github.com/microsoft/BridgeTower.
| Original language | English |
|---|---|
| Title of host publication | AAAI-23 Technical Tracks 9 |
| Editors | Brian Williams, Yiling Chen, Jennifer Neville |
| Publisher | AAAI press |
| Pages | 10637-10647 |
| Number of pages | 11 |
| ISBN (Electronic) | 9781577358800 |
| DOIs | |
| State | Published - 27 Jun 2023 |
| Externally published | Yes |
| Event | 37th AAAI Conference on Artificial Intelligence, AAAI 2023 - Washington, United States Duration: 7 Feb 2023 → 14 Feb 2023 |
Publication series
| Name | Proceedings of the 37th AAAI Conference on Artificial Intelligence, AAAI 2023 |
|---|---|
| Volume | 37 |
Conference
| Conference | 37th AAAI Conference on Artificial Intelligence, AAAI 2023 |
|---|---|
| Country/Territory | United States |
| City | Washington |
| Period | 7/02/23 → 14/02/23 |
Bibliographical note
Publisher Copyright:Copyright © 2023, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
Fingerprint
Dive into the research topics of 'BridgeTower: Building Bridges between Encoders in Vision-Language Representation Learning'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver