BridgeTower: Building Bridges between Encoders in Vision-Language Representation Learning

Xiao Xu, Chenfei Wu, Shachar Rosenman, Vasudev Lal, Wanxiang Che, Nan Duan

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

9 Scopus citations

Abstract

Vision-Language (VL) models with the TWO-TOWER architecture have dominated visual-language representation learning in recent years. Current VL models either use lightweight uni-modal encoders and learn to extract, align and fuse both modalities simultaneously in a deep cross-modal encoder, or feed the last-layer uni-modal representations from the deep pre-trained uni-modal encoders into the top cross-modal encoder. Both approaches potentially restrict vision-language representation learning and limit model performance. In this paper, we propose BRIDGETOWER, which introduces multiple bridge layers that build a connection between the top layers of uni-modal encoders and each layer of the cross-modal encoder. This enables effective bottom-up cross-modal alignment and fusion between visual and textual representations of different semantic levels of pre-trained uni-modal encoders in the cross-modal encoder. Pre-trained with only 4M images, BRIDGETOWER achieves state-of-the-art performance on various downstream vision-language tasks. In particular, on the VQAv2 test-std set, BRIDGETOWER achieves an accuracy of 78.73%, outperforming the previous state-of-the-art model METER by 1.09% with the same pre-training data and almost negligible additional parameters and computational costs. Notably, when further scaling the model, BRIDGETOWER achieves an accuracy of 81.15%, surpassing models that are pre-trained on orders-of-magnitude larger datasets. Code and checkpoints are available at https://github.com/microsoft/BridgeTower.

Original languageEnglish
Title of host publicationAAAI-23 Technical Tracks 9
EditorsBrian Williams, Yiling Chen, Jennifer Neville
PublisherAAAI press
Pages10637-10647
Number of pages11
ISBN (Electronic)9781577358800
StatePublished - 27 Jun 2023
Externally publishedYes
Event37th AAAI Conference on Artificial Intelligence, AAAI 2023 - Washington, United States
Duration: 7 Feb 202314 Feb 2023

Publication series

NameProceedings of the 37th AAAI Conference on Artificial Intelligence, AAAI 2023
Volume37

Conference

Conference37th AAAI Conference on Artificial Intelligence, AAAI 2023
Country/TerritoryUnited States
CityWashington
Period7/02/2314/02/23

Bibliographical note

Publisher Copyright:
Copyright © 2023, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

Fingerprint

Dive into the research topics of 'BridgeTower: Building Bridges between Encoders in Vision-Language Representation Learning'. Together they form a unique fingerprint.

Cite this