Pull It Together: Reducing the Modality Gap in Contrastive Learning

Amit Sofer, Yoav Goldman, Shlomo E. Chazan

Research output: Contribution to journalConference articlepeer-review

Abstract

Contrastive learning has become a powerful strategy for aligning different modalities in a shared embedding space. Contrastive Language-Image Pre-training (CLIP) has achieved remarkable performance across various downstream tasks. This methodology has been extended to the audio-text domain through Contrastive Language-Audio Pre-training (CLAP), demonstrating strong performance in related tasks. However, recent work highlights a modality gap in CLIP's embedding space, where embeddings from different modalities remain partially separated rather than fully integrated. In this paper, we begin by analyzing the CLAP embedding space and identify a similar modality gap. Furthermore, we propose a novel solution combining a modality classifier with a Gradient Reverse Layer (GRL) to reduce this gap. Our experiments on CLIP and CLAP confirm that our approach reduces the modality gap while improving performance, and even achieving new State Of The Art (SOTA) results in text-audio retrieval.

Original languageEnglish
Pages (from-to)196-200
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
DOIs
StatePublished - 2025
Externally publishedYes
Event26th Interspeech Conference 2025 - Rotterdam, Netherlands
Duration: 17 Aug 202521 Aug 2025

Bibliographical note

Publisher Copyright:
© 2025 International Speech Communication Association. All rights reserved.

Keywords

  • CLAP
  • CLIP
  • modality gap
  • multi modal

Fingerprint

Dive into the research topics of 'Pull It Together: Reducing the Modality Gap in Contrastive Learning'. Together they form a unique fingerprint.

Cite this