Abstract
Contrastive learning has become a powerful strategy for aligning different modalities in a shared embedding space. Contrastive Language-Image Pre-training (CLIP) has achieved remarkable performance across various downstream tasks. This methodology has been extended to the audio-text domain through Contrastive Language-Audio Pre-training (CLAP), demonstrating strong performance in related tasks. However, recent work highlights a modality gap in CLIP's embedding space, where embeddings from different modalities remain partially separated rather than fully integrated. In this paper, we begin by analyzing the CLAP embedding space and identify a similar modality gap. Furthermore, we propose a novel solution combining a modality classifier with a Gradient Reverse Layer (GRL) to reduce this gap. Our experiments on CLIP and CLAP confirm that our approach reduces the modality gap while improving performance, and even achieving new State Of The Art (SOTA) results in text-audio retrieval.
| Original language | English |
|---|---|
| Pages (from-to) | 196-200 |
| Number of pages | 5 |
| Journal | Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH |
| DOIs | |
| State | Published - 2025 |
| Externally published | Yes |
| Event | 26th Interspeech Conference 2025 - Rotterdam, Netherlands Duration: 17 Aug 2025 → 21 Aug 2025 |
Bibliographical note
Publisher Copyright:© 2025 International Speech Communication Association. All rights reserved.
Keywords
- CLAP
- CLIP
- modality gap
- multi modal