C-CLAPA: IMPROVING TEXT-AUDIO CROSS DOMAIN RETRIEVAL WITH CAPTIONING AND AUGMENTATIONS

Amit Sofer, Shlomo E. Chazan

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

In this paper, we introduce Captioning decoder Contrastive Language-Audio Pretraining with data Augmantation (C-CLAPA), a new Audio-Text model for the Cross Domain Retrieval (CDR) task. The model's backbone is comprised of two encoders, one for the text and the other for the audio. The embedding vectors from the different modalities are commonly trained with a contrastive-loss. In our approach, a captioning decoder is also used to generate a text-description from the embedding vector of the audio sample. This decoder is used to ensure that the audio embedding encapsulates text information, and is used only on training stage. Data preparations including filtering, augmentations and text generation utilizing Large Language Models (LLMs), are used to extend the current training dataset. The proposed model is finally trained using a curriculum training procedure. In this approach, we train the model on datasets with increasing quality. In our empirical investigation, we provide compelling evidence that our model significantly surpasses the current State Of The Art (SOTA) models on the available benchmarks. Ablation analysis provides empirical evidence showcasing the advantages in the proposed architectural design as well as the efficacy of the employed data processing methodology.

Original languageEnglish
Title of host publication2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages8040-8044
Number of pages5
ISBN (Electronic)9798350344851
DOIs
StatePublished - 2024
Externally publishedYes
Event49th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Seoul, Korea, Republic of
Duration: 14 Apr 202419 Apr 2024

Publication series

NameICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
ISSN (Print)1520-6149

Conference

Conference49th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024
Country/TerritoryKorea, Republic of
CitySeoul
Period14/04/2419/04/24

Bibliographical note

Publisher Copyright:
© 2024 IEEE.

Keywords

  • multi modality
  • retrieval
  • Text-audio

Fingerprint

Dive into the research topics of 'C-CLAPA: IMPROVING TEXT-AUDIO CROSS DOMAIN RETRIEVAL WITH CAPTIONING AND AUGMENTATIONS'. Together they form a unique fingerprint.

Cite this