Abstract
In this paper, we introduce Captioning decoder Contrastive Language-Audio Pretraining with data Augmantation (C-CLAPA), a new Audio-Text model for the Cross Domain Retrieval (CDR) task. The model's backbone is comprised of two encoders, one for the text and the other for the audio. The embedding vectors from the different modalities are commonly trained with a contrastive-loss. In our approach, a captioning decoder is also used to generate a text-description from the embedding vector of the audio sample. This decoder is used to ensure that the audio embedding encapsulates text information, and is used only on training stage. Data preparations including filtering, augmentations and text generation utilizing Large Language Models (LLMs), are used to extend the current training dataset. The proposed model is finally trained using a curriculum training procedure. In this approach, we train the model on datasets with increasing quality. In our empirical investigation, we provide compelling evidence that our model significantly surpasses the current State Of The Art (SOTA) models on the available benchmarks. Ablation analysis provides empirical evidence showcasing the advantages in the proposed architectural design as well as the efficacy of the employed data processing methodology.
Original language | English |
---|---|
Title of host publication | 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Proceedings |
Publisher | Institute of Electrical and Electronics Engineers Inc. |
Pages | 8040-8044 |
Number of pages | 5 |
ISBN (Electronic) | 9798350344851 |
DOIs | |
State | Published - 2024 |
Externally published | Yes |
Event | 49th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Seoul, Korea, Republic of Duration: 14 Apr 2024 → 19 Apr 2024 |
Publication series
Name | ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings |
---|---|
ISSN (Print) | 1520-6149 |
Conference
Conference | 49th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 |
---|---|
Country/Territory | Korea, Republic of |
City | Seoul |
Period | 14/04/24 → 19/04/24 |
Bibliographical note
Publisher Copyright:© 2024 IEEE.
Keywords
- Text-audio
- multi modality
- retrieval