Generative Spoken Language Model based on continuous word-sized audio tokens

Robin Algayres, Yossi Adi, Tu Anh Nguyen, Jade Copet, Gabriel Synnaeve, Benoit Sagot, Emmanuel Dupoux

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

In NLP, text language models based on words or subwords are known to outperform their character-based counterparts. Yet, in the speech community, the standard input of spoken LMs are 20ms or 40ms-long discrete units (shorter than a phoneme). Taking inspiration from word-based LM, we introduce a Generative Spoken Language Model (GSLM) based on word-size continuous-valued audio embeddings that can generate diverse and expressive language output. This is obtained by replacing lookup table for lexical types with a Lexical Embedding function, the cross entropy loss by a contrastive loss, and multinomial sampling by k-NN sampling. The resulting model is the first generative language model based on word-size continuous embeddings. Its performance is on par with discrete unit GSLMs regarding generation quality as measured by automatic metrics and subjective human judgements. Moreover, it is five times more memory efficient thanks to its large 200ms units. In addition, the embeddings before and after the Lexical Embedder are phonetically and semantically interpretable..

Original languageEnglish
Title of host publicationEMNLP 2023 - 2023 Conference on Empirical Methods in Natural Language Processing, Proceedings
EditorsHouda Bouamor, Juan Pino, Kalika Bali
PublisherAssociation for Computational Linguistics (ACL)
Pages3008-3023
Number of pages16
ISBN (Electronic)9798891760608
StatePublished - 2023
Externally publishedYes
Event2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023 - Hybrid, Singapore, Singapore
Duration: 6 Dec 202310 Dec 2023

Publication series

NameEMNLP 2023 - 2023 Conference on Empirical Methods in Natural Language Processing, Proceedings

Conference

Conference2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023
Country/TerritorySingapore
CityHybrid, Singapore
Period6/12/2310/12/23

Bibliographical note

Publisher Copyright:
©2023 Association for Computational Linguistics.

Funding

This work was funded in part, to the authors in their academic capacities, by the Agence Na-tionale pour la Recherche (ANR-17-EURE-0017 Frontcog, ANR-10-IDEX-0001-02 PSL*, ANR-19-P3IA-0001 PRAIRIE 3IA Institute), CIFAR (Learning in Machines and Brains) and Meta AI Research (Research Grant). This work was performed using HPC resources from GENCI-IDRIS (Grant 2021-[AD011011217]).

FundersFunder number
ANR-10-IDEX-0001-02PSL, ANR-19-P3IA-0001
ANR-17-EURE-0017
Agence Na-tionale pour la Recherche
GENCI-IDRIS
Meta AI Research
Canadian Institute for Advanced Research

    Fingerprint

    Dive into the research topics of 'Generative Spoken Language Model based on continuous word-sized audio tokens'. Together they form a unique fingerprint.

    Cite this