EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech Resynthesis

Tu Anh Nguyen, Wei Ning Hsu, Antony D'Avirro, Bowen Shi, Itai Gat, Maryam Fazel-Zarani, Tal Remez, Jade Copet, Gabriel Synnaeve, Michael Hassid, Felix Kreuk, Yossi Adi, Emmanuel Dupoux

Research output: Contribution to journalConference articlepeer-review

Abstract

Recent work has shown that it is possible to resynthesize high-quality speech based, not on text, but on low bitrate discrete units that have been learned in a self-supervised fashion and can therefore capture expressive aspects of speech that are hard to transcribe (prosody, voice styles, non-verbal vocalization). The adoption of these methods is still limited by the fact that most speech synthesis datasets are read, severely limiting spontaneity and expressivity. Here, we introduce EXPRESSO, a high-quality expressive speech dataset for textless speech synthesis that includes both read speech and improvised dialogues rendered in 26 spontaneous expressive styles. We illustrate the challenges and potentials of this dataset with an expressive resynthesis benchmark where the task is to encode the input in low-bitrate units and resynthesize it in a target voice while preserving content and style. We evaluate resynthesis quality with automatic metrics for different self-supervised discrete encoders, and explore tradeoffs between quality, bitrate and invariance to speaker and style. The dataset, evaluation metrics and baseline models will be open sourced.

Original languageEnglish
Pages (from-to)4823-4827
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume2023-August
DOIs
StatePublished - 2023
Externally publishedYes
Event24th International Speech Communication Association, Interspeech 2023 - Dublin, Ireland
Duration: 20 Aug 202324 Aug 2023

Bibliographical note

Publisher Copyright:
© 2023 International Speech Communication Association. All rights reserved.

Keywords

  • Speech synthesis evaluation
  • expressive synthesis
  • self-supervised speech representations

Fingerprint

Dive into the research topics of 'EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech Resynthesis'. Together they form a unique fingerprint.

Cite this