Do Coarser Units Benefit Cluster Prediction-Based Speech Pre-Training?

Ali Elkahky, Wei Ning Hsu, Paden Tomasello, Tu Anh Nguyen, Robin Algayres, Yossi Adi, Jade Copet, Emmanuel Dupoux, Abdelrahman Mohamed

Research output: Contribution to journalConference articlepeer-review

1 Scopus citations


The research community has produced many successful self-supervised speech representation learning methods over the past few years. Discrete units have been utilized in various self-supervised learning frameworks, such as VQ-VAE [1], wav2vec 2.0 [2], Hu-BERT [3], and Wav2Seq [4]. This paper studies the impact of altering the granularity and improving the quality of these discrete acoustic units for pre-training encoder-only and encoder-decoder models. We systematically study the current proposals of using Byte-Pair Encoding (BPE) and new extensions that use cluster smoothing and Brown clustering. The quality of learned units is studied intrinsically using zero speech metrics and on the down-stream speech recognition (ASR) task. Our results suggest that longer-range units are helpful for encoder-decoder pre-training; however, encoder-only masked-prediction models cannot yet benefit from self-supervised word-like targets.

Bibliographical note

Publisher Copyright:
© 2023 IEEE.


  • representation learning
  • self-supervision
  • unit discovery


Dive into the research topics of 'Do Coarser Units Benefit Cluster Prediction-Based Speech Pre-Training?'. Together they form a unique fingerprint.

Cite this