Word emphasis prediction for expressive text to speech

Yosi Mass, Slava Shechtman, Moran Mordechay, Ron Hoory, Oren Sar Shalom, Guy Lev, David Konopnicki

Research output: Contribution to journalConference articlepeer-review

15 Scopus citations

Abstract

Word emphasis prediction is an important part of expressive prosody generation in modern Text-To-Speech (TTS) systems. We present a method for predicting emphasized words for expressive TTS, based on a Deep Neural Network (DNN). We show that the presented method outperforms machine learning methods based on hand-crafted features in terms of objective metrics such as precision and recall. Using a listening test, we further demonstrate that the contribution of the predicted emphasized words to the expressiveness of the synthesized speech is subjectively perceivable.

Original languageEnglish
Pages (from-to)2868-2872
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume2018-September
DOIs
StatePublished - 2018
Externally publishedYes
Event19th Annual Conference of the International Speech Communication, INTERSPEECH 2018 - Hyderabad, India
Duration: 2 Sep 20186 Sep 2018

Bibliographical note

Publisher Copyright:
© 2018 International Speech Communication Association. All rights reserved.

Keywords

  • Deep learning
  • Expressive text to speech
  • Prosody
  • Speech synthesis
  • Word emphasis

Fingerprint

Dive into the research topics of 'Word emphasis prediction for expressive text to speech'. Together they form a unique fingerprint.

Cite this