Abstract
Word emphasis prediction is an important part of expressive prosody generation in modern Text-To-Speech (TTS) systems. We present a method for predicting emphasized words for expressive TTS, based on a Deep Neural Network (DNN). We show that the presented method outperforms machine learning methods based on hand-crafted features in terms of objective metrics such as precision and recall. Using a listening test, we further demonstrate that the contribution of the predicted emphasized words to the expressiveness of the synthesized speech is subjectively perceivable.
Original language | English |
---|---|
Pages (from-to) | 2868-2872 |
Number of pages | 5 |
Journal | Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH |
Volume | 2018-September |
DOIs | |
State | Published - 2018 |
Externally published | Yes |
Event | 19th Annual Conference of the International Speech Communication, INTERSPEECH 2018 - Hyderabad, India Duration: 2 Sep 2018 → 6 Sep 2018 |
Bibliographical note
Publisher Copyright:© 2018 International Speech Communication Association. All rights reserved.
Keywords
- Deep learning
- Expressive text to speech
- Prosody
- Speech synthesis
- Word emphasis