Abstract
Formant frequency estimation and tracking are among the most fundamental problems in speech processing. In the estimation task, the input is a stationary speech segment such as the middle part of a vowel, and the goal is to estimate the formant frequencies, whereas in the task of tracking the input is a series of speech frames, and the goal is to track the trajectory of the formant frequencies throughout the signal. The use of supervised machine learning techniques trained on an annotated corpus of read-speech for these tasks is proposed. Two deep network architectures were evaluated for estimation: feed-forward multilayer-perceptrons and convolutional neural-networks and, correspondingly, two architectures for tracking: recurrent and convolutional recurrent networks. The inputs to the former are composed of linear predictive coding-based cepstral coefficients with a range of model orders and pitch-synchronous cepstral coefficients, where the inputs to the latter are raw spectrograms. The performance of the methods compares favorably with alternative methods for formant estimation and tracking. A network architecture is further proposed, which allows model adaptation to different formant frequency ranges that were not seen at training time. The adapted networks were evaluated on three datasets, and their performance was further improved.
Original language | English |
---|---|
Pages (from-to) | 642-653 |
Number of pages | 12 |
Journal | Journal of the Acoustical Society of America |
Volume | 145 |
Issue number | 2 |
DOIs | |
State | Published - 1 Feb 2019 |
Bibliographical note
Publisher Copyright:© 2019 Acoustical Society of America.
Funding
This research was supported by the MAGNET program of the Israeli Innovation Authority. We would like to thank Cynthia Clopper for allowing us to use their dataset.
Funders | Funder number |
---|---|
Israeli Innovation Authority |