Automatic measurement of voice onset time using discriminative structured prediction

Morgan Sonderegger, Joseph Keshet

Research output: Contribution to journalArticlepeer-review

29 Scopus citations


A discriminative large-margin algorithm for automatic measurement of voice onset time (VOT) is described, considered as a case of predicting structured output from speech. Manually labeled data are used to train a function that takes as input a speech segment of an arbitrary length containing a voiceless stop, and outputs its VOT. The function is explicitly trained to minimize the difference between predicted and manually measured VOT; it operates on a set of acoustic feature functions designed based on spectral and temporal cues used by human VOT annotators. The algorithm is applied to initial voiceless stops from four corpora, representing different types of speech. Using several evaluation methods, the algorithms performance is near human intertranscriber reliability, and compares favorably with previous work. Furthermore, the algorithms performance is minimally affected by training and testing on different corpora, and remains essentially constant as the amount of training data is reduced to 50-250 manually labeled examples, demonstrating the methods practical applicability to new datasets.

Original languageEnglish
Pages (from-to)3965-3979
Number of pages15
JournalJournal of the Acoustical Society of America
Issue number6
StatePublished - Dec 2012
Externally publishedYes

Bibliographical note

Funding Information:
We thank Matt Goldrick and Nattalia Paterson for providing the PGWORDS data, Hugo van Hamme for providing the manual annotations used in Stouten and van Hamme (2009), and Chi-Yueh Lin for providing the list of stops used in Lin and Wang (2011). We also thank Karen Livescu and Matt Goldrick for helpful feedback, and Natalie Rothfels and Max Bane for VOT annotation. The first author was supported in part by a Department of Education GAANN grant. 1


Dive into the research topics of 'Automatic measurement of voice onset time using discriminative structured prediction'. Together they form a unique fingerprint.

Cite this