Abstract
Spoken term detection (STD) is the task of determining whether and where a given word or phrase appears in a given segment of speech. Algorithms for STD are often aimed at maximizing the gap between the scores of positive and negative examples. As such they are focused on ensuring that utterances where the term appears are ranked higher than utterances where the term does not appear. However, they do not determine a detection threshold between the two. In this paper, we propose a new approach for setting an absolute detection threshold for all terms by introducing a new calibrated loss function. The advantage of minimizing this loss function during training is that it aims at maximizing not only the relative ranking scores, but also adjusts the system to use a fixed threshold and thus maximizes the detection accuracy rates. We use the new loss function in the structured prediction setting and extend the discriminative keyword spotting algorithm for learning the spoken term detector with a single threshold for all terms. We further demonstrate the effectiveness of the new loss function by training a deep neural Siamese network in a weakly supervised setting for template-based STD, again with a single fixed threshold. Experiments with the TIMIT, Wall Street Journal (WSJ), and Switchboard corpora showed that our approach not only improved the accuracy rates when a fixed threshold was used but also obtained higher area under curve (AUC).
Original language | English |
---|---|
Article number | 8070931 |
Pages (from-to) | 1310-1317 |
Number of pages | 8 |
Journal | IEEE Journal on Selected Topics in Signal Processing |
Volume | 11 |
Issue number | 8 |
DOIs | |
State | Published - Dec 2017 |
Bibliographical note
Publisher Copyright:© 2007-2012 IEEE.
Funding
Manuscript received March 31, 2017; revised August 10, 2017 and October 2, 2017; accepted October 5, 2017. Date of publication October 18, 2017; date of current version November 16, 2017. This work was supported by the MAGNET program of the Israeli Innovation Authority. The guest editor coordinating the review of this paper and approving it for publication was Dr. Nancy F. Chen. (Corresponding author: Joseph Keshet.) The authors are with the Department of Computer Science, Bar-Ilan University, Ramat-Gan 52900, Israel (e-mail: [email protected]; [email protected]). Digital Object Identifier 10.1109/JSTSP.2017.2764268
Funders | Funder number |
---|---|
Israeli Innovation Authority |
Keywords
- AUC maximization
- Spoken term detection
- deep-neural networks
- keyword spotting
- structured prediction