TY - JOUR
T1 - Personalizing ASR for dysarthric and accented speech with limited data
AU - Shor, Joel
AU - Emanuel, Dotan
AU - Lang, Oran
AU - Tuval, Omry
AU - Brenner, Michael
AU - Cattiau, Julie
AU - Vieira, Fernando
AU - McNally, Maeve
AU - Charbonneau, Taylor
AU - Nollstadt, Melissa
AU - Hassidim, Avinatan
AU - Matias, Yossi
N1 - Publisher Copyright:
© 2019 ISCA
PY - 2019
Y1 - 2019
N2 - Automatic speech recognition (ASR) systems have dramatically improved over the last few years. ASR systems are most often trained from 'typical' speech, which means that underrepresented groups don't experience the same level of improvement. In this paper, we present and evaluate finetuning techniques to improve ASR for users with non-standard speech. We focus on two types of non-standard speech: speech from people with amyotrophic lateral sclerosis (ALS) and accented speech. We train personalized models that achieve 62% and 35% relative WER improvement on these two groups, bringing the absolute WER for ALS speakers, on a test set of message bank phrases, down to 10% for mild dysarthria and 20% for more serious dysarthria. We show that 71% of the improvement comes from only 5 minutes of training data. Finetuning a particular subset of layers (with many fewer parameters) often gives better results than finetuning the entire model. This is the first step towards building state of the art ASR models for dysarthric speech.
AB - Automatic speech recognition (ASR) systems have dramatically improved over the last few years. ASR systems are most often trained from 'typical' speech, which means that underrepresented groups don't experience the same level of improvement. In this paper, we present and evaluate finetuning techniques to improve ASR for users with non-standard speech. We focus on two types of non-standard speech: speech from people with amyotrophic lateral sclerosis (ALS) and accented speech. We train personalized models that achieve 62% and 35% relative WER improvement on these two groups, bringing the absolute WER for ALS speakers, on a test set of message bank phrases, down to 10% for mild dysarthria and 20% for more serious dysarthria. We show that 71% of the improvement comes from only 5 minutes of training data. Finetuning a particular subset of layers (with many fewer parameters) often gives better results than finetuning the entire model. This is the first step towards building state of the art ASR models for dysarthric speech.
KW - Accessibility
KW - Personalization
KW - Speech recognition
UR - http://www.scopus.com/inward/record.url?scp=85074727857&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2019-1427
DO - 10.21437/Interspeech.2019-1427
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.conferencearticle???
AN - SCOPUS:85074727857
SN - 2308-457X
VL - 2019-September
SP - 784
EP - 788
JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
T2 - 20th Annual Conference of the International Speech Communication Association: Crossroads of Speech and Language, INTERSPEECH 2019
Y2 - 15 September 2019 through 19 September 2019
ER -