Speech quality, as evaluated by humans, is most accurately assessed by subjective human ratings. The objective acoustic echo cancellation mean opinion score (AECMOS) metric was recently introduced and achieved high accuracy in predicting human perception during double-talk. Residual-echo suppression (RES) systems, however, employ the signal-to-distortion ratio (SDR) metric to quantify speech-quality in double-talk. In this study, we focus on stereophonic acoustic echo cancellation, and show that the stereo SDR (SSDR) poorly correlates with subjective human ratings according to the AECMOS, since the SSDR is influenced by both distortion of desired speech and presence of residual-echo. We introduce a pair of objective metrics that distinctly assess the stereo desired-speech maintained level (SDSML) and stereo residual-echo suppression level (SRESL) during double-talk. By employing a tunable RES system based on deep learning and using 100 hours of real and simulated recordings, the SDSML and SRESL metrics show high correlation with the AECMOS across various setups. We also investigate into how the design parameter governs the SDSML-SRESL tradeoff, and harness this relation to allow optimal performance for frequently-changing user demands in practical cases.
|Number of pages
|Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
|Published - 2022
|23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022 - Incheon, Korea, Republic of
Duration: 18 Sep 2022 → 22 Sep 2022
Bibliographical notePublisher Copyright:
Copyright © 2022 ISCA.
- Residual-echo suppression
- deep learning
- objective metrics
- perceptual speech quality
- stereo echo cancellation