Abstract
The performance of most emotion recognition systems degrades in real-life situations (“in the wild” scenarios) where the audio is contaminated by reverberation. Our study explores new methods to alleviate the performance degradation of Speech Emotion Recognition (SER) algorithms and develop a more robust system for adverse conditions. We propose processing multi-microphone signals to address these challenges and improve emotion classification accuracy. We adopt a state-of-the-art transformer model, the Hierarchical Token-semantic Audio Transformer (HTS-AT), to handle multi-channel audio inputs. We evaluate two strategies: averaging mel-spectrograms across channels and summing patch-embedded representations. Our multi-microphone model achieves superior performance compared to single-channel baselines when tested on real-world reverberant environments.
| Original language | English |
|---|---|
| Title of host publication | 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2025 - Proceedings |
| Editors | Bhaskar D Rao, Isabel Trancoso, Gaurav Sharma, Neelesh B. Mehta |
| Publisher | Institute of Electrical and Electronics Engineers Inc. |
| ISBN (Electronic) | 9798350368741 |
| DOIs | |
| State | Published - 2025 |
| Event | 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2025 - Hyderabad, India Duration: 6 Apr 2025 → 11 Apr 2025 |
Publication series
| Name | ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings |
|---|---|
| ISSN (Print) | 1520-6149 |
Conference
| Conference | 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2025 |
|---|---|
| Country/Territory | India |
| City | Hyderabad |
| Period | 6/04/25 → 11/04/25 |
Bibliographical note
Publisher Copyright:© 2025 IEEE.
Keywords
- human-robot interaction
- speech emotion recognition