Abstract
We present a deep-learning approach for the task of Concurrent Speaker Detection (CSD) using a modified transformer model. Our model is designed to handle multi-microphone data but can also work in the single-microphone case. The method can classify audio segments into one of three classes: 1) no speech activity (noise only), 2) only a single speaker is active, and 3) more than one speaker is active. We incorporate a Cost-Sensitive (CS) loss and a confidence calibration to the training procedure. The approach is evaluated using three real-world databases: AMI, AliMeeting, and CHiME 5, demonstrating an improvement over existing approaches.
Original language | English |
---|---|
Title of host publication | 32nd European Signal Processing Conference, EUSIPCO 2024 - Proceedings |
Publisher | European Signal Processing Conference, EUSIPCO |
Pages | 897-901 |
Number of pages | 5 |
ISBN (Electronic) | 9789464593617 |
DOIs | |
State | Published - 2024 |
Event | 32nd European Signal Processing Conference, EUSIPCO 2024 - Lyon, France Duration: 26 Aug 2024 → 30 Aug 2024 |
Publication series
Name | European Signal Processing Conference |
---|---|
ISSN (Print) | 2219-5491 |
Conference
Conference | 32nd European Signal Processing Conference, EUSIPCO 2024 |
---|---|
Country/Territory | France |
City | Lyon |
Period | 26/08/24 → 30/08/24 |
Bibliographical note
Publisher Copyright:© 2024 European Signal Processing Conference, EUSIPCO. All rights reserved.