Abstract
In this study, we present a deep neural network-based online multi-speaker localization algorithm based on a multi-microphone array. Following the W-disjoint orthogonality principle in the spectral domain, time-frequency (TF) bin is dominated by a single speaker and hence by a single direction of arrival (DOA). A fully convolutional network is trained with instantaneous spatial features to estimate the DOA for each TF bin. The high-resolution classification enables the network to accurately and simultaneously localize and track multiple speakers, both static and dynamic. Elaborated experimental study using simulated and real-life recordings in static and dynamic scenarios demonstrates that the proposed algorithm significantly outperforms both classic and recent deep-learning-based algorithms. Finally, as a byproduct, we further show that the proposed method is also capable of separating moving speakers by the application of the obtained TF masks.
Original language | English |
---|---|
Article number | 16 |
Journal | Eurasip Journal on Audio, Speech, and Music Processing |
Volume | 2021 |
Issue number | 1 |
DOIs | |
State | Published - Dec 2021 |
Externally published | Yes |
Bibliographical note
Funding Information:This project has received funding from the European Union’s Horizon 2020 Research and Innovation Programme under Grant Agreement No. 871245. The project was also supported by the Israeli Ministry of Science & Technology.
Publisher Copyright:
© 2021, The Author(s).
Keywords
- DOA
- Tracking
- UNET