Dynamically localizing multiple speakers based on the time-frequency domain

Hodaya Hammer, Shlomo E. Chazan, Jacob Goldberger, Sharon Gannot

Research output: Contribution to journalArticlepeer-review

11 Scopus citations

Abstract

In this study, we present a deep neural network-based online multi-speaker localization algorithm based on a multi-microphone array. Following the W-disjoint orthogonality principle in the spectral domain, time-frequency (TF) bin is dominated by a single speaker and hence by a single direction of arrival (DOA). A fully convolutional network is trained with instantaneous spatial features to estimate the DOA for each TF bin. The high-resolution classification enables the network to accurately and simultaneously localize and track multiple speakers, both static and dynamic. Elaborated experimental study using simulated and real-life recordings in static and dynamic scenarios demonstrates that the proposed algorithm significantly outperforms both classic and recent deep-learning-based algorithms. Finally, as a byproduct, we further show that the proposed method is also capable of separating moving speakers by the application of the obtained TF masks.

Original languageEnglish
Article number16
JournalEurasip Journal on Audio, Speech, and Music Processing
Volume2021
Issue number1
DOIs
StatePublished - Dec 2021
Externally publishedYes

Bibliographical note

Funding Information:
This project has received funding from the European Union’s Horizon 2020 Research and Innovation Programme under Grant Agreement No. 871245. The project was also supported by the Israeli Ministry of Science & Technology.

Publisher Copyright:
© 2021, The Author(s).

Keywords

  • DOA
  • Tracking
  • UNET

Fingerprint

Dive into the research topics of 'Dynamically localizing multiple speakers based on the time-frequency domain'. Together they form a unique fingerprint.

Cite this