TY - JOUR
T1 - Single-microphone speaker separation and voice activity detection in noisy and reverberant environments
AU - Opochinsky, Renana
AU - Moradi, Mordehay
AU - Gannot, Sharon
N1 - Publisher Copyright:
© The Author(s) 2025.
PY - 2025/12
Y1 - 2025/12
N2 - The increasing complexity of real-world environments, where multiple speakers might converse simultaneously, underscores the importance of effective speech separation techniques. This work presents a single-microphone speaker separation network with time-frequency (TF) attention aimed at noisy and reverberant environments. We dub this new architecture Separation TF Attention Network (Sep-TFAnet). Additionally, we introduce a variant of the separation network, Sep-TFAnetVAD, which incorporates a voice activity detector (VAD) into the separation network. The separation module is based on a temporal convolutional network (TCN) backbone inspired by the Conv-Tasnet architecture, with several modifications. Instead of using a learned encoder and decoder, we employ the short-time Fourier transform (STFT) and inverse short-time Fourier transform (iSTFT) for analysis and synthesis, respectively. Our system is specifically developed for human-robot interaction and supports block processing mode. While considerable progress has been made in separating overlapping speech signals, most studies have primarily focused on mixtures of simulated-reverberated speech signals, lacking real-world scenarios. To address this limitation, we introduce the ARImulti-mic dataset, which incorporates real-world experiments. These recordings were carried out in the acoustic laboratory at Bar-Ilan University and captured by a humanoid robot. Throughout this paper, we focus on a single-microphone setting. Extensive evaluation of the proposed methods using this dataset and carefully simulated data demonstrated advantages over competing methods. The ARImulti-mic dataset is available at DataPort, and examples of our algorithm applied to this dataset can be found on the project page: https://Sep-TFAnet.github.io.
AB - The increasing complexity of real-world environments, where multiple speakers might converse simultaneously, underscores the importance of effective speech separation techniques. This work presents a single-microphone speaker separation network with time-frequency (TF) attention aimed at noisy and reverberant environments. We dub this new architecture Separation TF Attention Network (Sep-TFAnet). Additionally, we introduce a variant of the separation network, Sep-TFAnetVAD, which incorporates a voice activity detector (VAD) into the separation network. The separation module is based on a temporal convolutional network (TCN) backbone inspired by the Conv-Tasnet architecture, with several modifications. Instead of using a learned encoder and decoder, we employ the short-time Fourier transform (STFT) and inverse short-time Fourier transform (iSTFT) for analysis and synthesis, respectively. Our system is specifically developed for human-robot interaction and supports block processing mode. While considerable progress has been made in separating overlapping speech signals, most studies have primarily focused on mixtures of simulated-reverberated speech signals, lacking real-world scenarios. To address this limitation, we introduce the ARImulti-mic dataset, which incorporates real-world experiments. These recordings were carried out in the acoustic laboratory at Bar-Ilan University and captured by a humanoid robot. Throughout this paper, we focus on a single-microphone setting. Extensive evaluation of the proposed methods using this dataset and carefully simulated data demonstrated advantages over competing methods. The ARImulti-mic dataset is available at DataPort, and examples of our algorithm applied to this dataset can be found on the project page: https://Sep-TFAnet.github.io.
KW - Speaker separation
KW - Temporal convolutional networks
KW - Voice activity detection
UR - http://www.scopus.com/inward/record.url?scp=105003795953&partnerID=8YFLogxK
U2 - 10.1186/s13636-025-00404-7
DO - 10.1186/s13636-025-00404-7
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???
AN - SCOPUS:105003795953
SN - 1687-4714
VL - 2025
JO - Eurasip Journal on Audio, Speech, and Music Processing
JF - Eurasip Journal on Audio, Speech, and Music Processing
IS - 1
M1 - 18
ER -