TY - JOUR
T1 - A unified acoustic-to-speech-to-language embedding space captures the neural basis of natural language processing in everyday conversations
AU - Goldstein, Ariel
AU - Wang, Haocheng
AU - Niekerken, Leonard
AU - Schain, Mariano
AU - Zada, Zaid
AU - Aubrey, Bobbi
AU - Sheffer, Tom
AU - Nastase, Samuel A.
AU - Gazula, Harshvardhan
AU - Singh, Aditi
AU - Rao, Aditi
AU - Choe, Gina
AU - Kim, Catherine
AU - Doyle, Werner
AU - Friedman, Daniel
AU - Devore, Sasha
AU - Dugan, Patricia
AU - Hassidim, Avinatan
AU - Brenner, Michael
AU - Matias, Yossi
AU - Devinsky, Orrin
AU - Flinker, Adeen
AU - Hasson, Uri
N1 - Publisher Copyright:
© The Author(s) 2025.
PY - 2025/5
Y1 - 2025/5
N2 - This study introduces a unified computational framework connecting acoustic, speech and word-level linguistic structures to study the neural basis of everyday conversations in the human brain. We used electrocorticography to record neural signals across 100 h of speech production and comprehension as participants engaged in open-ended real-life conversations. We extracted low-level acoustic, mid-level speech and contextual word embeddings from a multimodal speech-to-text model (Whisper). We developed encoding models that linearly map these embeddings onto brain activity during speech production and comprehension. Remarkably, this model accurately predicts neural activity at each level of the language processing hierarchy across hours of new conversations not used in training the model. The internal processing hierarchy in the model is aligned with the cortical hierarchy for speech and language processing, where sensory and motor regions better align with the model’s speech embeddings, and higher-level language areas better align with the model’s language embeddings. The Whisper model captures the temporal sequence of language-to-speech encoding before word articulation (speech production) and speech-to-language encoding post articulation (speech comprehension). The embeddings learned by this model outperform symbolic models in capturing neural activity supporting natural speech and language. These findings support a paradigm shift towards unified computational models that capture the entire processing hierarchy for speech comprehension and production in real-world conversations.
AB - This study introduces a unified computational framework connecting acoustic, speech and word-level linguistic structures to study the neural basis of everyday conversations in the human brain. We used electrocorticography to record neural signals across 100 h of speech production and comprehension as participants engaged in open-ended real-life conversations. We extracted low-level acoustic, mid-level speech and contextual word embeddings from a multimodal speech-to-text model (Whisper). We developed encoding models that linearly map these embeddings onto brain activity during speech production and comprehension. Remarkably, this model accurately predicts neural activity at each level of the language processing hierarchy across hours of new conversations not used in training the model. The internal processing hierarchy in the model is aligned with the cortical hierarchy for speech and language processing, where sensory and motor regions better align with the model’s speech embeddings, and higher-level language areas better align with the model’s language embeddings. The Whisper model captures the temporal sequence of language-to-speech encoding before word articulation (speech production) and speech-to-language encoding post articulation (speech comprehension). The embeddings learned by this model outperform symbolic models in capturing neural activity supporting natural speech and language. These findings support a paradigm shift towards unified computational models that capture the entire processing hierarchy for speech comprehension and production in real-world conversations.
UR - http://www.scopus.com/inward/record.url?scp=86000340737&partnerID=8YFLogxK
U2 - 10.1038/s41562-025-02105-9
DO - 10.1038/s41562-025-02105-9
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???
C2 - 40055549
AN - SCOPUS:86000340737
SN - 2397-3374
VL - 9
SP - 1041
EP - 1055
JO - Nature Human Behaviour
JF - Nature Human Behaviour
IS - 5
ER -