Simple, interpretable and stable method for detecting words with usage change across corpora

Hila Gonen, Ganesh Jawahar, Djamé Seddah, Yoav Goldberg

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

55 Scopus citations

Abstract

The problem of comparing two bodies of text and searching for words that differ in their usage between them arises often in digital humanities and computational social science. This is commonly approached by training word embeddings on each corpus, aligning the vector spaces, and looking for words whose cosine distance in the aligned space is large. However, these methods often require extensive filtering of the vocabulary to perform well, and-as we show in this work-result in unstable, and hence less reliable, results. We propose an alternative approach that does not use vector space alignment, and instead considers the neighbors of each word. The method is simple, interpretable and stable. We demonstrate its effectiveness in 9 different setups, considering different corpus splitting criteria (age, gender and profession of tweet authors, time of tweet) and different languages (English, French and Hebrew).

Original languageEnglish
Title of host publicationACL 2020 - 58th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference
PublisherAssociation for Computational Linguistics (ACL)
Pages538-555
Number of pages18
ISBN (Electronic)9781952148255
StatePublished - 2020
Event58th Annual Meeting of the Association for Computational Linguistics, ACL 2020 - Virtual, Online, United States
Duration: 5 Jul 202010 Jul 2020

Publication series

NameProceedings of the Annual Meeting of the Association for Computational Linguistics
ISSN (Print)0736-587X

Conference

Conference58th Annual Meeting of the Association for Computational Linguistics, ACL 2020
Country/TerritoryUnited States
CityVirtual, Online
Period5/07/2010/07/20

Bibliographical note

Publisher Copyright:
© 2020 Association for Computational Linguistics

Funding

We thank Marianna Apidianiaki for her insightful comments on an earlier version of this work. This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme, grant agreement No. 802774 (iEXTRACT), and from the the Israeli ministry of Science, Technology and Space through the Israeli-French Mai-monide Cooperation programme. The second and third authors were partially funded by the French Research Agency projects ParSiTi (ANR-16-CE33-0021), SoSweet (ANR15-CE38-0011-01) and by the French Ministry of Industry and Ministry of Foreign Affairs via the PHC Maimonide France-Israel cooperation programme. We thank Marianna Apidianiaki for her insightful comments on an earlier version of this work. This project has received funding from the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation programme, grant agreement No. 802774 (iEXTRACT), and from the the Israeli ministry of Science, Technology and Space through the Israeli-French Maimonide Cooperation programme. The second and third authors were partially funded by the French Research Agency projects ParSiTi (ANR-16-CE33-0021), SoSweet (ANR15-CE38-0011-01) and by the French Ministry of Industry and Ministry of Foreign Affairs via the PHC Maimonide France-Israel cooperation programme.

FundersFunder number
French Ministry of Industry
Israeli-French Mai-monide Cooperation programme
Israeli-French Maimonide Cooperation programme
Horizon 2020 Framework Programme
Providence Health Care
European Commission
Agence Nationale de la RechercheANR-16-CE33-0021, ANR15-CE38-0011-01
Ministry of Science, Technology and Space
Ministry of Foreign Affairs
Horizon 2020802774

    Fingerprint

    Dive into the research topics of 'Simple, interpretable and stable method for detecting words with usage change across corpora'. Together they form a unique fingerprint.

    Cite this