Abstract
We present a word-sense induction method based on pre-trained masked language models (MLMs), which can cheaply scale to large vocabularies and large corpora. The result is a corpus which is sense-tagged according to a corpus-derived sense inventory and where each sense is associated with indicative words. Evaluation on English Wikipedia that was sense-tagged using our method shows that both the induced senses, and the per-instance sense assignment, are of high quality even compared to WSD methods, such as Babelfy. Furthermore, by training a static word embeddings algorithm on the sense-tagged corpus, we obtain high-quality static senseful embeddings. These outperform existing senseful embeddings methods on the WiC dataset and on a new outlier detection dataset we developed. The data driven nature of the algorithm allows to induce corpora-specific senses, which may not appear in standard sense inventories, as we demonstrate using a case study on the scientific domain.
Original language | English |
---|---|
Title of host publication | ACL 2022 - 60th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers) |
Editors | Smaranda Muresan, Preslav Nakov, Aline Villavicencio |
Publisher | Association for Computational Linguistics (ACL) |
Pages | 4738-4752 |
Number of pages | 15 |
ISBN (Electronic) | 9781955917216 |
State | Published - 2022 |
Event | 60th Annual Meeting of the Association for Computational Linguistics, ACL 2022 - Dublin, Ireland Duration: 22 May 2022 → 27 May 2022 |
Publication series
Name | Proceedings of the Annual Meeting of the Association for Computational Linguistics |
---|---|
Volume | 1 |
ISSN (Print) | 0736-587X |
Conference
Conference | 60th Annual Meeting of the Association for Computational Linguistics, ACL 2022 |
---|---|
Country/Territory | Ireland |
City | Dublin |
Period | 22/05/22 → 27/05/22 |
Bibliographical note
Publisher Copyright:© 2022 Association for Computational Linguistics.