Abstract
Methods for erasing human-interpretable concepts from neural representations that assume linearity have been found to be tractable and useful. However, the impact of this removal on the behavior of downstream classifiers trained on the modified representations is not fully understood. In this work, we formally define the notion of log-linear guardedness as the inability of an adversary to predict the concept directly from the representation, and study its implications. We show that, in the binary case, under certain assumptions, a downstream log-linear model cannot recover the erased concept. However, we demonstrate that a multiclass log-linear model can be constructed that indirectly recovers the concept in some cases, pointing to the inherent limitations of log-linear guardedness as a downstream bias mitigation technique. These findings shed light on the theoretical limitations of linear erasure methods and highlight the need for further research on the connections between intrinsic and extrinsic bias in neural models.
Original language | English |
---|---|
Title of host publication | Long Papers |
Publisher | Association for Computational Linguistics (ACL) |
Pages | 9413-9431 |
Number of pages | 19 |
ISBN (Electronic) | 9781959429722 |
State | Published - 2023 |
Event | 61st Annual Meeting of the Association for Computational Linguistics, ACL 2023 - Toronto, Canada Duration: 9 Jul 2023 → 14 Jul 2023 |
Publication series
Name | Proceedings of the Annual Meeting of the Association for Computational Linguistics |
---|---|
Volume | 1 |
ISSN (Print) | 0736-587X |
Conference
Conference | 61st Annual Meeting of the Association for Computational Linguistics, ACL 2023 |
---|---|
Country/Territory | Canada |
City | Toronto |
Period | 9/07/23 → 14/07/23 |
Bibliographical note
Publisher Copyright:© 2023 Association for Computational Linguistics.
Funding
We thank Afra Amini, Clément Guerner, David Schneider-Joseph, Nora Belrose and Stella Biderman for their thoughtful comments and revision of this paper. This project received funding from the Europoean Research Council (ERC) under the Europoean Union's Horizon 2020 research and innovation program, grant agreement No. 802774 (iEXTRACT). Shauli Ravfogel is grateful to be supported by the Bloomberg Data Science Ph.D Fellowship. Ryan Cotterell acknowledges the Google Research Scholar program for supporting the proposal “Controlling and Understanding Representations through Concept Erasure.” We thank Afra Amini, Clément Guerner, David Schneider-Joseph, Nora Belrose and Stella Bider-man for their thoughtful comments and revision of this paper. This project received funding from the Europoean Research Council (ERC) under the Europoean Union’s Horizon 2020 research and innovation program, grant agreement No. 802774 (iEXTRACT). Shauli Ravfogel is grateful to be supported by the Bloomberg Data Science Ph.D Fellowship. Ryan Cotterell acknowledges the Google Research Scholar program for supporting the pro-
Funders | Funder number |
---|---|
Europoean Union's Horizon 2020 research and innovation program | |
Europoean Union’s Horizon 2020 research and innovation program | 802774 |
European Commission |