Abstract
How novel are texts generated by language models (LMs) relative to their training corpora? In this work, we investigate the extent to which modern LMs generate n-grams from their training data, evaluating both (i) the probability LMs assign to complete training n-grams and (ii) n-novelty, the proportion of n-grams generated by an LM that did not appear in the training data (for arbitrarily large n). To enable arbitrary-length n-gram search over a corpus in constant time w.r.t. corpus size, we develop RUSTY-DAWG, a novel search tool inspired by indexing of genomic data. We compare the novelty of LM-generated text to human-written text and explore factors that affect generation novelty, focusing on the Pythia models. We find that, for n > 4, LM-generated text is less novel than human-written text, though it is more novel for smaller n. Larger LMs and more constrained decoding strategies both decrease novelty. Finally, we show that LMs complete n-grams with lower loss if they are more frequent in the training data. Overall, our results reveal factors influencing the novelty of LM-generated text, and we release RUSTY-DAWG to facilitate further pretraining data research.
| Original language | English |
|---|---|
| Title of host publication | EMNLP 2024 - 2024 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference |
| Editors | Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen |
| Publisher | Association for Computational Linguistics (ACL) |
| Pages | 14459-14473 |
| Number of pages | 15 |
| ISBN (Electronic) | 9798891761643 |
| DOIs | |
| State | Published - 2024 |
| Externally published | Yes |
| Event | 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024 - Hybrid, Miami, United States Duration: 12 Nov 2024 → 16 Nov 2024 |
Publication series
| Name | EMNLP 2024 - 2024 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference |
|---|
Conference
| Conference | 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024 |
|---|---|
| Country/Territory | United States |
| City | Hybrid, Miami |
| Period | 12/11/24 → 16/11/24 |
Bibliographical note
Publisher Copyright:© 2024 Association for Computational Linguistics.