Abstract
Many NLP models operate over sequences of subword tokens produced by hand-crafted tokenization rules and heuristic subword induction algorithms. A simple universal alternative is to represent every computerized text as a sequence of bytes via UTF-8, obviating the need for an embedding layer since there are fewer token types (256) than dimensions. Surprisingly, replacing the ubiquitous embedding layer with one-hot representations of each byte does not hurt performance; experiments on byte-to-byte machine translation from English to 10 different languages show a consistent improvement in BLEU, rivaling character-level and even standard subword-level models. A deeper investigation reveals that the combination of embeddingless models with decoder-input dropout amounts to token dropout, which benefits byte-to-byte models in particular.
| Original language | English |
|---|---|
| Title of host publication | NAACL-HLT 2021 - 2021 Conference of the North American Chapter of the Association for Computational Linguistics |
| Subtitle of host publication | Human Language Technologies, Proceedings of the Conference |
| Publisher | Association for Computational Linguistics (ACL) |
| Pages | 181-186 |
| Number of pages | 6 |
| ISBN (Electronic) | 9781954085466 |
| DOIs | |
| State | Published - 2021 |
| Event | 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021 - Virtual, Online Duration: 6 Jun 2021 → 11 Jun 2021 |
Publication series
| Name | NAACL-HLT 2021 - 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference |
|---|
Conference
| Conference | 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021 |
|---|---|
| City | Virtual, Online |
| Period | 6/06/21 → 11/06/21 |
Bibliographical note
Publisher Copyright:© 2021 Association for Computational Linguistics.
Funding
This work was supported in part by Len Blavat-nik and the Blavatnik Family foundation, the Alon Scholarship, and the Tel Aviv University Data Science Center.
| Funders |
|---|
| Blavatnik Family Foundation |
| Tel Aviv University |