Abstract
Multi-label text classification is a critical task in the industry. It helps to extract structured information from large amount of textual data. We propose Text to Topic (Text2Topic), which achieves high multi-label classification performance by employing a Bi-Encoder Transformer architecture that utilizes concatenation, subtraction, and multiplication of embeddings on both text and topic. Text2Topic also supports zero-shot predictions, produces domain-specific text embeddings, and enables production-scale batch-inference with high throughput. The final model achieves accurate and comprehensive results compared to state-of-the-art baselines, including large language models (LLMs). In this study, a total of 239 topics are defined, and around 1.6 million text-topic pairs annotations (in which 200K are positive) are collected on approximately 120K texts from 3 main data sources on Booking.com. The data is collected with optimized smart sampling and partial labeling. The final Text2Topic model is deployed on a real-world stream processing platform, and it outperforms other models with 92.9% micro mAP, as well as a 75.8% macro mAP score. We summarize the modeling choices which are extensively tested through ablation studies, and share detailed in-production decision-making steps.
Original language | English |
---|---|
Pages | 93-103 |
Number of pages | 11 |
DOIs | |
State | Published - 2023 |
Externally published | Yes |
Event | 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023 - Hybrid, Singapore, Singapore Duration: 6 Dec 2023 → 10 Dec 2023 |
Conference
Conference | 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023 |
---|---|
Country/Territory | Singapore |
City | Hybrid, Singapore |
Period | 6/12/23 → 10/12/23 |
Bibliographical note
Publisher Copyright:© 2023 Association for Computational Linguistics.
Funding
This work is supported by Booking.com. We would like to thank Satendra Kumar, Selena Wang, Michael Alo, and Guy Nadav for the paper review. We would also like to thank Ilya Gusev on contributing some GPT-3.5 prompting ideas.
Funders | Funder number |
---|---|
Guy Nadav |