Abstract
The ability to measure similarity between documents enables intelligent summarization and analysis of large corpora. Past distances between documents suffer from either an inability to incorporate semantic similarities between words or from scalability issues. As an alternative, we introduce hierarchical optimal transport as a meta-distance between documents, where documents are modeled as distributions over topics, which themselves are modeled as distributions over words. We then solve an optimal transport problem on the smaller topic space to compute a similarity score. We give conditions on the topics under which this construction defines a distance, and we relate it to the word mover's distance. We evaluate our technique for k-NN classification and show better interpretability and scalability with comparable performance to current methods at a fraction of the cost.
Original language | English |
---|---|
Journal | Advances in Neural Information Processing Systems |
Volume | 32 |
State | Published - 2019 |
Externally published | Yes |
Event | 33rd Annual Conference on Neural Information Processing Systems, NeurIPS 2019 - Vancouver, Canada Duration: 8 Dec 2019 → 14 Dec 2019 |
Bibliographical note
Publisher Copyright:© 2019 Neural information processing systems foundation. All rights reserved.
Funding
Acknowledgements. J. Solomon acknowledges the generous support of Army Research Office grant W911NF1710068, Air Force Office of Scientific Research award FA9550-19-1-031, of National Science Foundation grant IIS-1838071, from an Amazon Research Award, from the MIT-IBM Watson AI Laboratory, from the Toyota-CSAIL Joint Research Center, from the QCRI–CSAIL Computer Science Research Program, and from a gift from Adobe Systems. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of these organizations.
Funders | Funder number |
---|---|
QCRI | |
Toyota-CSAIL Joint Research Center | |
National Science Foundation | IIS-1838071 |
Air Force Office of Scientific Research | FA9550-19-1-031 |
Army Research Office | W911NF1710068 |