Hierarchical optimal transport for document representation

Mikhail Yurochkin, Farzaneh Mirzazadeh, Sebastian Claici, Edward Chien, Justin Solomon

Research output: Contribution to journalConference articlepeer-review

34 Scopus citations

Abstract

The ability to measure similarity between documents enables intelligent summarization and analysis of large corpora. Past distances between documents suffer from either an inability to incorporate semantic similarities between words or from scalability issues. As an alternative, we introduce hierarchical optimal transport as a meta-distance between documents, where documents are modeled as distributions over topics, which themselves are modeled as distributions over words. We then solve an optimal transport problem on the smaller topic space to compute a similarity score. We give conditions on the topics under which this construction defines a distance, and we relate it to the word mover's distance. We evaluate our technique for k-NN classification and show better interpretability and scalability with comparable performance to current methods at a fraction of the cost.

Original languageEnglish
JournalAdvances in Neural Information Processing Systems
Volume32
StatePublished - 2019
Externally publishedYes
Event33rd Annual Conference on Neural Information Processing Systems, NeurIPS 2019 - Vancouver, Canada
Duration: 8 Dec 201914 Dec 2019

Bibliographical note

Publisher Copyright:
© 2019 Neural information processing systems foundation. All rights reserved.

Fingerprint

Dive into the research topics of 'Hierarchical optimal transport for document representation'. Together they form a unique fingerprint.

Cite this