Scalable evaluation and improvement of document set expansion via neural positive-unlabeled learning

Alon Jacovi, Gang Niu, Yoav Goldberg, Masashi Sugiyama

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

3 Scopus citations

Abstract

We consider the situation in which a user has collected a small set of documents on a cohesive topic, and they want to retrieve additional documents on this topic from a large collection. Information Retrieval (IR) solutions treat the document set as a query, and look for similar documents in the collection. We propose to extend the IR approach by treating the problem as an instance of positive-unlabeled (PU) learning-i.e., learning binary classifiers from only positive (the query documents) and unlabeled (the results of the IR engine) data. Utilizing PU learning for text with big neural networks is a largely unexplored field. We discuss various challenges in applying PU learning to the setting, showing that the standard implementations of state-of-the-art PU solutions fail. We propose solutions for each of the challenges and empirically validate them with ablation tests. We demonstrate the effectiveness of the new method using a series of experiments of retrieving PubMed abstracts adhering to fine-grained topics, showing improvements over the common IR solution and other baselines.

Original languageEnglish
Title of host publicationEACL 2021 - 16th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference
PublisherAssociation for Computational Linguistics (ACL)
Pages581-592
Number of pages12
ISBN (Electronic)9781954085022
StatePublished - 2021
Event16th Conference of the European Chapter of the Associationfor Computational Linguistics, EACL 2021 - Virtual, Online
Duration: 19 Apr 202123 Apr 2021

Publication series

NameEACL 2021 - 16th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference

Conference

Conference16th Conference of the European Chapter of the Associationfor Computational Linguistics, EACL 2021
CityVirtual, Online
Period19/04/2123/04/21

Bibliographical note

Publisher Copyright:
© 2021 Association for Computational Linguistics

Funding

AJ and YG have received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme, grant agreement No. 802774 (iEXTRACT). GN and MS were supported by JST AIP Acceleration Research Grant Number JP-MJCR20U3, Japan.

FundersFunder number
JST AIPJP-MJCR20U3
Horizon 2020 Framework Programme802774
European Commission

    Fingerprint

    Dive into the research topics of 'Scalable evaluation and improvement of document set expansion via neural positive-unlabeled learning'. Together they form a unique fingerprint.

    Cite this