Abstract
Named Entity Recognition (NER) has been mostly studied in the context of written text. Specifically, NER is an important step in de-identification (de-ID) of medical records, many of which are recorded conversations between a patient and a doctor. In such recordings, audio spans with personal information should be redacted, similar to the redaction of sensitive character spans in de-ID for written text. The application of NER in the context of audio de-identification has yet to be fully investigated. To this end, we define the task of audio de-ID, in which audio spans with entity mentions should be detected. We then present our pipeline for this task, which involves Automatic Speech Recognition (ASR), NER on the transcript text, and text-to-audio alignment. Finally, we introduce a novel metric for audio de-ID and a new evaluation benchmark consisting of a large labeled segment of the Switchboard and Fisher audio datasets and detail our pipeline's results on it.
Original language | English |
---|---|
Title of host publication | Industry Papers |
Publisher | Association for Computational Linguistics (ACL) |
Pages | 197-204 |
Number of pages | 8 |
ISBN (Electronic) | 9781950737147 |
State | Published - 2019 |
Externally published | Yes |
Event | 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL HLT 2019 - Minneapolis, United States Duration: 2 Jun 2019 → 7 Jun 2019 |
Publication series
Name | NAACL HLT 2019 - 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference |
---|---|
Volume | 2 |
Conference
Conference | 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL HLT 2019 |
---|---|
Country/Territory | United States |
City | Minneapolis |
Period | 2/06/19 → 7/06/19 |
Bibliographical note
Publisher Copyright:© 2019 Association for Computational Linguistics.