Abstract
The paper extends ideas from data compression by deduplication to the Bioinformatic field. The specific problems on which we show our approach to be useful are the clustering of a large set of DNA strings and the search for approximate matches of long substrings, both based on the design of what we call an approximate hashing function. The outcome of the new procedure is very similar to the clustering and search results obtained by accurate tools, but in much less time and with less required memory.
Original language | English |
---|---|
Title of host publication | Implementation and Application of Automata - 25th International Conference, CIAA 2021, Proceedings |
Editors | Sebastian Maneth |
Publisher | Springer Science and Business Media Deutschland GmbH |
Pages | 178-189 |
Number of pages | 12 |
ISBN (Print) | 9783030791209 |
DOIs | |
State | Published - 2021 |
Event | 25th International Conference on Implementation and Application of Automata, CIAA 2021 - Virtual, Online Duration: 19 Jul 2021 → 22 Jul 2021 |
Publication series
Name | Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) |
---|---|
Volume | 12803 LNCS |
ISSN (Print) | 0302-9743 |
ISSN (Electronic) | 1611-3349 |
Conference
Conference | 25th International Conference on Implementation and Application of Automata, CIAA 2021 |
---|---|
City | Virtual, Online |
Period | 19/07/21 → 22/07/21 |
Bibliographical note
Publisher Copyright:© 2021, Springer Nature Switzerland AG.