Abstract
High-throughput sequencing techniques generate large volumes of DNA sequencing data at ultra-fast speed and extremely low cost. As a consequence, sequencing techniques have become ubiquitous in biomedical research and are used in hundreds of genomic applications. Efficient data structures and algorithms have been developed to handle the large datasets produced by these techniques. The prevailing method to index DNA sequences in those data structures and algorithms is by using k-mers (k-long substrings) known as minimizers. Minimizers are the smallest k-mers selected in every consecutive window of a fixed length in a sequence, where the smallest is determined according to a predefined order, e.g., lexicographic. Recently, a new k-mer order based on a universal hitting set (UHS) was suggested. While several studies have shown that orders based on a small UHS have improved properties, the utility of using them in high-throughput sequencing analysis tasks has been demonstrated to date in only one application of k-mer counting. Here, we demonstrate the practical benefit of UHSs in the genome assembly task. Reconstructing a genome from billions of short reads is a fundamental task in high-throughput sequencing analyses. De Bruijn graph construction is a key step in genome assembly, which often requires very large amounts of memory and long computation time. A critical bottleneck lies in the partitioning of DNA sequences into bins. The sequences in each bin are assembled separately, and the final de Bruijn graph is constructed by merging the bin-specific subgraphs. We incorporated a UHS-based order in the bin partition step of the Minimum Substring Partitioning algorithm. Using a UHS-based order instead of lexicographic-or random-ordered minimizers produced lower density minimizers with more balanced bin partitioning, which led to a reduction in both runtime and memory usage.
Original language | English |
---|---|
Title of host publication | Proceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, BCB 2021 |
Publisher | Association for Computing Machinery, Inc |
ISBN (Electronic) | 9781450384506 |
DOIs | |
State | Published - 18 Jan 2021 |
Externally published | Yes |
Event | 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, BCB 2021 - Virtual, Online, United States Duration: 1 Aug 2021 → 4 Aug 2021 |
Publication series
Name | Proceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, BCB 2021 |
---|
Conference
Conference | 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, BCB 2021 |
---|---|
Country/Territory | United States |
City | Virtual, Online |
Period | 1/08/21 → 4/08/21 |
Bibliographical note
Publisher Copyright:© 2021 Owner/Author.
Keywords
- assembly
- de Bruijn graph
- minimum substring partitioning
- universal hitting set