Abstract
In this work, we revisit the fundamental and well-studied problem of approximate pattern matching under edit distance. Given an integer k, a pattern p of length m, and a text T of length n≥q m, the task is to find substrings of T that are within edit distance k from p. Our main result is a streaming algorithm that solves the problem in tilde O}(k 5}) space11Hereafter, tilde O() hides a poly} (log n) factor. and tilde O(k 8}) amortized time per character of the text, providing answers correct with high probability. This answers a decade-old question: since the discovery of a poly (k log n)-space streaming algorithm for pattern matching under Hamming distance by Porat and Porat [FOCS 2009], the existence of an analogous result for edit distance remained open. Up to this work, no poly (k log n)-space algorithm was known even in the simpler semi-streaming model, where T comes as a stream but p is available for read-only access. In this model, we give a deterministic algorithm that achieves slightly better complexity. Our central technical contribution is a new space-efficient deterministic encoding of two strings, called the greedy encoding, which encodes a set of all alignments of cost at most k with a certain property (we call such alignments greedy). On strings of length at most n, the encoding occupies tilde O(k 2}) space. We use the encoding to compress substrings of the text that are close to the pattern. In order to do so, we compute the encoding for substrings of the text and of the pattern, which requires read-only access to the latter. In order to develop the fully streaming algorithm, we further introduce a new edit distance sketch parameterized by integers n > k. For any string of length at most n, the sketch is of size tilde Ooverline{(k} 2}), and it can be computed with an tilde O(k 2})-space streaming algorithm. Given the sketches of two strings, in tilde O(k 3}) time we can compute their edit distance or certify that it is larger than k. This result improves upon tilde O(k 8})-size sketches of Belazzougui and Zhang [FOCS 2016] and very recent tilde O(k 3})-size sketches of Jin, Nelson, and Wu [STACS 2021].
Original language | English |
---|---|
Title of host publication | Proceedings - 2021 IEEE 62nd Annual Symposium on Foundations of Computer Science, FOCS 2021 |
Publisher | IEEE Computer Society |
Pages | 885-896 |
Number of pages | 12 |
ISBN (Electronic) | 9781665420556 |
DOIs | |
State | Published - 2022 |
Event | 62nd IEEE Annual Symposium on Foundations of Computer Science, FOCS 2021 - Virtual, Online, United States Duration: 7 Feb 2022 → 10 Feb 2022 |
Publication series
Name | Proceedings - Annual IEEE Symposium on Foundations of Computer Science, FOCS |
---|---|
Volume | 2022-February |
ISSN (Print) | 0272-5428 |
Conference
Conference | 62nd IEEE Annual Symposium on Foundations of Computer Science, FOCS 2021 |
---|---|
Country/Territory | United States |
City | Virtual, Online |
Period | 7/02/22 → 10/02/22 |
Bibliographical note
Publisher Copyright:© 2022 IEEE.
Funding
Tomasz Kociumaka was partially supported by NSF 1652303, 1909046, and HDR TRIPODS 1934846 grants, and an Alfred P. Sloan Fellowship. Ely Porat was partially supported by ISF grants no. 1278/16 and 1926/19, by a BSF grant no. 2018364, and by an ERC grant MPM under the EU’s Horizon 2020 Research and Innovation Program (grant no. 683064). Tatiana Starikovskaya was partially supported by the ANR-20-CE48-0001 grant from the French National Research Agency (ANR).
Funders | Funder number |
---|---|
Alfred P. Sloan Fellowship | |
EU’s Horizon 2020 research and innovation program | ANR-20-CE48-0001, 683064 |
National Science Foundation | 1909046, 1652303, 1934846 |
Bonfils-Stanton Foundation | 2018364 |
Engineering Research Centers | |
Agence Nationale de la Recherche | |
Israel Science Foundation | 1926/19, 1278/16 |
Keywords
- edit distance
- pattern matching
- streaming