TY - UNPB

T1 - Gapped String Indexing in Subquadratic Space and Sublinear Query Time

AU - Bille, Philip

AU - Gørtz, Inge Li

AU - Lewenstein, Moshe

AU - Pissis, Solon P.

AU - Rotenberg, Eva

AU - Steiner, Teresa Anna

N1 - 21 pages, 5 figures

PY - 2022/11/30

Y1 - 2022/11/30

N2 - In Gapped String Indexing, the goal is to compactly represent a string $S$ of length $n$ such that given queries consisting of two strings $P_1$ and $P_2$, called patterns, and an integer interval $[\alpha, \beta]$, called gap range, we can quickly find occurrences of $P_1$ and $P_2$ in $S$ with distance in $[\alpha, \beta]$. Due to the many applications of this fundamental problem in computational biology and elsewhere, there is a great body of work for restricted or parameterised variants of the problem. However, for the general problem statement, no improvements upon the trivial $\mathcal{O}(n)$-space $\mathcal{O}(n)$-query time or $\Omega(n^2)$-space $\mathcal{\tilde{O}}(|P_1| + |P_2| + \mathrm{occ})$-query time solutions were known so far. We break this barrier obtaining interesting trade-offs with polynomially subquadratic space and polynomially sublinear query time. In particular, we show that, for every $0\leq \delta \leq 1$, there is a data structure for Gapped String Indexing with either $\mathcal{\tilde{O}}(n^{2-\delta/3})$ or $\mathcal{\tilde{O}}(n^{3-2\delta})$ space and $\mathcal{\tilde{O}}(|P_1| + |P_2| + n^{\delta}\cdot (\mathrm{occ}+1))$ query time, where $\mathrm{occ}$ is the number of reported occurrences. As a new fundamental tool towards obtaining our main result, we introduce the Shifted Set Intersection problem: preprocess a collection of sets $S_1, \ldots, S_k$ of integers such that given queries consisting of three integers $i,j,s$, we can quickly output YES if and only if there exist $a \in S_i$ and $b \in S_j$ with $a+s = b$. We start by showing that the Shifted Set Intersection problem is equivalent to the indexing variant of 3SUM (3SUM Indexing) [Golovnev et al., STOC 2020]. Via several steps of reduction we then show that the Gapped String Indexing problem reduces to polylogarithmically many instances of the Shifted Set Intersection problem.

AB - In Gapped String Indexing, the goal is to compactly represent a string $S$ of length $n$ such that given queries consisting of two strings $P_1$ and $P_2$, called patterns, and an integer interval $[\alpha, \beta]$, called gap range, we can quickly find occurrences of $P_1$ and $P_2$ in $S$ with distance in $[\alpha, \beta]$. Due to the many applications of this fundamental problem in computational biology and elsewhere, there is a great body of work for restricted or parameterised variants of the problem. However, for the general problem statement, no improvements upon the trivial $\mathcal{O}(n)$-space $\mathcal{O}(n)$-query time or $\Omega(n^2)$-space $\mathcal{\tilde{O}}(|P_1| + |P_2| + \mathrm{occ})$-query time solutions were known so far. We break this barrier obtaining interesting trade-offs with polynomially subquadratic space and polynomially sublinear query time. In particular, we show that, for every $0\leq \delta \leq 1$, there is a data structure for Gapped String Indexing with either $\mathcal{\tilde{O}}(n^{2-\delta/3})$ or $\mathcal{\tilde{O}}(n^{3-2\delta})$ space and $\mathcal{\tilde{O}}(|P_1| + |P_2| + n^{\delta}\cdot (\mathrm{occ}+1))$ query time, where $\mathrm{occ}$ is the number of reported occurrences. As a new fundamental tool towards obtaining our main result, we introduce the Shifted Set Intersection problem: preprocess a collection of sets $S_1, \ldots, S_k$ of integers such that given queries consisting of three integers $i,j,s$, we can quickly output YES if and only if there exist $a \in S_i$ and $b \in S_j$ with $a+s = b$. We start by showing that the Shifted Set Intersection problem is equivalent to the indexing variant of 3SUM (3SUM Indexing) [Golovnev et al., STOC 2020]. Via several steps of reduction we then show that the Gapped String Indexing problem reduces to polylogarithmically many instances of the Shifted Set Intersection problem.

KW - cs.DS

U2 - 10.48550/arXiv.2211.16860

DO - 10.48550/arXiv.2211.16860

M3 - פרסום מוקדם

SP - 21

BT - Gapped String Indexing in Subquadratic Space and Sublinear Query Time

PB - arXiv preprint

ER -