TY - JOUR
T1 - Set Intersection and Sequence Matching with mismatch counting
AU - Shiftan, Ariel
AU - Porat, Ely
N1 - Publisher Copyright:
© 2016 Elsevier B.V.
PY - 2016/7/25
Y1 - 2016/7/25
N2 - In the classical pattern-matching problem, one is given a text and a pattern both of which are sequences of letters. The requirement is to find all occurrences of the pattern in the text. We studied two modifications of the classical problem, where each letter in the text and pattern is a set (Set Intersection Matching problem) or a sequence (Sequence Matching problem). Two "letters" are found to match if the intersection of the corresponding sets is not empty or if the two sequences have a common element in the same index. We first show that the two problems are similar by establishing a linear time reduction between them. We then show the first known non-trivial and efficient algorithms for these problems, when the maximum set/sequence size d is small. The first is a Monte Carlo randomized algorithm for Set Intersection Matching, that takes Θ(4dnlog nlog m) time, where n and m are the lengths of the text and the pattern, respectively; the failure probability is less than 1n2. This algorithm can also be used, with slight modifications, when up to k mismatches is allowed. In addition, it can be used to maintain an approximation of factor 1 ± ε of the mismatch count in Θ(1ε24dnlog nlog m) time; the failure probability is bounded by 1n. The second is a deterministic algorithm for Set Intersection Matching that can be used to count the number of matches at each index of the text in a total running time Θ(∑i=1d(σi)nlog m)=O(σdnlog m), where σ is the size of the alphabet. The third algorithm, also deterministic, solves the Sequence Matching problem in Θ(4dnlog m) time.
AB - In the classical pattern-matching problem, one is given a text and a pattern both of which are sequences of letters. The requirement is to find all occurrences of the pattern in the text. We studied two modifications of the classical problem, where each letter in the text and pattern is a set (Set Intersection Matching problem) or a sequence (Sequence Matching problem). Two "letters" are found to match if the intersection of the corresponding sets is not empty or if the two sequences have a common element in the same index. We first show that the two problems are similar by establishing a linear time reduction between them. We then show the first known non-trivial and efficient algorithms for these problems, when the maximum set/sequence size d is small. The first is a Monte Carlo randomized algorithm for Set Intersection Matching, that takes Θ(4dnlog nlog m) time, where n and m are the lengths of the text and the pattern, respectively; the failure probability is less than 1n2. This algorithm can also be used, with slight modifications, when up to k mismatches is allowed. In addition, it can be used to maintain an approximation of factor 1 ± ε of the mismatch count in Θ(1ε24dnlog nlog m) time; the failure probability is bounded by 1n. The second is a deterministic algorithm for Set Intersection Matching that can be used to count the number of matches at each index of the text in a total running time Θ(∑i=1d(σi)nlog m)=O(σdnlog m), where σ is the size of the alphabet. The third algorithm, also deterministic, solves the Sequence Matching problem in Θ(4dnlog m) time.
KW - Generalized strings
KW - Pattern matching
KW - Sequence Matching
KW - Set Intersection Matching
UR - http://www.scopus.com/inward/record.url?scp=84954306098&partnerID=8YFLogxK
U2 - 10.1016/j.tcs.2016.01.003
DO - 10.1016/j.tcs.2016.01.003
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???
AN - SCOPUS:84954306098
SN - 0304-3975
VL - 638
SP - 3
EP - 10
JO - Theoretical Computer Science
JF - Theoretical Computer Science
ER -