TY - GEN

T1 - Approximate matching in weighted sequences

AU - Amir, Amihood

AU - Iliopoulos, Costas

AU - Kapah, Oren

AU - Porat, Ely

PY - 2006

Y1 - 2006

N2 - Weighted sequences have been recently introduced as a tool to handle a set of sequences that are not identical but have many local similarities, The weighted sequence is a "statistical image" of this set, where the probability of every symbol's occurrence at every text location is given. We address the problem of approximately matching a pattern in such a weighted sequence. The pattern is a given string and we seek all locations in the set where the pattern occurs with a high enough probability. We define the notion of Hamming distance and edit distance in weighted sequences and give efficient algorithms for computing them. We compute two versions of the Hamming distance in time O(n√m log m), where n is the length of the weighted text and m is the pattern length. The edit distance is computed in time O(nm) and O(nm 2), depending on the edit distance definition used. Unfortunately, due to space considerations, the edit distance details are left to the journal version. We also define the notion of weighted matching in infinite alphabets and show that exact weighted matching can be computed in time O(s log 2 s), where s is the number of text symbols having non-zero probability. The weighted Hamming distance over infinite alphabets can be computed in time min(O(kn√s + s3/2log2s),O(s 4/3m1/3log s)).

AB - Weighted sequences have been recently introduced as a tool to handle a set of sequences that are not identical but have many local similarities, The weighted sequence is a "statistical image" of this set, where the probability of every symbol's occurrence at every text location is given. We address the problem of approximately matching a pattern in such a weighted sequence. The pattern is a given string and we seek all locations in the set where the pattern occurs with a high enough probability. We define the notion of Hamming distance and edit distance in weighted sequences and give efficient algorithms for computing them. We compute two versions of the Hamming distance in time O(n√m log m), where n is the length of the weighted text and m is the pattern length. The edit distance is computed in time O(nm) and O(nm 2), depending on the edit distance definition used. Unfortunately, due to space considerations, the edit distance details are left to the journal version. We also define the notion of weighted matching in infinite alphabets and show that exact weighted matching can be computed in time O(s log 2 s), where s is the number of text symbols having non-zero probability. The weighted Hamming distance over infinite alphabets can be computed in time min(O(kn√s + s3/2log2s),O(s 4/3m1/3log s)).

UR - http://www.scopus.com/inward/record.url?scp=33746090934&partnerID=8YFLogxK

U2 - 10.1007/11780441_33

DO - 10.1007/11780441_33

M3 - ???researchoutput.researchoutputtypes.contributiontobookanthology.conference???

AN - SCOPUS:33746090934

SN - 3540354557

SN - 9783540354550

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 365

EP - 376

BT - Combinatorial Pattern Matching - 17th Annual Symposium, CPM 2006, Proceedings

PB - Springer Verlag

T2 - 17th Annual Symposium on Combinatorial Pattern Matching, CPM 2006

Y2 - 5 July 2006 through 7 July 2006

ER -