Pattern Matching in the Rearrangement model - Semantic Scholar

Report 4 Downloads 14 Views
Pattern Matching in the Rearrangement model Amihood Amir BAD 2010

Motivation

Motivation

Motivation In the “old” days: Pattern and text are given in correct sequential order. It is possible that the content is erroneous – hence, edit distance. New paradigm: Content is exact, but the order of the pattern symbols may be scrambled. Why? Transmitted asynchronously (Bit torrent)? The nature of the application?

Example: Swaps Tehse knids of typing mistakes are very common So when searching for pattern These we are seeking the symbols of the pattern but with an order changed by swaps. Surprisingly, pattern matching with swaps is easier than pattern matching with mismatches (ACHLP:01)

Example: Reversals AAAGGCCCTTTGAGCCC AAAGAGTTTCCCGGCCC Given a DNA substring, a piece of it can detach and reverse. This process still computationally tough. Question: What is the minimum number of reversals necessary to sort a permutation of 1,…,n

Global Rearrangements? Berman & Hannenhalli (1996) called this Global Rearrangement as opposed to Local Rearrangement (edit distance). Showed it is NP-hard. Our Thesis: This is a special case of errors in the address rather than content.

Example: Transpositions AAAGGCCCTTTGAGCCC AATTTGAGGCCCAGCCC

Given a DNA substring, a piece of it can be transposed to another area. Question: What is the minimum number of transpositions necessary to sort a permutation of 1,…,n ?

Complexity? Bafna & Pevzner (1998), Christie (1998), Hartman (2001): 1.5 Polynomial Approximation. Not known whether efficiently computable. This is another special case of errors in the address rather than content.

Example: Block Interchanges AAAGGCCCTTTGAGCCC AAGTTTAGGCCCAGCCC Given a DNA substring, two non-empty subsequences can be interchanged. Question: What is the minimum number of block interchanges necessary to sort a permutation of 1,…,n ? Christie (1996): O(n 2 )

Motivation: Architecture. Assume distributed memory. Our processor has text and requests pattern of length m. Pattern arrives in m asynchronous packets, of the form: <symbol, addr> Example: , , , , Pattern: BCBAA

What Happens if Address Bits Have Errors? In Architecture: 1. Checksums. 2. Error Correcting Codes. 3. Retransmits.

We would like… To avoid extra transmissions.

For every text location compute the minimum number of address errors that can cause a mismatch in this location.

Error in Content:

Bristol University

Bar-Ilan University

Error in Address:

Bristol University

Bar-Ilan University

Questions What causes the error? independent moves, external process, internal process 2. What is the error cost? LP, UCM, LCM, ECM 3. How to efficiently find matches? Standard PM techniques inadequate 1.

Models map Pattern Matching: slide pattern along text. Nearest Neighbor: pattern and text same size. Permutation (Ulam): no repeating symbols.

Error Cause: External Process • interchanges • parallel-interchanges • transposition of single elements



Error Cost: UCM

 Unit-cost model: Minimum operations

P= 3 4 1 6 5 2 8 7 T= 1 2 3 4 5 6 7 8 Cost= 12 4 3

Error Cost: UCM  Interchange distance of π is: m-c(π). Example: Consider 3 6 4 1 7 2 5 Has 3 permutation cycles: (1 4 3) (2 6) (5 7) Distance is 7-3=4. 7

3641725 3614725 1634725 1234765

1 2

6 5

3 4

Error Cost: UCM  General strings? Open since 1849!  The GP,T=(V,E) graph: V = {v ∈ ∑ : v ∈ P}

E = {(ti , pi ) : ∀1 ≤ i ≤ m}

5

T= a d d b a b c c P= b c b c a d d a

a

8

1 2 3 4 5 6 7 8

1

4

c

b 3

7

6

2 d

Error Cost: UCM 1 2 3 4 5 6 7 8

T= a d d b a b c c P= b c b c a d d a b

5

c

a

a

8

a

1

4

c

2

3

c 6

d

c

d

NOT UNIQUE

b 7

b

d

b

b

c

a d

d

a

Error Cost: UCM Results NP-hard Fixed-parameter tractable Linear time 1.5-approximation

Open: Complexity when symbols appear at most twice?

Error Cost: LCM  Length-cost model: Length of operation 2

2

4

1

P= 3 4 1 6 5 2 8 7 T= 1 2 3 4 5 6 7 8 3 Cost=2 9 5

Error Cost: LCM 

Classify length cost models by: The law of increasing marginal cost: w∈I-type if for a1 O(|S|mlog 3 /log c-1m). • |T|=n, |P|=m=2k, faulty bits problem: deterministically O(|S|nm log m) .

Conclusion Error cause: different operators Error cost: different cost models Overcoming Error : different techniques

Different story.

Credits… Pattern Matching with Address Errors: Rearrangement Distances [SODA, 2006] : A, Aumann, Benson, Lipsky, Porat, Skiena, Vishne. On the Cost of Interchange Rearrangement in Strings [ESA, 2007] : A, Hartman, Kapah, Porat. Efficient Computations of L1 and L∞ Rearrangement Distances [SPIRE, 2007] : A, Aumann, Indyk, Porat. Approximate String Matching with Address Bit Errors [CPM, 2008] : A, Aumann, Kapah ,Porat. Interchange Rearrangement: the Element Cost Model [SPIRE,2008] : Landau, Kapah, Oz. Approximate String Matching with Stuck Bit Errors [manuscript] : A, Eisenberg, Keller, Porat.

Thank You