Consequences of Faster Alignment of Sequences - Semantic Scholar

Report 2 Downloads 42 Views
CONSEQUENCES OF FASTER ALIGNMENT OF SEQUENCES AMIR ABBOUD

STANFORD

VIRGINIA VASSILEVSKA WILLIAMS

STANFORD

OREN WEIMANN

UNIVERSITY OF HAIFA

Thursday, July 10th, 2014

FASTER ALGORITHMS? Some classic problems on sequences have ๐‘‚(๐‘›) algorithms: ๏ƒผ Exact Pattern Matching ๏ƒผ Pattern Matching with donโ€™t cares ๏ƒผ Longest Common Substring While other classic problems donโ€™t have ๐‘‚(๐‘›2 โˆ’ ๐œ€ ) algorithms: ๏ƒ˜ Local Alignment ๏ƒ˜ Edit Distance ๏ƒ˜ Longest Common Subsequence (LCS) Isnโ€™t quadratic time efficient enough?

LOCAL ALIGNMENT ๐‘‚ ๐‘›2 is not that efficientโ€ฆ Input: Two (DNA) sequences of length ๐‘›. AGCCCGTCTACGTGCAACCGGGGAAAGTATA AGCCCGTCTACGTGCAACCGGGGAAAGTATA AAACGTGACGAGAGAGAGAACCCATTACGAA Output: The optimal alignment of two substrings. C C G โ€“ T C T A C G C C C A T โ€“ T A C G +1 +1 -0.5 -1 +1 -1 +1 +1 +1 +1 = +4.5

A

C

G

T

โ€“

A

+1

-1.4

-1.8

-0.7

-1

C

-1.4

+1

-0.5

-1

-1

G

-1.8

-0.5

+1

-1.9

-1

T

-0.7

-1

-1.9

+1

-1

-

-1

-1

-1

-1

-โˆž

Solved daily on huge sequences: ๐‘› = 3 โˆ™ 109 for the human genome. Algorithms:

Smith-Waterman dynamic programming ๐‘‚ ๐‘›2 . Compression tricks ๐‘‚

๐‘›2 log ๐‘›

.

LOCAL ALIGNMENT When n = 3 โˆ™ 109 , ๐‘‚(๐‘›2 / log ๐‘›) is too slow! In practice? Heuristics. Most cited paper in the 90s: BLAST: Basic Local Alignment Search Tool A heuristic algorithm for Local Alignment.

Can we find an ๐‘‚ ๐‘› log ๐‘› algorithm?? (that would probably be efficientโ€ฆ) How about ๐‘‚(๐‘›1.5 ) or even ๐‘‚(๐‘›1.8 )?

Today: Theoretical evidence that the answer is โ€œnoโ€!

HARDNESS FOR EASY PROBLEMS How can we prove that a problem requires ~๐‘›2 time?

Prove NP-Hardness?

๐‘ถ(๐’๐Ÿ’ ) vs ๐‘ถ(๐’ ๐’๐’๐’ˆ ๐’) ?

Unconditional Lower bounds ?

No superlinear bounds

Lower bounds for classes of algorithms ?

Not a complete answer.

IDEA: REDUCTIONS Theorem: Problem X is NP-hard Every NP-complete problem is in P

X is in P

Conclusion: X is probably not in Pโ€ฆ

OUR APPROACH A surprising algorithm for problem Y

Unexpected breakthroughs in different areas

Conclusion: Such algorithm is unlikelyโ€ฆ A refined version of NP-hardnessโ€ฆ

MAIN RESULT โ€œTheoremโ€: If Local Alignment can be solved in ๐‘›1.99 time, then: ๏ƒ˜ 3-SUM can be solved in ๐‘›1.99 time! Refuting the 3-SUM conjecture

๏ƒ˜ CNF-SAT can be solved in 1.99๐‘› time! ๏ƒ˜ Max-4-Clique can be solved in ๐‘›3.99 time!

3SUM Most famous example of this approach Input: A list of n numbers -15

-6

33

8

1

-21

4

-30

7

โ€ฆ

107

Output: Are there 3 numbers that sum to 0? Trivial: ๐‘‚ ๐‘›3 , Simple: ๐‘‚ ๐‘›2 , Best: ๐‘‚(๐‘›2 / log 2 ๐‘›)

[STOC 10โ€™: Patrascu] The 3SUM Conjecture: 3SUM cannot be solved in ๐‘‚(๐‘›2โˆ’ฮต ) time for any ฮต > 0. [Gajentaan โ€“ Overmars 95โ€™] and many others: ๏ƒ˜ A long list of 3SUM-hard problems.

3SUM-HARD PROBLEMS The 3SUM Conjecture: 3SUM cannot be solved in ๐‘‚(๐‘›2โˆ’ฮต ) time for any ฮต > 0.

The 3SUM conjecture implies the following lower bounds: [C.G. 95โ€™: Gajentaan -- Oevrmars] ๏ƒ˜ 3-Points-On-A-Line requires ๐‘›2โˆ’๐‘œ(1) time.

Computational Geometry

[SODA 01โ€™: Barequet โ€“ Har Peled] ๏ƒ˜ Polygon Containment requires ๐‘›2โˆ’๐‘œ(1) time. [STOC 10โ€™: Patrascu] and [STOC 09โ€™: Vassilevska โ€“ Williams] ๏ƒ˜ Zero-Triangle requires ๐‘›3โˆ’๐‘œ(1) time. [ICALP 13โ€™: A. -- Lewi] ๏ƒ˜ Zero-4-Path requires ๐‘›3โˆ’๐‘œ(1) time. [ICALP 14โ€™: Amir โ€“ Chan -- Lewenstein โ€“ Lewenstein] ๏ƒ˜ A lower bound for Jumbled Pattern Matching.

Graph Algorithms

Stringology

MAIN RESULT โ€œTheoremโ€: If Local Alignment can be solved in ๐‘›1.99 time, then:

๏ƒ˜ 3-SUM can be solved in ๐‘›1.99 time! Refuting the 3-SUM conjecture

๏ƒ˜ CNF-SAT can be solved in 1.99๐‘› time! Refuting the Strong Exponential Time Hypothesis (SETH)

๏ƒ˜ Max-4-Clique can be solved in ๐‘›3.99 time!

THE STRONG EXPONENTIAL TIME HYPOTHESIS Very useful for proving lower boundsโ€ฆ

CNF-SAT: Given a CNF formula on ๐‘› variables and ๐‘š clauses, is it satisfiable?

[01โ€™: Impagliazzo โ€“ Paturi -- Zane] The Strong Exponential Time Hypothesis (SETH): โ€œCNF-SAT cannot be solved in (2 โˆ’ ๐œ€)๐‘› ๐‘๐‘œ๐‘™๐‘ฆ(๐‘š) time.โ€

There are faster algorithms for k-SAT but they become ~2๐‘› as k grows.

SETH HARDNESS The Strong Exponential Time Hypothesis (SETH): โ€œCNF-SAT cannot be solved in 2 1โˆ’๐œ€ ๐‘› ๐‘๐‘œ๐‘™๐‘ฆ(๐‘š) time.โ€

Theorem(s): The SETH implies the following lower bounds: [SODA 10โ€™: Patrascu -- Williams] ๏ƒ˜ k-Dominating-Set requires ๐‘›๐‘˜โˆ’๐‘œ(1) time. [STOC 13โ€™: Roditty โ€“ Vassilevska Williams] ๏ƒ˜ A

3 2

โˆ’ ฮต -approximation for the diameter requires (๐‘š๐‘›)1โˆ’๐‘œ(1) time.

[FOCS 14โ€™: A. โ€“ Vassilevska Williams] ๏ƒ˜ Dynamic Reachability requires ๐‘š1โˆ’๐‘œ(1) amortized update time. [FOCS 14โ€™: Bringmann ] ๏ƒ˜ Computing the Frechet distance requires ๐‘›2โˆ’๐‘œ(1) time.

MAIN RESULT โ€œTheoremโ€: If Local Alignment can be solved in ๐‘›1.99 time, then:

๏ƒ˜ 3-SUM can be solved in ๐‘›1.99 time! Refuting the 3-SUM conjecture

๏ƒ˜ CNF-SAT can be solved in 1.99๐‘› time! Refuting the Strong Exponential Time Hypothesis (SETH)

๏ƒ˜ Max-4-Clique can be solved in ๐‘›3.99 time! A longstanding open problem

Computational Geometry Satisfiability Algorithms Graph Algorithms

Bottom line: Local Alignment probably requires ~๐‘›2 to solve optimally, and we should settle for heuristics in practiceโ€ฆ

PLAN โ€ข Motivation โ€ข Main Results โ€ข Other Results โ€ข Proof examples: โ€ข CNF-SAT to LCS* โ€ข Sketch: 3-SUM to Local Alignment

โ€ข Open problems

MORE RESULTS The conjectures imply tight lower bounds for: ๏ƒ˜ Edit Distance with gap penalties ๏ƒ˜ Normalized LCS ๏ƒ˜ Multiple Local Alignment ๏ƒ˜ Partial Match

๏ƒ˜ LCS* The simplest problem that requires ๐‘›2โˆ’๐‘œ(1) time?

LCS* The Longest Common Substring with donโ€™t cares problem (LCS*) Input: Two string of length ๐‘›, containing donโ€™t care characters *. S = RESEARCH_P*P*RS_ARE_*OOL T = GO*GLE_SE*R*H_IS_U*EFUL Output: The longest common substring.

Theorem: The SETH implies that LCS* on binary strings requires ๐‘›2โˆ’๐‘œ(1) time!

CNF-SAT TO LCS* Theorem: The SETH implies that LCS* on binary strings requires ๐‘›2โˆ’๐‘œ(1) time!

Proof: ๐‘‚(๐‘›

2โˆ’๐œ€

) alg for LCS* => 2

๐œ€

1โˆ’ 2 ๐‘›

alg for CNF-SAT

Given a CNF formula with ๐‘š clauses ๐œ‘ ๐‘ฅ1 , โ€ฆ , ๐‘ฅ๐‘› = ยฌ๐‘ฅ1 โˆจ ๐‘ฅ17 โˆจ โˆ™โˆ™โˆ™ โˆจ ๐‘ฅ10 โˆง โˆ™โˆ™โˆ™ โˆง (๐‘ฅ2 โˆจ ๐‘ฅ5 โˆจ ๐‘ฅ21 ) ๐‘ช๐Ÿ

โˆ™โˆ™โˆ™

๐‘ช๐’Ž

Split the variables and enumerate over partial assignments ๐‘ˆ1 = ๐‘ฅ1 , โ€ฆ , ๐‘ฅ๐‘› ๐›ผ=

๐‘ฅ1 = ๐‘‡ ๐‘ฅ2 = ๐น โ‹ฎ ๐‘ฅ๐‘›/2 = ๐‘‡

๐‘ˆ2 = ๐‘ฅ๐‘›

2

๐›ฝ=

๐‘ฅ๐‘› ๐‘ฅ๐‘›

2+1 2+2

2+1 , โ€ฆ , ๐‘ฅ๐‘›

=๐น =๐น

โ‹ฎ ๐‘ฅ๐‘› = ๐‘‡

There are ๐‘ = 2๐‘›/2 such ๐›ผโ€™s and ๐›ฝโ€™s Goal of alg: find a pair such that (๐›ผ โˆ™ ๐›ฝ) sat ๐œ‘.

CNF-SAT TO LCS* Theorem: The SETH implies that LCS* on binary strings requires ๐‘›2โˆ’๐‘œ(1) time!

Proof: ๐‘‚(๐‘›

2โˆ’๐œ€

) alg for LCS* => 2

๐œ€

1โˆ’ 2 ๐‘›

alg for CNF-SAT

๐œ‘ is satisfiable โŸบ โˆƒ๐›ผ, ๐›ฝ โˆถ โˆ€๐ถ๐‘– โˆถ ๐›ผ โˆ™ ๐›ฝ sat ๐ถ๐‘–

Idea: construct strings ๐‘†, ๐‘‡ of length ~(2๐‘›/2 ๐‘š) such that LCS* ๐‘†, ๐‘‡ = ๐‘š โŸบ โˆƒ๐›ผ, ๐›ฝ โˆถ โˆ€๐ถ๐‘– โˆถ ๐›ผ โˆ™ ๐›ฝ sat ๐ถ๐‘–

Done: we get a

2โˆ’๐œ€ 2๐‘› 2 ๐‘š

=2

๐œ€

1โˆ’ 2 ๐‘›

๐‘๐‘œ๐‘™๐‘ฆ(๐‘š) alg for CNF-SAT

CNF-SAT TO LCS* Theorem: The SETH implies that LCS* on binary strings requires ๐‘›2โˆ’๐‘œ(1) time! Proof: Construct strings ๐‘†, ๐‘‡ of length ๐‘‚(2๐‘›/2 ๐‘š) such that LCS* ๐‘†, ๐‘‡ = ๐‘š โŸบ โˆƒ๐›ผ, ๐›ฝ โˆถ โˆ€๐ถ๐‘– โˆถ ๐›ผ โˆ™ ๐›ฝ sat ๐ถ๐‘– Define strings of length ๐‘š: ๐‘‡๐›ผ = 0 โˆ— โˆ— 0 โˆ— โ‹ฏ 0 โˆ— ๐›ผ sat ๐ถ๐‘– ๐‘‡๐›ผ [๐‘–] = 0 otherwise

Then:

๐‘†๐›ฝ = 0 0 1 0 1 โ‹ฏ 0 0 ๐›ฝ sat ๐ถ๐‘– ๐‘†๐›ฝ [๐‘–] = 1 otherwise

Tฮฑ โ‰ก ๐‘†๐›ฝ โŸบ โˆ€๐ถ๐‘– โˆถ ๐›ผ โˆ™ ๐›ฝ sat ๐ถ๐‘–

Construct S,T in ๐‘‚(2๐‘›

2

๐‘š) time:

๐‘‡ = โ‹ฏ ๐‘‡๐›ผ1 โ‹ฏ $ โ‹ฏ ๐‘‡๐›ผ2 โ‹ฏ $ โ‹ฏ $ โ‹ฏ ๐‘‡๐›ผ๐‘ โ‹ฏ ๐‘† = โ‹ฏ ๐‘†๐›ฝ1 โ‹ฏ

โ‹• โ‹ฏ ๐‘†๐›ฝ2 โ‹ฏ

โ‹• โ‹ฏ โ‹• โ‹ฏ ๐‘†๐›ฝ๐‘ โ‹ฏ

โˆŽ

3-SUM TO LOCAL ALIGNMENT 3-SUM on ๐‘› numbers -15

ฮฃ ~ ๐‘›3 ?

-6

33

๐’™ โˆˆ ยฑ ๐’๐Ÿ‘

-30

7

โ€ฆ

[ESA 14โ€™: A. โ€“ Lewi โ€“ Williams]

๐‘›๐‘œ(1) instances of 3-Vector-SUM on ๐‘› vectors ๐’—๐’™ = (๐’™๐Ÿ , โ€ฆ , ๐’™๐’… )

ฮฃ ~ log ๐‘›

๐‘ฅ๐‘– โˆˆ ยฑ log ๐‘› and ๐‘‘ = ๐‘‚

โ€ฆ log ๐‘› log log ๐‘›

โˆƒ๐‘ฃ๐‘Ž , ๐‘ฃ๐‘ , ๐‘ฃ๐‘ : ๐‘ฃ๐‘Ž + ๐‘ฃ๐‘ + ๐‘ฃ๐‘ = 0, โ€ฆ , 0 ?

ฮฃ ~๐‘›๐œ€ log ๐‘›

Hashingโ€ฆ

Define substrings of length ๐‘‘: ๐‘†๐‘ฅ = [โ€ฆ , โ€ฒ โ„Ž ๐‘ฅ , ๐‘ฅ๐‘– โ€ฒ , โ€ฆ ]

ฮฃ contains pairs (โ„Ž ๐‘ฅ , ๐‘ฅ๐‘– )

107

3-SUM TO LOCAL ALIGNMENT Define substrings of length ๐‘‘: ๐‘†๐‘ฅ = [โ€ฆ , โ€ฒ โ„Ž ๐‘ฅ , ๐‘ฅ๐‘– โ€ฒ , โ€ฆ ]

ฮฃ contains pairs (โ„Ž ๐‘ฅ , ๐‘ฅ๐‘– )

Our scoring matrix enforces that:

(โ„Ž ๐‘ฅ , ๐‘ฅ๐‘– ) and (โ„Ž ๐‘ฆ , ๐‘ฆ๐‘– ) will โ€œmatchโ€ iff: ๐‘ฅ๐‘– + ๐‘ฆ๐‘– + ๐‘ง๐‘– = 0 where ๐‘ง is determined by โ„Ž ๐‘ฅ , โ„Ž(๐‘ฆ)

CONCLUSION The reductions explain the lack of progress and prove that new ideas are required for faster algorithms โ€œAn opportunity to solve many famous open problems while working on your favorite problem!โ€

โ€ข Subquadratic Edit Distance? โ€ข Subquadratic LCS? โ€ข Subcubic Protein Folding? โ€ข Subcubic Tree Edit Distance?

Thank You! Questions?