CONSEQUENCES OF FASTER ALIGNMENT OF SEQUENCES AMIR ABBOUD
STANFORD
VIRGINIA VASSILEVSKA WILLIAMS
STANFORD
OREN WEIMANN
UNIVERSITY OF HAIFA
Thursday, July 10th, 2014
FASTER ALGORITHMS? Some classic problems on sequences have ๐(๐) algorithms: ๏ผ Exact Pattern Matching ๏ผ Pattern Matching with donโt cares ๏ผ Longest Common Substring While other classic problems donโt have ๐(๐2 โ ๐ ) algorithms: ๏ Local Alignment ๏ Edit Distance ๏ Longest Common Subsequence (LCS) Isnโt quadratic time efficient enough?
LOCAL ALIGNMENT ๐ ๐2 is not that efficientโฆ Input: Two (DNA) sequences of length ๐. AGCCCGTCTACGTGCAACCGGGGAAAGTATA AGCCCGTCTACGTGCAACCGGGGAAAGTATA AAACGTGACGAGAGAGAGAACCCATTACGAA Output: The optimal alignment of two substrings. C C G โ T C T A C G C C C A T โ T A C G +1 +1 -0.5 -1 +1 -1 +1 +1 +1 +1 = +4.5
A
C
G
T
โ
A
+1
-1.4
-1.8
-0.7
-1
C
-1.4
+1
-0.5
-1
-1
G
-1.8
-0.5
+1
-1.9
-1
T
-0.7
-1
-1.9
+1
-1
-
-1
-1
-1
-1
-โ
Solved daily on huge sequences: ๐ = 3 โ 109 for the human genome. Algorithms:
Smith-Waterman dynamic programming ๐ ๐2 . Compression tricks ๐
๐2 log ๐
.
LOCAL ALIGNMENT When n = 3 โ 109 , ๐(๐2 / log ๐) is too slow! In practice? Heuristics. Most cited paper in the 90s: BLAST: Basic Local Alignment Search Tool A heuristic algorithm for Local Alignment.
Can we find an ๐ ๐ log ๐ algorithm?? (that would probably be efficientโฆ) How about ๐(๐1.5 ) or even ๐(๐1.8 )?
Today: Theoretical evidence that the answer is โnoโ!
HARDNESS FOR EASY PROBLEMS How can we prove that a problem requires ~๐2 time?
Prove NP-Hardness?
๐ถ(๐๐ ) vs ๐ถ(๐ ๐๐๐ ๐) ?
Unconditional Lower bounds ?
No superlinear bounds
Lower bounds for classes of algorithms ?
Not a complete answer.
IDEA: REDUCTIONS Theorem: Problem X is NP-hard Every NP-complete problem is in P
X is in P
Conclusion: X is probably not in Pโฆ
OUR APPROACH A surprising algorithm for problem Y
Unexpected breakthroughs in different areas
Conclusion: Such algorithm is unlikelyโฆ A refined version of NP-hardnessโฆ
MAIN RESULT โTheoremโ: If Local Alignment can be solved in ๐1.99 time, then: ๏ 3-SUM can be solved in ๐1.99 time! Refuting the 3-SUM conjecture
๏ CNF-SAT can be solved in 1.99๐ time! ๏ Max-4-Clique can be solved in ๐3.99 time!
3SUM Most famous example of this approach Input: A list of n numbers -15
-6
33
8
1
-21
4
-30
7
โฆ
107
Output: Are there 3 numbers that sum to 0? Trivial: ๐ ๐3 , Simple: ๐ ๐2 , Best: ๐(๐2 / log 2 ๐)
[STOC 10โ: Patrascu] The 3SUM Conjecture: 3SUM cannot be solved in ๐(๐2โฮต ) time for any ฮต > 0. [Gajentaan โ Overmars 95โ] and many others: ๏ A long list of 3SUM-hard problems.
3SUM-HARD PROBLEMS The 3SUM Conjecture: 3SUM cannot be solved in ๐(๐2โฮต ) time for any ฮต > 0.
The 3SUM conjecture implies the following lower bounds: [C.G. 95โ: Gajentaan -- Oevrmars] ๏ 3-Points-On-A-Line requires ๐2โ๐(1) time.
Computational Geometry
[SODA 01โ: Barequet โ Har Peled] ๏ Polygon Containment requires ๐2โ๐(1) time. [STOC 10โ: Patrascu] and [STOC 09โ: Vassilevska โ Williams] ๏ Zero-Triangle requires ๐3โ๐(1) time. [ICALP 13โ: A. -- Lewi] ๏ Zero-4-Path requires ๐3โ๐(1) time. [ICALP 14โ: Amir โ Chan -- Lewenstein โ Lewenstein] ๏ A lower bound for Jumbled Pattern Matching.
Graph Algorithms
Stringology
MAIN RESULT โTheoremโ: If Local Alignment can be solved in ๐1.99 time, then:
๏ 3-SUM can be solved in ๐1.99 time! Refuting the 3-SUM conjecture
๏ CNF-SAT can be solved in 1.99๐ time! Refuting the Strong Exponential Time Hypothesis (SETH)
๏ Max-4-Clique can be solved in ๐3.99 time!
THE STRONG EXPONENTIAL TIME HYPOTHESIS Very useful for proving lower boundsโฆ
CNF-SAT: Given a CNF formula on ๐ variables and ๐ clauses, is it satisfiable?
[01โ: Impagliazzo โ Paturi -- Zane] The Strong Exponential Time Hypothesis (SETH): โCNF-SAT cannot be solved in (2 โ ๐)๐ ๐๐๐๐ฆ(๐) time.โ
There are faster algorithms for k-SAT but they become ~2๐ as k grows.
SETH HARDNESS The Strong Exponential Time Hypothesis (SETH): โCNF-SAT cannot be solved in 2 1โ๐ ๐ ๐๐๐๐ฆ(๐) time.โ
Theorem(s): The SETH implies the following lower bounds: [SODA 10โ: Patrascu -- Williams] ๏ k-Dominating-Set requires ๐๐โ๐(1) time. [STOC 13โ: Roditty โ Vassilevska Williams] ๏ A
3 2
โ ฮต -approximation for the diameter requires (๐๐)1โ๐(1) time.
[FOCS 14โ: A. โ Vassilevska Williams] ๏ Dynamic Reachability requires ๐1โ๐(1) amortized update time. [FOCS 14โ: Bringmann ] ๏ Computing the Frechet distance requires ๐2โ๐(1) time.
MAIN RESULT โTheoremโ: If Local Alignment can be solved in ๐1.99 time, then:
๏ 3-SUM can be solved in ๐1.99 time! Refuting the 3-SUM conjecture
๏ CNF-SAT can be solved in 1.99๐ time! Refuting the Strong Exponential Time Hypothesis (SETH)
๏ Max-4-Clique can be solved in ๐3.99 time! A longstanding open problem
Computational Geometry Satisfiability Algorithms Graph Algorithms
Bottom line: Local Alignment probably requires ~๐2 to solve optimally, and we should settle for heuristics in practiceโฆ
PLAN โข Motivation โข Main Results โข Other Results โข Proof examples: โข CNF-SAT to LCS* โข Sketch: 3-SUM to Local Alignment
โข Open problems
MORE RESULTS The conjectures imply tight lower bounds for: ๏ Edit Distance with gap penalties ๏ Normalized LCS ๏ Multiple Local Alignment ๏ Partial Match
๏ LCS* The simplest problem that requires ๐2โ๐(1) time?
LCS* The Longest Common Substring with donโt cares problem (LCS*) Input: Two string of length ๐, containing donโt care characters *. S = RESEARCH_P*P*RS_ARE_*OOL T = GO*GLE_SE*R*H_IS_U*EFUL Output: The longest common substring.
Theorem: The SETH implies that LCS* on binary strings requires ๐2โ๐(1) time!
CNF-SAT TO LCS* Theorem: The SETH implies that LCS* on binary strings requires ๐2โ๐(1) time!
Proof: ๐(๐
2โ๐
) alg for LCS* => 2
๐
1โ 2 ๐
alg for CNF-SAT
Given a CNF formula with ๐ clauses ๐ ๐ฅ1 , โฆ , ๐ฅ๐ = ยฌ๐ฅ1 โจ ๐ฅ17 โจ โโโ โจ ๐ฅ10 โง โโโ โง (๐ฅ2 โจ ๐ฅ5 โจ ๐ฅ21 ) ๐ช๐
โโโ
๐ช๐
Split the variables and enumerate over partial assignments ๐1 = ๐ฅ1 , โฆ , ๐ฅ๐ ๐ผ=
๐ฅ1 = ๐ ๐ฅ2 = ๐น โฎ ๐ฅ๐/2 = ๐
๐2 = ๐ฅ๐
2
๐ฝ=
๐ฅ๐ ๐ฅ๐
2+1 2+2
2+1 , โฆ , ๐ฅ๐
=๐น =๐น
โฎ ๐ฅ๐ = ๐
There are ๐ = 2๐/2 such ๐ผโs and ๐ฝโs Goal of alg: find a pair such that (๐ผ โ ๐ฝ) sat ๐.
CNF-SAT TO LCS* Theorem: The SETH implies that LCS* on binary strings requires ๐2โ๐(1) time!
Proof: ๐(๐
2โ๐
) alg for LCS* => 2
๐
1โ 2 ๐
alg for CNF-SAT
๐ is satisfiable โบ โ๐ผ, ๐ฝ โถ โ๐ถ๐ โถ ๐ผ โ ๐ฝ sat ๐ถ๐
Idea: construct strings ๐, ๐ of length ~(2๐/2 ๐) such that LCS* ๐, ๐ = ๐ โบ โ๐ผ, ๐ฝ โถ โ๐ถ๐ โถ ๐ผ โ ๐ฝ sat ๐ถ๐
Done: we get a
2โ๐ 2๐ 2 ๐
=2
๐
1โ 2 ๐
๐๐๐๐ฆ(๐) alg for CNF-SAT
CNF-SAT TO LCS* Theorem: The SETH implies that LCS* on binary strings requires ๐2โ๐(1) time! Proof: Construct strings ๐, ๐ of length ๐(2๐/2 ๐) such that LCS* ๐, ๐ = ๐ โบ โ๐ผ, ๐ฝ โถ โ๐ถ๐ โถ ๐ผ โ ๐ฝ sat ๐ถ๐ Define strings of length ๐: ๐๐ผ = 0 โ โ 0 โ โฏ 0 โ ๐ผ sat ๐ถ๐ ๐๐ผ [๐] = 0 otherwise
Then:
๐๐ฝ = 0 0 1 0 1 โฏ 0 0 ๐ฝ sat ๐ถ๐ ๐๐ฝ [๐] = 1 otherwise
Tฮฑ โก ๐๐ฝ โบ โ๐ถ๐ โถ ๐ผ โ ๐ฝ sat ๐ถ๐
Construct S,T in ๐(2๐
2
๐) time:
๐ = โฏ ๐๐ผ1 โฏ $ โฏ ๐๐ผ2 โฏ $ โฏ $ โฏ ๐๐ผ๐ โฏ ๐ = โฏ ๐๐ฝ1 โฏ
โ โฏ ๐๐ฝ2 โฏ
โ โฏ โ โฏ ๐๐ฝ๐ โฏ
โ
3-SUM TO LOCAL ALIGNMENT 3-SUM on ๐ numbers -15
ฮฃ ~ ๐3 ?
-6
33
๐ โ ยฑ ๐๐
-30
7
โฆ
[ESA 14โ: A. โ Lewi โ Williams]
๐๐(1) instances of 3-Vector-SUM on ๐ vectors ๐๐ = (๐๐ , โฆ , ๐๐
)
ฮฃ ~ log ๐
๐ฅ๐ โ ยฑ log ๐ and ๐ = ๐
โฆ log ๐ log log ๐
โ๐ฃ๐ , ๐ฃ๐ , ๐ฃ๐ : ๐ฃ๐ + ๐ฃ๐ + ๐ฃ๐ = 0, โฆ , 0 ?
ฮฃ ~๐๐ log ๐
Hashingโฆ
Define substrings of length ๐: ๐๐ฅ = [โฆ , โฒ โ ๐ฅ , ๐ฅ๐ โฒ , โฆ ]
ฮฃ contains pairs (โ ๐ฅ , ๐ฅ๐ )
107
3-SUM TO LOCAL ALIGNMENT Define substrings of length ๐: ๐๐ฅ = [โฆ , โฒ โ ๐ฅ , ๐ฅ๐ โฒ , โฆ ]
ฮฃ contains pairs (โ ๐ฅ , ๐ฅ๐ )
Our scoring matrix enforces that:
(โ ๐ฅ , ๐ฅ๐ ) and (โ ๐ฆ , ๐ฆ๐ ) will โmatchโ iff: ๐ฅ๐ + ๐ฆ๐ + ๐ง๐ = 0 where ๐ง is determined by โ ๐ฅ , โ(๐ฆ)
CONCLUSION The reductions explain the lack of progress and prove that new ideas are required for faster algorithms โAn opportunity to solve many famous open problems while working on your favorite problem!โ
โข Subquadratic Edit Distance? โข Subquadratic LCS? โข Subcubic Protein Folding? โข Subcubic Tree Edit Distance?
Thank You! Questions?