Resource-efficient regular expression matching architecture for text analytics Kubilay Atasu IBM Research - Zurich Presented at ASAP 2014
© 2014 IBM Corporation
SystemT: an algebraic approach to declarative information extraction distill structured data from unstructured and semi-structured text
exploit the extracted data in your applications For years, Microsoft Corporation CEO Bill Gates was against open source. But today he appears to have changed his mind. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Annotations Name Bill Gates Bill Veghte Richard Stallman
Title Organization CEO Microsoft VP Microsoft Founder Free Soft..
Richard Stallman, founder of the Free Software Foundation, countered saying… (from Cohen’s IE tutorial, 2003) 2
© 2014 IBM Corporation
A simple SystemT information extraction rule Find the names (regex) that are at most 20 chars after a title (dict.)
Founder..............Bill Gates dict. match start offset
regex match
end offset
start offset
end offset
at most 20 chars
end offset
start offset
result
3
© 2014 IBM Corporation
Finding leftmost regular expression matches Assume that we are searching for the regex .*(a|aa|aaaa) in the input string “aaaa” Find the regex match with the smallest start offset value at each end offset position The leftmost maches are marked using solid lines
4
© 2014 IBM Corporation
Regex matching: background Consider the regex .*(a|b|aa|aba)
NFA
Can be transformed into NFA/DFA
DFA
Traditional architectures do not support start offset reporting & leftmost matching: Reconfigurable NFAs (Sidhu FCCM 2001, Bispo FPT 2006 , Yang ANCS 2008) Programmable DFAs (Smith SIGCOMM 2008, Van Lunteren MICRO 2012) 5
© 2014 IBM Corporation
Previous solution: network of state machines active_reg
state_reg
start_offset_reg
-Dimension of the network (# state machines) is statically computed for each regex -DRAWBACK: replication of the state transition logic
K. Atasu, R. Polig, C.Hagleitner, F. R. Reiss: Hardware-accelerated regular expression matching for high-throughput text analytics. FPL 2013: 1-7 6
© 2014 IBM Corporation
Contributions of this work 1. Extending Sidhu and Prasanna’s NFA architecture to support start offset reporting 2. A graph coloring based register clustering method to minimize the register usage 3. An efficient leftmost match computation method without using offset comparisons
NFA
7
Sidhu and Prasanna’s NFA Architecture
© 2014 IBM Corporation
A straightforward extension of Sidhu & Prasanna’s architecture Add a start offset register to each NFA state offset_reg[0] = value of current offset position DRAWBACK: redundant start offset registers offset_reg [0]
offset_reg [1]
offset_reg [3]
offset_reg [4]
offset_reg [2]
NFA 8
Baseline Architecture © 2014 IBM Corporation
Clustering offset registers
NFA
DFA
Build a conflict graph and apply graph coloring States with the same color can share registers
9
© 2014 IBM Corporation
Leftmost match computation Assume that state 0 and state 1 are active and the current input is “a” We have to compute offset_reg[2] = MIN(offset_reg[0], offset_reg[1]) offset_reg [0]
offset_reg [1]
offset_reg [3]
offset_reg [4]
offset_reg [2]
NFA
10
Baseline Architecture
© 2014 IBM Corporation
Leftmost match computation without offset comparisons (1) Assume that state K has M incoming state transitions We have to compute the minimum of M offset values offset_reg [1]
offset_reg [2]
offset_reg [0]
offset_reg [M-1]
tree-based implementation (long latency) K
fully parallel implementation (expensive)
offset_reg [K]
Better solution: each state keeps track of the states that are activated earlier -Define an N×N bit matrix: earlier[i,j]=1 if state j is activated before state i 11
© 2014 IBM Corporation
Leftmost match computation without offset comparisons (2) offset_reg[0] = value of the current offset pointer, earlier[0, :] = active_reg Assume that state 1 is active (i.e., earlier[0,1] = 1), and the input is ‘a’ – we have two transitions into state 2: from state 0 and from state 1 – since earlier[0,1] = 1, we choose the start offset provided by state 1 – due to transitions 0 → 1 and 1 → 2, earlier[1,2]=1 in the next cycle
offset_reg [0]
offset_reg [1]
tr[0,2] (tr[1,2] earlier [0,1])
tr[1,2] (tr[0,2] earlier [1,0])
2
offset_reg [2]
12
© 2014 IBM Corporation
Experiments (text analytics regexs) Altera Stratix IV GX530KH40C2, Altera Quartus II V11 tools 32-bit start offset registers, 250 MHz target clock frequency NFA representation: Follow Automata with character classes
13
© 2014 IBM Corporation
Experiments (L7 filter regexs) Altera Stratix IV GX530KH40C2, Altera Quartus II V11 tools 32-bit start offset registers, 250 MHz target clock frequency NFA representation: Follow Automata with character classes
14
© 2014 IBM Corporation
Summary & future work
Support for start offset reporting and leftmost matching – without replicating the state transition logic – without using redundant offset registers – without using expensive offset comparison > threefold reduction in the logic resource usage > 1.25-fold improvement in the clock frequency < 8.6-fold overhead w.r.t. Sidhu & Prasanna’s architecture – while using 32-bit offset registers Our current and future work includes – design of intermediate fabrics for fast compilation – analysis and optimization of the energy consumption
15
© 2014 IBM Corporation
Resource-efficient regular expression matching architecture for text analytics Kubilay Atasu IBM Research - Zurich Presented at ASAP 2014
© 2014 IBM Corporation