Presentation - IBM Research | Zurich

Report 3 Downloads 90 Views
Resource-efficient regular expression matching architecture for text analytics Kubilay Atasu IBM Research - Zurich Presented at ASAP 2014

© 2014 IBM Corporation

SystemT: an algebraic approach to declarative information extraction  distill structured data from unstructured and semi-structured text

 exploit the extracted data in your applications For years, Microsoft Corporation CEO Bill Gates was against open source. But today he appears to have changed his mind. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Annotations Name Bill Gates Bill Veghte Richard Stallman

Title Organization CEO Microsoft VP Microsoft Founder Free Soft..

Richard Stallman, founder of the Free Software Foundation, countered saying… (from Cohen’s IE tutorial, 2003) 2

© 2014 IBM Corporation

A simple SystemT information extraction rule  Find the names (regex) that are at most 20 chars after a title (dict.)

Founder..............Bill Gates dict. match start offset

regex match

end offset

start offset

end offset

at most 20 chars

end offset

start offset

result

3

© 2014 IBM Corporation

Finding leftmost regular expression matches  Assume that we are searching for the regex .*(a|aa|aaaa) in the input string “aaaa”  Find the regex match with the smallest start offset value at each end offset position  The leftmost maches are marked using solid lines

4

© 2014 IBM Corporation

Regex matching: background  Consider the regex .*(a|b|aa|aba)

NFA

 Can be transformed into NFA/DFA

DFA

Traditional architectures do not support start offset reporting & leftmost matching:  Reconfigurable NFAs (Sidhu FCCM 2001, Bispo FPT 2006 , Yang ANCS 2008)  Programmable DFAs (Smith SIGCOMM 2008, Van Lunteren MICRO 2012) 5

© 2014 IBM Corporation

Previous solution: network of state machines active_reg

state_reg

start_offset_reg

-Dimension of the network (# state machines) is statically computed for each regex -DRAWBACK: replication of the state transition logic

K. Atasu, R. Polig, C.Hagleitner, F. R. Reiss: Hardware-accelerated regular expression matching for high-throughput text analytics. FPL 2013: 1-7 6

© 2014 IBM Corporation

Contributions of this work 1. Extending Sidhu and Prasanna’s NFA architecture to support start offset reporting 2. A graph coloring based register clustering method to minimize the register usage 3. An efficient leftmost match computation method without using offset comparisons

NFA

7

Sidhu and Prasanna’s NFA Architecture

© 2014 IBM Corporation

A straightforward extension of Sidhu & Prasanna’s architecture  Add a start offset register to each NFA state  offset_reg[0] = value of current offset position  DRAWBACK: redundant start offset registers offset_reg [0]

offset_reg [1]

offset_reg [3]

offset_reg [4]

offset_reg [2]

NFA 8

Baseline Architecture © 2014 IBM Corporation

Clustering offset registers

NFA

DFA

 Build a conflict graph and apply graph coloring  States with the same color can share registers

9

© 2014 IBM Corporation

Leftmost match computation  Assume that state 0 and state 1 are active and the current input is “a”  We have to compute offset_reg[2] = MIN(offset_reg[0], offset_reg[1]) offset_reg [0]

offset_reg [1]

offset_reg [3]

offset_reg [4]

offset_reg [2]

NFA

10

Baseline Architecture

© 2014 IBM Corporation

Leftmost match computation without offset comparisons (1)  Assume that state K has M incoming state transitions  We have to compute the minimum of M offset values offset_reg [1]

offset_reg [2]

offset_reg [0]

offset_reg [M-1]

 tree-based implementation (long latency) K

 fully parallel implementation (expensive)

offset_reg [K]

Better solution: each state keeps track of the states that are activated earlier -Define an N×N bit matrix: earlier[i,j]=1 if state j is activated before state i 11

© 2014 IBM Corporation

Leftmost match computation without offset comparisons (2)  offset_reg[0] = value of the current offset pointer, earlier[0, :] = active_reg  Assume that state 1 is active (i.e., earlier[0,1] = 1), and the input is ‘a’ – we have two transitions into state 2: from state 0 and from state 1 – since earlier[0,1] = 1, we choose the start offset provided by state 1 – due to transitions 0 → 1 and 1 → 2, earlier[1,2]=1 in the next cycle

offset_reg [0]

offset_reg [1]

tr[0,2]  (tr[1,2]  earlier [0,1])

tr[1,2]  (tr[0,2]  earlier [1,0])

2

offset_reg [2]

12

© 2014 IBM Corporation

Experiments (text analytics regexs)  Altera Stratix IV GX530KH40C2, Altera Quartus II V11 tools  32-bit start offset registers, 250 MHz target clock frequency  NFA representation: Follow Automata with character classes

13

© 2014 IBM Corporation

Experiments (L7 filter regexs)  Altera Stratix IV GX530KH40C2, Altera Quartus II V11 tools  32-bit start offset registers, 250 MHz target clock frequency  NFA representation: Follow Automata with character classes

14

© 2014 IBM Corporation

Summary & future work

 Support for start offset reporting and leftmost matching – without replicating the state transition logic – without using redundant offset registers – without using expensive offset comparison  > threefold reduction in the logic resource usage  > 1.25-fold improvement in the clock frequency  < 8.6-fold overhead w.r.t. Sidhu & Prasanna’s architecture – while using 32-bit offset registers  Our current and future work includes – design of intermediate fabrics for fast compilation – analysis and optimization of the energy consumption

15

© 2014 IBM Corporation

Resource-efficient regular expression matching architecture for text analytics Kubilay Atasu IBM Research - Zurich Presented at ASAP 2014

© 2014 IBM Corporation