An Efficient Probability Framework for Error ... - Semantic Scholar

Report 2 Downloads 56 Views
An Efficient Probability Framework for Error Propagation and Correlation Estimation Liang Chen

Mehdi B. Tahoori

Karlsruhe Institute of Technology Karlsruhe, Germany Email: [email protected]

Karlsruhe Institute of Technology Karlsruhe, Germany Email: [email protected]

Abstract—Soft error is becoming one of the major reliability concerns with continuously shrinking transistor size. Low level transient events may result in multiple correlated bit flips at high level. Considering this correlation effect is essential for accurate error rate estimation and efficient error mitigation. This paper proposes a novel framework to address this correlation issue at logic level. Based on the concept of error propagation function, graph transformation techniques are utilized to convert the error probability and correlation problem into the computation of signal probability and correlation. The experimental results show that compared with Monte-Carlo simulation, our approach is 72x faster, while the average inaccuracy of error probability estimation is below 0.006.

I. I NTRODUCTION Radiation-induced soft error is becoming one significant reliability issue in nano era [1]. The critical charge, due to a particle strike, needed to cause a transient fault decreases with shrinking transistor size. Additionally, the density of transistors on a chip increases exponentially according to Moore’s law, causing more area of the chip to be sensitive to soft errors. As a result, not only the rate of single transient fault in logic circuits is approaching that of memories [2], but also the occurrence probability of multiple transient faults is no longer negligible [3]. As the extent and criticality of radiationinduced soft errors are increasing, it is essential to consider this reliability impact for cost-effective error mitigation. Soft errors are modeled differently at different abstraction levels in the design cycle. For circuit and logic level modeling, soft errors are treated as transient current pulses and three masking effects (logical, electrical and timing masking) are considered to obtain an accurate estimation of the soft error rate (SER) [4], [5]. At register-transfer level and higher (e.g. architectural), soft errors are modeled as bit flips and reliability metrics are estimated by either fault injection simulations [6], [7], [8] or analytical methods [9], [10]. As sequential circuit elements and even combinational logic gates are becoming equally vulnerable to soft errors as memories, the intrinsic structural irregularity and functional complexity of logic circuits make the modeling of error propagation and error probability estimation more sophisticated than that of memory cells. Even a single event transient (SET) caused by a particle strike at circuit level can propagate to multiple outputs, and the way it is seen at higher abstraction levels can manifest as multiple correlated errors (bit flips).

c 978-1-4673-2085-6/12/$31.00 2012 IEEE

Modeling this correlation effect is essential for accurate estimation of soft error rate. Furthermore, since the functionality of a logic block is represented as a gate-level netlist (e.g. boolean network of basic gates), the investigation of error propagation mechanism in this network can provide valuable insights on how errors are correlated. This paper proposes a novel method based on the concept of error propagation function and super-gate to unify the treatment of signal probability and error probability, which efficiently addresses the internal correlation among different wires. It not only calculates both probability values in one pass, but also takes the error-free signal correlations, error signal correlations and their cross-correlation into account. By investigating the error propagation mechanisms, semi-super gate simplification is utilized to speedup the computation. Therefore, it provides a concise and efficient solution, and much potential for further extensions (e.g. analyzing multiple upsets and block-level error propagation). The simulation results show that our method is 72x faster than Monte-Carlo simulation, while the average inaccuracy of error probability estimation is below 0.006. Our proposed method focuses on logical masking effect of soft errors in the combinational circuits, however, it can also be combined with other existing techniques such as [2], [11] to handle timing and electrical masking effects to provide the full perspective. The organization of the rest of this paper is as follows. Section II discusses our proposed error propagation model and adopted correlation model. Section III introduces the procedures to calculate the gate error probabilities using a graphbased gate replacement method and Section IV describes the experimental results. In Section V the comparison between related work and our proposed technique is discussed and finally Section VI concludes this paper. II. P ROPOSED E RROR P ROPAGATION M ODEL A. Error Propagation in Combinational Network Soft errors are modeled as bit flips in our paper. How to efficiently model error propagation in combinational network is essential for accurate error probability estimation. To illustrate our idea, the AND operation l = ij is taken as the running example. Totally there are four error-free input combinations. For the case ij = 10, only when j is faulty

170

i if j

lf

jf

l

i if j jf

Super AND

i

l lf

l

i

m

(a) Fanout node Fig. 1.

l

j h

m

(b) AND gate

Super-gate concept

(0→1) and i is error-free, the output l is faulty (0→1). The other three cases could be analyzed in a similar way. If notation xf is introduced to indicate whether the signal x is faulty, i.e. xf = 1 means a bit-flip occurs on signal x, the above four cases can be combined and expressed by a single boolean function, called Error Propagation Function (EPF): lf = i¯j i¯f jf + ¯i¯jif jf + ¯ijif j¯f + ij(if + jf )

(1)

If the error-free function l = ij is combined with EPF, a four-input, two-output super-gate can be constructed, as illustrated in Figure 1. The representation of virtual error signals if , jf , lf has several important features: • It significantly simplifies the propagation and correlation modeling of the bit-flips in combinational network: it does not differentiate two kinds of bit flips: 0→1 and 1→0 as in [12]. The virtual error signals are treated as normal ones with properties such as logic values, signal probabilities, etc. • The super-gate concept is independent of any specific algorithm: it is just an additional boolean function and could be analyzed using the well-researched methods in the areas of signal probability, switching activity, etc. • This concept can be easily extended for more complicated error modeling: it can be adjusted for single bit-flips, multiple bit-flips, transient faults, permanent faults, etc. In addition, the interpretation of their properties is different from traditional meaning: • Logic value: ’1’ is interpreted as bit-flip occurring in the corresponding signal and ’0’ means error-free; • Signal probability: interpreted as the error probability of the corresponding signal. Since any boolean function can be decomposed into a network of basic operations, and multiple input gates could be transformed into a cascade of two-input ones, for the sake of modelling simplicity, two-input AND, OR and INV gates are chosen to constitute arbitrary combinational networks, similar to the treatment in [12]. However, the general idea can be applied to any other combinational gate in a cell library. The main limitation of this modeling technique is that it doubles the number of signals needed to be considered. However, on one hand this additional complexity is unavoidable to model propagation of bit-flip errors and their complicated correlations; on the other hand, this complexity problem can be partially alleviated by exploration of error propagation mechanisms using graph algorithms, as discussed later. B. Signal Correlation In typical digital circuits, there are two kinds of correlation: temporal correlation and spatial correlation [13]. Temporal

Fig. 2.

Typical structures for correlation calculation

correlation is always related to the historical trends of bit streams, which is beyond the scope of this paper because our main concern is error propagation in combinational networks. Therefore, only spatial correlation will be considered here. Generally speaking, there are two main sources of spatial correlation in combinational network: • Structural dependence: due to the reconvergent fanout, where two or more signals originate from the same gate, propagate on different paths and converge again to the inputs of another gate; • Primary input dependence: resulting from correlated input vectors and workload dependencies. 1) Correlation Model: To model the spatial correlation, we adopt the Correlation Coefficient Method (CCM) [14]. There are two main reasons for this choice. First, CCM provides accurate probability estimation, while has better scalability than the Bayesian network approach [15] and probabilistic transfer matrix (PTM) approach [16], which use probabilistic graph model and matrix operations to calculate error probabilities, respectively. Second, CCM offers good extendability to address both structural dependencies and primary input dependencies [13]. In CCM with the notation of signal probability P (i = 1) = p(i), the correlation coefficient of signals i, j is defined as Ci,j = Cj,i =

p(ij) p(i|j) p(j|i) = = p(i)p(j) p(i) p(j)

(2)

where p(ij) is the joint probability P (i = 1, j = 1), and p(i|j) is the conditional probability P (i = 1|j = 1). From this definition, it can be derived that when these two signals are uncorrelated, Ci,j = 1. For the primitive two-input AND gate, given pi , pj and Ci,j we can exactly calculate the signal probability of the gate output l using the following formula: 1 p(l) = p(i)p(j)Ci,j , 0 ≤ Ci,j ≤ (3) p(i)p(j) Except for signal probability calculation, based on several basic propagation rules, the correlation coefficients between signals can be analytically computed for all structural cases in the combinational network. Two typical cases are illustrated in Figure 2 and the corresponding correlation formulas are Cl,m = 1/p(i) and Cl,m = Ci,h Cj,h , respectively. 2) Accuracy Issue: In Figure 2(b) it is assumed Cij,h ≈ Ci,h Cj,h

(4)

thus the dependencies of two signals to a third one is neglected. Hence the second and higher order correlations among multiple signals are not taken into account in CCM. The

2012 IEEE 18th International On-Line Testing Symposium (IOLTS)

171

i

signal probability estimation in [14] and switching activity estimation in [13] show this first-order approximation can provide accurate results in practice. However, neglecting high order correlations may lead gate output probability outside the [0, 1] bound. Therefore, Inequality (3) is used to limit Ci,j to avoid probability overflow. Actually, from our observation the upper bound of correlation coefficient in Inequality (3) is rather loose. Revisiting the definition of correlation coefficient in Equation (2), we have: Ci,j =

p(i|j) p(j|i) 1 1 = ≤ min{ , } p(i) p(j) p(i) p(j)

if lf

jf

l

(5)

This new inequality gives a tighter upper bound, therefore provides better error bounding in the correlation propagation, especially for the signals with high order correlation. 3) Complexity Issue: From [14] the computational complexity of CCM is linear in the topological levels L of the netlist and pseudo-quadratic in the number of gates per level NL , which could be expressed as:  NL (NL − 1) Complexity of CCM ≤ (6) 2 L

Please note that on the right hand side is just the worst case, because for accurate estimation, it is only necessary to calculate the correlation coefficients for couples of signals dependent on each other. Preprocessing netlist to identify uncorrelated signals are beneficial to reduce the CCM runtime. III. E RROR P ROBABILITY E STIMATION A. Uniform Super-gate Replacement As discussed in Section II-A, the error propagation in combinational networks can be modeled by EPFs corresponding to each gate, and each virtual error signal lf is treated as normal one. Combined with CCM, the error probability estimation flow can be performed as follows: 1) Gate-level netlist is parsed, and the corresponding graph is constructed; 2) Each basic gate is replaced with the netlist of corresponding super-gate and then flattened; 3) According to the location of error sites, the probabilities of gate’s output error signal lf are calculated as follows: • Error site: p(lf ) = 1.0; • Gates whose topological level is smaller than or equal to the error site: p(lf ) = 0.0, i.e. error-free; • Gates whose topological level is larger than the error site: p(lf ) is calculated with CCM level by level; 4) The error probabilities of primary outputs are obtained as the signal probabilities of corresponding error signals, i.e. pe (P Oi ) = p(P Oif ). The advantage of this flow is that the super-gate replacement needs to be done only once and the transformed graph is applicable to the scenario of any error location. However, it suffers from the high runtime of error probability estimation for each error site. Recalling the runtime Inequality (6) and the intuitive gate-level implementation of super AND in Figure

172

j

(a) Super AND

j i jf

l lf

(b) Semi-super AND Fig. 3.

Super and semi-super AND gate-level implementation

3(a), it is obvious that the super-gate replacement increases not only the circuit level L by 4 times, but also the number of gates at each level NL by around 4 times on average, therefore introducing very high runtime overhead. Although some logic optimization could be applied to EPF to reduce the number of gates in the super-gate implementation, the benefit obtained with regard to runtime is very limited. B. Fanout Cone Extraction Actually, the uniform super-gate replacement is not necessary in real error estimation, especially in the scope of soft errors. If single fault assumption is used, only the gates in the fanout cone of the error site will be influenced and contribute to error probability of primary outputs, as illustrated in Figure 4. Therefore, one alternative of above uniform replacement scheme arises: non-uniform replacement, where only gates in the fanout cone of the error site would be replaced with supergate and all the others remain untouched. The drawback of this approach is that it is necessary to generate different super-gate netlist for different error sites, but it has large benefit with regard to runtime reduction, especially for those error sites near primary outputs, because their fanout cones are rather small compared with the entire circuit. C. Semi-super Gate Simplification By investigating this new scheme further, we discover that for the gates at the boundary of fanout cone of error site, only one of the two inputs for AND, OR gate is possible to be erroneous, i.e. another input is definitely error-free. This important observation is very useful as it contributes more to the reduction of complexity and runtime. Recalling the EPF of AND gate in Equation (1), and assuming the input i is error-free, i.e. if = 0, this EPF can be simplified as following: lf = i¯jjf + ijjf = ijf

(7)

As Figure 3(b) illustrates, the complex super-gate implementation of EPF collapses to only two primitive AND gates, called semi-super gate: one for error-free function, the other for error

2012 IEEE 18th International On-Line Testing Symposium (IOLTS)

PO1 Ug0 PO2

Ug1

Uf2 Us2

Ue Ug2

Uf0 Us1 Ug3

PO3 Uf1 PO4 Us0 PO5

Fig. 4.

Error propagation path

propagation. Obviously, the longer the length of propagation path from error site to primary outputs, the more benefit we get from this semi-super-gate simplification. Therefore, a better replacement scheme is illustrated in Figure 4: • Original gate: not touched, as gates Ugi , i = 0, 1, 2, 3; • Semi-super gate: replaced with netlist of semi-super gate, as gates Usi , i = 0, 1, 2; • Super gate: replaced with netlist of full super-gate, as gates Uf i , i = 0, 1, 2. The corresponding error probability estimation flow is described as follows: 1) Graph setup: the original netlist is parsed, then corresponding graph is generated and topologically levelized; 2) Signal probability calculation: assuming independent and random primary inputs (CP Ii ,P Ij = 1 and p(P Ii ) = 0.5), the signal probability of each gate and the correlation coefficients are calculated level by level; 3) Gate replacement type determination: for a specific error site, graph algorithm is used to obtain the fanout cone, then all the gates at the same or larger levels than the error site are extracted as a new subgraph, and tagged as one of the three replacement types: original, semi-super and super gate depending on whether none, some, or all of the inputs of the gate reside in the fanout cone; 4) Graph transformation: according to different replacement types, the basic gates in the new subgraph are either not touched or replaced by super/semi-super gates; 5) Error probability calculation: CCM method is used to traverse this new subgraph: the error probability of error site is set to 1.0 and the correlation coefficients of its error signal with all the other error-free signals at same level are set to 1.0 (independent error occurrence). In this way, the error-free signal probabilities and correlation coefficients at different levels obtained in Step 2 can be reused for different error sites. The error probability of primary output pe (P Oi ) is the signal probability of corresponding error signal P Oif , i.e. pe (P Oi ) = p(P Oif ). D. Possible Extensions Although fanout cone extraction and semi-super gate identification can partially alleviate the complexity problem, the gate

replacement scheme alters original graph structure, thereby introducing the overhead of graph transformation and levelization for each error site. A better approach is to preserve the original netlist structure and derive probability and correlation formulas for the super, semi-super and original gate pairs. In addition, the preliminary analysis of EPF in Equation (1) reveals that the four items in the formula are mutually exclusive, i.e. any two of them cannot occur at the same time, which can be used to simplify the probability and correlation computation for error signals. This kind of logical exploration is beneficial to avoid unnecessary runtime overhead and provide more potential for the speedup. In summary, employing super gate formulas and exploring logical properties of EPF are two promising ways to reduce the computation complexity and provide better scalability. IV. E XPERIMENTAL R ESULTS The proposed approach was implemented in C++ using igraph library [17]. Experiments were performed for several 74 series and ISCAS’85 combinational benchmarks on a workstation with Intel Xeon E5540 2.53GHz and 16GB RAM. The benchmark circuits are synthesized using the primitive AND, OR and INV gates. For benchmark circuits with small number of primary inputs (