Improving SNR for DSM Linear Systems Using Probabilistic Error Correction and State Restoration: A Comparative Study Maryam Ashouei, Soumendu Bhattacharya, and Abhijit Chatterjee School of Electrical and Computer Engineering Georgia Institute of Technology, Atlanta, GA 30302 {ashouei, soumendu, chat}@ece.gatech.edu Abstract Smaller feature sizes and lower supply voltages make DSM devices more susceptible to soft errors generated by alpha particles and neutrons as well as other sources of environmental noise. In this scenario, soft-error/noise tolerant techniques are necessary for maintaining the SNR of critical DSP applications. This paper studies linear DSP circuits and discusses two low cost techniques for improving the SNR of DSP filters. Both techniques use a single checksum variable for error detection. This gives a distance two code that is traditionally good for error detection but not correction. In this paper, such a code is used to improve SNR rather than perfectly remove the error. The first technique, ’checksum-based probabilistic error correction’, uses the value indicated by the checksum variable to probabilistically correct the error and achieves up to 5 dB improvement in the SNR value. The second technique, ‘state restoration’, works well when the length of burst errors is small and the error magnitude is large. A general error statistics has been defined as a random process and the distribution of SNR is compared for the two proposed techniques.
1. Introduction In the last decade, technology scaling has spurred the growth of high performance digital circuits while making them susceptible to soft errors. Technology scaling increases the vulnerability of a circuit to soft-errors for several reasons. First, due to feature size reduction which in turn reduces average node capacitance, the voltage fluctuation at a node due to a particle strike is larger. Second, supply voltage reduction in every technology generation reduces noise margins and aggravates the soft error problem. Third, the increase in clock frequency raises the chance of a soft error being latched and propagated to a primary output. Moreover, due to shorter pipeline stages, the number of gates through which a soft error propagates (and hence attenuates) is smaller. Therefore, the probability of a soft error being masked in a modern highperformance digital system is becoming increasingly smaller compared to earlier technologies. Combinational circuits can mask soft errors through logical masking, electrical masking, and latching-window masking [1], [2]. But all three masking phenomena lose their effectiveness as technology scales down, resulting in an increase in the soft error rate (SER) of combinational circuits. The SER of logic circuits is expected to rise by 9 orders of magnitude, between 1992 and 2011, when it will
Proceedings of the Eleventh IEEE European Test Symposium (ETS’06) 0-7695-2566-0/06 $20.00 © 2006
IEEE
equal the SER of unprotected memory elements. Currently, the flip flops (registers) of a digital system are the most vulnerable components to soft errors. Hardware and software redundancy such as duplication/triplication or concurrent-error detection are used in critical applications to alleviate or eliminate soft errors [3]. However, high cost in terms of area and power associated with these techniques makes them impractical for more general applications. More cost effective techniques such as time redundancy [4] or partial duplication [5] and algorithm-based fault tolerance methods ([6], [7], [8]) have been proposed in the past. However, these techniques incur extra delay overhead due to the use of checking circuitry as well as system-level delay overhead due to pipeline flushing. A circuit level technique proposed in [9], makes a circuit more resilient to soft errors by adding capacitive loading to the primary output of a circuit. A more recent technique proposed in [10], dynamically controls the soft error tolerance of a digital circuit through adaptive supply and threshold voltage modulation as well as the use of variable capacitance banks. In [11], algorithmic noise-tolerance techniques for reliable digital signal processing are studied. This paper aims at improving the SNR of linear systems by probabilistically correcting the error, instead of completely eliminating it, using minimum hardware and delay overhead. A distance-two checksum code is used to detect and probabilistically correct transient errors latched into the circuit flip-flops. Such codes are traditionally good only for single error detection but not correction. In this paper, the above probabilistic checksum-based error correction scheme and another scheme based on “state restoration” are discussed and compared. In the latter, the checksum code is used to first detect the error. After the error is detected, the current (erroneous) state value is restored to the previous (error-free) state value and computation is resumed. The criteria, for which one of these two techniques performs better than the other, are also discussed. The rest of the paper is organized as follows. In the next section, a discussion of linear digital state variable systems is presented. Next, the real number checksum coding technique for detection and probabilistic correction is presented. The state restoration technique is described
next. Then experimental results, comparing checksumbased probabilistic correction with the state restoration and no correction, are presented. In the last section, conclusions and future work are discussed. 2. Linear Digital State Variable Systems Linear digital state variable systems can be used to represent linear time invariant systems such as digital filters. The general form of a state variable system is similar to the Huffman representation of a sequential circuit with the combinational block replaced by a module that computes a linear matrix transformation. This module is a network of basic computational elements, such as adders, multipliers, and shifters and feeds the system primary outputs and flip-flops. The processing is purely arithmetic and therefore inputs, outputs, and states represent numerical values. Let (u1...um) and (y1…yw) be the primary inputs and primary outputs of the linear state variable system respectively. If s(t)=[s1(t), s2(t), …, sn(t)]T is the state vector and u(t)=[u1(t), u2(t), …, un(t)]T is the input vector at time t, then the system function can be presented by the following equations: s(t + 1) = A ⋅ s(t ) + B ⋅ u(t + 1) (1) y(t + 1) = C ⋅ s (t ) + D ⋅ u(t + 1) Where the A, B, C, D matrices represent arithmetic operations performed on the current state variables, s(t), and the m primary inputs, to generate the next system states, s(t+1), and w primary outputs, y(t+1). Soft errors are assumed to cause bit flips on the outputs of components in the computational block or in the system states. If an error occurs in the computational block, some states and outputs may become erroneous. In the case of an error in system states, the error may stay in the system for multiple clock cycles and can propagate to other states as well as the primary outputs. An error in a primary output does not propagate to other outputs or states and will disappear after one clock cycle. For this reason, the paper only focuses on detecting and correcting errors in the system states. 3. Checksum-Based Error Detection Real number codes can be used for the purpose of error detection and error correction in linear digital state variable systems [13]. The state vector, s(t), is encoded using one or more check variables. The idea is briefly described below. A coding vector, CV=[Į1, Į2,…, Įn], is used to encode the A and B matrices such that X=CV.A. and Y=CV.B. A check variable c, corresponding to each coding vector is computed as: c(t+1) = X.s(t)+Y.u(t). If there is no error in the system, c(t+1) = CV.s(t+1). An error signal, e, can be computed as: e(t+1) = CV.s(t+1)c(t+1) and is zero in the absence of any error. A non-zero value of e(t+1) can be caused by an error either in the state computation, s(t+1), in the check
Proceedings of the Eleventh IEEE European Test Symposium (ETS’06) 0-7695-2566-0/06 $20.00 © 2006
IEEE
variable computation, c(t+1), or in the error signal computation, e(t+1). Without loss of generality, it is assumed that an error can only occur in the state computation. Given the complexity of the state computation in relation to that of check variable or error signal computation, this is a reasonable assumption. Furthermore, it is assumed that an error manifests itself in only one state of the system. If an error occurs in the time step (t, t+1), then the vector s(t+1) has the wrong value for one of its constituent state variables and e(t+1) is nonzero. The error signal is non-zero only for a single time step and returns to zero in the next time step. In the next section, we describe how the error signal value is used to probabilistically correct the system states such that the overall output SNR in the presence of injected soft errors is improved. 4. Checksum-based Error Correction A single check variable provides a distance-2 code and can detect a single error in the system, but is not sufficient for identifying which state variable is erroneous. In [12], the authors provide a method that can identify the erroneous state by using two check variables and by carefully selecting the coding vector for each check variable such that the ratio of the error magnitudes in the two checksums identifies the erroneous state. In this paper, the concept of probabilistic correction is used. Here, even though a distance-2 code is used, a single check variable can be used to probabilistically correct the error. Before proceeding to describing the technique, a few notations and definitions are introduced next. Let wi be the probability of the ith state being erroneous where i=1..nwi = 1. If the error signal has the value e and the ith state is faulty, then the error value of the ith state is e/Įi (Įi is defined in Section 4). Let ǻi be the error vector, when the ith state is faulty. Then ǻi is a n×1 vector whose ith element is e/Įi. and other elements are zero. Let EV be a n×1 vector, which indicates the errors in the state variable values, i.e. EV (i) shows the error value of the ith state variable. If an error is detected, then the error vector (EV) is ǻi with probability wi, for i=1…n. Let ygood be the output signal when there is no error and yerr be the outputs when there is an error. The output noise signal is noise = y good − y err . The output noise power and the output signal to noise ratio (SNR) are defined as follows: NoisePower
=
2
T
∑
(2)
noise ( i )
i =1
SNR
= 10 log(
var( y good ) var( noise )
)
(3)
Where T is the duration of the measurement of the output signal and noise (i)2 is the noise power component at time i. In the following, the output noise power is used as a metric to find the best correction vector for the
checksum-based probabilistic error correction technique. A correction scheme called the state restoration technique is also described. 4.1. No Error Correction An error occurring during the time interval (t, t+1) in one of the system states, results in a deviation in the state value, s(t+1), from its correct value as given in equation (4). This deviation is represented by the error vector EV, as described before. s err (t + 1) = s good (t + 1) + EV (4) If there is no error correction,, the error in system states at time t+k+1, k cycles after its occurrence, assuming no other errors happen in between, can be computed as follows. serr (t + k + 1) = A k EV + s good (t + k + 1) (5) The error vector k cycles after the error occurrence, is AkEV. Thus, if the system is stable, the errors in the system state variables disappear after m cycles, where Amĺ0. 4.2. Checksum-based Probabilistic Error Correction at the System States This scheme aims at probabilistically correcting system states such that the output noise power is minimized. The architecture of the scheme is shown in Figure 1. Because of the delay overhead associated to the scheme, the clock period must accommodate the error detection module (ED) and the error correction module (EC) delays.
Figure 1. Checksum-based probabilistic state correction (ED is the error detection module and St+1 is erroneous)
The goal is to find a correction vector, V, derived from the error signal e(t+1), to be subtracted from the state vector, at the time when an error is detected, before computing the outputs and the next states of the system. After the correction, the error in the system states is EV-V. The error in the states and the output k cycles after the correction are Ak(EV-V) and CAk(EV-V) respectively. The goal is to find V such that the average output noise power, as computed below, is minimized. AverageNoi se =
n
m
∑∑ w (CA i
k
( Δ i − V )) 2
(6)
i =1 k = 0
The solution to the minimization problem, assuming k=0..mCAk0, , is given by:
Proceedings of the Eleventh IEEE European Test Symposium (ETS’06) 0-7695-2566-0/06 $20.00 © 2006
IEEE
n
V = ∑ wi Δi
(7)
i =1
In the special case, when all state variables have the same probability of being erroneous, i.e. wi =1/n, i=1...n, and the coding vector elements Įi=1, i=1…n, the correction vector is V= [1/n 1/n … 1/n]T. By minimizing the average noise power using the Equation (7), it is possible that in some cases the noise power increases after correction. To prevent the negative impact of correction, two approaches are studied. The first approach adds boundary constrains to Equation (6) to guarantee that the noise power after correction is always below the noise power with no correction. With these constrains, the minimization problem can be presented as follows: AverageNoi se =
n
m
∑ ∑ w (CA i
k
( Δ i − V )) 2
(8)
i = 1...n
(9)
i =1 k = 0
subject to: m
m
∑ (CA k =0
k
Δ i − V ) 2 ≤ ∑ (CA k Δ i ) 2 k =0
The second approach minimizes the average noise using the correction vector given by (7), followed by a postprocessing step. The post-processing step finds the states, for which the correction is not beneficial in terms of reducing output noise power. Errors in such states should not be detected by the checksum variable. The checksum variable is used to only detect errors in the remaining states. The checksum can be defined on a subset of states by putting zeros in locations of CV, corresponding to the non-monitored states. The approach is summarized below. When (6) is used only on a subset of states, the erroneous probability of non-monitored states is divided among the remaining states, proportional to their original erroneous probability. For instance, if the kth state, sk, is not being monitored, the erroneous probability of each remaining state, si, is: wi =wi +(wi wk)/(j=1...nwj-wk). S = {s1, s2, …, sn} Repeat Stop = 0; Use (6) on set of states, S, to find the correction vector, V For V si in S m If m 2 2 k k
∑(CA Δ −V ) ≥ ∑(CA Δ ) i
i
k =0
k =0
S = S-{si} Stop = 1; Update wk for sk in S End End Until Stop == 0
A third order low pass filter (Table 1) is used to show the noise power and the SNR improvement, for the case of the correction vector at (7) and the two above approaches. The input is a sinusoid with maximum amplitude of 1 and with frequency 10 Hz with 1024 samples containing two periods of the waveform. An error of magnitude 0.1 is injected into each state (one at a time) at a fixed time.
Table 2 presents the SNR improvement and the noise power in each case. Table 1. Low pass filter state representation A=
0 0 ⎤ ⎡0.35 ⎢0.62 − 0.33 − 0.86 ⎥ ⎥ ⎢ ⎢⎣0.80 0.86 − 0.12 ⎥⎦
B = [2.6 1.2 1.6]
C= [0.11 0.06 0.08]
D = [0.22]
T
4.3. State Restoration Error Correction This approach simply sets the state latches to the previous value, whenever an error is detected. The technique requires having an extra set of latches to hold the previous state values, so that they can be restored to the system latches when an error is detected. Experimental results presented in the next section, compare the output SNR for the case of no correction, the case of checksum-based probabilistic correction of states, and the case of state restoration. 5. Experimental Results The experimental results of this section are generated using a 3rd order low pass filter (Table 1). The error is considered to be a random process of four random variables, defined as follows and illustrated in Figure 2:1) Error magnitude (EM), 2) Burst length (BL), the number of errors in a burst, 3) Burst-to-burst time (BBT), i.e. the time interval between two bursts, and 4) Error-to-error time (EET), the time interval between two consecutive errors in a burst. Additionally, the time of occurrence of the first burst is another random variable, called error position.
samples containing two periods of the waveform. Also, errors are assumed to be in s1 unless specified otherwise. The filter, the error detection, and error correction modules were implemented in MATLAB. Also, the errors were injected into the system by modifying the magnitude of a state of the system. To analyze the effect of error position, a single error (i.e. BL=1 and BBT=), with EM= 0.1 was introduced at all possible positions within a single period of the input. Figure 3 shows the results. The checksum-based correction SNR does not depend on the error position and provides a constant 3dB improvement over the no correction case. The state restoration SNR strongly depends on the error position. At those positions, where the states have the least derivative, the state restoration has its best performance.
Figure 3. SNR as a function of error position
Figure 4 shows the SNR values for error EM= 0.1 and EM=0.01. The plots for the SNR of the state restoration technique are identical and overlapping. The figure shows that the improvement of the checksum-based probabilistic correction over the no correction case stays constant. .It also shows that for a single error, regardless of the error position and error magnitude, checksum-based probabilistic correction results in a constant SNR improvement.
Figure 2. Error model
The third order filter and the error model are used to compare the output SNR of no error correction case, the checksum-based probabilistic correction case, and that of state restoration. Different error magnitudes, error positions, and burst lengths were studied. The SNR distributions under the error model are also presented. Finally, the experimental result for the case,, where only a subset of states is monitored by the checksum variable, is also presented. It is assumed that the erroneous probability of all states is the same, i.e.wi =1/n, where n is the number of states. Also the coding vectors are assumed to have all ones, i.e. Įi=1Vi. For the case of checksum-based correction, the correction vector in equation (7), was used. In all experiments, the input is a sinusoid with maximum amplitude of 1 and with frequency 10 Hz with 1024
Proceedings of the Eleventh IEEE European Test Symposium (ETS’06) 0-7695-2566-0/06 $20.00 © 2006
IEEE
Figure 4. Effect of error position on SNR for various error magnitudes
Table 2: Noise power and SNR improvement for different correction vectors Faulty state
s1 s2 s3
Noise power (no correction) (dBm)
-80.8 -84.4 -83.9
Correction vector at (7) V=[0.33, 0.33, 0.33]T CV=[1, 1, 1] Noise Power SNR Improved (dBm) (dB)
First approach
Second approach
V=[0.33, 0.32, 0.36]T CV=[1, 1, 1] Noise Power SNR Improved (dBm) (dB)
V=[0.5, 0.5, 0]T CV=[1, 1, 0] Noise Power (dBm)
SNR Improved (dB)
-83.5 -90.2 -83.7
-83.3 -90.0 -83.9
-87.8 -87.8 -83.9
7.0 3.5 0.0
2.7 5.9 -0.2
Figure 5 shows SNR as a function of error magnitude for a single error. For the state restoration case, since the SNR is error position dependent, the maximum and minimum possible SNR were obtained by injecting the errors at positions where the input signal derivative is minimum and maximum respectively. The figure shows that for a single error, if the error magnitude is greater than 0.07, the worst-case SNR using state restoration works better than the checksum-based probabilistic correction. Next, it is shown how the results will change if the error is a burst error instead of a single error.
2.5 5.7 0.0
scheme, for burst length greater than four errors, it is not useful to use the state restoration technique. In the analysis presented up to this point, the effects of error magnitude, burst length, and error position were studied individually, while the rest of parameters, which define the error characteristics, were constants. Here, a distribution is assumed for each of the random variables, on which the error statistic depends, i.e. EM, BL, BBT, EET, and error position. For these error statistics (shown in Table 3), a distribution for SNR of each scheme was obtained (Figure 7). The histogram has 1000 data points. The top subfigure is when EM has a uniform distribution [0, 0.1] and the bottom is when EM has a uniform distribution between [0, 1]. The plots show that when the error magnitude sweeps a range of smaller values, the SNR distribution of checksum-based error correction has higher mean (53 dB) than the state restoration (48 dB) and the no correction case (50.7 dB). But as the EM covers a range of larger values, i.e., [0,1], then the SNR of the state restoration scheme outperforms the checksum-based probabilistic correction by a large margin of 15dB in terms of their mean parameters. In addition, the state restoration SNR distribution has the largest standard deviation (1.7 dB), compared with a value of 1.3 dB for the checksumbased and a value of 1.2 dB for the no correction case.
Figure 5. SNR as a function of error magnitude
Figure 6 shows the SNR of different schemes as a function of burst length and EM=0.1. A single burst is assumed, i.e. BBT=. Also errors occur in consecutive cycles, i.e. EET=1. Again in the case of state restoration, a maximum and a minimum plot, and an average plot are shown, since the SNR is error position dependent. In Figure 6 all errors are assumed to be in s1. The figure shows that the SNR of the state restoration reduces drastically as the burst length increases. Although the SNR of the checksum-based probabilistic error correction also reduces with the burst length, the reduction is not as drastic as in the state restoration and always achieves a positive SNR improvement over the no correction case. Figure 6 shows that the SNR improvement of checksum-based Figure 6. SNR as a function of the number of errors probabilistic correction over no correction initially increases with burst length increase, but stays almost constant at 5 dB for burst lengths greater than 20. Considering the worst-case SNR of state restoration Table 3: Different error parameter distributions (T is the duration of the output) EM Uniform(0,0.1) Uniform(0,1)
BL
BBT
EET
Error Position
μ (10,1)
μ (150,1)
⎣Uniform([1,3]⎦
⎣Uniform([1, T / 2]⎦
Proceedings of the Eleventh IEEE European Test Symposium (ETS’06) 0-7695-2566-0/06 $20.00 © 2006
IEEE
finding the optimum subset is the subject of future research.
Figure 7. SNR distribution of three different techniques
When analyzing the checksum-based probabilistic scheme, the checksum variable is used to detect errors on all three states of the filter. However, it is possible to use the checksum variable to detect errors only on a subset of states. The SNR, using the checksum-based probabilistic correction on all possible subsets, is shown in Figure 8, assuming all states have the same probability of being erroneous. It shows that there are cases, where only a subset of states is corrected and the overall SNR value is more than the case, where errors in all states being detected and corrected. For instance, the cases of {s1,s2} and {s1} increase the SNR by 2 dB and 3 dB respectively over the case where errors in all three states being corrected by the checksum variable. It was seen that a hybrid approach of state restoration and checksum-based probabilistic correction may be used depending on several factors namely: the error magnitude, the error burst length, and the error position. 6. Conclusion and Future Work This paper presents two techniques, the state restoration and the checksum-based probabilistic correction. The effectiveness of the techniques was analyzed using a 3rd order low pass filter. It was shown that depending on the error characteristics, one method outperforms the other. The checksum-based probabilistic correction works better for smaller error magnitudes and longer burst errors, while the state restoration works better for larger error magnitudes and smaller length burst errors. The SNR distributions for each technique are also presented. It is shown that the state restoration SNR distribution is slightly wider distribution (larger standard deviation) than the probabilistic checksum-based correction. Also, it is shown that in case of checksum-based probabilistic correction, it might be beneficial to have the checksum variable to detect and correct a subset of system states. The problem of
Proceedings of the Eleventh IEEE European Test Symposium (ETS’06) 0-7695-2566-0/06 $20.00 © 2006
IEEE
Figure 8. Checksum-based probabilistic correction when different subsets of states are being detected/corrected using the checksum
Reference: [1]. P. E. Dodd and L. W. Massengill, “Basic mechanisms and modeling of single-event upset in digital microelectronics,” IEEE Trans. On Nuclear Science, vol. 50, 2003, pp. 583-602 [2]. P. Shivakumar, et.al., “Modeling the effect of technology trends on the soft error rate of combinational logic,” Proceeding International Conference on Dependable System and networks, 2002, pp. 389-398 [3]. M. Nicolaidis and Y. Zorian, “On-line testing for VLSI- a compendium of approaches”, Journal of Electronic Testing: Theory and Application (JETTA), vol. 12, 1998, pp. 7-20 [4]. M. Nicolaidis, “Time redundancy based soft error tolerance to rescue nanometer technologies,” Proceeding of VLSI Test Symposium, 1999, pp. 86-94 [5]. Mohanram and N. A. Touba, “Cost-effective approach for reducing soft error failure rate in logic circuits,” Proceeding of the International Test Conference, 2003, pp. 893-901 [6]. K. H. Huang and J. A. Abraham, “Algorithm-based fault tolerance for matrix operations,” IEEE Transactions on Computers,” Vol. C-33, pp. 518-528, June 1984 [7]. J. Y. Jou and J. A. Abraham, “Fault Tolerant FFT Networks,” IEEE Transactions on Computers, Vol. 37, pp. 548561, May 1988 [8]. L. N. Reddy and P. Banerjee, “Algorithm-based fault detection for signal processing applications,” IEEE Transactions on Computers, Vol. 39. No. 10., pp. 1304-1308, October 1990 [9]. Y. S. Dhillon, et. al., "soft-error tolerance analysis and optimization of nanometer circuits," Proceeding of Design Automation and Test in Europe, 2005, pp. 288-293 [10].U. Diril, et. al., "Design of adaptive nanometer digital systems for effective control of soft error tolerance," Proceeding of VLSI Test Symposium, pp. 298-303 [11].Shanbhag, “Reliable and energy-efficient digital signal processing,” Proceeding of Design Automation Conference, 2002, pp. 830-835 [12].Chatterjee, M. A. d’Abreu, “The design of fault-tolerant linear digital state variable systems: theory and techniques,” IEEE Trans. on computers, vol 42, 1993, pp. 794-808 [13]. V. S. Nair and J. A. Abraham, “Real-number codes for fault-tolerant matrix operations on processor arrays,” IEEE Trans. on Computer, vol.39, 1990, pp. 426-435