A PERTURBATION THEORY ON STATISTICAL QUANTIZATION EFFECTS IN FIXED-POINT DSP WITH NON-STATIONARY INPUTS Changchun Shi and Robert W. Brodersen Berkeley Wireless Research Center, Department of EECS, University of California, Berkeley ABSTRACT A quantitative characterization of statistical quantization effects in a discrete-time digital system is derived as a function of fixed-point data types. The analysis is based on a perturbation theory approach which assumes the quantization noise is a small deviation from the ideal characteristics and that the system does not contain decision-error propagation. The theory works for both linear and nonlinear digital systems with either stationary or non-stationary inputs. The approach is applied to an LMS adaptive filter in both the transient and steady state periods. 1. INTRODUCTION In order to reduce hardware costs, most implementations of digital system rely on binary fixed point (FP) number systems—either 2’s complement or unsignedmagnitude—with roundoff and truncation quantization [1-7]. Existing studies attack the problem of the effect of this quantization on a case-by-case basis and normally handle input signals with simple statistical distributions [1-6]. Using three assumptions, a technique is proposed to determine the effect of word length quantization in a general system under possible non-stationary input statistics. A quantizer in our discussion is uniquely described by its fractional word-length, WFr, and its quantization mode which can be a simple truncation or some form of roundoff. These quantizer when acting on input signal x, add a quantization noise e to the signal, (1) e = Q[ x ] − x. Exact analyses of the statistics of e under a random input x have been done, usually limited to simple linear systems such as a simple multiplier or FIR filter (see, e.g. [2]). Other work has extended the exact analysis into LMS algorithm [3], based on a simple one quantizer model. Though these techniques are successful in the limited cases addressed, both analyses become difficult to apply for systems with long data paths and complicated feedback loops. A more general strategy has been to use a statistical approach where the following assumption is made:
Assumption A.1. Constant signals have constant quantization noise. Otherwise, the quantization noise is uniformly distributed in its possible range and independent (or uncorrelated) with other data, other quantization noise, and itself over time. Based on A.1, statistical quantization effects of linear-time-invariant (see, e.g. [1][4]) systems and specific nonlinear systems such as adaptive filters (see, e.g. [5]) and CORDIC [6] have been studied, with impressively accurate results. However, a general approach applying this assumption to general nonlinear systems is not available. Also, quantization effects with non-stationary inputs have been difficult to analyze using the past techniques. A more general solution is presented using two additional assumptions that are widely satisfied in practical systems. Based on these results, it is possible to reduce the associated exponential computation complexity [7] associated with this problem to a polynomial one. Examples are given to demonstrate our theory, including the quantization effects in the transient period of an LMS adaptive filter with correlated inputs. 2. PERTURBATION THEORY 2.1. Categorizing signals and blocks Digital signal processing systems are constructed by the interconnection of functional operators such as adders and multiplexers. Quantizers in a fixed point (FP) system can be used to reduce the accuracy of some signals associated with these functional units from infinite-precision (IP) to limited-precision. These signals which are allowed to have reduced accuracy will be called arithmetic signals. Signals which are already discrete and are not modified by quantizers will be termed logical signals. Assuming one output for each operator, operators in an IP system can be separated into different types, 1. Arithmetic operator—the output is an arithmetic signal, such as adder and delay in an FIR or LMS. 2. Logical operator—all the inputs and outputs are logical, such as an AND gate in control logic. 3. Decision-making operator—some of the inputs are arithmetic and the output is logical, such as the final
slicer in a communication system or a comparator in a CORDIC to decide the angle-shift direction.
function of the random expanded variables (x1,…, xM).
quantization is so aggressive that the FP system behaves significantly different from the IP system it is basically a new algorithm. Quantization effects from quantizers that modify only arithmetic signals may accumulate and alter the decision of a decision-making block in the FP system, which further cause quantization effects that are difficult to analyze. In this paper, we rule out them by having Assumption 3, Assumption A.3. Every arithmetic and decisionmaking operator in a causal discrete-time system will have its arithmetic inputs in smooth regions of the operator. One inference of A.2 and A.3 is that, when every WFr is infinitely large the signals with quantization noise are still in the smooth region of all the operators and produce small perturbation at their outputs. Thus, no decision-error propagation will occur and the total quantization effect is only a small perturbation of the basic IP system. In practice, all WFr’s only need to be large enough to make this true, which defines “sufficient” in A.2. To determine if this condition is satisfied requires a careful consideration of the system. Almost all existing analytical work on quantization effects study systems that satisfy A.3. Most of them, such as LTI and most adaptive filters, are absent of logical signals (hence, no decision-making and logical blocks). Others, e.g. [6], assume no decision-error implicitly.
3.3. Smooth operators
3.5. Quantitative perturbations
3.2. Definitions In a statistical model, the inputs of a system are random processes, whereas system operators are deterministic operations which process this non-stationary random data. An operator, denoted by F, has a finite number, K, of inputs that form a random vector (x1(t), x2(t),…, xK(t)) or, in a simpler notation, (x1, x2,…, xK)t, at time t. Assuming the system starts at time 0, then the output of a causal operator F depends on all its previous and current inputs, {( x1, x2,…, xK)t, (x1, x2,…, xK)t-1,…, (x1, x2,…, xK)0}, or more compactly denoted as (x1, ,…, xM) with 1-1 correspondence in order, where M= Kä(t+1). We call (x1, ,…, , xM) the expanded variables of (x1, x2,…, xK) at time t. Now, the output of F at time t, defined by the transfer function fF(x1, x2,…,xK,t), becomes (2) f F ( x1 , x 2 , L , x K , t ) = φ F , t (ξ 1 ,..., ξ M ), where function fF,t uniquely characterizes the operator. For example, a timing operator G, such as a down sampler, has its output at t equal to its single input at another time g(t). Then, f G ( x , t ) = φG,t (ξ1,..., ξ M ) =
ξt − g ( t ) +1 . In (2), fF,t with subscript t is a deterministic
An operation F is called smooth at time t over its arithmetic inputs in an open set, or F is smooth, if the function φF,t (ξ1 ,..., ξ M ) is continuous and differentiable to any desired degree over an open set of its arithmetic signals, regardless of the realization of its logical signals. A decision-making operator and arithmetic may be smooth over its arithmetic inputs. Logical operators are not smooth because its logical inputs only have discrete levels. Basic arithmetic operators such as multiplier are smooth over all regions, while some operators such as the reciprocal and absolute-value are only smooth in finite regions (-¶,0)»(0, +¶). A smooth operator following a smooth operator forms a combined operator that is also smooth over the input region in which both are smooth. 3.4. Additional assumptions Assumption A.2. The fractional word-length of each quantizer will be sufficiently large so that the quantization noise caused by each quantizer will be sufficiently small. We will explain what “sufficient” means after A.3. Here, the quantizers are those after arithmetic operators in FP system. Since quantization noise is strictly bounded by the quantization step size, A.2 is easily satisfied. If the
Zero inputs
...
Ç input
SIP: An IP system S w/ arithmetic operators l
output
= Ç
: IP system, w/ all operators in S and additional arithmetic adders
output
input q-error inputs
...
Ç input
SFP: the FP version of S, w/ arithmetic operators and quantizers
output
=
Ç
: IP system, same as the one above
output
input
Fig. 1. With Assumption A.1, an FP system SFP can be treated as IP with changes on some error input signals.
We can now treat a system as a joint smooth operator on its arithmetic signals since if all the internal operators operate in their smooth region, the combined operator is also smooth. Let the original IP system be SIP and the final FP system be SFP. From A.1, we can replace all quantizers with adders introducing the quantization noise in the FP system, to get a new system S . Let bold letters S , SIP and SFP be the operators associated with S , SIP and SFP, respectively. With only adders inserted and the noise input at 0, S has its internal signals of S identical to
those in SIP. So, S satisfies A.3 and is smooth on both the original arithmetic inputs of SIP and on the quantization noise. Denote the transfer function of as f , its signal inputs as (x1, x2,…, xK), and the error inputs as (e1, e2,…, eL), and call their expanded variables (x1,…, xM) and (ε1,…, εN), respectively. Fig. 1 shows, under A.1-3, f S FP ( x1, x2 ,L, xK , t ) = φS ,t (ξ1,L, ξ M , ε1,L, ε N ), and
f S IP ( x1, x2 ,L, xK , t ) = φS ,t (ξ1,L, ξ M ,0,L,0).
procedure work with even non-stationary inputs with general statistical distributions, as well as transient analysis of a system under stationary inputs. Starting from (5), we also find that the mean-square error (MSE) of ( f SFP − f SIP ) is simply E[( f S FP ( x1 , x2 ,L, xK , t ) − f S IP ( x1 , x2 ,L, xK , t )) 2 ] L
= u T B(t )u + ∑ ci (t ) ⋅ si 2 ,
(3)
i
where B(t) is a K-by-K symmetric matrix and µ is the column vector formed by (u1,…, uK)T. The fact that the MSE quantity has to be non-negative means matrix B(t) has to be positive semi-definite and ci(t) has to be positive. In general, it may be necessary to study (9) E [ g f S FP ( x1,L, xK , t ) − f S IP ( x1,L, xK , t ), t ] .
Doing a Taylor expansion of the smooth function φS ,t (ξ1,L, ξ M , ε1,L, ε N ) over its arithmetic input signals (ε1,…,εN) around their IP values (0,…,0), we get the following is the expansion up to its 2nd-order terms, φS ,t (ξ1,L, ξ M , ε1,L, ε N ) = φS ,t (ξ1,L, ξ M ,0,L,0) + N ∂ 2φ N ∂φ ∑ S ,t ∂ε ⋅ ε i + ∑ S ,t ∂ε ∂ε ⋅ ε iε j , i i j i =1 i , j =1
(
(4)
get, f S FP ( x1 , x2 ,L, xK , t ) = f S IP ( x1 , x2 , L, xK , t ) +
With A.1, entries of (ε1, µ, εN) are mutually independent and are independent to (x1, µ, xM); so, doing an expectation of (5) on both sides, and using the identity E[a⋅b]=E[a]⋅E[b] when a and b are statistically independent, we get E [ f S FP ( x1 , x2 ,L , x K , t )] = E [ f S IP ( x1 , x2 , L, x K , t )] + N N ∂ 2φ ∂φ ∑ E [ S , t ∂ε ] ⋅ µi + ∑ E [ S , t ∂ε ∂ε ] ⋅ µi µ j (6) i i j i =1 i , j =1 N ∂ 2φ 2 + ∑ E [ S , t 2 ] ⋅ σ i . ∂ ε i i where only the first two terms are kept, mi and si are mean and standard deviations of εi. Now we can switch back to the mean ui and standard deviation si of the quantization noise ei directly. This is done by replacing all m and s in (6) with u and s according to the correspondence in the definition of the expanded variables, and then collecting all the coefficients of the same ui, si and uiuj. The result is E[ f S FP ( x1, x2 ,L, xK , t ) − f S IP ( x1, x2 ,L, xK , t )] L L L (7) = ∑ mi (t ) ⋅ ui + ∑ hi (t ) ⋅ si 2 + ∑ ni , j (t ) ⋅ ui u j .
i =1
i
i , j =1
Reference [4] and [8] give the simple and detailed expressions for ui and si in terms of the FP parameters WFr and quantization mode using A.1. Equation (7) gives a deterministic relationship of the effect of quantization noise on an arbitrary digital system, under the assumptions A.1-3. Therefore, the result and the
)
where g as a smooth function may have memory to study outputs correlation over time. This can be done by treating a compound system—{SIP -SFP) followed by the smooth system built using function g, as our new FP system, with its IP version simply as g processing all 0 inputs.
where all the partial derivatives are evaluated at ε1=0, …, εN =0 (same below). Applying (3) on both sides of (4), we
N ∂φ N ∂ 2φ ∑ S ,t ∂ε ⋅ ε i + ∑ S ,t ∂ε ∂ε ⋅ ε iε j . i i j i =1 i , j =1
(8)
(5)
It is now necessary to find the values of the coefficients in Equation’s (7) and (8). Analytically carrying out the detailed procedure in this section provides the coefficients as explicit functions of system parameters and input statistics. Though theoretically feasible for all systems, the task could be mathematically difficult. On the other hand, we suggest an alternative efficient computational-approach based on digital-simulations. By carefully setting up the FP parameters ( hence all ui and si are known) in a system and using digital-simulations to estimate the corresponding quantization effects on the left side of (7) or (8), one numerical equation with these coefficients as unknown variables is obtained. Repeat this procedure with different setups of FP parameters until the number of equations exceeds the number of unknown coefficients—roughly on the order of L2, where L is the number of quantizers in the system; then, the coefficients can be numerically solved. More accurate formulation of the problem as function-data-fitting gives the same conclusion—only about L2 estimations are sufficient. Moreover, this computational procedure can be automatically done [8]. Without the theoretical results achieved in this paper, however, the number of digitalestimations needed would be would be (2l)L, where l is the number of possible fractional word-lengths for each quantizer and 2 indicates two quantization modes [7]. 3. LMS EXAMPLE Reference [5] treats the transient analysis of an LMS algorithm with an uncorrelated input and we will use their
notation, except that we use a moving average model of the input to include correlation-input effects, x ( n ) = υ ( n ) + a ⋅ υ ( n − 1), (10) where υ (n ) is i.i.d. innovative zero-mean random process with variance συ 2 . Without loss of generality, we choose a one tap LMS for better clarity and set the optimal filter weight w*=1. Similarly we will assume only one quantizer in roundoff mode in the system—the one that quantizes the weight update α⋅e(n)x(n)—because it dominates when the adaptation coefficient α is small. Then, the only noise defined in [5] left in our example is
ηw with 0-mean and variance being sw 2 . Following the analytical procedure in Section 2, we calculate that, under the small update coefficient α model, the expectation value of filter weight misadjustment, defined as ρ ( n ) , is 2 n 2 2 1 − (1 − 2α ⋅ (1 + a ) ⋅ σ υ ) 2 2 sw . (11) (1 − 2α ⋅ (1 + a ) ⋅ σ υ ) 2α ⋅ (1 + a 2 ) ⋅ σ υ 2
Moreover, when a=0, (11) degenerates to the uncorrelated-input case and agrees with (4.8) of [7]. Our approach is more general in that we can solve for quantization effects during the LMS transient period when the inputs are correlated. 4. CONCLUSION While only one simple example was shown, vastly more complex linear and non-linear systems can be automatically analyzed using the approach described here. In summary, we have developed a theory that is widely applicable and provides a general understanding of the critical dependencies of quantization noise effects. 5. ACKNOWLEDGEMENTS This work was sponsored by DARPA and the SIA under the MARCO focus centers program as well as the sponsors of the Berkeley Wireless Research Center. 6. REFERENCES
WFr=15
WFr=16
Fig. 2. Square-root of weight misadjustment in LMS transient period. Theoretical results are from (11). Here, a=2 (correlated input), α=0.001, and συ 2=1.
This result, of course, agrees with the format in (8) by noticing L=1 and noise mean uw=0, and it reveals the only unknown coefficients in (8). Fig. 2 shows (11) in dashed lines, which agree very well with simulated ensembleaverage estimations in solid lines. On the other hand, with L=1 and us=0, the computational approach at the end of Section 2 suggests only one computational-estimation based on one realization of quantizer fractional word-length WFr is sufficient to characterize ρ ( n ) for all other possible realizations. Thus, the estimation when WFr=15 gives the only unknown coefficient left in (8) as a function of time and predict ρ (n )1 / 2 for WFr=16 case. Fig. 2 shows this curve in “+” sign, which again agrees very well with simulation.
[1] L. B. Jackson. Digital filters and signal processing: with MATLAB exercises, 3rd ed. Boston: Kluwer Academic Publishers, 1996 [2] P. W. Wong, “Quantization and roundoff noises in fixedpoint FIR digital filters,” IEEE Trans. Signal Processing, vol. 39, pp. 1552-1563, July 1991. [3] N. J. Bershad, and J. C. M. Bermudez, “A nonlinear analytical model for the quantized LMS algorithm-the power-oftwo step size case,” IEEE Trans. Signal Processing, vol. 44, pp. 2895-2900, Nov. 1996. [4] Changchun Shi, “Statistical method for floating-point conversion,” 2002, Master Thesis, Department of EECS, Univ. of California, Berkeley. (Advisor: Robert W. Brodersen). [5] S. T. Alexander, “Transient weight misadjustment properties for the finite precision LMS algorithm,” IEEE Trans. Acoust. Speech, Signal Process., vol ASSP-35, pp1250-1258. [6] S. Y. Park, and N. I. Cho, “Fixed point error analysis of CORDIC processor based on the variance propagation,” Proc. IEEE Int. Conf. Acoust., Speech, and Signal Processing, vol. 2, pp. 565-568, Apr. 2003. [7] S. Kim, K. Kum and W. Sung, “Fixed-point optimization utility for C and C++ based digital signal processing programs,” IEEE Trans. On Circuits Syst. II: Analog and Digital Signal Processing, vol. 45, pp. 1455-1464, 1998. [8] C. Shi, and R. W. Brodersen, “An automated floating-point to fixed-point conversion methodology,” Proc. IEEE Int. Conf. on Acoust., Speech, and Signal Processing, Vol. 2, pp. 529-532, April 2003.