904
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: ANALOG AND DIGITAL SIGNAL PROCESSING, VOL. 48, NO. 10, OCTOBER 2001
Difference Metric Soft-Output Detection: Architecture and Implementation Warren J. Gross, Student Member, IEEE, Vincent C. Gaudet, Student Member, IEEE, and P. Glenn Gulak, Senior Member, IEEE
Abstract—The forward–backward (FB, also known as the MAP or BCJR) detection algorithm provides “soft” reliability estimates for each bit that it decodes. This paper presents a VLSI architecture for soft-output forward–backward detection of Class-IV partial response signaling (PR4) used in magnetic recording. A difference metric version of the FB algorithm is derived. A novel lowcomplexity architecture implements the computational kernel as a limiter. A 0.35- m 3-level metal CMOS ASIC was implemented and verified to operate at 20 MHz (20 Mbps), the highest speed of our IC tester. Simulations predict operation of up to 150 Mbps. Index Terms—CMOS digital integrated circuits, digital communication, MAP estimation, partial response signaling, very large scale integration.
I. INTRODUCTION
T
HE forward–backward (FB) detection algorithm (also known as the maximum a posteriori (MAP) algorithm) provides “soft” reliability estimates for each bit that it decodes [1], [2]. Ignored for many years because of its perceived complexity, interest has been reignited in recent years in the wake of exciting new discoveries such as turbo codes [3]. Class-IV partial response (PR4) signaling is commonly used as a model for intersymbol interference in hard drive read channels. We have developed a low-complexity VLSI architecture for soft-output PR4 detection using the FB algorithm. The FB algorithm differs from the Viterbi algorithm [4], [5] in that it performs symbol-by-symbol MAP detection instead of maximum-likelihood (ML) sequence detection. In applications where only the “hard” bit decisions are needed, this distinction does not justify the increased computational complexity of the FB algorithm. The simulated bit error rate (BER) curves shown in Fig. 1 are a motivation for developing a soft-output detector for the PR4 channel. We will consider a classical serially concatenated system [6]. The information sequence is first encoded with a 4-state convolutional code, dispersed by a random interleaver, PR4 encoded and sent across an additive white Gaussian noise (AWGN) channel. The first step in decoding would be to apply a Viterbi algorithm matched to the PR4 signaling to the received sequence. Since the Viterbi algorithm can only produce hard outputs, some vital information has been thrown away. Manuscript received September 20, 2000; revised September 9, 2001. This work was supported by the Natural Sciences and Engineering Research Council of Canada. This paper was recommended by Associate Editor G. Cauwenberghs. The authors are with the Department of Electrical and Computer Engineering, University of Toronto, Toronto, ON, M5S 3G4 Canada (e-mail:
[email protected]). Publisher Item Identifier S 1057-7130(01)11044-X.
Fig. 1. BER performance of a serial concatenation of a 4-state convolutional code with PR4 signaling in an AWGN channel. The interleaver length is 1024 bits.
It would be nice to be able to provide soft information to the Viterbi algorithm associated with the convolutional code to improve its performance. The simulation results of Fig. 1 show that a soft-output PR4 detector using the FB algorithm provides a 1.8 dB improvement over a hard-output PR4 detector at a BER of 10 . Section II describes the FB algorithm. Section III derives a difference metric version of the FB algorithm for Class-IV partial response channels. Section IV presents a novel low-complexity VLSI architecture based on a simplification of the algorithm in Section III. An IC implementation is presented in Section V. Section VI concludes the paper. II. THE FB ALGORITHM A. Background It can be useful for a decoding algorithm to provide an estimate of the reliability of the decoded bits. The “soft” reliability values can be used to adjust the decoding algorithm to provide better performance. As far back as 1973, Forney proposed the use of “augmented outputs” from the Viterbi algorithm as a measure of reliability of the decoding process [5]. Forney’s heuristic idea to use the difference in state metrics between the best path and the next shortest path led to the soft-output Viterbi algorithm (SOVA) described by Battail [7] and Hagenauer and Hoeher [8]. It is important to define what is meant by soft information. The
1057–7130/01$10.00 © 2001 IEEE
GROSS et al.: DIFFERENCE METRIC SOFT-OUTPUT DETECTION
905
reliability of a decoded bit is best described by the a posteriori probability (APP). For an estimate of bit (which can take on the value 1 or 1) having received symbol we define the optimum soft output as
• Log-likelihood ratio calculation. The output LLR for each symbol at time is calculated as
(1) which is called the log-likelihood ratio (LLR). The LLR is a convenient measure since it encapsulates both soft and hard bit information in one number. The sign of the number corresponds to the hard decision while the magnitude gives a reliability estimate. The LLR can be easily computed by noticing that a decoding algorithm that uses the maximum a posteriori (MAP) rule inherently calculates the APP’s required. MAP algorithms have been proposed by several authors [1], [2], [9] but were generally ignored because the Viterbi algorithm can provide nearly identical hard outputs with less computational effort. B. Description of the FB Algorithm This description of the algorithm is based on [2] and [10] to which the reader is referred to for a detailed derivation. The algorithm is based on the same trellis as the Viterbi algorithm. The algorithm is performed on a block of received symbols which corresponds to a trellis with a finite number of stages . We will choose the transmitted bit from the set of 1, 1 . from the AWGN channel with Upon receiving the symbol we calculate the branch probability of the noise variance transition from state to state as (2) is the expected symbol along the branch from where state to state . The algorithm consists of three steps. • Forward recursion. The forward state probability of being in each state of the trellis at each time , given the knowledge of all the previous received symbols, is recursively calculated and stored (3) The recursion is initialized by forcing the starting state to state 0 and setting
(7)
where the upper summation is over all branches with input label “ 1” and the lower summation is over all branches with input label “ 1.” C. The FB Algorithm in the Logarithmic Domain The FB algorithm was virtually ignored for many years in part because of the difficulty in implementing efficient exponentiation and multiplication. If the algorithm is implemented in the logarithmic domain like the Viterbi algorithm, then the multiplications become additions and the exponentials disappear. Addition is transformed according to the rule described in [11] and [12]. The additions are replaced using the Jacobi logarithm
(8) operation, to denote that it is essenwhich is called the tially a maximum operator adjusted by a correction factor. The , can be precalsecond term, a function of the single value culated and stored in a small lookup table with negligible effects on performance [12]. The FB algorithm will now be restated in the logarithmic domain. As with the Viterbi algorithm, logarithms of probabilities are referred to as metrics. Define the new quantities: • Branch Metrics (9) • Forward State Metrics (10) • Backward State Metrics (11)
(4) • Backward recursion. The backward state probability of being in each state of the trellis at each time , given the knowledge of all the future received symbols, is recursively calculated and stored
The branch metric calculation eliminates the exponential (12) The forward state metric recursion becomes
(5) (13) The recursion is initialized by forcing the ending state to state 0 and setting
with initial conditions (14)
(6) The trellis termination condition requires the entire block to be received before the backward recursion can begin.
The backward state metric recursion becomes (15)
906
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: ANALOG AND DIGITAL SIGNAL PROCESSING, VOL. 48, NO. 10, OCTOBER 2001
with initial conditions
(16) The dynamic range of the metrics is much smaller than the associated probabilities. The computational kernel of the algorithm is analogous to the add–compare–select (ACS) operation in the Viterbi algorithm adjusted by an offset. -likelihood ratio calculation becomes The
Fig. 2.
PR4 detection using two time-interleaved 1
0 D detectors.
(17) Fig. 3. A two-state trellis.
III. THE DIFFERENCE METRIC FB ALGORITHM
B. Derivation of the Algorithm
It is well known that the Viterbi algorithm for the PR4 channel can be implemented using a difference metric where one state metric per trellis stage is calculated [13]. A difference metric FB algorithm (we will call it the DMFB), analogous to the difference metric Viterbi algorithm is developed here. The difference metric FB algorithm (DMFB) is a soft-output detection algorithm for two-state partial response systems such as PR4 signaling.
Consider the partial response system with the transfer function (20) where is a channel-dependent constant. This corresponds to a two state trellis as shown in Fig. 3. We label the states “ 1” and “ 1.” The input symbols can take the values 1 and 1 and the time index is . The branch weights are
A. Class-IV Partial Response As hard disk recording densities increase, the pulses detected by the magnetic read head interfere with each other, introducing intersymbol interference (ISI). Modern disk drive read channels use a known model for the ISI to combat it in the receiver with advanced signal processing. Partial response signaling is a technique where a controlled amount of ISI is introduced into a communications system to shape the spectrum of the transmitted signal [13]–[15]. For example, partial response systems can be realized to eliminate frequency components that the channel cannot efficiently transmit, such as dc or high-frequencies. The ISI can be modeled by a finite impulse response (FIR) filter (tapped feedforward shift register), much like a convolutional encoder with real instead of modulo-2 addition. The output of the encoder is therefore multilevel. Partial response FIR filters have the general transfer function
(21) and correspond to the probability that the branch between states and has been taken given that the symbol has been received. The forward recursion can be expressed as
(22) corresponds to the probability of being in state where at time given knowledge of all the previous received symbols. The backward recursion can be expressed as
(18) (23) is a unit delay. Class-IV Partial response signaling where (PR4) for magnetic recording was proposed in [14] and has the transfer function (19) The transfer function can be broken down into two indepenfunctions. The detection problem dent time-interleaved is therefore reduced to detecting unit memory ISI which can be easily done with two-state Viterbi detectors as shown in Fig. 2.
corresponds to the probability of being in state at where time given knowledge of all the future received symbols. We write the expression for the soft output explicitly as (24) shown at the bottom of the next page. from (24) and substituting in (22) we get Factoring out
(25)
GROSS et al.: DIFFERENCE METRIC SOFT-OUTPUT DETECTION
907
(26) We see that the algorithm is reduced to recursively calculating , a single backward metric and a single forward metric adding them together. To derive the forward difference metric recursion, use (22) to write
(27) Substituting the expressions for the branch weights from (21), followed by dividing and simplifying we get
Fig. 4. Bit-error rate performance of a serial concatenation of a 4-state convolutional code with PR4 signaling in an additive white Gaussian noise channel. The PR4 decoder is a FB decoder using the MAX approximation to the MAX operator.
(28) In terms of the difference metric and
notation
(29)
Fig. 5.
The limiter used in the recursion equations.
Similarly, the backward recursion is
(30) For a block-based decoder, the starting and ending states are forced to 1. The initial values of the recursions are therefore . Similarly, . Note that the initial values of the difference metrics are never which only actually used in the calculation of the soft output (since the trellis needs to decode symbols at and into the expressions for is terminated). Substituting and we get
(31) which eliminates the need for infinite values in the calculations. If the data is received in a continuous stream and not blocks, the
ending state is not known a priori. In this case, the ending states . are equiprobable and The DMFB is equivalent to explicitly calculating both state metrics and then subtracting the smaller of the two from both to normalize them. The redundant value of zero does not need to be propagated. The DMFB is therefore self-normalizing. Overflow can occur at large values of the SNR since there is a division by a very small noise variance in the branch metrics. This issue will be considered and a solution proposed in Section IV. IV. VLSI ARCHITECTURE A. The
-DMFB Simplification
We have described a difference metric formulation for the FB algorithm. Below we develop a low-complexity VLSI architecture for the difference metric FB algorithm (we will call this the -DMFB architecture). operator with the opWe can approximate the erator and ignore the correction factor from (8). Fig. 4 shows the results of the simulation of the concatenated system described
(24)
908
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: ANALOG AND DIGITAL SIGNAL PROCESSING, VOL. 48, NO. 10, OCTOBER 2001
Fig. 6. Using a limiter saves an adder in the critical path of the recursion hardware.
in Section I using both the operator and the proposed apoperator. A -based detector proximation using the -based detector. performs nearly as well as a approximation is used, the forward recursion beIf the ) comes (for PR4,
This implies that node will always be
which implies that (37)
Therefore (38) (39)
(32) for which there are four distinct cases: and 1) 2) and 3) and 4) and . Since the inputs to each term are related, this imposes a constraint that makes the fourth case impossible. By considering the three possible cases, the new recursions become if if
which is just one of the inputs, eliminating the need for the final subtractor. The critical path of the resulting limiter is just one adder and two multiplexors. Since only the MSB of the subtractions are used in the limiter, the adders can be replaced by comparator circuits resulting in a Compare–Select–Select (CSS) operation. The presence of the noise variance in the branch metric ex-DMFB. pressions complicates the implementation of the The word length of the metrics needs to accommodate the . This problem can be eliminated by smallest value of from the branch metrics so that removing the term
(33) (40)
if and if if
(34)
if where
This is a valid solution since the new state metric at each stage is either one of the current branch metrics or the previous state metric. Therefore, each state metric will take on the value of eibranch metrics ther the initial state metric or one of the each of the form in (40). There is therefore no cumulative error in this approximation, and the new soft output will simply be . the desired value scaled by B. Sliding-Window Architectures
(35) Equation (33) can be interpreted as the limiter in Fig. 5. At each recursion step, either the input or one of the two threshold levels is copied to the output, eliminating any numerical error propagation. The limiter enables us to build fast circuits. To see this more clearly, we can rederive the same result from a hardware viewpoint. Fig. 6 shows the hardware implementation of (32) as well as the limiter of (33). The “naive” circuit in Fig. 6(a) uses redundant hardware to implement the impossible fourth case above. The final subtractor is needed only when point is nonzero, i.e., (36)
The sliding window architecture [16] can be applied to the -DMFB. If the backward recursion is started from an arbitrary point in the sequence with the “all-zero” vector as the inisteps, the state metrics will converge to tial value, then after the correct values [16]. Simulations show that a value of is required for our system. The pipelined version of this architecture is shown in Fig. 7. An interesting computational challenge will now be considered. The backward recursion is not really necessary in one sense because all of the inputs to the calculation (the branch metric and the initial difference metric) are always available in the shift register at once. Therefore, in theory at least, all possible inputs and outputs could be tabulated in a very large lookup table [see Fig. 8(a)] [17]. Of course, a lookup table that large
GROSS et al.: DIFFERENCE METRIC SOFT-OUTPUT DETECTION
909
Fig. 7.
The sliding window MAX-DMFB architecture.
Fig. 8.
Parallel processing of the backward recursion using (a) a lookup table (LUT) or (b) a tree.
(a)
(b)
(c)
(d)
Fig. 9. Replacing the action of multiple interval limiters with a single equivalent interval limiter.
could never be practically built, and if it could, the access time would still be a function of the table size. A compromise is to use stages as proposed in [18] a tree-like structure with for the Viterbi algorithm [see Fig. 8(b)]. The question remains of how to implement the new type of processing element required. Fortunately, the limiter concept we introduced provides an interesting and simple solution. The concept is derived by a simple analogy with a high-rise building that has developed a leak in its roof (see Fig. 9). If each floor also has a hole in it, then assuming the floor dips toward the hole, the water will run along the floor until it falls in the hole. If the holes on two adjacent floors overlap, then the water can possibly pass right through both holes without ever creating a puddle on the floor! The question arises as to where the water falls in the basement. The chain of limiters in our problem is like the stack of floors in the building. Each floor represents the number line, with the linear part of the limiter represented by the hole and the cutoff regions represented by the concrete floor. The relative location of the hole is determined by the value of the input symbols by way of the branch metrics. If all the floors in the building were torn out and replaced by a single floor with
Fig. 10. The interval adjustment unit (IAU). (a) Symbol. (b) Example of IAU operation. (c) Algorithm. (d) Circuit.
one hole in the correct place then the puddle in the basement would still be in the same place. In other words, the chain of limiters in the backward recursion can be replaced by a single limiter with the equivalent action. The lookup table approach calculates the equivalent overall steps by limiter in one step while the tree approach uses considering pairs of adjacent limiters. The new limiter functions are then paired up and the calculation is repeated until the overall limiter function is known. We will call one of these new processors an interval adjustment unit or IAU as shown in Fig. 10. The
910
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: ANALOG AND DIGITAL SIGNAL PROCESSING, VOL. 48, NO. 10, OCTOBER 2001
Fig. 11. The tree architecture for the sliding window MAX-DMFB using IAUs and limiters. (a) The full tree. Shaded arrows indicate intervals defined by two numbers. The shaded IAUs are redundant and can be pruned. (b) The pruned tree using memory to replace logic.
Fig. 12. Simulated performance of the MAX-DMFB ASIC with 6-bit quantization and a window length of 9 compared to the optimum floating point block-based FB algorithm with a blocksize of 1024.
IAU uses one more MUX than two limiters and the delay is one MUX less than that of two limiters in series. The tree architecture offers a trade-off between complexity and latency. We note that significant savings in latency are only realized for large values of W. The length of the critical path for the tree architecture could be controlled by inserting pipeline registers where desired. A trade-off between memory and hardware is possible within the tree architecture by recognizing that some IAUs are performing redundant calculations since the same input was available to other IAUs at an earlier point in the shift is shown in Fig. 11. register [19]. An example for
In this paper we have presented a soft-output detector based on the FB algorithm for PR4 signaling. The FB algorithm can be implemented using a difference metric (DMFB) and a simplified -DMFB) can be efficiently imversion of this algorithm ( plemented in VLSI. A CMOS VLSI ASIC was developed that is expected to decode at speeds of 150 Mbps.
V. AN ASIC IMPLEMENTATION
ACKNOWLEDGMENT
We have implemented the -DMFB architecture of Fig. 7 in a 0.35- m CMOS 3-LM ASIC. The input and for output soft values are quantized to 6 bits. The simulated performance is shown in Fig. 12. The design was synthesized to standard cells and has a core area of 0.49 mm and a total silicon area of 7.8 mm . Fig. 13 is a die micrograph. The chip was verified to operate at 20 MHz (20 Mbps), the highest speed of our IC tester, with equivalent results to our simulations. Simulations predict operation of up to 150 Mbps. This compares favorably with published results of a 200 Mbps 0.8- m analog BiCMOS hard-output Viterbi detector [20].
The authors would like to thank Canadian Microelectronics Corporation for fabrication support. The authors also wish to thank Prof. F. R. Kschischang, Prof. E. Boutillon, and Prof. S. Gazor for enlightening discussions.
Fig. 13.
Die micrograph.
VI. CONCLUSION
REFERENCES [1] R. W. Chang and J. C. Hancock, “On receiver structures for channels having memory,” IEEE Trans. Inform. Theory, vol. 12, pp. 463–468, Oct. 1966. [2] L. R. Bahl, J. Cocke, F. Jelinek, and J. Raviv, “Optimal decoding of linear codes for minimizing symbol error rate,” IEEE Trans. Inform. Theory, pp. 284–287, Mar. 1974.
GROSS et al.: DIFFERENCE METRIC SOFT-OUTPUT DETECTION
[3] C. Berrou, A. Glavieux, and P. Thitimajshima, “Near Shannon limit error-correcting coding and decoding: turbo codes,” in Proc. ICC’93, Geneva, Switzerland, May 23–25, 1993, pp. 1064–1070. [4] A. J. Viterbi, “Error bounds for convolutional codes and an asymptotically optimum decoding algorithm,” IEEE Trans. Inform. Theory, vol. 13, pp. 260–269, Apr. 1967. [5] G. D. Forney, Jr., “The Viterbi algorithm,” Proc. IEEE, vol. 61, pp. 268–278, Mar. 1973. , Concatenated Codes. Cambridge, MA: MIT Press, 1966. [6] [7] G. Battail, “Ponderation des symboles decodes par l’algorithme de Viterbi,” Ann. Telecommun., vol. 42, pp. 31–38, Jan. 1987. [8] J. Hagenauer and P. Hoeher, “A Viterbi algorithm with soft-decision outputs and its applications,” in Proc. IEEE Globecom Conf., Dallas, TX, Nov. 1989, pp. 1680–1686. [9] K. Abend and B. Fritchman, “Statistical detection for communication channels with intersymbol interference,” Proc. IEEE, vol. 58, pp. 779–785, May 1970. [10] J. Hagenauer, E. Offer, and L. Papke, “Iterative decoding of binary block and convolutional codes,” IEEE Trans. Inform. Theory, vol. 42, pp. 429–445, Mar. 1996. [11] N. G. Kingsbury and P. J. W. Rayner, “Digital filtering using logarithmic arithmetic,” Electron. Lett., vol. 7, no. 2, pp. 56–58, Jan. 1971. [12] J. A. Erfanian and S. Pasupathy, “Low-complexity parallel-structure symbol-by-symbol detection for ISI channels,” in IEEE Pacific Rim Conf. Commun., Comput. Signal Processing, June 1–2, 1989, pp. 350–353. [13] M. J. Ferguson, “Optimal reception for binary partial response channels,” Bell Syst. Tech. J., vol. 51, no. 2, pp. 493–505, Feb. 1972. [14] H. Kobayashi, “Application of probabilistic decoding to digital magnetic recording systems,” IBM J. Res. Develop., vol. 15, pp. 64–74, Jan. 1971. [15] P. Kabal and S. Pasupathy, “Partial-response signaling,” IEEE Trans. Commun., vol. 23, Sept. 1975. [16] H. Dawid and H. Meyr, “Real-time algorithms and VLSI architectures for soft output MAP convolutional decoding,” in Proc. PIMRC’95, New York, 1995, pp. 193–197. [17] K. Tzou and J. G. Dunham, “Sliding block decoding of convolutional codes,” IEEE Trans. Commun., vol. 29, pp. 1401–1403, 1981. [18] G. Fettweis and H. Meyr, “High-speed parallel Viterbi decoding: Algorithm and VLSI-architecture,” IEEE Commun. Mag., vol. 29, pp. 46–55, May 1991. [19] B. Farhang-Boroujeny and S. Gazor, “Generalized sliding FFT and its application to implementation of block LMS adaptive filters,” IEEE Trans. Signal Processing, vol. 42, pp. 532–538, Mar. 1994. [20] M. H. Shakiba, D. A. Johns, and K. W. Martin, “An integrated 200 MHz 3.3V BiCMOS class-IV partial response analog Viterbi decoder,” IEEE J. Solid-State Circuits, vol. 33, pp. 61–75, Jan. 1998.
Warren J. Gross (S’92) was born in Montreal, QC, Canada, in 1972. He received the B.A.Sc. degree in electrical engineering from the University of Waterloo, Canada, in 1996 and the M.A.Sc. degree from the University of Toronto, Canada, in 1999. He is currently pursuing the Ph.D. degree at the University of Toronto. From 1993 to 1996, while studying for the B.A.Sc. degree, he worked in the area of space-based machine vision at Neptec Design Group, Ottawa, Canada. His research interests are VLSI architectures for digital communications algorithms and digital signal processing, coding theory, and computer architecture. Mr. Gross received the Natural Sciences and Engineering Research Council of Canada postgraduate scholarship, the Walter Sumner fellowship and the Government of Ontario/Ricoh Canada Graduate Scholarship in Science and Technology.
911
Vincent C. Gaudet (S’97) was born in Sherbrooke, QC, Canada, in 1974. He received the Bachelor of Science (Computer Engineering) degree from the University of Manitoba, Winnipeg, MB, Canada, and was awarded the University Gold Medal, both in 1995. He received the Master of Applied Science degree from the Department of Electrical and Computer Engineering, University of Toronto, ON, Canada, in 1997, where he is currently pursuing the Ph.D. degree. His research interests include mixed-signal IC design for error control coding, field-programmable devices, and genetic programming. He has received funding from the Natural Sciences and Engineering Research Council (Canada), the Ontario Graduate Scholarship in Science and Technology, and the Walter Sumner Memorial Fund.
P. Glenn Gulak (S’82–M’83–SM’96) received the Ph.D. degree from the University of Manitoba, Winnipeg, MB, Canada. From 1985 to 1988, he was a Research Associate with the Information Systems Laboratory and the Computer Systems Laboratory, Stanford University, Stanford, CA. Currently, he is a Professor with the Department of Electrical and Computer Engineering, at the University of Toronto, Toronto, ON, Canada, and holds the L. Lau Chair in Electrical and Computer Engineering. His research interests are in the areas of memory design, circuits, algorithms, and VLSI architectures for digital communications. Dr. Gulak received a Natural Sciences and Engineering Research Council of Canada Postgraduate Scholarship and several teaching awards for undergraduate courses taught in both the Department of Computer Science and the Department of Electrical and Computer Engineering of the University of Toronto, Toronto, ON, Canada. He has served on the ISSCC Signal Processing Technical Subcommittee since 1990 and served as the Technical Program Chair for ISSCC 2001. He is a registered professional engineer in the province of Ontario.