A High-Throughput Maximum a posteriori ... - Semantic Scholar

Report 4 Downloads 126 Views
IEEE 2007 Custom Intergrated Circuits Conference (CICC)

A High-Throughput Maximum a posteriori Probability Detector Ruwan Ratnayake1, Aleksandar Kavcic2 and Gu-Yeon Wei1 1 School of Engineering and Applied Sciences, Harvard University, Cambridge, MA 2 Department of Electrical Engineering, University of Hawaii, Honolulu, HI Abstract – This paper presents a maximum a posteriori probability (MAP) detector, based on a forward-only algorithm that can achieve high throughputs. The MAP algorithm is optimal in terms of bit error rate (BER) performance and, with Turbo decoding, can approach performance close to the channel capacity limit. The proposed detector utilizes a deep-pipelined architecture implemented in skew-tolerant domino and experimentally measured results verify the detector can achieve throughputs greater than 750MHz while consuming 2.4W. The detector is implemented in a 0.13µm CMOS technology and has a die area of 9.9 mm2.

I. INTRODUCTION High-speed detectors that combat inter-symbol interference are of interest for a variety of communications applications such as magnetic recording systems. Typically, high-speed detectors use less computationally intensive algorithms such as the Viterbi algorithm. Unfortunately, such algorithms generate hard outputs, making them less attractive for use in iterative systems where soft outputs are required. On the other hand algorithms such as the soft-output Viterbi algorithm (SOVA) can exploit iterative detection for better performance [1], but SOVA is a suboptimal algorithm in terms of BER performance [2]. While MAP algorithms are known to offer optimal BER performance, they have not been considered for high-speed detectors due to their prohibitively high computational complexity. This paper presents the design, implementation and experimental verification of a MAP detector that can perform at very high throughputs. The implementation benefits from optimizations performed at several levels of system design. First, we chose to implement an algorithm that has several advantages over the traditional MAP algorithm. One of the key features of this algorithm is its inherent pipelined structure, which can be exploited to increase throughput. Second, we leverage techniques to increase throughput at the circuit level. The design is implemented with skew-tolerant dual-rail domino logic triggered by multiphase clocks. In addition to reducing gate delay, skew-tolerant domino obviates dedicated latches required for traditional pipelining schemes. Removing these latches significantly reduces hardware overhead, eliminates latch delay, and facilitates time borrowing. In order to further increase throughput, computations that impose worst-case bottlenecks, constraining throughput, are addressed at the algorithmic level. A modification to the algorithm is proposed that has minimal affect on BER performance, verified by system-level BER simulations, but speeds up the critical path. The well-known algorithm by Bahl, Cocke, Jelinek and Raviv (BCJR), which traditionally is considered for most

1-4244-1623-X/07/$25.00 ©2007 IEEE

MAP detection/decoding applications, requires forward and backward computations (FB-BCJR) [3]. This is in contrast to the Viterbi or SOVA algorithms, which allow computations to be performed only in the forward direction. Once the input stream is fed into Viterbi/SOVA detectors, the outputs are generated after a fixed latency and retain the same order. On the other hand, the a posteriori probability (APPs) outputs of the FB-BCJR algorithm can only be evaluated after both forward and backward metrics have been computed. Inevitably, the outgoing symbols appear in a permuted order relative to the incoming symbols. To overcome the complexities inherent to FB-BCJR, we explore a recently-developed algorithm that performs MAP with computations only in the forward direction [4]. We call this algorithm forward-only MAP (FOMAP). The FOMAP algorithm has similarities to both Viterbi and FB-BCJR algorithms. The FOMAP algorithm keeps soft survivors (probabilities), which are saved in a fixed-length slidingwindow survivor memory. A prominent feature of FOMAP is its ability to update all of the (soft) survivors in parallel, similar to the Viterbi structure, where the (hard) survivors are also updated in parallel. Moreover, the FOMAP algorithm is a sum-of-products algorithm that, with soft survivors, generates APPs. This is in contrast to SOVA, which only computes an approximation of the APPs. One of the key drawbacks of traditional FB-BCJR is its backward computations, which only allows for sequential state metric updates. After receiving a symbol the FB-BCJR algorithm takes up to four times the latency pertaining to the window length for it to compute corresponding APPs. On the other hand, the FOMAP algorithm can perform parallel updates to generate APPs after a fixed latency equal to the laytency of a single window length, resulting in ordered outputs. Latency and ordering of outputs are similar to the Viterbi algorithm. By retaining key attractive features from both Viterbi and MAP algorithms, namely parallel survivor updating and the ability to compute APPs, the FOMAP algorithm can be implemented as a deep pipelined structure to offer superior performance in terms of BER, throughput, and latency. This paper is organized into several sections as follows. First, a brief overview of the FOMAP algorithm is provided. Then, Section III introduces the proposed detector architecture and describes design mechanisms employed to achieve very high throughputs. Section IV presents measurement results of a test-chip prototype that verifies high-throughput performance. Finally, Section V concludes the paper.

16-5-1

455

II. FORWARD-ONLY MAP ALGORITHM FOMAP is a path partitioning algorithm that computes APPs by processing probabilities of paths. For brevity, we only present the key equations and defer the reader to [4] for a detailed discussion of the FOMAP algorithm. A soft survivor in the FOMAP algorithm, denoted by αt,i (s,u), is defined to be the sum of the a-posteriori probabilities of all paths that terminate at state s at time t that include a branch at time i with input u. In essence, it is the joint probability of St and Ui conditioned upon the Y1t, i.e. αt,i (s,u) = Pr(St=s, Ui =u | Y1t), where St and Ui are random variables denoting state at t and input at i, respectively. Y1t is the sequence of received symbols up to time t. Fig. 1 clarifies this concept with a trellis diagram. The trellis shown consists of two states (0 and 1) and binary inputs. The figure shows all the paths that contribute to the soft survivor for state 1 at time t where each path has a branch at t-3 with a corresponding input 0. 0

0

0

1

0

0

0

0 1

0

1

1

1

0

0

0

α t ,t −3 (1,0)

0 1

1

1

1

1

1

1

1

t-5 t-4 t-3 t-2 t-1 t Fig. 1: The trellis paths that contribute to the soft survivor for state 1 at time t, where each path has a branch with input 0 at time t-3. The input corresponding to each branch is shown adjacent to the branch.

FOMAP is a recursive algorithm consisting of three steps called extend, update and collect. Assuming all soft survivors within the window of length L for previous iteration are available i.e. αt-1,i (s,u) for t-L ≤ i ≤ t-1, the three steps are evaluated as follows: A. Extend The extend operation, which extends the sliding window by one step, computes state metrics for a new time instant based on the previous state metrics and the latest received symbol. Pr( St −1 = s | Y1t ) = α t −1,⋅ (s,⋅) = ∑ α t −1,t −1 (s, u) (1) u∈U

α t ,t (s, u ) =

∑α t −1,⋅ (s' ,⋅)γ t (s' , s)

(2)

s '∈S , l ( s ', s ) =u

Here, S and U are sets of all possible states and inputs, respectively, and s and u are elements of these two sets. γt(s’,s) is the branch metric from states s’ to s at time t. l(s’,s) is a function that indicates the input u pertaining to a branch connecting states s’ to s. Since αt,t is based on αt-1,t-1 its evident that this operation contains a feedback loop that can limit performance. Section III shows how the extend step can be simplified to alleviate this performance bottleneck. B. Update The update operation updates the soft survivors for the remaining length of the window, i.e. t-L+1 ≤ i ≤ t-1. α t ,i ( s, u) = ∑ α t −1,i (s ' , u )γ t ( s' , s) (3) s '∈S

All of the soft survivors are updated in parallel across the length of the window. Thus, within one cycle the latest received symbol is incorporated into all of the soft survivors. The data flow for this operation occurs in a feed-forward

manner from the front of the window towards the back. Hence, this operation is amenable to pipelining. C. Collect Since soft survivors are joint a-posteriori probabilities pertaining to states and inputs, summing up all of the survivors at the end of the window for a given input across the states gives the APP pertaining to the input. Pr(U t − L +1 = u | Y1t ) = APP(u ) = ∑ α t ,t − L +1 (s, u ) (4) s∈S

A block diagram of FOMAP for the simple trellis system with two states, considered in Fig. 1, is shown in Fig. 2. Each soft survivor (α) is computed by a dedicated processor. Each extend and update column contains a number of processing units equal to the number of states times the number of inputs. The BER performance of FOMAP simulated with turbo decoding is shown in Fig. 3. For comparison, performance of SOVA is also shown. The turbo decoder simulated here is comprised of two component decoders connected in parallel. The performance results are for a AWGN channel with a code rate 1/3 and a trellis with 16 states. The interleaver size was set to 4096 bits and maximum number of iterations considered was 18. The results indicate that FOMAP has a performance gain of 0.6dB over SOVA at a BER about 10-5. III. DEEP-PIPELINED PARALLEL FOMAP The FOMAP algorithm was implemented as a detector targeting an extended enhanced partial response class-4 (EEPR4) channel, which consists of 16 states. The detector takes 6-bit quantized channel symbols from an A/D at the receiver input and 6-bit quantized prior log-likelihood ratios from previous iterations to generate 8-bit posterior loglikelihood ratios for the next iteration. The window length of the detector was set to 25, which balances hardware requirements versus performance. Branch metrics are represented by 6-bit values while state metrics and soft survivors are represented by 7-bit binary values. The block diagram of FOMAP architecture for EEPR4 is an expanded version of the block diagram shown in Fig. 2. The detector has 32 survivor-updating processors per column corresponding to the 16 states and two input levels. The total number of processors is 32x25. Fig. 4 presents a die microphotograph of the test chip with the detector’s floor plan superimposed onto the same figure. Extend, update and collect blocks are laid out as columns. The branch metric unit shown on the left computes the relevant branch metrics for each received symbol utilizing a look up table. The detector relies on high-speed, dual-rail domino logic to minimize gate delays and maximize throughput. Keepers inserted into each domino stage prevent discharge due to leakage and allow low-speed testing. The domino gates are triggered by three equally-spaced, overlapping clocks (Φ1, Φ2, Φ3), shown in Fig. 5(b), which make the circuit skew tolerant and facilitate time borrowing [5]. Soft survivors are temporarily stored in domino stages, obviating explicit latches, alleviating latch delay, and significantly reducing hardware. Otherwise, the prohibitively large number of latches that would be required (~7x32x25) makes such memory-intensive detector designs impractical to implement.

16-5-2

456

In order to avoid multiplications in the sum-of-product operations, the FOMAP algorithm is implemented in the log domain, where multiplication simply becomes addition. Summation in the log domain involves finding a maximum and supplementing it with a correction term. All operations in extend and update can be implemented with add/compare/select/look-up-table/add (ACSLA) units, shown in Fig. 5(a) where a look-up-table (LUT) provides the aforementioned correction term. Fig. 5(c) shows the ACSLA unit used in the update operation, divided into multiple pipeline stages. Simple Manchester carry chain adders are used for the 7-bit additions. Logic for propagate, generate, and the carry chain evaluates during Φ1 and domino XOR gates that compute the sum evaluate on Φ2. The MUXs and LUT evaluate during Φ3. To prevent overflow, normalization circuitry is needed at the end of each update operation. The update operations are pipelined into two pipe stages. As explained previously, the extend operation has a feedback loop that cannot be pipelined and limits detector throughput. In order to reduce its delay, computation for extend can be simplified by ignoring the correction term and concatenating the add/compare/select operations. This simplification has minimal impact on overall BER performance, verified by simulation results [6]. Delay is further reduced by performing the add + compare in parallel such that the critical path delay only consists of a single addition and selection (MUX) [6]. One of the complications of this architecture is the large number of wires connecting each column. Generally, the dynamic circuits need monotonically rising signals and monotonically rising complements of the signals. However, propagating the complements doubles the number of wires. To alleviate the need to propagate the complements, Complementary Signal Generator (CSG) circuits [7] are used, which locally generate monotonically rising complementary signals from the monotonically rising true signals. IV. MEASUEMENT RESULTS The test-chip prototype was implemented in a 0.13µm CMOS logic process and was experimentally verified in two stages. First, intermediate values generated by extend, update and collect blocks were tested for correctness while operating at low frequencies. The inputs at each time instant, namely 6bit channel symbols and 6-bit a-priori log-likelihood ratios were fed to the detector through a scan chain and outputs at each intermediate stage were compared with the expected vectors. Then, the detector was tested for maximum achievable throughput. Given the difficulty of externally feeding two sets of 6-bit inputs serially at high speeds, pseudorandom inputs generated by an on-chip linear feedback shift register were used to feed the detector and the corresponding outputs were verified. Lastly, in order to alleviate any speed constraints imposed by output buffers, the outputs operate at half rate. Fig. 6 presents the average frequency and power performance experimentally measured for the prototype chips plotted with respect to supply voltage. For each supply

voltage, clock frequency was increased until the outputs became invalid. A maximum throughput rate greater than 750MHz was achieved, at which point the chip consumed 2.4W of power. While the design initially targeted 1-GHz operation, duty-cycle distortion in the clock distribution network reduced clock overlap and limited maximum achievable frequencies. The key characteristics of the test-chip prototype are summarized in Table 1. V. CONCLUSION In pursuit of higher performance for future generation communications applications, iterative detection and decoding methods are being considered where MAP detection and decoding ought to be used given their superior BER. To fulfill this need, a maximum a-posteriori probability detector based on a recently-developed algorithm that achieves throughputs greater than 750MHz has been described. The algorithm inherits attractive features from both Viterbi and traditional BCJR algorithms, namely parallel survivor updating and an ability to compute a-posteriori probabilities. A high throughput rate is achieved by exploiting key aspects of this forward-only MAP algorithm and leveraging high-speed circuit techniques. The detector has been implemented in a 0.13µm CMOS technology and experimentally verified. This is the first published implementation of the forward-only MAP algorithm and the highest throughput demonstrated for a MAP algorithm in VLSI. ACKNOWLEDGMENTS The authors thank E. F. Haratsch, Z. Keirn and Agere Systems for their generous support of this work and chip fabrication.

REFERENCES [1] E. Yeo et al., “A 500-Mb/s Soft-Outpot Viterbi Dcoder,” IEEE J. SolidState Circuits, vol. 38, Jul. 2003. [2] B. Vucetic and J. Yuan, Turbo Codes-principals and applications, Kluwer Academic, 2000. [3] L.R. Bahl, J. Cocke, F. Jelinek and J. Raviv, “Optimum Decoding of Linerar Codes for Minimizing Symbol Error Rate,” IEEE Trans. Inform. Theory, vol. 13, Mar. 1974. [4] X.Ma and A. Kavcic, “Path Partitions and Forward-Only Trellis Algorithms,” IEEE Trans. Inform. Theory, vol. 49, Jan. 2003. [5] D. Harris, Skew-tolerant circuit design, Morgan Kaufmann, 2001. [6] R. Ratnayake, G.-Y. Wei and A. Kavcic, “Pipelined parallel architectures for high throughput MAP detectors,” IEEE ISCAS, May 2004. [7] N. H. E. Weste and D. Harris, CMOS VLSI Design, Addison Wesley, 2004.

Fig. 3. Bit error rate performance of FOMAP and SOVA for turbo decoding. The results are for parallel turbo decoder, code rate 1/3, 16 states, interleaver size 4096 and maximum number of iterations 18.

16-5-3

457

received symbols

Branch Metric Comp.

soft survivor processor

γt

Register

α t ,t (0,0)

α t ,t (1,1) Extend

α t ,t −1 (0,0) α t ,t −1 (1,0) α t ,t −1 (0,1) α t ,t −1 (1,1)

α t ,t − 2 (0,0) α t ,t −2 (1,0) α t ,t −2 (0,1) α t ,t − 2 (1,1)

Update 1

Update 2

α t ,t − L +1 (0,0) α t ,t −L +1 (1,0) α t ,t −L +1 (0,1) α t ,t − L +1 (1,1)

APP(0)

APP(1)

Update L-1

Collect

Fig. 2. Block diagram of FOMAP architecture for a two-state system defined by trellis in Fig. 1. Soft survivors αt,i (s,u) for each index i, t-L+1≤ i ≤ t are shown.

LUT

γ t ( s1 , s )

sign

α t −1,i ( s1 , u ) α t −1,i (s2 , u ) γ t ( s2 , s )

(a)

Φ1 Φ2

Fig. 4: Die microphotograph and floor plan overlay.

Φ3

(b) 7

LUT

α t −1,i ( s1 , u )

sign

7 6

γ t ( s1 , s )

Φ1 Φ 2

7

α t −1,i ( s2 , u ) Fig. 6: Experimentally measured throughout and power dissipation.

TABLE 1 SUMMARY OF CHIP CHARACTERISTICS Max throughput Power dissipation Transistor count Dimensions (Area) Technology

> 750MHz 2.4W (at 750Mbps) 2.054M (w/o I/O buffers) 2633x3793µm (9.9mm2) 0.13µm CMOS, 7 metal, 1.2V nom.

Φ1 Φ 2

Add

Norm. 3

7 7

7

7 Φ 2Φ 3

6

γ t (s2 , s)

+

Φ3

7

7

7

Φ3

Φ1 Φ 2

Φ3

(c) Compare Mux/LUT Add pipe stage 1

Normalize pipe stage 2

Fig. 5: Update processor (a) add/comp/select/LUT/add unit (b) Overlapping clocks (c) Deep-pipelined add/comp/select/LUT/add update processing circuit. The pipe stages respect to Φ1 are also shown.

16-5-4

458