Construction and Hardware-Efficient Decoding of Raptor ... - IEEE Xplore

Report 3 Downloads 130 Views
IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 6, JUNE 2011

2943

Construction and Hardware-Efficient Decoding of Raptor Codes Hady Zeineddine, Mohammad M. Mansour, Senior Member, IEEE, and Ranjit Puri

Abstract—Raptor codes are a class of concatenated codes composed of a fixed-rate precode and a Luby-transform (LT) code that can be used as rateless error-correcting codes over communication channels. These codes have the atypical features of dynamic code-rate, highly irregular Tanner graph check-degree distribution, random LT-code structure, and LT-precode concatenation, which render a hardware-efficient decoder implementation achieving good error-correcting performance a challenging task. In this paper, the design of hardware-efficient Raptor decoders with good performance is addressed through joint optimizations targeting 1) the code construction, 2) decoding schedule, and 3) decoder architecture. First, random encoding is decoupled by developing a two-stage LT-code construction scheme that embeds structural features in the LT-graph that are amenable to efficient implementation while guaranteeing good performance. An LT-aware LDPC precode construction methodology that ensures architectural-compatibility with the structured LT code is also proposed. Second, a decoding schedule is optimized to reduce memory cost and account for processing workload-variability caused by the varying code rate. Third, to address the problems of check-degree irregularity and hardware underutilization, a novel reconfigurable check unit that attains a constant throughput while processing a varying number of LT and LDPC nodes is presented. These design steps collectively are employed to generate serial and partially-parallel decoder architectures. A Raptor code instance constructed using the proposed method having LT data-block length of 1210 is shown to outperform or closely match the performance of conventional LDPC codes over the code-rate range 2 ]. The corresponding hardware serial decoder is synthe[0 4 3 sized using 65-nm CMOS technology and achieves a throughput of 22 Mb/s at rate 0.4 for a BER of 10 6 , dissipates an average power of 222 mW at 1.2 V, and occupies an area of 1.77 mm2 . Index Terms—Algorithms, code construction, decoder architecture, iterative decoding, low-density parity-check codes, raptor codes, rate-less.

I. INTRODUCTION Raptor code is constructed by concatenating a fixed-rate precode to a rateless LT-code [1], [2]. The code is therefore rateless and its “rate” is determined on a frame-by-frame

A

Manuscript received June 23, 2010; revised December 15, 2010; accepted February 02, 2011. Date of publication February 14, 2011; date of current version May 18, 2011. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Warren J. Gross. H. Zeineddine and M. M. Mansour are with the Department of Electrical and Computer Engineering, American University of Beirut, Beirut 1107 2020, Lebanon (e-mail: [email protected]; [email protected]). R. Puri is with the Microsoft Corporation, Redmond, WA 98052 USA (e-mail: [email protected]; website: http://www.aub.edu.lb/~mm14/). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TSP.2011.2114655

basis or even changed for the same frame, upon a decoding failure. Similar to other rateless codes, Raptor codes were initially designed to operate on binary erasure channels (BECs). In a BEC [3], a code symbol is either erased with an erasure probability or received correctly by the receiver. This behavior is an adequate model for packet transmission over computer networks, where a corrupted or un-received packet is considered an erased symbol. Through coding, the erased packets in the frame can be recovered by applying erasure-correcting decoding on its received/unerased symbols. Thus, coding is advantageous when compared to the acknowledgement-based protocols, especially under poor channel conditions or upon transmission from one server to multiple recipients. By using rateless codes over the varying channels, the common problem of over- or underestimating the channel loss rate, in fixed-rate erasure coding, is avoided [2]. LT-codes [1] constitute a class of rateless codes, in which ( being potentially any value ) output symbols are produced randomly and independently according to a degree distribution. A decoding algorithm is then applied at the receiver side to recover the input symbols from the output symbols. The LT-decoder is efficient in terms of the block-size required to have a decoding success with high probability. The main drawback of LT codes is that, for vanishing error probability over erasure channels, the average degree of the output symbols grows logarithmically with the number of input symbols (when is close to ). This makes it hard to design a linear-time encoder and decoder for LT codes [2], [4]. Raptor codes [2] are an extension of LT-codes specifically targeted to solve the problem of nonlinear-time encoding and decoding. In Raptor codes, the input symbols are encoded using a fixed-rate code prior to LT encoding. Raptor codes solve the transmission problem over an unknown erasure channel in an almost optimal manner [2]. The application of LT/Raptor codes over binary-input memoryless symmetric channels was studied in [4] and [5], with emphasis on the theoretical behavior and the design and analysis of check-degree distributions in [4]. In these codes, the symbols are bits not packets. The rateless nature of Raptor codes leads to a dynamic coding performance which can be viewed as an application metric rather than simply as a design parameter. It provides flexibility in making time-dependent optimal tradeoffs among transmission time, bandwidth, and power. In addition, it allows the development of efficient methods/protocols to deal with a decoding failure, where overcoming such a failure is simply achieved by sending additional bits (i.e., by instantaneously decreasing the rate). These characteristics can be advantageous in the case of rapidly changing environments, such as ad hoc networks, multiuser communication channels and channels

1053-587X/$26.00 © 2011 IEEE

2944

where noise levels are not known a priori to the sender. The precode is needed to attain high code minimum distance and thus avoid error floors at relatively high bit-error rates (BERs). LT codes can be efficiently decoded using the iterative two-phase message-passing algorithm (TPMP), typically used in LDPC decoding [6]. If the fixed-rate precode is an LDPC code, applying the TPMP algorithm on the concatenated code (LT and LDPC) yields significantly better decoding performance than applying a two-stage decoding process [7]. In addition, joint coding results in lesser average number of LT-decoding iterations due to better stopping criteria and enables utilizing the same hardware resources for LT and LDPC decoding. This motivates the need for a hardware-efficient decoder architecture for Raptor codes, having LDPC as precode. The peculiar features of Raptor codes impose serious challenges when it comes to a hardware efficient decoder implementation. These features include varying code rate, random LT-encoding, variable check-degree distribution, and joint decoding of the LT code and LDPC precode. These irregularity and randomness features lead to low resource utilization, high control overhead, complex data movement patterns, in addition to stringent memory requirements, thus resulting in a highly inefficient decoder implementation. In this paper, a class of decoders for Raptor codes that are efficient in terms of both hardware complexity and algorithmic error-correcting performance is proposed. To address the problems of random encoding, indeterministic rate, variable degree distribution and LDPC-LT architectural compatibility, the proposed method involves decoder design and optimizations at three different levels, namely, code construction, decoding procedure and decoder architecture. First, a two-stage LT-code construction method that yields short-cycle free LT codes, decouples code structuring from random encoding and allows LT-LDPC-compatible design is proposed. Moreover, a replication technique to construct girth-8 structured LT-graphs is developed. A method to construct a class of 4-cycle-free LDPC codes that are LT-compatible is also presented and a subset of this class, obtained by row merging, is shown to yield 4-cycle-free Raptor codes. Second, serial and partially parallel memory-aware decoding schedules to jointly decode both LDPC and LT codes are proposed along with the corresponding architectures. Third, three novel reconfigurable check-node unit designs, corresponding to three different message update algorithms, are developed to process messages corresponding to irregular check-degrees at a constant throughput. Making use of the above optimizations, the decoding procedure can then mapped into row processing of a regular matrix. The decoding schedule hence is made simple, regular and identical across both LT and LDPC codes. The problems related to hardware utilization, stringent memory requirements and interconnect complexity, are largely resolved. The constructed codes preserve a good error-correcting performance over low to medium rates. An instance code constructed using the proposed method, outperforms randomly-built rate-0.4 (3,5) LDPC codes and closely matches with LDPC codes defined in IEEE 802.16 and , for BERs as low as . PHY-layer [8] at rates Hardware simulations show that at BERs as low as , the serial decoder implementation achieves a throughput of 20 Mb/s, has an area of 1.7 mm and dissipates average power of 222 mW at 1.2-V supply voltage.

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 6, JUNE 2011

The remainder of the paper is organized as follows. Section II describes Raptor codes and their decoding algorithm. Moreover, it presents an overview of the challenges for hardware efficient decoder implementations and the proposed solutions to tackle them. Architecture-aware LT-code and LDPC precode construction techniques are presented in Section III. In Section IV, serial and partially-parallel decoder architectures, in addition to the reconfigurable check function unit design, are proposed. Section V presents hardware simulation results for the resulting serial architecture, and Section VI concludes the paper. II. RAPTOR CODES A Raptor code is composed of a fixed rate precode concatenated with a rateless LT-code [2]. The LT-code has a minimum distance that is bounded by the minimum bit-node degree and hence, exhibits an error floor at relatively high BERs. The precode is, thus, needed to attain high code minimum distance and avoid high-BER error floors. A. Encoding data bits, the encoding process is done Given a frame of in two stages. First, the frame is encoded using a fixed precode into a new frame of bits. Next, the -bit encoded frame is re-encoded with an LT-code to generate output bits, for some dynamically determined code rate . Each output bit in the LT-code frame is generated independently as follows. A predesigned distribution on the integers , is sampled to obtain an integer , called the output degree, then bits of the -bit frame are chosen at random and their modulo-2 sum (XOR) is transmitted. The design of the distribution is crucial for the resulting code to yield good error-correcting performance. Similar to LDPC codes [6], an LT-code can be represented by a bipartite (Tanner) graph [9] with variable (or bit) nodes representing the input bits on one side and check nodes representing the output bits on the other (see Fig. 1). An edge exists between check-node and bit-node if bit is an input to the XOR whose output is check-node . In this case, nodes and are said to be neighbors. The degree of a node , denoted by , is the number of edges connected to it. If all nodes in both partitions of the graph have the same degree, the graph (code) is called regular, otherwise it is irregular. The girth is defined to be the minimum cycle-length in the graph. An LT-code can be equivalently represented by an matrix , where if bit-node is connected to check-node and 0 otherwise. The Hamming weight of a row or column vector in is defined as the number of nonzero entries in the vector. If the precode is an LDPC code, the Raptor code can be represented by a bipartite graph with the check-node partition composed of LT and LDPC check nodes as shown in Fig. 1. B. Decoding A Raptor code can be decoded using Gallager’s TPMP algorithm used for LDPC codes [6], where two types of messages are exchanged between bit-nodes and check-nodes in an iterative manner. Let be the intrinsic channel reliability value of the tdh check-node, CTB the check-to-bit message from check-node to bit-node at iteration and BTC the bit-tocheck message from bit-node to check-node at iteration .

ZEINEDDINE et al.: CONSTRUCTION AND HARDWARE-EFFICIENT DECODING OF RAPTOR CODES

Fig. 1. Bipartite graph

G and parity-check matrix H of a Raptor code. A length-4 cycle is highlighted on G and H.

We denote by the index set of the check-node neighbors of and by the index set of the bit-node (so bit-node neighbors of check-node (so ). We use the notation to denote the vector of check messages to bit-node at iteration and to denote all vectors of check messages at iteration . and are similarly defined. For simplicity, we drop the subscripts and iteration index when the context is clear or when arbitrary nodes are considered. The decoding algorithm is described as follows: 1) At iteration : Phase 1 Compute for each bit-node the message to every check-node according to BTC BTC CTB (1)

with initial conditions BTC for and . Phase 2 Compute for each check-node , the message CTB to every bit-node : • If node is an LT check-node, then [see (2), shown at the bottom of the page]. • If node is an LDPC check-node, then CTB BTC

2945

BTC (3)

where and is the sign of . 2) Decision phase: At the final iteration , bit is set to 0 or 1 according to the sign of CTB .

Decoding is terminated when either the maximum number of decoding iterations is reached or when the decoded bits satisfy all the check-constraints of the LDPC precode. Using the latter stopping criterion in joint decoding results in a lesser average number of iterations per frame than in the case of two-stage decoding where the LT-decoding stage needs a constant number of iterations. The code construction method, from a coding point of view, must generate codes that have high girth and minimum distance, or more generally minimize the number of short cycles and low-weight codewords. Short cycles in a bipartite graph create high correlations between supposedly independent messages, which prevent suboptimal iterative decoding from converging to maximum likelihood decoding (MLD) or near-MLD performance. The existence of low-weight codewords on the other hand degrades the code’s algorithmic performance and causes error floors at relatively high signal-to-noise ratios (SNRs). C. Overview of Decoder Architectures In principle, LT and LDPC codes have similar decoding algorithms and hence their decoders share many similarities. The iterative decoder architecture, shown in Fig. 2, is composed of three main components: 1) A check-node processor composed of a number of check function units (CFUs) that compute check-to-bit messages (serially or in parallel), 2) a bit-node processor composed of a number of bit function units (BFUs) that compute bit-to-check messages, and 3) a network that communicates messages between the check and bit-node processors. Message communication can be done through a complex interconnect that mimics the graph topology in a parallel architecture and/or through memory in a serial or partially-parallel architecture. The design and implementation of LDPC decoders have been studied extensively in the literature (e.g., [10]–[17]). Several challenges were resolved to obtain efficient decoders. One of

; CTB

BTC

BTC

.

(2)

2946

Fig. 2. An iterative decoder architecture overview. The interconnect network communicates t messages per clock cycle in either direction.

these challenges is the randomness of the LDPC code structure leading to very complex interconnect in case of parallel architecture and high memory overhead in serial and partially parallel architectures. Other challenges include the high complexity of the check-to-bit message update, the performance degradation due to quantization and the decoding convergence speed. To solve the code randomness problem, several classes of structured codes, such as quasi-cyclic (QC) [18] and AA- [10] codes, were introduced. In general, the parity-check matrices of such codes are composed of zero and permutation matrices. While these codes have their algorithmic performance comparable to that of randomly built codes for short to moderate block lengths, their structure simplifies memory partitioning and access and significantly reduces the interconnect complexity. Consequently, instances of quasi-cyclic codes were deployed in several systems, including IEEE standard 802.16e [8]. A convergence speedup by a factor of nearly 2, in addition to significant memory savings, were obtained by applying a turbo-decoding message-passing (TDMP) algorithm [19]. In the TDMP algorithm, the neighboring bit posterior probabilities are updated upon the processing of each check node and the updated values are used in the following check-to-bit message computations of the same iteration. The layered structure of the deployed AA and QC codes makes the application of TDMP hardware-efficient. Two main check-message update algorithms were proposed to simplify the check-node unit and avoid the need of computation. One is based on a simplified form of the BCJR algorithm [20] tailored to the trellis of a single parity-check code [10]. Another is the Min-Sum algorithm [21], [22], which degrades the performance but reduces substantially the check-node memory requirements [13], [14]. Collectively, these optimizations allowed for efficient high-throughput multi-mode/rate LDPC decoder implementations (e.g., [13]–[15], [23], and [24]). While similar to LDPC codes, Raptor codes have inherent features that render a direct application of the aforementioned optimizations for LDPC codes a nontrivial task and ultimately prohibit an efficient decoder implementation. These features and the proposed steps to tackle them are listed below. 1. Dynamic code rate: The hardware resources of a parallel architecture will be highly underutilized when the code rate exceeds the minimum rate. This fact favors the implementation of serial or partially-parallel decoder architectures. The proposed decoding procedure stores only the check-to-bit messages and implements (1) in a serial manner similar to [19] and [25], cutting down memory cost and eliminating complications associated with irregular and variable bit-node degrees, caused nonexclusively by the code rateless nature.

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 6, JUNE 2011

2. Input-bits random selection: This means that ( being the number of bit-to-check messages communicated each cycle) random locations of the bit-node memory are simultaneously accessed. Such stringent requirement would lead to prohibitively complex memory design and read/write networks. Similar to LDPC codes, irregular quasi-cyclic LT codes can be constructed to resolve the problem. Three issues, however, must be resolved to make this solution plausible. First, a design-related issue, is that the LT-matrix has to be redesigned with the variation of the rate and check degree distribution, possibly occurring in real time. Second, a storage-related issue, is that the shift-offsets and the locations of the nonzero submatrices of the LT-matrix must be stored for every code instance. In addition, due to the code structure, the degree distribution polynomial coefficients are quantized and the quantization step is set equal to the ratio of the permutation submatrix size to the block length. The proposed code construction method utilizes one quasi-cyclic source-matrix to pseudo-randomly generate all code instances via row-splitting, according to any given check-set distribution, as will be explained in Section III. This resolves, to a great extent, the design-related and storage issues. In the proposed method, the optimal check degree distribution is approximated to a check-set distribution; such an approximation does not have the same dependency on the permutation matrix size as is the case in the custom quasi-cycle code. 3. Check-degree randomness and variable check-degree distribution: To attain a constant message-throughput, , under variable check degrees, reconfigurable CFUs capable of processing constant number of messages per cycle are utilized in the Raptor decoder architecture. Serial CFUs [13], [14], [23] are more power-efficient, since they involve less data movement and multiplexing and possess the appropriate flexibility to process variable-degree check nodes. However, their latency is proportional to the corresponding check-node degree, which is considerably high in the high-rate LDPC precode. In addition, the row-splitting transformation involves pseudorandom permutation of the communicated messages, which, in serial CFU processing, is either constrained or has extra latency and complexity compared to the case of parallel CFU processing. In partially parallel decoder architectures, this high latency results in idle time between subsequent iterations that is comparable to the decoding time. Therefore, reconfigurable parallel CFUs are utilized in the architecture, and three novel designs of CFUs implementing the conventional, BCJR, and Min-Sum algorithms are developed. 4. LT-code and LDPC-precode architectural compatibility: Hardware reuse (i.e., memory, function units and interconnect) in both LT and LDPC decoding is made possible in this work by constructing high-rate LDPC codes that share the same structure with the LT-generating source matrix and have check nodes whose degrees are multiple of the maximum possible LT-node degree and whose bit-neighbors are evenly distributed as discussed in Section III. The reconfigurable CFUs are designed to process the LDPC check nodes in a quasi-serial manner, attaining a throughput of messages per cycle.

ZEINEDDINE et al.: CONSTRUCTION AND HARDWARE-EFFICIENT DECODING OF RAPTOR CODES

2947

Fig. 3. Random splitting of a row is done in two steps: 1) Random permutation of the nonzero entries and 2) sampling from the partition set to obtain the resulting row weights. For simplicity, sampling is performed here by reading the partition set in a circular fashion.

The design steps mentioned above, targeted to resolve the challenges facing efficient Raptor decoder implementation, are discussed in the subsequent sections. III. ARCHITECTURE-AWARE RAPTOR CODE CONSTRUCTION The proposed method to construct architecture-aware Raptor codes is summarized by the following three steps: is constructed to have high 1) A structured source matrix girth. This embeds favorable regularity features and guarantees good girth properties in the LT-code formed by the random encoding process in Step 2. yielding the 2) Pseudo-random row-splitting is applied on matrix . A rate LT-code, represented by the matrix , is obtained by selecting the first rows of . 3) A code-construction methodology is applied to form a class of matrices which are structurally identical to and describe 4-cycle-free LDPC graphs. A regular code from this class can be obtained by applying row-merging on . The resulting Raptor graph is 4-cycle-free and has rate . The following subsections describe the above three steps and present BER simulation curves to demonstrate the effectiveness of the proposed codes. A. Row Splitting Transformation Random row splitting transforms a bipartite graph having constant check-degree into an LT-graph having a check-degree distribution that is close to optimal, while still preserving the underlying structure and girth properties of the initial graph. It is described by the following procedure: Input: A matrix having uniform row weight of . Output: A matrix describing an LT-code. Procedure: 1. Design an LT-check degree distribution , with , that is optimal with respect to a preselected metric such as error correcting performance or convergence speed. 2. Approximate the check-degree distribution with a distribution over the space of sets , having the property that the elements of each set are positive integers that sum to .

of , random sampling from 3. For every row is done according to . Denote the chosen set by . The row vector is then split into rows, , , such that and the Hamming weight of , . Let the resulting matrix be . is rows of . formed by selecting the first Step 1 is beyond the scope of this work. One way to perform the approximation in Step 2 is to minimize the weighted quadratic distance between and the check degree distribution resulting from . Let be a matrix such that is the number of elements of value in divided by ( is th element of ). Then, the check degree distribution resulting from is . Finding the -vector can be formulated as the following quadratic optimization problem:

and where is the diagonal matrix of weights such that is set proportional to the sensitivity of the preselected metric in Step 1 to a change in . This minimizes the deviation of this metric from the optimal one due to the difference . Steps 1 and 2 can involve an iterative co-design of and to reach the best approximation, while having the quantization of distribution parameters compatible with the number of split rows. One possible simple way to do the sampling in Step 3 is to read sequentially a circular array of sets. These sets are chosen to approximate the ideal distribution and thus can be easily reconfigured in hardware for arbitrary . The technique of row splitting is illustrated in Fig. 3. Row-splitting is implemented in hardware by randomly permuting the bit-to-check messages then applying check-node processing by reconfiguring the CFU, having a constant throughput of , according to the chosen set . Random sampling of the bit-node neighbors of a check node is now and distributed over two phases: the construction phase of

2948

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 6, JUNE 2011

the row-splitting phase. The bit-node stringent memory requirements can be relaxed, thus, by imposing a desired structure on . At the algorithmic level, the girth of the graph described is preserved after row-splitting, as Lemma 1 states. by Lemma 1: If the graph obtained by row splitting has a cycle of length , then the graph, prior to row splitting, has a cycle of . length be a cycle in the graph Proof: Let described by . Let be the node in the graph , described by , which gives node after row-splitting. If the edge exists in , then the edge exists in . Therefore, , is a path and contains a cycle whose length is less than the in length of path . B. Construction of the Source Matrix The structure and girth properties of the LT-graph are dedescribed by . Matrix termined by constructing a graph is constructed to have a quasi-cyclic structure composed of zero and shifted identity matrices. As is the case with LDPC decoders, this allows for partitioning of the bit-node memory into one or two-port memory arrays where, in serial decoding, one message is read/written from each memory array per cycle and simplifies memory address generation. An example of such a is the regular matrix , with matrix (4) identity matrix where is a design parameter and is the cyclically shifted to the right by positions. When is prime, describes a 4-cycle-free graph. Similar to the case of quasicyclic codes, can be equivalently represented by using a smaller base matrix, where a zero submatrix in is replaced by a entry and a shifted identity matrix in is replaced by its shift value. The graph described by above is regular since both check and bit-nodes have degree . Irregular graphs, designated by irregular bit-degrees and constant check-degree , can be designed using the following procedure: 1) Choose a set of pair-wise prime integers and set . 2) For , construct a matrix by vertically stacking identity matrices of size . 3) Form a matrix by horizontally appending the matrices and then retaining only the first rows from the resulting matrix. Fig. 4 illustrates an example of an irregular constructed , , . using the above procedure for Lemma 2: The resulting has regular check-node degree of and irregular bit-node degrees of or for . Moreover, describes a 4-cycle-free bipartite graph. Proof: Suppose a 4-cycle exists in and let and be the row indexes in corresponding to the check-nodes in , such that , the cycle. Then and , which is impossible by the pair-wise primality of and . 1) Girth-Oriented Replication/Girth-8 LT-Code Construction: The idea is to apply an additional replication step to a 4-cycle free quasi-cyclic matrix, targeted to avoid length-6 cycles. Each node of the base graph is replicated into nodes,

Fig. 4. Irregular matrix

H

for p

= 3, p = 4, p = 5.

forming a new “child” graph and each edge between is replaced by edges bebit-node and check-node in tween the children nodes of and in , respectively. For, describing , is constructed using the mally, the matrix following procedure. matrix Input: A prime and a regular identical to the matrix described in (4). matrix describing a girth-8 graph. Output: A Procedure: Replace every scalar entry in by a shifted identity matrix where ; , for . The resulting LT source matrix is given by . Theorem 1: The graph obtained from the graph replication method has girth . Proof: See Appendix I. Replicating the number of block rows and columns allows for flexible construction of LT-aware LDPC codes having high check-degrees while still being 4-cycle-free. The choice of the , is targeted, along with avoiding 6-cycle, to shift offsets of allow constructing a subclass of 4-cycle free Raptor codes. Partially parallel decoding can be applied by processing rows of simultaneously, while incurring no extra control or scheduling overhead. In the rest of this paper, matrix described here is used as a source matrix for LT-code construction. C. LT-Compatible Precode Construction The constructed LT-code has a minimum distance that, being upper bounded by the minimum bit-node degree, is . The concatenation of a precode with an LT code has an approximately multiplicative effect on the codeword weights; a codeis mapped, with high word in the precode of weight probability, to a codeword of weight , with being the bit-node degree in the LT code. From an algorithmic performance perspective, the LDPC precode design is motivated by the need to avoid minimum cycles and low-weight codewords. At the architectural level, both LT

ZEINEDDINE et al.: CONSTRUCTION AND HARDWARE-EFFICIENT DECODING OF RAPTOR CODES

Fig. 5. Row merging for p

= 11. H 0 base is the p 2 p base matrix of the matrix defined in (4).

Fig. 6. Regular-to-irregular graph transformation in submatrix

Q

;

d

g

first three rows are replaced by nonzero entries corresponding to I the column weight remains unchanged.

0=

, for in the . Note that

f0 1 3g, = 7 and = 0. The nonzero entries corresponding to I ;

2949

and LDPC decoding utilize the same hardware resources. Therefore, the LT-compatible precode is formed by the replication of a base matrix, using -sized submatrices, as follows: Input: A vector of values, where . matrix described by the base matrix Output: A , where row in has weight . Procedure: Graph-Replication Construction Technique : Step 1: For i. Construct a matrix such that describes a 4-cycle-free bipartite graph 1) (with vertices on each side), 2) the weight of row in equals and 3) the code described by has no very-low-weight codewords (e.g., ). ii. Replace every 0-entry in with and every 1-entry with the value , for . Step 2: Form a base matrix by vertically matrices, for . concatenating the Step 3: Return to Step 1.i and permute the rows of , to avoid very-low minimum distances in the code described by . Theorem 2: The graph , described by , is 4-cycle-free. Proof: See Appendix II. Besides being composed of -size submatrices, the constructed precodes have another architecture-aware property.

is a multiple of The Hamming weight of each row of , say and the 1-entries of the row are distributed evenly between the , -wide, blocks of the bit-node side. Hence, the 1-entries of each row can be partitioned into sets, each of size , having exactly one element in every bit-node block. The bit-to-check messages corresponding to each of the sets can then be accessed from bit-node memory in one clock cycle. Therefore, bit-memory access corresponding to one LDPC check-node will take clock cycles. Thus, no overhead exists in terms of the bit-node memory organization, being partitioned into blocks corresponding to the bit-node blocks and the interconnect network. The reconfigurable CFU is designed to process messages having check-degree in a quasi-serial manner, with a throughput of messages per cycle. Since both and are composed of -size submatrices, this scheme can be easily generalized to the partially-parallel decoding case. The LDPC code generated by the graph-replication technique is 4-cycle-free, but the resulting Raptor graph formed from the LT and LDPC graphs need not be. A subclass of these LDPC codes that yields a 4-cycle-free Raptor graph can be obtained via a row-merging transformation, illustrated directly from in Fig. 5, as follows: Procedure: Row Merging Transformation Step 1: Form a set of integers with maximum cardinality such that: . i. ii. , if , then the pair of elements is identical to the pair . Step 2: For : i. Form the matrix by adding modulo 2 the shifted identity matrices , for , where is the th element of . ii. Replace every 0-entry of with and every 1-entry with the quantity , for . LDPC base-matrix by Step 3: Form a concatenating the matrices , for . Step 4: , omit the rows from whose indexes are to form a matrix. Then omit the first columns. The resulting matrix describes an LT source matrix. , described by , Theorem 3: The graph is 4-cycle-free.

2950

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 6, JUNE 2011

Fig. 7. FER and BER versus SNR curves for rate-0.4 LDPC (LDPC ) and Raptor (RAP ) codes.

Proof: See Appendix III. The Raptor code designed above has the following properties. , check-degree and bit-degree Its precode has rate . Its LT code has maximum check-degree and maximum . bit-degree 1) Regular-to-Irregular Graph Transformation: The base matrix of the regular LDPC code obtained by row merging can be used as a starting point to design the base submatrices of irregular 4-cycle-free LDPC precodes. This can be done by simply omitting positive-entries from the regular base matrix. However, if is small, the effect on the code’s minimum distance will be significant. Alternatively, the row-weights of the submatrices , obtained from Step 2.i of the row merging method, can be changed via 1-entry substitutions across the matrix. This is done by picking an integer that satisfies the following property: such that with , and : . Then, in submatrix , the 1-entries corresponding to can be interchangeably replaced by those of in a column-per-column or row-per-row fashion. The result is an irregular 4-cycle-free LDPC graph. This procedure is illustrated in Fig. 6. By applying identical row shuffling in and in the submatrices of matrix [Section III-B-2)] that share the same bit-nodes with , a 4-cycle-free Raptor graph with irregular LDPC subgraph is obtained. However, the resulting LT graph is not 6-cycle-free anymore. Only regular LDPC precodes are considered in the decoder architecture discussed in the next section. However, minor modifications in the control, memory address generation and recon-

figurable CFU operation is needed to have the architecture suitwill able, as well, to irregular LDPC decoding. In addition, be assumed to have columns, i.e., the first columns are omitted as indicated in the row merging transformation. D. Code Design Examples and Simulation Results In this section, the coding performance of Raptor codes constructed using the proposed techniques is compared to that of conventional LDPC codes at rates 0.4, 0.5, and 0.66. An LT code instance with and is first constructed. For each rate, LT codes obtained by several partition-set arrays are compared and the array yielding the best error-correcting performance is chosen. Table I shows the details of the codes used. Both the regular and irregular LDPC precodes are 4-cycle-free and have a minimum distance of 6. Only Raptor codes RAP and RAP containing regular precodes are 4-cycle-free. The irregular precode outperforms the regular precode at rate 0.66 and therefore was used there. Three metrics were used to evaluate the code performance: the bit-error rate (BER), frame error rate (FER) and average number of iterations required for successful decoding, with a maximum of 100 iterations. The results are plotted in Figs. 7–9. The Raptor code outperforms the rate-0.4 LDPC code, compares favorably to the rate-0.5 LDPC code and unfavorably to the rateLDPC codes. However, the relatively low minimum distance of the precode implies that an error floor would appear at low error rates (at FER of or lower, or equivalently ). To get error floors at lower rates, the precode BER of has to be redesigned, or alternatively, is set to 13 (i.e., the next

ZEINEDDINE et al.: CONSTRUCTION AND HARDWARE-EFFICIENT DECODING OF RAPTOR CODES

Fig. 8. FER and BER versus SNR curves for rate-0.5 and rate-2=3 LDPC (LDPC ; LDPC

2951

; LDPC

) and Raptor (RAP ; RAP ) codes.

Fig. 9. Average number of iterations required until decoding convergence versus SNR for all codes in Table I.

2952

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 6, JUNE 2011

TABLE I PARAMETERS OF THE SIMULATED RAPTOR CODES AND LDPC CODES

TABLE II NUMBER OF EDGES IN THE MATCHING RAPTOR AND LDPC TANNER GRAPHS

prime after 11), therefore increasing the block length so that regular codes of minimum distance 8 can be obtained. The code structure affects the decoding throughput through two metrics, namely, the number of decoding iterations to converge to a valid codeword and the message processing workload required per iteration. The latter metric is directly related to the number of edges in the code graph (i.e., to the matrix sparsity). While comparable for rates 0.4 and 0.5, the average number of iterations in Raptor decoding is significantly higher than in at rate . This is due to the nature of LDPC decoding the LT code where the transmitted bits are modulo-2 sums of the bit values instead of the bit values themselves. For rate , the designed Raptor code requires 9 iterations to converge even in favorable channel conditions. The number of edges in the corresponding LDPC and Raptor graphs, compared in Table II, indicate that the LDPC graphs are sparser by a factor of than their Raptor counterparts. IV. RAPTOR DECODER ARCHITECTURES can be viewed as a concatenation of , -deep Matrix layers; the precode, likewise, consists of -deep layers. The underlying layered structure of the code is convenient for applying the turbo decoding algorithm. If two layers are connected via a nonzero entry to a bit-node , the posterior reliability value of has to be updated upon processing the former of the two layers, before being forwarded to the latter layer for processing. An idle time is caused by this scheduling order, which significantly affects the decoder throughput and constraints the pipelining of the interconnect and computational units. This inconvenience is resolved, in LDPC decoding, by reordering layers, block-columns or messages within the layers [26], or by processing all layers, serially within each layer, in parallel and adding offsets to submatrices shift values [15]. For serial LT decoding, layer reordering in , considered in row merging, can be done efficiently without affecting the bit-node degree distribution, assuming the latency time is less than . The problem is more involved in the precode decoding due to the small depth of the layers (e.g., ) compared to the latency time and the high number of nonzero matrices within each layer. The algorithmic modification needed in the precode construction and its hardware cost as opposed to the performance degradation caused by ignoring the adjacent layers interdependence, is beyond the scope of this work. The problem becomes more rele-

Fig. 10. A serial Raptor decoder architecture. Parameter f in the intrinsic memory is < 1 and is chosen so that f p (p 1) is the maximum block-size is the minimum achievable LT-code rate. (N ), or equivalently

0

vant in partially-parallel decoding. The proposed architectures in this paper, therefore, apply the TPMP decoding scheduling. A. Serial Decoder Architecture Fig. 10 illustrates the architecture of a serial decoder for Raptor codes. The decoder processes the check-to-bit or bit-to-check messages corresponding to one row of per cycle. For a Raptor code composed of an LDPC precode of check-degree and an LT code formed by row-splitting rows of (or equivalently ), the serial decoder performs subiterations (one decoding iteration), as described next. For clarity of exposition, let be a submatrix of composed of the rows used to generate the LT graph, horizontally concatenated with the rows corresponding to the LDPC precode. The rows corresponding to each LDPC check-node are grouped into one block row. The operation of the serial decoder proceeds as follows. At subiteration of iteration , for , the decoder performs the following steps (refer to Fig. 10): 1) Forwarding: A -dimensional message vector is read from bit-node memory. Assuming , is the edge corresponding to th

ZEINEDDINE et al.: CONSTRUCTION AND HARDWARE-EFFICIENT DECODING OF RAPTOR CODES

2) 3)

4)

5)

6)

nonzero entry of row of , then denotes the posterior extrinsic reliability value of the bit-node . connected to obtained at the end of iteration Permuting: is sent to a pseudo-random permuter to generate the . operation: The -dimensional vector is read from the check-node memory new bit-to-check messages, , are and computed as . operation: new check-to-bit messages, , are generated by performing LT decoding using (2) if , or LDPC decoding using (3) if . Inverse permuting and check-memory write-back: The vector is written back to check-node memory -block and simultaneously inverse-permuted using the to generate . Posterior reliability update: The extrinsic posterior messages are updated as follows: ; .

is written back to the bit-node 7) Accumulation: memory. Steps 3), 6), and 7) compute the bit-to-check messages in a serial manner, therefore avoiding the underutilization and scheduling problems that may arise due to the bit-node degree variability, which, in turn, is caused by the rate and check degrees variability. This allows the bit-node block to store a constant number of messages regardless of the code rate, thus saving memory. The main components of the serial decoder are described in detail below. Bit-Node Blocks 1 and 2: These blocks independently perform the forwarding (Step 1)) and accumulation [Step 7)]. The blocks interchange their roles every iteration. Using two blocks instead of one allows simultaneous communication of check-to-bit and bit-to-check messages, hence doubling the throughput. Each memory block is composed of banks, where each bank holds -bit words and has one read and one write ports. The banks are accessed using the address , where . The intermediate value , updated every cycles, corresponds to index of the accessed bit-node in matrix [see Section III-B-2)]; , updated per cycle, corresponds to the bit index in the child matrix resulting from the girth-oriented replication. Since is composed solely of shifted identity matrices, the computation of both and , during LT decoding, is done by either computing the shift-offset of the corresponding matrix or incrementing the previous value. During LDPC decoding, is alternated between different values, which are updated every cycles. The update of is similar to that in LT-decoding but is done every cycles. Communication Network: It consists mainly of a permuter and inverse permuter, as well as adders and subtractors. In the permuter design, a tradeoff exists between the number of possible permutations and hardware complexity. One possible permuter applies a linear permutation generating

2953

possible permutations. It can be implemented using linear finite shift registers to generate the permutation coefficients pseudo-wide cyclic shifter and , , randomly, -bit multiplexers. The subtractor blocks form the bit-to-check messages [Step 3)], while the adders accumulate the posterior extrinsic reliability values [Step 6)]. Check-Node Block: The check-node block is composed of four main components. banks, each 1) Check-Node Memory: It consists of storing up to words and performing 1-read and 1-write per cycle. The memory banks are accessed sequentially. 2) Partition Table: It is a buffer that stores the check-node degrees that result from the different row-splitting scenarios (i.e., partition set in Fig. 3). In the simplest case, it can be accessed sequentially as a circular buffer. For each split-bit partition vector showing the deting scenario, a grees of the formed check nodes is stored. For example, if a row with Hamming weight of is split into four check nodes of degrees 3, 2, 4, 1, then the partition vector is 0001100001 (i.e., ). Equality of bits and of the vector indicates that messages and belong to the same check-node. 3) Intrinsic Values Memory (IVM): It stores the intrinsic channel reliability values of the check-nodes. Since the number of words read every subiteration is variable, but is at most , the IVM block is divided into banks. The reliability value of check-node is written at location in bank . For correct operation, the values read from the memory banks are reorganized before being forwarded to the reconfigurable CFU, in a way such that if bits and of the partition vector differ, the th value of the IVM output vector must equal the intrinsic reliability value of the check-node whose message index starts with . This is done through a decompressor network, similar to that shown in Fig. 12(a), that is composed of -wide -bit cyclic shifter, to apply shifting , followed by muxes to obtain the correct distribution of the intrinsic values over locations. 4) Reconfigurable CFU: It receives bit-to-check messages from the subtractor block, messages from the IVM and the partition vector. It computes the check-to-bitmessages corresponding to one row in . The detailed design of the CFU is presented in Section IV-C. B. Partially-Parallel Decoder Architecture To increase throughput by a factor of , the messages corresponding to rows of , or equivalently one row of , are processed simultaneously. The resulting partially-parallel architecture has the same organization of the serial architecture. The required modifications are the following. The bit node memories have their banks organized as -bit partitions and access is done row-wise. The check-node (and IVM) memory are partitioned into blocks, each sized down by a factor but retaining the access rate and pattern of its counterpart in the serial architecture. For permutation, each -message output of memory bank is cyclically shifted by , prior to applying independent random permutations on the message vectors, corresponding to the processed rows of . The number of CFUs and adders/subtractors is replicated by

2954

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 6, JUNE 2011

TABLE III HARDWARE RESOURCES AND THROUGHPUT OF SERIAL AND PARTIALLY PARALLEL ARCHITECTURES. THROUGHPUT IS MEASURED AS THE NUMBER OF COMPLETED ITERATIONS/s; f req IS THE CLOCK FREQUENCY. THE MEMORY ADDRESS DECODERS ARE NOT INCLUDED

. The hardware resources and throughput for serial and partially architectures are summarized in Table III, given per mes, requires sage bit-precision. In total, processing one cycle of reading values from memory ( is the average number of nodes formed by one row-split), writing values and applying one CFU operation and additions/subtractions. C. Reconfigurable Check Function Unit Design The design of the reconfigurable CFUs presented in this section is based on their fixed-degree counterpart implementations. Three basic algorithms for the CFU operation exist, namely, the conventional one implementing equations (2) and (3), BCJRbased [10], [19] and Min-Sum [21], [27] algorithms. The latter two give a reduced complexity approximation of the check-update equations and thus, need no LUTs to compute the function. The Min-Sum algorithm, in particular, reduces the check memory requirements. However this comes at the expense of varying degradation in performance for the Min-Sum update and an increase in the complexity of computational units and latency in the CFU, for the BCJR-based update. In this section, indicates the message of index corresponding to processed row and corresponds to the intrinsic reliability value of the check node derived from row and to which edge is connected. 1) Accumulator-Based CFU Architecture: The function and its inverse in (2) and (3) are implemented using LUTs. The operation is also applied on the intrinsic reliability values only once and the generated values are stored in IVM. The proposed reconfigurable CFU is based on the design of a -degree CFU that implements the following transformed equation: CTB

BTC

BTC

This equation can be implemented using a tree of adders of depth to compute BCT , followed by subtractors to extract the individual messages BTC from the total sum. At level of the adder tree, the bit-to-check messages are partitioned into sets, , such that 1) set , includes messages with consecutive indices, 2) the sum of the operations on the messages in set is computed in a stage (call it output value of the set), and 3) the output values of these sets are fed to the adder subtree starting from stage which, in turn, computes their sum. The left (right) boundary of a set is defined as the minimum (maximum) index of the messages in the set. In LT-decoding mode, intermediate values, , are maintained at level of the adder tree. The th value of , , stores

the sum of the operations on messages that 1) belong to the same partition as message at level and 2) have their corresponding LT-graph edges connected to the check node that the th edge is connected to. The algorithm for updating is shown below and the corresponding architecture is shown in Fig. 11(a). Algorithm 1: Accumulator-Based Reconfigurable CFU Algorithm BTC for

to for all values of

do s.t. an adder in fixed-rate mode adds output and at stage with do

if right boundary of and left boundary correspond to same check-node then

of

else end if , if

share same check-node,

then , if

and

share same check-node,

then end for end for

In LDPC decoding mode, the bit-to-check messages corresponding to one check-node are fed to the CFU in consecutive cycles. By using an extra accumulator to add the output values into BTC , where is the index of the first row in corresponding to the current check-node and delaying subtraction for an extra cycles required to obtain the latter sum, the CFU attains a constant throughput of messages/cycle. The registers for storing the intermediate vector during LT-processing are now re-used to store the vector for the additional cycles. The extra latency of cycles is due to the quasi-serial mode of CFU operation in LDPC decoding. This is unlike the LT-decoding latency caused by pipelining targeted to enhance the operating frequency. 2) Forward–Backward BCJR-Based CFU Architecture: In [10] and [19], a SISO message-processing unit (MPU) was proposed to implement check-message processing. The

ZEINEDDINE et al.: CONSTRUCTION AND HARDWARE-EFFICIENT DECODING OF RAPTOR CODES

check-to-bit messages are computed using a simplified form of the BCJR algorithm [20]. Let denote an operator which performs the operation . Then the message from a degree- check-node to bit-node at iteration , where , is computed as [10] CTB BTC . Moreover, the following simple yet fairly accurate approximation of was proposed:

The unit implementing this approximation is called the Max-Quartet MPU. The resulting CFU implements the BCJR algorithm on the syndrome trellis of an -SPC code, where intermediate forward and backward metrics are propagated each cycle. At stage , two metrics (forward) and (backward) are computed according to the recursions BTC BTC and BTC BTC , respectively. Then starting from stage , the check-to-bit messages CTB are generated as CTB . A reconfigurable version of the SISO MPU is proposed in Fig. 11(b). LT-node processing capability is achieved by multiplexing the inputs to the Max-Quartet units implementing the - and -recursions. The computation of the forward and backward metrics is done as follows: BTC BTC

if have same check node; otherwise. if have same check node; otherwise.

In

LDPC-decoding mode, the metric BTC corresponding to (one of the rows merged to form the check-node row in ) is computed and then forwarded to a degree- serial CFU. is then forwarded to The CFU output Max-Quartet units to compute . 3) Min-Sum CFU Architecture: A reduced complexity approximation of the check-to-bit message computation is given in [21] as CTB

BTC

BTC

(5)

The Min-Sum approximation results in a degradation in the error correcting performance. To reduce the performance gap, a correction step, consisting of subtracting an offset or multiplying by a normalization factor [27], is applied on the resulting minimum value. The reconfigurable CFU implementing the Min-Sum approximation is based on the constant-degree implementation given in [16]. The architecture has a tree structure similar to the accumulator-based CFU, where the addition operation is substituted by a 4-input 2-output partial-sorting function that computes the minimum and the second minimum of two sorted input pairs. The final outputs of the tree are the minimum-value, its index and the second minimum value of the input values. During LT-decoding, two intermediate values instead of one are needed

2955

for every edge per level. These are the minimum and second minimum values of the messages that 1) belong to the same partition as message at level and 2) have their corresponding LT-graph edges connected to the check node that the th edge is connected to. To reduce the number of intermediate values per edge to approximately one, the intermediate value of each degree-1 LT-node edge is set to the corresponding intrinsic channel reliability value. For a degree-2 LT-node edge , the intermediate value holds the corresponding minimum value if is odd or the second minimum value if is even. The indexes of the minimum values, corresponding to the row check nodes, are tracked across the CFU by updating a -bit minimum-index vector after each stage. In the final stage, the intermediate values and correction step is applied on the the CTB messages are generated using the partition and minimum-index vectors. In LDPC-decoding mode, the minimum index-vector and two odd- and even-indexed outputs of the correction step are forwarded to an additional partial-sorter. The minimum, second minimum and the minimum index over the rows of every corresponding check node are thus computed with an extra latency of cycles and forwarded to the CTB block. The Min-Sum algorithm changes the check-node memory requirements since up to three values are sufficient to regenerate the absolute values of the messages corresponding to one node. This results in significant savings in the LDPC memory. Three LT-memory organization schemes are considered: 1) the LT-memory retains its conventional organization, 2) the LT memory stores two values per node (the minimum and second -bit minimum-index vector per minimum values) and a row and 3) the LT-memory stores three values per LT-node. In the two latter schemes, the LT-memory storing the absolute values of the CTB messages is composed of two blocks, one of which stores the odd/even-indexed values output by the correction step, prior to the CTB generation. The odd/even-indexed memory block has similar access patterns and thus, organization to those of the IVM memory, but contains , instead , memory banks. A -wide compressor is needed of -wide decomto store the CFU output in memory and a pressor is needed to regenerate the corresponding messages in the next iteration. Fig. 12(a) shows the CFU interconnect under this scheme. Fig. 12(b) compares the reduction in check-node memory size in the three respective schemes versus the average LT-node degree. As the average LT-node degree increases, the memory size reduction brought by the Min-Sum algorithm becomes more significant. Table IV compares the hardware complexity of the fixed-degree and reconfigurable implementations of the three messageupdate algorithms. The main advantage of the BCJR-based and Min-Sum implementations is the elimination of lookup tables. In the BCJR-based CFU, the numbers of the registers and relatively complex function units increase by and are higher than in the other two algorithms. On the other hand, the accumulator-based and Min-Sum implementations involve intensive multiplexing and have their register count replicated. V. PERFORMANCE EVALUATIONS AND SIMULATION RESULTS 1) Quantization Analysis: A bit-accurate C++ simulator of the serial Raptor decoder was developed to analyze the quantization effect on the coding performance. The decoding procedure of the rate-0.4 Raptor code (RAP in Table I) was simulated

2956

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 6, JUNE 2011

Fig. 11. (a) Accumulator-based CFU architecture. The gray-shaded muxes are needed for LDPC-decoding. Registers used to propagate BTC messages from the accumulator to the subtractor are omitted. (b) BCJR-based CFU. Blocks labeled f perform the Max-Quartet operation.

for message quantization of 5, 6, 7, and 28 (almost-ideal) bits, assuming a white Gaussian noise channel with BPSK modulation. Three check message update equations were considered: the conventional [using (2) and (3)], BCJR-based and Min-Sum with offset [22]. In the latter algorithm, the offset was obtained, per check node, by reading from a lookup table, according to the corresponding check degree and the average bit posterior reliability value resulting from the previous iteration. As illustrated in Fig. 13, quantization results in less steep curves compared to the almost-ideal case. The performance gap between the conventional and BCJR-based, already small in the near ideal case, disappears for the other quantization levels. The Min-Sum algorithm incurs a performance loss in the -dB range, for message widths greater than 5. In all algorithms, a large performance loss results from going to 5-bit messages, therefore, the message quantization, , is set to 6 in the synthesis step discussed next. 2) Synthesis Results: The serial decoder datapath was modeled in Verilog and its datapath bit-width was set to and then synthesized using a 65–nm, 1.2-V customized CMOS library. The Design Compiler tool from Synopsys [28] was used to synthesize the logic components. Area and power estimates of the various memory blocks of the decoder were computed using CACTI software tool [29], after some customizations to fit the decoder requirements. The resulting area and power figures of the basic decoder components are summarized in Table V. The critical path of the decoder was estimated to be 3.26 ns; hence, the clock frequency was set at 300 MHz. The decoder occupies an area of mm and dissipates 222 mW when operating at 1.2 V and 300 MHz. Memory accesses dissipate 85% of the total power, partially due to having memory accesses per cycle. The check-node memory accesses,

which account for 40% of the overall memory accesses, dissipate 49% of the total power. This is due to the large size of check-node memory which occupies 70% of the total decoder area. This motivates the need to optimize check-node memory through partitioning and making use of the fact that check-node memory is accessed sequentially. Regarding area, 92% of the area is occupied by memory, mainly due to the low message promessages per cycle) and cessing throughput of the decoder ( consequently, the low area of the combinational components. However, in partially-parallel architectures, the area ratio of the interconnect and combinational function units to memory increases by a factor as Table III suggests. The throughput of the synthesized serial decoder is plotted versus SNR in Fig. 14. Unlike the case for LDPC codes, no throughput advantage is achieved when the code rate increases. This is due to the increase in the number of iterations required for convergence that is accompanied with the rate increase. VI. CONCLUSION Hardware-efficient Raptor decoders have been developed through a methodology that encompasses code construction, decoding scheduling and architectural optimizations in a single design cycle. The proposed method decouples the code structure from the random encoding process and utilizes a memory-and-variability aware decoding procedure and reconfigurable check processing. The decoding procedure is hence mapped to row-processing of a regular matrix. The resulting Raptor codes can be employed in wireless communication channels, where the code rate becomes a real-time parameter. The rateless nature of a Raptor code, added to the simple LT-encoding process, can be exploited to enhance communication under varying and poor channel conditions such as multiuser

ZEINEDDINE et al.: CONSTRUCTION AND HARDWARE-EFFICIENT DECODING OF RAPTOR CODES

2957

Fig. 12. (a) CFU operation and interconnect of the Min-Sum implementation. Only LT-decoding mode is shown for clarity. (b) Check-node memory size of the three schemes in Min-Sum implementation, relative to the memory organization discussed in Section IV-A. p .

= 11

TABLE IV HARDWARE RESOURCES OF FIXED-DEGREE AND RECONFIGURABLE ARCHITECTURES, OF THE THREE ALGORITHMS. THE MIN-SUM CORRECTION STEP AND THE SIGN-PRODUCING LOGIC ARE NOT INCLUDED

channels and ad hoc networks. Future work includes, designing Raptor codes at higher rates, applying the turbo decoding approach and investigating the possibility of enhancing the decoding convergence speed relative to LDPC codes.

from one edge replication method. We next prove that length-6 cycle and

APPENDIX I PROOF OF THEOREM 1 Let be a node of graph . By abuse of notation, let also . is be the index of the corresponding node in matrix the node in the graph , from which is formed by graph replication. Node can be described by three quantities , where , and . We first show that is 4-cycle free. Assume has a length-4 cycle , where ’s are bit-nodes and ’s are check-nodes. Two cases exist. Case 1: If and , then a length-4 cycle exists in , which is impossible by construction of . Case 2: If , then edges and result

in

, which is impossible by the

is 6-cycle free. Assume has a . By construction of , then , . Therefore, , or equiv-

alently (6) By the replication method, then , and

. Therefore, ,

or equivalently (7) Similarly, by construction of ,

, then

2958

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 6, JUNE 2011

Fig. 13. (a) Almost-ideal performance of the 3 algorithms considered. Performance of (b) conventional, (c) BCJR-based, and (d) Min-Sum algorithms for various quantization levels. Since the max-quartet approximation involves a constant, the internal datapath width of the BCJR-based CFU is chosen so that the corresponding quantization step is multiple of .

and (7) yields

. Substituting in

TABLE V POWER AND AREA ESTIMATES OF THE BASIC COMPONENTS OF THE SYNTHESIZED SERIAL DECODER. THE ACCUMULATOR-BASED CFU IS USED IN THE ARCHITECTURE. IVM MEMORY CAN PERFORM (p 1) READS PER CYCLE. THE INTERCONNECT NETWORK CONSISTS OF THE PERMUTER/DEPRMUTER, IVM INTERCONNECT, ADDERS AND OTHER AUXILIARY LOGIC

0

Solving (6) and (8) gives or which are impossible by construction of

or

(8) ,

.

APPENDIX II PROOF OF THEOREM 2 , described by matrix , has a length-4 cycle Assume , where ’s are bit-nodes and ’s are check-nodes. Two cases exist.

ZEINEDDINE et al.: CONSTRUCTION AND HARDWARE-EFFICIENT DECODING OF RAPTOR CODES

Fig. 14. Synthesized decoder throughput (in number of decoded information bits/sec) versus SNR for Raptor codes RAP , RAP , RAP .

Case 1) If , then a length-4 cycle exists in the graph described by , which is impossible by Step 1.i of the construction method. Case 2) If , then is connected to , in the 4-cycle which implies that . Similarly, is connected to , in the 4-cycle, hence . Therefore, , which implies that by construction. APPENDIX III PROOF OF THEOREM 3 Assume , described by matrix , has a length-4 cycle where ’s are bit-nodes and ’s are check-nodes. Three cases exist. Case 1) Both and are LT-check nodes. By Theorem 1, this case is impossible. Case 2) Both , are LDPC-check nodes. If , then such that and , which is impossible by condition 1.ii) of the row merging transformation. On the other hand, if , then this case is impossible using a similar argument used in the proof of case 2 in Theorem 2. is an LT check-node and is an LDPC checkCase 3) node. is connected to , in the 4-cycle implies that and . Also, is connected to , implies that . Therefore, . Hence, in the matrix constructed by graph replication, bit-node , where (i.e., does not correspond to any of the first columns), is connected to two check nodes having different -coordinates but equal -coordinates, which is impossible by the graph replication method. REFERENCES [1] M. Luby, “LT codes,” in Proc. 43rd Annu. IEEE Symp. Foundations Comput. Sci., Vancouver, BC, Canada, Nov. 2002, pp. 271–280.

2959

[2] A. Shokrollahi, “Raptor codes,” IEEE Trans. Inf. Theory, vol. 52, no. 6, pp. 2551–2567, Jun. 2006. [3] P. Elias, “Coding for two noisy channels,” in Proc. 3rd London Symp. Inf. Theory, Sep. 1955, pp. 61–76. [4] O. Etesami and A. Shokrollahi, “Raptor codes on binary memoryless symmetric channels,” IEEE Trans. Inf. Theory, vol. 52, no. 5, pp. 2033–2051, May 2006. [5] R. Palanki and J. S. Yedidia, “Rateless codes on noisy channels,” in Proc. Int. Symp. Inf. Theory, Chicago, IL, Jul. 2004, p. 37. [6] R. G. Gallager, Low-Density Parity-Check Codes. Cambridge, MA: MIT Press, 1963. [7] G. D. Forney, Concatenated Codes. Cambridge, MA: MIT Press, 1966. [8] IEEE 802.16-2009, IEEE Stand. For Local and Metropolitan Area Networks—Part 16: Air Interface For Broadband Wireless Access Systems, IEEE 802.16, May 2009 [Online]. Available: http://www.ieee802.org/16 [9] R. M. Tanner, “A recursive approach to low complexity codes,” IEEE Trans. Inf. Theory, vol. IT-27, pp. 533–547, Sep. 1981. [10] M. M. Mansour and N. R. Shanbhag, “High-throughput LDPC decoders,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 11, no. 6, pp. 976–996, Dec. 2003. [11] X. Y. Hu et al., “Efficient implementations of the sum-product algorithm for decoding LDPC codes,” in Proc. IEEE Int. Global Commun. Conf., San Antonio, TX, Nov. 2001, pp. 1036–1036E. [12] C. Howland and A. Blanksby, “Parallel decoding architectures for low density parity check codes,” in Proc. IEEE Int. Symp. Circuits Syst., Sydney, Australia, May 2001, pp. 742–745. [13] K. Gunnam et al., “VLSI architectures for layered decoding for irregular LDPC codes of WIMAX,” in Proc. IEEE Int. Conf. Commun., Glasgow, Scotland, Jun. 2007, pp. 4542–4547. [14] B. Xiang et al., “An area-efficient and low-power multirate decoder for quasi-cyclic low-density parity-check codes,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 18, no. 10, pp. 1447–1460, Oct. 2010. [15] K. Zhang, X. Huang, and Z. Wang, “High-throughput layered decoder implementation for quasi-cyclic LDPC codes,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 27, no. 6, pp. 985–994, Aug. 2009. [16] N. Jiang et al., “High-Throughput QC-LDPC decoders,” IEEE Trans. Broadcast., vol. 55, no. 2, pp. 251–259, Jun. 2009. [17] J.-B. Doré, M.-H. Hamon, and P. Pénard, “On flexible design and implementation of structured LDPC codes,” in Proc. 18th Annu. IEEE Int. Symp. Personal, Indoor Mobile Radio Commun., Athens, Sep. 2007, pp. 1–5. [18] R. Townsend and E. Weldon, “Self-orthogonal quasi-cyclic codes,” IEEE Trans. Inf. Theory, vol. IT-13, no. 2, pp. 183–195, Apr. 1967. [19] M. M. Mansour, “A turbo-decoding message-passing algorithm for sparse parity-check matrix codes,” IEEE Trans. Signal Process., vol. 54, no. 11, pp. 4376–4392, Nov. 2006. [20] L. R. Bahl, J. Cocke, F. Jelinek, and J. Raviv, “Optimal decoding of linear codes for minimizing symbol error rate,” IEEE Trans. Inf. Theory, vol. IT-20, pp. 284–287, Mar. 1974. [21] M. Fossorier, M. Mihaljevic, and H. Imai, “Reduced complexity iterative decoding of low-density parity check codes based on belief propagation,” IEEE Trans. Commun., vol. 47, no. 5, pp. 673–680, May 1999. [22] J. Chen et al., “Reduced-complexity decoding of LDPC codes,” IEEE Trans. Commun., vol. 53, no. 8, pp. 1288–1299, Aug. 2005. [23] M. M. Mansour and N. R. Shanbhag, “A 640-mb/s 2048-bit programmable LDPC decoder chip,” IEEE J. Solid-State Circuits, vol. 41, no. 3, pp. 684–698, Mar. 2006. [24] C.-H. Liu et al., “An LDPC decoder chip based on self-routing network for IEEE 802.16e applications,” IEEE J. Solid-State Circuits, vol. 43, pp. 684–694, Mar. 2008. [25] S. Kim, G. E. Sobelman, and H. Lee, “A reduced-complexity architecture for LDPC layered decoding schemes,” IEEE Trans. VLSI Syst., pp. 1–5, 2010, DOI: 10.1109/TVLSI.2010.2043965, preprint. [26] M. Rovini, G. Gentile, F. Rossi, and L. Fanucci, “A minimum-latency block-serial architecture of a decoder for IEEE 802.11n LDPC codes,” in Proc. IFIP Int. Conf. Very Large Scale Integr., Atlanta, GA, Oct. 2007, pp. 236–241. [27] J. Chen and M. Fossorier, “Density evolution for two improved bp-based decoding algorithms of LDPC codes,” IEEE Commun. Lett., vol. 6, no. 5, pp. 208–210, May 2002. [28] Synopsys Design Compiler [Online]. Available: http://www.synopsys.com [29] N. Muralimanohar, R. Balasubramonian, and N. Jouppi, “CACTI 6.0: A tool to model large caches,” in Proc. Int. Symp. Microarchitecture, Chicago, IL, Dec. 2007, pp. 3–14.

2960

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 6, JUNE 2011

Hady Zeineddine received the B.E. degree from the American University of Beirut (AUB), Beirut, Lebanon, in 2006 and the M.S. degree from the University of Texas at Austin in 2009, all in electrical and computer engineering. He is currently working toward the Ph.D. degree at AUB. His research interests are in the design of algorithms and architectures for efficient IC implementation of communication and digital signal processing applications.

Mohammad M. Mansour (S’98–M’03–SM’08) received the B.E. degree with distinction and the M.E. degree both in computer and communications engineering from the American University of Beirut (AUB), Beirut, Lebanon, in 1996 and 1998, respectively, and the M.S. degree in mathematics and the Ph.D. degree in electrical engineering from the University of Illinois at Urbana-Champaign (UIUC), Urbana, IL, in 2002 and 2003, respectively. In 1996, he was a Teaching Assistant at the same department. In 1997, he was a Research Assistant at the ECE Department at AUB. From 1998 to 2003, he was a research assistant at the Coordinated Science Laboratory (CSL) at UIUC. During summer 2000, he worked at National Semiconductor Corporation, San Francisco, CA, with the wireless research group. From December 2006 to August 2008, he was on research leave with QUALCOMM Flarion Technologies, Bridgewater, NJ, where he worked on modem design and implementation for 3GPP-LTE, 3GPP-UMB, and peer-to-peer wireless networking PHY layer standards. He is currently an Associate Professor of electrical and computer engineering with the Electrical

and Computer Engineering (ECE) Department at AUB, Beirut, Lebanon. His research interests are VLSI design and implementation for embedded signal processing and wireless communications systems, coding theory and its applications, digital signal processing systems, and general purpose computing systems. Prof. Mansour is a member of the Design and Implementation of Signal Processing Systems Technical Committee of the IEEE Signal Processing Society. He has been serving as an Associate Editor for the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II (TCAS-II) since April 2008, Associate Editor for ISRN Applied Mathematics since December 2010 and Associate Editor for the IEEE TRANSACTIONS ON VLSI SYSTEMS since January 2011. He is the Technical Co-Chair of the IEEE SiPS 2011 workshop. He is the recipient of the Phi Kappa Phi Honor Society Award twice in 2000 and 2001 and the recipient of the Hewlett Foundation Fellowship Award in March 2006. He joined the faculty at AUB in October 2003.

Ranjit Puri received the B.E. degree with distinction from the Electronics and Telecommunication Department, University of Mumbai, India, in 2007 and the M.S. degree in electrical engineering from the University of Texas at Austin in 2009. He currently works as a Software Engineer in the Windows division with Microsoft Corporation, Redmond, WA. His current research interests include systems software for virtual machine management and embedded systems. Mr. Puri was awarded the J. N. Tata and Dorabji Tata Trust Scholarship in 2007 and is a member of Phi Kappa Phi and Gamma Beta Phi.