Hardware optimizations of hard-decision ECC decoders for MLC NAND flash memories Youngjoo Lee Department of Electronic Engineering Kwangwoon University Seoul, Republic of Korea
[email protected] Keywords-component; Decoder architecture; Error-correction codes; Hardware optimization; NAND flash memory; VLSI design
I.
INTRODUCTION
Recently, multi-level-cell (MLC) NAND flash memories have continuously replaced the traditional magnetic storage devices in commercial electronics due to their fascinating characteristics such as short accessing latency, high density, and low-power operation. As the technology advances, however, the data integrity of MLC NAND flash memories suffers from various sources including the stress induced leakage current, the cell-to-cell interference, and the retention problem. Hence, there are strict limitations on the number of program/erase (P/E) cycles of NAND flash memories as illustrated in Fig. 1. In order to extend the number of P/E cycles as many as possible, error-correction codes (ECCs) are inevitable in NAND-flash-memory-based storage systems [1]. Compared to the ECCs of communication systems, the ECCs for the storages have unique properties as follows: 1) Due to the limited spare region of the flash memory, the acceptable code-rate of storage systems is higher than that of the communication systems. Recent works normally adopt more than 0.9-rate ECCs where the modern communication systems allow the low-rate ECCs even down to 1/3. 2) The target uncorrectable bit-error-rate (UBER) is quite low. To replace the traditional magnetic hard-disk drives, the ECCs for NAND flash memories should achieve at least 10-12 UBER. Therefore, recent storage systems realize two different ECC decoders; an energy-efficient but weak hard-decision BCH decoder, and an energy-hungry but strong soft-decision LDPC decoder [2]. In general, the BCH decoder is activated at the
0.008 0.006
RBER
Abstract—This paper summarizes various optimization schemes of BCH-based error-correction code (ECC) decoders for commercialized NAND flash memories. To improve the energy efficiency by relaxing the decoding energy consumption or by increasing the decoding throughput, the decoder architectures are carefully analyzed to share the internal processing units. The enhanced folding technique can be applied to reduce the hardware complexity further while constructing the regular pipelined processing. In addition, the pre-processing method can be used for reducing the number of on-chip SRAM memory accesses in decoding of BCH-based product codes, saving the decoding energy significantly.
0.004 0.002 0.000
0
10000
20000 30000 P/E cycles
40000
50000
Figure 1. RBER of MLC NAND flash memory over P/E cycles [1].
early-use stage of NAND flash memories whereas the LDPC decoder is activated to extend the life-time if the channel condition deteriorates at the late-use stage. Although the recent works have been focused on the strong soft-decision LDPC codes [3], [4], it is still important to provide an energy-efficient hard-decision ECC decoder as it mostly determines the overall performance of storage systems during the life-time. In this paper, numerous hardware optimization techniques that reduce the decoding energy efficiency are summarized. A number of processing elements can be shared as many as possible while checking the common intermediate values. The advanced folding technique constructs the balanced pipeline architecture while relaxing the required processing units further. To reduce the energy consumption, moreover, the number of memory accesses can be minimized by applying the preprocessing associated with the proper data forwarding scheme. II.
OPTIMIZATION TECHNIQUES
A. BCH Decoder Architecture Fig. 2 shows the conceptual diagram of the pipelined structure of the BCH decoder, which is the most popular harddecision ECC for the storage systems [5]–[9]. There are three major stages in a BCH decoder; the syndrome generation block, the key-equation solver (KES), and the Chien search block. To increase the decoding throughput for satisfying the recent highspeed host interface, in general, the syndrome generation and Chien search blocks adopt the massive-parallel architectures, which take numerous hardware resources. On the other hands, the folding technique is actively applied to the KES to relax the hardware complexity and to balance the pipelined processing.
This research was supported by the MSIP (Ministry of Science, ICT and Future Planning), Korea, under the “ICT Consilience Creative Program” (IITP-2015-R0346-15-1007) supervised by the IITP (Institute for Information & communications Technology Promotion)
-page number-
ISOCC 2015
Syndrome Calculation
Key-equation Solver
Chien Search
Corrected Data
On-chip buffer
Figure 2. Typical 3-stage pipelined architecture of the BCH decoder.
B. Optimization Techniques To improve the energy efficiency by reducing the hardware complexity, the sub-expressions can be shared to eliminate the redundant units associated with the common intermediate results. In the parallel syndrome calculation and the parallel Chien search stages, all the constant Galois-field operations can be expressed as a single matrix, enlarging the searching region as much as possible [7], [8]. Hence, the number of common terms is maximized, leading to the energy-efficient decoder architecture. The KES uses fewer cycles compared to the other stages. Hence, it can be folded to share the operators by taking more cycles. The overlapped folding scheme in [6] provides an optimal folding factor to make the balanced pipelined decoder, which can preserve the overall decoding throughput. As the code-length increases, the complexity of a BCH decoder increases significantly as depicted in [6]. Hence, the block-wise concatenated-BCH decoder is a promising harddecision ECC solution as it can reduce the decoder complexity while providing similar error-correcting performance to the single and long BCH decoder [9]. The concatenated-BCH code uses short component BCH codes, which can be realized with the minimum hardware costs. To save the energy consumption, moreover, the number of on-chip page buffer accesses is minimized by using the syndrome pre-calculation [9]. III.
IMPLEMENTATION RESULTS AND FUTURE WORKS
Fig. 3 shows the reported energy efficiencies of recent hard-decision ECC decoders. For fair comparisons, all the results are normalized to 1.2V, 65nm CMOS process. The required RBER in Fig. 3 represents the maximum RBER of each decoder, which can be recovered to provide at least 10-12 UBER. Note that the single strong BCH decoder requires a significant amount of decoding energy. By adopting the hardware sharing schemes on the single BCH decoder, as a result, the energy efficiency is enhanced by more than 35%. The concatenated-BCH decoder offers similar error-correcting performance to the single BCH decoder, providing much less energy efficiency as shown in Fig. 3. Note that we still have a large performance gap between the realized hard-decision decoders and the theoretical limitation [10]. Therefore, we can develop more advanced hard-decision ECCs targeting stronger error-correction capabilities without using the multiple sensing operations at NAND flash memories for generating the soft-values for LDPC decoders [2]–[4]. The quasi-cycle concatenated-BCH code [10] and the half-product BCH code [11] may offer more attractive coding gains, but these ECCs requires more complicated decoder architecture. Therefore, the advanced low-energy techniques for realizing strong hard-decision ECC decoders should be considered in near future to impact on the low-power stand-alone IoT/sensor systems based on the MLC NAND flash memories.
Normalized energy efficiency (pJ/b)
Input Codeword
BCH (R=0.9) BCH (R=0.85)
10
BCH (R=0.95) Applying the optimizations
1 -3 10
CBCH (R=0.93) Future goals
2x10-3 3x10-3 4x10-3 -12 Required RBER for achieving 10 UBER
5x10-3 6x10-3
Figure 3. The energy efficiencies of hard-decision ECC decoders.
IV.
CONCLUSION
In this paper, we have summarized optimization techniques for the energy-efficient hard-decision ECC decoders. The harddecision ECCs have definite advantages compared to the softdecision ECCs in terms of the energy consumption. Hence, it is necessary to develop the low-power but strong hard-decision ECC decoders to cover the life-time of NAND flash memories, leading to the energy-efficient storage systems. REFERENCES [1]
Y. Cai, Y. Luo, E. F. Haratsch, K. Mai, O. Mutlu, “Data retension in MLC NAND flash memory: characterization, optimization, and recovery,” in Proc. IEEE Int. Symp. High Performance Comput. Archit. (HPCA), pp. 551–563, Feb. 2015. [2] G. Dong, N. Xie, and T. Zhang, “Enabling NAND flash memory use soft-decision error-correction codes at minimal read latency overhead,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 60, no. 9, pp. 2412–2421, Sep. 2013. [3] J. Kim and W. Sung, “Rate-0.96 LDPC decoding VLSI for soft-decision error correction of NAND flash memory,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 22, no. 5, pp. 1004–1015, May 2014. [4] N. Miladinovic, “Designing error floor performance of iterative codes,” in Proc. Flash Memory Summit, Santa Clara, USA, Aug. 2013. [5] Y. Lee, H. Yoo, I. Yoo, and I.-C. Park, “6.4Gb/s multi-threaded BCH encoder and decoder for multi-channel SSD controllers,” in Proc. IEEE Int. Solid-State Circuits Conf. (ISSCC), pp. 426–427, Feb. 2012. [6] Y. Lee, H. Yoo, I. Yoo, and I.-C. Park, “High-throughput and lowcomplexity BCH decoding architecture for solid-state drives,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 22, no. 5, pp. 1183– 1187, May 2014. [7] Y. Lee, H. Yoo, and I.-C. Park, “Small-area parallel syndrome calculation for strong BCH decoding,” IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), pp. 1609–1612, Mar. 2012. [8] Y. Lee, H. Yoo, and I.-C. Park, “Low-complexity parallel Chien search structure using two-dimensional optimization," IEEE Trans. Circuits Syst., II, Exp. Briefs, vol. 58, no. 8, pp. 522–526, Aug. 2011. [9] Y. Lee, H. Yoo, J. Jung, J. Jo, and I.-C. Park, "A 2.74-pJ/bit, 17.7-Gb/s iterative concatenated-BCH decoder in 65-nm CMOS for NAND flash memory," IEEE J. Solid-State Circuits, vol. 48, no. 10, pp. 2531–2540, Oct. 2013. [10] D. Kim, and J. Ha, "Quasi-primitive block-wise concatenated BCH codes for NAND flash memories," in Proc. IEEE Inf. Theory Workshop (ITW), pp. 611–615, Nov. 2014. [11] S. Emmadi, K. Narayanan, and H. Pfister, “Half-product codes for flash memory,” in Proc. Non-Volatile Memories Workshop, 2015.
-page number-
ISOCC 2015