An Elastic Error Correction Code Technique for ... - Semantic Scholar

Report 2 Downloads 24 Views
Y.-P. Hu et al: An Elastic Error Correction Code Technique for NAND Flash-based Consumer Electronic Devices

1

An Elastic Error Correction Code Technique for NAND Flash-based Consumer Electronic Devices Yu-Peng Hu, Nong Xiao, Member, IEEE and Xiao-Fan Liu Abstract — Multi-level cell (MLC) NAND flash-based consumer electronic devices suffer from random multiple bit errors that grow exponentially with the increase of program/erase counts. Numerous error correction codes (ECCs) have been developed to detect and correct these multiple erroneous bits within a codeword, such as bosechaudhuri-hocquenghem (BCH) and reed-solomon (RS) codes. However, most of these existing techniques do not take into account the uneven distribution of bit errors over flash pages, thus they cannot meet varying correction needs of the flash memories during its lifetime. Specifically, weak ECCs are eventually unable to correct some particular pages’ bit errors beyond their correction capabilities, while powerful ECCs can protect each page longer yet incur unnecessary computation overhead too early. In this paper, an elastic error correction code (EECC) technique is proposed, which can progressively enhance the error correction capability for each page when performing program operation. In particular, based on a scalable coding mapping model, EECC technique can enhance the ECC level progressively, by allowing each page to employ changeable ECC parity in its own spare out-of-band area according to its own remaining lifetime as well as the hot level of the data in it. In this way, this technique not only meets the changing error correction demands for different page, but also obtains a good reliability-performance tradeoff. Analytically and experimentally, the results demonstrate EECC scheme is efficient in many aspects of performance, and particularly is able to make significant power consumption savings without degrading the error correction capability 1. Index Terms — NAND Flash Memory, Storage, Reliability, Error Correction Code. I. INTRODUCTION Thanks to the decrease of per-bit cost, the multi-level cell (MLC) NAND flash memories have been widely used in various consumer electronic devices, such as cell phones, 1 This work was supported in part by the Natural Science Foundation of China (61025009, 61232003), and by the China Postdoctoral Science Foundation(2012M512071). Yu-Peng Hu is with the Hu Nan University, Changsha 410082, P.R. China. He is also with the National University of Defense Technology, Changsha 410073, P. R. China(e-mail: [email protected]). Nong Xiao is with the National University of Defense Technology, Changsha 410073, P. R. China(e-mail: [email protected]). Xiao-Fan Liu is with the university of Nottingham, Nottingham, UK, NG8 1BB (e-mail:[email protected])

Contributed Paper Manuscript received 01/01/13 Current version published 03/25/13 Electronic version published 03/25/13.

media players and tablet computers [1]. Owing to the flash geometries being scaled down to 25nm or less, these MLC NAND flash memories face more uncertainties and are more prone to random bit errors induced by several mechanisms, like read/program disturb, data retention, quantum-level noise effects, endurance and so on [2], [3]. The raw bit error rate (RBER) of the MLC NAND flash memory is about 10-6 and at least two orders of magnitude worse than that of the singlelevel cell (SLC) NAND flash memory [3]. To ensure highly reliable data retrieval, various strong error correction codes (ECCs) technologies are proposed [4]. As the RBER grows exponentially with the increase of program/erase counts [5], most existing t-error-correcting ECC technologies for MLC NAND flash memories are committed to value of t as large as possible to meet the rising correction demands over time. The program/erase and even read operations have a detrimental impact on the reliability of flash memory as they lead to retention, endurance, and disturb issues. The more program/erase or read operations are performed, the more bit errors there might be. Therefore, the growing demands of error correction over time have attracted a plethora of research efforts on constructions of strong error correction coding, such as well known linear bose-chaudhuri–hocquenghem (BCH) [6]-[10], reed-solomon (RS) [11]-[13], and low-density parity-check (LDPC) codes [14], and the more powerful nonlinear multierror correcting codes [15], [16]. However, the problem of temporal and spatial uneven distribution of bit errors over flash pages has not been well investigated so far. The phenomenon that the RBER grows with the increase of program/erase counts over time is referred to as the temporal uneven distribution of bit errors. The spatial uneven distribution of bit errors refers to the fact that flash pages with fewer bit errors only need a short ECC parity, whereas pages with more bit errors require high ECC level (determines the error correction capability) as well as long parity. Conventional ECC techniques simply assign the same ECC level and the same long parity to all flash pages. As a result, they cannot satisfy different correction needs of flash pages over time. Weak ECCs are eventually unable to correct some page’s error bits over their correction capability while most of other flash pages are still good, whereas powerful ECCs are able to protect each page longer yet incurring more computation and storage space overhead too early. This paper proposes an elastic ECC (EECC) technique aiming to avoid unnecessary ECC computation overhead by progressively enhancing the correction capability for each page.

0098 3063/13/$20.00 © 2013 IEEE

2

Instead of assigning the same ECC level for all pages, the EECC dynamically adopts different ECC levels for each page. The key point of EECC is not to construct a particularly strong ECC, but to design a coding mapping model which enables each page to adaptively employ changeable ECC parity in its spare area, according to its program/erase counts and the hot level of data in it. The design is elastic and the reliability-performance tradeoff is also taken into consideration. In this paper, characteristics of the novel EECC technique are firstly described, after which the analysis and experiment of the proposed techniques are demonstrated. As can be seen, EECC proves to be efficient in many aspects, i.e., error correction capability, storage, and power consumption. The highlight is that, given the same reliability, EECC scheme can reduce the power consumption significantly. II. RELATED WORK Recent research concerning ECC schemes for NAND flash focus on the construction of coding mechanisms providing strong multiple error correction capability. Existing ECC schemes could broadly be divided into two types, that is, linear and nonlinear coding schemes. Past decade has seen considerable amount of linear ECC techniques proposed for NAND flash memories, e.g., wellknown BCH, RS, and more complex LDPC codes. The linear ECC schemes normally exploit long polynomials to generate the parity data. BCH codes perform correction over single-bit symbols while RS codes over multi-bit symbols. Usually, BCH codes are used in the situation where the bit errors are distributed in a random or non-correlated way, whereas RS codes are adopted in the case where bit errors are supposed to occur in bursts. Studies show that the MLC NAND flash memories have random bit errors [3], [5]. Therefore, BCH codes are more suitable for MLC NAND flash and have taken the lead as they offer high correction capability with fewer parity bytes. Among available ECC scheme, the adaptive-rate ECC scheme is the most relevant one to the proposed EECC technique [6]. Based on the BCH codes, this scheme employs four operation modes to obtain varying error correction capabilities for the storage system. However, it is lack of specific coding rate adjustment mechanism. Moreover, it needs an extra shared storage space out of the very page to store the extra parity bytes generated by the stronger ECC. In this scheme, since there is no in-place update in flash memories, it is difficult to find out the related parity and further perform the data write operation as well as address mapping table management [17]. EECC scheme, on the contrary, needs no additional storage space but takes advantage of the inherent spare area of a page to store the changeable parity bytes. What's more, Choi et al [7] presented an ECC architecture which is featured with low-power and high-throughput and lay emphasis on the circuit design. It employs three different ECC schemes to trade off the code rate, circuit complexity and power consumption, respectively. Liu et al [8] put forward a 4-errorcorrecting coding scheme with strong BCH codes. In [9], the authors demonstrated that the use of powerful BCH codes, e.g.,

IEEE Transactions on Consumer Electronics, Vol. 59, No. 1, February 2013

t=12, 67, and 102, can effectively increase the number of bits/cell and thus further enhance the capacity of MLC NAND flash memories. The authors of [10] suggested concatenating BCH codes with trellis-coded modulation. They showed that the error correction burden of a single BCH code may be reduced at the cost of five threshold states per cell. RS codes have also been applied in MLC flash memories at the expense of larger hardware design area and coding latency. The RS codes of length 828 and 820 information bits over Galois field (GF) GF(210) was constructed in [11], which are able to correct almost all bit errors less or equal to four. Moreover, it needs less area overhead to implement yet requires more parity bits than BCH codes given the same error correction capability. Yang et al [12] introduced a product code based scheme providing higher error correction capability. The scheme uses RS codes along rows and hamming codes along columns, thus being flexible to migrate to a stronger ECC scheme if needed. However, no concrete operations of migration are given. In [13], the asymmetric limited-magnitude ECC architecture was proposed, which can correct all asymmetric errors of multiplicities up to t. Authors in [14] argued that, with the advent of smaller bit cell in near future, BCH and RS codes are likely to give way to more complex LDPC codes. The nonlinear coding schemes were proposed in [15], [16], motivated by the large number of undetectable and miscorrected errors (more than 2k for k information bits) in linear block codes. In [15], an ECC architecture based on nonlinear single and double error-detecting codes was developed for the resistance of soft errors in memories. Recently, Wang et al [16] presented two constructions of nonlinear multi-error correction codes at the price of reasonable area and power. The advantage of these two codes is that few or even no bit errors could be undetected or miscorrected with the codeword. In summary, existing ECC schemes focus on the powerful error correction capability with various coding methods. Clearly, stronger ECC schemes come at the cost of extra storage space and computational complexity. Unlike all proposals above, the EECC technique sets the goal as enhancing ECC level for different flash pages progressively during their expected lifetimes, to prevent unnecessary computation cost. In particular, the real value of EECC lies in its co-existence as a tuning technique with many other ECC schemes like the BCH [8] and RS codes [11]. III. PROBLEMS DESCRIPTION Previous ECC schemes are not scalable, in other words, they cannot be adapted to the growing correction demands over different pages. As shown in Fig.1, in addition to the data bytes, flash page contains a few spare bytes used to store ECC parity and other page states description information. For the convenience of matching the sector size traditionally used in storage applications, controllers typically correct 512-byte data sector with 16 spare bytes. The specific amount of bytes required for ECC parity is codetermined by data size and the intended ECC level.

Y.-P. Hu et al: An Elastic Error Correction Code Technique for NAND Flash-based Consumer Electronic Devices

3

IV. ELASTIC ERROR CORRECTION CODES SCHEME A. Scheme Overview

Fig. 1. The logical structure of a flash page.

However, the uneven distribution of bit errors occurring in the process of selecting different size of data sector results in some interesting results as shown in Fig. 2. To simplify the illustration, this paper will describe the ECC problem with BCH codes. The kB-ECCt here denotes the BCH codes capable of correcting up to t bit errors per k-byte data sector. Fig. 2(a) depicts two adjacent 512-byte data sectors within a page, both of which are respectively protected by two 13-byte ECC parities capable of correcting up to 8 bit errors. But in the case of more than 8 bit errors within one data sector, the data sector cannot be decoded correctly due to the excessive bit errors. On the contrary, as shown in Fig. 2(b), if a high level of BCH codes, i.e., 1024B-ECC16, is employed, the data can be retrieved correctly at the cost of a little longer parity of 28 bytes. Notably, further studies below show that higher level BCH codes with stronger correction capability how can be generated along with a shorter ECC parity. Take the two cases of the same page as example, i.e., 512B-ECC16 over two data sectors and 1024B-ECC28 over one large data sector. The 512B-ECC16 is built over GF(213) and hence the total length of parities is 13  16bits  2=52B, whereas the 1024BECC28 is built over GF(214) and thus the length of parity is only 14  28bits=49B. Obviously, the 1024B-ECC28 is more efficient benefiting from its stronger correction capability and shorter redundant parity. Likewise, the uneven distribution of bit errors also exists among multiple pages, i.e., some pages have more random bit errors while some have fewer. Therefore, this paper is devoted to design an elastic control scheme with the ability of assigning appropriate ECC level for every page to obtain sufficient error correction capability while keeping the size of parity as small as possible.

Fig. 2. (a) A low ECC level over two small data sectors. (b) A high ECC level over a large data sector with a little longer parity.

In Fig. 3, the overview of EECC scheme is illustrated by a concrete example. The first few bytes in spare area are page states information containing two special fields named sectSize and numErr corresponding to the k and t mentioned in section III, which indicates the number of bytes in the correction sector and the maximal number of correctable error bits, respectively. After reading the preceding data, the controller is then able to find out the ECC level based on the coming sectSize and numErr fields, and further performs proper decoding procedure. In Fig.3, the remaining space in spare area is reserved for parity bytes. As shown in Fig. 3, there are three different ECC levels employed by three flash pages, respectively. Compared to the short parity used in page 1 for protecting a small sector, both page 2 and 3 employ longer parity to protect larger sector. It is beneficial for a flash page to employ higher level of ECC for the reason discussed in section III. Notice that higher ECC level could be employed in storage device when considering the increasing size of flash page.

Fig. 3. Pictorial overview of the proposed EECC scheme.

In EECC, the two parameters related to ECC level, i.e., sectSize and numErr, are determined by the two factors below: (1) The program/erase counts of a page. Due to the limited write-endurance, the flash pages undergoing a high frequency of program/erase operations are prone to more random bit errors [5]. The more program/erase operations the page undergoes, the higher ECC level it should employ. (2) The hot level of data. The second key factor is the hot level of the data in certain page. The logical pages undergoing frequent read and update operations are hot pages and thus the data on them are hot data. There are two reasons to ensure the reliability of hot data: first, the physical pages storing the data are more likely to suffers from disturbs or charges getting trapped in the cell oxide due to the intensive stress operations, whereas the cold pages are relatively reliable as their physical pages benefit from the quiescent periods between successive

IEEE Transactions on Consumer Electronics, Vol. 59, No. 1, February 2013

operations [18]; second, the hot data (e.g., the metadata) is apparently more important than cold data and thus demands stronger correction codes to ensure the higher reliability. Generally, the hotter the data tends to be, the higher ECC level the page should adopt. The above two factors could be accommodated in address mapping table in cache. Thereby the operations and procedures of EECC are as follows: (1) The EECC scheme can be implemented in flash translation layer (FTL) and ECC controller. When updating a logical flash page, the FTL firstly finds out its hot level and the program/erase counts of corresponding physical page by looking up the address mapping table. Notice that the looking up operations can be done along with the address translation operations. (2) EECC scheme calculates the sectSize and numErr, and further generates the corresponding codeword (data and redundant parity). The controller will write them together with other data into a flash page like Fig. 3. Afterwards, the two key factors in address mapping table will be updated for next use. (3) Upon reading a page, the controller can retrieve the sectSize and numErr so as to discover the ECC level and thus carry out proper decoding procedures. In this way, the EECC technique is able to adaptively control the level of ECC for each page progressively. B. Coding Mapping Model The point of EECC technique is the coding mapping model which determines the intended ECC level that a page needs. The proposed coding mapping model consists of the following two equations (1) and (2), which are used to calculate the parameters numEr and sectSize, respectively. Let integer α and β denote the two key factors respectively, i.e., program/erase counts that a page undergoes and the hot level of the data within the page. Here T is the maximal number of correctable bit errors, and the parameter A is used to ensure numErr  T. Let integer E and L respectively denote the maximal endurable program/erase cycles of a specific device and predefined highest ECC level of data. Let G denote the number of bytes of a page. The ECC level can be achieved by mapping the above two key factors α and β to numErr and sectSize. Notice that, in this model, the parameter B is equal to the actual value of ECC level and the sectSize is usually an integral multiple of 512 bytes.  1 1    numErr   f ( ,  )   TA     ( E   1) /  1  e  ( E  ) /    1 e 

(1)

G L    sectSize    512, B   numErr   (LB)  T  512  2  

(2)

Fig. 4 below illustrates the mapping model with an example achieved by the function f(α, β) in equation (1). In this example, let T=30, A=4, and E is set to10 to denote 10k which is the maximal of program/erase cycles that the typical MLC flash memory can endure in its lifetime [3]. Assume there are four hot levels of data, i.e., β=1, 2, 3, 4, and the defined

highest ECC level L is set to three. As shown in Fig. 4, if not being rounded down, the number of correctable error bits should increase exponentially with the program/erase counts α, which is consistent with the exponential probability distribution of the raw bit error in MLC NAND flash memories [5]. Especially, with the increase of β, the numErr at fewer program/erase counts becomes higher yet it turns to be lower beyond a particular program/erase counts. As a result, β is able to obtain a good tradeoff between the error correction capability and performance, e.g., with the help of β, the popular/hot data could be protected by short parity and retrieved quickly in the modern storage environment where reads are dominant compared to writes. For simplicity, in equation (2), the T is divided into L equal parts to allocate each ECC level with an equal span over the numErr. For instance, consider the sectSize in Fig. 3, let G=2kB and the numErr=25, 15, 5, then the corresponding ECC level will be level 3, 2, 1, and the sectSize will be 2k, 1k, 512 bytes, respectively. In practice, the span of each ECC level could be different, e.g., let higher ECC level have a wider range of numErr to obtain strong correction capability earlier. It is worth emphasizing that the equation (2) for calculating the sectSize is recommended but not required for determining the size of correction sector. Actually, the sectSize is subject to the size of page specified in storage devices. Additionally, the reserved spare area should be large enough for the parities at different ECC level. The Number of Correctable Error Bits (bits)

4

30 ECC Level 3

25

β=1 β=2 β=3

20 ECC Level 2

15

β=4

10 ECC Level 1

5 0 1

2 3 4 5 6 7 8 9 Program/Erase Counts, α (103 times)

10

Fig. 4. An example of the coding mapping model.

C. Key Factors Management In EECC scheme, the two key factors α and β for each page must be carefully managed. Apart from the logical-to-physical address translation information, the address mapping table has to maintain two additional fields, i.e., physical page program/erase counts and logical page hot level. It is not hard to maintain the two fields in many dominant FTL schemes, such as schemes in [17], which use the page-level mapping scheme at log blocks for actually data overwriting. Notice that the memory cost for these factors is negligible compared to that of the logical block/page address. In EECC, the program/erase count of the

Y.-P. Hu et al: An Elastic Error Correction Code Technique for NAND Flash-based Consumer Electronic Devices

D. Hardware Design The hardware implementation of the EECC relies on the configurable ECC controller techniques, which support programmable size of correction sector and number of correction bits. Fortunately, there are considerable amount of research [6], [7] and available products with the ability similar to the configurable ECC controller. Thereby the size of correction sector and the ECC level can be dynamically changed within a certain range for each correction operation, which allows flexibility in the parity generation for each page. The main implementations in these studies are the two modules, i.e., a configurable linear-feedback shift register (LFSR) module generating parity when writing data and syndromes when reading data, and an ECC module that locates the errors and applies corrections to the retrieved data stream. The intended maximal number of correctable bit errors implies different circuit designs for these modules. When the maximal correction capability is determined by the same module, correction and arithmetic operations could be performed over the same Galois field for pages on varying ECC levels. The highlight is that, the EECC can tailor the actual length of the codeword to be processed to reduce the computation cost. The prototype of ECC hardware is under construction by using field programmable gate array (FPGA) circuits and the implementation details will be given in another paper. V. EFFICIENCY ANALYSIS This section illustrates the efficiency of the EECC scheme with BCH codes in several aspects, i.e., the mean error correction capability, the probability of page error, storage efficiency, and computation cost. Suppose the EECC scheme here still employs the BCH codes kB-ECCt which is capable of correcting up to t bit errors per k-byte sector. The parameters used in the analysis are as follows.

-N is the number of bits of a correction data sector. -M is the number of bits of a parity, M