Silverton Consulting, Inc. StorInt™ Briefing Introduction While all-‐flash appliances are increasingly being used to supply high-‐performing enterprise data storage, proving their quality and reliability in extremely demanding environments has been an ongoing challenge. Enterprise data centers have come to depend on the reliability, availability and serviceability (RAS) of disk storage for their mission-‐critical systems, and they expect the same RAS from all-‐flash storage. However, given the unique nature of NAND memory failure modes, the degree of reliability and the level of continuous availability of all-‐flash data storage remain important product differentiators. Moreover, the serviceability of all-‐flash storage varies considerably across vendors and is yet another characteristic that merits consideration when used in enterprise data center applications. IBM® has long been known for the superior RAS characteristics of its IT equipment. The new IBM FlashSystem™ family of all-‐flash storage arrays follows in this long-‐standing tradition of excellence.
Figure 1 IBM FlashSystem 840
Some definitions may help our discussion. In data center environments, reliability is normally defined as the mean-‐time-‐ between-‐failure of system components. System or data availability is usually described as the percentage of time a system provides uninterrupted access to data or services. Equipment serviceability is generally interpreted as the ease with which a system can be fixed and is measured by the mean-‐time-‐to-‐repair a
system’s failing components. For example, one storage system could take advantage of better quality components to achieve higher reliability, whereas another storage system could endure multiple failures and yet continue to supply data access using fault-‐tolerant functionality. Keeping data online and continuously available is nearly as important as preventing failures in the first place. The need for equipment serviceability is often overlooked, yet today’s hardware still fails and must be replaced in a timely manner in order to ensure continued operations. Within an IT storage system, enterprise RAS functionality minimizes failures, continues to provide the correct data for customer applications when components do fail, and ensures those failed components can be rapidly repaired.
IBM FlashSystem NAND Flash Characteristics The underlying nature of the physical storage technology is key to system reliability. Three types of flash are used in flash storage arrays: SLC (single-‐level cell), MLC (multi-‐level cell) and eMLC (enterprise-‐class multi-‐level cell) NAND memory. SLC is very reliable and has high endurance, meaning it can be programmed/erased (overwritten) many times, but it is also very expensive on a $/GB basis and has the lowest chip data density (quantity of data that can be stored per NAND chip). Next, MLC NAND has the least reliability and endurance but costs less, with twice the chip data density as SLC. Finally, eMLC has more reliability and endurance than MLC but is slightly more expensive ($/GB).
IBM FlashSystem 840 RAS for Better Performance & Data Protection PAGE 2 OF 7 While most IBM FlashSystem competitors have switched to standard or commodity MLC flash to reduce costs, IBM FlashSystem 840 uses higher-‐grade eMLC flash to achieve higher reliability at the chip level. This is because eMLC flash provides 5-‐10x the write endurance of commodity MLC flash, with only a modest increase in price. Further, most enterprise data centers depend on storage systems to provide a component lifespan of at least five years of unfailing service. With the reduced write endurance of MLC flash today, it would be difficult for system designers to deliver an acceptable lifespan for enterprise I/O activity. IBM uses eMLC flash in the FlashSystem 840 mainly for its better endurance and longer lifespan. In the latest FlashSystem 840, IBM has switched from 32nm flash to 24nm eMLC flash. The “24nm” (24 nanometers) refers to the feature size inside each flash chip, which strongly influences the data density per chip and system. As NAND memory feature size gets smaller, relative endurance levels also decline. This problem is especially acute with NAND geometries at the 1x nm technology node level and below. Endurance ratings are normally specified by NAND manufacturers, and NAND memory chips from different vendors exhibit widely differing endurance levels, even within the same technology generation. Endurance levels vary particularly among consumer-‐grade commodity MLC flash chips. For instance, at the 2x nm technology generation level, MLC endurance levels can range from 1/30th to 1/5th the program/erase (P/E) cycles of eMLC depending on the manufacturer. Such high variability in NAND memory endurance makes designing reliable all-‐flash storage systems significantly harder, which is yet another reason for IBM to use eMLC flash for enterprise storage.
IBM FlashSystem Reliability IBM FlashSystem 840 and earlier FlashSystem generations provide reliability technology well beyond high-‐quality flash chips to reduce common flash failure modes. For instance, •
IBM FlashSystem uses wear leveling to distribute writes across all flash chips within a system and eliminate premature chip wear-‐out due to high write activity of any one data location. Given NAND memory’s endurance limitations, any flash storage solution must spread write activity or P/E cycles across as many NAND locations as possible.
•
IBM FlashSystem incorporates additional reserved flash capacity beyond user-‐accessible data space. IBM uses “overprovisioned space” to increase overall system reliability by adding extra NAND storage space and spreading write activity out across even more flash memory. Further, all NAND memory technology cannot be overwritten and can write only to “erased blocks.” Thus, overprovisioning also helps IBM FlashSystem improve sustained write performance by supplying more erased blocks of NAND memory for heavy write activity. Most flash storage uses some level of overprovisioning in their flash modules.
•
IBM FlashSystem uses a strong ECC (error correcting code) algorithm to protect data as it is read from flash memory. For each new generation of NAND technology, manufacturers require a minimum level of ECC algorithm to meet their flash reliability specifications. IBM implements a better, more powerful ECC algorithm than that required by its NAND suppliers to provide even more flash reliability.
•
IBM FlashSystem appends data path checksums to data being transferred internally around the system so that any transmission errors can be quickly identified and corrected. Cosmic radiation, alpha particles and other random natural phenomena often cause silent errors in data transfers, especially in high-‐speed electronic devices. Checksums used by IBM FlashSystem storage help to detect and correct these errors in real time.
TWITTER.COM/RAYLUCCHESI | RAYONSTORAGE.COM | +1-720-221-7270 | SILVERTONCONSULTING.COM © 2012 SILVERTON CONSULTING, INC. ALL RIGHTS RESERVED
IBM FlashSystem 840 RAS for Better Performance & Data Protection PAGE 3 OF 7 • IBM FlashSystem uses a patented approach to mitigate NAND “disturb errors.” Disturb errors occur when NAND data accessed or modified in one location causes information in another location to be corrupted. Write disturb errors cause data corruption in adjacent NAND cells whenever data is written, and read disturb errors corrupt data in adjacent cells when data is read. Over time, these single and multi-‐bit changes can accumulate to a point that causes user data errors to occur once the ECC algorithm Figure 2 IBM FlashSystem 840 flash modules capabilities have been exceeded. IBM FlashSystem products supply unique voltage and timing adjustments to prevent disturb errors, including special optimizations against read disturb errors, which can lead to silent data corruption. IBM’s read disturb optimization selectively moves data with a high potential for read disturbance before it can be corrupted. IBM FlashSystem prioritizes and spaces these data migrations apart in time to avoid the sudden need for a large number of move operations. • IBM FlashSystem implements a read sweeper algorithm to mitigate NAND memory data fade errors by periodically scanning all data stored within the system to verify its integrity and correct any issues. Flash memory data fade errors occur whenever stored data has not been programmed or accessed in a long time, which can cause the data to deteriorate or become inaccessible due to NAND cell charge leakage. By periodically reading, verifying and correcting all NAND memory locations over time, IBM FlashSystem reduces NAND data fade. • IBM FlashSystem products also incorporate redundant batteries. While some flash module storage designs have data corruption issues when power is abruptly lost, redundant batteries allow the system to shut down gracefully in the event of a system power failure.1 The above capabilities counteract or at least reduce the most common flash memory failures. While most all-‐flash storage systems rely upon some of these technologies, IBM FlashSystem storage uses all of them to maximize reliability for demanding enterprise environments.
IBM FlashSystem Availability and Fault Tolerance Beyond the basic NAND memory and common failure mode mitigations, there are other techniques that help systems mask flash failures from customer applications. For the IBM FlashSystem family of products, advanced error recovery technologies help to minimize the impact of flash memory failures that could diminish customer data access or impede ongoing storage operations. The combined use of higher-‐quality NAND memory, common failure mitigations and IBM FlashSystem flash-‐optimized data protection functionality supplies a more highly reliable and available all-‐flash storage system for IBM’s enterprise customers. 1 Please see http://www.infoworld.com/t/solid-‐state-‐drives/test-‐your-‐ssds-‐or-‐risk-‐massive-‐data-‐loss-‐researchers-‐warn-‐213715 for more information on these types of errors.
TWITTER.COM/RAYLUCCHESI | RAYONSTORAGE.COM | +1-720-221-7270 | SILVERTONCONSULTING.COM © 2012 SILVERTON CONSULTING, INC. ALL RIGHTS RESERVED
IBM FlashSystem 840 RAS for Better Performance & Data Protection PAGE 4 OF 7 Two key technologies are used by IBM FlashSystem to repair native flash failures: IBM Variable Stripe RAID™ and system-‐level RAID. Patented Variable Stripe RAID provides data protection and error recovery within flash storage modules in FlashSystem products. System-‐level RAID protects and recovers customer data when entire flash modules fail. Collectively, these two functions can mitigate any NAND memory failure wherever that failure occurs. IBM calls the combination of Variable Stripe RAID and system-‐level RAID 5 Two-‐Dimensional Flash RAID.
IBM FlashSystem Variable Stripe RAID Variable Stripe RAID decreases the need to replace flash modules when flash chip failures occur. When flash chips or parts of flash chips fail, Variable Stripe RAID rebuilds the inaccessible data using module level RAID parity together with the remaining data segments in the stripe. Then, the rebuilt data is relocated to previously reserved areas within the affected flash module. Thus, IBM FlashSystem Variable Stripe RAID handles flash chip and sub-‐chip memory failures without decreasing storage capacity or data protection level. The fact that the data is rebuilt within the flash module itself also means that overall storage performance is not impeded and that FlashSystem can take care of multiple flash memory failures at the same time. Specifically, Variable Stripe RAID is a 9-‐data plus 1-‐parity RAID 5 implementation (rotating parity) across NAND memory chips, using flash controllers inside IBM FlashSystem storage modules. When a flash failure (chip or sub-‐chip) occurs, the data is rebuilt on a previously reserved (overprovisioned) storage area, and the affected RAID stripe shrinks to become an 8-‐data plus 1-‐parity (or 7-‐data plus 1-‐parity, 6-‐ data plus 1-‐parity, etc.) RAID group. Shrinking RAID group stripe size is unique in the industry and can better retain flash storage availability with little to no impact on data protection or system functionality. Variable Stripe RAID flash chip or sub-‐chip data protection is superior to current industry practice, as many competitive flash systems have no RAID protection within modules. Competitors using only system-‐ level RAID 5 across modules do not necessarily preserve flash capacity and performance as well as Variable Stripe RAID. Some of the key implementation differences between Variable Stripe RAID and traditional RAID 5 include: •
Higher resiliency. Rather than centralized RAID controllers within a storage system, IBM FlashSystem uses a distributed Variable Stripe RAID implementation within each flash controller so it is able to correct for multiple flash failures at the same time.
•
Higher storage efficiency. Due to the lower chip or sub-‐chip granularity of data protection, IBM FlashSystem Variable Stripe RAID uses less free space to rebuild flash failures.
•
Higher rebuild performance. Because of the higher efficiency and the fact that the rebuild occurs inside the flash controller closest to the flash memory, IBM FlashSystem Variable Stripe RAID rebuilds take less time and less storage system resources.
•
Lower setup time. Because Variable Stripe RAID requires no user configuration/optimization to activate, IBM FlashSystem Variable Stripe RAID data protection is present from power-‐on to flash module retirement and continuously protects customer data during all that time.
With Variable Stripe RAID IBM FlashSystem products are built with a more fault-‐tolerant flash module. Other flash arrays that only use standard RAID 5 at the system level may need to rebuild at the system level and replace flash modules whenever a failure as small as 1/16th of a single flash chip occurs. With Variable Stripe RAID, however, IBM FlashSystem can completely resolve these kinds of failures within the affected module so that full storage module rebuilds and replacements are seldom needed.
TWITTER.COM/RAYLUCCHESI | RAYONSTORAGE.COM | +1-720-221-7270 | SILVERTONCONSULTING.COM © 2012 SILVERTON CONSULTING, INC. ALL RIGHTS RESERVED
IBM FlashSystem 840 RAS for Better Performance & Data Protection PAGE 5 OF 7
IBM FlashSystem System-‐level RAID IBM FlashSystem offers a system-‐level RAID to provide supplementary data protection for failures affecting entire flash modules. For example, in the unlikely event that a flash controller in a flash module completely fails, system-‐level RAID 5 functionality can automatically rebuild the inaccessible data onto a hot-‐spare flash module within the IBM FlashSystem. The two components of Two-‐Dimensional Flash RAID (Variable Stripe RAID and system-‐level RAID) operate independently, but together provide synergistic system fault tolerance to mend multiple flash memory failures. Further, reserved space for Variable Stripe RAID and dedicated spares for system-‐level RAID mean there is no reduction in usable system capacity when flash failures do occur. The system-‐level RAID component in IBM FlashSystem 840 has also been enhanced over prior generations of IBM FlashSystem. Specifically, IBM FlashSystem 840 supports a much wider set of configurations and offers a lower-‐capacity, RAID-‐protected, entry-‐level system with more granular options. In FlashSystem 840, a system-‐level RAID group can be 2-‐data plus 1-‐parity, 6-‐data plus 1-‐parity, or 10-‐data plus 1-‐parity, each with a dedicated spare flash module. IBM FlashSystem 840 also comes with flash modules with varying capacity of either 2TB or 4TB. Together with the RAID groupings described above, these modules allow FlashSystem 840 to support configurations ranging from 4 TB to 40 TB of usable RAID-‐protected storage.
IBM FlashSystem Availability and Serviceability IBM FlashSystem 840 introduces a whole new level of availability and serviceability to an already highly reliable and available platform design. FlashSystem 840 is a fully modular storage solution with all key, non-‐passive components contained within field-‐replaceable units (FRUs) or modules. As such, the following components are fully redundant and can be hot-‐swapped when needed:
Figure 3 IBM FlashSystem 840 Back
•
Flash storage modules. FlashSystem 840 flash modules are accessible from the front of the unit and can be easily hot-‐swapped with new modules when failures occur with no impact to storage operations.
•
Dual sets of interfaces, RAID controllers and management controllers. FlashSystem 840 has redundant FRUs or canisters that include all of these components. The canisters can be accessed and hot-‐swapped from the rear of the system, providing non-‐disruptive, continually available storage operations.
•
Dual power supplies, batteries and fan modules. Redundant power supplies, batteries and fans can be accessed and hot-‐swapped whenever a failure occurs without impacting system operations.
As such, IBM FlashSystem 840 offers a highly serviceable design with components that are easily accessible and quickly replaced. Other solutions often place system components very close together or
TWITTER.COM/RAYLUCCHESI | RAYONSTORAGE.COM | +1-720-221-7270 | SILVERTONCONSULTING.COM © 2012 SILVERTON CONSULTING, INC. ALL RIGHTS RESERVED
IBM FlashSystem 840 RAS for Better Performance & Data Protection PAGE 6 OF 7 hide them under chassis lids, which makes hot-‐swapping these components impractical in most data center environments. In addition to the enhanced availability and serviceability described above, FlashSystem 840 offers non-‐ disruptive (concurrent) code load, which maintains data availability while system code is upgraded or changed. Other systems that claim 99.999% or greater system availability often hide the fact that they do not support non-‐disruptive code load. Solutions with hot-‐swappable components that lack non-‐disruptive code load capabilities periodically provide lower availability than advertised. IBM FlashSystem 840 incorporates additional features like call home, detailed logging and remote power on, which improve system serviceability and increase system uptime. Call home facilities enable IBM FlashSystem 840 to tell service organizations that something is amiss, sometimes even before the customer is aware of a problem. With FlashSystem 840 call home functionality, service representatives can be dispatched faster to fix problems before those problems impact more critical functionality. Enterprise-‐class storage systems use detailed logging to help identify the steps that lead to a failure, helping to diagnose and fix problems sooner. IBM FlashSystem remote power on helps restore powered-‐ down systems to operational status without having to dispatch customer or service personnel to remote, lights-‐out data centers. IBM has a world-‐class service organization that is second to none, which makes responding to hardware failures in a short timeframe that much easier to do. Indeed, a company may have extensive storage RAS capabilities, but it can’t provide service personnel or replace parts for failing components unless it has a service presence near the customer’s location.
IBM FlashSystem and Application Business Continuity Even with all the IBM FlashSystem 840 RAS capabilities listed above, systems or data centers sometimes fail for reasons that have nothing to do with storage. In these scenarios, customers often use data mirroring together with cluster failover or disaster recovery functionality to provide application or business continuity. For business continuity purposes, IBM FlashSystem Enterprise Performance Solution offers mirroring and copy services that can be supplied by IBM SAN Volume Controller (SVC) software, host-‐based software services or hardware replication appliances. The FlashSystem Enterprise Performance Solution can provide hardware data mirroring for IBM FlashSystem through its advanced replication capabilities. For example, IBM Metro Mirror and Global Mirror data replication functionality operates between IBM SVC systems that can virtualize IBM FlashSystem and replicate data synchronously or asynchronously between systems. With IBM Metro Mirror and Global Mirror functionality, cluster failover and disaster recovery services can depend on IBM FlashSystem data to be current, available and ready for failover in the event of system outages. Alternatively, many software facilities provide data mirroring for cluster failover or disaster recovery. Probably the most prevalent product in enterprise data center use today is Oracle Real Application Clusters (RAC). Oracle RAC uses a clustered database based on shared cache and software data mirroring to supply business continuity for Oracle database applications. Different clustering solutions available from Microsoft, Symantec, VMware and others, provide similar system failover capabilities that use software data mirroring. Some of these cluster failover solutions also take advantage of hardware data mirroring. Customers can supply hardware data mirroring with purpose-‐built replication appliances that tap into a storage network
TWITTER.COM/RAYLUCCHESI | RAYONSTORAGE.COM | +1-720-221-7270 | SILVERTONCONSULTING.COM © 2012 SILVERTON CONSULTING, INC. ALL RIGHTS RESERVED
IBM FlashSystem 840 RAS for Better Performance & Data Protection PAGE 7 OF 7 or front-‐end IBM FlashSystem to intercept data writes and replicate them to other remote storage systems.
Summary IBM has always taken a holistic and systematic approach to RAS capabilities. IBM FlashSystem uses higher-‐quality flash memory chips and special-‐purpose mitigations to minimize the likelihood of inherent flash failures. Other advanced data protection functionality is then layered on to enable repair and recovery from flash failures. With the IBM FlashSystem 840, they’ve redesigned their all-‐flash storage system to incorporate additional availability and serviceability characteristics to provide even more RAS than previous-‐generation IBM FlashSystems. For instance, the enhanced serviceability in IBM FlashSystem 840 can help to fix failing components before they have a chance to cause other damage that could potentially lead to system outages. Taken together, all of these features and capabilities make IBM FlashSystem 840 a highly reliable, available and serviceable enterprise-‐class storage system. Combining IBM FlashSystem 840’s enhanced RAS characteristics with appropriate clustering, disaster recovery and data mirroring functionality enables data centers to easily meet or exceed stringent business continuity requirements for mission-‐ critical applications. In addition, IBM FlashSystem 840 provides an enterprise-‐class RAS solution for flash storage systems that exceeds the capabilities of most other all-‐flash storage appliances.
Silverton Consulting, Inc., is a U.S.-based Storage, Strategy & Systems consulting firm offering products and services to the data storage community. QRcode: SilvertonConsulting.com Disclaimer: This document was developed with International Business Machines Corporation (IBM) funding. Although the document may utilize publicly available material from various sources, including IBM, it does not necessarily reflect the positions of such sources on the issues addressed in this document.
TWITTER.COM/RAYLUCCHESI | RAYONSTORAGE.COM | +1-720-221-7270 | SILVERTONCONSULTING.COM © 2012 SILVERTON CONSULTING, INC. ALL RIGHTS RESERVED