Improving system reliability by failure-mode ... - Wiley Online Library

Report 16 Downloads 21 Views
Regular Paper

Improving System Reliability by Failure-Mode Avoidance Including Four Concept Design Strategies Don Clausing1, * and Daniel D. Frey2 1

Massachusetts Institute of Technology (retired)

2

Massachusetts Institute of Technology, 77 Massachusetts Avenue, Room 3-449D, Cambridge, MA 02139 IMPROVING SYSTEM RELIABILITY BY FAILURE-MODE AVOIDANCE

Received 25 March 2005; Accepted 6 June 2005, after one or more revisions Published online in Wiley InterScience (www.interscience.wiley.com). DOI 10.1002/sys.20034

ABSTRACT To be reliable, a system must be robust—it must avoid failure modes even in the presence of a broad range of conditions including harsh environments, changing operational demands, and internal deterioration. This paper discusses and codifies techniques for robust system design that operate by expanding the range of conditions under which the system functions. A distinction is introduced between one-sided and two-sided failure modes, and four strategies are presented for creating larger windows between sets of one-sided failure modes. Each strategy is illustrated through two examples from industrial practice. For each strategy, one example is from paper handling and another is from jet engines. By showing that every strategy has been successfully applied to each system, we seek to illustrate that the strategies are widely applicable and highly effective. © 2005 Wiley Periodicals, Inc. Syst Eng 8: 245–261, 2005 Key words: reliability; robust design; operating window; system architecture

1. MOTIVATION: RELIABILITY AND SYSTEMS ENGINEERING Reliability is among the most important topics in systems engineering. Reliability is the proper functioning of the system under the full range of conditions experienced in the field. Reliability requires two critical conditions:

*

Author to whom all correspondence should be addressed: 245 Bishops Forest Drive, Waltham, MA 02452 (e-mail: [email protected]). Systems Engineering, Vol. 8, No. 3, 2005 © 2005 Wiley Periodicals, Inc.

245

246

CLAUSING AND FREY

1. Mistake avoidance 2. Robustness By “mistake” we refer to the plethora of design decisions and manufacturing operations that may be grossly in error. Examples of mistakes are installing a switch backwards, or interpreting a software command as being expressed in inches when it represents centimeters. Reliability can be improved by reducing the incidence of such mistakes by a combination of knowledge-based engineering and the problem-solving process. By “robustness” we refer to the ability of a system to function (i.e., to avoid failure) under the full range of conditions that may be experienced in the field. It is one sort of challenge to develop a system that functions for a demonstration under tightly controlled conditions such as in a laboratory. It is an entirely different challenge to make a system that functions reliably throughout its lifecycle as it experiences a broad set of real world environmental and operating conditions. Effective systems engineering is the second challenge, not the first one. In its traditional formulation reliability is stated as the probability of failure under specified operating conditions. A typical textbook that addresses reliability will present a set of probabilistic concepts such as a survival function, failure rates, and mean times between failures. These concepts are then related to a model of the causes of failure such as component reliabilities or material and environmental variability. To make the model quantitative, specified operating conditions are stipulated as an agreed upon range of allowable conditions or an estimated probability density function for uncertain or variable parameters. This approach is well suited to calculating predicted failure rates once all of the data are available. This is the general approach of texts that emphasize reliability analysis (such as Ushakov [1994]) as well as texts oriented toward design for reliability (such as Rao [1992]). This is a very sound approach, but here we present an alternative formulation of reliability that has proven very effective in the improvement of reliability early in the development of a new system. An alternative conception of reliability engineering is based on what we call “failure-mode avoidance.” Many changes in system design that improve reliability do so by moving the physical failure modes. In fact, we argue that the most significant improvements in reliability come about by this means. Although this approach can be integrated with probability theory, it is not necessary to use probability theory to understand how these design changes bring about their effects.

We claim that, especially in the early development of systems, the failure-mode avoidance approach will lead to many improvements being made with a minimum amount of data required—just enough to guide the next improvement. The failure-mode avoidance approach is deeply rooted in the physics of the system and is therefore tangible to the engineers, which facilitates the needed creative insights for concept design. This advantage is supported by recent results from cognitive psychology. Gigerenzer and Edwards [2003] conducted an experiment in which medical doctors were given data regarding tests for cancer. If the data are presented in terms of probabilities, the doctors typically perform very poorly (Fig. 1). However, given the same basic scenario described in frequency formats, doctors perform far better. In interpreting these results, Gigerenzer states “our perceptual system has been shaped by the environment in which our ancestors evolved, which is often referred to as the ‘environment of evolutionary adaptiveness’ or EEA … I propose that human reasoning algorithms are … designed for information that comes in a format that was present in the EEA” [Gigerenzer, 1998, p. 10]. Gigerenzer goes on to say “I believe we can be as certain as we ever can be: Probabilities and percentages were not the way organisms encountered information” and “I propose the original format was event frequencies, acquired by natural sampling,” p. 12. Gigerenzer then makes a more general claim: “Information needs representation. If a representation is recurrent and stable during human evolution, one can expect that mental algorithms are designed to operate on this representation,” p. 29. We seek to apply Gigerenzer’s insight to reliability engineering. What kind of information about reliability was recurrent and stable during human evolution? We propose that our ancestors observed failures in the systems they were crafting (spears, fields of crops, pottery, etc.) and could directly perceive the conditions that led to failures. Because of this, humans find it natural to reason about failure modes and their physical causes. Humans also have very natural visual thinking abilities and may find it natural to reason about failuremode boundaries—regions of a map of the parameter space that lead to failure. “Failure-mode avoidance” is a design activity in which these failure-mode boundaries are changed to create a large region in which the system can function. A further advantage of the “failure-mode avoidance” approach is that it reduces the salience of so-called “specified operating conditions.” Such a set of specified operating conditions is an approximation that helps to guide concept selection. At an early stage of system development, one cannot reasonably define a complete

IMPROVING SYSTEM RELIABILITY BY FAILURE-MODE AVOIDANCE

247

Figure 1. Two ways of thinking about uncertainty—probabilistic and naturalistic. Medical doctors were asked questions in two different formats. Their answers are graphed here as dots and the correct answer is annotated. These results suggest that the more naturalistic formulation leads to far more accurate judgments by professional practitioners [adapted from Gigerenzer and Edwards, 2003].

set of conditions that a system is likely to experience in its lifecycle. Although an approximate set of conditions can be defined, it will surely miss some important combinations of conditions. Later on, these unanticipated operating conditions may arise and the system may cease to function. When this happens, it is tempting to say that, since the condition was not specified, the system did not actually fail—that the system was misused. It is essential for systems engineers to recognize that nature does not care what systems engineers think the “specified operating conditions” are. When the system fails to function under the conditions the system actually experiences, that constitutes a failure. This point is well understood by some reliability engineers. For example, Thomas, Ayers, and Pecht [2002] discuss “trouble not identified” warrantee returns in the auto industry and conclude: “[I]t must not be assumed that a returned module that passes tests associated with an engineering specification is good,” p. 650. Because of uncertainty regarding specified operating conditions, we argue that an effective approach is to increase the set of conditions under which the system operates and do

this as quickly and economically as one can manage within the time available. This implies that systems engineers should not spend much energy on predicting field reliability but instead use that same energy to increase field reliability [Clausing, 1994]. It seems that the creative design work that leads to reliability improvement is a very natural activity and is consistent with our “failure-mode avoidance” conception of reliability. We propose that thinking of reliability as failuremode avoidance can have real advantages, especially in the early stages of system design or in a long-term scenario such as technology development. In early stages of system design, probability theory may be too quantitative for the task at hand. Probability density functions imply a level of precision in modeling the scenario that is often unwarranted, especially during early development. As a project advances through its development stages the probabilistic view of reliability becomes increasingly useful. Analysis of reliability using probability theory is useful for component selection, system validation, and the management of

248

CLAUSING AND FREY

field-service operations. The value of the failure mode avoidance conception of reliability is greatest for technology strategy, systems architecting, concept design, and for some robust parameter design activities, all done early during the development of the system.

2. REVIEW OF RELATED WORK This paper is intended to help engineers with the earlystage, conceptual phase of design. Therefore, an important related development is the Theory of Inventive Problem Solving (sometimes described by the acronyms TRIZ or TIPS). The theory was first described by Altschuller [1984] and was recently placed in a broader context of innovation by Clausing and Fey [2004]. The theory is based on a study of thousands of patents that revealed patterns among inventive solutions. An important underlying hypothesis is that inventive problems can be viewed as conflicts which the inventive solutions resolve. This enabled large numbers of patents to be organized in a useful taxonomy. It has also given rise to commercial software products that facilitate the use of the theory by professional practitioners. However, we note that many patents claim robustness as their primary advantage—they do not deliver new functions, but deliver existing functions over a broader range of conditions. While TRIZ is helpful in development of new functions and elimination of harmful side effects, it does not seem to support reliability innovations to the extent we desire. Therefore, this paper analyzes patents and seeks new patterns of inventive engineering work. A development in reliability engineering closely related to this paper is the “physics-of-failure” (PoF) approach developed at the Computer Aided Life Cycle Engineering (CALCE) Electronic Products and Systems Center at the University of Maryland. The first instance in archival literature of the term “physics of failure” is Pecht et al. [1990], which emphasizes use of a physics-based model for reliability prediction and design for reliability. This approach has been extended to product development by Pecht and Desgupta [1995] and to accelerated life testing by Kimseng et al. [1999]. This paper builds upon the conception of physics-offailure and seeks to extend this conception to the earliest, creative phases of system design. An important development in reliability engineering is robust parameter design pioneered by Genichi Taguchi [Taguchi, 1993]. For any design concept, there is a potentially large space of control factor settings that will nominally place the function at the desired target value. In robust parameter design, the engineer explores the design space seeking changes that will make the system more robust while still keeping the performance

on target. Taguchi’s method employs orthogonal arrays to explore the design space. At the same time, outer arrays or compounded noises are used to explore the range of possible operating conditions. Signal to noise ratios are used as measures of the robustness of the system and guide the engineer to preferable levels of the control factors. Taguchi’s philosophy of robust design is consistent with the approach to reliability engineering discussed here. Taguchi rejected the “goal post” mentality inherent in tolerance limits and specifications. His notion of a quality-loss function replaced consideration of defect rates and process yields with an emphasis on reducing variance followed by adjustment to target. Taguchi encouraged engineers to deliberately expose designs to harsh conditions in experiments. To do this requires a transformation in the culture of an engineering organization. The emphasis must shift from demonstrating adequate performance with high statistical confidence to aggressive improvement followed by adequate confirmation. Robust parameter design is among the most important developments in systems engineering in the 20th century. These methods seem to have accounted for a significant part of the quality differential that made Japanese manufacturing so dominant during the 1970s. The methods were subsequently adopted outside of Japan. The timing of that adoption in the West corresponded closely with improvement in quality that improved competitiveness of North American and European manufacturers. Robust design methods were surely a significant part of both the rise of Japanese industry and the response to that competitive challenge. Robust design methods have continued to be refined and are still an active area of systems engineering innovation. Another approach relevant to this paper known as “operating window methods” was developed and practiced at Xerox Corporation in the 1970s. The operating window is the set of conditions under which the system operates without failure. In operating window methods, reliability is improved by making the operating window larger. Clausing [2004] described the approach in detail in a recent issue of Technometrics, but the essence of the approach is simple enough to present here: 1. Increase the value of the noise factors so that the failure rate is high. 2. Change the value of the control factors to seek a broader operating window at a fixed failure rate. This approach was used, for example, to improve the reliability of paper handling machines. At Xerox, paper stacks were designed and constructed to deliberately

IMPROVING SYSTEM RELIABILITY BY FAILURE-MODE AVOIDANCE

produce a large magnitude of variation. The papers varied in their weight, surface condition, geometry, and so on. These paper stacks were similar to the worst stacks one would encounter in field use, and, in conjunction with operation near the limit of the operating window, they brought about higher failure rates than would normally be encountered, on the order of 1 in 10 rather than 1 in 10,000. These high failure rates enabled the engineers to more quickly discern the effect of changes in failure rate with changes in the control factors such as stack forces, feed belt angles, and so on. This approach worried managers since they observed the machines jamming with high frequency, but they eventually came to understand why this was needed. As a consequence the engineers were able to quickly converge to more reliable machine configurations. Despite the use of failure rates as a measure of performance, the operating window method is, upon closer examination, consistent with Taguchi’s quality philosophy. Because failure rates were greatly increased by applying aggressive noises, improvements could be made rapidly, even though they sacrificed the ability to accurately predict field reliability. The term “operating window” may seem to imply an emphasis on goal posts, but in fact the “customer-specified” limits are viewed as irrelevant and the expansion of actual physical limits is valued instead. Operating window methods continue to be an active area of research in quality engineering. Joseph and Wu [2004] showed that under certain conditions a failure rate of 50% maximizes the information gained from robust design using an operating window. As an example, they carried out a case study wherein line width in a lithography process set at a much finer pitch than actually needed in practice. The control factor settings that improved the robustness at the finer pitch also improved the robustness at the pitch needed in operation. The basic concept of operating windows was therefore further corroborated. While retaining the benefits of Taguchi’s quality philosophy, operating window methods may have a further advantage. In operating window methods, the progress in reliability is measured in physical terms by the size of the operating window. This may be preferable to measuring results with a more abstract measure such as signal to noise ratios. For example, operating window methods encouraged engineers at Xerox to devise ways to double the range of paper weights the machine could feed rather than contemplate how to increase signal to noise ratios by 6 decibels. As previously discussed, cognitive psychology suggests there is an advantage in maintaining a connection to physical quantities rather than probabilistic measures. We propose that a mental connection to the physics and logic

249

of the system is even more critical for early stage system design than it is for later stage parameter design. As discussed in this section, the basic concept of operating windows is to seek a larger set of conditions under which the system functions. While the idea is very simple, implementation is challenging, requiring deep knowledge of the system and the creativity to develop the needed design innovations. This paper seeks to help engineers implement early stage robustness work via operating window methods. The next section covers some theoretical developments. The subsequent sections present specific strategies for implementation.

3. OPERATING WINDOWS AND FAILURE MODES FORMALIZED This section develops a formal treatment of operating windows and failure modes. The details developed here are not regarded by the authors as necessary for implementing the four strategies presented in this paper. The formal framework may, however, justify the approach and will be helpful to those who seek a deeper understanding of the strategies. However, those readers who are primarily interested in the operational aspects could skip to Section 4. To formalize the idea of operating windows, it is helpful to define failure modes mathematically. A failure-mode criterion is an inequality that applies to a functional response of a system Yi(X, Z) > Li or Yi(X, Z) < Ui. The criteria are defined such that, if the criteria are satisfied, the failure will not occur. The inputs X and Z are vectors of physical variables in the engineering system. The physical variables are sorted into two types, not necessarily disjoint—noise factors Z and control factors X. The control factors are variables the designer may change during the parameter-design phase of systems engineering. The noise factors are physical variables that vary in the environment, manufacture, or lifecycle of the system. Yi is a functional response of the system and the mapping Yi(X, Z) describes the physical or logical process by which the system responds to the control and noise factors. Li and Ui are lower and upper limits on a response defined so that exceeding that limit constitutes a system failure. To illustrate these ideas, consider a jet engine. A functional response of an engine is the thrust it develops. If thrust were to fall below some prescribed limit, we could define that condition as a failure. The thrust is affected by control factors such as the chord of the fan blades. The thrust is also affected by noise factors such as the inlet temperature and angle of attack of the free stream into the engine inlet. A reliable engine is

250

CLAUSING AND FREY

designed so that the thrust is within acceptable limits over a wide range of the noise factors. To make these ideas operational, we have found it necessary to introduce a distinction between two types of failure modes—one-sided and two-sided failure modes [Clausing and Frey, 2004]. A one-sided failure mode is a functional response and the associated physical process Yi(X, Z) with either a lower or upper limit but not both. A common one-sided failure mode is plastic deformation of a material. When plastic deformation is unacceptable or reaches a prescribed limit, the designer will define that as a failure. Plastic deformation often occurs when a level of stress is exceeded, so the failure criterion would naturally fit the form Yi(X, Z) < Ui where Yi denotes stress in physical units such as pounds per square inch. If there is no parallel failure mode for low values of stress, then it is most natural to think of plastic deformation as a one-sided failure mode. A two-sided failure mode is a functional response and the associated physical process Yi(X, Z) with both a lower and an upper limit. Two-sided failure modes are frequently found in measurement or metering functions within a system. If a measuring system is inaccurate, the designer will regard it as a failure when the readings are too high or too low compared to the true quantity, so the failure criterion would naturally fit the form Li < Yi(X, Z) < Ui where Yi denotes, for example, measurement error in physical units such as volts. Note that, given the definitions here, a two-sided failure mode is driven by the same physical process description Yi at both the high and low failure-mode boundaries. Thus, a single noise factor like ambient temperature can be limited from above and below by a single physical phenomenon. For example, a fluid metering system may operate in a limited temperature range due to the fact that the fluid viscosity is a function of temperature. This single physical phenomenon of temperature dependence of viscosity may make too much fluid flow at high temperatures and too little fluid

flow at low temperatures. In this paper, a two-sided failure mode is necessarily governed by a single set of failure-mode physics. In the presence of a two-sided failure mode, robust parameter design is critical. Figure 2 depicts a twosided failure mode applied to a response. The operating conditions give rise to a variation in the functional response; therefore, the response has a probability distribution p(Yi). In the scenario on the left side of Figure 2, the variability is so wide that it cannot be accommodated within the limits between the failure mode boundaries. If robust parameter design were applied, the sensitivity of the response would be reduced, resulting in a tighter distribution of the response enabling both sides of the failure mode to be avoided. Thus, robust parameter design is essential in the presence of two-sided failure modes and, indeed, much of the research in robust design is oriented toward scenarios with two-sided failure modes. This paper by contrast concerns itself primarily with single-sided failure modes, which seem to admit a wider range of robust design approaches. It is common for a single noise factor to be limited from above and below by two different physical failure modes. Here, this situation is characterized as an operating window between two one-sided failure modes rather than a two-sided failure mode. To illustrate the difference, consider fluid metering again. It is possible that an upper limit on the noise factor of temperature is set by the physical process of a boiling while the lower limit on temperature is set by the previously discussed increase in viscosity with reduced temperature. It therefore seems more natural to consider two failure modes governed by two different functional responses, Yi(X, Z) < Ui and Yi+1(X, Z) > Li+1. The difference here is reflected in the fact that the two responses have different indices. In theory this seems minor, but in practice we regard this as highly significant. Robust parameter design might still be applied with success, but it seems that other approaches will also be applicable. All of the

Figure 2. Robust parameter design accomplishes failure-mode avoidance in the presence of two-sided failure modes.

IMPROVING SYSTEM RELIABILITY BY FAILURE-MODE AVOIDANCE

four strategies presented here illustrate specific alternatives for such cases. Now that we have defined failure-mode criteria of various types, we may define the operating window formally. The operating window is the set of noise conditions Z that satisfy the full set of failure-mode criteria:  Yi(X, Z) ≥ Li for all i with lower limited    onewsided failure modes      Yi(X, Z) ≤ Ui for all i with upper limited  Z  onewsided failure modes      Li ≤ Yi(X, Z) ≤ Ui for all i with    twowsided failure modes     To simplify the notation, we will compress this down to {Z|Yi(X, Z) ∈ Wi for all i} where Wi therefore defines a window, which may be one-sided or two-sided. A goal of systems engineering is to make the system robust by adding more points to this set. Given this concept of failure-mode criteria and operating windows, it is possible to identify design changes that improve reliability without any recourse to probability. The development principle is to add points to the operating window as rapidly as possible. The theorem below formalizes this concept for parameter design. Operating Windows and Parameter Design -- If the design parameters of a system are changed from X to X′ and the new operating window holds the old operating window as a subset {Z|Yi(X, Z) ∈ Wi for all i} ∈ {Z|Yi(X′, Z) ∈ Wi for all i}, then reliability has improved.

251

This theorem is mathematically straightforward. Reliability is traditionally defined as the probability of failure where probability and probability density are defined so that integration over a set gives a probability. Since probability density cannot be negative, integrating over a set must give a larger or equal probability than integrating over any of its subsets. Although mathematically basic, the theorem may be important in system design due to its practical implications. The probability density function of the noise factors is generally known only very approximately. If changes can be identified that meet the conditions of the theorem above, then reliability can be improved in spite of our ignorance about the probability density function of the noise factors. A graphical illustration of this theorem is Figure 3 in which one axis represents a single noise factor and another axis represents a single control factor. Two different functional responses define constraints within the space defined. At the initial setting of the control factor X1, there is an operating window. A change is made in the control factor setting making it X 1g . Since the new range of the noise factor Z1 completely contains the old range of Z1, the operating window has been increased and reliability has been improved. It is instructive to consider coupling among failure modes. In pursuing robustness to the ith failure mode, the designer may consider changing the value of control factor Xk. If a change in Xk that affects the set satisfying the ith failure-mode criterion also changes the set satisfying the jth failure-mode criterion, then we say that failure-mode criteria i and j are coupled by control factor Xk. This definition of coupling is consistent with the definition of coupling among equations in mathematics [Borowski and Borwein, 1991]. The definition

Figure 3. Robust parameter design can accomplish failure-mode avoidance in the presence of multiple one-sided failure modes.

252

CLAUSING AND FREY

is also similar to the definition of coupling in Axiomatic Design [Suh, 1990] except that coupling occurs among failure modes rather than functional requirements. It should be evident that the two failure modes in Figure 3 are coupled by control factor X1. In this instance, however, the coupling is not such that it negatively affects the robust parameter design process. The theorem’s conditions are satisfied and reliability improvements may proceed despite the coupling. However, it should also be clear that when failure modes are not coupled, robust parameter design may be simpler to accomplish. In the absence of coupling, any control factor Xi affects at most one failure-mode criterion. Once the direction of the dependence is determined, the operating window can be increased by sequentially maximizing or minimizing the size of the set as a function of that single control factor. This is frequently accomplished by driving the value of the control factor to its technical or architectural limits. An example of this is found in paper-feeding machinery. A higher friction coefficient of the feed rolls helps to prevent misfeeds and does not particularly encourage multifeeds. For this reason, developers of paper handlers worked to increase the friction coefficient of feed rolls as far as technically feasible. Even though these technical developments improved the system, the reliability was still not sufficient and further improvements had to be sought. Because of this phenomenon, in any system that is fairly mature, it is common for the parameters that do not couple multiple failure modes to be set near their physical or architectural limits. Since consideration of uncoupled parameters is straightforward, much of the attention in systems engineering is therefore directed to dealing with parameters that are coupled to multiple failure modes. It is often necessary to consider the operating window with respect to two or more noise factors simultaneously. This requires a representation of multidimensional failure-mode boundaries. Figure 4 is

a graphical depiction of a two-dimensional operating window formed between three one-sided failure-mode criteria. A key distinction between Figure 3 and Figure 4 is that in Figure 4 two noise factors are represented rather than one. In addition, no control factors are represented using an axis. Instead Figure 4 represents the operating window at a two distinct design configurations X and X′. The shape and size of the window can vary with the design parameters. A useful goal, as before, is to add points to the operating window without removing any points. This condition holds in Figure 4, so the change in design will improve the system’s reliability. The theorem discussed previously applies to parameter design, but the idea depicted in Figure 4 can be readily extended to conceptual design in which not only the control factors are changed, but the functional response of the system is modified as well. All that is required is the idea that the functional response itself can be varied as well as the control factors. Operating Windows and Conceptual Design -- If the conceptual design of a system is changed including a change in functional responses Yi to Y ig and the corresponding design parameter changes from X to X′ and the new operating window holds the old operating window as a subset {Z|Yi(X, Z) ∈ Wi for all i} ∈ {Z|Y ig (X ′, Z) ∈ Wi for all i}, then reliability has improved. At the earliest stages of system design when our latitude to make changes is greatest, it is these types of conceptual changes that are most critical to find and implement. Although robust parameter design has been a valued development in systems engineering, large changes in system reliability observed over time cannot be explained by parameter design alone. As an example, vehicles sold today are far more reliable than those that

Figure 4. Robust design with a two-dimensional operating window.

IMPROVING SYSTEM RELIABILITY BY FAILURE-MODE AVOIDANCE

were sold 30 years ago. The majority of that reliability improvement is due to the scores of system design and technological changes made over these decades. Electronic spark timing replaced the distributor. Fuel injection systems replaced carburetion. In addition there were many other less widely known innovations that created large improvements in reliability. We propose that at an early stage of system design, many design opportunities exist that meet the criteria of the theorems presented here. Much of early stage, conceptual reliability engineering can therefore be undertaken without any probabilistic modeling freeing up engineers for deep thought about patterns of innovation in reliability engineering. This is a principal message of this paper, and it will be emphasized by presenting specific strategies for carrying out this suggestion.

4. FOUR STRATEGIES FOR IMPROVED ROBUSTNESS Up to this point, this paper has focused on the interrelated concepts of reliability, robustness, and one-sided failure modes. From this point forward, the paper concentrates on strategies to avoid one-sided failure modes. All of these strategies involve concept design rather than parameter design. The design changes considered here are not only changes in the values of design parameters but also additions of new features or components, changes in the configuration of the system, or even new inventions. We present four strategies along these lines: 1. Relax a constraint limit on an uncoupled control factor. 2. Use physics of incipient failure to avoid failure. 3. Create two distinct operating modes for two different demand conditions. 4. Exploit interdependence between two operatingwindow system variables.

253

defined in Section 3. Such control factors should be maximized or minimized to create the greatest possible distance from the affected one-sided failure mode consistent with any constraints on the control factor. As the system is placed under greater demands over time due to system evolution and competition, the operating window afforded under the current system constraints may become insufficient. Under these circumstances, the constraint can often be relaxed by making changes in the system architecture or by changes in technology. The relaxed constraint enables further changes to the uncoupled control factor, which opens the operating window. Primary Case Study—Paper Feeder. As an industrial example, we present the Xerox paper feeder that first went into production in 1981, and has appeared in many different Xerox copiers and printers. This paper feeder is known as a friction-retard feeder (Fig. 5). The feedbelt rests on the paper stack, and drags the top sheet forward. The friction of the retard roll holds back (retards) the second sheet if it tries to come through. Thus, the retard roll prevents multifeeds (feeding of more than one sheet). Therefore, the wrap angle between the feedbelt and the retard roll only affects the failure mode of multifeeds. The other primary failure mode is misfeeds (no sheet is fed). This failure mode is not affected by the wrap angle between the feedbelt and the retard roll. Because multifeeds are reduced by a large wrap angle and misfeeds are unaffected, it is clear that the wrap angle should be as big as possible. Despite the desirability of having a large wrap angle, the previous-generation feeder (ca. 1975) had a wrap angle of only 13°, which was constrained by the system architecture. In the new design that first went into

To illustrate these strategies and demonstrate their versatility, we present two different example applications of each strategy, a primary example that is described in considerable detail and a supplementary example that is described in less detail. Two engineering domains are used throughout—paper feeders and jet engines. The next four subsections present these strategies.

4.1. Relax a Constraint Limit on an Uncoupled Control Factor A control factor that affects only one of the one-sided failure modes in a system is said to be uncoupled as

Figure 5. Friction-retard feeder, U.S. Patent #4,475,732 [Clausing et al., 1984].

254

CLAUSING AND FREY

Figure 6. The architecture on the left has a nearly linear paper path, U.S. Patent # 3,390,725 [Jones and Van Deluyster, 1976]. A newer architecture on the right has a looping paper path, which enabled a larger wrap angle, U.S. Patent # 4,475,732 [Clausing et al., 1984].

production in 1981 the wrap angle was increased to 45°. This large improvement in wrap angle was enabled by a change in the total system architecture. In large copiers and printers the next subsystem after the paper feeder is the registration subsystem, which aligns the sheet with the image. In the new design the architecture was changed so that the paper came out of the feeder and turned down to reach the registration subsystem (Fig. 6), which was underneath the feeder. This enabled the wrap angle to be greatly increased. This architecture also reduced the width of the copier/printer, which is desirable. This paper feeder with the large wrap angle has been very successful in many generations of Xerox copiers and printers. Supplementary Case Study—Jet Engines. A similar approach was used to improve the reliability of axial-flow fans in jet engines. A fan is a component of modern high by-pass commercial jet engines that provides a significant increase in the total mass flow, and therefore improvement in propulsive efficiency. A critical failure mode of such fans is flutter vibration due to the length of the blades and their exposure to inlet flow distortions. It had long been known that increasing the chord of a fan blade stiffened the blade and thereby reduced the incidence of the failure mode of flutter, but the chord of the blade was limited by constraints on weight [Koff, 2004]. Eventually, new technologies for manufacturing hollow blades enabled engine manufacturers to increase chords significantly without added weight. For example both Patent #4,345,877 [Monroe, 1980] and Patent #4,720,244 [Kluppel and Monroe, 1987] contributed to these advances. Wide-chord fans provided much greater resistance to flutter and have thereby greatly improved engine reliability. As in the

case of wrap angles in paper feeders, innovation enabled a critical parameter to be pushed past its previous constraints to move a one-sided failure-mode boundary and increase the operating window. Summary of the Strategy. When a system variable only affects one of the one-sided failure modes, take its value to its constraint limit. If the operating window is still not large enough, seek new architectures or technologies that relax the constraint.

4.2. Use Physics of Incipient Failure To Avoid Failure In some systems the physics of the incipient failure can be used to prevent or delay the failure mode. All onesided failure modes are associated with underlying physical phenomena. In many cases the failure mode exhibits distinct physical mechanisms that become active as the onset of the failure mode is approached. In some systems there exists an opportunity to exploit the physics of incipient failure to increase the size of the operating window. Primary Case Study—Jet Engines. An example is afforded by the use of shaped grooves in compressor casings in modern jet engines. An axial flow compressor is comprised of multiple alternating stages of rotor assemblies and stators. To limit engine complexity and weight, a large pressure rise per stage is desired so that the desired pressure rise in the compressor can be accomplished with a small number of stages. However, the pressure increase of each stage is limited by a failure mode of aerodynamic stall and surge. A stall involves separation of airflow from a blade, which at any given time may affect only one stage or even a group of stages.

IMPROVING SYSTEM RELIABILITY BY FAILURE-MODE AVOIDANCE

255

Figure 7. The arrangements of slots in an axial flow compressor. Adapted from U.S. Patent #4,086,022 [Freeman and Moritz, 1978].

A compressor surge generally refers to a complete flow breakdown throughout the compressor. The value of airflow and pressure ratio at which a surge occurs is termed the “surge point” and “surge margin” is a term for the difference between the airflow and compression ratio at which it will normally be operated and the airflow and compression ratio at which a surge will occur. Thus, we can readily interpret surge margin as the distance from the one-sided failure mode of compressor surge. In the late 1970s new technologies known as “casing treatments” were developed. In one casing treatment technology assigned to Rolls Royce, Patent #4,086,022

[Freeman and Moritz, 1978], a series of angled channels are placed in the casing of the compressor extending from the leading edge of the rotors and extending just aft of the trailing edge (see Fig. 7). If a surge begins to occur, then “a rotating annulus of pressurized gas will begin to build up about the tips of the blades”. Because of the geometry of the slots, “the annulus of air will be directed into the slots … thus reducing or eliminating the surge” [Freeman and Moritz, 1978, p. 5]. To understand how the casing treatments are related to the operating window, it is useful to consider Figure 8 adapted from Cumpsty [1997]. The abscissa in the figure is mass flow of air into the engine. The mass flow

Figure 8. The effect of casing treatment on surge of jet engine compressors [adapted from Cumpsty, 1997].

256

CLAUSING AND FREY

in an engine may vary due to changes in inlet conditions caused by atmospheric conditions or aircraft maneuvers; therefore, mass flow is a noise factor as defined in Section 3. The ordinate in Figure 8 is pressure rise across a stage of the compressor. When conditions are at their nominal state, the engine will generally remain on the operating line with mass flow and pressure rise both changing as a function of the throttle position set by the pilot. At a fixed throttle position, when mass flow is reduced due to maneuvers or environmental conditions, the state of the engine moves toward the surge line as indicated in step 1 of Figure 8. This pushes the engine off the operating line and toward the failuremode boundary. The amount of mass-flow drop that can be tolerated before failure (step 3a or step 3b) is sometimes called the “surge margin” which we interpret as an indication of the operating window size. The technology described in Patent #4,086,022 can be viewed as a means to exploit the incipient failure-mode physics (the rotating annulus of air—step 2) to increase the surge margin. The treatments are designed so that the incipient physics will lead to a pressure relief across the stage (step 3b). The advanced casing treatment “increased fan stall margin by a staggering 20% under distorted inlet flow and with little loss in efficiency.” [Koff, 2004, p. 582]. Supplementary Case Study—Paper Feeder. A similar approach was used to improve the reliability of paper feeders. For friction-retard paper feeders, the stack force between the feedbelt and the paper stack is a critical system variable. If it is too large the multifeed

rate will be excessive. If the stack force is too small, the misfeed rate will be excessive. Therefore, there is an operating window between these two one-sided failure modes (Fig. 9). When the range of papers is moderate, it is easy to develop a sufficient operating window so that both the multifeed rate and the misfeed rate are very small. However, for the large range of papers that are typically used in large production copiers and printers, it is very difficult, or impossible, to develop a sufficient operating window, as shown on the left of Figure 9. On the left hand side of Figure 9, it is evident that no single value of stack force will simultaneously avoid both multifeeds and misfeeds over the full range of paper weights. This was still true after robust parameter design had been completed, so there was little hope to improve it further beyond the great improvement that had already been achieved. The problem was resolved through the development of a “stack force relief/enhancement” technology, U.S. Patent # 4,561,644 [Clausing, 1985]. This technology uses two different values of the stack force, a small value for most papers, and a larger value for heavy papers (as depicted on the right side of Fig. 9). Under normal conditions, the stack force is set to the small value. For most common paper weights this works very reliably. If a larger paper weight is used, a misfeed condition may begin to emerge. A sensor near the retard roll is designed to sense the arrival of the lead edge of the sheet. If an incipient misfeed occurs, the paper will not arrive within the desired time period. Under this

Figure 9. Operating window for friction retard paper feeder.

IMPROVING SYSTEM RELIABILITY BY FAILURE-MODE AVOIDANCE

condition, the stack force is increased to the large value. This was done by energizing the solenoid 90 in Figure 5, which pushed the feeder around the pivot 11, thus increasing the stack force. Thus, the machine was able to reliably feed the full range of paper weights. Summary of the Strategy. Exploit the physical mechanisms associated with an incipient failure to offset the failure mode, thereby increasing size of the operating window.

4.3. Employ Two Different Operating Modes In some cases, the development process reaches a state in which the system has a limited operating window between multiple one-sided failure modes and therefore cannot operate reliably. In such cases, it is often advisable to change from a single operating mode to two operating modes. Separately designing two distinct operating modes enables significant design freedom to seek better resistance to the failure modes. This strategy is often similar to the strategy “use physics of incipient failure to avoid failure” and in fact the two strategies can overlap. However two key distinctions should be made: (1) Incipient failure-mode physics do not always lead to clearly distinct operating modes, and (2) the switch between two modes need not be cued by incipient failure physics and can instead be cued by operator inputs or state variables of the system. Primary Case Study—Paper Feeder. A failure mode of friction retard paper feeders (Fig. 5) is excessive wear of the retard roll. In previous designs the roll had been rotated approximately once per hour to distribute the wear over the entire roll. Nevertheless, the wear was excessive, and was a considerable expense in service cost and lost production of the copier/printer. The critical variable that determines the wear of the retard roll is the force between the feedbelt and the retard roll, F, multiplied by the contact distance D between the feedbelt and the retard roll. The product, FD, is the work that the retard roll can do to remove energy from the second sheet, and thus stop the second sheet. However, this is also the work that causes wear of the retard roll. The result is as shown in Figure 10. With the previous design, one system variable FD has control of both of the one-sided failure modes, excessive multifeeds and excessive wear of the retard roll. Maurice Holmes at Xerox recognized that this problem could be resolved through a redesign of the retard mechanism by adding a second operating mode. The innovation was included in the advanced paper feeder that first went into production in the Xerox 1075 copier in 1981, Patent # 4,475,732 [Clausing et al., 1984].

257

Figure 10. Two failure modes, one system variable. (Crosshatched region is negative operating window—no safe range.)

The inventive process that led to this invention is well described in terms of the theory of inventive problem solving (TRIZ). The TRIZ process generally begins by framing the current problem as a conflict. In this case, there was an engineering conflict between avoiding multifeeds and avoiding excessive wear. In TRIZ, one effective way to seek a conflict resolution is through “Sufield” or “substance-field” analysis [Clausing and Fey, 2004]. Simple Sufield diagrams are in the form of a triad. The relevant triad diagram for the retard-roll problem is shown in the left hand side of Figure 11. Here substances are (1) the paper and (2) the roll/shaft. The field is the contact force. TRIZ includes many standards for the creative revision of the Sufield. One of the standards is: “To enhance the effectiveness of the Sufield, transform one substance into an independently controlled Sufield, thus generating a chain Sufield,” p. 112. This can be implemented by introducing a field between the retard roll and its shaft (as shown in right hand side of Fig. 11). This is as far as Sufield analysis will take us. Now we have to use science and art to identify a field and a component for creating the field that will open an operating window. One such approach is to insert a friction brake with a brake torque T into the design to produce a field between the retard roll and its shaft (U.S. Patent 4,475,732). This field creates the possibility of two distinct operating modes: (1) When the torque that is applied to the roll is less than T, the roll remains stationary, and (2) when the torque that is applied to the roll is greater than T, the roll rotates. The torque that is applied to the retard roll is produced by the friction from the belt or the paper, whichever is contacting the roll. When one sheet of paper is between the roll and the feedbelt, the friction coefficient has a value of 2, which overcomes the brake torque. Therefore, the roll rotates, and there is not any wear. When two sheets of paper are between the roll and the feedbelt, the friction coefficient is 0.6, and the brake torque prevents rotation of the retard roll. Thus the second sheet is stopped.

258

CLAUSING AND FREY

Figure 11. Sufield diagrams for retard roll.

The addition of the new operating mode created an additional design parameter “brake torque” which sets the condition for the switch between the two modes. Thus, the design space expands from a 1-D operating window to a 2-D operating window (Fig. 12). If the brake torque is set to an appropriate value, the retard roll will only rub against the paper when the incipient multifeed condition actually occurs. In this case, the excessive-wear failure-mode boundary is never active and a new failure mode (paper damage) becomes the limiting factor on parameter FD, leaving a greatly increased operating window. Supplementary Case Study—Jet Engines. A similar approach was used to simultaneously avoid two one-sided failure modes associated with combustion in jet engines. A combustor is a part of a jet engine in which fuel is injected into the air stream, mixed with air, and burned. Two key failure modes of a combustor are concerned with the composition of the exhaust gas, which is tightly regulated to protect the environment. One failure mode is excessive production of carbon monoxide (CO), which occurs with an overly lean mixture and low temperature in the combustion zone. Another failure mode is excessive production of oxides of nitrogen (NOX), which is associated with overly high temperature in the combustion zone. Given the changes in the thrust demands (and many other parameters that

vary), it is a challenge to maintain the combustion conditions in the small operating window between the failure modes. In the 1970s a new technology called “two-zone” or “staged” combustion substantially increased the operating window by affording multiple operating modes [Markowski, Lohmann, and Reilly, 1976; Lefebvre, 1999]. When the demand for thrust is low, all the combustion takes place in a single “primary zone.” When thrust demands are highest, the engine automatically switches to a mode in which combustion occurs in two different zones each of which is functioning within the operating window between the CO and NOX related failure modes. This technology has been developed through many inventions including Patent #4,052,844 [Caruel, Quillevere, and Gastebois, 1977] and has become popular especially in gas turbine engines for ground based power [Washam, 1983]. As in the case of the paper feeders with a friction brake, the system automatically switches between two modes of operation in order to increase the operating window between two coupled one-sided failure-mode boundaries. Summary of Strategy. When it is not possible to simultaneously avoid two one-sided failure modes due to a wide range of noise values, consider defining two distinct operating modes so that at least one of the failure modes will be moved to increase the size of the operating window.

4.4. Identify and Exploit Dependencies Among Failure Modes In the operating-window approach, the parameter space is sketched out and the failure mode boundaries are identified. In the sketch, it is often the case that the parameters associated with the axes are not independent. A small change induced in one parameter will

Figure 12. Operating window for improved retard-roll design.

IMPROVING SYSTEM RELIABILITY BY FAILURE-MODE AVOIDANCE

Figure 13. Physical layout of a cooling system for a turbine blade.

have an associated effect on the other one. It seems clear that such dependencies can influence system reliability. What is sometimes overlooked is that they often provide an opportunity to use the dependence to stay within the operating window. Primary Case Study—Jet Engines. An example is afforded by turbine blade cooling systems [Sidwell, 2004]. The physical layout of the system is described in Figure 13. Air from the compressor is routed to the first-stage turbine blades. The cooling flow path includes a Tangential On-Board Injector, which brings the flow from a supply at Ps into the rotating parts of the engine. The area between the rotating seal and the blades acts as a plenum storing compressed gas at a

259

pressure Pp. The gas then flows through each of the many first stage blades. The purpose of this flow is to cool the surface of the blades and thereby avoid the failure mode of early blade oxidation. To apply operating-window methods to this scenario, one may first sketch the parameter space and the failure-mode boundaries. Figure 14 depicts a highly simplified window with just two failure modes, oxidation of blade #1 and oxidation of blade #2. Manufacturing variation may excite failure mode #1 (oxidation of blade #1) if its flow passages are constricted causing m1 to drop. However, the schematic diagram of Figure 14 suggests that there is a dependency among the failure modes. Any small drop in m1 tends to cause a rise in plenum pressure and a resulting rise in m2. The reverse is also true—any small drop in m2 tends to cause a rise in plenum pressure and a resulting rise in m1. This interdependency of the failure modes creates an opportunity to create larger distance from both failure modes. Turbine blades are routinely tested for their flow characteristics. Sidwell proposed that this test could be used to sort the blades into low flow, medium flow, and high-flow classes. In this way, a second interdependency is added to the system. The low m1 due to the sorting process brings about a low m2. The nature of the interdependency caused by the plenum causes the two effects to cancel (or very nearly cancel) as depicted in Figure 14. Sidwell [2004] estimated that “binning” turbine blades will increase the life of the high flow and medium flow blades by 50% or more and would enable low-flowing blades to be used with approximately the same life as current engines. Supplementary Case Study—Paper Feeder. In a document feeder for a copier it is highly desirable to feed from the bottom of the stack of documents. This leaves the top of the stack free to receive the recirculated document after it has been copied. The most advanced document-feeder technology uses air to move the document, which minimizes damage to the document. Such

Figure 14. The failure-mode boundaries in a simplified, two-blade system.

260

CLAUSING AND FREY

Figure 15. Operating window for bottom-feeding vacuum document feeder.

feeders typically use a combination of positive air pressure and negative air pressure (vacuum). The positive air pressure is used to levitate the document stack (otherwise the weight of the document stack would tend to cause both misfeeds and multifeeds). Therefore, a sufficient pressure under the stack is required to avoid both misfeeds and multifeeds. However, excessive pressure under the stack could cause the last sheet to blow away. Therefore, good system design requires an operating window between inadequate pressure and excessive pressure, as shown in Figure 15. The simple approach to achieve robust document feeding is to arrange a natural dependence between weight of the paper stack and the air pressure under the stack. This is done by careful sizing of all of the flow impedances. Thus the pressure under the stack is maintained proportional to the stack weight without the need for any additional components. This strategy was used in a series of patents at Xerox which made successive improvements in robustness [Stange, 1977; Silverberg, 1981; Browne, 1983]. Summary of the Strategy. When there are dependencies among failure modes, look for ways to use those dependencies to counteract the effects of noise factors.

5. SUMMARY Reliability is one of the most important characteristics of an engineering system. Probabilistic formulations of reliability are useful for component selection, verification testing, and field-service management. However, at the early stages of system architecting and concept design, probabilistic formulations are not as helpful. We propose that thinking in terms of physical mechanisms of failure is much more effective and that the fundamental principle of reliability engineering is failure-mode avoidance. A useful reliability-engineering concept is the operating window, which is the region in noise parameter space that avoids failure modes. In this paper we have

given a mathematical definition of the operating window. We have shown that adding to the window increases the reliability regardless of the probability distributions of the noise factors. To this we add the principle that this should be done early and rapidly during the system development. In particular, concept design changes frequently add large regions to the operating window and account for some of the largest improvements to reliability of systems over the course of their development. To illustrate this approach, we have described four strategies for increasing operating window through concept design. Each strategy is illustrated by two case studies, one from the field of paper feeders for copiers and printers, and the other from the field of jet engines. Each case study includes past inventions that significantly improved reliability. By showing the theory and eight case studies we have displayed both the fundamentals and the diversity of industrial applications of this important approach to the development of reliable systems.

REFERENCES G.S. Altschuller, Creativity and an exact science: The theory of the solution of inventive problems, Gordon and Breach, New York, 1984. E.J. Borowski and J.M. Borwein (Editors), The Harper Collins dictionary of mathematics, Harper Collins, New York, 1991. J.M. Browne, Bottom sheet feeding apparatus, U.S. Pat. #4,411,417, 1983. J.E.J. Caruel, H.A. Quillevere, and P.M.D. Gastebois, Gas turbine combustion chambers, U.S. Pat. #4,052,844, 1977. D.P. Clausing, Sheet feeding and separating apparatus with stack force relief/enhancement, U.S. Pat. #4,561,644, 1985. D. P. Clausing, Total quality development, ASME Press, New York, 1994. D.P. Clausing, Operating window–an engineering measure for robustness, Technometrics 46(1) (2004), 25–29. D.P. Clausing and D.D. Frey, Failure modes and two types of robustness, INCOSE Annual Symp, 2004, CD, Paper number 321. D.P. Clausing and V. Fey, Effective innovation, ASME Press, New York, 2004. D.P. Clausing, M.F. Holmes, R.A. Povio, and R.P. Rebres, Sheet feeding and separating apparatus with stack force relief/enhancement, U.S. Pat. #4,475,732, 1984. N.A. Cumpsty, Jet propulsion: A simple guide to the aerodynamic and thermodynamic design and performance of jet engines, Cambridge University Press, Cambridge, UK, 1997. C. Freeman and R.R. Moritz, Gas turbine engine with improved compressor casing for permitting higher air flow and pressure ratios before surge, U.S. Pat. 4,086,022, 1978.

IMPROVING SYSTEM RELIABILITY BY FAILURE-MODE AVOIDANCE

G. Gigerenzer, “Ecological intelligence: An adaptation for frequencies,” The evolution of mind, D.D. Cummins and C. Allen (Editors), Oxford University Press, New York, 1998, http://www.mpib-berlin.mpg.de/dok/full/gg/ggejuevm_/ggejuevm_.html. G. Gigerenzer and A. Edwards, Simple tools for understanding risks: From innumeracy to insight, Br Med J 327 (2003), 741–744. H. Jones and J.W. Van Deluyster, Multiple sheet feeding system for electrostatographic printing machines, U.S. Pat. #3,930,725, 1976. R.V. Joseph and C.F.J. Wu, Failure amplification method: An information maximization approach to categorical response optimization, Technometrics 46(1) (2004), 1–12. K. Kimseng, M. Hoit, N. Tiwari, and M. Pecht, Physics-offailure assessment of a cruise control module, Microelectron Reliab 39(10) (1999), 1423–1444. G.E. Kluppel and R.C. Monroe, Fan blade for an axial flow fan and method of forming same, U.S. Pat. #4,720,244, 1987. B.L. Koff, Gas turbine technology evolution: A designer’s perspective, AIAA J Propulsion Power 18(14) (2004), 577–595. A.H. Lefebvre, Gas turbine combustion, Philadelphia , Taylor & Francis, 1999. S.J. Markowski, R.P. Lohmann, and R.S. Reilly, Vorbix burner: A new approach to gas turbine combustors, ASME J Eng Power 98(1) (1976), 123–129. R.C. Monroe, Axial flow fans and blades therefore, U.S. Pat. #4,345,877, 1980.

261

M. Pecht and A. Dasgupta, Physics-of-failure: An approach to reliable product development, J Inst Environ Sci 38 (1995), 30–34. M. Pecht, A. Dasgupta, D. Barker, and C. T. Leonard, The reliability physics approach to failure prediction modeling, Qual Reliab Eng Int, 6 (1990), 267–273. S.S. Rao, Reliability-based design, McGraw Hill, New York. 1992. C.V. Sidwell, On the impact of variability and assembly on turbine cooling flow and oxidation life, Ph.D. Thesis, Massachusetts Institute of Technology, Cambridge, MA, 2004. M. Silverberg, Interrupted jet air knife for sheet separator, U.S. Pat. #4,275,877, 1981. K.K. Stange, Air floatation bottom feeder, U.S. Pat. #4,014,537, 1977. N.P. Suh, The principles of design, Oxford University Press, New York, 1990. G. Taguchi, Taguchi on robust technology development, ASME Press, New York, 1993. D.A. Thomas, K. Ayers, and M. Pecht, The trouble not identified phenomenon in automotive electronics, Microelectron Reliab 42(4–5) (2002), 641–651. I.A. Ushakov (Editor), Handbook of reliability engineering, Wiley, New York, 1994. R.M. Washam, Dry low NOX combustion system for utility gas turbine, ASME Paper 83-JPGC-GT-13, ASME, New York, 1983.

Don Clausing received the B.S. degree in mechanical engineering from Iowa State University in 1952. After working for nine years he again became a full-time student, and received his M.S. (1962) and Ph.D. (1966) degrees from the California Institute of Technology (Caltech). He worked in industry for a total of 29 years before becoming a half-time faculty member at MIT from 1986 until 2000. Starting about 1975 he has had a role in the major improvements in product development and systems engineering that have enhanced the competitiveness of many commercial industries. This includes the publication (1994) of his book Total Quality Development—World-Class Concurrent Engineering. He now has a new book (2004), co-authored with Victor Fey, Effective Innovation—The Development of Winning Technologies. Clausing has long been a leader in robust design, a key to reliable systems. During the 1970s he led in the development of the operating-window method to achieve robustly reliable systems.

Dan Frey earned the B.S. degree in aeronautical engineering from Rensselaer Polytechnic Institute in 1987. After serving as a Naval Officer for 4 years, he earned his M.S. from the University of Colorado in 1993 and Ph.D. from the Massachusetts Institute of Technology in 1997. Since then, he has been a faculty member conducting research in robust design, statistics, design methodology, and systems engineering. He currently holds a dual key faculty position at MIT in the Department of Mechanical Engineering and in the Engineering Systems Division.