Software Rejuvenation Policies for Cluster Systems under Varying Workload ∗ Wei Xie† , Yiguang Hong‡ , Kishor S. Trivedi† †Center for Advanced Computing and Communications (CACC) Department of Electrical and Computer Engineering Duke University, Durham, NC 27708-0294 ‡Institute of Systems Science, Chinese Academy of Sciences, Beijing 100080, China Email:
[email protected] Abstract
used by a variety of services to reach multitude of users. With the dramatic growth in Internet technology and the emergence of a number of new and advanced online applications, system performance and availability of e-business platforms have become critical issues to assure quality of service (QoS) and to provide highly available (HA) services, especially given the ubiquitous nature of computer networks and e-business platforms with bulk of services. One commonly applied method to significantly improve overall system QoS and HA under high demand is clustering (e.g., see [7] and white papers from server vendors such as [2, 11]). A cluster is a set of servers and related resources that acts like a single system and provide high availability, load balancing and parallel processing. These servers, also known as nodes, are usually identical and may either reside in different physical machines or may reside in the same box yet share the same operating system. If one node fails, another can act as a backup. Clustering is also costeffective since a cluster of inexpensive low-end servers can achieve higher availability and performance than a single costly high-end server. Web server farm is a typical cluster system found in many e-businesses. Building a cluster system does not afford us the luxury to ignore the availabilities of individual nodes, as they are important to the entire cluster system availability and capacity. Moreover, not all unexpected node outages are recoverable and their consequences could be very serious. It has been reported that software faults and failures result in more outages in large computer systems than hardware faults (e.g., see [1, 9, 15]) and the aftermath of software failures may amount to huge economic losses or risk to human lives (e.g., see [13]). It has been observed that the status
This paper analyzes two software rejuvenation policies of cluster server systems under varying workload, called fixed rejuvenation and delayed rejuvenation. In order to achieve a higher average throughput, we propose the delayed rejuvenation policy, which postpones the rejuvenation of individual nodes until off-peak hours. Analytic models using the well known paradigm of Markov chains are used. Since the size of the Markov model is nontrivial, automated specification generation, and the solution via stochastic Petri nets is utilized. Deterministic time to trigger rejuvenation is approximated by a 20-stage Erlangian distribution. Based on the numerical solutions of the models, we find that under the given context, although the fixed rejuvenation occasionally yields a higher throughput, the delayed rejuvenation policy seems to outperform fixed rejuvenation policy by up to 11%. We also compare the steady-state system availabilities of these two rejuvenation policies.
1 Introduction The World Wide Web (WWW) has acquired importance for its uniform and widely-accepted application interface ∗ This research was supported in part by DARPA and US Army Research Office under Award No. C-DAAD19 01-1-0646, and in part by the Air Force Office of Scientific Research under MURI Grant No. F4962000-1-0327, in part by NSF of China. Any opinions, findings, and conclusions or recommendations expressed in this publication are those of the author(s) and does not necessarily reflect the view of the sponsoring agencies.
1
Proceedings of the 10th IEEE Pacific Rim International Symposium on Dependable Computing (PRDC’04) 0-7695-2076-6/04 $20.00 © 2004 IEEE
wieldy to construct and solve by hand. We therefore resort to the powerful paradigm of stochastic Petri net and stochastic reward net (SRN) to automatically generate and solve the underlying Markov model. We use the stochastic Petri net package (SPNP) for this purpose. The rest of the paper is organized as follows. In Section 2, two rejuvenation policies are elaborated and the corresponding SRN models are introduced. In Section 3, we define the mean cluster system throughput based on the SRN models, and the numerical results for the proposed rejuvenation policies are given. In Section 4, steady-state availability for cluster system is defined and numerical results are presented. Section 5 provides the concluding remarks.
of the continuously running software inevitably degrades over time. This phenomenon, known as software aging, is typically caused by memory bloating and leaking, data corruption, storage fragmentation, accumulation of round-off errors, unreleased network/database connections. Usually software aging is closely related to workload, i.e., the heavier the workload is, the faster the software deteriorates [17]. An important technique to counter software aging is known as software rejuvenation. First proposed in [10], software rejuvenation can be regarded as a preventive maintenance technique applied to the software system to clean up its internal state (or its environment) and restart the software, in order to avoid (or postpone) a future failure. Software rejuvenation has drawn attention recently, for individual nodes [8] and for cluster systems [3]. For example, rejuvenation technology has been incorporated into the IBM Director for xSeries servers for the system availability improvement and cost reduction [3]. [16] showed that using software rejuvenation can significantly improve the availability of cluster systems by studying time-based rejuvenations and prediction-based rejuvenation with different failure types and imperfect coverage. Both [3] and [16] concentrated on cluster availability, but not on the impacts of rejuvenation on system capacity and throughput. The rejuvenation in cable modem termination systems and its analysis using a Markovian model is conducted [12]. [18] studied the availability optimization problem of a two-level software rejuvenation policy for Webservers and [19] discussed the impact of Webserver rejuvenation on Web users, where Websites are studied as a whole instead of clusters of servers. However, most existing work related to the analysis of software rejuvenation does not consider an important factor – workload variations caused by user behavior patterns. The servers are expected to have shorter times to failure during peak hours. Besides, since rejuvenation does introduce planned outages to individual nodes, it could be beneficial to differentiate between peak hours and offpeak hours in the development of a rejuvenation strategy.
2 Rejuvenation Policies for Cluster Systems and SRN Models We consider a cluster system with n identical nodes. These nodes could be either n physical boxes or n processes running in a single box, as long as they have the same configurations and may operate independently. The overall service is not interrupted by losing one or several of the nodes. The total system capacity at time t is calculated as the individual node capacity x multiplied by the number of available nodes at time t. Such a cluster system is scalable because by simply adding more nodes the system capacity increases (nearly) linearly. In e-business infrastructure, Webserver clusters and application server clusters are examples of such cluster systems. Techniques such as session replication ensure the independent operability of each node in a Webserver cluster, i.e., when one node is down, the other nodes can pick up its user sessions and resume processing. As mentioned before, it is observed that, with no rejuvenation action applied to the software system, internal errors may naturally accumulate, available resources may be exhausted, and consequently unplanned outages may occur. This phenomenon is known as software aging, which directly depends on workload. Following [10], each node has four basic states: robust state, failure-prone state, failure state, and rejuvenation state. To avoid the unexpected shutdown of individual nodes, rejuvenation actions should be carried out according to certain policies when each node is in its failure-prone state. However, when a node is rejuvenated, it is brought down and the total cluster system capacity is reduced. This unavailability is especially undesirable during peak hours when workload is heavy. In this paper, two rejuvenation policies for the n-node cluster system under varying workload are studied:
In this paper, we propose and study two rejuvenation policy designs (fixed rejuvenation and delayed rejuvenation) of cluster systems under varying workload. We model the incoming traffic as an ON-OFF process, which has been validated in practice (e.g., see [6, 14]). Moreover, we take the impact of workload on server failure times into account. Each node in the cluster may serve the user requests, fail, be repaired, or be rejuvenated. In order to make the model tractable, we assume that most of the distributions are exponential. It is unreasonable, however, to assume that the time to trigger rejuvenation is exponentially distributed. We take a recourse to the device of stages and approximate the deterministic time to trigger rejuvenation by a 20-stage Erlang distribution. This makes the Markov model somewhat un-
• Policy-A: Fixed Rejuvenation. Similar to the timebased rejuvenation in [3], policy-A is a simple and straightforward rejuvenation strategy: 2
Proceedings of the 10th IEEE Pacific Rim International Symposium on Dependable Computing (PRDC’04) 0-7695-2076-6/04 $20.00 © 2004 IEEE
Pf ailure−prone are dumped to place Pto rejuv , when a rejuvenation starts (transition Trejuv ). The rejuvenated nodes are returned back to the robust state. By simply removing the inhibitor arc from Ppeak to Timm in Fig. 1, we obtain the SRN model for Policy-A. In this case, rejuvenation will start as soon as transition Tdet fires, independent of whether there is a peak hour or not. Note that the firing rates in Tf ailure prone and Tf ailure still depend on the number of tokens in places Ppeak and Pof f peak (not shown in Fig. 1).
– In both peak hours and off-peak hours, all nodes in the failure-prone state are taken offline immediately for rejuvenation (as a batch) if any node has been running in the failure-prone state for time t0 , called rejuvenation trigger interval or time to trigger rejuvenation in this paper. This interval is fixed without regard of the current workload. • Policy-B: Delayed Rejuvenation. Bringing functioning nodes down during peak-hours for maintenance obviously compromises system capacity and throughput. Therefore, we attempt to achieve a higher system throughput by considering a modified rejuvenation policy:
3
Throughput Analysis and Numerical Results
3.1 – In off-peak hours, all nodes in the failure-prone state are taken offline immediately for rejuvenation (as a batch) if any node has been running in the failure-prone state for time t0 .
Throughput Definition
To solve these SRN models, their reachability graphs and the corresponding underlying Markov models are needed. The state space of the Markov model is denoted by Ω = Ωpeak ∪ Ωof f peak , where Ωpeak and Ωof f peak are the sets for peak-hour states and offpeak-hour states, respectively. For many problems, the reachability graphs may contain thousands or millions of states and it is not feasible to solve by hand. We resort to a powerful tool called SPNP (Stochastic Petri Net Package) [4, 5] which can automatically generate the reachability graph and solve the model numerically for both transient and steady-state analysis. Let R1 and R2 be the average incoming numbers of requests per second in peak hours and off-peak hours, respectively. If each functioning node is capable of processing x requests per second, the average request throughput of the cluster system can be calculated by: p(i) × min(R(i), M (i)x), (1) E[T ] =
– In peak hours, all nodes are just scheduled for rejuvenation (as a batch) if any node has been running in the failure-prone state for time t0 . Those being scheduled still operate as usual until actual rejuvenation starts immediately when the next off-peak hour starts. SRN (Stochastic Reward Net) models are constructed for a cluster system with the rejuvenation policies described above. Policy-B is depicted in Fig. 1, in which circles represent “places”, open rectangles are for exponential transitions, filled rectangles are for general transitions, and thin bars are for immediate transitions. We consider two possible workload states, which correspond to two user group states: peak hours and off-peak hours, represented in Fig. 1 by places Ppeak and Pof f peak , respectively, connected by transitions Tpeak and Tof f peak . For the cluster system, at the beginning all nodes are in the robust state, indicated by n tokens in place Pup . They age over time and enter the failure-prone state Pf ailure prone through transition Tf ailure prone . Without rejuvenation, these nodes will eventually fail (transit to place Pf ailure via transition Tf ailure ). The reactive recovery is denoted by transition Trepair , which brings the failed nodes back to the robust state. If place Pf ailure prone is empty and receives a token, the deterministic transition Tdet is enabled (enforced by a guard function). After time t0 elapses, transition Tdet fires and the token in Pclock moves to place Ptemp . If it is an off-peak hour (indicated by a token in place Pof f peak ), the token in Ptemp transits to place Pstartrejuv instantaneously through the immediate transition Timm , otherwise, transition Timm does not fire because of the inhibitor arc from Ppeak to Timm . If a token is in place Pstartrejuv , immediate transition Tto rejuv is enabled and all the tokens in
i∈Ω
where p(i) = R(i) = M (i) =
steady-sate probability of state i, R1 , i ∈ Ωpeak ; , R2 , i ∈ Ωof f peak . sum of the number of tokens in places “up” and “failure-prone”.
(2) (3)
(4)
In SPNP, we define a throughput function and calculate its expected value, which is the E[T ] in Eq. 1. We denote the mean system throughput of Policy-A and Policy-B by E[TA ] and E[TB ], respectively.
3.2 Numerical results The default parameters listed in Table 1 are employed in this section for model illustration purpose only. SPNP 3
Proceedings of the 10th IEEE Pacific Rim International Symposium on Dependable Computing (PRDC’04) 0-7695-2076-6/04 $20.00 © 2004 IEEE
T(failure-prone), T(failure) rates depend on P(peak).
n Pup Trepair
#
Trejuv
Tfailure-prone
#
g
Pclock
g = (#Pfailure-prone>0) Ppeak
1 Tdet
Tto_rejuv
1 Tpeak
Pfailure-prone Tfailure
#
Pto_rejuv
Toff-peak
Ptemp Treset Timm
Pfailure
Poff-peak
Pstartrejuv
Figure 1. SRN model.
Parameter n 1/γ1 1/γ2 1/λ1 1/λ2 1/µ t1 T1 T2 t0 k R1 R2 x
Table 1. Default Parameters Used Default Value Comment 4 number of servers 4 hours mean robust time under peak workload 3 days mean robust time under offpeak workload 20 hours MTTF under peak workload 3 days MTTF under offpeak workload 4 hours MTTR 1 hour rejuvenation time 8 hours mean duration of peak hours 16 hours mean duration of off-peak hours 3 hours time to trigger rejuvenation 20 number of Erlang distribution stages 1000 s−1 request incoming rate in peak hours 10 s−1 request incoming rate in offpeak hours 275 s−1 request processing rate per server
4
Proceedings of the 10th IEEE Pacific Rim International Symposium on Dependable Computing (PRDC’04) 0-7695-2076-6/04 $20.00 © 2004 IEEE
4
is used to solve the models defined in the previous section. We utilize an k-stage Erlang distribution to approximate the deterministic distribution in transition Tdet .
Availability Analysis and Numerical Results
4.1
Availability Definition
3.2.1 Dependence on Time to Trigger Rejuvenation Whether a cluster system is available or not is determined by the business requirements, and may vary in different scenarios. In this paper, we say the cluster is available if at least T H servers are operational, i.e., at least T H tokens in places Pup and Pf ailure−prone combined. The steady-state availability may be calculated by: p(i) × A(i), (5) A=
Fig. 2 (a) demonstrates how system throughput E[T ] changes when the time to trigger rejuvenation t0 changes, and Fig. 2 (b) is the corresponding diagram of throughput improvement achieved by rejuvenation Policy-B over Policy-A. If t0 is small, i.e., rejuvenating a node shortly after it becomes failure-prone, Policy-A (fixed rejuvenation) yields a very low mean throughput while Policy-B (delayed rejuvenation) produces a high throughput. As t0 increases up to about 10 hours, E[TA ] decreases and E[TB ] increases, then both curves go down when t0 is greater than 10 hours. The mean throughputs for both rejuvenation policies seem to converge to the same point when t0 is very large. As shown in Fig. 2 (b), Policy-B outperforms Policy-A by up to 4.5% in terms of mean system throughput for a reasonable range of t0 values. 3.2.2
i∈Ω
where steady-sate probability of state i, sum of the number of tokens in places “up” and “failure-prone”, 1, M (i) ≥ T H; 0, M (i) < T H.
A(i) =
Dependence on Mean Robust State Duration in Peak Hours
4.2
Depending on the hardware and software infrastructure, an individual node may have a different mean sojourn time in the robust state under heavy load than that under light load. Fig. 3 shows how the mean throughput E[T ] is influenced by the mean robust state duration in peak hours 1/γ1 . As we can see, when 1/γ1 is small, the discrepancy between E[TA ] and E[TB ] could be as large as 11%, and this difference shrinks as 1/γ1 grows. This phenomenon is not surprising because when 1/γ1 is very large, all individual nodes are very reliable and the rejuvenation policy difference is not as important as before. It is shown that for a reasonable γ −1 (several hours to a day), Policy-B is noticeably better than Policy-A.
(6) (7) (8)
Numerical Results
As in the previous section, the default values are listed in Table 1. We will vary the parameter t0 (time to rejuvenation) for T H = 1, 2, and 3 in this subsection. As we can see in Fig. 5, neither Policy-A nor Policy-B dominates for the availability measure. It is not surprising that A is higher when T H is smaller, and as shown in Fig. 5, higher availability is achieved when time to rejuvenation is short for both rejuvenation policies. Note that in the traditional definition of availability precludes workload variation to be taken into consideration, which Policy-B is designed for.
5 3.2.3
p(i) = M (i) =
Dependence on Mean Peak Hours per Day
Conclusions
This paper has proposed and discussed two rejuvenation policies for a cluster system under varying workload. In Policy-A (fixed rejuvenation), all failure-prone nodes are taken offline for rejuvenation as soon as any one of them has been in that state for a pre-determined duration, ignoring whether system is in peak hours (high workload) or offpeak hours (low workload). While in Policy-B (delayed rejuvenation), all failure-prone nodes are merely scheduled for rejuvenation if any one of them reaches the time limit in that state in peak hours, and the actual rejuvenation is started as soon as the next offpeak hour starts. By postponing the rejuvenation to offpeak hours in Policy-B, we expect to see a higher overall system throughput in certain circumstances.
We depict the impact of peak hours duration on system throughput in Fig. 4. Compared to the cases discussed above, the difference between E[TA ] and E[TB ] for different peak hours length is small, up to 2.8%. Note that when T1 is sufficiently large, Policy-A outperforms PolicyB in terms of mean system throughput. Since it is rare to see peak hours longer than 16 hours per day, we conclude Policy-B is better than Policy-A in most scenarios. The SPN model has as many as 1030 states and 4422 transitions in its reachability graph, and the average elapsed time is only 0.26 second running SPNP on a Pentium III 700MHz laptop. 5
Proceedings of the 10th IEEE Pacific Rim International Symposium on Dependable Computing (PRDC’04) 0-7695-2076-6/04 $20.00 © 2004 IEEE
(a)
(b)
315
5 Throughput Improvement (%)
Policy−B Policy−A Throughput
310
305
300
295 −2 10
0
2
4 3 2 1 0 −2 10
4
10 10 10 Time to Trigger Rejuvenation (hour)
0
2
4
10 10 10 Time to Trigger Rejuvenation (hour)
Figure 2. Throughput versus Time to Trigger Rejuvenation
(a)
(b)
340
12 Throughput Improvement (%)
Policy−B Policy−A Throughput
320
300
280
260 −2 0 2 4 10 10 10 10 Mean Robust State Duration in Peak Hours (hour)
10 8 6 4 2 0 −2 0 2 4 10 10 10 10 Mean Robust State Duration in Peak Hours (hour)
Figure 3. Throughput versus Mean Robust State Duration in Peak Hours
(a)
(b)
800
3 Throughput Improvement (%)
Policy−B Policy−A Throughput
600
400
200
0
0
5 10 15 Peak Hours per Day (hour)
2
1
0
−1
20
0
5 10 15 Peak Hours per Day (hour)
Figure 4. Throughput versus Mean Peak Hours per Day
6
Proceedings of the 10th IEEE Pacific Rim International Symposium on Dependable Computing (PRDC’04) 0-7695-2076-6/04 $20.00 © 2004 IEEE
20
Availability Improvement (%)
Threshold = 1 Policy−B Policy−A
1 0.99 0.98 −2 10
0
2
4
10 10 Threshold = 2
10 Policy−B Policy−A
0.8 0.7 0.6 −2 10
0
2
4
10 10 Threshold = 3
10
Availability
1 Policy−B Policy−A 0.5
0 −2 10
0
2
0 −0.5
Availability Improvement (%)
Availability
1 0.9
Threshold = 1 0.5
Availability Improvement (%)
Availability
1.01
4
10 10 10 Time to Trigger Rejuvenation (hour)
−1 −2 10
0
2
0
2
0
2
10 10 Threshold = 2
4
10
2 0 −2 −4 −6 −2 10
10 10 Threshold = 3
4
10
5 0 −5 −10 −2 10
4
10 10 10 Time to Trigger Rejuvenation (hour)
Figure 5. Availability versus Time to Trigger Rejuvenation gert. Proactive management of software aging. IBM Journal of Research & Development, 45(2):311–332, March 2001.
We have constructed tractable analytic models for the cluster system with different rejuvenation policies under consideration. Due to the complexity of the model, we turn to a powerful tool, SPNP, to carry out the numerical analysis. Based on the parameters chosen in this paper, it has been shown that Policy-B is likely to outperform PolicyA in terms of expected system throughput for up to 11%, although under certain conditions Policy-A has the same or even a slightly better performance than Policy-B. We also defined and presented numerical results for steady-state availability of cluster system.
[4] G. Ciardo, J. Muppala, and K. Trivedi. SPNP: Stochastic Petri net package. International Conference on Petri Nets and Performance Models, Kyoto, Japan, December 1989.
References
[6] A. I. Elwalid and D. Mitra. Statistical multiplexing with loss priorities in rate-based congestion control of high-speed networks. IEEE Transactions on Communications, 42(11):2989–3002, 1994.
[5] G. Ciardo and K. S. Trivedi. Manual of Stochastic Petri Net Package. 1996.
[1] A. Avritzer and E. Weyuker. Monitoring smoothly degrading systems for increased dependability. Empirical Software Engineering, (2):55–77, 1997.
[7] Armando Fox, Steven D. Gribble, Yatin Chawathe, Eric A. Brewer, and Paul Gauthier. Cluster-based scalable network services. In Proceedings of the Sixteenth ACM Symposium on Operating Systems Principles, pages 78–91, 1997.
[2] BEA White Paper. Achieving scalability and high availability for e-business, clustering in bea weblogic server. http://www.bea.com/content/news events/ white papers/BEA WL Server Clustering wp.pdf, 2003.
[8] S. Garg, A. Puliafito, M. Telek, and K. Trivedi. Analysis of preventive maintenance in transactions based software systems. IEEE Trans. Computers, 47(1):96– 107, Jan. 1998.
[3] V. Castelli, R. E. Harper, P. Heidelberger, S. W. Hunter, K. S. Trivedi, K. Vaidyanathan, and W. P. Zeg7
Proceedings of the 10th IEEE Pacific Rim International Symposium on Dependable Computing (PRDC’04) 0-7695-2076-6/04 $20.00 © 2004 IEEE
[9] J. Gray and D. P. Siewiorek. High-availability computer systems. IEEE Computer, (24):39–48, 1991. [10] Y. Huang, C. Kintala, N. Kolettis, and N. Fulton. Software rejuvenation: analysis, module, and applications. In Proc. of 25th Int. Symposium on Fault-tolerance Computing, 1995. [11] IBM White Paper. Server clusters for high availability in websphere application server network deployment edition 5.0, Dec. 2003. [12] Y. Liu, K. Trivedi, Y. Ma, J. Han, and H. Levendel. Modeling and analysis of software rejuvenation in cable modem termination systems. In The 13th International Symposium on Software Reliability Engeineering (ISSRE), Annapolis, MD, November 2002. [13] E. Marshall. Fatal error: how patriot overlooked a scud. Science, 3, 1992. [14] M. Schwartz. Broadband integrated networks. Pentice-Hall, Englewood Cliffs, NJ, 1996. [15] M. Sullivan and R. Chillarege. Software defects and their impact on system availability - a study of field failures in operating systems. In Proceedings of the 21st IEEE International Symposium on Fault-Tolerant Computing, pages 2–9, 1991. [16] K. Vaidyanathan, R. E. Harper, S. W. Hunter, and K. S. Trivedi. Analysis and implementation of software rejuvenation in cluster systems. In ACM SIGMETRICS 2001/Performance 2001, Cambridge, MA, June 2001. [17] K. Vaidyanathan and K. S. Trivedi. A measurementbased model for estimation of resource exhaustion in operational software systems. In Proc. of the Tenth International Symposium on Softwar Reliability Engineering (ISSRE), Boca Raton, Florida, November 1999. [18] W. Xie, Y. Hong, and K. Trivedi. Analysis of a two-level software rejuvenation policy. Internationl Journal on Reliability Engineering and System Safety, 2003. submitted. [19] W. Xie, Y. Hong, and K. Trivedi. Webserver rejuvenation and web user behavior. In Dependable Systems and Networks (DSN), 2004. submitted.
8
Proceedings of the 10th IEEE Pacific Rim International Symposium on Dependable Computing (PRDC’04) 0-7695-2076-6/04 $20.00 © 2004 IEEE