Adaptive software rejuvenation: degradation ... - Semantic Scholar

Report 2 Downloads 125 Views
Adaptive Software Rejuvenation: Degradation Model and Rejuvenation Scheme Yujuan Bao, Xiaobai Sun Department of Computer Science, Duke University, Durham, NC 27708 {byj, xiaobai}@cs.duke.edu

Kishor S. Trivedi Department of Electrical and Computer Engineering, Duke University, Durham, NC 27708 [email protected]

Abstract We present a framework of adaptive estimation and rejuvenation of software system performance in the presence of aging sources. The framework specifies that a degradation model not only describe an aging process but also enable the adaptation of model-based performance estimates to on-line measurements of data pertaining to the aging process. The adaptive estimation uses model-based a priori estimation and obtains a posteriori estimation based on the data measurements. With the adaptive estimation, the rejuvenation policy determines the time epochs for data collection and rejuvenation according to system dynamics. In the specific context of resource leaks previously assumed to lead to aging, we present a non-homogeneous Markov model to explicitly establish a connection between resource leaks and the failure rate. We demonstrate an increasing failure rate in the presence of resource leaks.

1. Introduction: the Adaptation Problem We present a framework for adaptive estimation and rejuvenation of software system performance in the presence of aging sources. We are concerned with system resource loss, in particular, memory leakage. Memory is indispensable in computer and communication systems. Memory leakage is a typical aging source for server systems due to software bugs in the client applications that use server resources. Our framework for adaptive estimation and rejuvenation consists of three integral components: a degradation model, an adaptive estimation scheme, and an adaptive rejuvenation scheduling policy. The degradation model allows adaptation of model-based performance estimates to on-line measurements of data pertaining to the aging process. The adaptive estimation scheme uses the model-based a priori estimation and obtains a posteriori estimation based on the measurements. With the adaptive estimation, the rejuvenation scheduling policy determines time epochs for data-

collection and rejuvenation. The concept and framework of adaptive software rejuvenation (ASR) is a natural evolution of software rejuvenation efforts. In the short time since the concept and analysis of software rejuvenation was introduced by Huang et al. in 1995 [9], degradation analysis and rejuvenation techniques have been advanced quickly and are widely used in various applications such as spacecraft systems [12], transaction processing systems [3], and telecommunications systems [1]. Software rejuvenation has been posed as an optimization problem, with an objective to minimize both the risk of preventable failures and the cost of rejuvenation; however, the optimization process has been changed with improvements in modeling, degradation estimation and rejuvenation scheduling. The model for a degradation-rejuvenation process presented by Huang et al. [9] is a continuous time Markov chain (CTMC). It has four states: the healthy state, the degradation state, the failure-and-recovery state, and the rejuvenation state. The model assumes the knowledge of a failure profile, which includes the distribution of time to the next failure, the cost of reactive recovery after a failure, the distribution of time from the healthy state to the degradation state, and the distribution of time to carry out rejuvenation. Garg et al. [6] modeled the process as a Markov regenerative stochastic Petri net (MRSPN) to deal with deterministic intervals between successive rejuvenations. In their subsequent work, the aging effect was classified into crash/hang failures and gradual performance degradation [7], while the instantaneous system workload was taken into account. Using semi-Markov process models, Dohi et al. [4] presented a non-parametric statistical method for determining the optimal rejuvenation schedule. Bobbio et al. [2] introduced a discrete degradation index to refine the description of the degradation state. In addition, they provided degradationindex-based policies to fine tune the rejuvenation optimization process. Most of these models have a common assumption that the knowledge of a failure profile is available in a simple or complex form. In other words, these models de-

Proceedings of the 2003 International Conference on Dependable Systems and Networks (DSN’03)

0-7695-1959-8/03 $17.00 (c) 2003 IEEE

scribe only the relationship between the failure profile and certain performance optimization objectives, separating this relationship from the profiling process. For a long-term and steady-running system, a failure profile may be obtained over time by a monitoring and learning process. In reality, computer and communication systems undergo updates on both the server and client sides while maintaining system operations. In the measurement-based approaches, one attempts to capture the dynamic changes in a system by monitoring certain data that are related to system performance degradation. For example, it is reported [8] that an SNMP-based resource monitoring tool is used to monitor data on resource usage at regular intervals. Statistical trend detection techniques [11] are then applied to analyze the collected data and provide estimated time to resource exhaustion. The workload is also taken into account in the improved trend analysis [13]. There are remaining problems. Primarily, the on-line approach to coupling the estimation of performance degradation with the estimation of prescribed model parameters makes rejuvenation decisions difficult to make. Resource usage depends on time, workload condition as well as aging effect. Rejuvenation operations help only when the main cause of resource exhaustion is resource leakage; they would have adverse effect otherwise. Another problem is that the on-line models tend to be simple and overly sensitive to local temporary changes. In summary, the model-based approaches put more weight on long-term static behavior of a system and separate model-based estimation from model construction; the measurement-based approaches put more weight on local dynamics and couple the model construction with the model-based estimation. Our ASR framework is developed from these early approaches but differentiate from them in integrating model construction, model-based estimates and on-line inspection. We separate model construction and model-based estimation but adapt model-based estimation to the measured data that are pertinent to resource leakage, in particular, memory leakage. We introduce a methodology for developing degradation models that enable such adaptation. It is worth mentioning that based on our model, the memory loss is related directly to the failure rate. We further demonstrate that the failure rate increases over time as the leaked resource accumulates, i.e., memory leakage is indeed an aging source. We present a backward analysis method that underlies our adaptive estimation scheme. In addition, we introduce a scheme to design rejuvenation policies based on adapted estimates.

2. Degradation Model and Analysis We develop a model for performance degradation due to the gradual loss of system resources, especially, the mem-

ory resource. In a client-server system, for example, every client process issues memory requests at varying points in time. An amount of memory is granted to each new request (when there is enough memory available), held by the requesting process for a period of time, and presumably released back to the system resource reservoir when it is no longer in use. A memory leak occurs when the amount of allocated memory is not fully released. The available memory space is gradually reduced as such resource leaks accumulate over time. As a consequence, a resource request that would have been granted in the leak-less situation may not be granted when the system suffers from memory resource leaks. Our model accommodates performance analysis for both the leak-less case and the leak-present case. More importantly, the model can be used for the purpose of adapting performance estimates to on-line data measurements. Specifically, the model relates system performance to resource requests, releases or resource holding intervals and memory leaks. These quantities can be monitored and modeled directly from obtainable data measurements [5]. We model an operating software system as a CTMC. Consider first the ideal, leak-less case, Figure 1. Denote sink λξ[0]

0

λξ[1]

λ(1− ξ[0])

1

λξ[2]

λ(1− ξ[1])

µ1

µ2

λ(1− ξ[2])

2

µ3

3

n

1 0 100 0 11 00 1 0 0 1 11

µn

100 0 1100 00 11 00 0 1 11 11

µn+1

Figure 1. The leak-free model

by M the initial total amount of available memory. The memory unit is application-specific. The system is in workload state k, k ≥ 0, when there are k independent processes holding a portion of the resource. The total number of states is practically finite. We assume that the memory requests are independent of each other and arrive from a Poisson process with rate λ. A request is granted when sufficient memory is available, else the system is considered to have failed. In other words, each incoming request may cause the system to transit to the sink state when it asks for more memory than the available amount. Denote by ξ[k] the conditional probability that the system fails in state k upon the arrival of a new request. The amount of each memory request is modeled as a continuous random variable with the density function g(x). The allocated resource is held for a random period of time, which is dependent upon the processing or service rate and determines the resource release rate. When the holding time per request is exponentially distributed with rate µ, the release rate µk at state is equal to k µ. Here, the time unit is also application-specific. Provided with the specification of request arrival rate λ,

Proceedings of the 2003 International Conference on Dependable Systems and Networks (DSN’03)

0-7695-1959-8/03 $17.00 (c) 2003 IEEE

service rates µk and the conditional probabilities ξ[k] of a CTMC in Figure 1, the failure rate of the system can be obtained via the transient solution to the Kolmogorov equations for the CTMC,  dπsink (t) = λ ξ[k] πk (t), dt k dπ0 (t) = −λπ0 (t) + µ1 π1 (t), dt (1) dπk (t) = λ (1 − ξ[k − 1]) πk−1 (t) dt −(λ + µk )πk (t) +µk+1 πk+1 (t), k > 0, where the transient solution is normalized so that πsink (t) +  k πk (t) = 1. The initial condition is π0 (0) = 1. The failure rate is related to the transient solution as follows, dπsink (t)  λ k ξ[k] πk (t) dt  = . h(t) = 1 − πsink (t) k πk (t)

(2)

We detail the conditional probabilities ξ[k]. It is easy to see that, for each k, ξ[k] is equal to the probability that the total requested amount over k + 1 requests exceeds M while the total amount of k requests does not. We have  M ξ[0] = 1 − g(x)dx. For any k > 0, let g [k] (x) be

where α is a natural number and γ is a nonnegative real number, we have (γx)kα x−1 e−γx , (kα − 1)! ∞  (γx)i , G[k] (x) = e−γx i!

g [k] (x)

=

=

ξ[k]

(7)

i=αk kα+α−1  i=αk

∞ (γM )i  (γM )i / . i! i! i=kα

We see from the expression that for every fixed state k, ∂ξ[k]/∂M < 0 when γM > 1, i.e., the conditional probability from state k to the sink state increases as M decreases. When ξ[k] are not easily described in a closed form expression, we resort to fast numerical algorithms. Provided with the specification of the leak-free model, one can derive system failure rate h(t) asin (2) and the system reliability R(t) = 1 − πsink (t) = k πk (t). Conversely, given a specified requirement on system reliability, the model can be used to derive a lower bound on the total amount M of system resource to meet the performance requirement. We model a leak-present system with a nonhomogeneous CTMC, see Figure 2. The conditional

0

the density function for the total amount of k independent resource requests. If the requests share the identical density function g, then g [k] is equal to the k-fold convolution of g, g [1] (x)

=  g(x), x g [k+1] (x) = g [k] (u)g(x − u)du, 0

(3)

[k]

cumulative distribution Let G (x) be the corresponding  x [k] [k] function, G (x) = g (u)du. Then, it can be veri0

fied that at any x > 0, G[k] (x) decreases monotonically as k increases, (4) G[k+1] (x) ≤ G[k] (x). Now we can give the formal expression of ξ[k] for all k, ξ[0]

=

ξ[k]

=

k > 0.

1

µ1

2

µ2

3

µ3

(5)

(6)

n

1 0 00 11 0 0 1 00 1 11 0 1

µn

1 0 00 00 011 1 0011 11 00 11

µn+1

Figure 2. The leak-present model probability that the system transits to the sink state from state k upon a new request becomes leak dependent and hence time dependent, specifically, ξ[k, ] = 1 −

For every k, ξ[k] → 0 as M → ∞. This is consistent with the common expectation that the conditional probability from state k to the sink state vanishes when the resource is unlimited. In special cases, the conditional probabilities ξ[k] can be obtained analytically. For example, when g(x) is the density function of Erlang distribution (γx)α x−1 e−γx , (α − 1)!

λξ[2,l(t)]

λ(1− ξ[0,l(t)]) λ(1− ξ [1,l(t)]) λ(1− ξ[2,l(t)] )

[1]

1 − G (M ), G[k+1] (M ) 1− , G[k] (M )

g(x) =

λξ[1,l(t)]

0 k ≥ 1.

sink

λ ξ[0,l(t)]

G[k+1] (M − ) , G[k] (M − )

(8)

where  = (t) is the amount of leaked memory at time t. Initially, (t) = 0. In the leak-free case (t) = 0, ξ[k, ] equals ξ[k] in (5) for each k. This degradation model differs from many existing models. We separate the aging factor from the other factors and separate the aging process from the rejuvenation process, for the subsequent degradation analysis and rejuvenation design. We do not create additional degradation function or index to quantitatively define the degradation degree. Instead, we establish a direct connection between the resource leaks and the failure rate, implying that the degradation is described by a continuous variable.

Proceedings of the 2003 International Conference on Dependable Systems and Networks (DSN’03)

0-7695-1959-8/03 $17.00 (c) 2003 IEEE

We elaborate on the leak-present model. First, the rejuvenation may take place while the system is in any workload state k. When the system performance degrades to a certain extent, a rejuvenation operation is carried out to collect and reclaim the leaked system resource. With respect to the resource collection, the rejuvenation operation does not necessarily terminate the system. The system may or may not be interrupted by resource collection. It is desirable and reasonable to assume that at the finish of every resource collection, the system resumes to where it was and the amount of leaked memory is reset to zero. This requires that the model allows the system in an arbitrary workload state k ≥ 0 at the finish of a rejuvenation operation. Second, we have introduced the leak function (t). Between two consecutive rejuvenation operations, the function is a nondecreasing function of time. The leaking process has complex characteristics. It is a parasite process because the leaked memory is generated and left behind by other processes. Moreover, the leaked memory aggregates over, or remembers, all the leaks by the processes that ever existed since the last rejuvenation operation. These features substantially complicate the degradation model and limit the feasibility or computational efficiency in the subsequent analysis. Fortunately, the amount of leaked memory can be monitored in many application systems [5]. For example, one can periodically monitor the amount of leaked memory and model the leaking process. The leak function obtained from the data collected in the recent past may be used to predict the leaking behavior in the near future. Consider the analysis based on the leak-present model. Generally speaking, the leak function is a function of time. We consider two basic cases. In one case, (t) is a nondecreasing piecewise constant function, (t) = j over consecutive intervals ∆j = [tj , tj+1 ], j < j+1 < M . On each interval ∆j , we solve a homogeneous CTMC, as shown in Figure 1, with the total available memory equal to M − j at tj . The initial state probability vector for the CTMC on ∆j is the state probability vector at tj obtained from the CTMC on the previous interval ∆j−1 . Thus, we solve a sequence of homogeneous CTMCs with coefficients defined as follows, ξ[k, (t)] = ξ[k, j ] = 1 −

G[k+1] (M − j ) , G[k] (M − j )

t ∈ ∆j .

The computation of ξ[k, j ] is not necessarily more expensive than the computation of ξ[k]. Exploiting the fact that G[k] (M − 0 ) = +



M −(tc +T )

0   j

∆j

g (k) (x)dx

g (k) (x)dx,

(9)

one can get G[k] (M − j ) by dividing the integration for

G[k] (M − 0 ) on the subintervals and get the cumulative sum of the divided integrations. In the second basic case (t) is a continuous function. We may use a piecewise constant function to approximate (t), which implies a piecewise approximation of coefficients ξ[k, (t)], ξ[k, (t)] ≈ ξ[k, j ],

t ∈ ∆j .

A simple method is to use equispaced time steps. The leak function (t), however, may change rapidly or slowly during different time periods. We may let the interval length vary in order to represent ξ[k, (t)] more accurately, compared to the approximation with equispaced time steps. By piecewise approximation and analysis, the system degradation over time due to resource leakage can be derived numerically with tools such as SHARPE [10]. By (9) the computation of ξ[k, j ] and the numerical transient solution of the non-homogeneous CTMC are not necessarily more expensive than that of the homogeneous CTMC.

3. Adaptive Estimation and Rejuvenation In this section we present an analysis method and a scheme for adapting model-based estimates to on-line data measurements. We then introduce a scheme to schedule inspection and rejuvenation with adapted estimates of system performance.

3.1. The backward analysis and adaptive estimation Backward analysis underlies the incorporation of the model-based estimate with on-line measurements. The estimates on one side are based on the modeled statistics while the on-line data on the other side exhibit the dynamics of a particular trial of the aging process. In analysis based on the degradation model in Figure 2, we have related the memory leak function (t) to the system failure via the variable of available resource amount. The variable is bounded from above by the total amount M of the system resource. In this section, we denote by h(t, M ) the failure rate of a leak-present system with the initial amount M of available memory. For simplicity, we depict the failure rate in the th plane, which is a cross-section plane in a multi-variates analysis, Figure 3. We denote by h∗ (t, M0 ) the failure rate for a leak-free reference system with the initial amount M0 of system resource. The backward analysis is not restricted to the model we introduced in Section 2, but requires that a model satisfy the following four properties. (A) The model specifies measurable data pertaining to a particular aging agent. (T) The model supports transient performance analysis.

Proceedings of the 2003 International Conference on Dependable Systems and Networks (DSN’03)

0-7695-1959-8/03 $17.00 (c) 2003 IEEE

(I) The model accommodates the case with leaking as well as the case without leaking. In particular, h(t, M ) equals h∗ (t, M ) when the leak function (t) is constant zero. We refer to this property as the inclusive property. (B) The model provides the inequality h(t, M ) ≤ h∗ (t, M − L) over the time period where (t) ≤ L. We refer to this property as the bounding property. Not every degradation model has these properties. In the discussion on degradation models, we exposed the complexity in analytical derivation of the leak function and exploited the fact that there are practical ways to obtain information on the lost memory. Property (T) provides a temporal platform for the adaptation to take place. The inclusive property (I) requests that a model be able to separate the impact of resource loss on system performance from that of resource usage or other affecting factors. The bounding property (B) specifies the consistency of system performance with respect to available resource. The system is not expected to perform well when the available resource becomes lower than the expected from the designed system resource capacity. In the backward analysis we determine at any given inspection time tc if the failure rate is under or over estimated. An underestimate leads to the miss of a rejuvenation opportunity and increases the risk of system failure and recovery cost; an overestimate results in excess rejuvenation and increases rejuvenation cost. We improve the rejuvenation optimization method by introducing estimate adjustments based on the measured datum Lc , the actual amount of leaked memory at time tc . At the inspection time tc , we consider the relationship between h(t, M ) and h∗ (t, M − Lc ), Figure 3. The latter represents a leakfree reference system that has the same amount of available memory at time tc as the leak-present system with h(t, M ). By the non-increasing property of (t), (t) ≤ Lc , for all t ∈ (t0 , tc ). By the model property (B), h(t, M ) is bounded from above by h∗ (t, M − Lc ) between the initial time t0 and the current inspection time tc . We illustrate the overestimate and underestimate cases in Figure 3.  For the overestimate case, the model-based prediction overestimates the failure rate by at least ∆h(tc ) = h(tc , M ) − h∗ (tc , M − Lc ).  For the underestimate case, the model-based prediction underestimates the failure rate by up to ∆h(tc ) = h(tc , M ) − h∗ (tc , M − Lc ). Based on the backward analysis, we develop a simple adaptive estimation scheme. We use the bounding leak-free function h∗ (t, M − Lc ) as the adjustment reference. We make an adjustment in the a priori estimate at tc by simply replacing h(tc , M ) with h∗ (tc , M − Lc ). If the adapted rate warrants a rejuvenation operation, a new inspectionrejuvenation period begins at the end of the rejuvenation.

h(t, M0 ) ∆h

h∗(t, M0− L c )

tc

∆h

h(t, M0 ) h*(t, M0− L c )

tc

Figure 3. The estimate adaptation: downshift overestimate(top) vs. up-shift underestimate(bottom)

Otherwise, we shift (the curve) h(t, M ) down or up (along the vertical line) at tc to the point h∗ (tc , M − Lc ), depending on whether h(t, M ) is an overestimate or an underestimate, Figure 3. Formally, the prediction for the time period between the current inspection point tc and the next inspection point tn is based on the curve h(t, M ) adjusted by a shift −∆h(tc ). The shift of the estimation curve may be increased or decreased at any inspection point. We refer to the above analysis as the backward analysis in order not to confuse it with a posteriori analysis. There are a couple of differences. (1) At every inspection point tc , the information in the measured datum Lc is exploited in a backward way. It is used as the offset from M at the initial time t0 in the leak-free reference case. (2) Strictly speaking, the leak-free reference function h∗ (t, M − Lc ) is not a posterior estimate, it bounds from above the estimated failure rate as long as the total leaked memory is no more than Lc up to the point tc . (3) We do not use the measured data at the inspection points between two rejuvenation operations to replace the statistical function (t), in order to avoid the extra sensitivity to local temporary changes. Meanwhile, this does not prevent one from using the monitored data to

Proceedings of the 2003 International Conference on Dependable Systems and Networks (DSN’03)

0-7695-1959-8/03 $17.00 (c) 2003 IEEE

model the leaking behavior or update the model.

By this policy, the rejuvenation sequence is a subsequence of the inspection sequence and neither of them is necessarily equispaced. We illustrate the scheduling policy in Figure 4. The horizontal axis is the time from the last rejuvenation t0 ,

3.2. Adaptive rejuvenation scheme We introduce the scheme for adaptive rejuvenation, with respect to both the statistics and dynamics of the system, based on adapted performance estimates. The rejuvenation scheme determines when to acquire data pertaining to memory leakage and when to recollect the lost resource. Any inspection point is a potential point for rejuvenation. Both inspection and rejuvenation operations use system resources and induce certain cost. We attempt to reduce the total cost. For illustration, we assume the following performance criteria: the probability of the system failure between any two consecutive inspection points should not be above a prescribed threshold. The inspection points are determined as below: The initial time t0 is set at the finish of the last rejuvenation operation and is considered the first inspection point. At the current inspection point tc , the next inspection point is determined. Let τp be the threshold on the failure probability over the time period between tc and tn . To meet the performance criteria, tn is bounded from above as follows,   tn  1 − exp − h(t, M )dt ≤ τp , tc

or,



tn

tc

h(t, M )dt ≤ − ln(1 − τp ).

(10)

To minimize the inspection frequency, we select tn so that the interval (tc , tn ) is as large as possible. In the presence of memory leakage, h(t, M ) increases with time. Consequently, the interval length tn − tc decreases with time and invokes frequent inspection. To prevent this from happening, we introduce a lower bound T∆ on the interval length. When the interval length is smaller than T∆ , a rejuvenation operation is scheduled immediately. We summarize the rejuvenation scheduling as follows. ◦ Specify a threshold τp on the failure probability and a lower bound T∆ on the length of the interval between two consecutive inspection points, ◦ At any inspection point tc , get the adapted estimate of h(t), as shown in Section 3.1, ◦ Determine the next inspection point tn ,   t  h(t)dt ≤ − ln(1 − τp ) , (11) tn = max t : tc

◦ If tn > tc + T∆ , let the system operation continue until the next inspection point; otherwise schedule an immediate rejuvenation operation.

h(t)

t1

t2

t3

t

Figure 4. Illustration of rejuvenation scheduling

and the vertical axis is the estimated failure rate, which is adjusted according to the on-line measurements. At time t0 , we determine the next inspection point t1 by equation (10). That is, the area under h(t) between t0 and t1 is no larger than − ln(1−τp ). At t1 , we collect the data pertaining to the memory loss and lift the estimate adaptively. According to the adjusted estimation, we obtain the next inspection point t2 . Notice that the intervals are decreasing in length, which indicates the presence of an aging source. Assume that at t3 , we have t4 − t3 < T∆ . A rejuvenation operation takes places at t3 . As a result, t0 is reset and (t0 ) = 0. The failure rate at t0 is set as the h∗ (t0 ), which is obtained by solving the leak-less model described in Section 2.

4. Numerical Illustrations We present a set of numerical results to illustrate the key aspects of the adaptive software rejuvenation (ASR) framework and their impacts. We specify first the system for the illustrative experiments shown in this section. The total amount of the system resource is M . The arrival of resource requests is a Poisson process with rate λ. The resource holding time per request follows an exponential distribution with parameter µ. The resource amount per request is a random variable of Erlang distribution with parameter (α, γ), the mean value of which is α/γ, see (6). The parameters are set with the values in the table below. In the leak-present case, M 100

λ 0.8

µ 0.2

γ 7

α 30

we assume that the expected leak function is (t) = βλt,

Proceedings of the 2003 International Conference on Dependable Systems and Networks (DSN’03)

0-7695-1959-8/03 $17.00 (c) 2003 IEEE

1 0.9 0.8 0.7 0.6 F(t)

where β = 25/24, t is the time from the last rejuvenation. For the solution of the non-homogeneous CTMC, as in Figure 2, we use a piecewise constant function to approximate the leak function (t), namely, (ti ) = βλti , where ti are equispaced. The numerical solutions for the model described in Section 2 are obtained by using MATHEMATICA and SHARPE [10].

0.5

with leak without leak

0.4 0.3

4.1. Degradation due to resource loss

0.2 0.1

4.2. Optimal rejuvenation with adaptive approach The adaptive rejuvenation approach introduced in Section 3.2 explores more search space for the rejuvenation optimization problem. For illustration purpose, we set the tolerance τp = 0.05 on the failure probability and the lower bound T∆ = 2 on the length of the interval between two consecutive inspection points. Initially, by model-based analysis, we obtain an estimate of the failure rate and select the first inspection point at t1 = 14.5. Further inspection finds that the amount of leaked memory at t1 is 4, which is lower than the expected leak at t1 , (t1 ) = βλt1 = 12. We adjust our estimation as shown in Figure 6 and select the next inspection point at t2 = 20.5. We adapt the estimates at t = 20.5 and then at t = 24 similarly. At t = 24, we schedule a rejuvenation because the next inspection point is not far enough. As long as the aging agent is active, the inspection intervals are getting smaller from the last rejuvenation point to the next one, as shown in Figure 6.

0 0

100

200

300

400

500

time 0.12 with leak without leak 0.1

0.08 h(t)

A degradation model must address a degradation source or aging agent and give a quantitative description of the measurable, pertinent data. We are concerned, in particular, with degradation due to resource loss. The amount of lost resource can be detected in many systems [5]. A principle of our methodology for degradation modeling is that the impact of resource loss on performance degradation be differentiated from that of the other affecting factors, such as the workload. Figure 5 shows the impact of resource loss on the system reliability and on the failure rate. The dash lines are for the leak-free case and the solid lines are for the leak-present case. The upper plot shows that for any value p ∈ (0, 1), the failure probability of the leak-present reaches it earlier than that without leaking. This can be seen more clearly from the comparison in the failure rate in the lower plot. While the failure rate remains within a relatively low and narrow range in the leak-free case, it is increasing monotonically with time in the leak-present case and, therefore, manifesting the aging phenomenon. We note that the temporal range in the lower plot is shorter than that in the upper plot for the illustration purpose.

0.06

0.04

0.02

0 0

10

20

30 time

40

50

60

Figure 5. Performance degradation due to resource loss: the failure time distribution (top) and the failure rate (bottom)

5. Conclusion We have presented the framework of adaptive software rejuvenation (ASR). We have illustrated the ideas and techniques within the specific context of performance degradation due to resource leaks. The adaptive analysis and method for integrating model-based and measurementbased approaches is novel, to our knowledge. Both the statistics and dynamics of an operating software system are respected. The adaptation method enables improvements in performance estimation and rejuvenation scheduling. In modeling, our methodology differs from previous methods. Especially, our model for performance degradation due to resource leaks provides a direct connection between resource leaks and the failure rate. We demonstrated in Section 4 an increasing failure rate in the presence of resource leaks. The objective of the ASR framework is to make rejuvenation more effective in practical applications. To this end, application-specific models, application-specific optimization objectives, and implementation-specific issues are to be developed. Presently, a rejuvenation testbed is under

Proceedings of the 2003 International Conference on Dependable Systems and Networks (DSN’03)

0-7695-1959-8/03 $17.00 (c) 2003 IEEE

0.12

a priori estimate adapted estimate rejuvenation

0.1

h(t)

0.08

0.06

0.04

0.02

0 0

5

10

20

15

25

30

35

time

Figure 6. Illustration of the rejuvenation policy

development. We would like to add that the ASR framework requires that the adaptive analysis be efficient. Since many solutions are obtained by numerical algorithms, it is important to develop models with respect to computational feasibility and complexity on the one side and develop algorithms that exploit the model structures on the other side.

[8] S. Garg, A. van Moorsel, K. Vaidyanathan, and K. S. Trivedi. A methodology for detection and estimation of software aging. In Proceedings of the 1998 International Symposium on Software Reliability Engineering, pages 283– 292, Paderborn, Germany, November 1998. [9] Y. Huang, C. Kintala, N. Kolettis, and N. D. Fulton. Software rejuvenation: Analysis, module and applications. In Proceedings of the 25th International Symposium on Fault Tolerant Computing, pages 381–390, Pasadena, CA, June 1995. [10] R. Sahner, K. S. Trivedi, and A. Puliafito. Performance and Reliability Analysis of Computer Systems: An ExampleBased Approach Using the SHARPE Software Package. Kluwer Academic Publishers, Boston, November 1995. [11] P. K. Sen. Estimates of the regression coefficient based on Kendall’s tau. Journal of the American Statistical Association, 63:1379–1389, 1968. [12] A. T. Tai, S. N. Chau, L. Alkalaj, and H. Hecht. On-board preventive maintenance: Analysis of effectiveness and optimal duty period. In Proceedings of the Third International Workshop on Object-Oriented Real-Time Dependable Systems, pages 40–47, Newport Beach, CA, 1997. [13] K. Vaidyanathan and K. S. Trivedi. A measurement-based model for estimation of resource exhaustion in operational software systems. In Proceedings of the 10th International Symposium on Software Reliability Engineering, pages 84– 93, Boca Raton, Florida, November 1999.

References [1] A. Avritzer and E. J. Weyuker. Monitoring smoothly degrading systems for increased dependability. Empirical Software Engineering Journal, 2(1):59–77, 1997. [2] A. Bobbio, M. Sereno, and C. Anglano. Fine grained software degradation models for optimal software rejuvenation policies. Performance Evaluation, 46:45–62, 2001. [3] K. Cassidy, K. Gross, and A. Malekpour. Advanced pattern recognition for detection of complex software aging in online transaction processing servers. In Proceedings of the 2002 International Conference on Dependable Systems and Networks, pages 478–482, Washington D.C., 2002. [4] T. Dohi., K. Goseva-Popstojanova, and K. S. Trivedi. Statistical non-parametric algorithms to estimate the optimal software rejuvenation schedule. In Proceedings of the 2000 Pacific Rim International Symposium on Dependable Computing, pages 77–84, Los Angeles, CA, December 2000. [5] C. Erickson. Memory leak detection in embedded systems. Linux Journal, Web Article 6059, March 2003. [6] S. Garg, A. Puliafito, M. Telek, and K. S. Trivedi. Analysis of software rejuvenation using Markov regenerative stochastic Petri Nets. In Proceedings of the Sixth International Symposium on Software Reliability Engineering, pages 180– 187, Toulouse, France, October 1995. [7] S. Garg, A. Puliafito, M. Telek, and K. S. Trivedi. Analysis of preventive maintenance in transactions based software systems. IEEE Transactions on Computers, 47(1):96–107, January 1998.

Proceedings of the 2003 International Conference on Dependable Systems and Networks (DSN’03)

0-7695-1959-8/03 $17.00 (c) 2003 IEEE