The Journal of Systems and Software 61 (2002) 121–128 www.elsevier.com/locate/jss
Availability analysis and improvement of Active/Standby cluster systems using software rejuvenation Kiejin Park a, Sungsoo Kim b,* b
a Department of Software, Anyang University, Kangwha, Incheon, 417-830, South Korea Graduate School of Information and Communication, Ajou University, San 5 Wonchon-Dong, Paldal-Gu, Suwon 442-749, South Korea
Received 2 December 2000; received in revised form 4 April 2001; accepted 24 April 2001
Abstract Cluster systems, using commercially available personal computers connected in a loosely coupled fashion can provide high levels of availability. To improve the availability of personal computer-based Active/Standby cluster systems, we have conducted a study of software rejuvenation that follows a proactive fault-tolerant approach to handle software-origin system failure. In this paper, we map software rejuvenation and switchover states with a semi-Markov process and get mathematical steady-state solutions of the chain. We calculate the availability and the downtime of Active/Standby cluster systems using the solutions and find that software rejuvenation can be used to improve the availability of Active/Standby cluster systems. 2002 Elsevier Science Inc. All rights reserved. Keywords: Software rejuvenation; High availability; Cluster systems; Semi-Markov process
1. Introduction If the downtime of a system is less than 5 min per year (availability: 99.999%), the system can be classified as a highly available system. Due to the increasing complexity of software, studies on how to implement a highly available system using cluster technology are becoming more actively sought after. Cluster systems using commercially available personal computers connected in a loosely coupled fashion can provide high levels of availability. Moreover, highly available cluster systems become more and more popular for their cost effectiveness (Buyya, 1999). For example, the downtime of duplex systems built by using two clustered low-end personal computers is less than 9 h per year (availability: 99.99%). However, as cluster systems consist of many servers, one must solve the low availability problems caused by the high chance of the server software failures (Park et al., 2000; Lee and Lyer, 1995).
*
Corresponding author. Tel.: +82-31-219-2644; fax: +82-31-2191614. E-mail address:
[email protected] (S. Kim).
Generally, software-aging phenomena such as memory leak and buffer overflow proceed fast in the software of cluster servers due to the loss of communications or data. After rejuvenating cluster systems by buffer flushing, memory cleaning, file system purging, and initialization of the file allocation table, the systems can restart their service from a healthy condition in which the probability of a software failure is very low (Huang et al., 1995a). Software rejuvenation that terminates an application or a system intentionally and restarts it in a clean internal state prevents failures from occurring, while previous fault-tolerant methods recover from failures after happen (Huang et al., 1995b). As the system manager decides to stop the operation of cluster servers, cleans the internal state of the server processes, and restarts them, software rejuvenation does not require additional costs. Therefore, the method can be regarded as the proper choice for an application requiring high availability. The connection of highly available Active/Standby cluster systems is represented in Fig. 1. Through fast access network devices such as asymmetric digital subscriber line (ADSL) modem and cable modem, thousands of clients can join cluster systems. The data in the disk arrays are shared via storage interconnected with
0164-1212/02/$ - see front matter 2002 Elsevier Science Inc. All rights reserved. PII: S 0 1 6 4 - 1 2 1 2 ( 0 1 ) 0 0 1 0 7 - 8
122
K. Park, S. Kim / The Journal of Systems and Software 61 (2002) 121–128
Fig. 1. The connection of Active/Standby cluster systems.
all the cluster servers. In Active/Standby configuration, there is a primary server where the critical application runs and backup servers that are used as spares in standby mode (Buyya, 1999). We investigate the availability of the Active/Standby cluster systems with a different number of backups and operation policies to evaluate the effect of software rejuvenation. Due to the fast increase in size and complexity of software, the frequency of software-originated system failure is much higher than that of hardware-originated system failure. It is therefore almost impossible to develop error-free software. As the software used in servers begins to age, software faults such as memory loss, file sharing error, and data damage are prone to occur. However, it is very difficult to detect the failure of a cluster server caused by software aging (this kind of error is called ‘‘heisenbugs’’ in the fault tolerance field) (Garg et al., 1998). If software faults increase with software-aging, the possibility of a system failure becomes high. The following are popular techniques, which have been used for software fault tolerance (Johnson, 1989): • Recovery block: if errors occur, the process is re-executed in other modules, which have the same functionality. • N-version programming: N independent software modules are executed at the same time. The results are compared and then the majority of the results are selected as the output. • N-self checking programming: if a module fails to run, a standby module will continue its operations thereafter. • Checkpointing: periodically saves the temporary result of a process task and, when a failure occurs the process re-executes its operations again not from the beginning but from the latest saved checkpoint. However, due to high cost and software complexity the above-mentioned reactive methods are hardly used for the availability improvement of cluster systems. Software rejuvenation is based on the idea of the preventive maintenance techniques that have been used in
the mechanical engineering field for a long time. Garg et al. propose the idea of software rejuvenation as a means for availability improvement (Garg et al., 1995a,b; Garg et al., 1997; Pfening et al., 1996). In the calculation of an optimal rejuvenation period and job loss probability, they use buffer size and workload for the model parameter. However, they did not consider the cost function, which is used in the evaluation of the rejuvenation policies. Huang et al. considered cost function, which calculates the cost of the downtime during rejuvenation and shutdown period (Huang et al., 1995b, 1996; Wang et al., 1997). However, the state transition diagram is very simple and only the simplex system is analyzed. Garg et al. combines a checkpointing method and software rejuvenation to minimize the completion time of a request (Garg et al., 1996). Levendel uses software rejuvenation to handle reliable data exchange between the server and terminals in a communication environment (Levendel, 1999). In this paper, we present a proactive fault-tolerant approach called software rejuvenation to improve the availability of cluster systems. We map software rejuvenation and switchover states with a semi-Markov process and acquire mathematical steady-state solutions of the chain. By adopting software rejuvenation, we calculate and improve the availability of cluster systems. To our knowledge at this time, no other research has applied software rejuvenation to cluster servers. The previous studies do not generalize the number of server parameters in their analysis. The organization of the paper is as follows. In section 1, we define the problem and address related research. Section 2 presents a system availability model in which the operational states of Active/Standby cluster systems using software rejuvenation are described and in the following section, the model is analyzed and experimental results are given to validate the model solution. Finally, we conclude that software rejuvenation is a viable method and present further research issues.
2. System model A state transition diagram of Active/Standby cluster systems concerning software rejuvenation and switchover states is presented in Fig. 2. The assumptions used in the modeling are as follows: • Failure rate (k) and repair rate (l) of the server are identical at all states. • Unstable rate (ku ), the speed of escaping the healthy condition is identical at all states. • Rejuvenation rate (kr ), the frequency of rejuvenation is identical at all states. • Mean time spent during the rejuvenation process is constant (1=lr ).
K. Park, S. Kim / The Journal of Systems and Software 61 (2002) 121–128
123
Fig. 2. State transition diagram of Active/Standby cluster systems.
• Mean switchover time, the time needed to transfer from the primary to the backup server is constant (1=ks ). • Sojourn time in a state is exponentially distributed except for rejuvenation and switchover states. • Sojourn time in rejuvenation and switchover states follows k-stage Erlangian distribution. • During the rejuvenation process, continuous service is possible except for simplex configuration ðn ¼ 1Þ. Unfortunately, according to the above assumptions, the state transition diagram in Fig. 2 does not belong to the class of the irreducible recurrent nonnull Markov chain. Because the amount of time for rejuvenation and switchover processes is assumed to be constant, no closed-form solutions can be derived easily (Kleinrock, 1975). When only one of the states in the diagram violates the memoryless property, which means that sojourn time in a state does not follow exponential distribution, the diagram is classified as a semi-Markov process. The numbers in the thick circle in the diagram represent the number of operating servers in normal states ðn; n 1; . . . ; 1Þ. After a long mission time, normal states may change to unstable states ðUn ; Un1 ; . . . ; U1 Þ with rate i ku (i: the number of servers). In the unstable states, server performance is degraded and software-aging effects render the system unreliable. If a server is in an unstable state, the state can change to either a rejuvenation state with rate i kr or shutdown a server in the cluster system with rate i k. After rejuvenating the cluster servers, unstableness of the cluster systems is removed automatically. In failure state (0), all servers stop running and no available server remains. The tasks of fault detection and/or switchover from primary to backup servers are represented in switchover states (Ti and Si (where i ¼ n; n 1; . . . ; 2)). For deriving the mathematical solution easily, the notation of these states is distinguished. In rejuvenation states ðRn ; Rn1 ; . . . ; R1 Þ, backup servers stop running intentionally and revert to cleaning processes. After the rejuvenation, one of the healthy rejuvenated servers takes over the role of primary server. So service is not stopped even during the rejuvenation process.
Fig. 3. The shape of k-stage Erlangian distributions (1=lr ¼ 3 min).
In order to capture the rejuvenation and switchover processes of Active/Standby cluster systems more practically, we permit deterministic rejuvenation and switchover time. For this purpose, we assume sojourn time in these states as a k-stage Erlangian distribution shown below (Kleinrock, 1975). k1
bðxÞ ¼
klr ðklr xÞ ekur x ; ðk 1Þ!
x P 0:
ð1Þ
As k goes to infinity, this density function must be a unit impulse function at the point 1=lr (refer to Fig. 3). This implies that the time spent in infinite-stage approaches is a constant with a probability of 1.
Fig. 4. State transition diagram of sub-rejuvenation and switchover states.
124
K. Park, S. Kim / The Journal of Systems and Software 61 (2002) 121–128
To convert the semi-Markov process into a Markov process, we decompose all of the rejuvenation and switchover states in Fig. 2 into the number of k substates. For example, the sojourn time (1=lr ) of a rejuvenation state (Ri ) is represented as the sum of k exponential distributions with the rate klr (refer to Fig. 4). Actually, Erlangian distribution is the same as the sum of the exponential distributions. If we convert all of the rejuvenation and switchover states (Ri ; Ti and Si ), which have deterministic sojourn time into a number of decomposed sub-states shown in Fig. 4, our state transition diagram in Fig. 2 can be described as a Markov process class. So we can perform steady-state analysis of the diagram easily. The steady-state balance equations of Fig. 2 are as follows: ðu þ iku ÞPi ¼ lPi1 þ ði 1Þkur PRi;k þ kks PTi;k þ kks PSiþ1;k ; i ¼ 2; 3; . . . ; n 1;
ð2Þ
nku Pn ¼ lPn1 þ ðn 1Þkur PRn;k þ kkS PTn;k ;
ð3Þ
ðu þ ku ÞP1 ¼ lP0 þ kur PR1;k þ kks PS2;k ;
ð4Þ
lP0 ¼ kPu1 ;
ð5Þ
ðk þ kr ÞPui ¼ ku Pi ; klr PRi;j ¼ ikr Pui ;
i ¼ 1; 2; . . . ; n;
Combining the above-mentioned balance equations with the conservation equation, and solving these simultaneous equations, we acquire the closed-form solutions for the model of Active/Standby cluster systems. • Simplex system (n ¼ 1) 1 ku kr k 1þ þ : P1 ¼ 1 þ k þ kr ur u • Multiplex systems (n P 2) " ( ku kr 1þ 1þ Pn ¼ n! k þ kr kur ni n X 1 k ku ðk 1Þkr ku þ i! u k þ k klr k þ kr r i¼1 ni n X 1 k ku kr ku þ ði 1Þ! u k þ k ks k þ kr r i¼1 ni n )#1 n X 1 k ku k ku þ ; i! u k þ kr u k þ kr i¼2 ð13Þ n! Pi ¼ i! Pu i ¼
i ¼ 1; 2; . . . ; n; ð7Þ
ks PTi;j ¼ lr PRi;k ;
i ¼ 2; 3; . . . ; n; j ¼ 1; 2; . . . ; k;
ð9Þ
kkS PSi;j ¼ ikPui ;
i ¼ 2; 3; . . . ; n; j ¼ 1; 2; . . . ; k:
ð10Þ
PRi;k ¼
The conservation equation of Fig. 2 is obtained by summing the probabilities of all states in cluster systems and the sum of the equation is 1.
PTi;j ¼
i¼0
i¼1
n X k X i¼1
j¼1
PRi;j þ
n X k X i¼2
Pn ;
i ¼ 0; 1; 2; . . . ; n;
PTi;j þ PSi;j ¼ 1:
j¼1
ð11Þ The meaning of the probabilities is as follows: • Pi : the probability of cluster systems is in normal state i at steady-state. • Pui : the probability of cluster systems is in an unstable state i at steady-state. • PRi;j : the probability of being in a sub-rejuvenation state k j when cluster systems revert to the rejuvenation process at normal state i. • PTi;j , PSi;j : the probability of being in sub-switchover state k j when cluster systems revert to the fault detection and/or switchover process at normal state i.
i ¼ 1; 2; . . . ; n;
ikr ku Pi ; i ¼ 1; 2; . . . ; n; kur k þ kr j ¼ 1; 2; . . . ; k 1;
ð8Þ
Pui þ
ni ð14Þ ð15Þ
PRi;j ¼
i ¼ 1; 2; . . . ; n;
n X
k ku u k þ kr
ku Pi ; k þ kr
klr PRi;k ¼ kr Pui ;
Pi þ
ð6Þ
j ¼ 1; 2; . . . ; k 1;
n X
ð12Þ
kr ku Pi ; kur k þ kr
kr ku Pi ; kks k þ kr j ¼ 1; 2; . . . ; k; ik ku Pi ; kks k þ kr j ¼ 1; 2; . . . ; k:
PSi;j ¼
i ¼ 1; 2; . . . ; n;
ð16Þ ð17Þ
i ¼ 2; 3; . . . ; n; ð18Þ i ¼ 2; 3; . . . ; n; ð19Þ
Using system-operating parameters, first we obtain Pn , and then we calculate the probabilities of being in normal, unstable, rejuvenation, switchover and failure state in turn. For example, Active/Standby duplex cluster systems ðn ¼ 2Þ are modeled in Fig. 5 and the steadystate probabilities of the system are given. If the cluster systems are in an unstable state ðU2 Þ, one of the servers may change to the failure state with rate 2k or a backup server may revert to the rejuvenation process. The failure rate can be obtained from mean time to failure (MTTF) of the server and the repair rate can be obtained from the mean time to repair (MTTR). In failure
K. Park, S. Kim / The Journal of Systems and Software 61 (2002) 121–128
125
Fig. 5. State transition diagram of Active/Standby duplex cluster systems.
state (0), no operational server exists and service is not available during repair time (1=l). For the simplex system (1), it is not necessary to represent the switchover state: • • • • •
normal state: P2 þ P1 ; unstable state: Pu2 þPPu1 ; rejuvenation state: kj¼1 PR2;j þ PR1;j ; Pk switchover state: j¼1 ðPS2;j þ PT2;j Þ; failure state: P0 .
3. Experimental results
Table 1 System-operating parameters Parameters
Values
n T k l kr ku 1=lr 1=ks Cr Cs Cf k
2 (duplex configuration) 1 year 1 time/year 2 times/day 1 time/month 2 times/month 10 min 3 min 1 unit 20 units 100 units 20 stages
3.1. Availability The cluster systems are not available in all of the rejuvenation processes in the normal state (1), all of the switchover states, and the failure state (0). The availability of Active/Standby cluster systems is defined as follows: ! k n X k X X Availability ¼ 1 P0 þ PR1;j þ PTi;j þ PSi;j : j¼1
i¼2
j¼1
ð20Þ
3.2. Downtime cost Predictable shutdown cost is far less than that of unexpected shutdown ðCf Cr Þ. Downtime cost of Active/Standby cluster systems can be calculated from the unavailability of cluster systems and defined as a function of operation time (T). " k X CostðT Þ ¼ Cf P0 þ Cr PR1;j j¼1
þ Cs
n X
k X
i¼2
j¼1
# PTi;j þ PSi;j T ;
ð21Þ
where Cf is the unit cost of unexpected shutdown of a server, Cr the unit cost of rejuvenation process and Cs is the unit cost of switchover process.
3.3. Experiments To acquire system dependability measures like availability and downtime cost, we perform experiments using the system-operating parameters shown in Table 1 (Garg et al., 1998). The explanation of system-operating parameters for the experiments is as follows. Active/ Standby duplex cluster systems operate for one year continuously. Failure rate of the server is 1 time per year and repair time is 15 h. Rejuvenation is scheduled at every month and the rejuvenated healthy cluster systems become unstable every 15 days. The rejuvenation and switchover time are 10 and 3 min, respectively. The unexpected downtime cost per unit is 100 times greater than that of the scheduled rejuvenation cost. To have a deterministic sojourn time in rejuvenation and switchover states, the number of stages is set to 20. The change in the availability of cluster systems with the different number of servers and rejuvenation rates is plotted in Fig. 6. The number of servers is varied from simplex to multiplex ðn ¼ 5Þ, at the same time we perform software rejuvenation with the interval from 10 days (rate ¼ 3) to infinity (rate ¼ 0: no rejuvenation). From the graph, the amount of availability increment from simplex to duplex is significant but from duplex to multiplex very little is shown. As unstable states are removed frequently with high rejuvenation rates, the availability of the cluster systems with simplex
126
K. Park, S. Kim / The Journal of Systems and Software 61 (2002) 121–128
Fig. 6. The plot of availability versus number of servers and rejuvenation rate.
Fig. 8. The plot of downtime cost versus failure rate and rejuvenation rate.
Fig. 7. The plot of availability versus failure rate and rejuvenation rate.
The downtime cost of a scheduled shutdown is much lower than that of an unscheduled shutdown. Fig. 8 shows the possibility of the practical use of the software rejuvenation technique. If the unexpected shutdown cost of the system is large enough compared to the intended shutdown cost by software rejuvenation, frequent rejuvenation is beneficial. Because a high rejuvenation rate means high availability, the downtime cost of cluster systems decreases as the rejuvenation rate increases. The effect of repair capability for cluster systems to availability with different rejuvenation rates is shown in Fig. 9. The graph shows that the faster the repair, the larger the availability that is expected. The rejuvenation process does not improve availability when there is good repair capability. When the repair time is less than 12 h, the relationship between the availability and rejuvenation rate is not significant.
configuration increases. However, as the degree of redundancy is larger than or equal to 3, the improvement of availability is not significant. From this result, it is apparent that duplex configuration is a cost-effective way to build high availability systems. According to the required availability level, the decision making of a rejuvenation rate is possible under consideration of various evaluation criteria such as state probabilities and downtime cost. The influence of failure rates along with rejuvenation rates on availability is shown in Fig. 7. In the duplex configuration, failure rates are less sensitive to rejuvenation rate for availability. These results suggest that software reliability is more important than hardware reliability in improvement of the availability of cluster systems. In other words, the unstable rate of the server software acts as an important factor in the availability of cluster systems.
Fig. 9. The plot of availability versus repair time and rejuvenation rate.
K. Park, S. Kim / The Journal of Systems and Software 61 (2002) 121–128
Fig. 10. The plot of downtime cost versus cost ratio and rejuvenation rate.
127
Fig. 12. The plot of availability versus unstable rate and rejuvenation rate.
This figure has only 5% of the variance for exponential distribution with a mean of 10 min. Hence, the applicability of method-of stages is positive. In Fig. 12, the effect of unstable rate and rejuvenation rate to availability is shown. If the unstable rate exceeds 4 and the rejuvenation rate is high, availability decreases. When the unstable rate is high, frequent rejuvenations are not beneficial for availability improvement. So the unstable rate as well as the switchover rate must be considered carefully when determining the rejuvenation policy.
4. Conclusions Fig. 11. The plot of availability versus switchover time and rejuvenation rate.
The ratio of downtime cost for unexpected shutdown of cluster systems over intentional shutdown is closely related to rejuvenation rates (refer to Fig. 10). Because the availability of cluster systems decreases with a low rejuvenation rate, the change of downtime cost increases sharply with a high cost ratio and a low rejuvenation rate. Fig. 11 shows the relationship between switchover time and rejuvenation rate with availability. When switchover time is less than 15 min, a high rejuvenation rate is beneficial for improving availability. However, when switchover time exceeds 15 min, frequent rejuvenation is not beneficial. Due to this fact, switchover time must be considered carefully when determining the rejuvenation policy. When the mean switchover time is 10 min, with the number of sub-switchover states 20, the variance of 20-stage Erlangian distribution is 5 min.
Highly available proprietary fault-tolerant systems using tightly coupled hardware and software are expensive to develop and deploy. We have analyzed the availability of Active/Standby cluster systems built with loosely coupled commercially available personal computers. According to the system-operating parameters, we have calculated steady-state probabilities, availability, and downtime cost of Active/Standby cluster systems by adopting a software rejuvenation technique. We have validated the closed-form solutions of the mathematical model with experiments based on the above parameters. We have also found that software rejuvenation can be used as a preventive fault-tolerant technique and it improves the availability of Active/Standby cluster systems. In future work we will consider the coverage factor of failure events with software rejuvenation. The integration of response time and throughput with downtime cost will provide a more accurate evaluation measure. To compute the optimal rejuvenation period, the system load must be included in determining the rejuvenation policies.
128
K. Park, S. Kim / The Journal of Systems and Software 61 (2002) 121–128
Acknowledgements This work is supported in part by the Ministry of Information & Communication of Korea (‘‘Support Project of University Foundation Research h2001i’’ supervised by IITA) and supported in part by the Ministry of Education of Korea (Brain Korea 21 Project Supervised by Korea Research Foundation). References Buyya, R., 1999. In: High Performance Cluster Computing Volume 1: Architectures and Systems. Prentice-Hall, Englewood cliffs, NJ, p. 849. Garg, S. et al., 1995a. Time and load based software rejuvenation: policy, evaluation and optimality. In: Proceedings of the First Conference on Fault Tolerant Systems. Garg, S. et al., 1995b. Analysis of software rejuvenation using Markov regenerative stochastic petri net. In: Proceedings of the Sixth International Symposium on Software Reliability Engineering, pp. 180–187. Garg, S. et al., 1996. Minimizing completion time of a program by checkpointing and rejuvenation. In: Proceedings of ACM SIGMETRICS Conference, pp. 252–261. Garg, S. et al., 1997. On the analysis of software rejuvenation policies. In: Proceedings of 12th Annual Conference on Computer Assurance (COMPASS). Garg, S. et al., 1998. Analysis of preventive maintenance in transactions based software systems. IEEE Transactions on Computers 47 (1), 96–107. Huang, Y. et al., 1995a. Software tools and libraries for fault tolerance. Bulletin of the Technical Committee on Operating Systems and Application Environment (TCOS) 7 (4), 5–9. Huang Y. et al., 1995b. Software rejuvenation: analysis, module and applications. In: Proceedings of the 25th International Symposium on Fault Tolerant Computing (FTCS-25), pp. 381–390. Huang, Y. et al., 1996. Components for software fault tolerance and rejuvenation. AT&T Technical Journal, 29–37. Johnson, B., 1989. In: Design and Fault-Tolerant Analysis of Digital Systems. Addison-Wesley, Reading, MA, p. 584.
Kleinrock, L., 1975. In: Queueing Systems Volume 1: Theory. Wiley, New York, p. 417. Lee, I., Lyer, R., 1995. Software dependability in the Tandem GUARDIAN system. IEEE Transactions on Software Engineering 21 (5), 455–467. Levendel, H., 1999. Software dependability in wireless systems. IEEE Workshop on Fault-Tolerant Parallel and Distributed Systems, 3–12. Park, K. et al., 2000. Availability analysis of multiplex systems using software rejuvenation method. Journal of the Korea Information Science Society 27 (8), 730–740. Pfening, A. et al., 1996. Optimal rejuvenation for tolerating soft failures. Performance Evaluation 27&28, 491–506. Wang, Y. et al., 1997. Progressive retry for software failure recovery in message-passing applications. IEEE Transactions on Computers 46 (10), 1137–1141.
Kiejin Park was born in Seoul, Korea. He received the B.S. and M.S. degrees in Industrial Engineering from Hanyang University and POSTECH, Korea, in 1989 and 1991, respectively, and Ph.D. degree in Department of Computer Engineering, Graduate School of Ajou University, Korea, in 2001. He is currently with Department of Software, Anyang University in Korea. From 1991 to 1996, he worked in the Computer and Communication Research Center of Samsung Advanced Institute of Technology, Korea, as an Assistant Researcher. From 1996 to 1997, he was with the Software Research and Development Center of Samsung Electronics Co., Korea, as a senior researcher. From 2001 to 2002, he worked in the Network Equipment Test Center of Electronics and Telecommunications Research Institute (ETRI) as a senior researcher. His research interests include software dependability, fault-tolerant computing, performance evaluation, simulation, multimedia systems, and cluster systems.
Sungsoo Kim was born in Seoul, Korea. He received the B.S. and M.S. degrees in electronic engineering from Sogang University, Korea, in 1982 and 1984, respectively, and Ph.D. degree in computer science from Texas A&M University, College Station, Texas, in 1995. He is currently an Associate Professor in Graduate School of Information and Communication. Ajou University in Korea. From 1983 to 1986, he worked in the Research and Development Center of Samsung Electronics Co., Korea. From 1987 to 1996, he was with the Computer and Communications Research Center of Samsung Advanced Institute of Technology, Korea, as a Principle Researcher. His research interests include fault-tolerent computing, performance evaluation, multimedia systems, mobile systems, and cluster systems.