Supporting fault-tolerance for time-critical events in distributed environments

Qian Zhu, Gagan Agrawal

Research output: Contribution to journalArticlepeer-review

4 Scopus citations

Abstract

In this paper, we consider the problem of supporting fault tolerance for adaptive and time-critical applications in heterogeneous and unreliable grid computing environments. Our goal for this class of applications is to optimize a user-specified benefit function while meeting the time deadline. Our first contribution in this paper is a multi-objective optimization algorithm for scheduling the application onto the most efficient and reliable resources. In this way, the processing can achieve the maximum benefit while also maximizing the success-rate, which is the probability of finishing execution without failures. However, for the cases where failures do occur, we have developed a hybrid failure recovery scheme to ensure that the application can complete within the pre-specified time interval. Our experimental results show that our scheduling algorithm can achieve better benefit when compared to several heuristics-based greedy scheduling algorithms, while still having a negligible overhead. Benefit is further improved when we apply the hybrid failure recovery scheme, and the success-rate becomes 100%.

Original languageEnglish (US)
Pages (from-to)51-76
Number of pages26
JournalScientific Programming
Volume18
Issue number1
DOIs
StatePublished - 2010
Externally publishedYes
Eventthe Conference - Portland, Oregon
Duration: Nov 14 2009Nov 20 2009

Keywords

  • Adaptive application
  • Fault tolerance
  • Grid computing
  • Time-critical event

ASJC Scopus subject areas

  • Software
  • Computer Science Applications

Fingerprint

Dive into the research topics of 'Supporting fault-tolerance for time-critical events in distributed environments'. Together they form a unique fingerprint.

Cite this