Supporting fault tolerance in a data-intensive computing middleware

Tekin Bicer, Wei Jiang, Gagan Agrawal

Research output: Chapter in Book/Report/Conference proceedingConference contribution

16 Scopus citations

Abstract

Over the last 2-3 years, the importance of data- intensive computing has increasingly been recognized, closely coupled with the emergence and popularity of map-reduce for developing this class of applications. Besides programmability and ease of parallelization, fault tolerance is clearly important for data-intensive applications, because of their long running nature, and because of the potential for using a large number of nodes for processing massive amounts of data. Fault-tolerance has been an important attribute of map-reduce as well in its Hadoop implementation, where it is based on replication of data in the file system. Two important goals in supporting fault-tolerance are low overheads and efficient recovery. With these goals, this paper describes a different approach for enabling data-intensive computing with fault-tolerance. Our approach is based on an API for developing data-intensive computations that is a variation of map-reduce, and it involves an explicit programmer-declared reduction object. We show how more efficient fault-tolerance support can be developed using this API. Particularly, as the reduction object represents the state of the computation on a node, we can periodically cache the reduction object from every node at another location and use it to support failure-recovery. We have extensively evaluated our approach using two data- intensive applications. Our results show that the overheads of our scheme are extremely low, and our system outperforms Hadoop both in absence and presence of failures.

Original languageEnglish (US)
Title of host publicationProceedings of the 2010 IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2010
DOIs
StatePublished - 2010
Externally publishedYes
Event24th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2010 - Atlanta, GA, United States
Duration: Apr 19 2010Apr 23 2010

Publication series

NameProceedings of the 2010 IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2010

Conference

Conference24th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2010
Country/TerritoryUnited States
CityAtlanta, GA
Period4/19/104/23/10

Keywords

  • Cloud computing
  • Data-intensive computing
  • Fault tolerance
  • Map-Reduce

ASJC Scopus subject areas

  • Computational Theory and Mathematics
  • Software
  • Theoretical Computer Science

Fingerprint

Dive into the research topics of 'Supporting fault tolerance in a data-intensive computing middleware'. Together they form a unique fingerprint.

Cite this