Supporting fault tolerance in a data-intensive computing middleware

Tekin Bicer; Wei Jiang; Gagan Agrawal

doi:10.1109/IPDPS.2010.5470462

Supporting fault tolerance in a data-intensive computing middleware

Tekin Bicer, Wei Jiang, Gagan Agrawal

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

16 Scopus citations

Abstract

Over the last 2-3 years, the importance of data- intensive computing has increasingly been recognized, closely coupled with the emergence and popularity of map-reduce for developing this class of applications. Besides programmability and ease of parallelization, fault tolerance is clearly important for data-intensive applications, because of their long running nature, and because of the potential for using a large number of nodes for processing massive amounts of data. Fault-tolerance has been an important attribute of map-reduce as well in its Hadoop implementation, where it is based on replication of data in the file system. Two important goals in supporting fault-tolerance are low overheads and efficient recovery. With these goals, this paper describes a different approach for enabling data-intensive computing with fault-tolerance. Our approach is based on an API for developing data-intensive computations that is a variation of map-reduce, and it involves an explicit programmer-declared reduction object. We show how more efficient fault-tolerance support can be developed using this API. Particularly, as the reduction object represents the state of the computation on a node, we can periodically cache the reduction object from every node at another location and use it to support failure-recovery. We have extensively evaluated our approach using two data- intensive applications. Our results show that the overheads of our scheme are extremely low, and our system outperforms Hadoop both in absence and presence of failures.

Original language	English (US)
Title of host publication	Proceedings of the 2010 IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2010
DOIs	https://doi.org/10.1109/IPDPS.2010.5470462
State	Published - 2010
Externally published	Yes
Event	24th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2010 - Atlanta, GA, United States Duration: Apr 19 2010 → Apr 23 2010

Publication series

Name	Proceedings of the 2010 IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2010

Conference

Conference	24th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2010
Country/Territory	United States
City	Atlanta, GA
Period	4/19/10 → 4/23/10

Keywords

Cloud computing
Data-intensive computing
Fault tolerance
Map-Reduce

ASJC Scopus subject areas

Computational Theory and Mathematics
Software
Theoretical Computer Science

Access to Document

10.1109/IPDPS.2010.5470462

Cite this

Bicer, T., Jiang, W., & Agrawal, G. (2010). Supporting fault tolerance in a data-intensive computing middleware. In Proceedings of the 2010 IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2010 Article 5470462 (Proceedings of the 2010 IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2010). https://doi.org/10.1109/IPDPS.2010.5470462

Supporting fault tolerance in a data-intensive computing middleware. / Bicer, Tekin; Jiang, Wei; Agrawal, Gagan.
Proceedings of the 2010 IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2010. 2010. 5470462 (Proceedings of the 2010 IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2010).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Bicer, T, Jiang, W & Agrawal, G 2010, Supporting fault tolerance in a data-intensive computing middleware. in Proceedings of the 2010 IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2010., 5470462, Proceedings of the 2010 IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2010, 24th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2010, Atlanta, GA, United States, 4/19/10. https://doi.org/10.1109/IPDPS.2010.5470462

@inproceedings{2b510d446afa489f9f064acd43d8ddf7,

title = "Supporting fault tolerance in a data-intensive computing middleware",

abstract = "Over the last 2-3 years, the importance of data- intensive computing has increasingly been recognized, closely coupled with the emergence and popularity of map-reduce for developing this class of applications. Besides programmability and ease of parallelization, fault tolerance is clearly important for data-intensive applications, because of their long running nature, and because of the potential for using a large number of nodes for processing massive amounts of data. Fault-tolerance has been an important attribute of map-reduce as well in its Hadoop implementation, where it is based on replication of data in the file system. Two important goals in supporting fault-tolerance are low overheads and efficient recovery. With these goals, this paper describes a different approach for enabling data-intensive computing with fault-tolerance. Our approach is based on an API for developing data-intensive computations that is a variation of map-reduce, and it involves an explicit programmer-declared reduction object. We show how more efficient fault-tolerance support can be developed using this API. Particularly, as the reduction object represents the state of the computation on a node, we can periodically cache the reduction object from every node at another location and use it to support failure-recovery. We have extensively evaluated our approach using two data- intensive applications. Our results show that the overheads of our scheme are extremely low, and our system outperforms Hadoop both in absence and presence of failures.",

keywords = "Cloud computing, Data-intensive computing, Fault tolerance, Map-Reduce",

author = "Tekin Bicer and Wei Jiang and Gagan Agrawal",

year = "2010",

doi = "10.1109/IPDPS.2010.5470462",

language = "English (US)",

isbn = "9781424464432",

series = "Proceedings of the 2010 IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2010",

booktitle = "Proceedings of the 2010 IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2010",

note = "24th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2010 ; Conference date: 19-04-2010 Through 23-04-2010",

}

TY - GEN

T1 - Supporting fault tolerance in a data-intensive computing middleware

AU - Bicer, Tekin

AU - Jiang, Wei

AU - Agrawal, Gagan

PY - 2010

Y1 - 2010

N2 - Over the last 2-3 years, the importance of data- intensive computing has increasingly been recognized, closely coupled with the emergence and popularity of map-reduce for developing this class of applications. Besides programmability and ease of parallelization, fault tolerance is clearly important for data-intensive applications, because of their long running nature, and because of the potential for using a large number of nodes for processing massive amounts of data. Fault-tolerance has been an important attribute of map-reduce as well in its Hadoop implementation, where it is based on replication of data in the file system. Two important goals in supporting fault-tolerance are low overheads and efficient recovery. With these goals, this paper describes a different approach for enabling data-intensive computing with fault-tolerance. Our approach is based on an API for developing data-intensive computations that is a variation of map-reduce, and it involves an explicit programmer-declared reduction object. We show how more efficient fault-tolerance support can be developed using this API. Particularly, as the reduction object represents the state of the computation on a node, we can periodically cache the reduction object from every node at another location and use it to support failure-recovery. We have extensively evaluated our approach using two data- intensive applications. Our results show that the overheads of our scheme are extremely low, and our system outperforms Hadoop both in absence and presence of failures.

AB - Over the last 2-3 years, the importance of data- intensive computing has increasingly been recognized, closely coupled with the emergence and popularity of map-reduce for developing this class of applications. Besides programmability and ease of parallelization, fault tolerance is clearly important for data-intensive applications, because of their long running nature, and because of the potential for using a large number of nodes for processing massive amounts of data. Fault-tolerance has been an important attribute of map-reduce as well in its Hadoop implementation, where it is based on replication of data in the file system. Two important goals in supporting fault-tolerance are low overheads and efficient recovery. With these goals, this paper describes a different approach for enabling data-intensive computing with fault-tolerance. Our approach is based on an API for developing data-intensive computations that is a variation of map-reduce, and it involves an explicit programmer-declared reduction object. We show how more efficient fault-tolerance support can be developed using this API. Particularly, as the reduction object represents the state of the computation on a node, we can periodically cache the reduction object from every node at another location and use it to support failure-recovery. We have extensively evaluated our approach using two data- intensive applications. Our results show that the overheads of our scheme are extremely low, and our system outperforms Hadoop both in absence and presence of failures.

KW - Cloud computing

KW - Data-intensive computing

KW - Fault tolerance

KW - Map-Reduce

UR - http://www.scopus.com/inward/record.url?scp=77953974120&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=77953974120&partnerID=8YFLogxK

U2 - 10.1109/IPDPS.2010.5470462

DO - 10.1109/IPDPS.2010.5470462

M3 - Conference contribution

AN - SCOPUS:77953974120

SN - 9781424464432

T3 - Proceedings of the 2010 IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2010

BT - Proceedings of the 2010 IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2010

T2 - 24th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2010

Y2 - 19 April 2010 through 23 April 2010

ER -

Supporting fault tolerance in a data-intensive computing middleware

Abstract

Publication series

Conference

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this