TY - JOUR
T1 - Robust network supercomputing with unreliable workers
AU - Konwar, Kishori M.
AU - Rajasekaran, Sanguthevar
AU - Shvartsman, Alexander A.
N1 - DBLP License: DBLP's bibliographic metadata records provided through http://dblp.org/ are distributed under a Creative Commons CC0 1.0 Universal Public Domain Dedication. Although the bibliographic metadata records are provided consistent with CC0 1.0 Dedication, the content described by the metadata records is not. Content may be subject to copyright, rights of privacy, rights of publicity and other restrictions.
PY - 2015
Y1 - 2015
N2 - Internet supercomputing is becoming a powerful tool for harnessing massive amounts of computational resources. However in typical master-worker settings the correctness of the results of the computation crucially relies on the ability of the master to depend on the computation performed by the workers. We consider a distributed system consisting of a master process and a collection of synchronous worker processes that can execute tasks on behalf of the master and that may act nefariously by deliberately returning fallacious results. The master decides on the correctness of the results by assigning the same task to several workers. For such a setting with n processes and t tasks we study the problem of collectively performing the tasks under two different failure models: model Fa, where some fraction f of workers that are prone to returning arbitrary (e.g., incorrect) results and the probability p of such faulty behavior are not known a priori to the master, and model Fb when these quantities are known to the master. Previous works assume that the number of faulty processes or the probability of a process acting maliciously is known to the master, e.g., as in model Fb. In this paper this assumption is removed in model Fa. First, for model Fa we provide an efficient algorithm-based on the Stopping Rule Algorithm by Dagum et al. (1995) - that can estimate f and p with (ε,δ)-approximation, for any 0<δ<1 and ε>0. We also provide a randomized algorithm for detecting the faulty processes for model Fa. Finally, we provide algorithms to perform t tasks, with n workers, in models Fa and Fb.
AB - Internet supercomputing is becoming a powerful tool for harnessing massive amounts of computational resources. However in typical master-worker settings the correctness of the results of the computation crucially relies on the ability of the master to depend on the computation performed by the workers. We consider a distributed system consisting of a master process and a collection of synchronous worker processes that can execute tasks on behalf of the master and that may act nefariously by deliberately returning fallacious results. The master decides on the correctness of the results by assigning the same task to several workers. For such a setting with n processes and t tasks we study the problem of collectively performing the tasks under two different failure models: model Fa, where some fraction f of workers that are prone to returning arbitrary (e.g., incorrect) results and the probability p of such faulty behavior are not known a priori to the master, and model Fb when these quantities are known to the master. Previous works assume that the number of faulty processes or the probability of a process acting maliciously is known to the master, e.g., as in model Fb. In this paper this assumption is removed in model Fa. First, for model Fa we provide an efficient algorithm-based on the Stopping Rule Algorithm by Dagum et al. (1995) - that can estimate f and p with (ε,δ)-approximation, for any 0<δ<1 and ε>0. We also provide a randomized algorithm for detecting the faulty processes for model Fa. Finally, we provide algorithms to perform t tasks, with n workers, in models Fa and Fb.
KW - Distributed algorithms
KW - Fault-tolerance
KW - Internet supercomputing
KW - Randomized algorithms
KW - Reliability
UR - http://www.scopus.com/inward/record.url?scp=84918803604&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84918803604&partnerID=8YFLogxK
U2 - 10.1016/j.jpdc.2014.10.002
DO - 10.1016/j.jpdc.2014.10.002
M3 - Article
AN - SCOPUS:84918803604
SN - 0743-7315
VL - 75
SP - 81
EP - 92
JO - Journal of Parallel and Distributed Computing
JF - Journal of Parallel and Distributed Computing
ER -