TY - GEN
T1 - Robust network supercomputing without centralized control
AU - Davtyan, Seda
AU - Konwar, Kishori M.
AU - Shvartsman, Alexander A.
N1 - Funding Information:
This work is supported in part by the NSF award 1017232.
PY - 2011
Y1 - 2011
N2 - Internet supercomputing provides means for harnessing the power of a vast number of interconnected computers. With this come the challenges of marshaling distributed resources and dealing with failures. Traditional centralized approaches employ a master processor and many worker processors that execute a collection of tasks on behalf of the master. Despite the simplicity and advantages of centralized schemes, the master processor is a performance bottleneck and a single point of failure. Additionally, a phenomenon of increasing concern is that workers may return incorrect results, e.g., due to unintended failures, over-clocked processors, or due to workers claiming to have performed work to obtain a high rank in the system. This paper develops an original approach that eliminates the master and instead uses a decentralized algorithm, where workers cooperate in performing tasks. The failure model assumes that the average probability of a worker returning a wrong result is inferior to 1/2. We present a randomized synchronous algorithm for n processors and t tasks (t ≥ n) achieving time complexity Θ(t/n log n) and work Θ(t log n). It is shown that upon termination the workers know the results of all tasks with high probability, and that these results are correct with high probability. The message complexity of the algorithm is Θ(n log n), and the bit complexity is O(tn log 3n). Simulations illustrate the behavior of the algorithm under realistic assumptions.
AB - Internet supercomputing provides means for harnessing the power of a vast number of interconnected computers. With this come the challenges of marshaling distributed resources and dealing with failures. Traditional centralized approaches employ a master processor and many worker processors that execute a collection of tasks on behalf of the master. Despite the simplicity and advantages of centralized schemes, the master processor is a performance bottleneck and a single point of failure. Additionally, a phenomenon of increasing concern is that workers may return incorrect results, e.g., due to unintended failures, over-clocked processors, or due to workers claiming to have performed work to obtain a high rank in the system. This paper develops an original approach that eliminates the master and instead uses a decentralized algorithm, where workers cooperate in performing tasks. The failure model assumes that the average probability of a worker returning a wrong result is inferior to 1/2. We present a randomized synchronous algorithm for n processors and t tasks (t ≥ n) achieving time complexity Θ(t/n log n) and work Θ(t log n). It is shown that upon termination the workers know the results of all tasks with high probability, and that these results are correct with high probability. The message complexity of the algorithm is Θ(n log n), and the bit complexity is O(tn log 3n). Simulations illustrate the behavior of the algorithm under realistic assumptions.
KW - Distributed Algorithms
KW - Fault-Tolerance
KW - Internet Supercomputing
UR - http://www.scopus.com/inward/record.url?scp=84055218390&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84055218390&partnerID=8YFLogxK
U2 - 10.1007/978-3-642-25873-2_30
DO - 10.1007/978-3-642-25873-2_30
M3 - Conference contribution
AN - SCOPUS:84055218390
SN - 9783642258725
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 435
EP - 450
BT - Principles of Distributed Systems - 15th International Conference, OPODIS 2011, Proceedings
T2 - 15th International Conference on Principles of Distributed Systems, OPODIS 2011
Y2 - 13 December 2011 through 16 December 2011
ER -