Taming massive distributed datasets: Data sampling using bitmap indices

Yu Su; Gagan Agrawal; Jonathan Woodring; Kary Myers; Joanne Wendelberger; James Ahrens

doi:10.1145/2493123.2462906

Taming massive distributed datasets: Data sampling using bitmap indices

Yu Su, Gagan Agrawal, Jonathan Woodring, Kary Myers, Joanne Wendelberger, James Ahrens

Research output: Contribution to conference › Paper › peer-review

21 Scopus citations

Abstract

With growing computational capabilities of parallel machines, scientific simulations are being performed at finer spatial and temporal scales, leading to a data explosion. The growing sizes are making it extremely hard to store, manage, disseminate, analyze, and visualize these datasets, especially as neither the memory capacity of parallel machines, memory access speeds, nor disk bandwidths are increasing at the same rate as the computing power. Sampling can be an effective technique to address the above challenges, but it is extremely important to ensure that dataset characteristics are preserved, and the loss of accuracy is within acceptable levels. In this paper, we address the data explosion problems by developing a novel sampling approach, and implementing it in a flexible system that supports server-side sampling and data subsetting. We observe that to allow subsetting over scientific datasets, data repositories are likely to use an indexing technique. Among these techniques, we see that bitmap indexing can not only effectively support subsetting over scientific datasets, but can also help create samples that preserve both value and spatial distributions over scientific datasets. We have developed algorithms for using bitmap indices to sample datasets. We have also shown how only a small amount of additional metadata stored with bitvectors can help assess loss of accuracy with a particular subsampling level. Some of the other properties of this novel approach include: 1) sampling can be flexibly applied to a subset of the original dataset, which may be specified using a value-based and/or a dimension-based subsetting predicate, and 2) no data reorganization is needed, once bitmap indices have been generated. We have extensively evaluated our method with different types of datasets and applications, and demonstrated the effectiveness of our approach.

Original language	English (US)
Pages	13-24
Number of pages	12
DOIs	https://doi.org/10.1145/2493123.2462906 https://doi.org/10.1145/2462902.2462906
State	Published - 2013
Externally published	Yes
Event	22nd ACM International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2013 - New York, NY, United States Duration: Jun 17 2013 → Jun 21 2013

Conference

Conference	22nd ACM International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2013
Country/Territory	United States
City	New York, NY
Period	6/17/13 → 6/21/13

Keywords

big data
bitmap indexing
data sampling

ASJC Scopus subject areas

Software

Access to Document

http://dl.acm.org/citation.cfm?doid=2493123.2462906

Cite this

Su, Y., Agrawal, G., Woodring, J., Myers, K., Wendelberger, J., & Ahrens, J. (2013). Taming massive distributed datasets: Data sampling using bitmap indices. 13-24. Paper presented at 22nd ACM International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2013, New York, NY, United States. https://doi.org/10.1145/2493123.2462906, https://doi.org/10.1145/2462902.2462906

Su, Y, Agrawal, G, Woodring, J, Myers, K, Wendelberger, J & Ahrens, J 2013, 'Taming massive distributed datasets: Data sampling using bitmap indices', Paper presented at 22nd ACM International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2013, New York, NY, United States, 6/17/13 - 6/21/13 pp. 13-24. https://doi.org/10.1145/2493123.2462906, https://doi.org/10.1145/2462902.2462906

@conference{3aecb2e70cc54dca9bb0382edc18db0f,

title = "Taming massive distributed datasets: Data sampling using bitmap indices",

abstract = "With growing computational capabilities of parallel machines, scientific simulations are being performed at finer spatial and temporal scales, leading to a data explosion. The growing sizes are making it extremely hard to store, manage, disseminate, analyze, and visualize these datasets, especially as neither the memory capacity of parallel machines, memory access speeds, nor disk bandwidths are increasing at the same rate as the computing power. Sampling can be an effective technique to address the above challenges, but it is extremely important to ensure that dataset characteristics are preserved, and the loss of accuracy is within acceptable levels. In this paper, we address the data explosion problems by developing a novel sampling approach, and implementing it in a flexible system that supports server-side sampling and data subsetting. We observe that to allow subsetting over scientific datasets, data repositories are likely to use an indexing technique. Among these techniques, we see that bitmap indexing can not only effectively support subsetting over scientific datasets, but can also help create samples that preserve both value and spatial distributions over scientific datasets. We have developed algorithms for using bitmap indices to sample datasets. We have also shown how only a small amount of additional metadata stored with bitvectors can help assess loss of accuracy with a particular subsampling level. Some of the other properties of this novel approach include: 1) sampling can be flexibly applied to a subset of the original dataset, which may be specified using a value-based and/or a dimension-based subsetting predicate, and 2) no data reorganization is needed, once bitmap indices have been generated. We have extensively evaluated our method with different types of datasets and applications, and demonstrated the effectiveness of our approach.",

keywords = "big data, bitmap indexing, data sampling",

author = "Yu Su and Gagan Agrawal and Jonathan Woodring and Kary Myers and Joanne Wendelberger and James Ahrens",

year = "2013",

doi = "10.1145/2493123.2462906",

language = "English (US)",

pages = "13--24",

note = "22nd ACM International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2013 ; Conference date: 17-06-2013 Through 21-06-2013",

}

TY - CONF

T1 - Taming massive distributed datasets

T2 - 22nd ACM International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2013

AU - Su, Yu

AU - Agrawal, Gagan

AU - Woodring, Jonathan

AU - Myers, Kary

AU - Wendelberger, Joanne

AU - Ahrens, James

PY - 2013

Y1 - 2013

N2 - With growing computational capabilities of parallel machines, scientific simulations are being performed at finer spatial and temporal scales, leading to a data explosion. The growing sizes are making it extremely hard to store, manage, disseminate, analyze, and visualize these datasets, especially as neither the memory capacity of parallel machines, memory access speeds, nor disk bandwidths are increasing at the same rate as the computing power. Sampling can be an effective technique to address the above challenges, but it is extremely important to ensure that dataset characteristics are preserved, and the loss of accuracy is within acceptable levels. In this paper, we address the data explosion problems by developing a novel sampling approach, and implementing it in a flexible system that supports server-side sampling and data subsetting. We observe that to allow subsetting over scientific datasets, data repositories are likely to use an indexing technique. Among these techniques, we see that bitmap indexing can not only effectively support subsetting over scientific datasets, but can also help create samples that preserve both value and spatial distributions over scientific datasets. We have developed algorithms for using bitmap indices to sample datasets. We have also shown how only a small amount of additional metadata stored with bitvectors can help assess loss of accuracy with a particular subsampling level. Some of the other properties of this novel approach include: 1) sampling can be flexibly applied to a subset of the original dataset, which may be specified using a value-based and/or a dimension-based subsetting predicate, and 2) no data reorganization is needed, once bitmap indices have been generated. We have extensively evaluated our method with different types of datasets and applications, and demonstrated the effectiveness of our approach.

AB - With growing computational capabilities of parallel machines, scientific simulations are being performed at finer spatial and temporal scales, leading to a data explosion. The growing sizes are making it extremely hard to store, manage, disseminate, analyze, and visualize these datasets, especially as neither the memory capacity of parallel machines, memory access speeds, nor disk bandwidths are increasing at the same rate as the computing power. Sampling can be an effective technique to address the above challenges, but it is extremely important to ensure that dataset characteristics are preserved, and the loss of accuracy is within acceptable levels. In this paper, we address the data explosion problems by developing a novel sampling approach, and implementing it in a flexible system that supports server-side sampling and data subsetting. We observe that to allow subsetting over scientific datasets, data repositories are likely to use an indexing technique. Among these techniques, we see that bitmap indexing can not only effectively support subsetting over scientific datasets, but can also help create samples that preserve both value and spatial distributions over scientific datasets. We have developed algorithms for using bitmap indices to sample datasets. We have also shown how only a small amount of additional metadata stored with bitvectors can help assess loss of accuracy with a particular subsampling level. Some of the other properties of this novel approach include: 1) sampling can be flexibly applied to a subset of the original dataset, which may be specified using a value-based and/or a dimension-based subsetting predicate, and 2) no data reorganization is needed, once bitmap indices have been generated. We have extensively evaluated our method with different types of datasets and applications, and demonstrated the effectiveness of our approach.

KW - big data

KW - bitmap indexing

KW - data sampling

UR - http://www.scopus.com/inward/record.url?scp=84880071136&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84880071136&partnerID=8YFLogxK

U2 - 10.1145/2493123.2462906

DO - 10.1145/2493123.2462906

M3 - Paper

SP - 13

EP - 24

Y2 - 17 June 2013 through 21 June 2013

ER -

Taming massive distributed datasets: Data sampling using bitmap indices

Abstract

Conference

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this