Stratified sampling for data mining on the deep web

Tantan Liu; Fan Wang; Gagan Agrawal

doi:10.1109/ICDM.2010.17

Stratified sampling for data mining on the deep web

Tantan Liu, Fan Wang, Gagan Agrawal

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

10 Scopus citations

Abstract

In recent years, one mode of data dissemination has become extremely popular, which is the deep web. Like any other data source, data mining on the deep web can produce important insights or summary of results. However, data mining on the deep web is challenging because the databases cannot be accessed directly, and therefore, data mining must be performed based on sampling of the datasets. The samples, in turn, can only be obtained by querying the deep web databases with specific inputs. In this paper, we target two related data mining problems, which are association mining and differential rule mining. We develop stratified sampling methods to perform these mining tasks on a deep web source. Our contributions include a novel greedy stratification approach, which processes the query space of a deep web data source recursively, and considers both the estimation error and the sampling costs. We have also developed an optimized sample allocation method that integrates estimation error and sampling costs. Our experiment results show that our algorithms effectively and consistently reduce sampling costs, compared with a stratified sampling method that only considers estimation error. In addition, compared with simple random sampling, our algorithm has higher sampling accuracy and lower sampling costs.

Original language	English (US)
Title of host publication	Proceedings - 10th IEEE International Conference on Data Mining, ICDM 2010
Pages	324-333
Number of pages	10
DOIs	https://doi.org/10.1109/ICDM.2010.17
State	Published - 2010
Externally published	Yes
Event	10th IEEE International Conference on Data Mining, ICDM 2010 - Sydney, NSW, Australia Duration: Dec 14 2010 → Dec 17 2010

Publication series

Name	Proceedings - IEEE International Conference on Data Mining, ICDM
ISSN (Print)	1550-4786

Conference

Conference	10th IEEE International Conference on Data Mining, ICDM 2010
Country/Territory	Australia
City	Sydney, NSW
Period	12/14/10 → 12/17/10

ASJC Scopus subject areas

General Engineering

Access to Document

10.1109/ICDM.2010.17

Cite this

Liu, T, Wang, F & Agrawal, G 2010, Stratified sampling for data mining on the deep web. in Proceedings - 10th IEEE International Conference on Data Mining, ICDM 2010., 5693986, Proceedings - IEEE International Conference on Data Mining, ICDM, pp. 324-333, 10th IEEE International Conference on Data Mining, ICDM 2010, Sydney, NSW, Australia, 12/14/10. https://doi.org/10.1109/ICDM.2010.17

@inproceedings{d6b73047b4154bc48d58599fbdf02978,

title = "Stratified sampling for data mining on the deep web",

abstract = "In recent years, one mode of data dissemination has become extremely popular, which is the deep web. Like any other data source, data mining on the deep web can produce important insights or summary of results. However, data mining on the deep web is challenging because the databases cannot be accessed directly, and therefore, data mining must be performed based on sampling of the datasets. The samples, in turn, can only be obtained by querying the deep web databases with specific inputs. In this paper, we target two related data mining problems, which are association mining and differential rule mining. We develop stratified sampling methods to perform these mining tasks on a deep web source. Our contributions include a novel greedy stratification approach, which processes the query space of a deep web data source recursively, and considers both the estimation error and the sampling costs. We have also developed an optimized sample allocation method that integrates estimation error and sampling costs. Our experiment results show that our algorithms effectively and consistently reduce sampling costs, compared with a stratified sampling method that only considers estimation error. In addition, compared with simple random sampling, our algorithm has higher sampling accuracy and lower sampling costs.",

author = "Tantan Liu and Fan Wang and Gagan Agrawal",

year = "2010",

doi = "10.1109/ICDM.2010.17",

language = "English (US)",

isbn = "9780769542560",

series = "Proceedings - IEEE International Conference on Data Mining, ICDM",

pages = "324--333",

booktitle = "Proceedings - 10th IEEE International Conference on Data Mining, ICDM 2010",

note = "10th IEEE International Conference on Data Mining, ICDM 2010 ; Conference date: 14-12-2010 Through 17-12-2010",

}

TY - GEN

T1 - Stratified sampling for data mining on the deep web

AU - Liu, Tantan

AU - Wang, Fan

AU - Agrawal, Gagan

PY - 2010

Y1 - 2010

N2 - In recent years, one mode of data dissemination has become extremely popular, which is the deep web. Like any other data source, data mining on the deep web can produce important insights or summary of results. However, data mining on the deep web is challenging because the databases cannot be accessed directly, and therefore, data mining must be performed based on sampling of the datasets. The samples, in turn, can only be obtained by querying the deep web databases with specific inputs. In this paper, we target two related data mining problems, which are association mining and differential rule mining. We develop stratified sampling methods to perform these mining tasks on a deep web source. Our contributions include a novel greedy stratification approach, which processes the query space of a deep web data source recursively, and considers both the estimation error and the sampling costs. We have also developed an optimized sample allocation method that integrates estimation error and sampling costs. Our experiment results show that our algorithms effectively and consistently reduce sampling costs, compared with a stratified sampling method that only considers estimation error. In addition, compared with simple random sampling, our algorithm has higher sampling accuracy and lower sampling costs.

AB - In recent years, one mode of data dissemination has become extremely popular, which is the deep web. Like any other data source, data mining on the deep web can produce important insights or summary of results. However, data mining on the deep web is challenging because the databases cannot be accessed directly, and therefore, data mining must be performed based on sampling of the datasets. The samples, in turn, can only be obtained by querying the deep web databases with specific inputs. In this paper, we target two related data mining problems, which are association mining and differential rule mining. We develop stratified sampling methods to perform these mining tasks on a deep web source. Our contributions include a novel greedy stratification approach, which processes the query space of a deep web data source recursively, and considers both the estimation error and the sampling costs. We have also developed an optimized sample allocation method that integrates estimation error and sampling costs. Our experiment results show that our algorithms effectively and consistently reduce sampling costs, compared with a stratified sampling method that only considers estimation error. In addition, compared with simple random sampling, our algorithm has higher sampling accuracy and lower sampling costs.

UR - http://www.scopus.com/inward/record.url?scp=79951735358&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=79951735358&partnerID=8YFLogxK

U2 - 10.1109/ICDM.2010.17

DO - 10.1109/ICDM.2010.17

M3 - Conference contribution

AN - SCOPUS:79951735358

SN - 9780769542560

T3 - Proceedings - IEEE International Conference on Data Mining, ICDM

SP - 324

EP - 333

BT - Proceedings - 10th IEEE International Conference on Data Mining, ICDM 2010

T2 - 10th IEEE International Conference on Data Mining, ICDM 2010

Y2 - 14 December 2010 through 17 December 2010

ER -

Stratified sampling for data mining on the deep web

Abstract

Publication series

Conference

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this