TY - GEN
T1 - Stratified sampling for data mining on the deep web
AU - Liu, Tantan
AU - Wang, Fan
AU - Agrawal, Gagan
PY - 2010
Y1 - 2010
N2 - In recent years, one mode of data dissemination has become extremely popular, which is the deep web. Like any other data source, data mining on the deep web can produce important insights or summary of results. However, data mining on the deep web is challenging because the databases cannot be accessed directly, and therefore, data mining must be performed based on sampling of the datasets. The samples, in turn, can only be obtained by querying the deep web databases with specific inputs. In this paper, we target two related data mining problems, which are association mining and differential rule mining. We develop stratified sampling methods to perform these mining tasks on a deep web source. Our contributions include a novel greedy stratification approach, which processes the query space of a deep web data source recursively, and considers both the estimation error and the sampling costs. We have also developed an optimized sample allocation method that integrates estimation error and sampling costs. Our experiment results show that our algorithms effectively and consistently reduce sampling costs, compared with a stratified sampling method that only considers estimation error. In addition, compared with simple random sampling, our algorithm has higher sampling accuracy and lower sampling costs.
AB - In recent years, one mode of data dissemination has become extremely popular, which is the deep web. Like any other data source, data mining on the deep web can produce important insights or summary of results. However, data mining on the deep web is challenging because the databases cannot be accessed directly, and therefore, data mining must be performed based on sampling of the datasets. The samples, in turn, can only be obtained by querying the deep web databases with specific inputs. In this paper, we target two related data mining problems, which are association mining and differential rule mining. We develop stratified sampling methods to perform these mining tasks on a deep web source. Our contributions include a novel greedy stratification approach, which processes the query space of a deep web data source recursively, and considers both the estimation error and the sampling costs. We have also developed an optimized sample allocation method that integrates estimation error and sampling costs. Our experiment results show that our algorithms effectively and consistently reduce sampling costs, compared with a stratified sampling method that only considers estimation error. In addition, compared with simple random sampling, our algorithm has higher sampling accuracy and lower sampling costs.
UR - http://www.scopus.com/inward/record.url?scp=79951735358&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=79951735358&partnerID=8YFLogxK
U2 - 10.1109/ICDM.2010.17
DO - 10.1109/ICDM.2010.17
M3 - Conference contribution
AN - SCOPUS:79951735358
SN - 9780769542560
T3 - Proceedings - IEEE International Conference on Data Mining, ICDM
SP - 324
EP - 333
BT - Proceedings - 10th IEEE International Conference on Data Mining, ICDM 2010
T2 - 10th IEEE International Conference on Data Mining, ICDM 2010
Y2 - 14 December 2010 through 17 December 2010
ER -