TY - GEN
T1 - Extracting output metadata from scientific deep web data sources
AU - Wang, Fan
AU - Agrawal, Gagan
PY - 2009
Y1 - 2009
N2 - Increasingly, many data sources appear as online databases, hidden behind query forms, thus forming the deep web. The popularity of this new medium for data dissemination is leading to new problems in data integration. Particularly, to enable data integration from multiple deep web data sources, one needs to obtain the metadata for each of the data sources. Obtaining the metadata, particularly, the output schema, can be very challenging. This is because, given an input query, many deep web data sources only return a subset of the output schema attributes, i.e, the ones that have a non-NULL value for the corresponding input. In this paper, we propose two approaches, which are the sampling model approach and the mixture model approach, respectively, to efficiently obtain an approximately complete set of output schema attributes from a deep web data source. Our experiments show while each of the above two approaches has limitations, a hybrid strategy, where we combine the two approaches, achieves high recall with good precision for most data sources.
AB - Increasingly, many data sources appear as online databases, hidden behind query forms, thus forming the deep web. The popularity of this new medium for data dissemination is leading to new problems in data integration. Particularly, to enable data integration from multiple deep web data sources, one needs to obtain the metadata for each of the data sources. Obtaining the metadata, particularly, the output schema, can be very challenging. This is because, given an input query, many deep web data sources only return a subset of the output schema attributes, i.e, the ones that have a non-NULL value for the corresponding input. In this paper, we propose two approaches, which are the sampling model approach and the mixture model approach, respectively, to efficiently obtain an approximately complete set of output schema attributes from a deep web data source. Our experiments show while each of the above two approaches has limitations, a hybrid strategy, where we combine the two approaches, achieves high recall with good precision for most data sources.
KW - Deep web
KW - Schema extraction
UR - http://www.scopus.com/inward/record.url?scp=77951149820&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=77951149820&partnerID=8YFLogxK
U2 - 10.1109/ICDM.2009.41
DO - 10.1109/ICDM.2009.41
M3 - Conference contribution
AN - SCOPUS:77951149820
SN - 9780769538952
T3 - Proceedings - IEEE International Conference on Data Mining, ICDM
SP - 552
EP - 561
BT - ICDM 2009 - The 9th IEEE International Conference on Data Mining
T2 - 9th IEEE International Conference on Data Mining, ICDM 2009
Y2 - 6 December 2009 through 9 December 2009
ER -