Extracting output metadata from scientific deep web data sources

Fan Wang; Gagan Agrawal

doi:10.1109/ICDM.2009.41

Extracting output metadata from scientific deep web data sources

Fan Wang, Gagan Agrawal

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

4 Scopus citations

Abstract

Increasingly, many data sources appear as online databases, hidden behind query forms, thus forming the deep web. The popularity of this new medium for data dissemination is leading to new problems in data integration. Particularly, to enable data integration from multiple deep web data sources, one needs to obtain the metadata for each of the data sources. Obtaining the metadata, particularly, the output schema, can be very challenging. This is because, given an input query, many deep web data sources only return a subset of the output schema attributes, i.e, the ones that have a non-NULL value for the corresponding input. In this paper, we propose two approaches, which are the sampling model approach and the mixture model approach, respectively, to efficiently obtain an approximately complete set of output schema attributes from a deep web data source. Our experiments show while each of the above two approaches has limitations, a hybrid strategy, where we combine the two approaches, achieves high recall with good precision for most data sources.

Original language	English (US)
Title of host publication	ICDM 2009 - The 9th IEEE International Conference on Data Mining
Pages	552-561
Number of pages	10
DOIs	https://doi.org/10.1109/ICDM.2009.41
State	Published - 2009
Externally published	Yes
Event	9th IEEE International Conference on Data Mining, ICDM 2009 - Miami, FL, United States Duration: Dec 6 2009 → Dec 9 2009

Publication series

Name	Proceedings - IEEE International Conference on Data Mining, ICDM
ISSN (Print)	1550-4786

Conference

Conference	9th IEEE International Conference on Data Mining, ICDM 2009
Country/Territory	United States
City	Miami, FL
Period	12/6/09 → 12/9/09

Keywords

Deep web
Schema extraction

ASJC Scopus subject areas

General Engineering

Access to Document

10.1109/ICDM.2009.41

Cite this

Wang, F & Agrawal, G 2009, Extracting output metadata from scientific deep web data sources. in ICDM 2009 - The 9th IEEE International Conference on Data Mining., 5360281, Proceedings - IEEE International Conference on Data Mining, ICDM, pp. 552-561, 9th IEEE International Conference on Data Mining, ICDM 2009, Miami, FL, United States, 12/6/09. https://doi.org/10.1109/ICDM.2009.41

@inproceedings{a6ec3aab7b8046cd9e453718223a14eb,

title = "Extracting output metadata from scientific deep web data sources",

abstract = "Increasingly, many data sources appear as online databases, hidden behind query forms, thus forming the deep web. The popularity of this new medium for data dissemination is leading to new problems in data integration. Particularly, to enable data integration from multiple deep web data sources, one needs to obtain the metadata for each of the data sources. Obtaining the metadata, particularly, the output schema, can be very challenging. This is because, given an input query, many deep web data sources only return a subset of the output schema attributes, i.e, the ones that have a non-NULL value for the corresponding input. In this paper, we propose two approaches, which are the sampling model approach and the mixture model approach, respectively, to efficiently obtain an approximately complete set of output schema attributes from a deep web data source. Our experiments show while each of the above two approaches has limitations, a hybrid strategy, where we combine the two approaches, achieves high recall with good precision for most data sources.",

keywords = "Deep web, Schema extraction",

author = "Fan Wang and Gagan Agrawal",

year = "2009",

doi = "10.1109/ICDM.2009.41",

language = "English (US)",

isbn = "9780769538952",

series = "Proceedings - IEEE International Conference on Data Mining, ICDM",

pages = "552--561",

booktitle = "ICDM 2009 - The 9th IEEE International Conference on Data Mining",

note = "9th IEEE International Conference on Data Mining, ICDM 2009 ; Conference date: 06-12-2009 Through 09-12-2009",

}

TY - GEN

T1 - Extracting output metadata from scientific deep web data sources

AU - Wang, Fan

AU - Agrawal, Gagan

PY - 2009

Y1 - 2009

N2 - Increasingly, many data sources appear as online databases, hidden behind query forms, thus forming the deep web. The popularity of this new medium for data dissemination is leading to new problems in data integration. Particularly, to enable data integration from multiple deep web data sources, one needs to obtain the metadata for each of the data sources. Obtaining the metadata, particularly, the output schema, can be very challenging. This is because, given an input query, many deep web data sources only return a subset of the output schema attributes, i.e, the ones that have a non-NULL value for the corresponding input. In this paper, we propose two approaches, which are the sampling model approach and the mixture model approach, respectively, to efficiently obtain an approximately complete set of output schema attributes from a deep web data source. Our experiments show while each of the above two approaches has limitations, a hybrid strategy, where we combine the two approaches, achieves high recall with good precision for most data sources.

AB - Increasingly, many data sources appear as online databases, hidden behind query forms, thus forming the deep web. The popularity of this new medium for data dissemination is leading to new problems in data integration. Particularly, to enable data integration from multiple deep web data sources, one needs to obtain the metadata for each of the data sources. Obtaining the metadata, particularly, the output schema, can be very challenging. This is because, given an input query, many deep web data sources only return a subset of the output schema attributes, i.e, the ones that have a non-NULL value for the corresponding input. In this paper, we propose two approaches, which are the sampling model approach and the mixture model approach, respectively, to efficiently obtain an approximately complete set of output schema attributes from a deep web data source. Our experiments show while each of the above two approaches has limitations, a hybrid strategy, where we combine the two approaches, achieves high recall with good precision for most data sources.

KW - Deep web

KW - Schema extraction

UR - http://www.scopus.com/inward/record.url?scp=77951149820&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=77951149820&partnerID=8YFLogxK

U2 - 10.1109/ICDM.2009.41

DO - 10.1109/ICDM.2009.41

M3 - Conference contribution

AN - SCOPUS:77951149820

SN - 9780769538952

T3 - Proceedings - IEEE International Conference on Data Mining, ICDM

SP - 552

EP - 561

BT - ICDM 2009 - The 9th IEEE International Conference on Data Mining

T2 - 9th IEEE International Conference on Data Mining, ICDM 2009

Y2 - 6 December 2009 through 9 December 2009

ER -

Extracting output metadata from scientific deep web data sources

Abstract

Publication series

Conference

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this