We propose a formal framework for an unsupervised approach tacking at the same time two problems: The data extraction problem, for generating the extraction rules needed to gain data from web pages, and the data integration problem, to integrate the data coming from several sources. We motivate the approach by discussing its advantages with regard to the traditional "waterfall approach", in which data are wholly extracted before the integration starts without any mutual dependency between the two tasks. In this paper, we focus on data that are exposed by structured and redundant web sources. We introduce novel polynomial algorithms to solve the stated problems and present theoretical results on the properties of the solution generated by our approach. Finally, a preliminary experimental evaluation show the applicability of our model with real-world websites.
|CEUR Workshop Proceedings
|Published - Dec 1 2011
|1st International Workshop on Searching and Integrating New Web Data Sources - Very Large Data Search, VLDS 2011 - Seattle, WA, United States
Duration: Sep 2 2011 → Sep 2 2011
ASJC Scopus subject areas
- General Computer Science