Fixing weakly annotated Web data using relational models

Fatih Gelgi, Srinivas Vadrevu, Hasan Davulcu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Scopus citations


In this paper, we present a fast and scalable Bayesian model for improving weakly annotated data - which is typically generated by a (semi) automated information extraction (IE) system from Web documents. Weakly annotated data suffers from two major problems: they (i) might contain incorrect ontological role assignments, and (ii) might have many missing attributes. Our experimental evaluations with the TAP and RoadRunner data sets, and a collection of 20,000 home pages from university, shopping and sports Web sites, indicate that the model described here can improve the accuracy of role assignments from 40% to 85% for template driven sites, from 68% to 87% for non-template driven sites. The Bayesian model is also shown to be useful for improving the performance of IE systems by informing them with additional domain information.

Original languageEnglish (US)
Title of host publicationWeb Engineering - 7th International Conference, ICWE 2007, Proceedings
PublisherSpringer Verlag
Number of pages15
ISBN (Print)3540735968, 9783540735960
StatePublished - 2007
Event7th International Conference on Web Engineering, ICWE 2007 - Como, Italy
Duration: Jul 16 2007Jul 20 2007

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume4607 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349


Other7th International Conference on Web Engineering, ICWE 2007


  • Bayesian models
  • Classification
  • Information extraction
  • Weakly annotated data

ASJC Scopus subject areas

  • Theoretical Computer Science
  • General Computer Science


Dive into the research topics of 'Fixing weakly annotated Web data using relational models'. Together they form a unique fingerprint.

Cite this