Retrieving and organizing web pages by "information unit"

Wen Syan Li, Kasim Candan, Quoc Vu, Divyakant Agrawal

Research output: Chapter in Book/Report/Conference proceedingConference contribution

81 Scopus citations


Since WWW encourages hypertext and hypermedia docu-ment authoring (e.g., HTML or XML), Web authors tend to create documents that are composed of multiple pages con-nected with hyperlinks or frames. A Web document may be authored in multiple ways, such as (1) all information in one physical page, or (2) a main page and the related information in separate linked pages. Existing Web search engines, however, return only physical pages. In this paper, we introduce and describe the use of the concept of informa-tion unit, which can be viewed as a logical Web document consisting of multiple physical pages as one atomic retrieval unit. We present an algorithm to eÆciently retrieve infor-mation units. Our algorithm can perform progressive query processing over a Web index by considering both document semantic similarity and link structures. Experimental re-sults on synthetic graphs and real Web data show the ef-fectiveness and usefulness of the proposed information unit retrieval technique.

Original languageEnglish (US)
Title of host publicationProceedings of the 10th International Conference on World Wide Web, WWW 2001
PublisherAssociation for Computing Machinery, Inc
Number of pages15
ISBN (Print)1581133480, 9781581133486
StatePublished - Apr 1 2001
Externally publishedYes
Event10th International Conference on World Wide Web, WWW 2001 - Hong Kong, Hong Kong
Duration: May 1 2001May 5 2001


Other10th International Conference on World Wide Web, WWW 2001
Country/TerritoryHong Kong
CityHong Kong


  • Link structures
  • Pro-gressive processing
  • Query relaxation
  • Web proximity search

ASJC Scopus subject areas

  • Computer Networks and Communications


Dive into the research topics of 'Retrieving and organizing web pages by "information unit"'. Together they form a unique fingerprint.

Cite this