Efficiently determining the starting sample size for progressive sampling

Baohua Gu, Bing Liu, Feifang Hu, Huan Liu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

13 Scopus citations


Given a large data set and a classification learning algorithm, Progressive Sampling (PS) uses increasingly larger random samples to learn until model accuracy no longer improves. It is shown that the technique is remarkably efficient compared to using the entire data. However, how to set the starting sample size for PS is still an open problem. We show that an improper starting sample size can still make PS expensive in computation due to running the learning algorithm on a large number of instances (of a sequence of random samples before achieving convergence) and excessive database scans to fetch the sample data. Using a suitable starting sample size can further improve the efficiency of PS. In this paper, we present a statistical approach which is able to efficiently find such a size. We call it the Statistical Optimal Sample Size (SOSS), in the sense that a sample of this size sufficiently resembles the entire data. We introduce an information-based measure of this resemblance (Sample Quality) to define the SOSS and show that it can be efficiently obtained in one scan of the data. We prove that learning on a sample of SOSS will produce model accuracy that asymptotically approaches the highest achievable accuracy on the entire data. Empirical results on a number of large data sets from the UCIKDD repository show that SOSS is a suitable starting size for Progressive Sampling.

Original languageEnglish (US)
Title of host publicationMachine Learning
Subtitle of host publicationECML 2001 - 12th European Conference on Machine Learning, Proceedings
EditorsLuc de Raedt, Peter Flach
PublisherSpringer Verlag
Number of pages11
ISBN (Print)3540425365, 9783540425366
StatePublished - 2001
Event12th European Conference on Machine Learning, ECML 2001 - Freiburg, Germany
Duration: Sep 5 2001Sep 7 2001

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349


Other12th European Conference on Machine Learning, ECML 2001

ASJC Scopus subject areas

  • Theoretical Computer Science
  • General Computer Science


Dive into the research topics of 'Efficiently determining the starting sample size for progressive sampling'. Together they form a unique fingerprint.

Cite this