Querying discriminative and representative samples for batch mode active learning

Zheng Wang, Jieping Ye

Research output: Contribution to journalArticlepeer-review

95 Scopus citations


Empirical risk minimization (ERM) provides a principled guideline for many machine learning and data mining algorithms. Under the ERM principle, one minimizes an upper bound of the true risk, which is approximated by the summation of empirical risk and the complexity of the candidate classifier class. To guarantee a satisfactory learning performance, ERM requires that the training data are i.i.d. sampled from the unknown source distribution. However, this may not be the case in active learning, where one selects the most informative samples to label, and these data may not follow the source distribution. In this article, we generalize the ERM principle to the active learning setting. We derive a novel form of upper bound for the true risk in the active learning setting; by minimizing this upper bound,we develop a practical batch mode active learning method. The proposed formulation involves a nonconvex integer programming optimization problem. We solve it efficiently by an alternating optimization method. Our method is shown to query the most informative samples while preserving the source distribution as much as possible, thus identifying the most uncertain and representative queries. We further extend our method to multiclass active learning by introducing novel pseudolabels in the multiclass case and developing an efficient algorithm. Experiments on benchmark datasets and real-world applications demonstrate the superior performance of our proposed method compared to state-of-the-art methods.

Original languageEnglish (US)
Pages (from-to)17
Number of pages1
JournalACM Transactions on Knowledge Discovery from Data
Issue number3
StatePublished - Feb 1 2015


  • Active learning
  • Empirical risk minimization
  • Maximum mean discrepancy
  • Representative and discriminative

ASJC Scopus subject areas

  • Computer Science(all)


Dive into the research topics of 'Querying discriminative and representative samples for batch mode active learning'. Together they form a unique fingerprint.

Cite this