Querying discriminative and representative samples for batch mode active learning

Zheng Wang; Jieping Ye

doi:10.1145/2700408

Querying discriminative and representative samples for batch mode active learning

Zheng Wang, Jieping Ye

Computing and Augmented Intelligence, School of (IAFSE-SCAI)

Research output: Contribution to journal › Article › peer-review

101 Scopus citations

Abstract

Empirical risk minimization (ERM) provides a principled guideline for many machine learning and data mining algorithms. Under the ERM principle, one minimizes an upper bound of the true risk, which is approximated by the summation of empirical risk and the complexity of the candidate classifier class. To guarantee a satisfactory learning performance, ERM requires that the training data are i.i.d. sampled from the unknown source distribution. However, this may not be the case in active learning, where one selects the most informative samples to label, and these data may not follow the source distribution. In this article, we generalize the ERM principle to the active learning setting. We derive a novel form of upper bound for the true risk in the active learning setting; by minimizing this upper bound,we develop a practical batch mode active learning method. The proposed formulation involves a nonconvex integer programming optimization problem. We solve it efficiently by an alternating optimization method. Our method is shown to query the most informative samples while preserving the source distribution as much as possible, thus identifying the most uncertain and representative queries. We further extend our method to multiclass active learning by introducing novel pseudolabels in the multiclass case and developing an efficient algorithm. Experiments on benchmark datasets and real-world applications demonstrate the superior performance of our proposed method compared to state-of-the-art methods.

Original language	English (US)
Pages (from-to)	17
Number of pages	1
Journal	ACM Transactions on Knowledge Discovery from Data
Volume	9
Issue number	3
DOIs	https://doi.org/10.1145/2700408
State	Published - Feb 1 2015

Keywords

Active learning
Empirical risk minimization
Maximum mean discrepancy
Representative and discriminative

ASJC Scopus subject areas

General Computer Science

Access to Document

10.1145/2700408

Cite this

@article{0b7f5e14bb4b44038baf76642e866ac5,

title = "Querying discriminative and representative samples for batch mode active learning",

abstract = "Empirical risk minimization (ERM) provides a principled guideline for many machine learning and data mining algorithms. Under the ERM principle, one minimizes an upper bound of the true risk, which is approximated by the summation of empirical risk and the complexity of the candidate classifier class. To guarantee a satisfactory learning performance, ERM requires that the training data are i.i.d. sampled from the unknown source distribution. However, this may not be the case in active learning, where one selects the most informative samples to label, and these data may not follow the source distribution. In this article, we generalize the ERM principle to the active learning setting. We derive a novel form of upper bound for the true risk in the active learning setting; by minimizing this upper bound,we develop a practical batch mode active learning method. The proposed formulation involves a nonconvex integer programming optimization problem. We solve it efficiently by an alternating optimization method. Our method is shown to query the most informative samples while preserving the source distribution as much as possible, thus identifying the most uncertain and representative queries. We further extend our method to multiclass active learning by introducing novel pseudolabels in the multiclass case and developing an efficient algorithm. Experiments on benchmark datasets and real-world applications demonstrate the superior performance of our proposed method compared to state-of-the-art methods.",

keywords = "Active learning, Empirical risk minimization, Maximum mean discrepancy, Representative and discriminative",

author = "Zheng Wang and Jieping Ye",

note = "Publisher Copyright: {\textcopyright} 2015 ACM.",

year = "2015",

month = feb,

day = "1",

doi = "10.1145/2700408",

language = "English (US)",

volume = "9",

pages = "17",

journal = "ACM Transactions on Knowledge Discovery from Data",

issn = "1556-4681",

publisher = "Association for Computing Machinery (ACM)",

number = "3",

}

TY - JOUR

T1 - Querying discriminative and representative samples for batch mode active learning

AU - Wang, Zheng

AU - Ye, Jieping

PY - 2015/2/1

Y1 - 2015/2/1

N2 - Empirical risk minimization (ERM) provides a principled guideline for many machine learning and data mining algorithms. Under the ERM principle, one minimizes an upper bound of the true risk, which is approximated by the summation of empirical risk and the complexity of the candidate classifier class. To guarantee a satisfactory learning performance, ERM requires that the training data are i.i.d. sampled from the unknown source distribution. However, this may not be the case in active learning, where one selects the most informative samples to label, and these data may not follow the source distribution. In this article, we generalize the ERM principle to the active learning setting. We derive a novel form of upper bound for the true risk in the active learning setting; by minimizing this upper bound,we develop a practical batch mode active learning method. The proposed formulation involves a nonconvex integer programming optimization problem. We solve it efficiently by an alternating optimization method. Our method is shown to query the most informative samples while preserving the source distribution as much as possible, thus identifying the most uncertain and representative queries. We further extend our method to multiclass active learning by introducing novel pseudolabels in the multiclass case and developing an efficient algorithm. Experiments on benchmark datasets and real-world applications demonstrate the superior performance of our proposed method compared to state-of-the-art methods.

AB - Empirical risk minimization (ERM) provides a principled guideline for many machine learning and data mining algorithms. Under the ERM principle, one minimizes an upper bound of the true risk, which is approximated by the summation of empirical risk and the complexity of the candidate classifier class. To guarantee a satisfactory learning performance, ERM requires that the training data are i.i.d. sampled from the unknown source distribution. However, this may not be the case in active learning, where one selects the most informative samples to label, and these data may not follow the source distribution. In this article, we generalize the ERM principle to the active learning setting. We derive a novel form of upper bound for the true risk in the active learning setting; by minimizing this upper bound,we develop a practical batch mode active learning method. The proposed formulation involves a nonconvex integer programming optimization problem. We solve it efficiently by an alternating optimization method. Our method is shown to query the most informative samples while preserving the source distribution as much as possible, thus identifying the most uncertain and representative queries. We further extend our method to multiclass active learning by introducing novel pseudolabels in the multiclass case and developing an efficient algorithm. Experiments on benchmark datasets and real-world applications demonstrate the superior performance of our proposed method compared to state-of-the-art methods.

KW - Active learning

KW - Empirical risk minimization

KW - Maximum mean discrepancy

KW - Representative and discriminative

UR - http://www.scopus.com/inward/record.url?scp=84923696228&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84923696228&partnerID=8YFLogxK

U2 - 10.1145/2700408

DO - 10.1145/2700408

M3 - Article

AN - SCOPUS:84923696228

SN - 1556-4681

VL - 9

SP - 17

JO - ACM Transactions on Knowledge Discovery from Data

JF - ACM Transactions on Knowledge Discovery from Data

IS - 3

ER -

Querying discriminative and representative samples for batch mode active learning

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this