Messing up with BART: Error generation for evaluating data-cleaning algorithms

Patricia C. Arocena; Boris Glavic; Giansalvatore Mecca; Renée J. Miller; Paolo Papotti; Donatello Santoro

doi:10.14778/2850578.2850579

Messing up with BART: Error generation for evaluating data-cleaning algorithms

Patricia C. Arocena, Boris Glavic, Giansalvatore Mecca, Renée J. Miller, Paolo Papotti, Donatello Santoro

Computing and Augmented Intelligence, School of (IAFSE-SCAI)

Research output: Contribution to journal › Conference article › peer-review

55 Scopus citations

Abstract

We study the problem of introducing errors into clean databases for the purpose of benchmarking data-cleaning algorithms. Our goal is to provide users with the highest possible level of control over the error-generation process, and at the same time develop solutions that scale to large databases. We show in the paper that the error-generation problem is surprisingly challenging, and in fact, NP-complete. To provide a scalable solution, we develop a correct and efficient greedy algorithm that srifices completeness, but succeeds under very reasonable assumptions. To scale to millions of tuples, the algorithm relies on several non-trivial optimizations, including a new symmetry property of data quality constraints. The trade-off between control and scalability is the main technical contribution of the paper.

Original language	English (US)
Pages (from-to)	36-47
Number of pages	12
Journal	Proceedings of the VLDB Endowment
Volume	9
Issue number	2
DOIs	https://doi.org/10.14778/2850578.2850579
State	Published - 2016
Event	42nd International Conference on Very Large Data Bases, VLDB 2016 - Delhi, India Duration: Sep 5 2016 → Sep 9 2016

ASJC Scopus subject areas

Computer Science (miscellaneous)
General Computer Science

Access to Document

10.14778/2850578.2850579

Cite this

@article{dfeef2e9358e4dd19f5dc52244546d3b,

title = "Messing up with BART: Error generation for evaluating data-cleaning algorithms",

abstract = "We study the problem of introducing errors into clean databases for the purpose of benchmarking data-cleaning algorithms. Our goal is to provide users with the highest possible level of control over the error-generation process, and at the same time develop solutions that scale to large databases. We show in the paper that the error-generation problem is surprisingly challenging, and in fact, NP-complete. To provide a scalable solution, we develop a correct and efficient greedy algorithm that srifices completeness, but succeeds under very reasonable assumptions. To scale to millions of tuples, the algorithm relies on several non-trivial optimizations, including a new symmetry property of data quality constraints. The trade-off between control and scalability is the main technical contribution of the paper.",

author = "Arocena, {Patricia C.} and Boris Glavic and Giansalvatore Mecca and Miller, {Ren{\'e}e J.} and Paolo Papotti and Donatello Santoro",

year = "2016",

doi = "10.14778/2850578.2850579",

language = "English (US)",

volume = "9",

pages = "36--47",

journal = "Proceedings of the VLDB Endowment",

issn = "2150-8097",

publisher = "Very Large Data Base Endowment Inc.",

number = "2",

note = "42nd International Conference on Very Large Data Bases, VLDB 2016 ; Conference date: 05-09-2016 Through 09-09-2016",

}

TY - JOUR

T1 - Messing up with BART

T2 - 42nd International Conference on Very Large Data Bases, VLDB 2016

AU - Arocena, Patricia C.

AU - Glavic, Boris

AU - Mecca, Giansalvatore

AU - Miller, Renée J.

AU - Papotti, Paolo

AU - Santoro, Donatello

PY - 2016

Y1 - 2016

N2 - We study the problem of introducing errors into clean databases for the purpose of benchmarking data-cleaning algorithms. Our goal is to provide users with the highest possible level of control over the error-generation process, and at the same time develop solutions that scale to large databases. We show in the paper that the error-generation problem is surprisingly challenging, and in fact, NP-complete. To provide a scalable solution, we develop a correct and efficient greedy algorithm that srifices completeness, but succeeds under very reasonable assumptions. To scale to millions of tuples, the algorithm relies on several non-trivial optimizations, including a new symmetry property of data quality constraints. The trade-off between control and scalability is the main technical contribution of the paper.

AB - We study the problem of introducing errors into clean databases for the purpose of benchmarking data-cleaning algorithms. Our goal is to provide users with the highest possible level of control over the error-generation process, and at the same time develop solutions that scale to large databases. We show in the paper that the error-generation problem is surprisingly challenging, and in fact, NP-complete. To provide a scalable solution, we develop a correct and efficient greedy algorithm that srifices completeness, but succeeds under very reasonable assumptions. To scale to millions of tuples, the algorithm relies on several non-trivial optimizations, including a new symmetry property of data quality constraints. The trade-off between control and scalability is the main technical contribution of the paper.

UR - http://www.scopus.com/inward/record.url?scp=84975824359&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84975824359&partnerID=8YFLogxK

U2 - 10.14778/2850578.2850579

DO - 10.14778/2850578.2850579

M3 - Conference article

AN - SCOPUS:84975824359

SN - 2150-8097

VL - 9

SP - 36

EP - 47

JO - Proceedings of the VLDB Endowment

JF - Proceedings of the VLDB Endowment

IS - 2

Y2 - 5 September 2016 through 9 September 2016

ER -

Messing up with BART: Error generation for evaluating data-cleaning algorithms

Abstract

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this