Abstract
We study the problem of introducing errors into clean databases for the purpose of benchmarking data-cleaning algorithms. Our goal is to provide users with the highest possible level of control over the error-generation process, and at the same time develop solutions that scale to large databases. We show in the paper that the error-generation problem is surprisingly challenging, and in fact, NP-complete. To provide a scalable solution, we develop a correct and efficient greedy algorithm that srifices completeness, but succeeds under very reasonable assumptions. To scale to millions of tuples, the algorithm relies on several non-trivial optimizations, including a new symmetry property of data quality constraints. The trade-off between control and scalability is the main technical contribution of the paper.
Original language | English (US) |
---|---|
Pages (from-to) | 36-47 |
Number of pages | 12 |
Journal | Proceedings of the VLDB Endowment |
Volume | 9 |
Issue number | 2 |
DOIs | |
State | Published - 2016 |
Event | 42nd International Conference on Very Large Data Bases, VLDB 2016 - Delhi, India Duration: Sep 5 2016 → Sep 9 2016 |
ASJC Scopus subject areas
- Computer Science (miscellaneous)
- Computer Science(all)