Abstract
Clustering is an important aspect of data mining, while clustering high-dimensional mixed-attribute data in a scalable fashion still remains a challenging problem. In this paper, we propose a tree-ensemble clustering algorithm for static datasets, CRAFTER, to tackle this problem. CRAFTER is able to handle categorical and numeric attributes simultaneously, and scales well with the dimensionality and the size of datasets. CRAFTER leverages the advantages of a tree-ensemble to handle mixed attributes and high dimensionality. The concept of the class probability estimates is utilized to identify the representative data points for clustering. Through a series of experiments on both synthetic and real datasets, we have demonstrated that CRAFTER is superior than Random Forest Clustering (RFC), an existing tree-based clustering method, in terms of both the clustering quality and the computational cost.
Original language | English (US) |
---|---|
Article number | 8294273 |
Pages (from-to) | 1686-1696 |
Number of pages | 11 |
Journal | IEEE Transactions on Knowledge and Data Engineering |
Volume | 30 |
Issue number | 9 |
DOIs | |
State | Published - Sep 1 2018 |
Keywords
- Clustering
- categorical attribute
- ensemble method
- high dimensionality
- mixed attributes
- random forest
- static datasets
ASJC Scopus subject areas
- Information Systems
- Computer Science Applications
- Computational Theory and Mathematics