TY - GEN
T1 - Feature selection for clustering
AU - Dash, Manoranjan
AU - Liu, Huan
N1 - Publisher Copyright:
© Springer-Verlag Berlin Heidelberg 2000.
PY - 2000
Y1 - 2000
N2 - Clustering is an important data mining task. Data mining often concerns large and high-dimensionai data but unfortunately most of the clustering algorithms in the literature axe sensitive to largeness or high-dimensionality or both. Different features affect clusters differently, some are important for clusters while others may hinder the clustering task. An efficient way of handling it is by selecting a subset of important features. It helps in finding clusters efficiently, understanding the data better and reducing data size for efficient storage, collection and processing. The task of finding original important features for unsupervised data is largely untouched. Traditional feature selection algorithms work only for supervised data where class information is avaiilable. For unsupervised data, without class information, often principal components (PCs) Eire used, but PCs still require all features and they may be difficult to understand. Our approach: first features Eire ranked Eiccording to their importance on clustering and then a subset of important features are selected. For large data we use a scalable method using sampling. Empirical evaluation shows the effectiveness and scalability of our approach for benchmark and synthetic data sets.
AB - Clustering is an important data mining task. Data mining often concerns large and high-dimensionai data but unfortunately most of the clustering algorithms in the literature axe sensitive to largeness or high-dimensionality or both. Different features affect clusters differently, some are important for clusters while others may hinder the clustering task. An efficient way of handling it is by selecting a subset of important features. It helps in finding clusters efficiently, understanding the data better and reducing data size for efficient storage, collection and processing. The task of finding original important features for unsupervised data is largely untouched. Traditional feature selection algorithms work only for supervised data where class information is avaiilable. For unsupervised data, without class information, often principal components (PCs) Eire used, but PCs still require all features and they may be difficult to understand. Our approach: first features Eire ranked Eiccording to their importance on clustering and then a subset of important features are selected. For large data we use a scalable method using sampling. Empirical evaluation shows the effectiveness and scalability of our approach for benchmark and synthetic data sets.
UR - http://www.scopus.com/inward/record.url?scp=84942747959&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84942747959&partnerID=8YFLogxK
U2 - 10.1007/3-540-45571-x_13
DO - 10.1007/3-540-45571-x_13
M3 - Conference contribution
AN - SCOPUS:84942747959
SN - 3540673822
SN - 9783540673828
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 110
EP - 121
BT - Knowledge Discovery and Data Mining
A2 - Terano, Takao
A2 - Liu, Huan
A2 - Chen, Arbee L.P.
PB - Springer Verlag
T2 - 4th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2000
Y2 - 18 April 2000 through 20 April 2000
ER -