Feature selection for clustering

Manoranjan Dash, Huan Liu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

177 Scopus citations

Abstract

Clustering is an important data mining task. Data mining often concerns large and high-dimensionai data but unfortunately most of the clustering algorithms in the literature axe sensitive to largeness or high-dimensionality or both. Different features affect clusters differently, some are important for clusters while others may hinder the clustering task. An efficient way of handling it is by selecting a subset of important features. It helps in finding clusters efficiently, understanding the data better and reducing data size for efficient storage, collection and processing. The task of finding original important features for unsupervised data is largely untouched. Traditional feature selection algorithms work only for supervised data where class information is avaiilable. For unsupervised data, without class information, often principal components (PCs) Eire used, but PCs still require all features and they may be difficult to understand. Our approach: first features Eire ranked Eiccording to their importance on clustering and then a subset of important features are selected. For large data we use a scalable method using sampling. Empirical evaluation shows the effectiveness and scalability of our approach for benchmark and synthetic data sets.

Original languageEnglish (US)
Title of host publicationKnowledge Discovery and Data Mining
Subtitle of host publicationCurrent Issues and New Applications - 4th Pacific-Asia Conference, PAKDD 2000, Proceedings
EditorsTakao Terano, Huan Liu, Arbee L.P. Chen
PublisherSpringer Verlag
Pages110-121
Number of pages12
ISBN (Print)3540673822, 9783540673828
DOIs
StatePublished - 2000
Externally publishedYes
Event4th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2000 - Kyoto, Japan
Duration: Apr 18 2000Apr 20 2000

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume1805
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Other

Other4th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2000
Country/TerritoryJapan
CityKyoto
Period4/18/004/20/00

ASJC Scopus subject areas

  • Theoretical Computer Science
  • General Computer Science

Fingerprint

Dive into the research topics of 'Feature selection for clustering'. Together they form a unique fingerprint.

Cite this