Eigen-Entropy: A metric for multivariate sampling decisions

Jiajing Huang; Hyunsoo Yoon; Teresa Wu; Kasim Selcuk Candan; Ojas Pradhan; Jin Wen; Zheng O'Neill

doi:10.1016/j.ins.2022.11.023

Eigen-Entropy: A metric for multivariate sampling decisions

Jiajing Huang, Hyunsoo Yoon, Teresa Wu, Kasim Selcuk Candan, Ojas Pradhan, Jin Wen, Zheng O'Neill

Engineering, Ira A. Fulton Schools of (IAFSE)

Research output: Contribution to journal › Article › peer-review

2 Scopus citations

Abstract

Sampling is a technique to help identify a representative data subset that captures the characteristics of the whole dataset. Most existing sampling algorithms require distribution assumptions of the multivariate data, which may not be available beforehand. This study proposes a new metric called Eigen-Entropy (EE), which is based on information entropy for the multivariate dataset. EE is a model-free metric because it is derived based on eigenvalues extracted from the correlation coefficient matrix without any assumptions on data distributions. We prove that EE measures the composition of the dataset, such as its heterogeneity or homogeneity. As a result, EE can be used to support sampling decisions, such as which samples and how many samples to consider with respect to the application of interest. To demonstrate the utility of the EE metric, two sets of use cases are considered. The first use case focuses on classification problems with an imbalanced dataset, and EE is used to guide the rendering of homogeneous samples from minority classes. Using 10 public datasets, it is demonstrated that two oversampling techniques using the proposed EE method outperform reported methods from the literature in terms of precision, recall, F-measure, and G-mean. In the second experiment, building fault detection is investigated where EE is used to sample heterogeneous data to support fault detection. Historical normal datasets collected from real building systems are used to construct the baselines by EE for 14 test cases, and experimental results indicate that the EE method outperforms benchmark methods in terms of recall. We conclude that EE is a viable metric to support sampling decisions.

Original language	English (US)
Pages (from-to)	84-97
Number of pages	14
Journal	Information Sciences
Volume	619
DOIs	https://doi.org/10.1016/j.ins.2022.11.023
State	Published - Jan 2023

Keywords

Correlation coefficient
Eigenvalues
Information entropy
Model-free
Sampling

ASJC Scopus subject areas

Software
Information Systems and Management
Artificial Intelligence
Theoretical Computer Science
Control and Systems Engineering
Computer Science Applications

Access to Document

10.1016/j.ins.2022.11.023

Cite this

@article{f7ce761a20d5437488f78862ccabb6db,

title = "Eigen-Entropy: A metric for multivariate sampling decisions",

abstract = "Sampling is a technique to help identify a representative data subset that captures the characteristics of the whole dataset. Most existing sampling algorithms require distribution assumptions of the multivariate data, which may not be available beforehand. This study proposes a new metric called Eigen-Entropy (EE), which is based on information entropy for the multivariate dataset. EE is a model-free metric because it is derived based on eigenvalues extracted from the correlation coefficient matrix without any assumptions on data distributions. We prove that EE measures the composition of the dataset, such as its heterogeneity or homogeneity. As a result, EE can be used to support sampling decisions, such as which samples and how many samples to consider with respect to the application of interest. To demonstrate the utility of the EE metric, two sets of use cases are considered. The first use case focuses on classification problems with an imbalanced dataset, and EE is used to guide the rendering of homogeneous samples from minority classes. Using 10 public datasets, it is demonstrated that two oversampling techniques using the proposed EE method outperform reported methods from the literature in terms of precision, recall, F-measure, and G-mean. In the second experiment, building fault detection is investigated where EE is used to sample heterogeneous data to support fault detection. Historical normal datasets collected from real building systems are used to construct the baselines by EE for 14 test cases, and experimental results indicate that the EE method outperforms benchmark methods in terms of recall. We conclude that EE is a viable metric to support sampling decisions.",

keywords = "Correlation coefficient, Eigenvalues, Information entropy, Model-free, Sampling",

author = "Jiajing Huang and Hyunsoo Yoon and Teresa Wu and Candan, {Kasim Selcuk} and Ojas Pradhan and Jin Wen and Zheng O'Neill",

note = "Publisher Copyright: {\textcopyright} 2022",

year = "2023",

month = jan,

doi = "10.1016/j.ins.2022.11.023",

language = "English (US)",

volume = "619",

pages = "84--97",

journal = "Information Sciences",

issn = "0020-0255",

publisher = "Elsevier Inc.",

}

TY - JOUR

T1 - Eigen-Entropy

T2 - A metric for multivariate sampling decisions

AU - Huang, Jiajing

AU - Yoon, Hyunsoo

AU - Wu, Teresa

AU - Candan, Kasim Selcuk

AU - Pradhan, Ojas

AU - Wen, Jin

AU - O'Neill, Zheng

PY - 2023/1

Y1 - 2023/1

N2 - Sampling is a technique to help identify a representative data subset that captures the characteristics of the whole dataset. Most existing sampling algorithms require distribution assumptions of the multivariate data, which may not be available beforehand. This study proposes a new metric called Eigen-Entropy (EE), which is based on information entropy for the multivariate dataset. EE is a model-free metric because it is derived based on eigenvalues extracted from the correlation coefficient matrix without any assumptions on data distributions. We prove that EE measures the composition of the dataset, such as its heterogeneity or homogeneity. As a result, EE can be used to support sampling decisions, such as which samples and how many samples to consider with respect to the application of interest. To demonstrate the utility of the EE metric, two sets of use cases are considered. The first use case focuses on classification problems with an imbalanced dataset, and EE is used to guide the rendering of homogeneous samples from minority classes. Using 10 public datasets, it is demonstrated that two oversampling techniques using the proposed EE method outperform reported methods from the literature in terms of precision, recall, F-measure, and G-mean. In the second experiment, building fault detection is investigated where EE is used to sample heterogeneous data to support fault detection. Historical normal datasets collected from real building systems are used to construct the baselines by EE for 14 test cases, and experimental results indicate that the EE method outperforms benchmark methods in terms of recall. We conclude that EE is a viable metric to support sampling decisions.

AB - Sampling is a technique to help identify a representative data subset that captures the characteristics of the whole dataset. Most existing sampling algorithms require distribution assumptions of the multivariate data, which may not be available beforehand. This study proposes a new metric called Eigen-Entropy (EE), which is based on information entropy for the multivariate dataset. EE is a model-free metric because it is derived based on eigenvalues extracted from the correlation coefficient matrix without any assumptions on data distributions. We prove that EE measures the composition of the dataset, such as its heterogeneity or homogeneity. As a result, EE can be used to support sampling decisions, such as which samples and how many samples to consider with respect to the application of interest. To demonstrate the utility of the EE metric, two sets of use cases are considered. The first use case focuses on classification problems with an imbalanced dataset, and EE is used to guide the rendering of homogeneous samples from minority classes. Using 10 public datasets, it is demonstrated that two oversampling techniques using the proposed EE method outperform reported methods from the literature in terms of precision, recall, F-measure, and G-mean. In the second experiment, building fault detection is investigated where EE is used to sample heterogeneous data to support fault detection. Historical normal datasets collected from real building systems are used to construct the baselines by EE for 14 test cases, and experimental results indicate that the EE method outperforms benchmark methods in terms of recall. We conclude that EE is a viable metric to support sampling decisions.

KW - Correlation coefficient

KW - Eigenvalues

KW - Information entropy

KW - Model-free

KW - Sampling

UR - http://www.scopus.com/inward/record.url?scp=85141927735&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85141927735&partnerID=8YFLogxK

U2 - 10.1016/j.ins.2022.11.023

DO - 10.1016/j.ins.2022.11.023

M3 - Article

AN - SCOPUS:85141927735

SN - 0020-0255

VL - 619

SP - 84

EP - 97

JO - Information Sciences

JF - Information Sciences

ER -

Eigen-Entropy: A metric for multivariate sampling decisions

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this