Addressing Missing Data in GC × GC Metabolomics: Identifying Missingness Type and Evaluating the Impact of Imputation Methods on Experimental Replication

Trenton J. Davis; Tarek R. Firzli; Emily A. Higgins Keppler; Matthew Richardson; Heather D. Bean

doi:10.1021/acs.analchem.1c04093

Addressing Missing Data in GC × GC Metabolomics: Identifying Missingness Type and Evaluating the Impact of Imputation Methods on Experimental Replication

Trenton J. Davis, Tarek R. Firzli, Emily A. Higgins Keppler, Matthew Richardson, Heather D. Bean

Life Sciences, School of (SOLS)

Research output: Contribution to journal › Article › peer-review

9 Scopus citations

Abstract

Missing data is a significant issue in metabolomics that is often neglected when conducting data preprocessing, particularly when it comes to imputation. This can have serious implications for downstream statistical analyses and lead to misleading or uninterpretable inferences. In this study, we aim to identify the primary types of missingness that affect untargeted metabolomics data and compare strategies for imputation using two real-world comprehensive two-dimensional gas chromatography (GC × GC) data sets. We also present these goals in the context of experimental replication whereby imputation is conducted in a within-replicate-based fashion─the first description and evaluation of this strategy─and introduce an R package MetabImpute to carry out these analyses. Our results conclude that, in these two GC × GC data sets, missingness was most likely of the missing at-random (MAR) and missing not-at-random (MNAR) types as opposed to missing completely at-random (MCAR). Gibbs sampler imputation and Random Forest gave the best results when imputing MAR and MNAR compared against single-value imputation (zero, minimum, mean, median, and half-minimum) and other more sophisticated approaches (Bayesian principal component analysis and quantile regression imputation for left-censored data). When samples are replicated, within-replicate imputation approaches led to an increase in the reproducibility of peak quantification compared to imputation that ignores replication, suggesting that imputing with respect to replication may preserve potentially important features in downstream analyses for biomarker discovery.

Original language	English (US)
Pages (from-to)	10912-10920
Number of pages	9
Journal	Analytical Chemistry
Volume	94
Issue number	31
DOIs	https://doi.org/10.1021/acs.analchem.1c04093
State	Published - Aug 9 2022

ASJC Scopus subject areas

Analytical Chemistry

Access to Document

10.1021/acs.analchem.1c04093

Cite this

@article{222419aa49214cd3b06cf2f3ad5c1c5a,

title = "Addressing Missing Data in GC × GC Metabolomics: Identifying Missingness Type and Evaluating the Impact of Imputation Methods on Experimental Replication",

abstract = "Missing data is a significant issue in metabolomics that is often neglected when conducting data preprocessing, particularly when it comes to imputation. This can have serious implications for downstream statistical analyses and lead to misleading or uninterpretable inferences. In this study, we aim to identify the primary types of missingness that affect untargeted metabolomics data and compare strategies for imputation using two real-world comprehensive two-dimensional gas chromatography (GC × GC) data sets. We also present these goals in the context of experimental replication whereby imputation is conducted in a within-replicate-based fashion─the first description and evaluation of this strategy─and introduce an R package MetabImpute to carry out these analyses. Our results conclude that, in these two GC × GC data sets, missingness was most likely of the missing at-random (MAR) and missing not-at-random (MNAR) types as opposed to missing completely at-random (MCAR). Gibbs sampler imputation and Random Forest gave the best results when imputing MAR and MNAR compared against single-value imputation (zero, minimum, mean, median, and half-minimum) and other more sophisticated approaches (Bayesian principal component analysis and quantile regression imputation for left-censored data). When samples are replicated, within-replicate imputation approaches led to an increase in the reproducibility of peak quantification compared to imputation that ignores replication, suggesting that imputing with respect to replication may preserve potentially important features in downstream analyses for biomarker discovery.",

author = "Davis, {Trenton J.} and Firzli, {Tarek R.} and {Higgins Keppler}, {Emily A.} and Matthew Richardson and Bean, {Heather D.}",

year = "2022",

month = aug,

day = "9",

doi = "10.1021/acs.analchem.1c04093",

language = "English (US)",

volume = "94",

pages = "10912--10920",

journal = "Analytical Chemistry",

issn = "0003-2700",

number = "31",

}

TY - JOUR

T1 - Addressing Missing Data in GC × GC Metabolomics

T2 - Identifying Missingness Type and Evaluating the Impact of Imputation Methods on Experimental Replication

AU - Davis, Trenton J.

AU - Firzli, Tarek R.

AU - Higgins Keppler, Emily A.

AU - Richardson, Matthew

AU - Bean, Heather D.

PY - 2022/8/9

Y1 - 2022/8/9

N2 - Missing data is a significant issue in metabolomics that is often neglected when conducting data preprocessing, particularly when it comes to imputation. This can have serious implications for downstream statistical analyses and lead to misleading or uninterpretable inferences. In this study, we aim to identify the primary types of missingness that affect untargeted metabolomics data and compare strategies for imputation using two real-world comprehensive two-dimensional gas chromatography (GC × GC) data sets. We also present these goals in the context of experimental replication whereby imputation is conducted in a within-replicate-based fashion─the first description and evaluation of this strategy─and introduce an R package MetabImpute to carry out these analyses. Our results conclude that, in these two GC × GC data sets, missingness was most likely of the missing at-random (MAR) and missing not-at-random (MNAR) types as opposed to missing completely at-random (MCAR). Gibbs sampler imputation and Random Forest gave the best results when imputing MAR and MNAR compared against single-value imputation (zero, minimum, mean, median, and half-minimum) and other more sophisticated approaches (Bayesian principal component analysis and quantile regression imputation for left-censored data). When samples are replicated, within-replicate imputation approaches led to an increase in the reproducibility of peak quantification compared to imputation that ignores replication, suggesting that imputing with respect to replication may preserve potentially important features in downstream analyses for biomarker discovery.

AB - Missing data is a significant issue in metabolomics that is often neglected when conducting data preprocessing, particularly when it comes to imputation. This can have serious implications for downstream statistical analyses and lead to misleading or uninterpretable inferences. In this study, we aim to identify the primary types of missingness that affect untargeted metabolomics data and compare strategies for imputation using two real-world comprehensive two-dimensional gas chromatography (GC × GC) data sets. We also present these goals in the context of experimental replication whereby imputation is conducted in a within-replicate-based fashion─the first description and evaluation of this strategy─and introduce an R package MetabImpute to carry out these analyses. Our results conclude that, in these two GC × GC data sets, missingness was most likely of the missing at-random (MAR) and missing not-at-random (MNAR) types as opposed to missing completely at-random (MCAR). Gibbs sampler imputation and Random Forest gave the best results when imputing MAR and MNAR compared against single-value imputation (zero, minimum, mean, median, and half-minimum) and other more sophisticated approaches (Bayesian principal component analysis and quantile regression imputation for left-censored data). When samples are replicated, within-replicate imputation approaches led to an increase in the reproducibility of peak quantification compared to imputation that ignores replication, suggesting that imputing with respect to replication may preserve potentially important features in downstream analyses for biomarker discovery.

UR - http://www.scopus.com/inward/record.url?scp=85135949439&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85135949439&partnerID=8YFLogxK

U2 - 10.1021/acs.analchem.1c04093

DO - 10.1021/acs.analchem.1c04093

M3 - Article

AN - SCOPUS:85135949439

SN - 0003-2700

VL - 94

SP - 10912

EP - 10920

JO - Analytical Chemistry

JF - Analytical Chemistry

IS - 31

ER -

Addressing Missing Data in GC × GC Metabolomics: Identifying Missingness Type and Evaluating the Impact of Imputation Methods on Experimental Replication

Abstract

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this