Integrative analysis of transcriptomic and proteomic data of Shewanella oneidensis: Missing value imputation using temporal datasets

Wandaliz Torres-García; Steven D. Brown; Roger H. Johnson; Weiwen Zhang; George Runger; Deirdre Meldrum

doi:10.1039/c0mb00260g

Integrative analysis of transcriptomic and proteomic data of Shewanella oneidensis: Missing value imputation using temporal datasets

Wandaliz Torres-García, Steven D. Brown, Roger H. Johnson, Weiwen Zhang, George Runger, Deirdre Meldrum

Research output: Contribution to journal › Article › peer-review

10 Scopus citations

Abstract

Despite significant improvements in recent years, proteomic datasets currently available still suffer from large number of missing values. Integrative analyses based upon incomplete proteomic and transcriptomic datasets could seriously bias the biological interpretation. In this study, we applied a non-linear data-driven stochastic gradient boosted trees (GBT) model to impute missing proteomic values using a temporal transcriptomic and proteomic dataset of Shewanella oneidensis. In this dataset, genes' expression was measured after the cells were exposed to 1 mM potassium chromate for 5, 30, 60, and 90 min, while protein abundance was measured for 45 and 90 min. With the ultimate objective to impute protein values for experimentally undetected samples at 45 and 90 min, we applied a serial set of algorithms to capture relationships between temporal gene and protein expression. This work follows four main steps: (1) a quality control step for gene expression reliability, (2) mRNA imputation, (3) protein prediction, and (4) validation. Initially, an S control chart approach is performed on gene expression replicates to remove unwanted variability. Then, we focused on the missing measurements of gene expression through a nonlinear Smoothing Splines Curve Fitting. This method identifies temporal relationships among transcriptomic data at different time points and enables imputation of mRNA abundance at 45 min. After mRNA imputation was validated by biological constrains (i.e. operons), we used a data-driven GBT model to impute protein abundance for the proteins experimentally undetected in the 45 and 90 min samples, based on relevant predictors such as temporal mRNA gene expression data and cellular functional roles. The imputed protein values were validated using biological constraints such as operon and pathway information through a permutation test to investigate whether dispersion measures are indeed smaller for known biological groups than for any set of random genes. Finally, we demonstrated that such missing value imputation improved characterization of the temporal response of S. oneidensis to chromate.

Original language	English (US)
Pages (from-to)	1093-1104
Number of pages	12
Journal	Molecular BioSystems
Volume	7
Issue number	4
DOIs	https://doi.org/10.1039/c0mb00260g
State	Published - Apr 1 2011

ASJC Scopus subject areas

Biotechnology
Molecular Biology

Access to Document

10.1039/c0mb00260g

Cite this

@article{63ec97c3ca5447f48236bd0324df17a4,

title = "Integrative analysis of transcriptomic and proteomic data of Shewanella oneidensis: Missing value imputation using temporal datasets",

abstract = "Despite significant improvements in recent years, proteomic datasets currently available still suffer from large number of missing values. Integrative analyses based upon incomplete proteomic and transcriptomic datasets could seriously bias the biological interpretation. In this study, we applied a non-linear data-driven stochastic gradient boosted trees (GBT) model to impute missing proteomic values using a temporal transcriptomic and proteomic dataset of Shewanella oneidensis. In this dataset, genes' expression was measured after the cells were exposed to 1 mM potassium chromate for 5, 30, 60, and 90 min, while protein abundance was measured for 45 and 90 min. With the ultimate objective to impute protein values for experimentally undetected samples at 45 and 90 min, we applied a serial set of algorithms to capture relationships between temporal gene and protein expression. This work follows four main steps: (1) a quality control step for gene expression reliability, (2) mRNA imputation, (3) protein prediction, and (4) validation. Initially, an S control chart approach is performed on gene expression replicates to remove unwanted variability. Then, we focused on the missing measurements of gene expression through a nonlinear Smoothing Splines Curve Fitting. This method identifies temporal relationships among transcriptomic data at different time points and enables imputation of mRNA abundance at 45 min. After mRNA imputation was validated by biological constrains (i.e. operons), we used a data-driven GBT model to impute protein abundance for the proteins experimentally undetected in the 45 and 90 min samples, based on relevant predictors such as temporal mRNA gene expression data and cellular functional roles. The imputed protein values were validated using biological constraints such as operon and pathway information through a permutation test to investigate whether dispersion measures are indeed smaller for known biological groups than for any set of random genes. Finally, we demonstrated that such missing value imputation improved characterization of the temporal response of S. oneidensis to chromate.",

author = "Wandaliz Torres-Garc{\'i}a and Brown, {Steven D.} and Johnson, {Roger H.} and Weiwen Zhang and George Runger and Deirdre Meldrum",

year = "2011",

month = apr,

day = "1",

doi = "10.1039/c0mb00260g",

language = "English (US)",

volume = "7",

pages = "1093--1104",

journal = "Molecular BioSystems",

issn = "1742-206X",

publisher = "Royal Society of Chemistry",

number = "4",

}

TY - JOUR

T1 - Integrative analysis of transcriptomic and proteomic data of Shewanella oneidensis

T2 - Missing value imputation using temporal datasets

AU - Torres-García, Wandaliz

AU - Brown, Steven D.

AU - Johnson, Roger H.

AU - Zhang, Weiwen

AU - Runger, George

AU - Meldrum, Deirdre

PY - 2011/4/1

Y1 - 2011/4/1

N2 - Despite significant improvements in recent years, proteomic datasets currently available still suffer from large number of missing values. Integrative analyses based upon incomplete proteomic and transcriptomic datasets could seriously bias the biological interpretation. In this study, we applied a non-linear data-driven stochastic gradient boosted trees (GBT) model to impute missing proteomic values using a temporal transcriptomic and proteomic dataset of Shewanella oneidensis. In this dataset, genes' expression was measured after the cells were exposed to 1 mM potassium chromate for 5, 30, 60, and 90 min, while protein abundance was measured for 45 and 90 min. With the ultimate objective to impute protein values for experimentally undetected samples at 45 and 90 min, we applied a serial set of algorithms to capture relationships between temporal gene and protein expression. This work follows four main steps: (1) a quality control step for gene expression reliability, (2) mRNA imputation, (3) protein prediction, and (4) validation. Initially, an S control chart approach is performed on gene expression replicates to remove unwanted variability. Then, we focused on the missing measurements of gene expression through a nonlinear Smoothing Splines Curve Fitting. This method identifies temporal relationships among transcriptomic data at different time points and enables imputation of mRNA abundance at 45 min. After mRNA imputation was validated by biological constrains (i.e. operons), we used a data-driven GBT model to impute protein abundance for the proteins experimentally undetected in the 45 and 90 min samples, based on relevant predictors such as temporal mRNA gene expression data and cellular functional roles. The imputed protein values were validated using biological constraints such as operon and pathway information through a permutation test to investigate whether dispersion measures are indeed smaller for known biological groups than for any set of random genes. Finally, we demonstrated that such missing value imputation improved characterization of the temporal response of S. oneidensis to chromate.

AB - Despite significant improvements in recent years, proteomic datasets currently available still suffer from large number of missing values. Integrative analyses based upon incomplete proteomic and transcriptomic datasets could seriously bias the biological interpretation. In this study, we applied a non-linear data-driven stochastic gradient boosted trees (GBT) model to impute missing proteomic values using a temporal transcriptomic and proteomic dataset of Shewanella oneidensis. In this dataset, genes' expression was measured after the cells were exposed to 1 mM potassium chromate for 5, 30, 60, and 90 min, while protein abundance was measured for 45 and 90 min. With the ultimate objective to impute protein values for experimentally undetected samples at 45 and 90 min, we applied a serial set of algorithms to capture relationships between temporal gene and protein expression. This work follows four main steps: (1) a quality control step for gene expression reliability, (2) mRNA imputation, (3) protein prediction, and (4) validation. Initially, an S control chart approach is performed on gene expression replicates to remove unwanted variability. Then, we focused on the missing measurements of gene expression through a nonlinear Smoothing Splines Curve Fitting. This method identifies temporal relationships among transcriptomic data at different time points and enables imputation of mRNA abundance at 45 min. After mRNA imputation was validated by biological constrains (i.e. operons), we used a data-driven GBT model to impute protein abundance for the proteins experimentally undetected in the 45 and 90 min samples, based on relevant predictors such as temporal mRNA gene expression data and cellular functional roles. The imputed protein values were validated using biological constraints such as operon and pathway information through a permutation test to investigate whether dispersion measures are indeed smaller for known biological groups than for any set of random genes. Finally, we demonstrated that such missing value imputation improved characterization of the temporal response of S. oneidensis to chromate.

UR - http://www.scopus.com/inward/record.url?scp=79952650377&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=79952650377&partnerID=8YFLogxK

U2 - 10.1039/c0mb00260g

DO - 10.1039/c0mb00260g

M3 - Article

C2 - 21212895

AN - SCOPUS:79952650377

SN - 1742-206X

VL - 7

SP - 1093

EP - 1104

JO - Molecular BioSystems

JF - Molecular BioSystems

IS - 4

ER -

Integrative analysis of transcriptomic and proteomic data of Shewanella oneidensis: Missing value imputation using temporal datasets

Abstract

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this