TY - JOUR
T1 - Integrative analysis of transcriptomic and proteomic data of Shewanella oneidensis
T2 - Missing value imputation using temporal datasets
AU - Torres-García, Wandaliz
AU - Brown, Steven D.
AU - Johnson, Roger H.
AU - Zhang, Weiwen
AU - Runger, George
AU - Meldrum, Deirdre
PY - 2011/4/1
Y1 - 2011/4/1
N2 - Despite significant improvements in recent years, proteomic datasets currently available still suffer from large number of missing values. Integrative analyses based upon incomplete proteomic and transcriptomic datasets could seriously bias the biological interpretation. In this study, we applied a non-linear data-driven stochastic gradient boosted trees (GBT) model to impute missing proteomic values using a temporal transcriptomic and proteomic dataset of Shewanella oneidensis. In this dataset, genes' expression was measured after the cells were exposed to 1 mM potassium chromate for 5, 30, 60, and 90 min, while protein abundance was measured for 45 and 90 min. With the ultimate objective to impute protein values for experimentally undetected samples at 45 and 90 min, we applied a serial set of algorithms to capture relationships between temporal gene and protein expression. This work follows four main steps: (1) a quality control step for gene expression reliability, (2) mRNA imputation, (3) protein prediction, and (4) validation. Initially, an S control chart approach is performed on gene expression replicates to remove unwanted variability. Then, we focused on the missing measurements of gene expression through a nonlinear Smoothing Splines Curve Fitting. This method identifies temporal relationships among transcriptomic data at different time points and enables imputation of mRNA abundance at 45 min. After mRNA imputation was validated by biological constrains (i.e. operons), we used a data-driven GBT model to impute protein abundance for the proteins experimentally undetected in the 45 and 90 min samples, based on relevant predictors such as temporal mRNA gene expression data and cellular functional roles. The imputed protein values were validated using biological constraints such as operon and pathway information through a permutation test to investigate whether dispersion measures are indeed smaller for known biological groups than for any set of random genes. Finally, we demonstrated that such missing value imputation improved characterization of the temporal response of S. oneidensis to chromate.
AB - Despite significant improvements in recent years, proteomic datasets currently available still suffer from large number of missing values. Integrative analyses based upon incomplete proteomic and transcriptomic datasets could seriously bias the biological interpretation. In this study, we applied a non-linear data-driven stochastic gradient boosted trees (GBT) model to impute missing proteomic values using a temporal transcriptomic and proteomic dataset of Shewanella oneidensis. In this dataset, genes' expression was measured after the cells were exposed to 1 mM potassium chromate for 5, 30, 60, and 90 min, while protein abundance was measured for 45 and 90 min. With the ultimate objective to impute protein values for experimentally undetected samples at 45 and 90 min, we applied a serial set of algorithms to capture relationships between temporal gene and protein expression. This work follows four main steps: (1) a quality control step for gene expression reliability, (2) mRNA imputation, (3) protein prediction, and (4) validation. Initially, an S control chart approach is performed on gene expression replicates to remove unwanted variability. Then, we focused on the missing measurements of gene expression through a nonlinear Smoothing Splines Curve Fitting. This method identifies temporal relationships among transcriptomic data at different time points and enables imputation of mRNA abundance at 45 min. After mRNA imputation was validated by biological constrains (i.e. operons), we used a data-driven GBT model to impute protein abundance for the proteins experimentally undetected in the 45 and 90 min samples, based on relevant predictors such as temporal mRNA gene expression data and cellular functional roles. The imputed protein values were validated using biological constraints such as operon and pathway information through a permutation test to investigate whether dispersion measures are indeed smaller for known biological groups than for any set of random genes. Finally, we demonstrated that such missing value imputation improved characterization of the temporal response of S. oneidensis to chromate.
UR - http://www.scopus.com/inward/record.url?scp=79952650377&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=79952650377&partnerID=8YFLogxK
U2 - 10.1039/c0mb00260g
DO - 10.1039/c0mb00260g
M3 - Article
C2 - 21212895
AN - SCOPUS:79952650377
SN - 1742-206X
VL - 7
SP - 1093
EP - 1104
JO - Molecular BioSystems
JF - Molecular BioSystems
IS - 4
ER -