Access to the Document
Metabolite profiles as a reflection of physiological status  a methodological validation
Steinfath, Matthias ; Repsilber, Dirk ; Hische, Manuela ; Schauer, Nicolas ; Fernie, Alisdair R. ; Selbig, Joachim
Journal of Integrative Bioinformatics  JIB (ISSN 16134516)
Abstract:
Biological "omics" data comprise numerous variables (metabolites, gene expression, physiological quantities) and comparatively few samples. These samples represent either measurements for slightly different genotypes in identical environments, or for different environmental conditions affecting the same genotype. Given this kind of data, it is intriguing to ask for possible measurable associations between molecular variables and the phenotypical or physiological status. To evaluate such correlations we need a model for the functional dependency of the physiological state on given molecular variables. Supervised machine learning methods such as neural networks, decision trees, or support vector machines may be used to reveal such correlations. The simplest model is certainly a linear approach. To investigate the association between molecular and phenotypical variables, we ask if the correlation between predictor and response is statistically significant, and how much of the phenotypical variance of the response can be explained by a given set of predictors. When confronted with a set of molecular data not all of them are generally relevant for each physiological trait. Given this fact the problem of feature selection arises. Different regression methods have been developed to answer this question: Ordinary Least Squares (OLS) yields an unbiased solution, but normally has a high mean square error. In particular, there is no dimension reduction included in this method and, hence, overfitting is a critical problem. In contrast, Principle Component Regression (PCR) offers such a dimension reduction, however, the principle components are found without considering the response. Partial Least Squares Regression (PLSR) is utilised as an alternative method since it considers the variance within the predictors as well as between predictors and response, whilst Ridge Regression is a further alternative worthy of consideration. In our study we applied these methods to data resulting from a tomato metabolite experimental series. Comparison of the results for this dataset with experimentally relevant correlation structure between variables and samples allows us to test the relative merits of the regression methods with respect to the questions raised above. Given certain prerequisite knowledge it also allows us to conjecture the true biological correlation. Our results show that under most circumstances OLS is worst with respect to prediction. However, the ranking of methods seems to change considerably if the question of feature selection is considered. Understanding and discussing these differences is a relevant contribution to the task of choice of suitable approach of correlation analysis for "omics" datasets with respect to the biological interpretation in question.
Institution: 

Faculty of Technology, Research Groups in Informatics 
DDC classification: 

Data processing, computer science, computer systems 
Suggested Citation:
Steinfath, Matthias ; Repsilber, Dirk ; Hische, Manuela ; Schauer, Nicolas ; Fernie, Alisdair R. ; Selbig, Joachim ( 2006) Metabolite profiles as a reflection of physiological status  a methodological validation.
Journal of Integrative Bioinformatics  JIB (ISSN 16134516), 3(2), 2006. Special Issue: 3rd Integrative Bioinformatics Workshop, Harpenden, United Kingdom, 2
OnlineJournal: http://journal.imbio.de/index.php?paper_id=28
URL:
http://biecoll.ub.unibielefeld.de/volltexte/2007/209
Also published by Shaker:
Ralf Hofestädt, Thoralf Töpel (eds.). Integrative Bioinformatics  Yearbook 2006. Shaker, 2007.
