Wabnik K, Hvidsten TR, Kedzienska A, Van Leene J, De Jaeger G, Beemster GTS, Komorowski J, Kuiper MTR
Gene expression trends and protein features effectively complement each other in gene function prediction
Bioinformatics: 2009 25(3):322-330


Motivation: Genome-scale ‘omics’ data constitutea potentially rich source of information about biological systemsand their function. There is a plethora of tools and methodsavailable to mine omics data. However, the diversity and complexityof different omics data types is a stumbling block for multi-dataintegration, hence there is a dire need for additional methodsto exploit potential synergy from integrated orthogonal data.Rough Sets provide an efficient means to use complex informationin classification approaches. Here, we set out to explore thepossibilities of Rough Sets to incorporate diverse informationsources in a functional classification of unknown genes.

Results: We explored the use of Rough Sets for a novel dataintegration strategy where gene expression data, protein featuresand Gene Ontology (GO) annotations were combined to describegeneral and biologically relevant patterns represented by If-Thenrules. The descriptive rules were used to predict the functionof unknown genes in Arabidopsis thaliana and Schizosaccharomycespombe. The If-Then rule models showed success rates of up to0.89 (discriminative and predictive power for both modeled organisms);whereas, models built solely of one data type (protein featuresor gene expression data) yielded success rates varying from0.68 to 0.78. Our models were applied to generate classificationsfor many unknown genes, of which a sizeable number were confirmedeither by PubMed literature reports or electronically interferedannotations. Finally, we studied cell cycle protein–proteininteractions derived from both tandem affinity purificationexperiments and in silico experiments in the BioGRID interactomedatabase and found strong experimental evidence for the predictionsgenerated by our models. The results show that our approachcan be used to build very robust models that create synergyfrom integrating gene expression data and protein features.

E-link to journal