Computational Approaches for Protein Function Prediction: A Survey

Gaurav Pandey (gaurav@cs.umn.edu), Vipin Kumar (kumar@cs.umn.edu), Michael Steinbach (steinbac@cs.umn.edu)

Proteins are the most essential and versatile macromolecules of life, and
the knowledge of their functions is a crucial link in the development of new drugs, better crops, and even the development of synthetic biochemicals such as biofuels. Experimental procedures for protein function prediction are inherently low throughput and are thus unable to annotate a non-trivial fraction of proteins that are becoming available due to rapid advances in genome sequencing technology.  
This has motivated the development of computational techniques that utilize a variety of high-throughput experimental data for protein function prediction, such as protein and genome sequences, gene expression data, protein interaction networks and phylogenetic profiles. Indeed, in a short period of a decade, several hundred articles have been published on this topic. This review aims to discuss this wide spectrum of approaches by categorizing them in terms of the data type they use for predicting function, and thus identify the trends and needs of this very important field. The survey is expected to be useful for computational biologists and bioinformaticians aiming to get an overview of the field of computational function prediction, and identify areas that can benefit from further research. Full-text of this review is accessible HERE!

Search Related Subjects:
 
 

Systematic discovery of regulatory motifs in conserved regions of the human genome

Xiaohui Xie et al., PNAS | April 24, 2007 | vol. 104 | no. 17 | 7145-7150
Conserved noncoding elements (CNEs) constitute the majority of sequences under purifying selection in the human genome, yet their function remains largely unknown. Experimental evidence suggests that many of these elements play regulatory roles, but little is known about regulatory motifs contained within them. Here we describe a systematic approach to discover and characterize regulatory motifs within mammalian CNEs by searching for long motifs (12-22 nt) with significant enrichment in CNEs and studying their biochemical and genomic properties. Our analysis identifies 233 long motifs (LMs), matching a total of {approx}60,000 conserved instances across the human genome. These motifs include 16 previously known regulatory elements, such as the histone 3'-UTR motif and the neuron-restrictive silencer element, as well as striking examples of novel functional elements. The most highly enriched motif (LM1) corresponds to the X-box motif known from yeast and nematode. We show that it is bound by the RFX1 protein and identify thousands of conserved motif instances, suggesting a broad role for the RFX family in gene regulation. A second group of motifs (LM2*) does not match any previously known motif. We demonstrate by biochemical and computational methods that it defines a binding site for the CTCF protein, which is involved in insulator function to limit the spread of gene activation. We identify nearly 15,000 conserved sites that likely serve as insulators, and we show that nearby genes separated by predicted CTCF sites show markedly reduced correlation in gene expression. These sites may thus partition the human genome into domains of expression.

Search Related Subjects:

Neural Competition and Selection during Memory Formation

Jin-Hee Han, Steven A. Kushner, Adelaide P. Yiu, Christy J. Cole, Anna Matynia, Robert A. Brown, Rachael L. Neve, John F. Guzowski, Alcino J. Silva, Sheena A. Josselyn ; Science 20 April 2007

 Competition between neurons is necessary for refining neural circuits during development and may be important for selecting the neurons that participate in encoding memories in the adult brain. To examine neuronal competition during memory formation, we conducted experiments with mice in which we manipulated the function of CREB (adenosine 3',5'-monophosphate response element–binding protein) in subsets of neurons. Changes in CREB function influenced the probability that individual lateral amygdala neurons were recruited into a fear memory trace. Our results suggest a competitive model underlying memory formation, in which eligible neurons are selected to participate in a memory trace as a function of their relative CREB activity at the time of learning.

 Search Related Subjects:

Comparing Sequences Without Using Alignments

Gilles Didier, Laurent Debomy, Maude Pupin, Ming Zhang, Alexander Grossmann, Claudine Devauchelle and Ivan Laprevotte  BMC Bioinformatics 2006, 7:535     

In general, the construction of trees is based on sequence alignments. This procedure, however, leads to loss of informationwhen parts of sequence alignments (for instance ambiguous regions) are deleted before tree building. To overcome this difficulty, one of us previously introduced a new and rapid algorithm that calculates dissimilarity matrices between sequences without preliminary alignment. In this paper, HIV (Human Immunodeficiency Virus) and SIV (Simian Immunodeficiency Virus) sequence data are used to evaluate this method. The program produces tree topologies that are identical to those obtained by a combination of standard methods detailed in the HIV Sequence Compendium. Manual alignment editing is not necessary at any stage. Furthermore, only one user-specified parameter is needed for constructing trees.

 Search Related Subjects:

Computational Inference of Neural Information Flow Networks

Smith VA, Yu J, Smulders TV, Hartemink AJ, Jarvis ED (2006) Computational inference of neural information flow networks. PLoS Comput Biol 2(11): e161. doi:10.1371/journal.pcbi.0020161

One of the challenges in the area of brain research is to decipher networks describing the flow of information among communicating neurons in the form of electrophysiological signals. These networks are thought to be responsible for perceiving and learning about the environment, as well as producing behavior. Monitoring these networks is limited by the number of electrodes that can be placed in the brain of an awake animal, while inferring and reasoning about these networks is limited by the availability of appropriate computational tools. Here, Smith and Yu and colleagues begin to address these issues by implanting microelectrode arrays in the auditory pathway of freely moving songbirds and by analyzing the data using new computational tools they have designed for deciphering networks. The authors find that a dynamic Bayesian network algorithm they developed to decipher gene regulatory networks from gene expression data effectively infers putative information flow networks in the brain from microelectrode array data. The networks they infer conform to known anatomy and other biological properties of the auditory system and offer new insight into how the auditory system processes natural and synthetic sound. The authors believe that their results represent the first validated study of the inference of information flow networks in the brain.

Search Related Subjects:

Atlas – a data warehouse for integrative bioinformatics

Sohrab P Shah , Yong Huang , Tao Xu , Macaire MS Yuen , John Ling and BF Francis Ouellette
UBC Bioinformatics Centre, University of British Columbia, Vancouver, BC, Canada

BMC Bioinformatics 2005, 6:34     doi:10.1186/1471-2105-6-34

Published   21 February 2005

Search Related Subjects:

Abstract


Background

We present a biological data warehouse called Atlas that locally stores and integrates biological sequences, molecular interactions, homology information, functional annotations of genes, and biological ontologies. The goal of the system is to provide data, as well as a software infrastructure for bioinformatics research and development.

Description

The Atlas system is based on relational data models that we developed for each of the source data types. Data stored within these relational models are managed through Structured Query Language (SQL) calls that are implemented in a set of Application Programming Interfaces (APIs). The APIs include three languages: C++, Java, and Perl. The methods in these API libraries are used to construct a set of loader applications, which parse and load the source datasets into the Atlas database, and a set of toolbox applications which facilitate data retrieval. Atlas stores and integrates local instances of GenBank, RefSeq, UniProt, Human Protein Reference Database (HPRD), Biomolecular Interaction Network Database (BIND), Database of Interacting Proteins (DIP), Molecular Interactions Database (MINT), IntAct, NCBI Taxonomy, Gene Ontology (GO), Online Mendelian Inheritance in Man (OMIM), LocusLink, Entrez Gene and HomoloGene. The retrieval APIs and toolbox applications are critical components that offer end-users flexible, easy, integrated access to this data. We present use cases that use Atlas to integrate these sources for genome annotation, inference of molecular interactions across species, and gene-disease associations.

Conclusion

The Atlas biological data warehouse serves as data infrastructure for bioinformatics research and development. It forms the backbone of the research activities in our laboratory and facilitates the integration of disparate, heterogeneous biological sources of data enabling new scientific inferences. Atlas achieves integration of diverse data sets at two levels. First, Atlas stores data of similar types using common data models, enforcing the relationships between data types. Second, integration is achieved through a combination of APIs, ontology, and tools. The Atlas software is freely available under the GNU General Public License at: http://bioinformatics.ubc.ca/atlas/