By Topic

Computational Biology and Bioinformatics, IEEE/ACM Transactions on

Issue 3 • Date May-June 2011

Filter Results

Displaying Results 1 - 25 of 28
  • [Front cover]

    Page(s): c1
    Save to Project icon | Request Permissions | PDF file iconPDF (1200 KB)  
    Freely Available from IEEE
  • [Inside front cover]

    Page(s): c2
    Save to Project icon | Request Permissions | PDF file iconPDF (210 KB)  
    Freely Available from IEEE
  • A cDNA Microarray Gene Expression Data Classifier for Clinical Diagnostics Based on Graph Theory

    Page(s): 577 - 591
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1917 KB) |  | HTML iconHTML  

    Despite great advances in discovering cancer molecular profiles, the proper application of microarray technology to routine clinical diagnostics is still a challenge. Current practices in the classification of microarrays' data show two main limitations: the reliability of the training data sets used to build the classifiers, and the classifiers' performances, especially when the sample to be classified does not belong to any of the available classes. In this case, state-of-the-art algorithms usually produce a high rate of false positives that, in real diagnostic applications, are unacceptable. To address this problem, this paper presents a new cDNA microarray data classification algorithm based on graph theory and is able to overcome most of the limitations of known classification methodologies. The classifier works by analyzing gene expression data organized in an innovative data structure based on graphs, where vertices correspond to genes and edges to gene expression relationships. To demonstrate the novelty of the proposed approach, the authors present an experimental performance comparison between the proposed classifier and several state-of-the-art classification algorithms. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Comprehensive Statistical Model for Cell Signaling

    Page(s): 592 - 606
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1745 KB) |  | HTML iconHTML  

    Protein signaling networks play a central role in transcriptional regulation and the etiology of many diseases. Statistical methods, particularly Bayesian networks, have been widely used to model cell signaling, mostly for model organisms and with focus on uncovering connectivity rather than inferring aberrations. Extensions to mammalian systems have not yielded compelling results, due likely to greatly increased complexity and limited proteomic measurements in vivo. In this study, we propose a comprehensive statistical model that is anchored to a predefined core topology, has a limited complexity due to parameter sharing and uses micorarray data of mRNA transcripts as the only observable components of signaling. Specifically, we account for cell heterogeneity and a multilevel process, representing signaling as a Bayesian network at the cell level, modeling measurements as ensemble averages at the tissue level, and incorporating patient-to-patient differences at the population level. Motivated by the goal of identifying individual protein abnormalities as potential therapeutical targets, we applied our method to the RAS-RAF network using a breast cancer study with 118 patients. We demonstrated rigorous statistical inference, established reproducibility through simulations and the ability to recover receptor status from available microarray data. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Fast Hierarchical Clustering Algorithm for Functional Modules Discovery in Protein Interaction Networks

    Page(s): 607 - 620
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2167 KB) |  | HTML iconHTML  

    As advances in the technologies of predicting protein interactions, huge data sets portrayed as networks have been available. Identification of functional modules from such networks is crucial for understanding principles of cellular organization and functions. However, protein interaction data produced by high-throughput experiments are generally associated with high false positives, which makes it difficult to identify functional modules accurately. In this paper, we propose a fast hierarchical clustering algorithm HC-PIN based on the local metric of edge clustering value which can be used both in the unweighted network and in the weighted network. The proposed algorithm HC-PIN is applied to the yeast protein interaction network, and the identified modules are validated by all the three types of Gene Ontology (GO) Terms: Biological Process, Molecular Function, and Cellular Component. The experimental results show that HC-PIN is not only robust to false positives, but also can discover the functional modules with low density. The identified modules are statistically significant in terms of three types of GO annotations. Moreover, HC-PIN can uncover the hierarchical organization of functional modules with the variation of its parameter's value, which is approximatively corresponding to the hierarchical structure of GO annotations. Compared to other previous competing algorithms, our algorithm HC-PIN is faster and more accurate. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Max-Flow-Based Approach to the Identification of Protein Complexes Using Protein Interaction and Microarray Data

    Page(s): 621 - 634
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1001 KB) |  | HTML iconHTML  

    The emergence of high-throughput technologies leads to abundant protein-protein interaction (PPI) data and microarray gene expression profiles, and provides a great opportunity for the identification of novel protein complexes using computational methods. By combining these two types of data, we propose a novel Graph Fragmentation Algorithm (GFA) for protein complex identification. Adapted from a classical max-flow algorithm for finding the (weighted) densest subgraphs, GFA first finds large (weighted) dense subgraphs in a protein-protein interaction network, and then, breaks each such subgraph into fragments iteratively by weighting its nodes appropriately in terms of their corresponding log-fold changes in the microarray data, until the fragment subgraphs are sufficiently small. Our tests on three widely used protein-protein interaction data sets and comparisons with several latest methods for protein complex identification demonstrate the strong performance of our method in predicting novel protein complexes in terms of its specificity and efficiency. Given the high specificity (or precision) that our method has achieved, we conjecture that our prediction results imply more than 200 novel protein complexes. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Practical Algorithm for Reconstructing Level-1 Phylogenetic Networks

    Page(s): 635 - 649
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (992 KB) |  | HTML iconHTML  

    Recently, much attention has been devoted to the construction of phylogenetic networks which generalize phylogenetic trees in order to accommodate complex evolutionary processes. Here, we present an efficient, practical algorithm for reconstructing level-1 phylogenetic networks-a type of network slightly more general than a phylogenetic tree-from triplets. Our algorithm has been made publicly available as the program Lev1athan. It combines ideas from several known theoretical algorithms for phylogenetic tree and network reconstruction with two novel subroutines. Namely, an exponential-time exact and a greedy algorithm both of which are of independent theoretical interest. Most importantly, Lev1athan runs in polynomial time and always constructs a level-1 network. If the data are consistent with a phylogenetic tree, then the algorithm constructs such a tree. Moreover, if the input triplet set is dense and, in addition, is fully consistent with some level-1 network, it will find such a network. The potential of Lev1athan is explored by means of an extensive simulation study and a biological data set. One of our conclusions is that Lev1athan is able to construct networks consistent with a high percentage of input triplets, even when these input triplets are affected by a low to moderate level of noise. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Theoretical Analysis of the Prodrug Delivery System for Treating Antibiotic-Resistant Bacteria

    Page(s): 650 - 658
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1378 KB) |  | HTML iconHTML  

    Simulations were carried out to analyze a promising new antimicrobial treatment strategy for targeting antibiotic-resistant bacteria called the beta-lactamase-dependent prodrug delivery system. In this system, the antibacterial drugs are delivered as inactive precursors that only become activated after contact with an enzyme characteristic of many species of antibiotic-resistant bacteria (beta-lactamase enzyme). The addition of an activation step contributes an extra layer of complexity to the system that can lead to unexpected emergent behavior. In order to optimize for treatment success and minimize the risk of resistance development, there must be a clear understanding of the system dynamics taking place and how they impact on the overall response. It makes sense to use a systems biology approach to analyze this method because it can facilitate a better understanding of the complex emergent dynamics arising from diverse interactions in populations. This paper contains an initial theoretical examination of the dynamics of this system of activation and an assessment of its therapeutic potential from a theoretical standpoint using an agent-based modeling approach. It also contains a case study comparison with real-world results from an experimental study carried out on two prodrug candidate compounds in the literature. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Cancer Classification from Gene Expression Data by NPPC Ensemble

    Page(s): 659 - 671
    Multimedia
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3367 KB)  

    The most important application of microarray in gene expression analysis is to classify the unknown tissue samples according to their gene expression levels with the help of known sample expression levels. In this paper, we present a nonparallel plane proximal classifier (NPPC) ensemble that ensures high classification accuracy of test samples in a computer-aided diagnosis (CAD) framework than that of a single NPPC model. For each data set only, a few genes are selected by using a mutual information criterion. Then a genetic algorithm-based simultaneous feature and model selection scheme is used to train a number of NPPC expert models in multiple subspaces by maximizing cross-validation accuracy. The members of the ensemble are selected by the performance of the trained models on a validation set. Besides the usual majority voting method, we have introduced minimum average proximity-based decision combiner for NPPC ensemble. The effectiveness of the NPPC ensemble and the proposed new approach of combining decisions for cancer diagnosis are studied and compared with support vector machine (SVM) classifier in a similar framework. Experimental results on cancer data sets show that the NPPC ensemble offers comparable testing accuracy to that of SVM ensemble with reduced training time on average. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Component-Based Modeling and Reachability Analysis of Genetic Networks

    Page(s): 672 - 682
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (926 KB) |  | HTML iconHTML  

    Genetic regulatory networks usually encompass a multitude of complex, interacting feedback loops. Being able to model and analyze their behavior is crucial for understanding their function. However, state space explosion is becoming a limiting factor in the formal analysis of genetic networks. This paper explores a modular approach for verification of reachability properties. A framework for component-based modeling of genetic regulatory networks, based on a modular discrete abstraction, is introduced. Then a compositional algorithm to efficiently analyze reachability properties of the model is proposed. A case study on embryonic cell differentiation involving several hundred cells shows the potential of this approach. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Estimating Genome-Wide Gene Networks Using Nonparametric Bayesian Network Models on Massively Parallel Computers

    Page(s): 683 - 697
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2988 KB) |  | HTML iconHTML  

    We present a novel algorithm to estimate genome-wide gene networks consisting of more than 20,000 genes from gene expression data using nonparametric Bayesian networks. Due to the difficulty of learning Bayesian network structures, existing algorithms cannot be applied to more than a few thousand genes. Our algorithm overcomes this limitation by repeatedly estimating subnetworks in parallel for genes selected by neighbor node sampling. Through numerical simulation, we confirmed that our algorithm outperformed a heuristic algorithm in a shorter time. We applied our algorithm to microarray data from human umbilical vein endothelial cells (HUVECs) treated with siRNAs, to construct a human genome-wide gene network, which we compared to a small gene network estimated for the genes extracted using a traditional bioinformatics method. The results showed that our genome-wide gene network contains many features of the small network, as well as others that could not be captured during the small network estimation. The results also revealed master-regulator genes that are not in the small network but that control many of the genes in the small network. These analyses were impossible to realize without our proposed algorithm. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • FEAST: Sensitive Local Alignment with Multiple Rates of Evolution

    Page(s): 698 - 709
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1600 KB) |  | HTML iconHTML  

    We present a pairwise local aligner, FEAST, which uses two new techniques: a sensitive extension algorithm for identifying homologous subsequences, and a descriptive probabilistic alignment model. We also present a new procedure for training alignment parameters and apply it to the human and mouse genomes, producing a better parameter set for these sequences. Our extension algorithm identifies homologous subsequences by considering all evolutionary histories. It has higher maximum sensitivity than Viterbi extensions, and better balances specificity. We model alignments with several submodels, each with unique statistical properties, describing strongly similar and weakly similar regions of homologous DNA. Training parameters using two submodels produces superior alignments, even when we align with only the parameters from the weaker submodel. Our extension algorithm combined with our new parameter set achieves sensitivity 0.59 on synthetic tests. In contrast, LASTZ with default settings achieves sensitivity 0.35 with the same false positive rate. Using the weak submodel as parameters for LASTZ increases its sensitivity to 0.59 with high error. FEAST is available at http://monod.uwaterloo.ca/feast/. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Identifiability of Two-Tree Mixtures for Group-Based Models

    Page(s): 710 - 722
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (541 KB) |  | HTML iconHTML  

    Phylogenetic data arising on two possibly different tree topologies might be mixed through several biological mechanisms, including incomplete lineage sorting or horizontal gene transfer in the case of different topologies, or simply different substitution processes on characters in the case of the same topology. Recent work on a 2-state symmetric model of character change showed that for 4 taxa, such a mixture model has nonidentifiable parameters, and thus, it is theoretically impossible to determine the two tree topologies from any amount of data under such circumstances. Here, the question of identifiability is investigated for two-tree mixtures of the 4-state group-based models, which are more relevant to DNA sequence data. Using algebraic techniques, we show that the tree parameters are identifiable for the JC and K2P models. We also prove that generic substitution parameters for the JC mixture models are identifiable, and for the K2P and K3P models obtain generic identifiability results for mixtures on the same tree. This indicates that the full phylogenetic signal remains in such mixtures, and the 2-state symmetric result is thus a misleading guide to the behavior of other models. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Incorporating Nonlinear Relationships in Microarray Missing Value Imputation

    Page(s): 723 - 731
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1370 KB) |  | HTML iconHTML  

    Microarray gene expression data often contain missing values. Accurate estimation of the missing values is important for downstream data analyses that require complete data. Nonlinear relationships between gene expression levels have not been well-utilized in missing value imputation. We propose an imputation scheme based on nonlinear dependencies between genes. By simulations based on real microarray data, we show that incorporating nonlinear relationships could improve the accuracy of missing value imputation, both in terms of normalized root-mean-squared error and in terms of the preservation of the list of significant genes in statistical testing. In addition, we studied the impact of artificial dependencies introduced by data normalization on the simulation results. Our results suggest that methods relying on global correlation structures may yield overly optimistic simulation results when the data have been subjected to row (gene)-wise mean removal. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Manipulating the Steady State of Metabolic Pathways

    Page(s): 732 - 747
    Multimedia
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3444 KB)  

    Metabolic pathways show the complex interactions among enzymes that transform chemical compounds. The state of a metabolic pathway can be expressed as a vector, which denotes the yield of the compounds or the flux in that pathway at a given time. The steady state is a state that remains unchanged over time. Altering the state of the metabolism is very important for many applications such as biomedicine, biofuels, food industry, and cosmetics. The goal of the enzymatic target identification problem is to identify the set of enzymes whose knockouts lead the metabolism to a state that is close to a given goal state. Given that the size of the search space is exponential in the number of enzymes, the target identification problem is very computationally intensive. We develop efficient algorithms to solve the enzymatic target identification problem in this paper. Unlike existing algorithms, our method works for a broad set of metabolic network models. We measure the effect of the knockouts of a set of enzymes as a function of the deviation of the steady state of the pathway after their knockouts from the goal state. We develop two algorithms to find the enzyme set with minimal deviation from the goal state. The first one is a traversal approach that explores possible solutions in a systematic way using a branch and bound method. The second one uses genetic algorithms to derive good solutions from a set of alternative solutions iteratively. Unlike the former one, this one can run for very large pathways. Our experiments show that our algorithms' results follow those obtained in vitro in the literature from a number of applications. They also show that the traversal method is a good approximation of the exhaustive search algorithm and it is up to 11 times faster than the exhaustive one. This algorithm runs efficiently for pathways with up to 30 enzymes. For large pathways, our genetic algorithm can find good solutions in less than 10 minutes. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Multitask Learning for Protein Subcellular Location Prediction

    Page(s): 748 - 759
    Multimedia
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1542 KB) |  | HTML iconHTML  

    Protein subcellular localization is concerned with predicting the location of a protein within a cell using computational methods. The location information can indicate key functionalities of proteins. Thus, accurate prediction of subcellular localizations of proteins can help the prediction of protein functions and genome annotations, as well as the identification of drug targets. Machine learning methods such as Support Vector Machines (SVMs) have been used in the past for the problem of protein subcellular localization, but have been shown to suffer from a lack of annotated training data in each species under study. To overcome this data sparsity problem, we observe that because some of the organisms may be related to each other, there may be some commonalities across different organisms that can be discovered and used to help boost the data in each localization task. In this paper, we formulate protein subcellular localization problem as one of multitask learning across different organisms. We adapt and compare two specializations of the multitask learning algorithms on 20 different organisms. Our experimental results show that multitask learning performs much better than the traditional single-task methods. Among the different multitask learning methods, we found that the multitask kernels and supertype kernels under multitask learning that share parameters perform slightly better than multitask learning by sharing latent features. The most significant improvement in terms of localization accuracy is about 25 percent. We find that if the organisms are very different or are remotely related from a biological point of view, then jointly training the multiple models cannot lead to significant improvement. However, if they are closely related biologically, the multitask learning can do much better than individual learning. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Peakbin Selection in Mass Spectrometry Data Using a Consensus Approach with Estimation of Distribution Algorithms

    Page(s): 760 - 774
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1961 KB) |  | HTML iconHTML  

    Progress is continuously being made in the quest for stable biomarkers linked to complex diseases. Mass spectrometers are one of the devices for tackling this problem. The data profiles they produce are noisy and unstable. In these profiles, biomarkers are detected as signal regions (peaks), where control and disease samples behave differently. Mass spectrometry (MS) data generally contain a limited number of samples described by a high number of features. In this work, we present a novel class of evolutionary algorithms, estimation of distribution algorithms (EDA), as an efficient peak selector in this MS domain. There is a trade-of f between the reliability of the detected biomarkers and the low number of samples for analysis. For this reason, we introduce a consensus approach, built upon the classical EDA scheme, that improves stability and robustness of the final set of relevant peaks. An entire data workflow is designed to yield unbiased results. Four publicly available MS data sets (two MALDI-TOF and another two SELDI-TOF) are analyzed. The results are compared to the original works, and a new plot (peak frequential plot) for graphically inspecting the relevant peaks is introduced. A complete online supplementary page, which can be found at http://www.sc.ehu.es/ccwbayes/members/ruben/ms, includes extended info and results, in addition to Matlab scripts and references. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Prediction of Protein Functions with Gene Ontology and Interspecies Protein Homology Data

    Page(s): 775 - 784
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1879 KB) |  | HTML iconHTML  

    Accurate computational prediction of protein functions increasingly relies on network-inspired models for the protein function transfer. This task can become challenging for proteins isolated in their own network or those with poor or uncharacterized neighborhoods. Here, we present a novel probabilistic chain-graph-based approach for predicting protein functions that builds on connecting networks of two (or more) different species by links of high interspecies sequence homology. In this way, proteins are able to “exchange” functional information with their neighbors-homologs from a different species. The knowledge of interspecies relationships, such as the sequence homology, can become crucial in cases of limited information from other sources of data, including the protein-protein interactions or cellular locations of proteins. We further enhance our model to account for the Gene Ontology dependencies by linking multiple but related functional ontology categories within and across multiple species. The resulting networks are of significantly higher complexity than most traditional protein network models. We comprehensively benchmark our method by applying it to two largest protein networks, the Yeast and the Fly. The joint Fly-Yeast network provides substantial improvements in precision, accuracy, and false positive rate over networks that consider either of the sources in isolation. At the same time, the new model retains the computational efficiency similar to that of the simpler networks. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Regular Networks Can be Uniquely Constructed from Their Trees

    Page(s): 785 - 796
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (915 KB) |  | HTML iconHTML  

    A rooted acyclic digraph N with labeled leaves displays a tree T when there exists a way to select a unique parent of each hybrid vertex resulting in the tree T. Let Tr(N) denote the set of all trees displayed by the network N. In general, there may be many other networks M, such that Tr(M) = Tr(N). A network is regular if it is isomorphic with its cover digraph. If N is regular and D is a collection of trees displayed by N, this paper studies some procedures to try to reconstruct N given D. If the input is D = Tr(N), one procedure is described, which will reconstruct N. Hence, if N and M are regular networks and Tr(N) = Tr(M), it follows that N = M, proving that a regular network is uniquely determined by its displayed trees. If D is a (usually very much smaller) collection of displayed trees that satisfies certain hypotheses, modifications of the procedure will still reconstruct N given D. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • 3D Shape Reconstruction of Loop Objects in X-Ray Protein Crystallography

    Page(s): 797 - 807
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1891 KB) |  | HTML iconHTML  

    Knowledge of the shape of crystals can benefit data collection in X-ray crystallography. A preliminary step is the determination of the loop object, i.e., the shape of the loop holding the crystal. Based on the standard set-up of experimental X-ray stations for protein crystallography, the paper reviews a reconstruction method merely requiring 2D object contours and presents a dedicated novel algorithm. Properties of the object surface (e.g., texture) and depth information do not have to be considered. The complexity of the reconstruction task is significantly reduced by slicing the 3D object into parallel 2D cross-sections. The shape of each cross-section is determined using support lines forming polygons. The slicing technique allows the reconstruction of concave surfaces perpendicular to the direction of projection. In spite of the low computational complexity, the reconstruction method is resilient to noisy object projections caused by imperfections in the image-processing system extracting the contours. The algorithm developed here has been successfully applied to the reconstruction of shapes of loop objects in X-ray crystallography. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • TCLUST: A Fast Method for Clustering Genome-Scale Expression Data

    Page(s): 808 - 818
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1769 KB) |  | HTML iconHTML  

    Genes with a common function are often hypothesized to have correlated expression levels in mRNA expression data, motivating the development of clustering algorithms for gene expression data sets. We observe that existing approaches do not scale well for large data sets, and indeed did not converge for the data set considered here. We present a novel clustering method TCLUST that exploits coconnectedness to efficiently cluster large, sparse expression data. We compare our approach with two existing clustering methods CAST and K-means which have been previously applied to clustering of gene-expression data with good performance results. Using a number of metrics, TCLUST is shown to be superior to or at least competitive with the other methods, while being much faster. We have applied this clustering algorithm to a genome-scale gene-expression data set and used gene set enrichment analysis to discover highly significant biological clusters. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • TRIAL: A Tool for Finding Distant Structural Similarities

    Page(s): 819 - 831
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2425 KB) |  | HTML iconHTML  

    Finding structural similarities in distantly related proteins can reveal functional relationships that can not be identified using sequence comparison. Given two proteins A and B and threshold ϵ Å, we develop an algorithm, TRiplet-based Iterative ALignment (TRIAL) for computing the transformation of B that maximizes the number of aligned residues such that the root mean square deviation (RMSD) of the alignment is at most ϵ Å. Our algorithm is designed with the specific goal of effectively handling proteins with low similarity in primary structure, where existing algorithms perform particularly poorly. Experiments show that our method outperforms existing methods. TRIAL alignment brings the secondary structures of distantly related proteins to similar orientations. It also finds larger number of secondary structure matches at lower RMSD values and increased overall alignment lengths. Its classification accuracy is up to 63 percent better than other methods, including CE and DALI. TRIAL successfully aligns 83 percent of the residues from the smaller protein in reasonable time while other methods align only 29 to 65 percent of the residues for the same set of proteins. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • True Path Rule Hierarchical Ensembles for Genome-Wide Gene Function Prediction

    Page(s): 832 - 847
    Multimedia
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3211 KB) |  | HTML iconHTML  

    Gene function prediction is a complex computational problem, characterized by several items: the number of functional classes is large, and a gene may belong to multiple classes; functional classes are structured according to a hierarchy; classes are usually unbalanced, with more negative than positive examples; class labels can be uncertain and the annotations largely incomplete; to improve the predictions, multiple sources of data need to be properly integrated. In this contribution, we focus on the first three items, and, in particular, on the development of a new method for the hierarchical genome-wide and ontology-wide gene function prediction. The proposed algorithm is inspired by the “true path rule” (TPR) that governs both the Gene Ontology and FunCat taxonomies. According to this rule, the proposed TPR ensemble method is characterized by a two-way asymmetric flow of information that traverses the graph-structured ensemble: positive predictions for a node influence in a recursive way its ancestors, while negative predictions influence its offsprings. Cross-validated results with the model organism S. Crevisiae, using seven different sources of biomolecular data, and a theoretical analysis of the the TPR algorithm show the effectiveness and the drawbacks of the proposed approach. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Note on the Fixed Parameter Tractability of the Gene-Duplication Problem

    Page(s): 848 - 850
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (130 KB) |  | HTML iconHTML  

    The NP-hard gene-duplication problem takes as input a collection of gene trees and seeks a species tree that requires the fewest number of gene duplications to reconcile the input gene trees. An oft-cited, decade-old result by Stege states that the gene-duplication problem is fixed parameter tractable when parameterized by the number of gene duplications necessary for the reconciliation. Here, we uncover an error in this fixed parameter algorithm and show that this error cannot be corrected without sacrificing the fixed parameter tractability of the algorithm. Furthermore, we show a link between the gene-duplication problem and the minimum rooted triplets inconsistency problem which implies that the gene-duplication problem is 1) W[2]-hard when parameterized by the number of gene duplications necessary for the reconciliation and 2) hard to approximate to better than a logarithmic factor. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Identifying Relevant Data for a Biological Database: Handcrafted Rules versus Machine Learning

    Page(s): 851 - 857
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1200 KB) |  | HTML iconHTML  

    With well over 1,000 specialized biological databases in use today, the task of automatically identifying novel, relevant data for such databases is increasingly important. In this paper, we describe practical machine learning approaches for identifying MEDLINE documents and Swiss-Prot/TrEMBL protein records, for incorporation into a specialized biological database of transport proteins named TCDB. We show that both learning approaches outperform rules created by hand by a human expert. As one of the first case studies involving two different approaches to updating a deployed database, both the methods compared and the results will be of interest to curators of many specialized databases. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.

Aims & Scope

This bimonthly publishes archival research results related to the algorithmic, mathematical, statistical, and computational methods that are central in bioinformatics and computational biology.

Full Aims & Scope

Meet Our Editors

Editor-in-Chief
Ying Xu
University of Georgia
xyn@bmb.uga.edu

Associate Editor-in-Chief
Dong Xu
University of Missouri
xudong@missouri.edu