By Topic

Computational Biology and Bioinformatics, IEEE/ACM Transactions on

Issue 2 • Date April-June 2008

Filter Results

Displaying Results 1 - 21 of 21
  • [Front cover]

    Page(s): c1
    Save to Project icon | Request Permissions | PDF file iconPDF (407 KB)  
    Freely Available from IEEE
  • [Inside front cover]

    Page(s): c2
    Save to Project icon | Request Permissions | PDF file iconPDF (145 KB)  
    Freely Available from IEEE
  • Consensus Genetic Maps as Median Orders from Inconsistent Sources

    Page(s): 161 - 171
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1765 KB) |  | HTML iconHTML  

    A genetic map is an ordering of genetic markers calculated from a population of known lineage. Although, traditionally, a map has been generated from a single population for each species, recently, researchers have created maps from multiple populations. In the face of these new data, we address the need to find a consensus map - a map that combines the information from multiple partial and possibly inconsistent input maps. We model each input map as a partial order and formulate the consensus problem as finding a median partial order. Finding the median of multiple total orders (preferences or rankings) is a well-studied problem in social choice. We choose to find the median by using the weighted symmetric difference distance, which is a more general version of both the symmetric difference distance and the Kemeny distance. Finding a median order using this distance is NP-hard. We show that, for our chosen weight assignment, a median order satisfies the positive responsiveness, extended Condorcet, and unanimity criteria. Our solution involves finding the maximum acyclic subgraph of a weighted directed graph. We present a method that dynamically switches between an exact branch and bound algorithm and a heuristic algorithm and show that, for real data from closely related organisms, an exact median can often be found. We present experimental results by using seven populations of the crop plant Zea mays. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Extracting Dynamics from Static Cancer Expression Data

    Page(s): 172 - 182
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (721 KB) |  | HTML iconHTML  

    Static expression experiments analyze samples from many individuals. These samples are often snapshots of the progression of a certain disease such as cancer. This raises an intriguing question: Can we determine a temporal order for these samples? Such an ordering can lead to better understanding of the dynamics of the disease and to the identification of genes associated with its progression. In this paper, we formally prove, for the first time, that under a model for the dynamics of the expression levels of a single gene, it is indeed possible to recover the correct ordering of the static expression data sets by solving an instance of the traveling salesman problem (TSP). In addition, we devise an algorithm that combines a TSP heuristic and probabilistic modeling for inferring the underlying temporal order of the microarray experiments. This algorithm constructs probabilistic continuous curves to represent expression profiles and can thus account for noise and for individual background expression differences leading to accurate temporal reconstruction for human data. Applying our method to cancer expression data, we show that the ordering derived agrees well with survival duration. A classifier that utilizes this ordering improves upon other classifiers suggested for this task. The set of genes displaying consistent behavior for the determined ordering is enriched for genes associated with cancer progression. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Graphical Models of Residue Coupling in Protein Families

    Page(s): 183 - 197
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1609 KB) |  | HTML iconHTML  

    Many statistical measures and algorithmic techniques have been proposed for studying residue coupling in protein families. Generally speaking, two residue positions are considered coupled if, in the sequence record, some of their amino acid type combinations are significantly more common than others. While the proposed approaches have proven useful in finding and describing coupling, a significant missing component is a formal probabilistic model that explicates and compactly represents the coupling, integrates information about sequence, structure, and function, and supports inferential procedures for analysis, diagnosis, and prediction. We present an approach to learning and using probabilistic graphical models of residue coupling (GMRCs). These models capture significant conservation and coupling constraints observable in a multiply aligned set of sequences. Our approach can place a structural prior on considered couplings, so that all identified relationships have direct mechanistic explanations. It can also incorporate information about functional classes, and thereby learn a differential graphical model that distinguishes constraints common to all classes from those unique to individual classes. Such differential models separately account for class-specific conservation and family- wide coupling, two different sources of sequence covariation. They are then able to perform interpretable functional classification of new sequences, explaining classification decisions in terms of the underlying conservation and coupling constraints. We apply our approach in studying both G protein-coupled receptors and PDZ domains, identifying and analyzing family-wide and class-specific constraints, and performing functional classification. The results demonstrate that GMRCs provide a powerful tool for uncovering, representing, and utilizing significant sequence-structure-function relationships in protein families. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Identification of Protein Coding Regions Using the Modified Gabor-Wavelet Transform

    Page(s): 198 - 207
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2203 KB) |  | HTML iconHTML  

    An important topic in genomic sequence analysis is the identification of protein coding regions. In this context, several coding DNA model-independent methods based on the occurrence of specific patterns of nucleotides at coding regions have been proposed. Nonetheless, these methods have not been completely suitable due to their dependence on an empirically predefined window length required for a local analysis of a DNA region. We introduce a method based on a modified Gabor-wavelet transform (MGWT) for the identification of protein coding regions. This novel transform is tuned to analyze periodic signal components and presents the advantage of being independent of the window length. We compared the performance of the MGWT with other methods by using eukaryote data sets. The results show that MGWT outperforms all assessed model-independent methods with respect to identification accuracy. These results indicate that the source of at least part of the identification errors produced by the previous methods is the fixed working scale. The new method not only avoids this source of errors but also makes a tool available for detailed exploration of the nucleotide occurrence. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Search for Steady States of Piecewise-Linear Differential Equation Models of Genetic Regulatory Networks

    Page(s): 208 - 222
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2907 KB) |  | HTML iconHTML  

    The analysis of the attractors of a genetic regulatory network gives a good indication of the possible functional modes of the system. In this paper, we are concerned with the problem of finding all steady states of genetic regulatory networks described by piecewise-linear differential equation (PLDE) models. We show that the problem is NP-hard and translate it into the problem of finding all valuations of a propositional satisfiability (SAT) formula. This allows the use of existing, efficient SAT solvers and has enabled the development of a steady state search module of the computer tool genetic network analyzer (GNA). The practical use of this module is demonstrated by means of the analysis of a number of relatively small bacterial regulatory networks, as well as randomly generated networks of several hundred genes. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Toward Verified Biological Models

    Page(s): 223 - 234
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2164 KB) |  | HTML iconHTML  

    The last several decades have witnessed a vast accumulation of biological data and data analysis. Many of these data sets represent only a small fraction of the system's behavior, making the visualization of full system behavior difficult. A more complete understanding of a biological system is gained when different types of data (and/or conclusions drawn from the data) are integrated into a larger scale representation or model of the system. Ideally, this type of model is consistent with all available data about the system, and it is then used to generate additional hypotheses to be tested. Computer-based methods intended to formulate models that integrate various events and to test the consistency of these models with respect to the laboratory-based observations on which they are based are potentially very useful. In addition, in contrast to informal models, the consistency of such formal computer-based models with laboratory data can be tested rigorously by methods of formal verification. We combined two formal modeling approaches in computer science that were originally developed for nonbiological system design. One is the interobject approach using the language of live sequence charts (LSCs) with the Play-Engine tool, and the other is the intraobject approach using the language of statecharts and Rhapsody as the tool. Integration is carried out using InterPlay, a simulation engine coordinator. Using these tools, we constructed a combined model comprising three modules. One module represents the early lineage of the somatic gonad of Caenorhabditis elegans in LSCs, whereas a second more detailed module in statecharts represents an interaction between two cells within this lineage that determine their developmental outcome. Using the advantages of the tools, we created a third module representing a set of key experimental data using LSCs. We tested the combined statechart-LSC model by showing that the simulations were consistent with the set of experimental LSCs.- - This small-scale modular example demonstrates the potential for using similar approaches for verification by exhaustive testing of models by LSCs. It also shows the advantages of these approaches for modeling biology. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Computing Phylogenetic Diversity for Split Systems

    Page(s): 235 - 244
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (862 KB) |  | HTML iconHTML  

    In conservation biology, it is a central problem to measure, predict, and preserve biodiversity as species face extinction. In 1992, Faith proposed measuring the diversity of a collection of species in terms of their relationships on a phylogenetic tree and using this information to identify collections of species with high diversity. Here, we are interested in some variants of the resulting optimization problem that arise when considering species whose evolution is better represented by a network rather than a tree. More specifically, we consider the problem of computing phylogenetic diversity relative to a split system on a collection of species of size n. We show that, for general split systems, this problem is NP-hard. In addition, we provide some efficient algorithms for some special classes of split systems, in particular presenting an optimal O(n) time algorithm for phylogenetic trees and an O(n log n + nk) time algorithm for choosing an optimal subset of size k relative to a circular split system. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Haplotyping for Disease Association: A Combinatorial Approach

    Page(s): 245 - 251
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (728 KB) |  | HTML iconHTML  

    We consider a combinatorial problem derived from haplotyping a population with respect to a , either recessive or dominant. Given a set of individuals, partitioned into healthy and diseased, and the corresponding sets of genotypes, we want to infer "bad" and "good" haplotypes to account for these genotypes and for the disease. Assume, for example, that the disease is recessive. Then, the resolving haplotypes must consist of bad and good haplotypes so that 1) each genotype belonging to a diseased individual is explained by a pair of bad haplotypes and 2) each genotype belonging to a healthy individual is explained by a pair of haplotypes of which at least one is good. We prove that the associated decision problem is NP-complete. However, we also prove that there is a simple solution, provided that the data satisfy a very weak requirement. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Highly Scalable Genotype Phasing by Entropy Minimization

    Page(s): 252 - 261
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3485 KB) |  | HTML iconHTML  

    A single nucleotide polymorphism (SNP) is a position in the genome at which two or more of the possible four nucleotides occur in a large percentage of the population. SNPs account for most of the genetic variability between individuals and mapping SNPs in the human population has become the next high priority in genomics after the completion of the Human Genome Project. In diploid organisms such as humans, there are two nonidentical copies of each autosomal chromosome. A description of the SNPs in a chromosome is called a haplotype. At present, it is prohibitively expensive to directly determine the haplotypes of an individual, but it is possible to rather easily obtain the conflated SNP information in the so-called genotype. Computational methods for genotype phasing, that is, inferring haplotypes from genotype data, have received much attention in recent years as haplotype information leads to an increased statistical power of disease association tests. However, many of the existing algorithms have impractical runtime for phasing large genotype data sets such as those generated by the international HapMap Project. In this paper, we propose a highly scalable algorithm based on entropy minimization. Our algorithm is capable of phasing both unrelated and related genotypes coming from complex pedigrees. Experimental results on both real and simulated data sets show that our algorithm achieves a phasing accuracy worse than but close to that of the best existing methods while being several orders of magnitude faster. The open source code implementation of the algorithm and a Web interface are publicly available at http://dna.engr.uconn.edu/~software/ent/. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Inferring Connectivity of Genetic Regulatory Networks Using Information-Theoretic Criteria

    Page(s): 262 - 274
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2598 KB) |  | HTML iconHTML  

    Recently, the concept of mutual information has been proposed for inferring the structure of genetic regulatory networks from gene expression profiling. After analyzing the limitations of mutual information in inferring the gene-to-gene interactions, this paper introduces the concept of conditional mutual information and, based on this, proposes two novel algorithms to infer the connectivity structure of genetic regulatory networks. One of the proposed algorithms exhibits a better accuracy, whereas the other algorithm excels in simplicity and flexibility. By exploiting the mutual information and conditional mutual information, a practical metric is also proposed to assess the likeliness of direct connectivity between genes. This novel metric resolves a common limitation associated with the current inference algorithms, namely, the situations where the gene connectivity is established in terms of the dichotomy of being either connected or disconnected. Based on the data sets generated by synthetic networks, the performance of the proposed algorithms is compared favorably relative to existing state-of-the-art schemes. The proposed algorithms are also applied on realistic biological measurements such as the cutaneous melanoma data set, and biological meaningful results are inferred. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Nature Reserve Selection Problem: A Tight Approximation Algorithm

    Page(s): 275 - 280
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (424 KB) |  | HTML iconHTML  

    The nature reserve selection problem is a problem that arises in the context of studying biodiversity conservation. Subject to budgetary constraints, the problem is to select a set of regions to be conserved so that the phylogenetic diversity of the set of species contained within those regions is maximized. Recently, it has been shown in a paper by Moulton et al. that this problem is NP-hard. In this paper, we establish a tight polynomial-time approximation algorithm for the Nature Reserve Section Problem. Furthermore, we resolve a question on the computational complexity of a related problem left open by Moulton et al. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Optimal Algorithms for the Interval Location Problem with Range Constraints on Length and Average

    Page(s): 281 - 290
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (927 KB) |  | HTML iconHTML  

    Let A be a sequence of n real numbers, L1 and L2 be two integers such that L1 les L2, and let R1 and R2 be two real numbers such that R1 les R2. An interval of A is feasible if its length is between L1 and L2, and its average is between R1 and R2. In this paper, we study the following problems: finding all feasible intervals of A, counting all feasible intervals of A, finding a maximum cardinality set of nonoverlapping feasible intervals of A, locating a longest feasible interval of A, and locating a shortest feasible interval of A. The problems are motivated from the problem of locating CpG islands in biomolecular sequences. In this paper, we first show that all the problems have an Omega (n log n)-time lower bound in the comparison model. Then, we use geometric approaches to design optimal algorithms for the problems. All the presented algorithms run in an online manner and use O(n) space. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Prediction of R5, X4, and R5X4 HIV-1 Coreceptor Usage with Evolved Neural Networks

    Page(s): 291 - 300
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2424 KB) |  | HTML iconHTML  

    The HIV-1 genome is highly heterogeneous. This variation affords the virus a wide range of molecular properties, including the ability to infect cell types, such as macrophages and lymphocytes, expressing different chemokine receptors on the cell surface. In particular, R5 HIV-1 viruses use CCR5 as a coreceptor for viral entry, X4 viruses use CXCR4, whereas some viral strains, known as R5X4 or D-tropic, have the ability to utilize both coreceptors. X4 and R5X4 viruses are associated with rapid disease progression to AIDS. R5X4 viruses differ in that they have yet to be characterized by the examination of the genetic sequence of HIV-1 alone. In this study, a series of experiments was performed to evaluate different strategies of feature selection and neural network optimization. We demonstrate the use of artificial neural networks trained via evolutionary computation to predict viral coreceptor usage. The results indicate the identification of R5X4 viruses with a predictive accuracy of 75.5 percent. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Shorelines of Islands of Tractability: Algorithms for Parsimony and Minimum Perfect Phylogeny Haplotyping Problems

    Page(s): 301 - 312
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (483 KB) |  | HTML iconHTML  

    The problem parsimony haplotyping (PH) asks for the smallest set of haplotypes that can explain a given set of genotypes, and the problem minimum perfect phylogeny haplotyping (MPPH) asks for the smallest such set that also allows the haplotypes to be embedded in a perfect phylogeny, an evolutionary tree with biologically motivated restrictions. For PH, we extend recent work by further mapping the interface between "easy" and "hard" instances, within the framework of (k, f)-bounded instances, where the number of 2s per column and row of the input matrix is restricted. By exploring, in the same way, the tractability frontier of MPPH, we provide the first concrete positive results for this problem. In addition, we construct for both PH and MPPH polynomial time approximation algorithms, based on properties of the columns of the input matrix. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • 2SNP: Scalable Phasing Method for Trios and Unrelated Individuals

    Page(s): 313 - 318
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1778 KB) |  | HTML iconHTML  

    Emerging microarray technologies allow affordable typing of very long genome sequences. A key challenge in analyzing such a huge amount of data is scalable and accurate computational inferring of haplotypes (that is, splitting of each genotype into a pair of corresponding haplotypes). In this paper, we first phase genotypes consisting only of two SNPs using genotypes frequencies adjusted to the random mating model and then extend the phasing of two-SNP genotypes to the phasing of complete genotypes using maximum spanning trees. The runtime of the proposed 2SNP algorithm is O(nm(n + logm)), where n and m are the numbers of genotypes and SNPs, respectively, and it can handle genotypes spanning the entire chromosomes in a matter of hours. On data sets across 23 chromosomal regions from HapMap [11], 2SNP is several orders of magnitude faster than GERBIL and PHASE when matching them in quality measured by the number of correctly phased genotypes, single-site, and switching errors. For example, the 2SNP software phases the entire chromosome (l05 SNPs from HapMap) for 30 individuals in 2 hours with an average switching error of 7.7 percent. We have also enhanced the 2SNP algorithm to phase family trio data and compared it with four other well-known phasing methods on simulated data from [15]. 2SNP is much faster than all of them while losing in quality only to PHASE. 2SNP software is publicly available at http://alla.cs.gsu.edu/~software/2SNP. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • IEEE Computer Society Digital Library [advertisement]

    Page(s): 319
    Save to Project icon | Request Permissions | PDF file iconPDF (124 KB)  
    Freely Available from IEEE
  • Build Your Career in Computing [advertisement]

    Page(s): 320
    Save to Project icon | Request Permissions | PDF file iconPDF (82 KB)  
    Freely Available from IEEE
  • IEEE/ACM TCBB: Information for authors

    Page(s): c3
    Save to Project icon | Request Permissions | PDF file iconPDF (145 KB)  
    Freely Available from IEEE
  • [Back cover]

    Page(s): c4
    Save to Project icon | Request Permissions | PDF file iconPDF (407 KB)  
    Freely Available from IEEE

Aims & Scope

This bimonthly publishes archival research results related to the algorithmic, mathematical, statistical, and computational methods that are central in bioinformatics and computational biology.

Full Aims & Scope

Meet Our Editors

Editor-in-Chief
Ying Xu
University of Georgia
xyn@bmb.uga.edu

Associate Editor-in-Chief
Dong Xu
University of Missouri
xudong@missouri.edu