By Topic

Computational Biology and Bioinformatics, IEEE/ACM Transactions on

Issue 3 • Date July-Sept. 2008

Filter Results

Displaying Results 1 - 19 of 19
  • [Front cover]

    Page(s): c1
    Save to Project icon | Request Permissions | PDF file iconPDF (658 KB)  
    Freely Available from IEEE
  • [Inside front cover]

    Page(s): c2
    Save to Project icon | Request Permissions | PDF file iconPDF (85 KB)  
    Freely Available from IEEE
  • Guest Editors' Introduction to the Special Section on Bioinformatics Research and Applications

    Page(s): 321 - 322
    Save to Project icon | Request Permissions | PDF file iconPDF (77 KB)  
    Freely Available from IEEE
  • Mixed Integer Linear Programming for Maximum-Parsimony Phylogeny Inference

    Page(s): 323 - 331
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1178 KB) |  | HTML iconHTML  

    Reconstruction of phylogenetic trees is a fundamental problem in computational biology. While excellent heuristic methods are available for many variants of this problem, new advances in phylogeny inference will be required if we are to be able to continue to make effective use of the rapidly growing stores of variation data now being gathered. In this paper, we present two integer linear programming (ILP) formulations to find the most parsimonious phylogenetic tree from a set of binary variation data. One method uses a flow-based formulation that can produce exponential numbers of variables and constraints in the worst case. The method has, however, proven extremely efficient in practice on datasets that are well beyond the reach of the available provably efficient methods, solving several large mtDNA and Y-chromosome instances within a few seconds and giving provably optimal results in times competitive with fast heuristics than cannot guarantee optimality. An alternative formulation establishes that the problem can be solved with a polynomial-sized ILP. We further present a web server developed based on the exponential-sized ILP that performs fast maximum parsimony inferences and serves as a front end to a database of precomputed phylogenies spanning the human genome. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Solving the Preserving Reversal Median Problem

    Page(s): 332 - 347
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2639 KB) |  | HTML iconHTML  

    Genomic rearrangement operations can be very useful to infer the phylogenetic relationship of gene orders representing species. We study the problem of finding potential ancestral gene orders for the gene orders of given taxa, such that the corresponding rearrangement scenario has a minimal number of reversals, and where each of the reversals has to preserve the common intervals of the given input gene orders. Common intervals identify sets of genes that occur consecutively in all input gene orders. The problem of finding such an ancestral gene order is called the preserving reversal median problem (pRMP). A tree-based data structure for the representation of the common intervals of all input gene orders is used in our exact algorithm called tree common interval preserving (TCIP) for solving the pRMP. It is known that the minimum number of reversals to transform one gene order into another can be computed in polynomial time, whereas the corresponding problem with the restriction that common intervals should not be destroyed is already NP-hard. It is shown theoretically that TCIP can solve a large class of pRMP instances in polynomial time. Empirically, we show the good performance of TCIP on biological and artificial data. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Exploring the Solution Space of Sorting by Reversals, with Experiments and an Application to Evolution

    Page(s): 348 - 356
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (661 KB) |  | HTML iconHTML  

    In comparative genomics, algorithms that sort permutations by reversals are often used to propose evolutionary scenarios of rearrangements between species. One of the main problems of such methods is that they give one solution while the number of optimal solutions is huge, with no criteria to discriminate among them. Bergeron et al. started to give some structure to the set of optimal solutions, in order to be able to deliver more presentable results than only one solution or a complete list of all solutions. However, no algorithm exists so far to compute this structure except through the enumeration of all solutions, which takes too much time even for small permutations. Bergeron et al. state as an open problem the design of such an algorithm. We propose in this paper an answer to this problem, that is, an algorithm which gives all the classes of solutions and counts the number of solutions in each class, with a better theoretical and practical complexity than the complete enumeration method. We give an example of how to reduce the number of classes obtained, using further constraints. Finally, we apply our algorithm to analyse the possible scenarios of rearrangement between mammalian sex chromosomes. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Reconstruction of 3D Structures From Protein Contact Maps

    Page(s): 357 - 367
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1805 KB) |  | HTML iconHTML  

    The prediction of the protein tertiary structure from solely its residue sequence (the so called Protein Folding Problem) is one of the most challenging problems in Structural Bioinformatics. We focus on the protein residue contact map. When this map is assigned it is possible to reconstruct the 3D structure of the protein backbone. The general problem of recovering a set of 3D coordinates consistent with some given contact map is known as a unit-disk-graph realization problem and it has been recently proven to be NP-Hard. In this paper we describe a heuristic method (COMAR) that is able to reconstruct with an unprecedented rate (3-15 seconds) a 3D model that exactly matches the target contact map of a protein. Working with a non-redundant set of 1760 proteins, we find that the scoring efficiency of finding a 3D model very close to the protein native structure depends on the threshold value adopted to compute the protein residue contact map. Contact maps whose threshold values range from 10 to 18 Aringngstroms allow reconstructing 3D models that are very similar to the proteins native structure. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Investigating the Efficacy of Nonlinear Dimensionality Reduction Schemes in Classifying Gene and Protein Expression Studies

    Page(s): 368 - 384
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (5054 KB) |  | HTML iconHTML  

    The recent explosion in procurement and availability of high-dimensional gene and protein expression profile data sets for cancer diagnostics has necessitated the development of sophisticated machine learning tools with which to analyze them. While some investigators are focused on identifying informative genes and proteins that play a role in specific diseases, other researchers have attempted instead to use patients based on their expression profiles to prognosticate disease status. A major limitation in the ability to accurately classify these high-dimensional data sets stems from the "curse of dimensionality," occurring in situations where the number of genes or peptides significantly exceeds the total number of patient samples. Previous attempts at dealing with this issue have mostly centered on the use of a dimensionality reduction (DR) scheme, principal component analysis (PCA), to obtain a low-dimensional projection of the high-dimensional data. However, linear PCA and other linear DR methods, which rely on euclidean distances to estimate object similarity, do not account for the inherent underlying nonlinear structure associated with most biomedical data. While some researchers have begun to explore nonlinear DR methods for computer vision problems such as face detection and recognition, to the best of our knowledge, few such attempts have been made for classification and visualization of high-dimensional biomedical data. The motivation behind this work is to identify the appropriate DR methods for analysis of high-dimensional gene and protein expression studies. Toward this end, we empirically and rigorously compare three nonlinear (Isomap, Locally Linear Embedding, and Laplacian Eigenmaps) and three linear DR schemes (PCA, Linear Discriminant Analysis, and Multidimensional Scaling) with the intent of determining a reduced subspace representation in which the individual object classes are more easily discriminable. Owing to the inherent nonlinear structure- - of gene and protein expression studies, our claim is that the nonlinear DR methods provide a more truthful low-dimensional representation of the data compared to the linear DR schemes. Evaluation of the DR schemes was done by 1) assessing the discriminability of two supervised classifiers (Support Vector Machine and C4.5 Decision Trees) in the different low- dimensional data embeddings and 2) five cluster validity measures to evaluate the size, distance, and tightness of object aggregates in the low-dimensional space. For each of the seven evaluation measures considered, statistically significant improvement in the quality of the embeddings across 10 cancer data sets via the use of three nonlinear DR schemes over three linear DR techniques was observed. Similar trends were observed when linear and nonlinear DR was applied to the high-dimensional data following feature pruning to isolate the most informative features. Qualitative evaluation of the low-dimensional data embedding obtained via the six DR methods further suggests that the nonlinear schemes are better able to identify potential novel classes (e.g., cancer subtypes) within the data. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Coclustering of Human Cancer Microarrays Using Minimum Sum-Squared Residue Coclustering

    Page(s): 385 - 400
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (4567 KB) |  | HTML iconHTML  

    It is a consensus in microarray analysis that identifying potential local patterns, characterized by coherent groups of genes and conditions, may shed light on the discovery of previously undetectable biological cellular processes of genes, as well as macroscopic phenotypes of related samples. In orderto simultaneously cluster genes and conditions, we have previously developed a fast coclustering algorithm, minimum sum-squared residue coclustering (MSSRCC), which employs an alternating minimization scheme and generates what we call coclusters in a "checkerboard" structure. In this paper, we propose specific strategies that enable MSSRCC to escape poor local minima and resolve the degeneracy problem in partitional clustering algorithms. The strategies include binormalization, deterministic spectral initialization, and incremental local search. We assess the effects of various strategies on both synthetic gene expression data sets and real human cancer microarrays and provide empirical evidence that MSSRCC with the proposed strategies performs better than existing coclustering and clustering algorithms. In particular, the combination of all the three strategies leads to the best performance. Furthermore, we illustrate coherence of the resulting coclusters in a checkerboard structure, where genes in a cocluster manifest the phenotype structure of corresponding specific samples and evaluate the enrichment of functional annotations in gene ontology (GO). View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Incorporating Gene Functions into Regression Analysis of DNA-Protein Binding Data and Gene Expression Data to Construct Transcriptional Networks

    Page(s): 401 - 415
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3316 KB) |  | HTML iconHTML  

    Useful information on transcriptional networks has been extracted by regression analyses of gene expression data and DNA-protein binding data. However, a potential limitation of these approaches is their assumption on the common and constant activity level of a transcription factor (TF) on all of the genes in any given experimental condition, for example, any TF is assumed to be either an activator or a repressor, but not both, whereas it is known that some TFs can be dual regulators. Rather than assuming a common linear regression model for all of the genes, we propose using separate regression models for various gene groups; the genes can be grouped based on their functions or some clustering results. Furthermore, to take advantage of the hierarchical structure of many existing gene function annotation systems such as gene ontology (GO), we propose a shrinkage method that borrows information from relevant gene groups. Applications to a yeast data set and simulations lend support to our proposed methods. In particular, we find that the shrinkage method consistently works well under various scenarios. We recommend the use of the shrinkage method as a useful alternative to the existing methods. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • PairProSVM: Protein Subcellular Localization Based on Local Pairwise Profile Alignment and SVM

    Page(s): 416 - 422
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (892 KB) |  | HTML iconHTML  

    The subcellular locations of proteins are important functional annotations. An effective and reliable subcellular localization method is necessary for proteomics research. This paper introduces a new method - PairProSVM - to automatically predict the subcellular locations of proteins. The profiles of all protein sequences in the training set are constructed by PSI-BLAST, and the pairwise profile alignment scores are used to form feature vectors for training a support vector machine (SVM) classifier. It was found that PairProSVM outperforms the methods that are based on sequence alignment and amino acid compositions even if most of the homologous sequences have been removed. PairProSVM was evaluated on Huang and Li's and Gardy et al.'s protein data sets. The overall accuracies on these data sets reach 75.3 percent and 91.9 percent, respectively, which are higher than or comparable to those obtained by sequence alignment and composition-based methods. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Reproducibility-Optimized Test Statistic for Ranking Genes in Microarray Studies

    Page(s): 423 - 431
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (646 KB) |  | HTML iconHTML  

    A principal goal of microarray studies is to identify the genes showing differential expression under distinct conditions. In such studies, the selection of an optimal test statistic is a crucial challenge, which depends on the type and amount of data under analysis. Although previous studies on simulated or spike-in data sets do not provide practical guidance on how to choose the best method for a given real data set, we introduce an enhanced reproducibility-optimization procedure, which enables the selection of a suitable gene-ranking statistic directly from the data. In comparison with existing ranking methods, the reproducibility-optimized statistic shows good performance consistently under various simulated conditions and on Affymetrix spike-in data set. Further, the feasibility of the novel statistic is confirmed in a practical research setting using data from an in-house cDNA microarray study of asthma-related gene expression changes. These results suggest that the procedure facilitates the selection of an appropriate test statistic for a given data set without relying on a priori assumptions, which may bias the findings and their interpretation. Moreover, the general reproducibility-optimization procedure is not limited to detecting differential expression only but could be extended to a wide range of other applications as well. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Solving the Problem of Trans-Genomic Query with Alignment Tables

    Page(s): 432 - 447
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (5780 KB) |  | HTML iconHTML  

    The trans-genomic query (TGQ) problem-enabling the free query of biological information, even across genomes-is a central challenge facing bioinformatics. Solutions to this problem can alter the nature of the field, moving it beyond the jungle of data integration and expanding the number and scope of questions that can be answered. An alignment table is a binary relationship on locations (sequence segments). An important special case of alignment tables are hit tables-tables of pairs of highly similar segments produced by alignment tools like BLAST. However, alignment tables also include general binary relationships and can represent any useful connection between sequence locations. They can be curated and provide a high-quality queryable backbone of connections between biological information. Alignment tables thus can be a natural foundation for TGQ, as they permit a central part of the TGQ problem to be reduced to purely technical problems involving tables of locations. Key challenges in implementing alignment tables include efficient representation and indexing of sequence locations. We define a location data type that can be incorporated naturally into common off-the-shelf database systems. We also describe an implementation of alignment tables in BLASTGRES, an extension of the open-source POSTGRESQL database system that provides indexing and operators on locations required for querying alignment tables. This paper also reviews several successful large-scale applications of alignment tables for TGQ. Tables with millions of alignments have been used in queries about alternative splicing, an area of genomic analysis concerning the way in which a single gene can yield multiple transcripts. Comparative genomics is a large potential application area for TGQ and alignment tables. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Fine-Scale Genetic Mapping Using Independent Component Analysis

    Page(s): 448 - 460
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2560 KB) |  | HTML iconHTML  

    The aim of genetic mapping is to locate the loci responsible for specific traits such as complex diseases. These traits are normally caused by mutations at multiple loci of unknown locations and interactions. In this work, we model the biological system that relates DNA polymorphisms with complex traits as a linear mixing process. Given this model, we propose a new fine-scale genetic mapping method based on independent component analysis. The proposed method outputs both independent associated groups of SNPs in addition to specific associated SNPs with the phenotype. It is applied to a clinical data set for the Schizophrenia disease with 368 individuals and 42 SNPs. It is also applied to a simulation study to investigate in more depth its performance. The obtained results demonstrate the novel characteristics of the proposed method compared to other genetic mapping methods. Finally, we study the robustness of the proposed method with missing genotype values and limited sample sizes. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Hadamard Conjugation for the Kimura 3ST Model: Combinatorial Proof Using Path Sets

    Page(s): 461 - 471
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (533 KB) |  | HTML iconHTML  

    Under a stochastic model of molecular sequence evolution the probability of each possible pattern of a characters is well defined. The Kimura's three-substitution-types (K3ST) model of evolution, allows analytical expression for these probabilities of by means of the Hadamard conjugation as a function of the phylogeny T and the substitution probabilities on each edge of TM . In this paper we produce a direct combinatorial proof of these results, using pathset distances which generalise pairwise distances between sequences. This interpretation provides us with tools that were proved useful in related problems in the mathematical analysis of sequence evolution. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Improved Layout of Phylogenetic Networks

    Page(s): 472 - 479
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1805 KB) |  | HTML iconHTML  

    Split networks are increasingly being used in phylogenetic analysis. Usually, a simple equal angle algorithm is used to draw such networks, producing layouts that leave much room for improvement. Addressing the problem of producing better layouts of split networks, this paper presents an algorithm for maximizing the area covered by the network, describes an extension of the equal-daylight algorithm to networks, looks into using a spring embedder and discusses how to construct rooted split networks. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Build Your Career in Computing [advertisement]

    Page(s): 480
    Save to Project icon | Request Permissions | PDF file iconPDF (84 KB)  
    Freely Available from IEEE
  • IEEE/ACM TCBB: Information for authors

    Page(s): c3
    Save to Project icon | Request Permissions | PDF file iconPDF (85 KB)  
    Freely Available from IEEE
  • [Back cover]

    Page(s): c4
    Save to Project icon | Request Permissions | PDF file iconPDF (658 KB)  
    Freely Available from IEEE

Aims & Scope

This bimonthly publishes archival research results related to the algorithmic, mathematical, statistical, and computational methods that are central in bioinformatics and computational biology.

Full Aims & Scope

Meet Our Editors

Editor-in-Chief
Ying Xu
University of Georgia
xyn@bmb.uga.edu

Associate Editor-in-Chief
Dong Xu
University of Missouri
xudong@missouri.edu