By Topic

Computational Biology and Bioinformatics, IEEE/ACM Transactions on

Issue 3 • Date July-Sept. 2007

Filter Results

Displaying Results 1 - 23 of 23
  • [Front cover]

    Page(s): c1
    Save to Project icon | Request Permissions | PDF file iconPDF (330 KB)  
    Freely Available from IEEE
  • [Inside front cover]

    Page(s): c2
    Save to Project icon | Request Permissions | PDF file iconPDF (89 KB)  
    Freely Available from IEEE
  • Associate Editor Appreciation and Welcome

    Page(s): 321
    Save to Project icon | Request Permissions | PDF file iconPDF (83 KB)  
    Freely Available from IEEE
  • Adjoint Systems for Models of Cell Signaling Pathways and their Application to Parameter Fitting

    Page(s): 322 - 335
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1987 KB) |  | HTML iconHTML  

    The paper concerns the problem of fitting mathematical models of cell signaling pathways. Such models frequently take the form of sets of nonlinear ordinary differential equations. While the model is continuous in time, the performance index used in the fitting procedure involves measurements taken at discrete time moments. Adjoint sensitivity analysis is a tool which can be used for finding the gradient of a performance index in the space of parameters of the model. In the paper, a structural formulation of adjoint sensitivity analysis called the generalized backpropagation through time (GBPTT) is used. The method is especially suited for hybrid, continuous-discrete time systems. As an example, we use the mathematical model NF-kappaB of the regulatory module, which plays a major role in the innate immune response in animals. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • CISA: Combined NMR Resonance Connectivity Information Determination and Sequential Assignment

    Page(s): 336 - 348
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (5149 KB) |  | HTML iconHTML  

    A nearly complete sequential resonance assignment is a key factor leading to successful protein structure determination via NMR spectroscopy. Assuming the availability of a set of NMR spectral peak lists, most of the existing assignment algorithms first use the differences between chemical shift values for common nuclei across multiple spectra to provide the evidence that some pairs of peaks should be assigned to sequentially adjacent amino acid residues in the target protein. They then use these connectivities as constraints to produce a sequential assignment. At various levels of success, these algorithms typically generate a large number of potential connectivity constraints and it grows exponentially as the quality of spectral data decreases. A key observation used in our sequential assignment program, CISA, is that chemical shift residual signature information can be used to improve the connectivity determination and, thus, dramatically decrease the number of predicted connectivity constraints. Fewer connectivity constraints lead to fewer ambiguities in the sequential assignment. Extensive simulation studies on several large test data sets demonstrated that CISA is efficient and effective compared to the three most recently proposed sequential resonance assignment programs, RANDOM, PACES, and MARS. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Comparing Compressed Sequences for Faster Nucleotide BLAST Searches

    Page(s): 349 - 364
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (973 KB) |  | HTML iconHTML  

    Molecular biologists, geneticists, and other life scientists use the BLAST homology search package as their first step for discovery of information about unknown or poorly annotated genomic sequences. There are two main variants of BLAST: BLASTP for searching protein collections and BLASTN for nucleotide collections. Surprisingly, BLASTN has had very little attention; for example, the algorithms it uses do not follow those described in the 1997 blast paper [1] and no exact description has been published. It is important that BLASTN is state-of-the-art: Nucleotide collections such as GenBank dwarf the protein collections in size, they double in size almost yearly, and they take many minutes to search on modern general purpose workstations. This paper proposes significant improvements to the BLASTN algorithms. Each of our schemes is based on compressed byte packed formats that allow queries and collection sequences to be compared four bases at a time, permitting very fast query evaluation using lookup tables and numeric comparisons. Our most significant innovations are two new, fast gapped alignment schemes that allow accurate sequence alignment without decompression of the collection sequences. Overall, our innovations more than double the speed of BLASTN with no effect on accuracy and have been integrated into our new version of BLAST that is freely available for download from http://www.fsa-blast.org/. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Development of Two-Stage SVM-RFE Gene Selection Strategy for Microarray Expression Data Analysis

    Page(s): 365 - 381
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (4770 KB) |  | HTML iconHTML  

    Extracting a subset of informative genes from microarray expression data is a critical data preparation step in cancer classification and other biological function analyses. Though many algorithms have been developed, the Support Vector Machine - Recursive Feature Elimination (SVM-RFE) algorithm is one of the best gene feature selection algorithms. It assumes that a smaller "filter-out" factor in the SVM-RFE, which results in a smaller number of gene features eliminated in each recursion, should lead to extraction of a better gene subset. Because the SVM-RFE is highly sensitive to the "filter-out" factor, our simulations have shown that this assumption is not always correct and that the SVM-RFE is an unstable algorithm. To select a set of key gene features for reliable prediction of cancer types or subtypes and other applications, a new two-stage SVM-RFE algorithm has been developed. It is designed to effectively eliminate most of the irrelevant, redundant and noisy genes while keeping information loss small at the first stage. A fine selection for the final gene subset is then performed at the second stage. The two-stage SVM-RFE overcomes the instability problem of the SVM-RFE to achieve better algorithm utility. We have demonstrated that the two-stage SVM-RFE is significantly more accurate and more reliable than the SVM-RFE and three correlation-based methods based on our analysis of three publicly available microarray expression datasets. Furthermore, the two-stage SVM-RFE is computationally efficient because its time complexity is $O(d * log{_2d})$, where $d$ is the size of the original gene set. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Neuroinformatics for Genome-Wide 3-D Gene Expression Mapping in the Mouse Brain

    Page(s): 382 - 393
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1675 KB) |  | HTML iconHTML  

    Large-scale gene expression studies in the mammalian brain offer the promise of understanding the topology, networks, and, ultimately, the function of its complex anatomy, opening previously unexplored avenues in neuroscience. High-throughput methods permit genome-wide searches to discover genes that are uniquely expressed in brain circuits and regions that control behavior. Previous gene expression mapping studies in model organisms have employed in situ hybridization (ISH), a technique that uses labeled nucleic acid probes to bind to specific mRNA transcripts in tissue sections. A key requirement for this effort is the development of fast and robust algorithms for anatomically mapping and quantifying gene expression for ISH. We describe a neuroinformatics pipeline for automatically mapping expression profiles of ISH data and its use to produce the first genomic scale 3D mapping of gene expression in a mammalian brain. The pipeline is fully automated and adaptable to other organisms and tissues. Our automated study of more than 20,000 genes indicates that at least 78.8 percent are expressed at some level in the adult C56BL/6J mouse brain. In addition to providing a platform for genomic scale search, high-resolution images and visualization tools for expression analysis are available at the Allen Brain Atlas web site (http://www.brain-map.org). View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Reconstructing Recombination Network from Sequence Data: The Small Parsimony Problem

    Page(s): 394 - 402
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (298 KB) |  | HTML iconHTML  

    The small parsimony problem is studied for reconstructing recombination networks from sequence data. The small parsimony problem is polynomial-time solvable for phylogenetic trees. However, the problem is proven NP-hard even for galled recombination networks. A dynamic programming algorithm is also developed to solve the small parsimony problem. It takes O(dn23 h ) time on an input recombination network over length-d sequences in which there are h recombination and n -h tree nodes. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Regulatory Motif Discovery Using a Population Clustering Evolutionary Algorithm

    Page(s): 403 - 414
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2262 KB)  

    This paper describes a novel evolutionary algorithm for regulatory motif discovery in DNA promoter sequences. The algorithm uses data clustering to logically distribute the evolving population across the search space. Mating then takes place within local regions of the population, promoting overall solution diversity and encouraging discovery of multiple solutions. Experiments using synthetic data sets have demonstrated the algorithm's capacity to find position frequency matrix models of known regulatory motifs in relatively long promoter sequences. These experiments have also shown the algorithm's ability to maintain diversity during search and discover multiple motifs within a single population. The utility of the algorithm for discovering motifs in real biological data is demonstrated by its ability to find meaningful motifs within muscle-specific regulatory sequences. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Strategies for Identifying Statistically Significant Dense Regions in Microarray Data

    Page(s): 415 - 429
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1339 KB) |  | HTML iconHTML  

    We propose and study the notion of dense regions for the analysis of categorized gene expression data and present some searching algorithms for discovering them. The algorithms can be applied to any categorical data matrices derived from gene expression level matrices. We demonstrate that dense regions are simple but useful and statistically significant patterns that can be used to 1) identify genes and/or samples of interest and 2) eliminate genes and/or samples corresponding to outliers, noise, or abnormalities. Some theoretical studies on the properties of the dense regions are presented which allow us to characterize dense regions into several classes and to derive tailor-made algorithms for different classes of regions. Moreover, an empirical simulation study on the distribution of the size of dense regions is carried out which is then used to assess the significance of dense regions and to derive effective pruning methods to speed up the searching algorithms. Real microarray data sets are employed to test our methods. Comparisons with six other well-known clustering algorithms using synthetic and real data are also conducted which confirm the superiority of our methods in discovering dense regions. The DRIFT code and a tutorial are available as supplemental material, which can be found on the Computer Society Digital Library at http://computer.org/tcbb/archives.htm. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Bayesian Basecalling for DNA Sequence Analysis Using Hidden Markov Models

    Page(s): 430 - 440
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (460 KB)  

    It has been shown that electropherograms of DNA sequences can be modeled with hidden Markov models. Basecalling, the procedure that determines the sequence of bases from the given eletropherogram, can then be performed using the Viterbi algorithm. A training step is required prior to basecalling in order to estimate the HMM parameters. In this paper, we propose a Bayesian approach which employs the Markov chain Monte Carlo (MCMC) method to perform basecalling. Such an approach not only allows one to naturally encode the prior biological knowledge into the basecalling algorithm, it also exploits both the training data and the basecalling data in estimating the HMM parameters, leading to more accurate estimates. Using the recently sequenced genome of the organism Legionella pneumophila we show that the MCMC basecaller outperforms the state-of-the-art basecalling algorithm in terms of total errors while requiring much less training than other proposed statistical basecallers. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Bidirectional Long Short-Term Memory Networks for Predicting the Subcellular Localization of Eukaryotic Proteins

    Page(s): 441 - 446
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1786 KB) |  | HTML iconHTML  

    An algorithm called bidirectional long short-term memory networks (BLSTM) for processing sequential data is introduced. This supervised learning method trains a special recurrent neural network to use very long-range symmetric sequence context using a combination of nonlinear processing elements and linear feedback loops for storing long-range context. The algorithm is applied to the sequence-based prediction of protein localization and predicts 93.3 percent novel nonplant proteins and 88.4 percent novel plant proteins correctly, which is an improvement over feedforward and standard recurrent networks solving the same problem. The BLSTM system is available as a Web service at http://stepc.stepc.gr/-synaptic/blstm.html. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Compression of Annotated Nucleotide Sequences

    Page(s): 447 - 457
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1711 KB) |  | HTML iconHTML  

    This paper introduces an algorithm for the lossless compression of DNA files which contain annotation text besides the nucleotide sequence. First, a grammar is specifically designed to capture the regularities of the annotation text. A revertible transformation uses the grammar rules in order to equivalently represent the original file as a collection of parsed segments and a sequence of decisions made by the grammar parser. This decomposition enables the efficient use of state-of-the-art encoders for processing the parsed segments. The output size of the decision-making process of the grammar is optimized by extending the states to account for high-order Markovian dependencies. The practical implementation of the algorithm achieves a significant improvement when compared to the general-purpose methods currently used for DNA files. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Computing the Hybridization Number of Two Phylogenetic Trees Is Fixed-Parameter Tractable

    Page(s): 458 - 466
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (715 KB) |  | HTML iconHTML  

    Reticulation processes in evolution mean that the ancestral history of certain groups of present-day species is non-tree-like. These processes include hybridization, lateral gene transfer, and recombination. Despite the existence of reticulation, such events are relatively rare and, so, a fundamental problem for biologists is the following: Given a collection of rooted binary phylogenetic trees on sets of species that correctly represent the tree-like evolution of different parts of their genomes, what is the smallest number of "reticulation" vertices in any network that explains the evolution of the species under consideration? It has been previously shown that this problem is NP-hard even when the collection consists of only two rooted binary phylogenetic trees. However, in this paper, we show that the problem is fixed-parameter tractable in the two-tree instance when parameterized by this smallest number of reticulation vertices. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Effective Gene Selection Method With Small Sample Sets Using Gradient-Based and Point Injection Techniques

    Page(s): 467 - 475
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1704 KB) |  | HTML iconHTML  

    Microarray gene expression data usually consist of a large amount of genes. Among these genes, only a small fraction are informative for performing a cancer diagnostic test. This paper focuses on effective identification of informative genes. We analyze gene selection models from the perspective of optimization theory. As a result, a new strategy is designed to modify conventional search engines. Also, as overfitting is likely to occur in microarray data because of their small sample set, a point injection technique is developed to address the problem of overfitting. The proposed strategies have been evaluated on three kinds of cancer diagnosis. Our results show that the proposed strategies can improve the performance of gene selection substantially. The experimental results also indicate that the proposed methods are very robust under all of the investigated cases. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • High-Throughput Ligand Screening via Preclustering and Evolved Neural Networks

    Page(s): 476 - 484
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1588 KB) |  | HTML iconHTML  

    The pathway for novel lead drug discovery has many major deficiencies, the most significant of which is the immense size of small molecule diversity space. Methods that increase the search efficiency and/or reduce the size of the search space increase the rate at which useful lead compounds are identified. Artificial neural networks optimized via evolutionary computation provide a cost and time-effective solution to this problem. Here, we present results that suggest that preclustering of small molecules prior to neural network optimization is useful for generating models of quantitative structure-activity relationships for a set of HIV inhibitors. Using these methods, it is possible to prescreen compounds to separate active from inactive compounds or even active and mildly active compounds from inactive compounds with high predictive accuracy while simultaneously reducing the feature space. It is also possible to identify "human interpretable" features from the best models that can be used for proposal and synthesis of new compounds in order to optimize potency and specificity. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Multicategory Classification Using An Extreme Learning Machine for Microarray Gene Expression Cancer Diagnosis

    Page(s): 485 - 495
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2990 KB) |  | HTML iconHTML  

    In this paper, the recently developed Extreme Learning Machine (ELM) is used for directing multicategory classification problems in the cancer diagnosis area. ELM avoids problems like local minima, improper learning rate and overfitting commonly faced by iterative learning methods and completes the training very fast. We have evaluated the multicategory classification performance of ELM on three benchmark microarray data sets for cancer diagnosis, namely, the GCM data set, the Lung data set, and the Lymphoma data set. The results indicate that ELM produces comparable or better classification accuracies with reduced training time and implementation complexity compared to artificial neural networks methods like conventional back-propagation ANN, Linder's SANN, and Support Vector Machine methods like SVM-OVO and Ramaswamy's SVM-OVA. ELM also achieves better accuracies for classification of individual categories. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Superiority of Spaced Seeds for Homology Search

    Page(s): 496 - 505
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (595 KB) |  | HTML iconHTML  

    In homology search, good spaced seeds have higher sensitivity for the same cost (weight). However, elucidating the mechanism that confers power to spaced seeds and characterizing optimal spaced seeds still remain unsolved. This paper investigates these two important open questions by formally analyzing the average number of nonoverlapping hits and the hit probability of a spaced seed in the Bernoulli sequence model. We prove that, when the length of a nonuniformly spaced seed is bounded above by an exponential function of the seed weight, the seed strictly outperforms the traditional consecutive seed of the same weight in both 1) the average number of nonoverlapping hits and 2) the asymptotic hit probability. This clearly answers the first problem mentioned above in the Bernoulli sequence model. The theoretical study in this paper also gives a new solution to finding long optimal seeds. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Optimization Over a Class of Tree Shape Statistics

    Page(s): 506 - 512
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (408 KB) |  | HTML iconHTML  

    Tree shape statistics quantify some aspect of the shape of a phylogenetic tree. They are commonly used to compare reconstructed trees to evolutionary models and to find evidence of tree reconstruction bias. Historically, to find a useful tree shape statistic, formulas have been invented by hand and then evaluated for utility. This paper presents the first method which is capable of optimizing over a class of tree shape statistics, called binary recursive tree shape statistics (BRTSS). After defining the BRTSS class, a set of algebraic expressions is defined which can be used in the recursions. The set of tree shape statistics definable using these expressions in the BRTSS is very general and includes many of the statistics with which phylogenetic researchers are already familiar. We then present a practical genetic algorithm which is capable of performing optimization over BRTSS given any objective function. The chapter concludes with a successful application of the methods to find a new statistic which indicates a significant difference between two distributions on trees which were previously postulated to have similar properties. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • IEEE/ACM TCBB: Information for authors

    Page(s): c3
    Save to Project icon | Request Permissions | PDF file iconPDF (89 KB)  
    Freely Available from IEEE
  • [Back cover]

    Page(s): c4
    Save to Project icon | Request Permissions | PDF file iconPDF (330 KB)  
    Freely Available from IEEE
  • IEEE/ACM Transactions on Computational Biology and Bioinformatics - Spine

    Page(s): c5 - c6
    Save to Project icon | Request Permissions | PDF file iconPDF (2222 KB)  
    Freely Available from IEEE

Aims & Scope

This bimonthly publishes archival research results related to the algorithmic, mathematical, statistical, and computational methods that are central in bioinformatics and computational biology.

Full Aims & Scope

Meet Our Editors

Editor-in-Chief
Ying Xu
University of Georgia
xyn@bmb.uga.edu

Associate Editor-in-Chief
Dong Xu
University of Missouri
xudong@missouri.edu