By Topic

Computational Biology and Bioinformatics, IEEE/ACM Transactions on

Issue 2 • Date April-June 2009

Filter Results

Displaying Results 1 - 23 of 23
  • [Front cover]

    Publication Year: 2009 , Page(s): c1
    Save to Project icon | Request Permissions | PDF file iconPDF (371 KB)  
    Freely Available from IEEE
  • [Inside front cover]

    Publication Year: 2009 , Page(s): c2
    Save to Project icon | Request Permissions | PDF file iconPDF (176 KB)  
    Freely Available from IEEE
  • EIC Editorial

    Publication Year: 2009 , Page(s): 177
    Save to Project icon | Request Permissions | PDF file iconPDF (77 KB)  
    Freely Available from IEEE
  • Guest Editors' Introduction to the Special Section on Bioinformatics Research and Applications

    Publication Year: 2009 , Page(s): 178 - 179
    Save to Project icon | Request Permissions | PDF file iconPDF (91 KB)  
    Freely Available from IEEE
  • A Novel Heuristic for Local Multiple Alignment of Interspersed DNA Repeats

    Publication Year: 2009 , Page(s): 180 - 189
    Cited by:  Papers (2)
    Multimedia
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1215 KB) |  | HTML iconHTML  

    Pairwise local sequence alignment methods have been the prevailing technique to identify homologous nucleotides between related species. However, existing methods that identify and align all homologous nucleotides in one or more genomes have suffered from poor scalability and limited accuracy. We propose a novel method that couples a gapped extension heuristic with an efficient filtration method for identifying interspersed repeats in genome sequences. During gapped extension, we use the MUSCLE implementation of progressive global multiple alignment with iterative refinement. The resulting gapped extensions potentially contain alignments of unrelated sequence. We detect and remove such undesirable alignments using a hidden Markov model (HMM) to predict the posterior probability of homology. The HMM emission frequencies for nucleotide substitutions can be derived from any time-reversible nucleotide substitution matrix. We evaluate the performance of our method and previous approaches on a hybrid data set of real genomic DNA with simulated interspersed repeats. Our method outperforms a related method in terms of sensitivity, positive predictive value, and localizing boundaries of homology. The described methods have been implemented in freely available software, Repeatoire, available from: http://wwwabi.snv.jussieu.fr/public/Repeatoire. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Framework for Multiple Kernel Support Vector Regression and Its Applications to siRNA Efficacy Prediction

    Publication Year: 2009 , Page(s): 190 - 199
    Cited by:  Papers (12)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1716 KB) |  | HTML iconHTML  

    The cell defense mechanism of RNA interference has applications in gene function analysis and promising potentials in human disease therapy. To effectively silence a target gene, it is desirable to select appropriate initiator siRNA molecules having satisfactory silencing capabilities. Computational prediction for silencing efficacy of siRNAs can assist this screening process before using them in biological experiments. String kernel functions, which operate directly on the string objects representing siRNAs and target mRNAs, have been applied to support vector regression for the prediction and improved accuracy over numerical kernels in multidimensional vector spaces constructed from descriptors of siRNA design rules. To fully utilize information provided by string and numerical data, we propose to unify the two in a kernel feature space by devising a multiple kernel regression framework where a linear combination of the kernels is used. We formulate the multiple kernel learning into a quadratically constrained quadratic programming (QCQP) problem, which although yields global optimal solution, is computationally demanding and requires a commercial solver package. We further propose three heuristics based on the principle of kernel-target alignment and predictive accuracy. Empirical results demonstrate that multiple kernel regression can improve accuracy, decrease model complexity by reducing the number of support vectors, and speed up computational performance dramatically. In addition, multiple kernel regression evaluates the importance of constituent kernels, which for the siRNA efficacy prediction problem, compares the relative significance of the design rules. Finally, we give insights into the multiple kernel regression mechanism and point out possible extensions. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Network-Based Inference of Cancer Progression from Microarray Data

    Publication Year: 2009 , Page(s): 200 - 212
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2450 KB) |  | HTML iconHTML  

    Cancer cells exhibit a common phenotype of uncontrolled cell growth, but this phenotype may arise from many different combinations of mutations. By inferring how cells evolve in individual tumors, a process called cancer progression, we may be able to identify important mutational events for different tumor types, potentially leading to new therapeutics and diagnostics. Prior work has shown that it is possible to infer frequent progression pathways by using gene expression profiles to estimate ldquodistancesrdquo between tumors. Here, we apply gene network models to improve these estimates of evolutionary distance by controlling for correlations among coregulated genes. We test three variants of this approach: one using an optimized best-fit network, another using sampling to infer a high-confidence subnetwork, and one using a modular network inferred from clusters of similarly expressed genes. Application to lung cancer and breast cancer microarray data sets shows small improvements in phylogenies when correcting from the optimized network and more substantial improvements when correcting from the sampled or modular networks. Our results suggest that a network correction approach improves estimates of tumor similarity, but sophisticated network models are needed to control for the large hypothesis space and sparse data currently available. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Generalized Gene Adjacencies, Graph Bandwidth, and Clusters in Yeast Evolution

    Publication Year: 2009 , Page(s): 213 - 220
    Cited by:  Papers (5)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1549 KB) |  | HTML iconHTML  

    We present a parameterized definition of gene clusters that allows us to control the emphasis placed on conserved order within a cluster. Though motivated by biological rather than mathematical considerations, this parameter turns out to be closely related to the bandwidth parameter of a graph. Our focus will be on how this parameter affects the characteristics of clusters: how numerous they are, how large they are, how rearranged they are, and to what extent they are preserved from ancestor to descendant in a phylogenetic tree. We infer the latter property by dynamic programming optimization of the presence of individual edges at the ancestral nodes of the phylogeny. We apply our analysis to a set of genomes drawn from the Yeast Gene Order Browser. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The Gene-Duplication Problem: Near-Linear Time Algorithms for NNI-Based Local Searches

    Publication Year: 2009 , Page(s): 221 - 231
    Cited by:  Papers (6)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1135 KB) |  | HTML iconHTML  

    The gene-duplication problem is to infer a species supertree from a collection of gene trees that are confounded by complex histories of gene-duplication events. This problem is NP-complete and thus requires efficient and effective heuristics. Existing heuristics perform a stepwise search of the tree space, where each step is guided by an exact solution to an instance of a local search problem. A classical local search problem is the NNI search problem, which is based on the nearest neighbor interchange operation. In this work, we 1) provide a novel near-linear time algorithm for the NNI search problem, 2) introduce extensions that significantly enlarge the search space of the NNI search problem, and 3) present algorithms for these extended versions that are asymptotically just as efficient as our algorithm for the NNI search problem. The exceptional speedup achieved in the extended NNI search problems makes the gene-duplication problem more tractable for large-scale phylogenetic analyses. We verify the performance of our algorithms in a comparison study using sets of large randomly generated gene trees. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Designing Patterns and Profiles for Faster HMM Search

    Publication Year: 2009 , Page(s): 232 - 243
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (491 KB) |  | HTML iconHTML  

    Profile HMMs are powerful tools for modeling conserved motifs in proteins. They are widely used by search tools to classify new protein sequences into families based on domain architecture. However, the proliferation of known motifs and new proteomic sequence data poses a computational challenge for search, requiring days of CPU time to annotate an organism's proteome. It is highly desirable to speed up HMM search in large databases. We design PROSITE-like patterns and short profiles that are used as filters to rapidly eliminate protein-motif pairs for which a full profile HMM comparison does not yield a significant match. The design of the pattern-based filters is formulated as a multichoice knapsack problem. Profile-based filters with high sensitivity are extracted from a profile HMM based on their theoretical sensitivity and false positive rate. Experiments show that our profile-based filters achieve high sensitivity (near 100 percent) while keeping around 20times speedup with respect to the unfiltered search program. Pattern-based filters typically retain at least 90 percent of the sensitivity of the source HMM with 30-40times speedup. The profile-based filters have sensitivity comparable to the multistage filtering strategy HMMERHEAD and are faster in most of our experiments. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Fuzzy-Adaptive-Subspace-Iteration-Based Two-Way Clustering of Microarray Data

    Publication Year: 2009 , Page(s): 244 - 259
    Cited by:  Papers (3)
    Multimedia
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3894 KB)  

    This paper presents fuzzy-adaptive-subspace-iteration-based two-way clustering (FASIC) of microarray data for finding differentially expressed genes (DEGs) from two-sample microarray experiments. The concept of fuzzy membership is introduced to transform the hard adaptive subspace iteration (ASI) algorithm into a fuzzy-ASI algorithm to perform two-way clustering. The proposed approach follows a progressive framework to assign a relevance value to genes associated with each cluster. Subsequently, each gene cluster is scored and ranked based on its potential to provide a correct classification of the sample classes. These ranks are converted into P values using the R-test, and the significance of each gene is determined. A fivefold validation is performed on the DEGs selected using the proposed approach. Empirical analyses on a number of simulated microarray data sets are conducted to quantify the results obtained using the proposed approach. To exemplify the efficacy of the proposed approach, further analyses on different real microarray data sets are also performed. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Gene Clustering via Integrated Markov Models Combining Individual and Pairwise Features

    Publication Year: 2009 , Page(s): 260 - 270
    Cited by:  Papers (4)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1991 KB) |  | HTML iconHTML  

    Clustering of genes into groups sharing common characteristics is a useful exploratory technique for a number of subsequent computational analysis. A wide range of clustering algorithms have been proposed in particular to analyze gene expression data, but most of them consider genes as independent entities or include relevant information on gene interactions in a suboptimal way. We propose a probabilistic model that has the advantage to account for individual data (e.g., expression) and pairwise data (e.g., interaction information coming from biological networks) simultaneously. Our model is based on hidden Markov random field models in which parametric probability distributions account for the distribution of individual data. Data on pairs, possibly reflecting distance or similarity measures between genes, are then included through a graph, where the nodes represent the genes, and the edges are weighted according to the available interaction information. As a probabilistic model, this model has many interesting theoretical features. In addition, preliminary experiments on simulated and real data show promising results and points out the gain in using such an approach. Availability: The software used in this work is written in C++ and is available with other supplementary material at http://mistis.inrialpes.fr/people/forbes/transparentia/supplementary.html. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Semantics of Multimodal Network Models

    Publication Year: 2009 , Page(s): 271 - 280
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1269 KB) |  | HTML iconHTML  

    A multimodal network (MMN) is a novel graph-theoretic formalism designed to capture the structure of biological networks and to represent relationships derived from multiple biological databases. MMNs generalize the standard notions of graphs and hypergraphs, which are the bases of current diagrammatic representations of biological phenomena, and incorporate the concept of mode. Each vertex of an MMN is a biological entity, a biot, while each modal hyperedge is a typed relationship, where the type is given by the mode of the hyperedge. The semantics of each modal hyperedge e is given through denotational semantics, where a valuation function f_{e} defines the relationship among the values of the vertices incident on e. The meaning of an MMN is denoted in terms of the semantics of a hyperedge sequence. A companion paper defines MMNs and concentrates on the structural aspects of MMNs. This paper develops MMN denotational semantics when used as a representation of the semantics of biological networks and discusses applications of MMNs in managing complex biological data. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Statistical Alignment with a Sequence Evolution Model Allowing Rate Heterogeneity along the Sequence

    Publication Year: 2009 , Page(s): 281 - 295
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1827 KB) |  | HTML iconHTML  

    We present a stochastic sequence evolution model to obtain alignments and estimate mutation rates between two homologous sequences. The model allows two possible evolutionary behaviors along a DNA sequence in order to determine conserved regions and take its heterogeneity into account. In our model, the sequence is divided into slow and fast evolution regions. The boundaries between these sections are not known. It is our aim to detect them. The evolution model is based on a fragment insertion and deletion process working on fast regions only and on a substitution process working on fast and slow regions with different rates. This model induces a pair hidden Markov structure at the level of alignments, thus making efficient statistical alignment algorithms possible. We propose two complementary estimation methods, namely, a Gibbs sampler for Bayesian estimation and a stochastic version of the EM algorithm for maximum likelihood estimation. Both algorithms involve the sampling of alignments. We propose a partial alignment sampler, which is computationally less expensive than the typical whole alignment sampler. We show the convergence of the two estimation algorithms when used with this partial sampler. Our algorithms provide consistent estimates for the mutation rates and plausible alignments and sequence segmentations on both simulated and real data. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Visual Exploration of Three-Dimensional Gene Expression Using Physical Views and Linked Abstract Views

    Publication Year: 2009 , Page(s): 296 - 309
    Cited by:  Papers (10)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (4317 KB) |  | HTML iconHTML  

    During animal development, complex patterns of gene expression provide positional information within the embryo. To better understand the underlying gene regulatory networks, the Berkeley Drosophila Transcription Network Project (BDTNP) has developed methods that support quantitative computational analysis of three-dimensional (3D) gene expression in early Drosophila embryos at cellular resolution. We introduce PointCloudXplore (PCX), an interactive visualization tool that supports visual exploration of relationships between different genes' expression using a combination of established visualization techniques. Two aspects of gene expression are of particular interest: 1) gene expression patterns defined by the spatial locations of cells expressing a gene and 2) relationships between the expression levels of multiple genes. PCX provides users with two corresponding classes of data views: 1) physical views based on the spatial relationships of cells in the embryo and 2) abstract views that discard spatial information and plot expression levels of multiple genes with respect to each other. Cell selectors highlight data associated with subsets of embryo cells within a View. Using linking, these selected cells can be viewed in multiple representations. We describe PCX as a 3D gene expression visualization tool and provide examples of how it has been used by BDTNP biologists to generate new hypotheses. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Conditioning-Based Modeling of Contextual Genomic Regulation

    Publication Year: 2009 , Page(s): 310 - 320
    Cited by:  Papers (9)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2607 KB) |  | HTML iconHTML  

    A more complete understanding of the alterations in cellular regulatory and control mechanisms that occur in the various forms of cancer has been one of the central targets of the genomic and proteomic methods that allow surveys of the abundance and/or state of cellular macromolecules. This preference is driven both by the intractability of cancer to generic therapies, assumed to be due to the highly varied molecular etiologies observed in cancer, and by the opportunity to discern and dissect the regulatory and control interactions presented by the highly diverse assortment of perturbations of regulation and control that arise in cancer. Exploiting the opportunities for inference on the regulatory and control connections offered by these revealing system perturbations is fraught with the practical problems that arise from the way biological systems operate. Two classes of regulatory action in biological systems are particularly inimical to inference, convergent regulation, where a variety of regulatory actions result in a common set of control responses (crosstalk), and divergent regulation, where a single regulatory action produces entirely different sets of control responses, depending on cellular context (conditioning). We have constructed a coarse mathematical model of the propagation of regulatory influence in such distributed, context-sensitive regulatory networks that allows a quantitative estimation of the amount of crosstalk and conditioning associated with a candidate regulatory gene taken from a set of genes that have been profiled over a series of samples where the candidate' s activity varies. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Multimodal Networks: Structure and Operations

    Publication Year: 2009 , Page(s): 321 - 332
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1704 KB) |  | HTML iconHTML  

    A multimodal network (MMN) is a novel graph-theoretic formalism designed to capture the structure of biological networks and to represent relationships derived from multiple biological databases. MMNs generalize the standard notions of graphs and hypergraphs, which are the bases of current diagrammatic representations of biological phenomena and incorporate the concept of mode. Each vertex of an MMN is a biological entity, a biot, while each modal hyperedge is a typed relationship, where the type is given by the mode of the hyperedge. The current paper defines MMNs and concentrates on the structural aspects of MMNs. A companion paper develops MMNs as a representation of the semantics of biological networks and discusses applications of the MMNs in managing complex biological data. The MMN model has been implemented in a database system containing multiple kinds of biological networks. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Optimal Aggregation of Binary Classifiers for Multiclass Cancer Diagnosis Using Gene Expression Profiles

    Publication Year: 2009 , Page(s): 333 - 343
    Cited by:  Papers (6)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1203 KB) |  | HTML iconHTML  

    Multiclass classification is one of the fundamental tasks in bioinformatics and typically arises in cancer diagnosis studies by gene expression profiling. There have been many studies of aggregating binary classifiers to construct a multiclass classifier based on one-versus-the-rest (1R), one-versus-one (11), or other coding strategies, as well as some comparison studies between them. However, the studies found that the best coding depends on each situation. Therefore, a new problem, which we call the ldquooptimal coding problem,rdquo has arisen: how can we determine which coding is the optimal one in each situation? To approach this optimal coding problem, we propose a novel framework for constructing a multiclass classifier, in which each binary classifier to be aggregated has a weight value to be optimally tuned based on the observed data. Although there is no a priori answer to the optimal coding problem, our weight tuning method can be a consistent answer to the problem. We apply this method to various classification problems including a synthesized data set and some cancer diagnosis data sets from gene expression profiling. The results demonstrate that, in most situations, our method can improve classification accuracy over simple voting heuristics and is better than or comparable to state-of-the-art multiclass predictors. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Parallel Clustering Algorithm for Large Data Sets with Applications in Bioinformatics

    Publication Year: 2009 , Page(s): 344 - 352
    Cited by:  Papers (10)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (951 KB) |  | HTML iconHTML  

    Large sets of bioinformatical data provide a challenge in time consumption while solving the cluster identification problem, and that is why a parallel algorithm is so needed for identifying dense clusters in a noisy background. Our algorithm works on a graph representation of the data set to be analyzed. It identifies clusters through the identification of densely intraconnected subgraphs. We have employed a minimum spanning tree (MST) representation of the graph and solve the cluster identification problem using this representation. The computational bottleneck of our algorithm is the construction of an MST of a graph, for which a parallel algorithm is employed. Our high-level strategy for the parallel MST construction algorithm is to first partition the graph, then construct MSTs for the partitioned subgraphs and auxiliary bipartite graphs based on the subgraphs, and finally merge these MSTs to derive an MST of the original graph. The computational results indicate that when running on 150 CPUs, our algorithm can solve a cluster identification problem on a data set with 1,000,000 data points almost 100 times faster than on single CPU, indicating that this program is capable of handling very large data clustering problems in an efficient manner. We have implemented the clustering algorithm as the software CLUMP. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Prediction of Cancer Class with Majority Voting Genetic Programming Classifier Using Gene Expression Data

    Publication Year: 2009 , Page(s): 353 - 367
    Cited by:  Papers (6)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (4178 KB) |  | HTML iconHTML  

    In order to get a better understanding of different types of cancers and to find the possible biomarkers for diseases, recently, many researchers are analyzing the gene expression data using various machine learning techniques. However, due to a very small number of training samples compared to the huge number of genes and class imbalance, most of these methods suffer from overfitting. In this paper, we present a majority voting genetic programming classifier (MVGPC) for the classification of microarray data. Instead of a single rule or a single set of rules, we evolve multiple rules with genetic programming (GP) and then apply those rules to test samples to determine their labels with majority voting technique. By performing experiments on four different public cancer data sets, including multiclass data sets, we have found that the test accuracies of MVGPC are better than those of other methods, including AdaBoost with GP. Moreover, some of the more frequently occurring genes in the classification rules are known to be associated with the types of cancers being studied in this paper. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The IEEE Computer Society Career Center

    Publication Year: 2009 , Page(s): 368
    Save to Project icon | Request Permissions | PDF file iconPDF (308 KB)  
    Freely Available from IEEE
  • IEEE/ACM TCBB: Information for authors

    Publication Year: 2009 , Page(s): c3
    Save to Project icon | Request Permissions | PDF file iconPDF (176 KB)  
    Freely Available from IEEE
  • [Back cover]

    Publication Year: 2009 , Page(s): c4
    Save to Project icon | Request Permissions | PDF file iconPDF (371 KB)  
    Freely Available from IEEE

Aims & Scope

This bimonthly publishes archival research results related to the algorithmic, mathematical, statistical, and computational methods that are central in bioinformatics and computational biology.

Full Aims & Scope

Meet Our Editors

Editor-in-Chief
Ying Xu
University of Georgia
xyn@bmb.uga.edu

Associate Editor-in-Chief
Dong Xu
University of Missouri
xudong@missouri.edu