By Topic

Computational Biology and Bioinformatics, IEEE/ACM Transactions on

Issue 1 • Date Jan.-March 2010

Filter Results

Displaying Results 1 - 25 of 25
  • [Front cover]

    Page(s): c1
    Save to Project icon | Request Permissions | PDF file iconPDF (420 KB)  
    Freely Available from IEEE
  • [Inside front cover]

    Page(s): c2
    Save to Project icon | Request Permissions | PDF file iconPDF (233 KB)  
    Freely Available from IEEE
  • Editor's Note

    Page(s): 1
    Save to Project icon | Request Permissions | PDF file iconPDF (78 KB)  
    Freely Available from IEEE
  • Automatic Detection of Large Dense-Core Vesicles in Secretory Cells and Statistical Analysis of Their Intracellular Distribution

    Page(s): 2 - 11
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1637 KB) |  | HTML iconHTML  

    Analyzing the morphological appearance and the spatial distribution of large dense-core vesicles (granules) in the cell cytoplasm is central to the understanding of regulated exocytosis. This paper is concerned with the automatic detection of granules and the statistical analysis of their spatial locations in different cell groups. We model the locations of granules of a given cell as a realization of a finite spatial point process and the point patterns associated with the cell groups as replicated point patterns of different spatial point processes. First, an algorithm to segment the granules using electron microscopy images is proposed. Second, the relative locations of the granules with respect to the plasma membrane are characterized by two functional descriptors: the empirical cumulative distribution function of the distances from the granules to the plasma membrane and the density of granules within a given distance to the plasma membrane. The descriptors of the different cells for each group are compared using bootstrap procedures. Our results show that these descriptors and the testing procedure allow discriminating between control and treated cells. The application of these novel tools to studies of secretion should help in the analysis of diseases associated with dysfunctional secretion, such as diabetes. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • BioExtract Server—An Integrated Workflow-Enabling System to Access and Analyze Heterogeneous, Distributed Biomolecular Data

    Page(s): 12 - 24
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1434 KB) |  | HTML iconHTML  

    Many in silico investigations in bioinformatics require access to multiple, distributed data sources and analytic tools. The requisite data sources may include large public data repositories, community databases, and project databases for use in domain-specific research. Different data sources frequently utilize distinct query languages and return results in unique formats, and therefore researchers must either rely upon a small number of primary data sources or become familiar with multiple query languages and formats. Similarly, the associated analytic tools often require specific input formats and produce unique outputs which make it difficult to utilize the output from one tool as input to another. The BioExtract Server (http://bioextract.org) is a Web-based data integration application designed to consolidate, analyze, and serve data from heterogeneous biomolecular databases in the form of a mash-up. The basic operations of the BioExtract Server allow researchers, via their Web browsers, to specify data sources, flexibly query data sources, apply analytic tools, download result sets, and store query results for later reuse. As a researcher works with the system, their ??steps?? are saved in the background. At any time, these steps can be preserved long-term as a workflow simply by providing a workflow name and description. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Feature Selection for Gene Expression Using Model-Based Entropy

    Page(s): 25 - 36
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2069 KB)  

    Gene expression data usually contain a large number of genes but a small number of samples. Feature selection for gene expression data aims at finding a set of genes that best discriminate biological samples of different types. Using machine learning techniques, traditional gene selection based on empirical mutual information suffers the data sparseness issue due to the small number of samples. To overcome the sparseness issue, we propose a model-based approach to estimate the entropy of class variables on the model, instead of on the data themselves. Here, we use multivariate normal distributions to fit the data, because multivariate normal distributions have maximum entropy among all real-valued distributions with a specified mean and standard deviation and are widely used to approximate various distributions. Given that the data follow a multivariate normal distribution, since the conditional distribution of class variables given the selected features is a normal distribution, its entropy can be computed with the log-determinant of its covariance matrix. Because of the large number of genes, the computation of all possible log-determinants is not efficient. We propose several algorithms to largely reduce the computational cost. The experiments on seven gene data sets and the comparison with other five approaches show the accuracy of the multivariate Gaussian generative model for feature selection, and the efficiency of our algorithms. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Heuristic Bayesian Segmentation for Discovery of Coexpressed Genes within Genomic Regions

    Page(s): 37 - 49
    Multimedia
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1148 KB)  

    Segmentation aims to separate homogeneous areas from the sequential data, and plays a central role in data mining. It has applications ranging from finance to molecular biology, where bioinformatics tasks such as genome data analysis are active application fields. In this paper, we present a novel application of segmentation in locating genomic regions with coexpressed genes. We aim at automated discovery of such regions without requirement for user-given parameters. In order to perform the segmentation within a reasonable time, we use heuristics. Most of the heuristic segmentation algorithms require some decision on the number of segments. This is usually accomplished by using asymptotic model selection methods like the Bayesian information criterion. Such methods are based on some simplification, which can limit their usage. In this paper, we propose a Bayesian model selection to choose the most proper result from heuristic segmentation. Our Bayesian model presents a simple prior for the segmentation solutions with various segment numbers and a modified Dirichlet prior for modeling multinomial data. We show with various artificial data sets in our benchmark system that our model selection criterion has the best overall performance. The application of our method in yeast cell-cycle gene expression data reveals potential active and passive regions of the genome. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Data-Fusion in Clustering Microarray Data: Balancing Discovery and Interpretability

    Page(s): 50 - 63
    Multimedia
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1505 KB) |  | HTML iconHTML  

    While clustering genes remains one of the most popular exploratory tools for expression data, it often results in a highly variable and biologically uninformative clusters. This paper explores a data fusion approach to clustering microarray data. Our method, which combined expression data and gene ontology (GO)-derived information, is applied on a real data set to perform genome-wide clustering. A set of novel tools is proposed to validate the clustering results and pick a fair value of infusion coefficient. These tools measure stability, biological relevance, and distance from the expression-only clustering solution. Our results indicate that a data-fusion clustering leads to more stable, biologically relevant clusters that are still representative of the experimental data. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Integrating Data Clustering and Visualization for the Analysis of 3D Gene Expression Data

    Page(s): 64 - 79
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (4242 KB) |  | HTML iconHTML  

    The recent development of methods for extracting precise measurements of spatial gene expression patterns from three-dimensional (3D) image data opens the way for new analyses of the complex gene regulatory networks controlling animal development. We present an integrated visualization and analysis framework that supports user-guided data clustering to aid exploration of these new complex data sets. The interplay of data visualization and clustering-based data classification leads to improved visualization and enables a more detailed analysis than previously possible. We discuss 1) the integration of data clustering and visualization into one framework, 2) the application of data clustering to 3D gene expression data, 3) the evaluation of the number of clusters k in the context of 3D gene expression clustering, and 4) the improvement of overall analysis quality via dedicated postprocessing of clustering results based on visualization. We discuss the use of this framework to objectively define spatial pattern boundaries and temporal profiles of genes and to analyze how mRNA patterns are controlled by their regulatory transcription factors. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Multidimensional Profiling of Cell Surface Proteins and Nuclear Markers

    Page(s): 80 - 90
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3008 KB) |  | HTML iconHTML  

    Cell membrane proteins play an important role in tissue architecture and cell-cell communication. We hypothesize that segmentation and multidimensional characterization of the distribution of cell membrane proteins, on a cell-by-cell basis, enable improved classification of treatment groups and identify important characteristics that can otherwise be hidden. We have developed a series of computational steps to 1) delineate cell membrane protein signals and associate them with a specific nucleus; 2) compute a coupled representation of the multiplexed DNA content with membrane proteins; 3) rank computed features associated with such a multidimensional representation; 4) visualize selected features for comparative evaluation through heatmaps; and 5) discriminate between treatment groups in an optimal fashion. The novelty of our method is in the segmentation of the membrane signal and the multidimensional representation of phenotypic signature on a cell-by-cell basis. To test the utility of this method, the proposed computational steps were applied to images of cells that have been irradiated with different radiation qualities in the presence and absence of other small molecules. These samples are labeled for their DNA content and E-cadherin membrane proteins. We demonstrate that multidimensional representations of cell-by-cell phenotypes improve predictive and visualization capabilities among different treatment groups, and identify hidden variables. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Predicting Novel Human Gene Ontology Annotations Using Semantic Analysis

    Page(s): 91 - 99
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1011 KB) |  | HTML iconHTML  

    The correct interpretation of many molecular biology experiments depends in an essential way on the accuracy and consistency of the existing annotation databases. Such databases are meant to act as repositories for our biological knowledge as we acquire and refine it. Hence, by definition, they are incomplete at any given time. In this paper, we describe a technique that improves our previous method for predicting novel GO annotations by extracting implicit semantic relationships between genes and functions. In this work, we use a vector space model and a number of weighting schemes in addition to our previous latent semantic indexing approach. The technique described here is able to take into consideration the hierarchical structure of the gene ontology (GO) and can weight differently GO terms situated at different depths. The prediction abilities of 15 different weighting schemes are compared and evaluated. Nine such schemes were previously used in other problem domains, while six of them are introduced in this paper. The best weighting scheme was a novel scheme, n2tn. Out of the top 50 functional annotations predicted using this weighting scheme, we found support in the literature for 84 percent of them, while 6 percent of the predictions were contradicted by the existing literature. For the remaining 10 percent, we did not find any relevant publications to confirm or contradict the predictions. The n2tn weighting scheme also outperformed the simple binary scheme used in our previous approach. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Sparse Support Vector Machines with L_{p} Penalty for Biomarker Identification

    Page(s): 100 - 107
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (992 KB) |  | HTML iconHTML  

    The development of high-throughput technology has generated a massive amount of high-dimensional data, and many of them are of discrete type. Robust and efficient learning algorithms such as LASSO [1] are required for feature selection and overfitting control. However, most feature selection algorithms are only applicable to the continuous data type. In this paper, we propose a novel method for sparse support vector machines (SVMs) with Lp (p < 1) regularization. Efficient algorithms (LpSVM) are developed for learning the classifier that is applicable to high-dimensional data sets with both discrete and continuous data types. The regularization parameters are estimated through maximizing the area under the ROC curve (AUC) of the cross-validation data. Experimental results on protein sequence and SNP data attest to the accuracy, sparsity, and efficiency of the proposed algorithm. Biomarkers identified with our methods are compared with those from other methods in the literature. The software package in Matlab is available upon request. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Multiple-Filter-Multiple-Wrapper Approach to Gene Selection and Microarray Data Classification

    Page(s): 108 - 117
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2482 KB)  

    Filters and wrappers are two prevailing approaches for gene selection in microarray data analysis. Filters make use of statistical properties of each gene to represent its discriminating power between different classes. The computation is fast but the predictions are inaccurate. Wrappers make use of a chosen classifier to select genes by maximizing classification accuracy, but the computation burden is formidable. Filters and wrappers have been combined in previous studies to maximize the classification accuracy for a chosen classifier with respect to a filtered set of genes. The drawback of this single-filter-single-wrapper (SFSW) approach is that the classification accuracy is dependent on the choice of specific filter and wrapper. In this paper, a multiple-filter-multiple-wrapper (MFMW) approach is proposed that makes use of multiple filters and multiple wrappers to improve the accuracy and robustness of the classification, and to identify potential biomarker genes. Experiments based on six benchmark data sets show that the MFMW approach outperforms SFSW models (generated by all combinations of filters and wrappers used in the corresponding MFMW model) in all cases and for all six data sets. Some of MFMW-selected genes have been confirmed to be biomarkers or contribute to the development of particular cancers by other studies. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Trade-Off between Sample Complexity and Computational Complexity in Learning Boolean Networks from Time-Series Data

    Page(s): 118 - 125
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (918 KB) |  | HTML iconHTML  

    A key problem in molecular biology is to infer regulatory relationships between genes from expression data. This paper studies a simplified model of such inference problems in which one or more Boolean variables, modeling, for example, the expression levels of genes, each depend deterministically on a small but unknown subset of a large number of Boolean input variables. Our model assumes that the expression data comprises a time series, in which successive samples may be correlated. We provide bounds on the expected amount of data needed to infer the correct relationships between output and input variables. These bounds improve and generalize previous results for Boolean network inference and continuous-time switching network inference. Although the computational problem is intractable in general, we describe a fixed-parameter tractable algorithm that is guaranteed to provide at least a partial solution to the problem. Most interestingly, both the sample complexity and computational complexity of the problem depend on the strength of correlations between successive samples in the time series but in opposing ways. Uncorrelated samples minimize the total number of samples needed while maximizing computational complexity; a strong correlation between successive samples has the opposite effect. This observation has implications for the design of experiments for measuring gene expression. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Efficient Peak-Labeling Algorithms for Whole-Sample Mass Spectrometry Proteomics

    Page(s): 126 - 137
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1122 KB) |  | HTML iconHTML  

    Whole-sample mass spectrometry (MS) proteomics allows for a parallel measurement of hundreds of proteins present in a variety of biospecimens. Unfortunately, the association between MS signals and these proteins is not straightforward. The need to interpret mass spectra demands the development of methods for accurate labeling of ion species in such profiles. To aid this process, we have developed a new peak-labeling procedure for associating protein and peptide labels with peaks. This computational method builds upon characteristics of proteins expected to be in the sample, such as the amino sequence, mass weight, and expected concentration within the sample. A new probabilistic score that incorporates this information is proposed. We evaluate and demonstrate our method's ability to label peaks first on simulated MS spectra and then on MS spectra from human serum with a spiked-in calibration mixture. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Exploratory Consensus of Hierarchical Clusterings for Melanoma and Breast Cancer

    Page(s): 138 - 152
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (4486 KB) |  | HTML iconHTML  

    Finding subtypes of heterogeneous diseases is the biggest challenge in the area of biology. Often, clustering is used to provide a hypothesis for the subtypes of a heterogeneous disease. However, there are usually discrepancies between the clusterings produced by different algorithms. This work introduces a simple method which provides the most consistent clusters across three different clustering algorithms for a melanoma and a breast cancer data set. The method is validated by showing that the Silhouette, Dunne's and Davies-Bouldin's cluster validation indices are better for the proposed algorithm than those obtained by k-means and another consensus clustering algorithm. The hypotheses of the consensus clusters on both the data sets are corroborated by clear genetic markers and 100 percent classification accuracy. In Bittner et al.'s melanoma data set, a previously hypothesized primary cluster is recognized as the largest consensus cluster and a new partition of this cluster into two subclusters is proposed. In van't Veer et al.'s breast cancer data set, previously proposed "basal?? and "luminal A?? subtypes are clearly recognized as the two predominant clusters. Furthermore, a new hypothesis is provided about the existence of two subgroups within the "basal?? subtype in this data set. The clusters of van't Veer's data set is also validated by high classification accuracy obtained in the data set of van de Vijver et al. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Identification of Regulatory Modules in Time Series Gene Expression Data Using a Linear Time Biclustering Algorithm

    Page(s): 153 - 165
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3806 KB) |  | HTML iconHTML  

    Although most biclustering formulations are NP-hard, in time series expression data analysis, it is reasonable to restrict the problem to the identification of maximal biclusters with contiguous columns, which correspond to coherent expression patterns shared by a group of genes in consecutive time points. This restriction leads to a tractable problem. We propose an algorithm that finds and reports all maximal contiguous column coherent biclusters in time linear in the size of the expression matrix. The linear time complexity of CCC-Biclustering relies on the use of a discretized matrix and efficient string processing techniques based on suffix trees. We also propose a method for ranking biclusters based on their statistical significance and a methodology for filtering highly overlapping and, therefore, redundant biclusters. We report results in synthetic and real data showing the effectiveness of the approach and its relevance in the discovery of regulatory modules. Results obtained using the transcriptomic expression patterns occurring in Saccharomyces cerevisiae in response to heat stress show not only the ability of the proposed methodology to extract relevant information compatible with documented biological knowledge but also the utility of using this algorithm in the study of other environmental stresses and of regulatory modules in general. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Incomplete Lineage Sorting: Consistent Phylogeny Estimation from Multiple Loci

    Page(s): 166 - 171
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (506 KB) |  | HTML iconHTML  

    We introduce a simple computationally efficient algorithm for reconstructing phylogenies from multiple gene trees in the presence of incomplete lineage sorting, that is, when the topology of the gene trees may differ from that of the species tree. We show that our technique is statistically consistent under standard stochastic assumptions, that is, it returns the correct tree given sufficiently many unlinked loci. We also show that it can tolerate moderate estimation errors. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • On the Importance of Comprehensible Classification Models for Protein Function Prediction

    Page(s): 172 - 182
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (738 KB) |  | HTML iconHTML  

    The literature on protein function prediction is currently dominated by works aimed at maximizing predictive accuracy, ignoring the important issues of validation and interpretation of discovered knowledge, which can lead to new insights and hypotheses that are biologically meaningful and advance the understanding of protein functions by biologists. The overall goal of this paper is to critically evaluate this approach, offering a refreshing new perspective on this issue, focusing not only on predictive accuracy but also on the comprehensibility of the induced protein function prediction models. More specifically, this paper aims to offer two main contributions to the area of protein function prediction. First, it presents the case for discovering comprehensible protein function prediction models from data, discussing in detail the advantages of such models, namely, increasing the confidence of the biologist in the system's predictions, leading to new insights about the data and the formulation of new biological hypotheses, and detecting errors in the data. Second, it presents a critical review of the pros and cons of several different knowledge representations that can be used in order to support the discovery of comprehensible protein function prediction models. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Approximate Maximum Parsimony and Ancestral Maximum Likelihood

    Page(s): 183 - 187
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (338 KB) |  | HTML iconHTML  

    We explore the maximum parsimony (MP) and ancestral maximum likelihood (AML) criteria in phylogenetic tree reconstruction. Both problems are NP-hard, so we seek approximate solutions. We formulate the two problems as Steiner tree problems under appropriate distances. The gist of our approach is the succinct characterization of Steiner trees for a small number of leaves for the two distances. This enables the use of known Steiner tree approximation algorithms. The approach leads to a 16/9 approximation ratio for AML and asymptotically to a 1.55 approximation ratio for MP. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • 2009 Reviewer's List

    Page(s): 188 - 190
    Save to Project icon | Request Permissions | PDF file iconPDF (65 KB)  
    Freely Available from IEEE
  • IEEE Computer Society CSDA Certification [advertisement]

    Page(s): 191
    Save to Project icon | Request Permissions | PDF file iconPDF (125 KB)  
    Freely Available from IEEE
  • IEEE Computer Society Career Center

    Page(s): 192
    Save to Project icon | Request Permissions | PDF file iconPDF (312 KB)  
    Freely Available from IEEE
  • IEEE/ACM TCBB: Information for authors

    Page(s): c3
    Save to Project icon | Request Permissions | PDF file iconPDF (233 KB)  
    Freely Available from IEEE
  • [Back cover]

    Page(s): c4
    Save to Project icon | Request Permissions | PDF file iconPDF (420 KB)  
    Freely Available from IEEE

Aims & Scope

This bimonthly publishes archival research results related to the algorithmic, mathematical, statistical, and computational methods that are central in bioinformatics and computational biology.

Full Aims & Scope

Meet Our Editors

Editor-in-Chief
Ying Xu
University of Georgia
xyn@bmb.uga.edu

Associate Editor-in-Chief
Dong Xu
University of Missouri
xudong@missouri.edu