By Topic

Computational Biology and Bioinformatics, IEEE/ACM Transactions on

Issue 1 • Date Jan.-Feb. 2011

Filter Results

Displaying Results 1 - 25 of 36
  • [Front cover]

    Page(s): c1
    Save to Project icon | Request Permissions | PDF file iconPDF (830 KB)  
    Freely Available from IEEE
  • [Inside front cover]

    Page(s): c2
    Save to Project icon | Request Permissions | PDF file iconPDF (236 KB)  
    Freely Available from IEEE
  • EIC Editorial

    Page(s): 1
    Save to Project icon | Request Permissions | PDF file iconPDF (36 KB)  
    Freely Available from IEEE
  • A Fast Algorithm for Computing Geodesic Distances in Tree Space

    Page(s): 2 - 13
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1116 KB) |  | HTML iconHTML  

    Comparing and computing distances between phylogenetic trees are important biological problems, especially for models where edge lengths play an important role. The geodesic distance measure between two phylogenetic trees with edge lengths is the length of the shortest path between them in the continuous tree space introduced by Billera, Holmes, and Vogtmann. This tree space provides a powerful tool for studying and comparing phylogenetic trees, both in exhibiting a natural distance measure and in providing a euclidean-like structure for solving optimization problems on trees. An important open problem is to find a polynomial time algorithm for finding geodesics in tree space. This paper gives such an algorithm, which starts with a simple initial path and moves through a series of successively shorter paths until the geodesic is attained. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A General Framework for Analyzing Data from Two Short Time-Series Microarray Experiments

    Page(s): 14 - 26
    Multimedia
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1756 KB)  

    We propose a general theoretical framework for analyzing differentially expressed genes and behavior patterns from two homogenous short time-course data. The framework generalizes the recently proposed Hilbert-Schmidt Independence Criterion (HSIC)-based framework adapting it to the time-series scenario by utilizing tensor analysis for data transformation. The proposed framework is effective in yielding criteria that can identify both the differentially expressed genes and time-course patterns of interest between two time-series experiments without requiring to explicitly cluster the data. The results, obtained by applying the proposed framework with a linear kernel formulation, on various data sets are found to be both biologically meaningful and consistent with published studies. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Efficient Formulations for Exact Stochastic Simulation of Chemical Systems

    Page(s): 27 - 35
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1550 KB) |  | HTML iconHTML  

    One can generate trajectories to simulate a system of chemical reactions using either Gillespie's direct method or Gibson and Bruck's next reaction method. Because one usually needs many trajectories to understand the dynamics of a system, performance is important. In this paper, we present new formulations of these methods that improve the computational complexity of the algorithms. We present optimized implementations, available from http://cain.sourceforge.net/>, that offer better performance than previous work. There is no single method that is best for all problems. Simple formulations often work best for systems with a small number of reactions, while some sophisticated methods offer the best performance for large problems and scale well asymptotically. We investigate the performance of each formulation on simple biological systems using a wide range of problem sizes. We also consider the numerical accuracy of the direct and the next reaction method. We have found that special precautions must be taken in order to ensure that randomness is not discarded during the course of a simulation. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Estimating Haplotype Frequencies by Combining Data from Large DNA Pools with Database Information

    Page(s): 36 - 44
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (629 KB) |  | HTML iconHTML  

    We assume that allele frequency data have been extracted from several large DNA pools, each containing genetic material of up to hundreds of sampled individuals. Our goal is to estimate the haplotype frequencies among the sampled individuals by combining the pooled allele frequency data with prior knowledge about the set of possible haplotypes. Such prior information can be obtained, for example, from a database such as HapMap. We present a Bayesian haplotyping method for pooled DNA based on a continuous approximation of the multinomial distribution. The proposed method is applicable when the sizes of the DNA pools and/or the number of considered loci exceed the limits of several earlier methods. In the example analyses, the proposed model clearly outperforms a deterministic greedy algorithm on real data from the HapMap database. With a small number of loci, the performance of the proposed method is similar to that of an EM-algorithm, which uses a multinormal approximation for the pooled allele frequencies, but which does not utilize prior information about the haplotypes. The method has been implemented using Matlab and the code is available upon request from the authors. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • $F^2$Dock: Fast Fourier Protein-Protein Docking

    Page(s): 45 - 58
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (5890 KB) |  | HTML iconHTML  

    The functions of proteins are often realized through their mutual interactions. Determining a relative transformation for a pair of proteins and their conformations which form a stable complex, reproducible in nature, is known as docking. It is an important step in drug design, structure determination, and understanding function and structure relationships. In this paper, we extend our nonuniform fast Fourier transform-based docking algorithm to include an adaptive search phase (both translational and rotational) and thereby speed up its execution. We have also implemented a multithreaded version of the adaptive docking algorithm for even faster execution on multicore machines. We call this protein-protein docking code F2Dock (F2 = Fast Fourier). We have calibrated F2Dock based on an extensive experimental study on a list of benchmark complexes and conclude that F2Dock works very well in practice. Though all docking results reported in this paper use shape complementarity and Coulombic-potential-based scores only, F2Dock is structured to incorporate Lennard-Jones potential and reranking docking solutions based on desolvation energy . View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Fast Surface-Based Travel Depth Estimation Algorithm for Macromolecule Surface Shape Description

    Page(s): 59 - 68
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2942 KB) |  | HTML iconHTML  

    Travel Depth, introduced by Coleman and Sharp in 2006, is a physical interpretation of molecular depth, a term frequently used to describe the shape of a molecular active site or binding site. Travel Depth can be seen as the physical distance a solvent molecule would have to travel from a point of the surface, i.e., the Solvent-Excluded Surface (SES), to its convex hull. Existing algorithms providing an estimation of the Travel Depth are based on a regular sampling of the molecule volume and the use of the Dijkstra's shortest path algorithm. Since Travel Depth is only defined on the molecular surface, this volume-based approach is characterized by a large computational complexity due to the processing of unnecessary samples lying inside or outside the molecule. In this paper, we propose a surface-based approach that restricts the processing to data defined on the SES. This algorithm significantly reduces the complexity of Travel Depth estimation and makes possible the analysis of large macromolecule surface shape description with high resolution. Experimental results show that compared to existing methods, the proposed algorithm achieves accurate estimations with considerably reduced processing times. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Finding Significant Matches of Position Weight Matrices in Linear Time

    Page(s): 69 - 79
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1897 KB) |  | HTML iconHTML  

    Position weight matrices are an important method for modeling signals or motifs in biological sequences, both in DNA and protein contexts. In this paper, we present fast algorithms for the problem of finding significant matches of such matrices. Our algorithms are of the online type, and they generalize classical multipattern matching, filtering, and superalphabet techniques of combinatorial string matching to the problem of weight matrix matching. Several variants of the algorithms are developed, including multiple matrix extensions that perform the search for several matrices in one scan through the sequence database. Experimental performance evaluation is provided to compare the new techniques against each other as well as against some other online and index-based algorithms proposed in the literature. Compared to the brute-force O(mn) approach, our solutions can be faster by a factor that is proportional to the matrix length m. Our multiple-matrix filtration algorithm had the best performance in the experiments. On a current PC, this algorithm finds significant matches (p = 0.0001) of the 123 JASPAR matrices in the human genome in about 18 minutes. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Fuzzy ARTMAP Prediction of Biological Activities for Potential HIV-1 Protease Inhibitors Using a Small Molecular Data Set

    Page(s): 80 - 93
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3529 KB) |  | HTML iconHTML  

    Obtaining satisfactory results with neural networks depends on the availability of large data samples. The use of small training sets generally reduces performance. Most classical Quantitative Structure-Activity Relationship (QSAR) studies for a specific enzyme system have been performed on small data sets. We focus on the neuro-fuzzy prediction of biological activities of HIV-1 protease inhibitory compounds when inferring from small training sets. We propose two computational intelligence prediction techniques which are suitable for small training sets, at the expense of some computational overhead. Both techniques are based on the FAMR model. The FAMR is a Fuzzy ARTMAP (FAM) incremental learning system used for classification and probability estimation. During the learning phase, each sample pair is assigned a relevance factor proportional to the importance of that pair. The two proposed algorithms in this paper are: 1) The GA-FAMR algorithm, which is new, consists of two stages: a) During the first stage, we use a genetic algorithm (GA) to optimize the relevances assigned to the training data. This improves the generalization capability of the FAMR. b) In the second stage, we use the optimized relevances to train the FAMR. 2) The Ordered FAMR is derived from a known algorithm. Instead of optimizing relevances, it optimizes the order of data presentation using the algorithm of Dagher et al. In our experiments, we compare these two algorithms with an algorithm not based on the FAM, the FS-GA-FNN introduced in . We conclude that when inferring from small training sets, both techniques are efficient, in terms of generalization capability and execution time. The computational overhead introduced is compensated by better accuracy. Finally, the proposed techniques are used to predict the biological activities of newly designed potential HIV-1 protease inhibitors. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Genetic Networks and Soft Computing

    Page(s): 94 - 107
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2915 KB) |  | HTML iconHTML  

    The analysis of gene regulatory networks provides enormous information on various fundamental cellular processes involving growth, development, hormone secretion, and cellular communication. Their extraction from available gene expression profiles is a challenging problem. Such reverse engineering of genetic networks offers insight into cellular activity toward prediction of adverse effects of new drugs or possible identification of new drug targets. Tasks such as classification, clustering, and feature selection enable efficient mining of knowledge about gene interactions in the form of networks. It is known that biological data is prone to different kinds of noise and ambiguity. Soft computing tools, such as fuzzy sets, evolutionary strategies, and neurocomputing, have been found to be helpful in providing low-cost, acceptable solutions in the presence of various types of uncertainties. In this paper, we survey the role of these soft methodologies and their hybridizations, for the purpose of generating genetic networks. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Identification and Modeling of Genes with Diurnal Oscillations from Microarray Time Series Data

    Page(s): 108 - 121
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3272 KB) |  | HTML iconHTML  

    Behavior of living organisms is strongly modulated by the day and night cycle giving rise to a cyclic pattern of activities. Such a pattern helps the organisms to coordinate their activities and maintain a balance between what could be performed during the "day” and what could be relegated to the "night.” This cyclic pattern, called the "Circadian Rhythm,” is a biological phenomenon observed in a large number of organisms. In this paper, our goal is to analyze transcriptome data from Cyanothece for the purpose of discovering genes whose expressions are rhythmic. We cluster these genes into groups that are close in terms of their phases and show that genes from a specific metabolic functional category are tightly clustered, indicating perhaps a "preferred time of the day/night” when the organism performs this function. The proposed analysis is applied to two sets of microarray experiments performed under varying incident light patterns. Subsequently, we propose a model with a network of three phase oscillators together with a central master clock and use it to approximate a set of "circadian-controlled genes” that can be approximated closely. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Improving the Computational Efficiency of Recursive Cluster Elimination for Gene Selection

    Page(s): 122 - 129
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3348 KB) |  | HTML iconHTML  

    The gene expression data are usually provided with a large number of genes and a relatively small number of samples, which brings a lot of new challenges. Selecting those informative genes becomes the main issue in microarray data analysis. Recursive cluster elimination based on support vector machine (SVM-RCE) has shown the better classification accuracy on some microarray data sets than recursive feature elimination based on support vector machine (SVM-RFE). However, SVM-RCE is extremely time-consuming. In this paper, we propose an improved method of SVM-RCE called ISVM-RCE. ISVM-RCE first trains a SVM model with all clusters, then applies the infinite norm of weight coefficient vector in each cluster to score the cluster, finally eliminates the gene clusters with the lowest score. In addition, ISVM-RCE eliminates genes within the clusters instead of removing a cluster of genes when the number of clusters is small. We have tested ISVM-RCE on six gene expression data sets and compared their performances with SVM-RCE and linear-discriminant-analysis-based RFE (LDA-RFE). The experiment results on these data sets show that ISVM-RCE greatly reduces the time cost of SVM-RCE, meanwhile obtains comparable classification performance as SVM-RCE, while LDA-RFE is not stable. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Influence of Prior Knowledge in Constraint-Based Learning of Gene Regulatory Networks

    Page(s): 130 - 142
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1797 KB) |  | HTML iconHTML  

    Constraint-based structure learning algorithms generally perform well on sparse graphs. Although sparsity is not uncommon, there are some domains where the underlying graph can have some dense regions; one of these domains is gene regulatory networks, which is the main motivation to undertake the study described in this paper. We propose a new constraint-based algorithm that can both increase the quality of output and decrease the computational requirements for learning the structure of gene regulatory networks. The algorithm is based on and extends the PC algorithm. Two different types of information are derived from the prior knowledge; one is the probability of existence of edges, and the other is the nodes that seem to be dependent on a large number of nodes compared to other nodes in the graph. Also a new method based on Gene Ontology for gene regulatory network validation is proposed. We demonstrate the applicability and effectiveness of the proposed algorithms on both synthetic and real data sets. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Information-Theoretic Model of Evolution over Protein Communication Channel

    Page(s): 143 - 151
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1008 KB) |  | HTML iconHTML  

    In this paper, we propose a communication model of evolution and investigate its information-theoretic bounds. The process of evolution is modeled as the retransmission of information over a protein communication channel, where the transmitted message is the organism's proteome encoded in the DNA. We compute the capacity and the rate distortion functions of the protein communication system for the three domains of life: Archaea, Bacteria, and Eukaryotes. The tradeoff between the transmission rate and the distortion in noisy protein communication channels is analyzed. As expected, comparison between the optimal transmission rate and the channel capacity indicates that the biological fidelity does not reach the Shannon optimal distortion. However, the relationship between the channel capacity and rate distortion achieved for different biological domains provides tremendous insight into the dynamics of the evolutionary processes of the three domains of life. We rely on these results to provide a model of genome sequence evolution based on the two major evolutionary driving forces: mutations and unequal crossovers. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Learning Genetic Regulatory Network Connectivity from Time Series Data

    Page(s): 152 - 165
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (4362 KB) |  | HTML iconHTML  

    Recent experimental advances facilitate the collection of time series data that indicate which genes in a cell are expressed. This information can be used to understand the genetic regulatory network that generates the data. Typically, Bayesian analysis approaches are applied which neglect the time series nature of the experimental data, have difficulty in determining the direction of causality, and do not perform well on networks with tight feedback. To address these problems, this paper presents a method to learn genetic network connectivity which exploits the time series nature of experimental data to achieve better causal predictions. This method first breaks up the data into bins. Next, it determines an initial set of potential influence vectors for each gene based upon the probability of the gene's expression increasing in the next time step. These vectors are then combined to form new vectors with better scores. Finally, these influence vectors are competed against each other to determine the final influence vector for each gene. The result is a directed graph representation of the genetic network's repression and activation connections. Results are reported for several synthetic networks with tight feedback showing significant improvements in recall and runtime over Yu's dynamic Bayesian approach. Promising preliminary results are also reported for an analysis of experimental data for genes involved in the yeast cell cycle. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Model Reduction Using Piecewise-Linear Approximations Preserves Dynamic Properties of the Carbon Starvation Response in Escherichia coli

    Page(s): 166 - 181
    Multimedia
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2317 KB)  

    The adaptation of the bacterium Escherichia coli to carbon starvation is controlled by a large network of biochemical reactions involving genes, mRNAs, proteins, and signalling molecules. The dynamics of these networks is difficult to analyze, notably due to a lack of quantitative information on parameter values. To overcome these limitations, model reduction approaches based on quasi-steady-state (QSS) and piecewise-linear (PL) approximations have been proposed, resulting in models that are easier to handle mathematically and computationally. These approximations are not supposed to affect the capability of the model to account for essential dynamical properties of the system, but the validity of this assumption has not been systematically tested. In this paper, we carry out such a study by evaluating a large and complex PL model of the carbon starvation response in E. coli using an ensemble approach. The results show that, in comparison with conventional nonlinear models, the PL approximations generally preserve the dynamics of the carbon starvation response network, although with some deviations concerning notably the quantitative precision of the model predictions. This encourages the application of PL models to the qualitative analysis of bacterial regulatory networks, in situations where the reference time scale is that of protein synthesis and degradation. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • New Methods for Inference of Local Tree Topologies with Recombinant SNP Sequences in Populations

    Page(s): 182 - 193
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1833 KB) |  | HTML iconHTML  

    Large amount of population-scale genetic variation data are being collected in populations. One potentially important biological problem is to infer the population genealogical history from these genetic variation data. Partly due to recombination, genealogical history of a set of DNA sequences in a population usually cannot be represented by a single tree. Instead, genealogy is better represented by a genealogical network, which is a compact representation of a set of correlated local genealogical trees, each for a short region of genome and possibly with different topology. Inference of genealogical history for a set of DNA sequences under recombination has many potential applications, including association mapping of complex diseases. In this paper, we present two new methods for reconstructing local tree topologies with the presence of recombination, which extend and improve the previous work in. We first show that the "tree scan” method can be converted to a probabilistic inference method based on a hidden Markov model. We then focus on developing a novel local tree inference method called RENT that is both accurate and scalable to larger data. Through simulation, we demonstrate the usefulness of our methods by showing that the hidden-Markov-model-based method is comparable with the original method in terms of accuracy. We also show that RENT is competitive with other methods in terms of inference accuracy, and its inference error rate is often lower and can handle large data. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Pairwise Statistical Significance of Local Sequence Alignment Using Sequence-Specific and Position-Specific Substitution Matrices

    Page(s): 194 - 205
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1493 KB) |  | HTML iconHTML  

    Pairwise sequence alignment is a central problem in bioinformatics, which forms the basis of various other applications. Two related sequences are expected to have a high alignment score, but relatedness is usually judged by statistical significance rather than by alignment score. Recently, it was shown that pairwise statistical significance gives promising results as an alternative to database statistical significance for getting individual significance estimates of pairwise alignment scores. The improvement was mainly attributed to making the statistical significance estimation process more sequence-specific and database-independent. In this paper, we use sequence-specific and position-specific substitution matrices to derive the estimates of pairwise statistical significance, which is expected to use more sequence-specific information in estimating pairwise statistical significance. Experiments on a benchmark database with sequence-specific substitution matrices at different levels of sequence-specific contribution were conducted, and results confirm that using sequence-specific substitution matrices for estimating pairwise statistical significance is significantly better than using a standard matrix like BLOSUM62, and than database statistical significance estimates reported by popular database search programs like BLAST, PSI-BLAST (without pretrained PSSMs), and SSEARCH on a benchmark database, but with pretrained PSSMs, PSI-BLAST results are significantly better. Further, using position-specific substitution matrices for estimating pairwise statistical significance gives significantly better results even than PSI-BLAST using pretrained PSSMs. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Predicting Metabolic Fluxes Using Gene Expression Differences As Constraints

    Page(s): 206 - 216
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1585 KB) |  | HTML iconHTML  

    A standard approach to estimate intracellular fluxes on a genome-wide scale is flux-balance analysis (FBA), which optimizes an objective function subject to constraints on (relations between) fluxes. The performance of FBA models heavily depends on the relevance of the formulated objective function and the completeness of the defined constraints. Previous studies indicated that FBA predictions can be improved by adding regulatory on/off constraints. These constraints were imposed based on either absolute or relative gene expression values. We provide a new algorithm that directly uses regulatory up/down constraints based on gene expression data in FBA optimization (tFBA). Our assumption is that if the activity of a gene drastically changes from one condition to the other, the flux through the reaction controlled by that gene will change accordingly. We allow these constraints to be violated, to account for posttranscriptional control and noise in the data. These up/down constraints are less stringent than the on/off constraints as previously proposed. Nevertheless, we obtain promising predictions, since many up/down constraints can be enforced. The potential of the proposed method, tFBA, is demonstrated through the analysis of fluxes in yeast under nine different cultivation conditions, between which approximately 5,000 regulatory up/down constraints can be defined. We show that changes in gene expression are predictive for changes in fluxes. Additionally, we illustrate that flux distributions obtained with tFBA better fit transcriptomics data than previous methods. Finally, we compare tFBA and FBA predictions to show that our approach yields more biologically relevant results. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Probabilistic Analysis of Probe Reliability in Differential Gene Expression Studies with Short Oligonucleotide Arrays

    Page(s): 217 - 225
    Multimedia
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (949 KB) |  | HTML iconHTML  

    Probe defects are a major source of noise in gene expression studies. While existing approaches detect noisy probes based on external information such as genomic alignments, we introduce and validate a targeted probabilistic method for analyzing probe reliability directly from expression data and independently of the noise source. This provides insights into the various sources of probe-level noise and gives tools to guide probe design. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Topology Improves Phylogenetic Motif Functional Site Predictions

    Page(s): 226 - 233
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1697 KB) |  | HTML iconHTML  

    Prediction of protein functional sites from sequence-derived data remains an open bioinformatics problem. We have developed a phylogenetic motif (PM) functional site prediction approach that identifies functional sites from alignment fragments that parallel the evolutionary patterns of the family. In our approach, PMs are identified by comparing tree topologies of each alignment fragment to that of the complete phylogeny. Herein, we bypass the phylogenetic reconstruction step and identify PMs directly from distance matrix comparisons. In order to optimize the new algorithm, we consider three different distance matrices and 13 different matrix similarity scores. We assess the performance of the various approaches on a structurally nonredundant data set that includes three types of functional site definitions. Without exception, the predictive power of the original approach outperforms the distance matrix variants. While the distance matrix methods fail to improve upon the original approach, our results are important because they clearly demonstrate that the improved predictive power is based on the topological comparisons. Meaning that phylogenetic trees are a straightforward, yet powerful way to improve functional site prediction accuracy. While complementary studies have shown that topology improves predictions of protein-protein interactions, this report represents the first demonstration that trees improve functional site predictions as well. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Twin Removal in Genetic Algorithms for Protein Structure Prediction Using Low-Resolution Model

    Page(s): 234 - 245
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2341 KB) |  | HTML iconHTML  

    This paper presents the impact of twins and the measures for their removal from the population of genetic algorithm (GA) when applied to effective conformational searching. It is conclusively shown that a twin removal strategy for a GA provides considerably enhanced performance when investigating solutions to complex ab initio protein structure prediction (PSP) problems in low-resolution model. Without twin removal, GA crossover and mutation operations can become ineffectual as generations lose their ability to produce significant differences, which can lead to the solution stalling. The paper relaxes the definition of chromosomal twins in the removal strategy to not only encompass identical, but also highly correlated chromosomes within the GA population, with empirical results consistently exhibiting significant improvements solving PSP problems. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Weighted Principal Component Analysis and Its Application to Gene Expression Data

    Page(s): 246 - 252
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (661 KB) |  | HTML iconHTML  

    In this work, we introduce in the first part new developments in Principal Component Analysis (PCA) and in the second part a new method to select variables (genes in our application). Our focus is on problems where the values taken by each variable do not all have the same importance and where the data may be contaminated with noise and contain outliers, as is the case with microarray data. The usual PCA is not appropriate to deal with this kind of problems. In this context, we propose the use of a new correlation coefficient as an alternative to Pearson's. This leads to a so-called weighted PCA (WPCA). In order to illustrate the features of our WPCA and compare it with the usual PCA, we consider the problem of analyzing gene expression data sets. In the second part of this work, we propose a new PCA-based algorithm to iteratively select the most important genes in a microarray data set. We show that this algorithm produces better results when our WPCA is used instead of the usual PCA. Furthermore, by using Support Vector Machines, we show that it can compete with the Significance Analysis of Microarrays algorithm. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.

Aims & Scope

This bimonthly publishes archival research results related to the algorithmic, mathematical, statistical, and computational methods that are central in bioinformatics and computational biology.

Full Aims & Scope

Meet Our Editors

Editor-in-Chief
Ying Xu
University of Georgia
xyn@bmb.uga.edu

Associate Editor-in-Chief
Dong Xu
University of Missouri
xudong@missouri.edu