By Topic

Computational Biology and Bioinformatics, IEEE/ACM Transactions on

Issue 6 • Date Nov.-Dec. 2012

Filter Results

Displaying Results 1 - 25 of 37
  • [Front inside cover]

    Page(s): c2
    Save to Project icon | Request Permissions | PDF file iconPDF (174 KB)  
    Freely Available from IEEE
  • EIC Editorial

    Page(s): 1553 - 1557
    Save to Project icon | Request Permissions | PDF file iconPDF (239 KB)  
    Freely Available from IEEE
  • A Characterization of the Set of Species Trees that Produce Anomalous Ranked Gene Trees

    Page(s): 1558 - 1568
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (579 KB) |  | HTML iconHTML  

    Ranked gene trees, which consider both the gene tree topology and the sequence in which gene lineages separate, can potentially provide a new source of information for use in modeling genealogies and performing inference of species trees. Recently, we have calculated the probability distribution of ranked gene trees under the standard multispecies coalescent model for the evolution of gene lineages along the branches of a fixed species tree, demonstrating the existence of anomalous ranked gene trees (ARGTs), in which a ranked gene tree that does not match the ranked species tree can have greater probability under the model than the matching ranked gene tree. Here, we fully characterize the set of unranked species tree topologies that give rise to ARGTs, showing that this set contains all species tree topologies with five or more taxa, with the exceptions of caterpillars and pseudocaterpillars. The results have implications for the use of ranked gene trees in phylogenetic inference. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Constrained Evolutionary Computation Method for Detecting Controlling Regions of Cortical Networks

    Page(s): 1569 - 1581
    Multimedia
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1345 KB) |  | HTML iconHTML  

    Controlling regions in cortical networks, which serve as key nodes to control the dynamics of networks to a desired state, can be detected by minimizing the eigenratio R and the maximum imaginary part σ of an extended connection matrix. Until now, optimal selection of the set of controlling regions is still an open problem and this paper represents the first attempt to include two measures of controllability into one unified framework. The detection problem of controlling regions in cortical networks is converted into a constrained optimization problem (COP), where the objective function R is minimized and σ is regarded as a constraint. Then, the detection of controlling regions of a weighted and directed complex network (e.g., a cortical network of a cat), is thoroughly investigated. The controlling regions of cortical networks are successfully detected by means of an improved dynamic hybrid framework (IDyHF). Our experiments verify that the proposed IDyHF outperforms two recently developed evolutionary computation methods in constrained optimization field and some traditional methods in control theory as well as graph theory. Based on the IDyHF, the controlling regions are detected in a microscopic and macroscopic way. Our results unveil the dependence of controlling regions on the number of driver nodes I and the constraint r. The controlling regions are largely selected from the regions with a large in-degree and a small out-degree. When r = + ∞, there exists a concave shape of the mean degrees of the driver nodes, i.e., the regions with a large degree are of great importance to the control of the networks when I is small and the regions with a small degree are helpful to control the networks when I increases. When r = 0, the mean degrees of the driver nodes increase as a function of I. We find that controlling σ is becoming more important in controlling a cortical network with increasing I. The methods and results of detecting c- ntrolling regions in this paper would promote the coordination and information consensus of various kinds of real-world complex networks including transportation networks, genetic regulatory networks, and social networks, etc. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A fast and practical approach to genotype phasing and imputation on a pedigree with erroneous and incomplete information

    Page(s): 1582 - 1594
    Multimedia
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1781 KB) |  | HTML iconHTML  

    The MINIMUM-RECOMBINANT HAPLOTYPE CONFIGURATION problem (MRHC) has been highly successful in providing a sound combinatorial formulation for the important problem of genotype phasing on pedigrees. Despite several algorithmic advances that have improved the efficiency, its applicability to real data sets has been limited since it does not take into account some important phenomena such as mutations, genotyping errors, and missing data. In this work, we propose the MINIMUM-RECOMBINANT HAPLOTYPE CONFIGURATION WITH BOUNDED ERRORS problem (MRHCE), which extends the original MRHC formulation by incorporating the two most common characteristics of real data: errors and missing genotypes (including untyped individuals). We describe a practical algorithm for MRHCE that is based on a reduction to the well-known Satisfiability problem (SAT) and exploits recent advances in the constraint programming literature. An experimental analysis demonstrates the biological soundness of the phasing model and the effectiveness (on both accuracy and performance) of the algorithm under several scenarios. The analysis on real data and the comparison with state-of-the-art programs reveals that our approach couples better scalability to large and complex pedigrees with the explicit inclusion of genotyping errors into the model. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Hybrid Cellular Automaton Model of Solid Tumor Growth and Bioreductive Drug Transport

    Page(s): 1595 - 1606
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2786 KB) |  | HTML iconHTML  

    Bioreductive drugs are a class of hypoxia selective drugs that are designed to eradicate the hypoxic fraction of solid tumors. Their activity depends upon a number of biological and pharmacological factors and we used a mathematical modeling approach to explore the dynamics of tumor growth, infusion, and penetration of the bioreductive drug Tirapazamine (TPZ). An in-silico model is implemented to calculate the tumor mass considering oxygen and glucose as key microenvironmental parameters. The next stage of the model integrated extra cellular matrix (ECM), cell-cell adhesion, and cell movement parameters as growth constraints. The tumor microenvironments strongly influenced tumor morphology and growth rates. Once the growth model was established, a hybrid model was developed to study drug dynamics inside the hypoxic regions of tumors. The model used 10, 50 and 100 μM as TPZ initial concentrations and determined TPZ pharmacokinetic (PK) (transport) and pharmacodynamics (cytotoxicity) properties inside hypoxic regions of solid tumor. The model results showed that diminished drug transport is a reason for TPZ failure and recommend the optimization of the drug transport properties in the emerging TPZ generations. The modeling approach used in this study is novel and can be a step to explore the behavioral dynamics of TPZ. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Mathematical Model to Study the Dynamics of Epithelial Cellular Networks

    Page(s): 1607 - 1620
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3296 KB) |  | HTML iconHTML  

    Epithelia are sheets of connected cells that are essential across the animal kingdom. Experimental observations suggest that the dynamical behavior of many single-layered epithelial tissues has strong analogies with that of specific mechanical systems, namely large networks consisting of point masses connected through spring-damper elements and undergoing the influence of active and dissipating forces. Based on this analogy, this work develops a modeling framework to enable the study of the mechanical properties and of the dynamic behavior of large epithelial cellular networks. The model is built first by creating a network topology that is extracted from the actual cellular geometry as obtained from experiments, then by associating a mechanical structure and dynamics to the network via spring-damper elements. This scalable approach enables running simulations of large network dynamics: the derived modeling framework in particular is predisposed to be tailored to study general dynamics (for example, morphogenesis) of various classes of single-layered epithelial cellular networks. In this contribution, we test the model on a case study of the dorsal epithelium of the Drosophila melanogaster embryo during early dorsal closure (and, less conspicuously, germband retraction). View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Tabu Search Approach for the NMR Protein Structure-Based Assignment Problem

    Page(s): 1621 - 1628
    Multimedia
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (358 KB) |  | HTML iconHTML  

    Nuclear Magnetic Resonance (NMR) (Abbreviations used: NMR, Nuclear Magnetic Resonance; NOE, Nuclear Overhauser Effect; RDC, Residual Dipolar Coupling; PDB, Protein Data Bank; SBA, Structure-Based Assignments; NVR, Nuclear Vector Replacement; BIP, Binary Integer Programming; TS, Tabu Search; QAP, Quadratic Assignment Problem; ff2, the FF Domain 2 of human transcription elongation factor CA150 (RNA polymerase II C-terminal domain interacting protein); SPG, Streptococcal Protein G; hSRI, Human Set2-Rpb1 Interacting Domain; MBP, Maltose Binding Protein; EIN, Amino Terminal Domain of Enzyme I from Escherichia Coli; EM, expectation maximization) Spectroscopy is an experimental technique which exploits the magnetic properties of specific nuclei and enables the study of proteins in solution. The key bottleneck of NMR studies is to map the NMR peaks to corresponding nuclei, also known as the assignment problem. Structure-Based Assignment (SBA) is an approach to solve this computationally challenging problem by using prior information about the protein obtained from a homologous structure. NVR-BIP used the Nuclear Vector Replacement (NVR) framework to model SBA as a binary integer programming problem. In this paper, we prove that this problem is NP-hard and propose a tabu search (TS) algorithm (NVR-TS) equipped with a guided perturbation mechanism to efficiently solve it. NVR-TS uses a quadratic penalty relaxation of NVR-BIP where the violations in the Nuclear Overhauser Effect constraints are penalized in the objective function. Experimental results indicate that our algorithm finds the optimal solution on NVRBIP's data set which consists of seven proteins with 25 templates (31 to 126 residues). Furthermore, it achieves relatively high assignment accuracies on two additional large proteins, MBP and EIN (348 and 243 residues, respectively), which NVR-BIP failed to solve. The executable and the input files are available for download at http://people.sabanciuniv.edu/catay/NVR- TS/NVR-TS.html. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An Efficient Alignment Algorithm for Searching Simple Pseudoknots over Long Genomic Sequence

    Page(s): 1629 - 1638
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (865 KB) |  | HTML iconHTML  

    Structural alignment has been shown to be an effective computational method to identify structural noncoding RNA (ncRNA) candidates as ncRNAs are known to be conserved in secondary structures. However, the complexity of the structural alignment algorithms becomes higher when the structure has pseudoknots. Even for the simplest type of pseudoknots (simple pseudoknots), the fastest algorithm runs in O(mn3) time, where m, n are the length of the query ncRNA (with known structure) and the length of the target sequence (with unknown structure), respectively. In practice, we are usually given a long DNA sequence and we try to locate regions in the sequence for possible candidates of a particular ncRNA. Thus, we need to run the structural alignment algorithm on every possible region in the long sequence. For example, finding candidates for a known ncRNA of length 100 on a sequence of length 50,000, it takes more than one day. In this paper, we provide an efficient algorithm to solve the problem for simple pseudoknots and it is shown to be 10 times faster. The speedup stems from an effective pruning strategy consisting of the computation of a lower bound score for the optimal alignment and an estimation of the maximum score that a candidate can achieve to decide whether to prune the current candidate or not. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Automatic Identification and Classification of Noun Argument Structures in Biomedical Literature

    Page(s): 1639 - 1648
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (904 KB) |  | HTML iconHTML  

    The accelerating increase in the biomedical literature makes keeping up with recent advances challenging for researchers thus making automatic extraction and discovery of knowledge from this vast literature a necessity. Building such systems requires automatic detection of lexico-semantic event structures governed by the syntactic and semantic constraints of human languages in sentences of biomedical texts. The lexico-semantic event structures in sentences are centered around the predicates and most semantic role labeling (SRL) approaches focus only on the arguments of verb predicates and neglect argument taking nouns which also convey information in a sentence. In this article, a noun argument structure (NAS) annotated corpus named BioNom and a SRL system to identify and classify these structures is introduced. Also, a genetic algorithm-based feature selection (GAFS) method is introduced and global inference is applied to significantly improve the performance of the NAS Bio SRL system. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Biomarker Identification and Cancer Classification Based on Microarray Data Using Laplace Naive Bayes Model with Mean Shrinkage

    Page(s): 1649 - 1662
    Multimedia
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2148 KB) |  | HTML iconHTML  

    Biomarker identification and cancer classification are two closely related problems. In gene expression data sets, the correlation between genes can be high when they share the same biological pathway. Moreover, the gene expression data sets may contain outliers due to either chemical or electrical reasons. A good gene selection method should take group effects into account and be robust to outliers. In this paper, we propose a Laplace naive Bayes model with mean shrinkage (LNB-MS). The Laplace distribution instead of the normal distribution is used as the conditional distribution of the samples for the reasons that it is less sensitive to outliers and has been applied in many fields. The key technique is the L_1 penalty imposed on the mean of each class to achieve automatic feature selection. The objective function of the proposed model is a piecewise linear function with respect to the mean of each class, of which the optimal value can be evaluated at the breakpoints simply. An efficient algorithm is designed to estimate the parameters in the model. A new strategy that uses the number of selected features to control the regularization parameter is introduced. Experimental results on simulated data sets and 17 publicly available cancer data sets attest to the accuracy, sparsity, efficiency, and robustness of the proposed algorithm. Many biomarkers identified with our method have been verified in biochemical or biomedical research. The analysis of biological and functional correlation of the genes based on Gene Ontology (GO) terms shows that the proposed method guarantees the selection of highly correlated genes simultaneously. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Design and Analysis of Classifier Learning Experiments in Bioinformatics: Survey and Case Studies

    Page(s): 1663 - 1675
    Multimedia
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1177 KB) |  | HTML iconHTML  

    In many bioinformatics applications, it is important to assess and compare the performances of algorithms trained from data, to be able to draw conclusions unaffected by chance and are therefore significant. Both the design of such experiments and the analysis of the resulting data using statistical tests should be done carefully for the results to carry significance. In this paper, we first review the performance measures used in classification, the basics of experiment design and statistical tests. We then give the results of our survey over 1,500 papers published in the last two years in three bioinformatics journals (including this one). Although the basics of experiment design are well understood, such as resampling instead of using a single training set and the use of different performance metrics instead of error, only 21 percent of the papers use any statistical test for comparison. In the third part, we analyze four different scenarios which we encounter frequently in the bioinformatics literature, discussing the proper statistical methodology as well as showing an example case study for each. With the supplementary software, we hope that the guidelines we discuss will play an important role in future studies. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Distinguishing Endogenous Retroviral LTRs from SINE Elements Using Features Extracted from Evolved Side Effect Machines

    Page(s): 1676 - 1689
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1232 KB) |  | HTML iconHTML  

    Side effect machines produce features for classifiers that distinguish different types of DNA sequences. They have the, as yet unexploited, potential to give insight into biological features of the sequences. We introduce several innovations to the production and use of side effect machine sequence features. We compare the results of using consensus sequences and genomic sequences for training classifiers and find that more accurate results can be obtained using genomic sequences. Surprisingly, we were even able to build a classifier that distinguished consensus sequences from genomic sequences with high accuracy, suggesting that consensus sequences are not always representative of their genomic counterparts. We apply our techniques to the problem of distinguishing two types of transposable elements, solo LTRs and SINEs. Identifying these sequences is important because they affect gene expression, genome structure, and genetic diversity, and they serve as genetic markers. They are of similar length, neither codes for protein, and both have many nearly identical copies throughout the genome. Being able to efficiently and automatically distinguish them will aid efforts to improve annotations of genomes. Our approach reveals structural characteristics of the sequences of potential interest to biologists. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Improving Protein-Protein Interaction Pair Ranking with an Integrated Global Association Score

    Page(s): 1690 - 1695
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (739 KB) |  | HTML iconHTML  

    Protein-protein interaction (PPI) database curation requires text-mining systems that can recognize and normalize interactor genes and return a ranked list of PPI pairs for each article. The order of PPI pairs in this list is essential for ease of curation. Most of the current PPI pair ranking approaches rely on association analysis between the two genes in the pair. However, we propose that ranking an extracted PPI pair by considering both the association between the paired genes and each of those genes' global associations with all other genes mentioned in the paper can provide a more reliable ranked list. In this work, we present a composite interaction score that considers not only the association score between two interactors (pair association score) but also their global association scores. We test three representative data fusion algorithms to estimate this global association score-two Borda-Fuse models and one linear combination model (LCM). The three estimation methods are evaluated using the data set of the BioCreative II.5 Interaction Pair Task (IPT) in terms of area under the interpolated precision/recall curve (AUC iP/R). Our experimental results indicate that using LCM to estimate the global association score can boost the AUC iP/R score from 0.0175 to 0.2396, outperforming the best BioCreative II.5 IPT system. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Large-Scale Signaling Network Reconstruction

    Page(s): 1696 - 1708
    Multimedia
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1036 KB) |  | HTML iconHTML  

    Reconstructing the topology of a signaling network by means of RNA interference (RNAi) technology is an underdetermined problem especially when a single gene in the network is knocked down or observed. In addition, the exponential search space limits the existing methods to small signaling networks of size 10-15 genes. In this paper, we propose integrating RNAi data with a reference physical interaction network. We formulate the problem of signaling network reconstruction as finding the minimum number of edit operations on a given reference network. The edit operations transform the reference network to a network that satisfies the RNAi observations. We show that using a reference network does not simplify the computational complexity of the problem. Therefore, we propose two methods which provide near optimal results and can scale well for reconstructing networks up to hundreds of components. We validate the proposed methods on synthetic and real data sets. Comparison with the state of the art on real signaling networks shows that the proposed methodology can scale better and generates biologically significant results. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Multiparameter Spectral Representation of Noise-Induced Competence in Bacillus Subtilis

    Page(s): 1709 - 1723
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2826 KB) |  | HTML iconHTML  

    In this work, the problem of representing a stochastic forward model output with respect to a large number of input parameters is considered. The methodology is applied to a stochastic reaction network of competence dynamics in Bacillus subtilis bacterium. In particular, the dependence of the competence state on rate constants of underlying reactions is investigated. We base our methodology on Polynomial Chaos (PC) spectral expansions that allow effective propagation of input parameter uncertainties to outputs of interest. Given a number of forward model training runs at sampled input parameter values, the PC modes are estimated using a Bayesian framework. As an outcome, these PC modes are described with posterior probability distributions. The resulting expansion can be regarded as an uncertain response function and can further be used as a computationally inexpensive surrogate instead of the original reaction model for subsequent analyses such as calibration or optimization studies. Furthermore, the methodology is enhanced with a classification-based mixture PC formulation that overcomes the difficulties associated with representing potentially nonsmooth input-output relationships. Finally, the global sensitivity analysis based on the multiparameter spectral representation of an observable of interest provides biological insight and reveals the most important reactions and their couplings for the competence dynamics. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • On the Steady-State Distribution in the Homogeneous Ribosome Flow Model

    Page(s): 1724 - 1736
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (355 KB) |  | HTML iconHTML  

    A central biological process in all living organisms is gene translation. Developing a deeper understanding of this complex process may have ramifications to almost every biomedical discipline. Reuveni et al. recently proposed a new computational model of gene translation called the Ribosome Flow Model (RFM). In this paper, we consider a particular case of this model, called the Homogeneous Ribosome Flow Model (HRFM). From a biological viewpoint, this corresponds to the case where the transition rates of all the coding sequence codons are identical. This regime has been suggested recently based on experiments in mouse embryonic cells. We consider the steady-state distribution of the HRFM. We provide formulas that relate the different parameters of the model in steady state. We prove the following properties: 1) the ribosomal density profile is monotonically decreasing along the coding sequence; 2) the ribosomal density at each codon monotonically increases with the initiation rate; and 3) for a constant initiation rate, the translation rate monotonically decreases with the length of the coding sequence. In addition, we analyze the translation rate of the HRFM at the limit of very high and very low initiation rate, and provide explicit formulas for the translation rate in these two cases. We discuss the relationship between these theoretical results and biological findings on the translation process. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Probabilistic Arithmetic Automata and Their Applications

    Page(s): 1737 - 1750
    Multimedia
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (405 KB) |  | HTML iconHTML  

    We present a comprehensive review on probabilistic arithmetic automata (PAAs), a general model to describe chains of operations whose operands depend on chance, along with two algorithms to numerically compute the distribution of the results of such probabilistic calculations. PAAs provide a unifying framework to approach many problems arising in computational biology and elsewhere. We present five different applications, namely 1) pattern matching statistics on random texts, including the computation of the distribution of occurrence counts, waiting times, and clump sizes under hidden Markov background models; 2) exact analysis of window-based pattern matching algorithms; 3) sensitivity of filtration seeds used to detect candidate sequence alignments; 4) length and mass statistics of peptide fragments resulting from enzymatic cleavage reactions; and 5) read length statistics of 454 and IonTorrent sequencing reads. The diversity of these applications indicates the flexibility and unifying character of the presented framework. While the construction of a PAA depends on the particular application, we single out a frequently applicable construction method: We introduce deterministic arithmetic automata (DAAs) to model deterministic calculations on sequences, and demonstrate how to construct a PAA from a given DAA and a finite-memory random text model. This procedure is used for all five discussed applications and greatly simplifies the construction of PAAs. Implementations are available as part of the MoSDi package. Its application programming interface facilitates the rapid development of new applications based on the PAA framework. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • SC³: Triple Spectral Clustering-Based Consensus Clustering Framework for Class Discovery from Cancer Gene Expression Profiles

    Page(s): 1751 - 1765
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1306 KB) |  | HTML iconHTML  

    In order to perform successful diagnosis and treatment of cancer, discovering, and classifying cancer types correctly is essential. One of the challenging properties of class discovery from cancer data sets is that cancer gene expression profiles not only include a large number of genes, but also contains a lot of noisy genes. In order to reduce the effect of noisy genes in cancer gene expression profiles, we propose two new consensus clustering frameworks, named as triple spectral clustering-based consensus clustering (SC3) and double spectral clustering-based consensus clustering (SC2 Ncut) in this paper, for cancer discovery from gene expression profiles. SC3 integrates the spectral clustering (SC) algorithm multiple times into the ensemble framework to process gene expression profiles. Specifically, spectral clustering is applied to perform clustering on the gene dimension and the cancer sample dimension, and also used as the consensus function to partition the consensus matrix constructed from multiple clustering solutions. Compared with SC3, SC2 Ncut adopts the normalized cut algorithm, instead of spectral clustering, as the consensus function. Experiments on both synthetic data sets and real cancer gene expression profiles illustrate that the proposed approaches not only achieve good performance on gene expression profiles, but also outperforms most of the existing approaches in the process of class discovery from these profiles. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Sequence-Based Prediction of DNA-Binding Residues in Proteins with Conservation and Correlation Information

    Page(s): 1766 - 1775
    Multimedia
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (626 KB) |  | HTML iconHTML  

    The recognition of DNA-binding residues in proteins is critical to our understanding of the mechanisms of DNA-protein interactions, gene expression, and for guiding drug design. Therefore, a prediction method DNABR (DNA Binding Residues) is proposed for predicting DNA-binding residues in protein sequences using the random forest (RF) classifier with sequence-based features. Two types of novel sequence features are proposed in this study, which reflect the information about the conservation of physicochemical properties of the amino acids, and the correlation of amino acids between different sequence positions in terms of physicochemical properties. The first type of feature uses the evolutionary information combined with the conservation of physicochemical properties of the amino acids while the second reflects the dependency effect of amino acids with regards to polarity-charge and hydrophobic properties in the protein sequences. Those two features and an orthogonal binary vector which reflect the characteristics of 20 types of amino acids are used to build the DNABR, a model to predict DNA-binding residues in proteins. The DNABR model achieves a value of 0.6586 for Matthew's correlation coefficient (MCC) and 93.04 percent overall accuracy (ACC) with a 68.47 percent sensitivity (SE) and 98.16 percent specificity (SP), respectively. The comparisons with each feature demonstrate that these two novel features contribute most to the improvement in predictive ability. Furthermore, performance comparisons with other approaches clearly show that DNABR has an excellent prediction performance for detecting binding residues in putative DNA-binding protein. The DNABR web-server system is freely available at http://www.cbi.seu.edu.cn/DNABR/. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Symmetry Compression Method for Discovering Network Motifs

    Page(s): 1776 - 1789
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1753 KB) |  | HTML iconHTML  

    Discovering network motifs could provide a significant insight into systems biology. Interestingly, many biological networks have been found to have a high degree of symmetry (automorphism), which is inherent in biological network topologies. The symmetry due to the large number of basic symmetric subgraphs (BSSs) causes a certain redundant calculation in discovering network motifs. Therefore, we compress all basic symmetric subgraphs before extracting compressed subgraphs and propose an efficient decompression algorithm to decompress all compressed subgraphs without loss of any information. In contrast to previous approaches, the novel Symmetry Compression method for Motif Detection, named as SCMD, eliminates most redundant calculations caused by widespread symmetry of biological networks. We use SCMD to improve three notable exact algorithms and two efficient sampling algorithms. Results of all exact algorithms with SCMD are the same as those of the original algorithms, since SCMD is a lossless method. The sampling results show that the use of SCMD almost does not affect the quality of sampling results. For highly symmetric networks, we find that SCMD used in both exact and sampling algorithms can help get a remarkable speedup. Furthermore, SCMD enables us to find larger motifs in biological networks with notable symmetry than previously possible. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Top-k Similar Graph Matching Using TraM in Biological Networks

    Page(s): 1790 - 1804
    Multimedia
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2347 KB) |  | HTML iconHTML  

    Many emerging database applications entail sophisticated graph-based query manipulation, predominantly evident in large-scale scientific applications. To access the information embedded in graphs, efficient graph matching tools and algorithms have become of prime importance. Although the prohibitively expensive time complexity associated with exact subgraph isomorphism techniques has limited its efficacy in the application domain, approximate yet efficient graph matching techniques have received much attention due to their pragmatic applicability. Since public domain databases are noisy and incomplete in nature, inexact graph matching techniques have proven to be more promising in terms of inferring knowledge from numerous structural data repositories. In this paper, we propose a novel technique called TraM for approximate graph matching that off-loads a significant amount of its processing on to the database making the approach viable for large graphs. Moreover, the vector space embedding of the graphs and efficient filtration of the search space enables computation of approximate graph similarity at a throw-away cost. We annotate nodes of the query graphs by means of their global topological properties and compare them with neighborhood biased segments of the data-graph for proper matches. We have conducted experiments on several real data sets, and have demonstrated the effectiveness and efficiency of the proposed method. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • uAnalyze: Web-Based High-Resolution DNA Melting Analysis with Comparison to Thermodynamic Predictions

    Page(s): 1805 - 1811
    Multimedia
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (688 KB) |  | HTML iconHTML  

    uAnalyzeSM is a web-based tool for analyzing high-resolution melting data of PCR products. PCR product sequence is input by the user and recursive nearest neighbor thermodynamic calculations used to predict a melting curve similar to uMELT (http://www.dna.utah.edu/umelt/umelt.html). Unprocessed melting data are input directly from LightScanner-96, LS32, or HR-1 data files or via a generic format for other instruments. A fluorescence discriminator identifies low intensity samples to prevent analysis of data that cannot be adequately normalized. Temperature regions that define fluorescence background are initialized by prediction and optionally adjusted by the user. Background is removed either as an exponential or by linear baseline extrapolation. The precision or, “curve spread,” of experimental melting curves is quantified as the average of the maximum helicity difference of all curve pairs. Melting curve accuracy is quantified as the area or “2D offset” between the average experimental and predicted melting curves. Optional temperature overlay (temperature shifting) is provided to focus on curve shape. Using 14 amplicons of CYBB, the mean +/ - standard deviation of the difference between experimental and predicted fluorescence at 50 percent helicity was -0.04 + / -0.48°C. uAnalyze requires Flash, is not browser specific and can be accessed at http://www.dna.utah.edu/uv/uanalyze.html. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Fast Ranking Algorithm for Predicting Gene Functions in Biomolecular Networks

    Page(s): 1812 - 1818
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (402 KB) |  | HTML iconHTML  

    Ranking genes in functional networks according to a specific biological function is a challenging task raising relevant performance and computational complexity problems. To cope with both these problems we developed a transductive gene ranking method based on kernelized score functions able to fully exploit the topology and the graph structure of biomolecular networks and to capture significant functional relationships between genes. We run the method on a network constructed by integrating multiple biomolecular data sources in the yeast model organism, achieving significantly better results than the compared state-of-the-art network-based algorithms for gene function prediction, and with relevant savings in computational time. The proposed approach is general and fast enough to be in perspective applied to other relevant node ranking problems in large and complex biological networks. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Fuzzy Intervention in Biological Phenomena

    Page(s): 1819 - 1825
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (751 KB) |  | HTML iconHTML  

    An important objective of modeling biological phenomena is to develop therapeutic intervention strategies to move an undesirable state of a diseased network toward a more desirable one. Such transitions can be achieved by the use of drugs to act on some genes/metabolites that affect the undesirable behavior. Due to the fact that biological phenomena are complex processes with nonlinear dynamics that are impossible to perfectly represent with a mathematical model, the need for model-free nonlinear intervention strategies that are capable of guiding the target variables to their desired values often arises. In many applications, fuzzy systems have been found to be very useful for parameter estimation, model development and control design of nonlinear processes. In this paper, a model-free fuzzy intervention strategy (that does not require a mathematical model of the biological phenomenon) is proposed to guide the target variables of biological systems to their desired values. The proposed fuzzy intervention strategy is applied to three different biological models: a glycolytic-glycogenolytic pathway model, a purine metabolism pathway model, and a generic pathway model. The simulation results for all models demonstrate the effectiveness of the proposed scheme. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.

Aims & Scope

This bimonthly publishes archival research results related to the algorithmic, mathematical, statistical, and computational methods that are central in bioinformatics and computational biology.

Full Aims & Scope

Meet Our Editors

Editor-in-Chief
Ying Xu
University of Georgia
xyn@bmb.uga.edu

Associate Editor-in-Chief
Dong Xu
University of Missouri
xudong@missouri.edu