By Topic

Bioinformatics and Bioengineering, 2005. BIBE 2005. Fifth IEEE Symposium on

Date 19-21 Oct. 2005

Filter Results

Displaying Results 1 - 25 of 56
  • Fifth IEEE Symposium on Bioinformatics and Bioengineering - Cover

    Page(s): c1
    Save to Project icon | Request Permissions | PDF file iconPDF (265 KB)  
    Freely Available from IEEE
  • Fifth IEEE Symposium on Bioinformatics and Bioengineering - Title Page

    Page(s): i - iii
    Save to Project icon | Request Permissions | PDF file iconPDF (36 KB)  
    Freely Available from IEEE
  • Fifth IEEE Symposium on Bioinformatics and Bioengineering - Copyright Page

    Page(s): iv
    Save to Project icon | Request Permissions | PDF file iconPDF (39 KB)  
    Freely Available from IEEE
  • Fifth IEEE Symposium on Bioinformatics and Bioengineering - Table of contents

    Page(s): v - viii
    Save to Project icon | Request Permissions | PDF file iconPDF (42 KB)  
    Freely Available from IEEE
  • Chairs Forward

    Page(s): ix
    Save to Project icon | Request Permissions | PDF file iconPDF (20 KB)  
    Freely Available from IEEE
  • Committees

    Page(s): x - xi
    Save to Project icon | Request Permissions | PDF file iconPDF (25 KB)  
    Freely Available from IEEE
  • A model-free and stable gene selection in microarray data analysis

    Page(s): 3 - 10
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (240 KB) |  | HTML iconHTML  

    Microarray data analysis is notorious for involving a huge number of genes compared to a relatively small number of samples. Detecting the most significantly differentially expressed genes under different conditions, or gene selection, has been a central focus for researchers. The gene selection problem becomes more difficult when the numbers of samples under different conditions vary significantly, or are unbalanced. A novel model-free and stable gene selection method is proposed in this paper, i.e., the method does not assume any statistical model on the gene expression data and it is not affected by the unbalanced samples. The method has been evaluated on two publicly available datasets, the leukemia dataset and the small round blue cell tumor dataset, where the experimental results showed that the proposed method is efficient and robust in identifying differentially expressed genes. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An efficient algorithm for the extended (l,d)-motif problem with unknown number of binding sites

    Page(s): 11 - 18
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (176 KB) |  | HTML iconHTML  

    Finding common patterns, or motifs, from a set of DNA sequences is an important problem in molecular biology. Most motif-discovering algorithms/software require the length of the motif as input. Motivated by the fact that the motifs length is usually unknown in practice, Styczynski et al. introduced the extended (l,d)-motif problem (EMP), where the motifs length is not an input parameter. Unfortunately, the algorithm given by Styczynski et al. to solve EMP can take an unacceptably long time to run, e.g. over 3 months to discover a length-14 motif. This paper makes two main contributions. First, we eliminate another input parameter from EMP: the minimum number of binding sites in the DNA sequences. Fewer input parameters not only reduces the burden of the user, but also may give more realistic/robust results since restrictions on length or on the number of binding sites make little sense when the best motif may not be the longest nor have the largest number of binding sites. Second, we develop an efficient algorithm to solve our redefined problem. The algorithm is also a fast solution for EMP (without any sacrifice to accuracy) making EMP practical. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Improved phylogenetic motif detection using parsimony

    Page(s): 19 - 26
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (424 KB) |  | HTML iconHTML  

    We have recently demonstrated (La et al, Proteins, 58:2005) that sequence fragments approximating the overall familial phylogeny, called phylogenetic motifs (PMs), represent a promising protein functional site prediction strategy. Previous results across a structurally and functionally diverse dataset indicate that phylogenetic motifs correspond to a wide variety of known functional characteristics. Phylogenetic motifs are detected using a sliding window algorithm that compares neighbor joining trees on the complete alignment to those on the sequence fragments. In this investigation we identify PMs using heuristic maximum parsimony trees. We show that when using parsimony the functional site prediction accuracy of PMs improves substantially, particularly on divergent datasets. We also show that the new PMs found using parsimony are not necessarily conserved in sequence, and, therefore, would not be detected by traditional motif (information content-based) approaches. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Highly scalable and accurate seeds for subsequence alignment

    Page(s): 27 - 31
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (192 KB) |  | HTML iconHTML  

    We propose a method for finding seeds for the local alignment of two nucleotide sequences. Our method uses randomized algorithms to find approximate seeds. We present a dynamic index to store the fingerprints of k-grams and a highly scalable and accurate (HSA) algorithm to incorporate randomization into process of seed generation. Experimental results show that our method produces better quality seeds with improved running time and memory usage compared to traditional non-spaced and spaced seeds. The presented algorithm scales very well with higher seed lengths while maintaining the quality and performance. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A novel quartet-based method for phylogenetic inference

    Page(s): 32 - 39
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (272 KB) |  | HTML iconHTML  

    In this paper we introduce a new quartet-based method. This method makes use of the Bayes (or quartet) weights of quartets as those used in the quartet puzzling. However, all the weights from the related quartets are accumulated to form a global quartet weight matrix. This matrix provides integrated information and can lead us to recursively merge small sub-trees to larger ones until the final single tree is obtained. The experimental results show that the probability for the correct tree to be among a very small number of trees constructed using our method is very high. These significant results open a new research direction to further investigate more efficient algorithms for phylogenetic inference. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • GOMIT: a generic and adaptive annotation algorithm based on gene ontology term distributions

    Page(s): 40 - 48
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (248 KB) |  | HTML iconHTML  

    We address the issue of providing highly informative annotations using information revealed by the structured vocabularies of gene ontology (GO). For a target, a set of candidate terms used to infer the target's property is collected and forms a unique distribution on the GO directed acyclic graph (DAG). We propose a generic and adaptive algorithm - GOMIT, which bases on term distributions and GO hierarchical characteristics to assign correct annotations for a target. We establish a quantitative model with parameters that can be framed for optimal performance for different applications. We propose several criteria for evaluating GOMIT's performance, and conducted three experiments involving a) automated functional annotations, b) biological annotations of microarray data clusters and c) protein family GO assignments. In these experiments, we used our proposed criteria to compare GOMIT with other algorithms. Results not only reflect GOMIT's generality and adaptability, but also suggest that GOMIT is better or comparable to other works for assigning correct annotations. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Classification of biomedical data through model-based spatial averaging

    Page(s): 49 - 56
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (280 KB) |  | HTML iconHTML  

    Ensemble learning is frequently used to reduce classification error. The more popular techniques draw multiple samples from the training data and employ a voting procedure to aggregate the decisions of the classifiers constructed from those samples. In practice, such ensemble methods have been shown to work well and improve accuracy. Here we present a meta-learning strategy that combines the decisions of classifiers constructed from spatial models taken at multiple resolutions. By varying the resolution from coarse to fine-grained, we are able to partition the data on global features that describe a majority of the objects, as well as small, local features that are present in just a few problem cases. We test our technique on a biomedical dataset containing surface elevation values for diseased and nondiseased corneas. We transform these elevations into a series of coefficients using two different spatial transformations. Using these coefficients, we determine how well they distinguish between the two classes. We find our algorithm can increase the classification accuracy of a single decision tree up to 10% and can also be used in conjunction with traditional meta-learning techniques such as bagging to further improve performance. In an attempt to improve the execution time of the transformation algorithms, we have developed a distributed, grid-based implementation as well. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A multi-level approach to SCOP fold recognition

    Page(s): 57 - 64
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (352 KB) |  | HTML iconHTML  

    The classification of proteins based on their structure can play an important role in the deduction or discovery of protein function. However, the relatively low number of solved protein structures and the unknown relationship between structure and sequence requires an alternative method of representation for classification to be effective. Furthermore, the large number of potential folds causes problems for many classification strategies, increasing the likelihood that the classifier will reach a local optima while trying to distinguish between all of the possible structural categories. Here we present a hierarchical strategy for structural classification that first partitions proteins based on their SCOP class before attempting to assign a protein fold. Using a well-known dataset derived from the 27 most-populated SCOP folds and several sequence-based descriptor properties as input features, we test a number of classification methods, including Naive Bayes and Boosted C4.5. Our strategy achieves an average fold recognition of 74%, which is significantly higher than the 56-60% previously reported in the literature, indicating the effectiveness of a multi-level approach. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A wrapper induction application with knowledge base support: a use case for initiation and maintenance of wrappers

    Page(s): 65 - 72
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (248 KB) |  | HTML iconHTML  

    Integrating life science Web databases, while important and necessary, is a challenge for current integration systems mainly due to the large number of these databases, their heterogeneity and the fact that their interfaces may change often. BACIIS, a biological and chemical information integration system, is a tightly coupled federated database system that uses the mediator wrapper method in order to retrieve information from several remote Web databases. BACIIS relies on a semi-automated approach for generating and maintaining wrappers in order to provide a scalable system with a limited maintenance overhead. The semi-automatic wrapper induction in BACIIS is efficient because it is based on, but not limited to a domain knowledge. Tests show that the use of ontology increases the accuracy of the wrapper induction. We also present how the wrapper induction system facilitates wrapper update, and assists in the information extraction. By using a wrapper induction system for creation and maintenance of wrappers, scalability, flexibility, and stability of the integrated information system is easily maintained. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Predicting human papilloma virus prevalence and vaccine policy effectiveness in demographic strata

    Page(s): 73 - 80
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (328 KB) |  | HTML iconHTML  

    Human Papilloma Virus (HPV) is a sexually transmitted virus, which can lead to cervical cancer. HPV DNA is found in cervical cancers with types 16, 18, 31 and 45 accounting for more than 75% of cervical cancers. Candidate vaccines have entered phase III testing with the Food and Drug Administration and several drug companies are in licensing arbitration. Once this vaccine becomes available, an effective vaccination strategy is needed. Hughes, Garnett and Anderson have developed a model to predict HPV prevalence and population-level vaccine effectiveness; however, this model does not allow for stratification with time-dependent demographic traits, such as age. With this in mind, we have developed a tool that facilitate s predicting HPV prevelance in a variety of demographic settings and allows for quantification of different vaccination policies. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Approximate global alignment of sequences

    Page(s): 81 - 88
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (168 KB) |  | HTML iconHTML  

    We propose two novel dynamic programming (DP) methods that solve the approximate bounded and unbounded global alignment problems for biological sequences. Our first method solves the bounded alignment problem. It computes the distribution of the edit distance between the remaining suffixes. For a given bound k and approximation p%, it uses this distribution to prune the entries of the DP matrix that will lead to alignments with more than k edit operations with more than p% probability. Our second method addresses the unbounded global alignment problem. For each entry of the distance matrix, it dynamically computes an upper bound to the distance between the unaligned suffixes. This bound, along with the lower bound as computed for the bounded case, is then used to eliminate the entries of the distance matrix. According to our experimental results, our methods are up to three times faster than the competing methods for the bounded alignment and up to two times faster for the unbounded alignment, even with 100% approximation. Our methods use only 17-68% of the space used by the next best competitor. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Multi-class biclustering and classification based on modeling of gene regulatory networks

    Page(s): 89 - 96
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (448 KB) |  | HTML iconHTML  

    The attempt to elucidate biological pathways and classify genes has led to the development of numerous clustering approaches to gene expression. All these approaches use a single metric to identify genes with similar expression levels. Until now, the correlation between the expression levels of such genes has been based on phenomenological and heuristic correlation functions, rather than on biological models. In this paper, we derive six distinct correlation functions based on explicit thermodynamic modeling of gene regulatory networks. We then combine these correlation functions with novel biclustering algorithms to identify functionally enriched groups. The statistical significance of the identified groups is demonstrated by precision-recall curves and calculated p-values. Furthermore, comparison with chromatin immunoprecipitation data indicates that the performance of the derived correlation functions depends on the specific regulatory mechanisms. Finally, we introduce the idea of multi-class biclustering and with the help of support vector machines we demonstrate its improved classification performance in a microarray dataset. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • DSIM: A distance-based indexing method for genomic sequences

    Page(s): 97 - 104
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (232 KB) |  | HTML iconHTML  

    In this paper, we propose a Distance-based Sequence Indexing Method (DSIM) for indexing and searching genome databases. Borrowing the idea of video compression, we compress the genomic sequence database around a set of automatically selected reference words, formed from high-frequency data substrings and substrings in past queries. The compression captures the distance of each non-reference word in the database to some reference word. At runtime, a query is processed by comparing its substrings with the compressed data strings, through their distances to the reference words. We also propose an efficient scheme to incrementally update the reference words and the compressed data sequences as more data sequences are added and new queries come along. Extensive experiments on a human genome database with 2.62 GB of DNA sequence letters show that the new algorithm achieves significantly faster response time than BLAST, while maintaining comparable accuracy. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Discovery of repetitive patterns in DNA with accurate boundaries

    Page(s): 105 - 112
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (224 KB) |  | HTML iconHTML  

    The accurate identification of repeats remains a challenging open problem in bioinformatics. Most existing methods of repeat identification either depend on annotated repeat databases or restrict repeats to pairs of similar sequences that are maximal in length. The fundamental flaw in most of the available methods is the lack of a definition that correctly balances the importance of the length and the frequency. In this paper, we propose a new definition of repeats that satisfies both criteria. We give a novel characterization of the building blocks of repeats, called elementary repeats, which leads to a natural definition of repeat boundaries. We design efficient algorithms and test them on synthetic and real biological data. Experimental results show that our method is highly accurate. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Prediction of outer membrane proteins by support vector machines using combinations of gapped amino acid pair compositions

    Page(s): 113 - 120
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (152 KB) |  | HTML iconHTML  

    Discriminating outer membrane proteins from proteins with other subcellular localizations and with other folding classes are both important to predict farther their functions and structures. In this paper, we propose a method for discriminating outer membrane proteins from other proteins by support vector machines using combinations of gapped amino acid pair compositions. Using 5-fold cross-validation, the method achieves 95% precision and 92% recall on the dataset of proteins with well-annotated subcellular localizations, consisting of 471 outer membrane proteins and 1,120 other proteins. When applied on another dataset of 377 outer membrane proteins and 674 globular proteins belonging to four typical structural classes, the method reaches 96% precision and recall and correctly excludes 98% of the globular proteins. Our method outperforms the OM classifier of PSORTb v.2.0 and a method based on dipeptide composition. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Discovery of gene expression patterns across multiple cancer types

    Page(s): 121 - 128
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (200 KB) |  | HTML iconHTML  

    In this paper, we investigate the underlying common gene expression signatures in related cancer types. Shared expression signatures are investigated in breast and ovarian cancers specifically through the definition of four progressively more difficult classification problems. SHEBA, a stochastic Bayesian inference approach, is introduced to identify highly predictive gene sets in the defined classification problems. The heuristics reduce the computation time required to identify the most informative groups of features in the gene space, while providing a good approximation of comparable exhaustive approaches. The breast and ovarian cancer class could be distinguished well from the other classes of cancers using SHEBA in three of the four classification problems, suggesting the existence of a commonality between their gene expressions. Extensive statistical validation and preliminary biological review of the most predictive gene sets demonstrate their robustness and specificity. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Effective pre-processing strategies for functional clustering of a protein-protein interactions network

    Page(s): 129 - 136
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (264 KB) |  | HTML iconHTML  

    In this article we present novel preprocessing techniques, based on typological measures of the network, to identify clusters of proteins from protein-protein interaction (PPI) networks wherein each cluster corresponds to a group of functionally similar proteins. The two main problems with analyzing protein-protein interaction networks are their scale-free property and the large number of false positive interactions that they contain. Our preprocessing techniques use a key transformation and separate weighting functions to effectively eliminate suspect edges, potential false positives, from the graph. A useful side-effect of this transformation is that the resulting graph is no longer scale free. We then examine the application of two well-known clustering techniques, namely hierarchical and multilevel graph partitioning on the reduced network. We define suitable statistical metrics to evaluate our clusters meaningfully. From our study, we discover that the application of clustering on the pre-processed network results in significantly improved, biologically relevant and balanced clusters when compared with clusters derived from the original network. We strongly believe that our strategies would prove invaluable to future studies on prediction of protein functionality from PPI networks. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Improving genome rearrangement phylogeny using sequence-style parsimony

    Page(s): 137 - 144
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1064 KB)  

    The study of genome rearrangements, the evolutionary events that change the order and strandedness of genes within genomes, presents new opportunities for discoveries about deep evolutionary events. The best software so far, GRAPPA, solves breakpoint and inversion phylogenies by scoring each tree topology through iterative improvements of internal node gene orders. We find that the greedy hill-climbing approach means the accuracy is limited because of multiple local optima. To address this problem, we propose integration GRAPPA with MPME, a string encoding of gene adjacency relationships whose optimal internal node assignments can be determined globally in polynomial time, to provide better initializations for GRAPPA. In simulation studies, the new algorithm yields shorter tree lengths and better accuracy in phylogeny reconstruction. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Haplotype phasing using semidefinite programming

    Page(s): 145 - 152
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (152 KB) |  | HTML iconHTML  

    Diploid organisms, such as humans, inherit one copy of each chromosome (haplotype) from each parent. The conflation of inherited haplotypes is called the genotype of the organism. In many disease association studies, the haplotype data is more informative than the genotype data. Unfortunately, getting haplotype data experimentally is both expensive and difficult. The haplotype inference with pure parsimony (HPP) problem is the problem of finding a minimal set of haplotypes that resolve a given set of genotypes. We provide a quadratic integer program (QIP) formulation for the HPP problem, and describe an algorithm for the HPP problem based on a semi-definite programming (SDP) relaxation of that QIP program. We compare our approach with existing approaches. Further, we show that the proposed approach is capable of incorporating a variety of additional constraints, such as missing or erroneous genotype data, outliers etc. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.