By Topic

Computational Biology and Bioinformatics, IEEE/ACM Transactions on

Issue 3 • Date July-Sept. 2010

Filter Results

Displaying Results 1 - 21 of 21
  • [Front cover]

    Publication Year: 2010 , Page(s): c1
    Save to Project icon | Request Permissions | PDF file iconPDF (377 KB)  
    Freely Available from IEEE
  • [Inside front cover]

    Publication Year: 2010 , Page(s): c2
    Save to Project icon | Request Permissions | PDF file iconPDF (222 KB)  
    Freely Available from IEEE
  • An Overview of BioCreative II.5

    Publication Year: 2010 , Page(s): 385 - 399
    Cited by:  Papers (8)
    Multimedia
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (4038 KB) |  | HTML iconHTML  

    We present the results of the BioCreative II.5 evaluation in association with the FEBS Letters experiment, where authors created Structured Digital Abstracts to capture information about protein-protein interactions. The BioCreative II.5 challenge evaluated automatic annotations from 15 text mining teams based on a gold standard created by reconciling annotations from curators, authors, and automated systems. The tasks were to rank articles for curation based on curatable protein-protein interactions; to identify the interacting proteins (using UniProt identifiers) in the positive articles (61); and to identify interacting protein pairs. There were 595 full-text articles in the evaluation test set, including those both with and without curatable protein interactions. The principal evaluation metrics were the interpolated area under the precision/recall curve (AUC iP/R), and (balanced) F-measure. For article classification, the best AUC iP/R was 0.70; for interacting proteins, the best system achieved good macroaveraged recall (0.73) and interpolated area under the precision/recall curve (0.58), after filtering incorrect species and mapping homonymous orthologs; for interacting protein pairs, the top (filtered, mapped) recall was 0.42 and AUC iP/R was 0.29. Ensemble systems improved performance for the interacting protein task. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Classification of Protein-Protein Interaction Full-Text Documents Using Text and Citation Network Features

    Publication Year: 2010 , Page(s): 400 - 411
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3350 KB) |  | HTML iconHTML  

    We participated (as Team 9) in the Article Classification Task of the Biocreative II.5 Challenge: binary classification of full-text documents relevant for protein-protein interaction. We used two distinct classifiers for the online and offline challenges: 1) the lightweight Variable Trigonometric Threshold (VTT) linear classifier we successfully introduced in BioCreative 2 for binary classification of abstracts and 2) a novel Naive Bayes classifier using features from the citation network of the relevant literature. We supplemented the supplied training data with full-text documents from the MIPS database. The lightweight VTT classifier was very competitive in this new full-text scenario: it was a top-performing submission in this task, taking into account the rank product of the Area Under the interpolated precision and recall Curve, Accuracy, Balanced F-Score, and Matthew's Correlation Coefficient performance measures. The novel citation network classifier for the biomedical text mining domain, while not a top performing classifier in the challenge, performed above the central tendency of all submissions, and therefore indicates a promising new avenue to investigate further in bibliome informatics. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Multistage Gene Normalization and SVM-Based Ranking for Protein Interactor Extraction in Full-Text Articles

    Publication Year: 2010 , Page(s): 412 - 420
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2824 KB)  

    The interactor normalization task (INT) is to identify genes that play the interactor role in protein-protein interactions (PPIs), to map these genes to unique IDs, and to rank them according to their normalized confidence. INT has two subtasks: gene normalization (GN) and interactor ranking. The main difficulties of INT GN are identifying genes across species and using full papers instead of abstracts. To tackle these problems, we developed a multistage GN algorithm and a ranking method, which exploit information in different parts of a paper. Our system achieved a promising AUC of 0.43471. Using the multistage GN algorithm, we have been able to improve system performance (AUC) by 1.719 percent compared to a one-stage GN algorithm. Our experimental results also show that with full text, versus abstract only, INT AUC performance was 22.6 percent higher. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Empirical Investigations into Full-Text Protein Interaction Article Categorization Task (ACT) in the BioCreative II.5 Challenge

    Publication Year: 2010 , Page(s): 421 - 427
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (725 KB) |  | HTML iconHTML  

    The selection of protein interaction documents is one important application for biology research and has a direct impact on the quality of downstream BioNLP applications, i.e., information extraction and retrieval, summarization, QA, etc. The BioCreative II.5 Challenge Article Categorization task (ACT) involves doing a binary text classification to determine whether a given structured full-text article contains protein interaction information. This may be the first attempt at classification of full-text protein interaction documents in wide community. In this paper, we compare and evaluate the effectiveness of different section types in full-text articles for text classification. Moreover, in practice, the less number of true-positive samples results in unstable performance and unreliable classifier trained on it. Previous research on learning with skewed class distributions has altered the class distribution using oversampling and downsampling. We also investigate the skewed protein interaction classification and analyze the effect of various issues related to the choice of external sources, oversampling training sets, classifiers, etc. We report on the various factors above to show that 1) a full-text biomedical article contains a wealth of scientific information important to users that may not be completely represented by abstracts and/or keywords, which improves the accuracy performance of classification and 2) reinforcing true-positive samples significantly increases the accuracy and stability performance of classification. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • BioLMiner System: Interaction Normalization Task and Interaction Pair Task in the BioCreative II.5 Challenge

    Publication Year: 2010 , Page(s): 428 - 441
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3255 KB) |  | HTML iconHTML  

    This paper describes a Biological Literature Miner (BioLMiner) system and its implementation. BioLMiner is a text mining system for biological literature, whose purpose is to extract useful information from biological literature, including gene and protein names, normalized gene and protein names, and protein-protein interaction pairs. BioLMiner has three main subsystems in a pipeline structure: a gene mention recognizer (GMRer), a gene normalizer (GNer), and a protein-protein interaction pair extractor (PPIEor). All these subsystems are developed based on the machine learning techniques including support vector machines (SVMs) and conditional random fields (CRFs) together with carefully designed informative features. At the same time, BioLMiner makes use of some biological specific resources and existing natural language processing tools. In order to evaluate and compare BioLMiner, it is adapted to participate in two tasks of the BioCreative II.5 challenge: interaction normalization task (INT) using GNer and interaction pair task (IPT) using PPIEor. Our system is among the highest performing systems on the two tasks from which it can be seen that GMRer provides a good support for the INT and IPT although its performance is not evaluated, and the methods developed in GNer and PPIEor are extended well to the BioCreative II.5 tasks. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Extracting Protein Interactions from Text with the Unified AkaneRE Event Extraction System

    Publication Year: 2010 , Page(s): 442 - 453
    Cited by:  Papers (9)
    Multimedia
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2965 KB) |  | HTML iconHTML  

    Currently, relation extraction (RE) and event extraction (EE) are the two main streams of biological information extraction. In 2009, the majority of these RE and EE research efforts were centered around the BioCreative II.5 Protein-Protein Interaction (PPI) challenge and the “BioNLP event extraction shared task.” Although these challenges took somewhat different approaches, they share the same ultimate goal of extracting bio-knowledge from the literature. This paper compares the two challenge task definitions, and presents a unified system that was successfully applied in both these and several other PPI extraction task settings. The AkaneRE system has three parts: A core engine for RE, a pool of modules for specific solutions, and a configuration language to adapt the system to different tasks. The core engine is based on machine learning, using either Support Vector Machines or Statistical Classifiers and features extracted from given training data. The specific modules solve tasks like sentence boundary detection, tokenization, stemming, part-of-speech tagging, parsing, named entity recognition, generation of potential relations, generation of machine learning features for each relation, and finally, assignment of confidence scores and ranking of candidate relations. With these components, the AkaneRE system produces state-of-the-art results, and the system is freely available for academic purposes at http://www-tsujii.is.s.u-tokyo.ac.jp/satre/akane/. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An IR-Aided Machine Learning Framework for the BioCreative II.5 Challenge

    Publication Year: 2010 , Page(s): 454 - 461
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1062 KB) |  | HTML iconHTML  

    The team at the University of Wisconsin-Milwaukee developed an information retrieval and machine learning framework. Our framework requires only the standardized training data and depends upon minimal external knowledge resources and minimal parsing. Within the framework, we built our text mining systems and participated for the first time in all three BioCreative II.5 Challenge tasks. The results show that our systems performed among the top five teams for raw F1 scores in all three tasks and came in third place for the homonym ortholog F1 scores for the INT task. The results demonstrated that our IR-based framework is efficient, robust, and potentially scalable. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Exploring Species-Based Strategies for Gene Normalization

    Publication Year: 2010 , Page(s): 462 - 471
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1979 KB) |  | HTML iconHTML  

    We introduce a system developed for the BioCreative II.5 community evaluation of information extraction of proteins and protein interactions. The paper focuses primarily on the gene normalization task of recognizing protein mentions in text and mapping them to the appropriate database identifiers based on contextual clues. We outline a "“fuzzy” dictionary lookup approach to protein mention detection that matches regularized text to similarly regularized dictionary entries. We describe several different strategies for gene normalization that focus on species or organism mentions in the text, both globally throughout the document and locally in the immediate vicinity of a protein mention, and present the results of experimentation with a series of system variations that explore the effectiveness of the various normalization strategies, as well as the role of external knowledge sources. While our system was neither the best nor the worst performing system in the evaluation, the gene normalization strategies show promise and the system affords the opportunity to explore some of the variables affecting performance on the BCII.5 tasks. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • OntoGene in BioCreative II.5

    Publication Year: 2010 , Page(s): 472 - 480
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1634 KB) |  | HTML iconHTML  

    We describe a system for the detection of mentions of protein-protein interactions in the biomedical scientific literature. The original system was developed as a part of the OntoGene project, which focuses on using advanced computational linguistic techniques for text mining applications in the biomedical domain. In this paper, we focus in particular on the participation to the BioCreative II.5 challenge, where the OntoGene system achieved best-ranked results. Additionally, we describe a feature-analysis experiment performed after the challenge, which shows the unexpected result that one single feature alone performs better than the combination of features used in the challenge. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Efficient Extraction of Protein-Protein Interactions from Full-Text Articles

    Publication Year: 2010 , Page(s): 481 - 494
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3104 KB) |  | HTML iconHTML  

    Proteins and their interactions govern virtually all cellular processes, such as regulation, signaling, metabolism, and structure. Most experimental findings pertaining to such interactions are discussed in research papers, which, in turn, get curated by protein interaction databases. Authors, editors, and publishers benefit from efforts to alleviate the tasks of searching for relevant papers, evidence for physical interactions, and proper identifiers for each protein involved. The BioCreative II.5 community challenge addressed these tasks in a competition-style assessment to evaluate and compare different methodologies, to make aware of the increasing accuracy of automated methods, and to guide future implementations. In this paper, we present our approaches for protein-named entity recognition, including normalization, and for extraction of protein-protein interactions from full text. Our overall goal is to identify efficient individual components, and we compare various compositions to handle a single full-text article in between 10 seconds and 2 minutes. We propose strategies to transfer document-level annotations to the sentence-level, which allows for the creation of a more fine-grained training corpus; we use this corpus to automatically derive around 5,000 patterns. We rank sentences by relevance to the task of finding novel interactions with physical evidence, using a sentence classifier built from this training corpus. Heuristics for paraphrasing sentences help to further remove unnecessary information that might interfere with patterns, such as additional adjectives, clauses, or bracketed expressions. In BioCreative II.5, we achieved an f-score of 22 percent for finding protein interactions, and 43 percent for mapping proteins to UniProt IDs; disregarding species, f-scores are 30 percent and 55 percent, respectively. On average, our best-performing setup required around 2 minutes per full text. All data and pattern sets as well as Java classes that extend- - third-party software are available as supplementary information (see Appendix). View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Cache-Oblivious Dynamic Programming for Bioinformatics

    Publication Year: 2010 , Page(s): 495 - 510
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (4988 KB) |  | HTML iconHTML  

    We present efficient cache-oblivious algorithms for some well-studied string problems in bioinformatics including the longest common subsequence, global pairwise sequence alignment and three-way sequence alignment (or median), both with affine gap costs, and RNA secondary structure prediction with simple pseudoknots. For each of these problems, we present cache-oblivious algorithms that match the best-known time complexity, match or improve the best-known space complexity, and improve significantly over the cache-efficiency of earlier algorithms. We present experimental results which show that our cache-oblivious algorithms run faster than software and implementations based on previous best algorithms for these problems. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • CollHaps: A Heuristic Approach to Haplotype Inference by Parsimony

    Publication Year: 2010 , Page(s): 511 - 523
    Cited by:  Papers (5)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1554 KB) |  | HTML iconHTML  

    Haplotype data play a relevant role in several genetic studies, e.g., mapping of complex disease genes, drug design, and evolutionary studies on populations. However, the experimental determination of haplotypes is expensive and time-consuming. This motivates the increasing interest in techniques for inferring haplotype data from genotypes, which can instead be obtained quickly and economically. Several such techniques are based on the maximum parsimony principle, which has been justified by both experimental results and theoretical arguments. However, the problem of haplotype inference by parsimony was shown to be NP-hard, thus limiting the applicability of exact parsimony-based techniques to relatively small data sets. In this paper, we introduce collapse rule, a generalization of the well-known Clark's rule, and describe a new heuristic algorithm for haplotype inference (implemented in a program called CollHaps), based on parsimony and the iterative application of collapse rules. The performance of CollHaps is tested on several data sets. The experiments show that CollHaps enables the user to process large data sets obtaining very “parsimonious” solutions in short processing times. They also show a correlation, especially for large data sets, between parsimony and correct reconstruction, supporting the validity of the parsimony principle to produce accurate solutions. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Combinatorial Analysis for Sequence and Spatial Motif Discovery in Short Sequence Fragments

    Publication Year: 2010 , Page(s): 524 - 536
    Multimedia
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1760 KB) |  | HTML iconHTML  

    Motifs are overrepresented sequence or spatial patterns appearing in proteins. They often play important roles in maintaining protein stability and in facilitating protein function. When motifs are located in short sequence fragments, as in transmembrane domains that are only 6-20 residues in length, and when there is only very limited data, it is difficult to identify motifs. In this study, we introduce combinatorial models based on permutation for assessing statistically significant sequence and spatial patterns in short sequences. We show that our method can uncover previously unknown sequence and spatial motifs in β-beta-barrel membrane proteins and that our method outperforms existing methods in detecting statistically significant motifs in this data set. Last, we discuss implications of motif analysis for problems involving short sequences in other families of proteins. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Nonnegative Principal Component Analysis for Cancer Molecular Pattern Discovery

    Publication Year: 2010 , Page(s): 537 - 549
    Cited by:  Papers (4)
    Multimedia
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3582 KB) |  | HTML iconHTML  

    As a well-established feature selection algorithm, principal component analysis (PCA) is often combined with the state-of-the-art classification algorithms to identify cancer molecular patterns in microarray data. However, the algorithm's global feature selection mechanism prevents it from effectively capturing the latent data structures in the high-dimensional data. In this study, we investigate the benefit of adding nonnegative constraints on PCA and develop a nonnegative principal component analysis algorithm (NPCA) to overcome the global nature of PCA. A novel classification algorithm NPCA-SVM is proposed for microarray data pattern discovery. We report strong classification results from the NPCA-SVM algorithm on five benchmark microarray data sets by direct comparison with other related algorithms. We have also proved mathematically and interpreted biologically that microarray data will inevitably encounter overfitting for an SVM/PCA-SVM learning machine under a Gaussian kernel. In addition, we demonstrate that nonnegative principal component analysis can be used to capture meaningful biomarkers effectively. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • SCS: Signal, Context, and Structure Features for Genome-Wide Human Promoter Recognition

    Publication Year: 2010 , Page(s): 550 - 562
    Cited by:  Papers (3)
    Multimedia
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2296 KB)  

    This paper integrates the signal, context, and structure features for genome-wide human promoter recognition, which is important in improving genome annotation and analyzing transcriptional regulation without experimental supports of ESTs, cDNAs, or mRNAs. First, CpG islands are salient biological signals associated with approximately 50 percent of mammalian promoters. Second, the genomic context of promoters may have biological significance, which is based on n-mers (sequences of n bases long) and their statistics estimated from training samples. Third, sequence-dependent DNA flexibility originates from DNA 3D structures and plays an important role in guiding transcription factors to the target site in promoters. Employing decision trees, we combine above signal, context, and structure features to build a hierarchical promoter recognition system called SCS. Experimental results on controlled data sets and the entire human genome demonstrate that SCS is significantly superior in terms of sensitivity and specificity as compared to other state-of-the-art methods. The SCS promoter recognition system is available online as supplemental materials for academic use and can be found on the Computer Society Digital Library at http://doi.ieeecomputersociety.org/10.1109/TCBB.2008.95. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Study of Hierarchical and Flat Classification of Proteins

    Publication Year: 2010 , Page(s): 563 - 571
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2255 KB) |  | HTML iconHTML  

    Automatic classification of proteins using machine learning is an important problem that has received significant attention in the literature. One feature of this problem is that expert-defined hierarchies of protein classes exist and can potentially be exploited to improve classification performance. In this article, we investigate empirically whether this is the case for two such hierarchies. We compare multiclass classification techniques that exploit the information in those class hierarchies and those that do not, using logistic regression, decision trees, bagged decision trees, and support vector machines as the underlying base learners. In particular, we compare hierarchical and flat variants of ensembles of nested dichotomies. The latter have been shown to deliver strong classification performance in multiclass settings. We present experimental results for synthetic, fold recognition, enzyme classification, and remote homology detection data. Our results show that exploiting the class hierarchy improves performance on the synthetic data but not in the case of the protein classification problems. Based on this, we recommend that strong flat multiclass methods be used as a baseline to establish the benefit of exploiting class hierarchies in this area. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • On the Complexity of uSPR Distance

    Publication Year: 2010 , Page(s): 572 - 576
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (515 KB) |  | HTML iconHTML  

    We show that subtree prune and regraft (uSPR) distance on unrooted trees is fixed parameter tractable with respect to the distance. We also make progress on a conjecture of Steel on the preservation of uSPR distance under chain reduction, improving on lower bounds of Hickey et al.. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • IEEE/ACM TCBB: Information for authors

    Publication Year: 2010 , Page(s): c3
    Save to Project icon | Request Permissions | PDF file iconPDF (222 KB)  
    Freely Available from IEEE
  • [Back cover]

    Publication Year: 2010 , Page(s): c4
    Save to Project icon | Request Permissions | PDF file iconPDF (377 KB)  
    Freely Available from IEEE

Aims & Scope

This bimonthly publishes archival research results related to the algorithmic, mathematical, statistical, and computational methods that are central in bioinformatics and computational biology.

Full Aims & Scope

Meet Our Editors

Editor-in-Chief
Ying Xu
University of Georgia
xyn@bmb.uga.edu

Associate Editor-in-Chief
Dong Xu
University of Missouri
xudong@missouri.edu