Evaluation of CD-HIT for constructing non-redundant databases | IEEE Conference Publication | IEEE Xplore

Evaluation of CD-HIT for constructing non-redundant databases


Abstract:

CD-HIT is one of the most popular tools for reducing sequence redundancy, and is considered to be the state-of-art method. It tries to minimise redundancy by reducing an ...Show More

Abstract:

CD-HIT is one of the most popular tools for reducing sequence redundancy, and is considered to be the state-of-art method. It tries to minimise redundancy by reducing an input database into several representative sequences, under a user-defined threshold of sequence identity. We present a comprehensive assessment of the redundancy in the outputs of CD-HIT, exploring the impact of different identity thresholds and new evaluation data on the redundancy. We demonstrate that the relationship between threshold and redundancies is surprising weak. Applications of CD-HIT that set low identity threshold values also may suffer from substantial degradation in both efficiency and accuracy.
Date of Conference: 15-18 December 2016
Date Added to IEEE Xplore: 19 January 2017
ISBN Information:
Conference Location: Shenzhen

I. Introduction

CD-HIT is arguably the state-of-art and has been used in thousands of biological studies [1]. It reduces database redundancy through producing a non-redundant database that only consists of representative sequences. The objective is to produce a subset of a database, where no sequence in the subset is more similar than a user-defined threshold to any other sequence in the subset. Because an exhaustive pairwise similarity method would be inefficient for these large databases, the method tolerates some redundancy in the output, trading some redundancy for speed.

Contact IEEE to Subscribe

References

References is not available for this document.