I. Introduction
CD-HIT is arguably the state-of-art and has been used in thousands of biological studies [1]. It reduces database redundancy through producing a non-redundant database that only consists of representative sequences. The objective is to produce a subset of a database, where no sequence in the subset is more similar than a user-defined threshold to any other sequence in the subset. Because an exhaustive pairwise similarity method would be inefficient for these large databases, the method tolerates some redundancy in the output, trading some redundancy for speed.