Skip to Main Content
DNA datasets demonstrate considerably low signal-to-noise ratio that constrains the computational motif discovery tools to achieve satisfactory performances. Thus, reducing the search space and increasing the signal-to-noise ratio (by the means of filtering) can be useful to facilitate computational motif discovery tools with better performing environments. This paper proposes unsupervised fuzzy filtering systems, that aim to remove a large portion of k-mers that are less relevant to potential motif instances in terms of location overlaps in given sequences. Relative Model Mismatch Score (RMMS), which is a new quantitative metric for measuring the quality of motif models, is employed in this work to facilitate the proposed filtering. A modified version of fuzzy c-means clustering algorithm with an initialization strategy is then adopted to group k-mers, while a complement of fuzzified RMMS is used to rank k-mers for data filtering. Experimental results on eight real DNA datasets showed that, the proposed filtering systems could remove approximately (85 ± 5)% of data samples while maintaining a high retention rate of relevant k-mers. Thus, this filtering as a data pre-processing component, will improve the performing environments of the motif discovery tools, since the filtered datasets will contain much smaller cardinality and higher signal-to-noise ratio than the original datasets.