Skip to Main Content
It has been an important task of discovering frequent subsequences as particular patterns from large sequence databases generated from a variety of applications, such as biological sequence analysis. In general, the patterns to be discovered may partially and asynchronously exist in sequences, and even contain gaps. In addition, the locations and frequencies of the patterns may be of interest for the subsequent analysis. How to enumerate candidate patterns for evaluation without exponentially increasing the computation time is another concern. The modified periodicity transform is proposed to meet the requirements mentioned above. The computation time for a synthetic sequence of length 300 K takes 4 seconds to mine all partial periodic patterns of length 5. With minor modification, it is able to handle asynchronous partial periodic patterns of arbitrary length. Note that the approach is in nature suited to distributed environments. A prototype system has been developed in Java for distributed computing. The system could be considered as a feature extractor in an early stage of sequence analysis.