Skip to Main Content
In some real world applications, the sample could be described as a string of symbols rather than a vector of real numbers. It is necessary to determine the similarity or dissimilarity of two strings in many training algorithms. The widely used notion of similarity of two strings with different lengths is the weighted Levenshtein distance (WLD), which implies the minimum total weights of single symbol insertions, deletions and substitutions required to transform one string into another. In order to incorporate prior knowledge of strings into kernels used in support vector machine and other kernel machines, we utilize variants of this distance to replace distance measure in the RBF and exponential kernels and inner product in polynomial and sigmoid kernels, and form a new class of string kernels: Levenshtein kernels in this paper. Combining our new kernels with support vector machine, the error rate and variance on UCI splice site recognition dataset over 20 run is 5.88∓0.53, which is better than the best result 9.5∓0.7 from other five training algorithms.