Fonts That Fit the Music: A Multi-Modal Design Trend Analysis of Lyric Videos

Lyric videos, or kinetic typography videos, are music videos showing lyric text in synchronization with the music. The purpose of this paper is to quantitatively and qualitatively analyze lyric videos to understand their design trends via three modalities: word motion, font style, and music style. These trends will not only be helpful as hints for designing new lyric videos but also be meaningful to quantitatively reveal the thought processes of the video design professionals. To achieve this, we needed to develop or utilize several technologies. First, we developed a lyric word tracking method to capture the motion of individual lyric words. The proposed method uses the lyric text as the guiding information for word tracking to overcome the difficulties arising from the various word appearances and motions. Second, we developed a font style estimator to quantify the appearance of each word as a feature vector. Finally, we employed a music style estimator to quantify the mood of the music, e.g., “techno” and “fast.” We then analyzed feature vectors of these three style modalities collected at 3,494 time points in 100 lyric videos. After revealing the trend of each modality via k-means, we conducted a co-occurrence analysis to understand the correlation between each modality pair. Our experimental results indicate that such a cluster-wise co-occurrence analysis can capture interesting trends hidden in lyric video designs.


I. INTRODUCTION
Lyric videos (a.k.a., kinetic typography videos) have become a popular approach for promoting songs on video sharing services, such as YouTube and social network services. In lyric videos, the lyric words are displayed and animated synchronously with the music. The display style of the lyric words is very different from that of still video captions. Figure 1 shows a series of video frames taken from a lyric video. In this video, the lyric words are shown in a decorative font style and move dynamically along with the video frames.
Similar to conventional typographic designs, such as book covers, posters, and web advertisements, creating lyric videos requires that the video creator have expertise in graphic design and that the relationship between the graphical and musical expressions be considered. The creators need to carefully choose the font style for the lyric words while considering the style (mood) of the music. Moreover, creators  study, we need to develop or employ appropriate techniques to quantify these three style modalities. For example, to quantify word motions, we first need to detect and recognize lyric words in individual frames and then to track them over multiple frames. After quantifying the three style modalities, statistical analyses are conducted to reveal the correlations between the modalities. This correlation analysis is meaningful in two ways. First, it will lead to a deeper understanding of the typographic designs of lyric videos. This analysis provides hints as to how experts can use their knowledge of typography in music videos. Second, the relationships discovered by the analysis will help non-experts create lyric videos or help in the development of lyric video creation tools such as TextAlive [1]. The relationships could also be used to suggest suitable font styles for specific music styles.
Despite its meaningfulness, correlation analyses between these three style modalities for lyric videos are underexplored and remain challenging because of the following difficulties.
1) Word motion quantification is not a simple task. Lyric word detection and recognition for lyric videos is a difficult task, even for state-of-the-art scene text detectors and recognizers. Various decorated fonts and background images prevent the accurate detection and recognition of lyric words. 2) Even though quantification of the music style is possible using a standard style estimator, such as musicnn [2], [3], there is no standard tool for quantifying the font style. The font style in lyric videos has wide varieties, and therefore the employed font style estimator needs to be capable of dealing with them.
3) The correlation between the style modalities will likely be very subtle and weak. Styles largely depend on the designer's subjective choices and may undergo multiple artistic and artificial variations. For example, the same font style may be used for music with completely different styles. This indicates that style correlations will not have simple or clear (such as linear) trends or distinctive peaks. In fact, our preliminary regression analysis experiment using XGBoost [4] was unable to capture a clear correlation between the style modalities. 4) Because lyric videos are a relatively new multi-media video resource with typographic artwork, there is not yet a standard video dataset available for analyses. This situation is very different from other well-studied video analysis tasks, such as the Text REtrieval Conference Video Retrieval Evaluation (TRECVID).
To address the first of the above difficulties, we propose a lyric word detection and tracking method, called lyricframe matching. Its key idea is to utilize the lyric word sequence, which is given as metadata, to improve the tracking performance. More specifically, state-of-the-art scene text detectors and recognizers are first applied to each video frame to obtain candidates for the lyric word locations. Then, dynamic programming (DP)-based optimization is applied to determine the optimal matching between the candidates and the lyric word sequences over the frames. The matching result gives a reliable spatio-temporal trajectory for each lyric word in a given sequence.
For the second difficulty, we developed a font style estimator based on a convolutional neural network (CNN). Basically, the estimator is simply realized by training the CNN with a font image dataset where the font style (e.g., "Sans-Serif") is annotated to each font. Because there is no standard font style class definition, we roughly defined six font styles and represented the style of a given word image using a six-dimensional class-probability vector. In addition, because the word images extracted from the lyric video frames have various backgrounds and distortions, we needed to train the CNN not with a clean font image but with synthetic font images that mimic actual lyric word images.
For the third difficulty, we made full use of cluster analysis. Even though clustering is a classic and simple method, it is useful for our correlation analysis task. Clustering involves vector quantization and therefore gives a rough view of the variations in the styles. Moreover, clustering can deal with highly nonlinear style trends because of its non-parametric nature. In this paper, we first apply k-means clustering to each modality independently and then apply a biclustering technique to understand the correlation between two modalities via the co-occurrence of their (quantized) styles.
For the fourth difficulty, we prepared a new lyric video dataset containing 100 lyric videos created by design experts. We manually attached the lyric word bounding boxes to 1,000 video frames to evaluate the accuracy of the lyric word tracking result. A list of the videos and the bounding box data are publicly available at https://github.com/uchidalab/Lyric-Video.
The main contributions of this paper can be summarized as follows: • To the best of the authors' knowledge, this is the first study to analyze the design of lyric videos in a quantitative manner. Because of the design factors specific to lyric videos, we focus on three style modalities: font style, word motion, and music style. A correlation analysis between these style modalities will provide basic knowledge concerning kinetic typography designs in music videos. In fact, the analysis results reveal interesting trends between the three style modalities; for example, "Fancy" fonts tend to be used for "pop" and "guitar" music, and active motions are often printed in "Fancy" and "Sans-Serif" fonts. • This is also the first attempt to detect and then track lyric words in lyric videos. We propose a novel word tracking technique using an optimal lyric-frame matching algorithm based on DP.
A preliminary version of this study, in which only a word motion analysis was conducted, was published in a conference paper [5]. The present paper contains much wider analyses introducing two new style modalities, i.e., font style and music style. The correlation analysis between the three modalities is another novel contribution of this paper.

II. RELATED WORK
Since this paper is the first attempt at a design analysis of lyric videos, there are presently no similar studies. In this section, instead, we review previous attempts to extract or analyze word motion, font style, and music style for more general subjects.

A. WORD MOTION ANALYSIS
There are several tasks involved in detecting and tracking words in video frames. The most typical task is caption detection [6]- [14]. Captions are defined as text superimposed on video frames. Captions, therefore, have characteristics that differ from scene text. Even though most studies have dealt with static captions (i.e., captions without motions), Zedan [11] addressed not only static captions but also moving captions. They referred to the vertical or horizontal scrolling of caption text as moving captions.
Recently, video text tracking [15]- [23] has also been attempted, as reviewed in [24]. Because such methods try to track words in a scene captured by a moving camera, they introduce a common assumption that the words are static in the scene and are captured by the moving camera. Therefore, they assume, for example, that neighboring words will move in similar directions. The paper [25] introduces "moving MNIST" for video prediction tasks. This paper focuses on synthetic videos capturing two digits moving with respect to a uniform background.
Our study is very different from these previous attempts with respect to the following three points at least. First, our target words in lyric videos move far more dynamically and freely, invalidating the assumption used in previous studies. Second, we can utilize lyric information during tracking, whereas previous attempts did not include such guiding information.

B. FONT STYLE ESTIMATION
Most previous font image analysis studies have focused on the so-called font identification (or font recognition). This involves identifying the font name (such as "Helvetica") of a given text image. Zramdini and Ingold [26] presented a pioneering trial recognizing 10 different fonts. Recently, deep neural networks have also been used for font identification [27]- [29].
In this study, we use font style estimation, which is different from font identification. Font styles are defined as Serif, Sans-Serif, Script, and so on. If a method can estimate the style of an arbitrary font, it can be applied to lyric words printed with rare or even brand-new fonts. However, font style estimation is less common than font identification because font style classes are not well defined 1 . Shinahara et al. [31] developed a font style estimation method based on six font classes (Serif, Sans-Serif, Hybrid, Script, Historical Script, and Fancy) defined in a font guidebook [32]. They used simple pattern matching for the classification. In this paper, as described in Section V, we develop our own neural network-based font style estimator, following the same six classes. Note that there have been several classical attempts (see Table 1 of [33]) to classify font images into Roman, bold, and italic classes. We do not use these three classes because they are appropriate for font images from ordinary text documents but not for various font images of lyric words.

C. MUSIC STYLE ESTIMATION
Music audio tagging including mood/emotion estimation is a popular research topic in the music information retrieval (MIR) community. Various approaches have already been proposed for music audio tagging [34]- [44]. Recently, Pons and Serra released musicnn [2], [3], which can provide a "taggram" for each music segment using CNNs. Each taggram is a 50-dimensional vector, and each element corresponds to 1 of 50 tags (defined in the MagnaTagATune (MTT) dataset [45]). This is not a one-hot vector but rather a non-negative real-valued vector. Each value represents the property of the music segment or the corresponding tag.
In this paper, we use the 50-dimensional taggram given by musicnn as the music style. As in the case of the font styles, the music styles do not have any standard definition; this is because music styles are defined by multiple factors, such as instrument types and genres. Fortunately, taggram by musicnn covers these factors. Of the 50 tags, some tags indicate musical instruments (such as "drums" and "guitar"), some indicate vocal types (such as "male vocal" and "choral"), some indicate music genre (such as "rock" and "techno"), and some indicate moods (such as "loud" and "slow").

III. LYRIC VIDEO DATASET
As the lyric video dataset to be analyzed, 100 videos were collected via the following steps. First, a list of lyric videos was generated by searching YouTube with the keywords "official lyric video" (on July 18, 2019). The keyword "official" was added to find videos with not only long-time availability but also professional quality. The latter is very important because we want to exclude incomplete or thoughtless video designs from our analysis. Then, the videos in the list were manually checked to exclude videos with only static motion words (i.e., videos whose lyric words did not move). Finally, the top-100 videos on the list were selected as our experimental target 2 . The frame image size is 1,920 × 1,080 pixels. The average, maximum, and minimum lengths of the videos in the dataset are 5,471 frames (3 min 38 s), 8,629 frames, and 2,280 frames, respectively. The average, maximum, and minimum numbers of lyric words are 338, 690, and 113, respectively. Figure 3 shows four examples of lyric video frame variations. Figure 3 (a) depicts a frame showing lyric words. Typically, several words (i.e., a phrase in the song) are shown in a single frame. In the introduction, interlude, and ending parts, frames with no lyrics are often found, as shown in Figure 3 (b). In Figure 3 (c), the same word is duplicated, as in the refrain of a song. Sometimes, as shown in Figure 3 (d), the background image contains words unrelated to the lyrics.
To perform a quantitative evaluation of the word tracking method in Appendix A, bounding boxes were manually attached to the lyric words for 10 frames in each video. These frames were selected automatically. Specifically, for each video, the top 10 frames with the most words were selected from the frames sampled at three-second intervals. The lyric words were detected using the method described in Appendix A-A, and a bounding box was attached to each word in the lyrics. We attached non-horizontal bounding 1 The PANOSE System [30] was expected to be a good standard for font styles; however, most fonts currently do not follow it. 2 A list of all 100 videos and their annotations is published at https://github.com/uchidalab/Lyric-Video. boxes 3 to the rotated lyric words. Consequently, we obtained 10 × 100 = 1, 000 ground-truth frames with 7,770-word bounding boxes for the dataset.

IV. WORD MOTION STYLE A. LYRIC WORD TRACKING
We propose a word tracking method for extracting individual word motions and then quantifying their style. The proposed method is specialized to accurately track lyric words while utilizing lyric information (which is available via the metadata of the lyric video). The tracking method has three steps: word detection, word recognition, and lyric-frame matching. In the first step, lyric word candidates are detected and recognized by the method presented in Appendix A-A, as shown on the left-hand side of Figure 4 (a).
After detection and recognition, lyric-frame matching is conducted to establish the correspondence between the frames and the lyric words. The matching algorithm is based on DP-based optimization and is detailed in Appendix A-B. The red path on the right-hand side of Figure 4 (a) represents the optimal correspondence of the frames and lyric words. If the path passes through the grid (k, t), it means that the tth frame is determined to be the most confident frame for the kth lyric word. We then search the frames around the tth frame to find the same kth lyric word. The vertical orange paths in Figure 4 (b) depict the search results for individual lyric words. This search was done not only using simple spatio-temporal closeness but also by evaluating the word similarity of the kth word. As shown in Figure 4 (b), there are many misrecognized words; therefore, we cannot use the exact match with the lyric word in this search. Details are given in Appendix A-C.
The vertical orange paths for "EVER" and "THOUGHT" in Figure 4 (b) include skipped frames. For example, "EVER" was not detected in the second frame. Such missed detections occur because of occlusion and severe misrecognition. Therefore, we need to perform the interpolation process shown in Figure 4 (c) to complete the spatio-temporal tracking process of each lyric word. Roughly speaking, if a missed frame is found for a lyric word, the polynomial interpolation process determines the location of the lyric word at that frame. Details are given in Appendix A-C. Figure 5 shows the final result of the tracking process for the two lyric words "YOU" and "EVER." Even though the above tracking method is not perfect because of the various difficulties, the quantitative evaluation uses the ground-truth bounding boxes attached to the video frames. Specifically, as detailed in Appendix A-E, the tracked trajectories according to the above method show high precision. Therefore, we believe that the following word motion style analysis based on the tracking result is sufficiently reliable. 3 To attach non-horizontal bounding boxes, we used the labeling tool roLabelImg available at https://github.com/cgvict/roLabelImg.

YOU EVER
Matched frame Matched frame Interpolated frame FIGURE 5. Tracking result of "YOU" and "EVER." Especially, interpolation is successfully performed for "ever."

B. REPRESENTATIVE WORD MOTIONS
Later, in the correlation analysis, we represent the word motion style of each 30-second time-window in a so-called "bag-of-words" manner. Specifically, the motion trajectories of all the lyric words in the video are quantized into B representative word motions, and a histogram with B bins is created. Each bin corresponds to one representative motion and shows how many word trajectories are quantized to that motion. We therefore need to select the representative word motions in advance of the word style representation.
The steps to select the B(= 70) representative motion trajectories of all the lyric videos in the dataset are as follows. First, each motion trajectory is represented as a sequence of four-dimensional vectors (x 1 , y 1 , x 2 , y 2 ), as shown in Figure 6, where (x 1 , y 1 ) represents the location of the center of the word bounding box, and (x 2 , y 2 ) is defined as the upper-right corner of a square whose center is (x 1 , y 1 ) and whose edge length is the bounding-box height. The coor-   dinates (x 2 , y 2 ) indirectly represent the size (word height) and rotation of the bounding box in a manner consistent with (x 1 , y 1 ). Second, each motion trajectory is translated such that its first location (x 1 , y 1 ) becomes (0, 0). Third, the trajectories are grouped by their duration: 0.5 ∼ 1.0s (5,107), 1.0 ∼ 1.5s (4,423), 1.5 ∼ 2.0s (3,581), 2.0 ∼ 2.5s (2,744), 2.5 ∼ 3.0s (1,658), 3.0 ∼ 4.0s (1,742), and 4.0 ∼ 5.0s (973). The numbers in parenthesizes count the trajectories in the individual groups. Extremely short (< 0.5s) and long (> 5.0s) trajectories were rare and were excluded. Finally, kmedoid clustering (k = 10) was performed to a dynamic time warping distance metric at each group, and 70 representative word motions were obtained. Figure 7 shows the 10 word motions (i.e., 10 medoids) in each of the seven duration groups, where (x 2 , y 2 ) is omitted. The center of each plot is the origin (0, 0) (i.e., the starting point of the trajectory), and the change in the color saturation (white to vivid) indicates the transition of time. The majority consists of rather simple motions: vertical, horizontal, or nomotion (i.e., staying at the origin). In addition, we show a histogram of the number of trajectories in each of the 70 clusters for all 100 lyric videos. In each of the seven duration groups, the orange bin indicates the cluster having a nomotion trajectory in which the lyric words do not move; lyric words often appear and disappear without movement. VOLUME 10, 2022 5 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication.

V. FONT STYLE
To facilitate the correlation analysis performed later, we represent the font style of each video as a likelihood vector of six typical font styles: Serif, Sans-Serif, Hybrid (of Serif and Sans-Serif), Script, Historical Script, and Fancy (i.e., Display). Accordingly, the font style is given as a sixdimensional real-valued vector. For this purpose, we developed a six-class font style classifier that gives the likelihood vector of each word image. The font style estimates were derived using ResNet18, a CNN trained using a large number of word images synthesized by SynthText [46]. More specifically, we first collected 510, 314, 151, 74, 58, and 704 different fonts for the Serif, Sans-Serif, Hybrid, Script, Historical Script, and Fancy classes, respectively. The class of each font was specified in a font guidebook [32]. We then generated 19,000 synthetic word images for each of the six fonts using SynthText. The images were separated into training (80%), validation (10%), and test (10%) sets, and these sets were font-disjoint. Finally, ResNet was trained as a six-class classifier using the training and validation sets. We used the six-dimensional likelihood vector given before the softmax layer of the trained ResNet as the font style vector. Note that the performance of the brute-force classification (into one of six font classes) by the trained ResNet was 81.10% for the test dataset. Figure 8 shows the font style vectors for word images extracted from the lyric videos. The horizontal axis corresponds to the six font classes, and the vertical axis indicates their likelihood. The top row shows four cases with high likelihoods only at the correct single class. The middle row shows cases estimated as being a mixture of several font styles. The bottom row shows style vectors for word images taken from the same lyric video in which the font styles were consistent. Note that, in the experiment in Section VII, the font style vectors of all lyric words detected within each 30second time window were averaged and then used as the font style vector for the time period.

VI. MUSIC STYLE
The music style was obtained as a 50-dimensional vector using musicnn 4 [2], [3]. We used the "MTT_musicnn" model pre-trained on the MTT dataset [45]. For the audio in a 30second time window, we estimated a 50-dimensional tag likelihood vector corresponding to the 50 MTT tags, including instrument-related tags such as guitar and drums, temporelated tags such as slow and fast, and vocal-related tags such as male and female. We simply refer to this as the music style vector, even though the estimated tags are not always stylerelated tags and represent a variety of musical attributes of a song, which is desirable for our research purposes. Each vector was estimated from a 30-second time window, and the estimation was performed every 5 s (at five-second intervals). For example, a sequence of 19 music style vectors can be extracted from a 120-second song ((120 − 30)/5 + 1 = 19). Figure 9 (a) visualizes an actual music style vector sequence as a heatmap, where the horizontal axis indicates time and the vertical axis shows 10 tags of the 50 tags. Yellow indicates the highest value (i.e., 1). In this example, the music style changes along the time axis because of various interludes. Figure 9 (b) shows the averaged style vector over the same song and indicates that this music is sung by a male and has a techno mood with a fast tempo. Note that we do not use the averaged vector in Figure 9 (b) but rather the vector sequence of in Figure 9 (a) in the later analysis.

A. TEN REPRESENTATIVE TYPES OF EACH STYLE MODALITY
As shown in Figure 2, we conducted a correlation analysis between the three style modalities of word motion, font style, and music style. Each feature vector of the three style modalities was extracted from a 30-second time window with a 5-second interval. Consequently, every 5 s, we obtained 70-, 6-, and 50-dimensional vectors for the word motion, font style, and music style, respectively. Note that time windows with no lyric words or only a few words were discarded from the analysis. This resulted in 3,494 feature vectors for each style modality from the 100 lyric videos. We used all of these vectors in the following clustering-based correlation analysis.  FIGURE 10. Ten word motion style types according to k-means clustering. The horizontal axis corresponds to the 70 word motions presented in Figure 7. The orange bars correspond to no motion (i.e., stay) with seven different durations. A brief description, such as "flash" (very short presence), is attached to each type.   In advance of the correlation analysis, standard k-means clustering 5 was performed to quantize the style vectors of each modality. As noted in Section I, cluster analysis is more promising for our task than orthodox multivariate analysis techniques, such as deep regression. Determination of the number of clusters (hyper-parameter k) relies on several criteria, such as the silhouette coefficient [48], the Calinski and Harabasz score [49], and the Davies-Bouldin index [50]. We examined these criteria but found no unanimous suggestion for the value of k. We therefore took the intermediate value of k = 10 for all of the style modalities. Consequently, we had k = 10 representative types (representative centroid vectors), as shown in Figures. 10, 11, and 12 for the word motion, font style, and music style modalities, respectively.

B. CO-OCCURRENCE ANALYSIS BETWEEN THE STYLE MODALITIES
As noted above, for every 5 s, word motion, font style, and music style feature vectors were obtained via an analysis of the 30-second time window. Let those feature vectors be denoted by w t,s ∈ R 70 + , f t,s ∈ R 6 + , and m t,s ∈ R 50 + , respectively, where t is the frame index and s ∈ [1, 100] is the lyric video ID. We quantized these vectors into the nearest vector of the 10 representative vectors (types) in each modality. Consequently, we obtained W t,s , F t,s , and M t,s , each of which represents the nearest vector index ∈ [1,10]. Then, we obtained a 10 × 10 co-occurrence matrix for each pair of two modalities. For example, the co-occurrence matrix C f,m between the font style and the music style was created by adding 1 to the (F t,s , M t,s )th element of the matrix for all t and s. Figure 13 shows the co-occurrence matrices for all three pairs of style modalities. The matrices were pre-processed with biclustering (row-wise and column-wise reordering) such that blocks (sub-matrices) became more visible. Via careful observations of the matrices, the following trends in the lyric video designs were indicated.
There are also other interesting strong co-occurrences in Figure 13; for example, "Fancy" and "Historical Script" fonts are usually used for "rock" music. These trends found in our analysis could be useful for assisting in the design of lyric videos. Even though we highlighted strong co-occurrences in the above analysis, no or low co-occurrences might also provide useful information concerning the trends in lyric videos. However, we do not emphasize those low co-occurrences in this paper because they may be caused by insufficient lyric video data and a much larger dataset might prove the importance of such low co-occurrences in future research.

VIII. CONCLUSIONS AND FUTURE WORK
In this paper, we tackled the novel task of analyzing lyric videos to understand the relationships between three style modalities: word motion, font style, and music style. To conduct this analysis, we developed an original lyric word tracking method, which is detailed in Appendix A, and an original font style estimator. Moreover, the clustering-based co-occurrence analysis of the style modalities from 100 lyric videos indicated several trends in the style combinations. That is, we were able to catch such trends in the videos in a totally objective and reproducible manner.
Because multi-modal analyses of lyric videos have not previously been explored and this paper is the first such attempt, there are tasks left as future work. First, the dataset can be expanded to make the analysis result more reliable. Second, using the discovered trends in the multi-modal style combinations, a recommendation system can be developed to decrease the difficulty of lyric video creation. For example, if a system can automatically suggest word motions and font styles for given music (i.e., audio), even a non-expert could easily create lyric videos. Third, the design of the video background images could be incorporated into the analysis. Even though the color and objects in the background images were not the focus of this paper, they are also an important modality and therefore cannot be ignored in the overall design analysis of lyric videos. .

APPENDIX A LYRIC WORD DETECTION AND TRACKING BY USING LYRIC INFORMATION
In this section, we introduce the methodology [5] used to detect and track lyric words in a lyric video. The technical highlight of the methodology is the full use of the lyric information (i.e., the lyric word sequence of the song) to obtain accurate tracking results. Note that the conference paper [5] focused only on the word motion style and not on the font style or color style; therefore, no correlation analysis between the style modalities had previously been made.

A. LYRIC WORD CANDIDATE DETECTION
First, lyric word candidates are detected as bounding boxes using two pretrained state-of-the-art scene text detectors, PSENet [51] and CRAFT [52]. The detected bounding boxes are then fed into a state-of-the-art scene text recognizer TPS-Resnet-BiLSTM-Attn, which was proposed in [53]. If bounding boxes detected by the above detectors overlap by more than 50%, and the recognition results are the same, these bounding boxes are regarded as duplicates. Accordingly, we remove either box in the later process.

B. LYRIC-FRAME MATCHING
As we described in IV-A, the lyric-frame matching was conducted by associating the word sequence and the frame sequence of the given lyrics after detection and recognition.
The red matching path shown in Figure 4 (a) was determined by evaluating the distance D(k, t) between the kth word and the tth frame. A smaller value of D(k, t) means that the probability of the kth lyric word existing in the tth frame is high. More precisely, the distance D(k, t) is defined as D(k, t) = min b∈Bt d(k, b), where B t is the set of bounding boxes detected in the tth frame and d(k, b) is the edit function between the kth lyric word detected in the tth frame and the bth word in the same frame. If the kth lyric word is perfectly detected in the tth frame, the distance is D(k, t) = 0.
Using the distance {D(k, t)|∀k, ∀t} for DP, we can efficiently obtain the globally optimal lyric frame matching as 8 VOLUME 10, 2022 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and   shown in the red path in Figure 4 (a). In this algorithm, the so-called the DTW algorithm, the DP recursion is calculated for each (k, t) from (k, t) = (1, 1) to (K, T ) as follows: where g(k, t) shows the minimum accumulated distance from (1, 1) to (k, t). The parameter ∆ indicates the maximum frame skipped on the path. In the experiment, we set ∆ = 1, 000. This means that a video with 24 fps is allowed to skip approximately 40 s. The calculation complexity of the algorithm is O(∆T K).
Note that this lyric-frame matching process using lyric information is essential for lyric videos. For example, the word "the" appears many times in the lyric text; this means that the spatio-temporal location of a certain "th" is ambiguous. Therefore, the lyric-frame matching process needs to fully utilize the continuity of the lyric words, as well as the video frames to determine the most reliable frame for each lyric word.

C. TRACKING OF INDIVIDUAL LYRIC WORDS
In the above lyrics-frame matching step, the kth lyric word is only matched to the tth frame; however, this word may also appear around the tth frame. Therefore, we search for such frames around the tth frame, as shown in Figure 4 (b). This search is done not only via simple spatio-temporal similarity but also by evaluating the word similarity with the kth word in the neighboring tth frames. If both similarities are larger than a threshold in the t th frame, we conclude that the same kth word is also found in the t th frame.
Finally, as shown in Figure 4 (c), we conduct an interpolation process as post-processing. If a lyric word is seriously misrecognized and/or occluded in a certain frame, we cannot track the word around the frame using the above simple searching process. If such a missed frame is found, polynomial interpolation is performed between the neighboring frames. The average running time of lyric word tracking per frame is approximately 440 ms.

D. QUALITATIVE EVALUATION OF LYRIC WORD DETECTION AND TRACKING
We applied the above method to all of the frames of the 100 collected lyric videos (approximately 547,100 frames in total) and obtained tracking results for all of the lyric words (approximately 33,800 words in total). Figure 15 shows several successful results of lyric word detection and tracking.  (a) Heavy distortion by partial occlusion ("PREY").
(b) Tracking error caused by multiple appearances of the same word within a short period ("BROKEN"). In Figure 15 (a), we can see that a word with wavy motion can be correctly tracked. As shown in Figure 15 (b), a word under scaling or rotation for each frame can also be correctly tracked. Figure 16 shows the effect of using lyric information in the lyric-frame matching process and subsequent tracking process to improve accuracy. Because these frames have a complex background (character-like patterns), unnecessary bounding boxes are found in the first word detection step; however, only the correct lyric words remain after matching and tracking the lyric frames. Figure 17 shows typical failure cases. The failure in Figure 17 (a) is caused by severe distortion resulting from the complicated visual design of the video. The word "PREY" is always partially occluded and therefore never detected, even by the state-of-the-art word detector. The failure in Figure 17 (b) is caused by a refrain of the same phrase "I'M BROKEN" in the lyrics. In lyric videos, an important lyric word or phrase sometimes appears repeatedly (i.e., excessively) while changing its appearance, even though the lyric text contains it only one time. Table 1 shows the result of a quantitative evaluation of the lyric word detection and tracking using 1,000 frames described in III as ground-truth data. If the bounding boxes of a lyric word according to the proposed method and the corresponding ground-truth data have IoU > 0.5, the detected box is considered to be a successful result. The evaluation result of the lyric-frame matching step and the later tracking step indicate that the precision is 90.98%. From this, we can see that the false positives are more successfully suppressed than in the case of only lyric-frame matching. The introduction of the interpolation step increased the true positives as expected, even though false positives were also unfortunately increased and the precision value was slightly decreased. The recall is approximately 71%. The main reasons for false positives are too many decorations and distortions in the word appearance, lyric-frame matching errors resulting from ambiguities in matching, and inconsistency between official lyric texts and actual song lyrics.

APPENDIX B VIDEOS SHOWN IN THE FIGURES
The figures in this paper can be seen in the frame of the following videos. For URLs, the common prefix "https://www.youtube.com/watch?v=" is omitted in the list. Note that the URL list of all 100 videos and their annotation data can be found at https://github.com/uchidalab/Lyric-Video. MASATAKA GOTO received the Doctor of Engineering degree from Waseda University in 1998. He is currently a Prime Senior Researcher at the National Institute of Advanced Industrial Science and Technology (AIST), Japan. Over the past 29 years he has published more than 300 papers in refereed journals and international conferences and has received 55 awards, including several best paper awards, best presentation awards, the Tenth Japan Academy Medal, and the Tenth JSPS PRIZE. He has served as a committee member of over 120 scientific societies and conferences, including the General Chair of ISMIR 2009 and 2014.
SEIICHI UCHIDA (Member, IEEE) received the B.E., M.E., and Dr. Eng. degrees from Kyushu University, in 1990, 1992, and 1999, respectively. He is currently a Distinguished Professor at Kyushu University. His research interests include image-informatics, especially document analysis and recognition (DAR). He received the 2007 IC-DAR Best Paper Award and many other awards. He acted as a program chair at DAR-related conferences, such as ICDAR2021. He is an Associate Editor of Pattern Recognition. VOLUME 10, 2022 13 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3184028