Content-Based Video Retrieval With Prototypes of Deep Features

The rapid development in the area of information and communication technologies has enabled the transfer of high-resolution, large-sized videos, and video applications have also evolved according to data quality levels. Content-based video retrieval (CBVR) is an essential video application because it can be applied to various domains, such as surveillance, education, sports, and medicine. In this paper, we propose a CBVR method based on prototypical category approximation (PCA-CBVR), which calculates prototypes of deep features for each category to predict the user’s query video category without a classifier. We also undertake fine searching to retrieve the video most similar to the user’s query video from the predicted category database of videos. The proposed PCA-CBVR approach is efficient in terms of its computational cost and maintains meaningful information of the videos. It does not need to train a classifier even when the database is updated and uses all deep features without any dimension reduction step, such as those in CBVR studies. Moreover, we conduct fine-tuning of the 3D CNN feature extractor based on a few-shot learning approach for better domain adaptation ability and apply salient frame sampling instead of uniform frame sampling to improve the performance of the PCA-CBVR method. We demonstrate the performance capability of the proposed PCA-CBVR approach through experiments on various benchmark video datasets, in this case the UCF101, HMDB51, and ActivityNet datasets.


I. INTRODUCTION
In recent years, the rapid developments of information and communication technologies have enabled faster and easier access to large volumes of data. In March of 2020, due to COVID-19, daily uploads and views of videos at home increased on YouTube by nearly 700% and 210%, respectively, compared to the corresponding levels before the coming of the pandemic [1]. By 2023, the numbers of internet users and 4K TV connections are predicted to be around 5.3 billion and 891 million, respectively [2], meaning that the transmitted video traffic and quality levels will also grow. Therefore, research focusing on with video applications has been active [3]- [6].
Video contains more complicated information compared to a single image, combining motion, audio and text.
The associate editor coordinating the review of this manuscript and approving it for publication was Hong-Mei Zhang . Accordingly, video application research requires an integrated approach to consider the various types of information. Content-based video retrieval (CBVR) is one area of video application research. The aim of CBVR is to search for videos that are most similar to a query video from a database based on only the video contents without any additional metadata. CBVR has been applied in several domains, such as crime prevention, by identifying suspects through CCTV videos [7], the indexing and retrieving of specific lecture videos for effective education [8], the retrieval of the sports videos such as badminton [9], and the retrieval of surgical procedures similar to an ongoing procedure to ensure an efficient operation [10], among others.
In general, the CBVR process consists of three steps. The first step is frame sampling. To process the video efficiently and enhance the retrieval performance, it is necessary to sample meaningful frames, as many video frames do not assist with an understanding of the video. For example, there  may be very similar frames in the same context of a certain duration. In this step, we obtain the meaningful key-frames [11], [12] or uniformly sampled frames [8], [13]. The second step is feature extraction. In this step, we extract a color features histogram [14] of the sampled frames using traditional computer vision techniques or deep features [15]- [17] of the sampled frames using a convolutional neural network (CNN). The last step is the distance calculation step which compares extracted features from both the database videos and the query video in terms of the Euclidean distance or cosine similarity metric. Finally, the database video with the shortest distance relative to the query video is retrieved as the video most similar to the query video.
Recently, with the development of deep learning techniques, several deep learning approaches, especially 2D and 3D CNNs, have been widely applied as aspects of CBVR methods. 3D CNNs represent the spatial and temporal features of videos well compared to 2D CNNs [18]. However, 3D CNNs are inefficient compared to 2D CNNs because 3D CNNs require too much video data and long processing times for 3D kernel optimization [19]. To overcome these problems, the recent research has fine-tuned pre-trained 3D CNNs on large-scale video datasets, resulting in a better 3D CNN model that requires less effort than a model trained from scratch [20].
Although the deep learning approach-based CBVR methods have been studied actively, there are still research problems to be addressed. In this paper, we tackle the following four major challenges. The first challenge is the information loss problem because of dimension reduction. Many of CBVR methods applied the dimension reduction to video data to compress the deep features for low computation resources. However, the deep feature information would be lost when trying to reduce the dimensions of these features. The second challenge is re-training the classifier when the database is updated. Many of CBVR methods used the trained classifiers based on the database videos to predict the user's query video category. However, in the real world, tons of new videos are constantly generated, thus the database must be updated. Consequently, the classifiers also must be VOLUME 10, 2022 re-trained to fit the updated database and this re-training process requires a lot of time and computing resources. The third challenge is the novel domain adaption ability to reduce training cycle, which is a closely related issue with the second challenge. When the database is updated, there is a possibility of adding novel categories not just adding videos of existing categories. As aforementioned, since re-training the classifiers requires a lot of time and resources, if the CBVR method adapts the novel categories and domains without additional re-training, then we could save a lot of time and resources. The last challenge is frame sampling problem for effective video retrieval. Since the lengths of videos are variable and sometimes very long, the CBVR methods need to sample the frames of videos. Thus, the performance of CBVR methods depend on the performance of frame sampling because if we retrieve relevant frames more than meaningless frames then they would contain better information of the videos.
To solve the aforementioned four problems, we propose a CBVR method based on prototypical category approximation (PCA-CBVR) which calculates category prototypes of deep features to predict the user's query video category efficiently without using any dimension reduction algorithms and classifiers. After predicting the user's query video category, the proposed PCA-CBVR utilizes a fine searching step to find videos from the predicted category videos in database, of which are the most similar to the user's query video. Since the PCA-CBVR does not reduce the dimension of deep features and predicts the user's query video category based on similarity measurement without classifiers, it does not have an information loss problem and need to re-train the classifiers when the database is updated. Moreover, we apply fine-tuning with a few-shot learning approach to the PCA-CBVR and verify that it increases novel domain adaptation ability through cross-domain evaluation. Finally, we find that the PCA-CBVR performance depends on the frame sampling methods. When salient frame sampling method is applied shows better performance than just simple uniform frame sampling. Figure 1 and Table 1 present an overview of the proposed PCA-CBVR method and abbreviations used in this paper.
This paper is organized as follows. Section II provides a brief introduction to research related to CBVR methods that use CNNs and the few-shot learning method. In Section III, PCA-CBVR is proposed and the details of PCA-CBVR are explained. Section IV analyzes the proposed PCA-CBVR performance with the uniform and salient frame sampling methods and assess the domain adaptation ability based on several benchmark datasets. Finally, concluding remarks follow in Section V.

II. RELATED WORKS
In this section, we discuss the research related to the proposed PCA-CBVR. In Section II-A, CBVR research based on CNNs for retrieval performance improvements and the unique frame extraction strategy are explained. In Section II-B, few-shot learning methods are described, as we fine-tune the CNNs using the few-shot learning approach for better video retrieval performance on crossdomain video contents. Table 2 summarizes the characteristics and differences between related research and the proposed PCA-CBVR.

A. USE OF CNNs FOR CBVR
CNNs are widely used in computer vision tasks, especially with large-scale image datasets [23] and video datasets [24], [25]. Given the performance improvements of CNNs on computer vision tasks, they have been also applied in CBVR research.
The first approach is to use 2D CNNs to extract the deep features of videos for CBVR. Yu et al. [15] extracted deep features using pre-trained 2D CNNs on ImageNet [23] and quantized the extracted deep features for computational efficiency. They showed that deep features were more efficient for CBVR compared to other low-level features. Yu et al. [26] also modified the architecture of 2D CNNs and utilized more data to improve the performance capabilities of existing 2D CNNs. Furthermore, they proposed a new feature fusion method to improve both the performance and robustness. Suo et al. [27] modified the proposed SimHash [28] algorithm to reduce the dimensions of deep features and increase the retrieval performance efficiency. Moreover, they calculated the distance between two deep features of the frames for precise retrieval. Anuranji and Srimathi [16] proposed stacked heterogeneous multi-kernel 2D CNNs to capture complex deep features, also using, bidirectional LSTM to train the temporal information, with the results of LSTM then passed to a fully connected layer to obtain the binary hash code. Abed et al. [12] proposed a new key-frame extraction method based on 2D CNNs and proved that it was efficient on CBVR by integrating it into the CBVR system.
The second approach is to use 3D CNNs for CBVR. The aforementioned studies applied 2D CNNs to video data; however, videos are a sequence of frames that contain temporal information. Thus, recent research applied 3D CNNs that represent video data including temporal information more efficiently. Ullah et al. [17] used a pre-trained C3D model [18] on the Sports-1M dataset [24] and reduced the dimensionality by means of PCA to generate the hash code. However, according to work by Hara et al. [19], Kataoka et al. [20], and Tran et al. [29], working with the Sports-1M dataset is not easy because it contains more videos than the Kinetics 400 and Kinetics 700 datasets [30], [31], and their annotations are noisier than those of these Kinetics datasets. They also demonstrated that 3D ResNets (R3D) [32], [33] and R(2+1)D [29] models could outperform C3D and that a model pre-trained on the Kinetics 700 dataset was better than one pre-trained on the Kinetics 400 dataset. Therefore, we utilize the R3D and R(2+1)D models pre-trained on the Kinetics 700 dataset to extract the deep features of videos efficiently.
The aforementioned previous CBVR research attempted to compress the deep features for low computation resources. However, the deep feature information would be lost when trying to reduce the dimensions of these features. Therefore, in this paper we use the proposed category prototypes from Snell et al. [34], which are the mean values of the deep features of videos in the same category without the use of any dimension reduction algorithm. These category prototypes help not only to reduce the computation cost but also classify the query video category without classifiers. More details of PCA-CBVR are explained in Section III.

B. CROSS-DOMAIN GENERALIZATION WITH FEW-SHOT LEARNING
To train and optimize the CNNs successfully, a large and well-labeled training dataset is essential. However, a large dataset labeled by humans is difficult to obtain, and when the training dataset is not well-labeled or is insufficiently labeled, the CNNs are over-fitted relative to the dataset. To overcome this problem, few-shot learning methods, which learn from only a few datasets and generalize to different novel classes, have been proposed. Vinyals et al. [35] proposed a matching network that utilizes a memory and attention mechanism for rapid learning. They also provided the mini-ImageNet dataset as a few-shot learning method benchmark. Snell et al. [34] proposed a prototypical network that generates prototypes by calculating the mean of deep features for each class in support sets and calculating the Euclidean distance between prototypes and the query data to classify the query data. Sung et al. [36] proposed a relation network that applied trainable distance calculation model for further generalization instead of using a fixed distance calculation such as the Euclidean distance. Recently, Chen et al. [21] and Tseng et al. [22] also showed that few-shot learning methods worked well for cross-domain generalization.
The proposed PCA-CBVR method is motivated by prototypes from Snell et al. [34]. We consider the database videos as a support set in the few-shot learning approach to classify the query video without additional classifiers. Cross-domain generalization is also important in the CBVR approach in actual applications because users do not send query videos which have the identical domain to the database videos. Therefore, we conduct fine-tuning of the 3D CNN models on the UCF101 [37] dataset with the few-shot learning approach and evaluate these outcomes on the ActivityNet [39] dataset (UCF101 → ActivityNet) in an effort to improve the cross-domain generalization ability of PCA-CBVR. More details pertaining to the cross-domain generalization ability of PCA-CBVR are given in Section IV-D.

III. PCA-CBVR
In this section, we explain the proposed PCA-CBVR method in more details. The proposed PCA-CBVR method consists following steps as shown in Figure 1.
• An offline process that calculates and saves the category prototypes of database videos 1) Sample 16 frames as resized to 112 × 112 from each database video uniformly. 2) Extract the deep features from the sampled frames using pre-trained 3D CNNs which exclude the last fully connected layer. 3) Calculate the category prototypes from the extracted deep features of the database videos in each category. 4) Save the category prototypes into a meta-database.
• An online process that retrieves the database video most similar to the user's query video 1) Sample 16 frames as resized to 112 × 112 from the query video uniformly. 2) Extract the deep features from the sampled frames using pre-trained 3D CNNs which exclude the last fully connected layer. 3) Predict the query video category based on category prototypes in the meta-database without classifiers. 4) Finely search for database videos in the predicted category most similar to the user's query video. In this example, the numbers of clusters for each category, K red , K orange , K yellow , are set to 2, which is in the range of (1, |V c |). We refer to this as a semi-prototype because there are several deep features for the one category. (c) Prototype feature matching to predict the query video category by comparing the deep feature of the query video with the deep features of the prototypes in the database. The deep features of the prototypes are calculated according to the centroids of clusters and the numbers of clusters for each category, K red , K orange , K yellow , are set to 1. We refer to these as prototypes because there is one deep feature for one category.
There are two main parts of the proposed PCA-CBVR. The first part is category classification of the user's query video, which is done by measuring the similarity between the category prototypes and the deep features of the query video. The other part is a fine search based on the query video category predicted in the first step to obtain the videos most similar to the query video. In the following subsections, the details of each step and organized mathematical notations in Table 3 are explained.

A. USERS' QUERY VIDEO CATEGORY PREDICTIONS WITH PROTOTYPES
We still need to re-train the classifier when the database is updated. For example, if some novel categories of videos are added to the database, the former trained classifier cannot then recognize the added categories. Therefore, in this paper, we use the proposed prototypical category approximation technique from Snell et al. [34] to classify the query video without a classifier.
We usually predict the query video category by comparing the query video with every video in the database, as shown in Figure 2(a). However, this approach requires a considerable computation cost, long times, and much memory. Instead, we utilize the K-means clustering algorithm to reduce deep feature matching points, as shown in Figures 2 (b) and (c). For clarification, we redefine the K-means clustering terms as shown in Table 4.
Equation (1) is a generalized form to predict the query video category without a classifier, as follows: where c q is the query video category that we want to predict, and df q and df c denote the deep features of the query video and the cluster centroids of the database videos, respectively. Equation (1) is intended to compare the deep features of the query video and the cluster centroids in the database based on the cosine similarity measurement and then predict the query video category as the category of the most similar cluster centroid with the greatest degree of cosine similarity. In contrast to Snell et al. [34], we employ cosine similarity in Equation (1) to increase the performance. The performance comparison between using Euclidean distance and cosine similarity is presented in Section IV-G and Table 8. If the number of clusters in each category is set to 1, then it is the proposed PCA-CBVR and there is no need to apply the K-means clustering algorithm because the deep features of each category, df c , are calculated as follows: In this equation, df v represents the deep feature of v, which is a video from set V c = {v 1 , . . . , v |V c | } in category c, and |V c | is the number of videos that correspond to category c.

B. FINE SEARCHING ON THE SELECTED EMBEDDING SPACE
In section III-A, we predict a query video's category efficiently using the aforementioned prototypes. After predicting the query video's category, it then becomes necessary to retrieve the video most similar to the user's query video among the database videos in the predicted category. Therefore, the second step of PCA-CBVR is a fine search, which proceeds as follows: where v r is the retrieved video, argtopK returns the arguments which have top K rank, and df q and df c q are the deep features of the query video and database videos in the predicted query video's category, respectively. Equation (3) is used to compare the deep features of the query video and database videos in the predicted query video's category based on the cosine similarity measurement and to retrieve the video most similar to the query video. Figure 3 explains the overall process of PCA-CBVR, including the category prediction of the user's query video and the fine search process to retrieve the most similar video.

IV. EXPERIMENTS AND RESULTS
In this section, we verify the performance outcomes of the proposed PCA-CBVR in different situations. First, we evaluated the proposed PCA-CBVR performance based on different combinations of 3D CNNs and depths to determine the best feature extractor. Then, we verified that the proposed PCA-CBVR with the selected feature extractor 3D CNN outperforms other video retrieval methods on certain datasets. Second, we conducted experiments to demonstrate the domain adaptation ability of the proposed PCA-CBVR by fine-tuning based on the few-shot learning approach. Third, we showed the benefit of fine searching compared to random selection to return the best recommending results. Fourth, we applied different frame sampling approaches to the proposed PCA-CBVR, specifically uniform frame sampling and salient frame sampling based on the prototypes, and showed that the salient frame sampling approach based on the prototypes outperforms the uniform sampling approach. Fifth, we showed that cosine similarity boosts PCA-CBVR performance compared to Euclidean distance. Finally, we analyzed video retrieval time to discuss the computational complexity of PCA-CBVR. Figure 4 shows summarized 3D CNNs architecture which was applied in the experiments along with the number of layers and the input size used in this paper. In the experiments, each category was clustered as one cluster. It means that videos in the same category have the same category prototype which is a mean vector of video features of that category. Moreover, we provide the codes used in this paper, which are available on GitHub. 1 Details of the datasets, the performance evaluation metrics, and the experimental results are explained in the following subsections.

A. DATASETS
We used the UCF101 [37], HMDB51 [38], and Activi-tyNet [39] datasets, which are representative video datasets widely used in human activity analyses. The UCF101 dataset is a trimmed dataset with 13,320 YouTube videos in 101 action categories, and the numbers of videos for training and testing are 9,537 and 3,738, respectively. The HMDB51 dataset is also a trimmed dataset with 6,766 videos from YouTube, movies, and web in 51 action categories, and the numbers of videos for training, validation and testing are 3,570, 1,666 and 1,530, respectively. On the other hand, ActivityNet is an untrimmed dataset with 19,994 web videos in 200 action categories, and the numbers of videos for training, validation and testing are 10,024, 4,926 and 5,044, respectively. The above three datasets help to confirm retrieval performances on each trimmed and untrimmed video, and (trimmed → untrimmed) videos cross-domain adaptation ability. We consider the training data and test data in the UCF101 and HMDB51 datasets as the database videos and the query videos in the video retrieval task, respectively. The ActivityNet dataset does not provide true labels for test data; thus, we consider the validation data as query videos in the video retrieval task. For video data preprocessing, we applied uniform frame sampling to sample 16 frames in each video and resized the sampled frames to 112 × 112.

B. EVALUATION METRICS
To evaluate the video retrieval performance, we used the top1 and top5 accuracy and mAP for information retrieval as evaluation metrics. The accuracy metric is used here because the PCA-CBVR performance depends on the query video category prediction ability. The mAP for information retrieval is different from that for classification. We used Equation (4) to measure the mAP in the information retrieval context.
where AP is the average precision function and Q is the number of queries. AP is defined as follows: AP = n k=1 (P (k) × rel (k)) number of relevant videos (5) where P is the precision function that returns the cut-off k precision, rel is a masking function that returns 1 if the video at k is relevant or 0 otherwise, and n is the number of retrieved videos. In the PCA-CBVR method, the AP result would be 1 or 0; therefore, the mAP of PCA-CBVR is identical to the top1 accuracy.

C. PERFORMANCE ANALYSIS DEPENDING ON FEATURE EXTRACTORS USING DIFFERENT COMBINATIONS OF 3D CNNs AND DEPTHS
To select the best 3D CNN feature extractor, we evaluated the PCA-CBVR performance with different feature extractors which were R3D and R(2+1)D models pre-trained on the Kinetics 700 dataset with different depths. Figure 5 shows the PCA-CBVR results with the different combinations of feature extractors. As shown in Figure 5, when we applied the R3D model as a feature extractor, it outperformed compared to when we applied the R(2+1)D model, and the deeper networks showed better performance as well. The best performance overall was achieved when we applied R3D50 (R3D model with 50 depth layers) as a feature extractor. The overall performance of PCA-CBVR on the ActivityNet dataset was not as good compared to the outcomes on the UCF101 and HMDB51 datasets, because the ActivityNet dataset is an untrimmed dataset, on the other hand, the UCF101 and HMDB51 datasets are trimmed datasets. In other words, the ActivityNet dataset has much noisier frames which are not closely related to the video context compared to the UCF101 and HMDB51 datasets.  Table 5 shows the mAP results of the proposed PCA-CBVR with the R3D50 feature extractor, which shows the best performance as shown in Figure 5, and other VOLUME 10, 2022 TABLE 6. The PCA-CBVR results, which are mAP (top5 accuracy), from the cross-domain (trained on UCF101 and evaluated on ActivityNet) evaluation depending on the number of fine-tuned residual blocks with specific models and learning algorithms. The red and blue values indicate the best mAP and top5 accuracy for each model, respectively. video retrieval methods on different datasets. As shown in Table 5, the PCA-CBVR without any fine-tuning showed a poorer mAP outcome on the trimmed datasets compared to Ullah et al. [17] and showed a better mAP outcome on the untrimmed dataset compared to Anuranji and Srimathi [16]. In general, users not only send trimmed videos but also send untrimmed videos as query videos; thus, good performance on an untrimmed dataset such as ActivityNet is an the important point in video retrieval tasks, and the proposed PCA-CBVR showed better performance on an untrimmed dataset. Moreover, when we fine-tuned R3D50 and R(2+1)D50 on the UCF101 dataset and applied the fine-tuned feature extractor to the proposed PCA-CBVR, the corresponding mAP results on the UCF101 dataset were 0.86 and 0.89, respectively, outcomes higher than those in Ullah et al. [17]. Thus, if we fine-tuned the 3D CNN feature extractor on the particular dataset and apply it to a video retrieval task on a dataset in the same domain, the retrieval performance would increase. However, the user's query does not always involve the same domain as the database videos. Accordingly, evaluation results from the same domain have less meaning than the video retrieval performance in different domain videos. To solve the cross-domain problem, we conducted more experiments with fine-tuning based on the few-shot learning approach. These results are discussed in Section IV-D.
Another possible reason why the proposed PCA-CBVR without fine-tuning showed poorer mAP results on trimmed datasets compared to Ullah et al. [17] is that Ullah et al. used a novel deep feature selection mechanism to choose the valuable features or frames. On the other hand, we applied simple uniform frame sampling in this experiment. Thus, we conducted more experiments to increase the performance of PCA-CBVR by applying the salient frame sampling approach instead of uniform frame sampling. These results are discussed in Section IV-F. Figure 6 shows examples of the success and failure of PCA-CBVR on the UCF101 dataset with ten different query videos. The PCA-CBVR approach retrieved videos similar to the user's query video successfully at a rate of 80%. As shown in Figure 6, failures, i.e., the Haircut and Tennis Swing query videos, occurred when the query video and the retrieved videos have similar backgrounds or activity levels, as the proposed PCA-CBVR classifies the user's query video category only based on the generalized deep features of the videos. These outcomes verify that the proposed PCA-CBVR is easily governed by salient frames or by its video representation ability. Therefore, we applied the salient frame sampling approach and discuss the results in Section IV-F, as mentioned earlier.

D. CROSS-DOMAIN EVALUATION FOR THE PCA-CBVR DOMAIN ADAPTATION ABILITY
The domain of the user's query video is not always identical to that of the database videos. Therefore, we need to solve the cross-domain problem in the video retrieval task by increasing the domain adaptation ability of CBVR methods. In the proposed PCA-CBVR method, we fine-tuned a few 3D CNN feature extractors based on the few-shot learning approach, as the fine-tuning based on few-shot learning is more appropriate to resolve the cross-domain problem compared to that based on categorical learning. We conducted fine-tuning based on categorical learning and few-shot FIGURE 7. Video retrieval results of the fine search and random selection. Three example queries (Mixing, Shaving, Writing) and five retrieval results from the UCF101 dataset are shown. The blue boxes (left side) are the user's query videos, the green boxes are the fine search retrieval results, and the red boxes are the random selection retrieval results. Fine searching catches more semantic information successfully than the random selection method.
learning on all models up to 100 epochs and 3000 episodes, respectively. Also, we utilized the stochastic gradient descent (SGD) optimizer and the cross-entropy loss function in the training phase; the learning rate, momentum, and weight decay were 1e-3, 0.9, and 1e-3 for categorical learning and 1e-4, 0.9, and 1e-3 for few-shot learning respectively.
To produce the cross-domain problem in the video retrieval task, we assumed that untrimmed dataset was the user's query videos and that the trimmed small-size dataset as the database, meaning that the user's query videos have a different domain from the database videos and that the database videos are not sufficient to train the model. To investigate the video retrieval performance in the cross-domain problem, we fine-tuned the model on the UCF101 dataset, which was considered as containing small-sized database videos, and evaluated it on the ActivityNet dataset, which was considered as containing untrimmed user's query videos. We used 64 batch sizes for the categorical learning algorithm and a 5-way 1-shot, 5-way 5-shot, and 5-way 10-shot scenarios for the few-shot learning algorithm. For few-shot learning, we used the prototypical few-shot learning algorithm proposed by Snell et al. [34]. Despite the fact that the categorical and few-shot training strategies are different, we still assign certain constraints to the few-shot learning strategies. In summary, fine-tuning based on categorical learning used 9,537 videos in 101 categories, and fine-tuning based on few-shot learning used 9,283 videos in 71 categories for training. The remaining videos in 30 categories can never be used with fine-tuning based on few-shot learning. Table 6 shows the PCA-CBVR results (mAP and top5 accuracy) on the cross-domain task, referring to the training of the models on the UCF101 dataset and their evaluation on a different dataset, the ActivityNet dataset in this case. We fine-tuned each model based on the categorical learning and few-shot learning approaches to show that the few-shot learning approach outperforms on the domain adaptation task. We trained different numbers of residual blocks of R3D and R(2+1)D from the bottom of the model to verify how many blocks must be fine-tuned for the best domain adaptation ability. As shown in Table 6, the few-shot learning approach showed better performance than the categorical learning approach, and when we fine-tuned more blocks, the performance increased. In this cross-domain experiment, the overall best performance was achieved when we applied the R(2+1)D50 model with fine-tuning of four blocks with the few-shot learning approach using the 5-way 1-shot scenario.

E. RANDOM SELECTION VS. FINE SEARCHING
After the proposed PCA-CBVR predicts the user's query video category, there is one remained step in the video retrieval task; to return the video most similar to the user's VOLUME 10, 2022 TABLE 7. The PCA-CBVR performance comparison results between uniform and salient frame sampling, which are mAP (top5 accuracy) outcomes. The red and blue values indicate the best mAP and top5 accuracy outcomes for each model, respectively. For this, we applied 3D CNN models that were pre-trained on kinetics 700 without any fine-tuning. query video from the predicted category database videos. There are two possible ways to do this: random selection and fine searching. Random selection retrieves random videos from the predicted category database videos; however, even if they are the videos from the same category, the detailed context of each video can differ. For example, videos in the basketball category are taken from different places, i.e., a street, an indoor court, and an outdoor court. Thus, to retrieve the video most similar to the user's query video, random selection from the predicted category is not feasible.
To retrieve the video most similar to the user's query video more accurately, we apply a fine searching step after category prediction by PCA-CBVR, with a fine search also done based on the deep features calculated from PCA-CBVR. Figure 7 shows typical results of video retrieval based on random selection and fine searching. As shown in Figure 7, fine searching retrieved a video more similar to the user's query video compared to random selection by considering the detailed semantic information. For example, when the user's query video category was predicted as ''Mixing,'' the fine searching approach retrieved videos in the same context, including those with the mixing ingredients, the mixing bowl, and the whisk. On the other hand, the random selection approach retrieved videos from the same category, but they were different in terms of the detailed context, such as different mixing ingredients with different cooking tools. Another example is the ''Shaving'' video. The fine searching approach was able to retrieve videos from the same context, showing a man shaving while using shaving cream. On the other hand, the retrieved videos when using the random selection approach included different context videos, in this case showing a man using an electric razor. The last example is the ''Writing'' video. In this example, the fine search approach TABLE 8. The PCA-CBVR performance comparison results between Euclidean distance and cosine similarity metric, which are mAP (top5 accuracy) outcomes. The red and blue values indicate the best mAP and top5 accuracy outcomes for each model, respectively. For this, we applied 3D CNN models that were pre-trained on kinetics 700 without any fine-tuning. retrieved videos in the same context, in which a person writes on a whiteboard. On the other hand, the videos retrieved by the random selection method included those in different contexts, where a person was writing on a blackboard.

F. UNIFORM FRAME SAMPLING VS. SALIENT FRAME SAMPLING
The proposed PCA-CBVR method utilizes prototypes to predict the user's query video category, and the prototypes are generalized features of videos' deep features devised by taking the corresponding mean values. Thus, if there are many outliers in the deep features, this will affect the representation ability of the prototypes for the category. This problem is moderated by the frame sampling method.
To determine the capabilities of the frame sampling method, we conducted experiments to compare the uniform frame sampling with salient frame sampling [40]. Yoon et al. [40] proposed the salient frame sampling method; the proposed salient frame sampling method eliminated meaningless and outlier frames from the video by using the mean of all deep features of the frames in the video. Table 7 shows the mAP and top5 accuracy results when applying uniform frame sampling and salient frame sampling to the proposed PCA-CBVR. As shown in this Table 7, the salient frame sampling method [40] shows better results than uniform frame sampling in most cases.

G. EUCLIDEAN DISTANCE VS. COSINE SIMILARITY
To calculate the ranking score, we can consider two simple metrics which are Euclidean distance and cosine similarity. To decide which metric is better, we conducted experiments to compare the PCA-CBVR performance based on Euclidean distance and cosine similarity and Table 8 shows the comparison results. As shown in Table 8, cosine similarity helps boost up PCA-CBVR performance compared to Euclidean distance in most case. This means the similarity factor is more appropriate in the proposed PCA-CBVR than the distance factor, thus we applied the cosine similarity instead of Euclidean distance.

H. RETRIEVAL TIME ANALYSIS DEPENDING ON THE NUMBER OF QUERIES
In this subsection, we discuss the proposed PCA-CBVR retrieval time. In these experiments, we excluded irrelevant components to know pure PCA-CBVR performance, such as video load, deep feature extract time, and feature load time. We only included subsections III-A and III-B processing time. We used Intel Xeon Silver 4215R 3.2GHz CPU, Samsung (16GB × 3) 2,666MHz RAM and Samsung 870 EVO 2TB SSD for experiments. According to Figure 8, the proposed PCA-CBVR computation time is almost linear with a 1/6230 slope depending on the number of queries. Moreover, for the video retrieval process, we utilized a 2k byte (32 bits × 512) array for 18 and 34 layers 3D CNN models output and an 8k byte (32 bits × 2048) array for 50 layers 3D CNN models output per each video retrieval in theoretically.

V. CONCLUSION
This paper proposed what is termed the PCA-CBVR method to retrieve videos most similar to users' query videos based on the videos' contexts without any additional information such as tags, among other types. The proposed PCA-CBVR method consists of two main steps: category prediction of the user's query video and fine searching to retrieve the video most similar to each user's query video. To reduce the computational cost while maintaining meaningful information of the videos for the user query video category prediction step, the PCA-CBVR calculates prototypes of the deep features for each category instead of using a dimension reduction strategy or generating binary hash codes as in previous CBVR research. Video category prediction of the user's query based on the prototypes was efficient because there is no need to train the classifier, even when the database is updated. The experimental results here showed that the proposed PCA-CBVR performed better with an untrimmed dataset compared to the outcome state-of-the-art CBVR research, with fine searching based on deep features retrieving the videos most similar to the user's query video by considering the detailed context information. Moreover, to solve the cross-domain problem associated with the CBVR task, we fine-tuned the 3D CNN feature extractor based on the fewshot learning approach, and the PCA-CBVR with fine-tuned feature extractors showed better domain adaptation ability. To improve the performance of the PCA-CBVR, we also applied salient frame sampling to PCA-CBVR instead of uniform frame sampling. As a result, the mAP and top5 accuracy rates were improved. As a future work, we would improve the proposed PCA-CBVR by analyzing 3D CNNs architecture and prototypes property using explainable AI (XAI) techniques and also by utilizing concatenated low level features such as color and texture from frames [41] and trajectory features [10], [42]. Moreover, we would apply the proposed PCA-CBVR to augmented reality (AR) and virtual reality (VR) applications to recommend the proper videos to add and edit the contents based on the user's current situation in real-time.