Sampling Fingerprints From Multimedia Content Resource Clusters

Nowadays, the growth of multimedia content over the web is exponential. The fingerprints are inconspicuously embedded in multimedia content. The fingerprints can be exploited to trace divergent information from multimedia resources. Sampling fingerprints, particularly from multimedia resources, is challenging since they are complex, heterogeneous, and diverse. This research proposed an approach to sample fingerprints from multimedia resources. Our approach partitions the multimedia content space into converged clusters using variations of Canberra distance and identifies the most diverged samples using Kullback-Leibler (KL) divergence. The resultant clusters represent the information belonging to particular concepts and the diverged samples within the clusters represent multimedia fingerprints. The fingerprint sampling process is leveraged using unsupervised learning algorithms, instantiated across various multimedia descriptors, and tested over standard multimedia datasets. The average results obtained over various standard visual and acoustic datasets reveal 80%, 77%, and 78% accuracy, precision, and recall, respectively, surpassing most of the existing baseline clustering methods such as K-Means, Mean-Shift, and DBSCAN. Furthermore, the rigorousness of the proposed algorithm clustering is evaluated using the internal clustering stability silhouette coefficient and the fingerprint diversity scores. The results unveil a maximum of 94% diversity score. The proposed variation of Canberra distance and KL divergence provides the most stable performance (SD=0.02) and creates promising implications in future multimedia retrieval, summarization, and exploration activities.


I. INTRODUCTION
Nowadays, exponential growth in the online production of multimedia content has been observed [1], [2].The multimedia content in different media formats, i.e., text, audio, image, video objects, etc., collectively accumulated over massive multimedia resources [3].Multimedia content has associated textual, acoustic, and visual information modalities [4].Approximately 2.6 exabytes of multimedia content are consumed, replicated, and explored over the online multimedia resources [5].Almost 82% of the global data traffic over the web is multimedia-based [6].The contents in different media formats with multiple modalities The associate editor coordinating the review of this manuscript and approving it for publication was Geng-Ming Jiang .are archived, retrieved, and interacted with by the web users in everyday exploration activities via search applications [7].
Web users become overwhelmed with multimedia content, causing information overload, which hinders multimedia content exploration and access [8], [9].Synthesizing vast multimedia resources with an abundance of different media formats and multiple information modalities via computing technologies is challenging [10], [11].Ensuring users can access specific content from multimedia resources is a challenging endeavor [12].Additionally, retrieving relevant information from immense piles of multimedia resources over the web becomes cumbersome.The retrieved multimedia content may include irrelevant, redundant, and insignificant content, leading to a partial satisfaction of information needs [4].
The massive amount of information in multimedia resources is undoubtedly invaluable in various user domains and retrieval scenarios [13], [14].The techniques to access vast multimedia resources are becoming integral to user interaction and exploration scenarios [15], [16].The exploration scenarios require fingerprint sampling from multimedia resources in the user's exploration activities [17].Fingerprinting is about extracting information subsets from a divergent information resource that may represent a concept as a whole [18].The fingerprints are inconspicuously embedded in multimedia content resources and are used to trace precise information from divergent resources [19], [20].
The existing literature broadly defines the multimedia fingerprinting concept in the context of audio content and copyright protection [21], [22].The former is to provide the effective matching of audio clips, and the latter is to preserve the copyright of the multimedia content.However, the main objective of multimedia fingerprinting is to facilitate the precise identification of massive content via signature matching [23].In this research, we extended the idea of fingerprinting to solve the problem of multimedia content accessibility in retrieval and exploration contexts.We will generalize the fingerprinting concept to identify the samples from multimedia resources.The samples are fingerprints, which may give a holistic representation of multimedia resources.
This research proposes an approach to sample fingerprints from multimedia resources containing audio-visual content.Our approach initially clusters multimedia content instances into a dynamic number of the most converged clusters.The key representative samples are finally extracted from the most diverged samples as fingerprints based on their convergence in perspective clusters.We employed Canberra distance and Kullback-Leibler divergence measures in clustering and fingerprinting, respectively.The former is to distribute multimedia resources into clusters and later to identify fingerprints from them.We also proposed clustering and fingerprint identification algorithms that employ the variations of Canberra distance and Kullback-Leibler divergence measure, respectively.
Our proposed approach provides a baseline to extract fingerprints from multimedia resources.To our knowledge, we are the first to employ multimedia fingerprinting to ease multimedia content accessibility and exploration.The proposed approach was instantiated over diverse audio-visual standard multimedia datasets.We extracted a variety of audio-visual descriptors from the multimedia contents and employed them in instantiation.The performance of our proposed approach in terms of precision, recall, and accuracy measures was revealed.We also used Mean-Shift, K-Means, and DBSCAN as baseline algorithms in a comparative evaluation.Our approach outperforms other baseline methods.We found that our proposed approach is more accurate and precision-oriented.The silhouette coefficients analysis highlights cluster stability across the different datasets and extracted descriptors.
Our proposed approach is generic and effective since the approach is equally applicable across multiple datasets and descriptors.
The rest of the discussion is organized as follows.Section II provides a literature review.Section III discusses the proposed approach.Section IV provides approach instantiation details.Section VI explains the experimental details and results.Section VII provides a comparative discussion.Finally, section VIII concludes the discussion and highlights future research directions.

II. LITERATURE REVIEW A. MULTIMEDIA RESOURCES
In recent years, multimedia resources have converged over the web due to the emergence and proliferation of advanced computer and communication technologies [24].The web has become a vast distributed multimedia resource.The multiple media objects have been accumulated over the web as massive multimedia resources that enabled the exploration of several different media types via advanced computing applications, i.e., digital libraries, social media platforms, knowledge-based systems, etc. [25], [26], [27].The multimedia information resources enable access and interaction with multiple media objects [28].For example, Google1 provides users interaction with more than 30 Trillion web pages containing textual content; Flickr2 enables social interaction with more than 10 billion images; SoundCloud3 contains 50 million tracks of audio content; YouTube4 contains more than 800 million videos clips of variable length.

B. FINGERPRINTS 1) FINGERPRINTING: BASIC CONCEPT
Traditionally, fingerprinting involves bio-metric of people's unique physical or biological characteristics required to identify them, e.g., thumb lines, retina, ears, etc. [29], [30].The fingerprinting concept was first conceptualized from the theory of uniqueness [31].However, in recent years, fingerprinting has been further employed in source identification, duplicate detection, copyright prevention, etc., in different domains [32], [33], [34], [35].The research concerning multimedia fingerprints has recently gained the attention of researchers with a significant focus on the audio domain [36], [37].The same idea of the theory of uniqueness in fingerprinting is also adopted in the context of fingerprinting of multimedia content [38].The fingerprinting mainly distinguishes perceptually different artifacts on the uniqueness basis from multimedia resources [31].Fingerprint identification from multimedia resources can be determined as extracting a subset of information as representatives of a multimedia resource [39].

2) FINGERPRINTING: ANALOGY
The fingerprinting approach mainly identifies a set of samples from the large resource sets [40].In a way, fingerprinting comprises selective representative examples from the original datasets or resources [41].Figure 1 illustrates the Analogy of fingerprint identification.As it emerges from Figure 1 that fingerprinting involves the selection of representative samples from information resources that can further be used in comparison, analysis, and management.The fingerprinting captures vital information from the original dataset more efficiently than any random sampling technique [13].Fingerprints dependably and effectively portray the whole dataset and address the vital issues in scientific data analysis, having diverse utilization in artificial intelligence, signal processing, data recovery domains, etc. [18].

C. MULTIMEDIA FINGERPRINTS
The proliferation of multimedia data creates challenges in accessing, interacting, and exploring massive multimedia resources [11], [42], [43].However, fingerprinting is a non-trivial task that enhances human understanding of information resources via a smaller set of representatives identified as samples [44], [45].Multimedia content available on the web is large and highly redundant and could be represented by a relatively small subset [36], [45], [46], [47].The relevant and representative subset that demonstrates the global view of the entire resource can be nominated as fingerprints [13].The identified fingerprints can be further used in processing since they precisely indicate the possible attributes of a collection.In multimedia resources, the fingerprints exist as a condensed content-based mark that synopsis content and provides evidence of uniqueness [48].In multimedia resources, fingerprinting can be categorized into acoustic and visual.

1) ACOUSTIC FINGERPRINTS
Audio fingerprints have become popular because they permit the detection of audio self-reliant from its structure.However, it may not include the meta-data requirements [49].An acoustic fingerprint is a digital summary generated from an audio clip.The objective is to locate an audio clip or similar from the audio database [50].[36].

2) VISUAL FINGERPRINTS
In visual fingerprinting, most work is done in either the context of prototype selection from an image dataset or key-frame extraction from a series of video frames [46], [56].Traditionally visual fingerprints are employed to verify human identities; the objective is to improve security and safety against impersonal attacks [57].The concept can be generalized to identify the sample visuals from a diverse set of video objects.Pandya et al. suggested the identification of fingerprints from the visual content by employing texture features, histogram equalization, Gabar filters, and deep learning approaches [58].Li et al. proposed a fingerprinting method for video retrieval and copy detection by considering convolution neural networks, quantization coding, and feature extraction method [59].Tseytlina et al. proposed a video fingerprinting plan for content-based video retrieval.The approach was based on Fourier Mellin, features, and compaction [60].Mandelli et al. dealt with stabilizing video from the recording devices, particularly the method that involves the identification of images or video clips as fingerprints [61].the Sensor Pattern Noise to effectively identify the image fingerprints using Large Scale Sparse Subspace Clustering.The technique produces many clusters from unclustered images [66].Chen et al. introduced Deep Marks, a framework to retrieve authorship information and unique users from multimedia content as fingerprints.The framework provided the design of a unique codebook and encoding scheme to extract fingerprints from multimedia content [67].Fan et al. investigated signature codes using the weighted binary adder channel and collusion-resistant to extract the multimedia fingerprinting.They theoretically experimented and generated adversarial traceability fingerprints [18]

E. ISSUES AND MOTIVATION
In the present era, multimedia resources are growing exponentially.Contrarily, individuals have constrained resources due to limitations in their manual comprehension.The existing fingerprinting approaches provide the identification of fingerprints from audio-visual content.However, they exploit the low-level representation of multimedia content, such as binary encoding and signal manipulations [18], [63], [68], [70].Moreover, the prime purpose of the existing fingerprinting approach is to uniquely identify multimedia content for the prevention of unauthorized distribution [64].Therefore, fingerprinting in the context of multimedia content identification is the least discussed in the literature.Most of the fingerprinting work has been leveraged in the context of source identification, duplicate Selection, similarity-based retrieval, inverted index management, etc.However, almost all of the fingerprinting techniques are for particular domains.The research needs perceptual divergence to provide a comprehensive fingerprint identification mechanism for heterogeneous multimedia content resources.Hence, in this research, we are interested in exploring a generic multimedia fingerprinting approach based on state-of-the-art descriptors that provide representative samples to help aid immense multimedia data exploration.

III. FINGERPRINTING APPROACH
In this research, we extended the fingerprinting analogy to address the issues in identifying fingerprints from multimedia resources.The objective is to suggest a generic approach that locates the most desired samples of multimedia resources as fingerprints.Notably, we extended the multimedia fingerprinting idea to sampling the most diverged fingerprints that may provide the sample-based coverage of the entire multimedia resource via the clusters with the most similar multimedia content.We aim to improve the performance of our generic algorithms and compare them with standard benchmarks that are applicable regardless of domain knowledge.We have proposed a novel approach to identify fingerprints from audio-visual resources.Primarily, we employed an unsupervised approach and developed an algorithmic fingerprint selection strategy from multimedia resources.
We hypothesized that the most convergent samples within clusters might have the most diverged characteristics within an entire multimedia resource.The clusters individually represent the unique concepts within an entire multimedia resource since a multimedia resource is a collection of diverse clusters.The most converged sample within a cluster shows maximum similarity with the other samples of the cluster.In this way, (i) a unique sample as a fingerprint from a cluster can be identified, (ii) the fingerprints can be sampled from the clusters as representative of the entire multimedia resource, and (iii) the sample representations of the entire resource can be recognized as fingerprints of the multimedia resource.We identified the most converged items as multimedia fingerprints from the most diverged clusters.In the following section, We will discuss the approach overview, preliminaries, distance measure, and algorithms employed to sample fingerprints from the multimedia resources.

A. APPROACH OVERVIEW
Our approach sampled the most diverged components from the most converged multimedia clusters, where Components are media objects belonging to a particular multimedia resource type, i.e., text, image, audio, video, etc.We introduced new variations of Canberra distance to identify L Most converged clusters.Alternatively, the proposed variations of Kullback-Leibler divergence identify M Most diverged components from the components of L Most converged clusters.M Most and L Most represent the dynamic number of clusters and sample fingerprints, respectively.Figure 2 demonstrates a schematic overview of our proposed fingerprinting approach.
Our proposed approach dynamically samples the fingerprints from the clusters by accommodating media objects belonging to a particular media type in separate media object spaces (Figure 2 (a)).The components are loaded into media set space (Figure 2 (b)).The media set space is converged into the most relevant components in separate partitions called clusters (Figure 2 (c)).The divergence process is applied to the entire sets of clusters to identify divergent samples (Figure 2 (d)).Amongst the divergent samples, the proposed approach identifies the fingerprints, which are the most discriminating and maximally correlated components in a media object space and clusters (Figure 2 (e)).Finally, the results of the obtained fingerprints are obtained empirically and compared with existing state-of-the-art algorithms (Figure 2 (f)).
in a pair of components T i and T j is computed as: Equation 2 associate individual mean Cn distance ∀T i ∈ T .In fact, C o Cn mean computes the degree of uniqueness ∀T i ∈ T .However, the uniqueness of individual components is not normalized; it varies in components, hence only utilized to find a convergence of randomly distributed components into the dynamic number of clusters.We proposed normalization of C o Cn mean to compute a normalized factor, which can be used as threshold values to decide the inclusion of a T i in a particular C i .The normalized component-wise mean Canberra distance (NE o Can mean T ) ∀T i ∈ T is computed as: The NC o Can mean T is derived by taking the difference between maximum and minimum non-zero values of 141644 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

S N T = {S
In equation 4, S N T is a set of not-clustered components T i , where T i ∈ T and T i / ∈ C. S L C T is a set of all clustered components T i , where T i ∈ {T , C}; and T 0−N o Can std is a set of all components T i , where T i ∈ T and T i C o Cn mean distance with respect T j ∈ T = 0, where i ̸ = j.

2) KULLBACK-LEIBLER DIVERGENCE
We propose variations of Kullback-Leibler divergence as object-wise Kullback-Leibler divergence and normalized object-wise Kullback-Leibler divergence.These variations are used to sample M-Most divergent objects from the objects of L-Most convergent clusters.Kullback-Leibler divergence calculates the degree of dissimilarity between two objects.It can be used to compute the divergence between objects.Kullback-Leibler divergence can be measured between the objects of vectors E i of objects M i and M j as: The range of the Kullback-Leibler divergence measure is [0,∞].The lower and upper bound will represent the degree of convergence between the pair of objects (M i , M j ).The Individual divergence measure of any two individual objects can be calculated using equation 5.This measure can not calculate the Kullback-Leibler divergence of an object concerning all other remaining objects.Kullback-Leibler divergence measure is used to calculate this measure.Equation 6 represents object Wise Mean Kullback-Leibler Divergence of an object (E 0 KL − Divergence mean M i ) for all other objects.
(E 0 KL − Divergence mean M i ) can be calculated by dividing the sum of all the Kullback-Leibler divergence of an object M i with all the objects M j with the cardinality of cluster object set E k k , where j̸ =i, E 0 KL−Divergence mean M i represent individual divergence of each object for all other cluster objects.E 0 KL − D mean M i represents its uniqueness in the set of objects in a cluster E k .The proposed variation of Kullback-Leibler divergence only calculates the individual E 0 KL − D mean M i in the set of media objects.It represents only the uniqueness of an object for all other objects in the cluster.The uniqueness of individual objects in the clusters is not normalized; it varies from object to object.It can not only be utilized to calculate the normalized divergence in cluster objects.Normalized object-wise Kullback-Leibler Mean Deviation(NE 0 KL − Divergence mean M ) is also proposed.The normalized factor can be used as a threshold value to decide the inclusion of an object in the set of candidates.(NE 0 KL − Divergence mean M ) can be calculated as: The normalization factor can be calculated by taking an average of the maximum and minimum (E 0 KL − Divergence mean M i ) for all other objects in the cluster.

D. ALGORITHMS
We developed four novel algorithms to sample the most diverged media objects (instances) as sample fingerprints from the most converged clusters.The algorithms compute instance-wise mean Canberra distance of all the nonclustered instances, instance-wise Kullback-Leibler (KL) standard deviation for all instances in a cluster, instances into a dynamic number of clusters, and sample most divergent instances from the clusters as sample instances.The Algorithm 1 computes instance-wise mean Canberra distance as E o Can mean M i of all non-clustered instances.It provides a threshold value as a normalization factor during the clustering of the instances.The threshold value is updated dynamically.It is re-computed for the objects remaining in a set after instance inclusion in a cluster.The threshold values are calculated dynamically until the inclusion of all objects in their corresponding most convergent cluster.

Algorithm 1 Canberra Distance Computation
Data: Feature Object Space Result: Uniqueness of Media object and Threshold value [1] ; The Algorithm 2 computed Kullback-Leibler standard deviation of all instances in a cluster.The algorithm performs normalization on the calculated instance-wise Kullback-Leibler standard deviations.Algorithm 2 provides a threshold value in sampling the pair of instances.The threshold factor is a normalization factor.The threshold value is calculated dynamically for each cluster.The Algorithm 1 and Algorithm 2 are further exploited to group instances in a dynamic number of converged clusters and sample most diverged instances from the clusters as sample fingerprints in Algorithm 3 and Algorithm 4, respectively.

Algorithm 2 Kullback-Leibler Normalization
Data: Feature object set contained in Clusters Result: Uniqueness of Media object and Threshold value Algorithm 3 initially takes the first instance of the media object space as cluster centroid.The normalization factor for the centroid is dynamically calculated using the pseudo-code mentioned in algorithm-1.The objects from the set are included in the cluster and excluded from the object set if their Canberra distance for the centroid is less than NE o Can mean M (computed via Algorithm 1).The procedure continues until all the objects are clustered into disjoint sets, and the cardinality of the media object set becomes zero.The Algorithm 3 creates the number of clusters dynamically.
Algorithm 4 selects Each cluster object will be chosen individually, and its Kullback-Leibler divergence for all other objects is computed.An object is considered divergent and sampled if its Kullback-Leibler divergence concerning all other objects is more significant than that of the KL-Divergence threshold.A pair of objects were selected from each cluster as a sample candidate.The Algorithm sample an object from the pair of objects with the least mean KL-Divergence for all other cluster objects.This procedure eliminates boundary objects from the candidate samples.This procedure continues cluster by cluster for all the objects until the sampling of all the M-Most divergent objects.The workflow is defined in Algorithm 4. The complexity of this algorithm is O(nk) as it will compute all the distances, and from each cluster, the fingerprint will be selected.

IV. INSTANTIATION
Our proposed approach is instantiated and executed on a publicly available dataset.It also defines the implementation of various measures and approaches.The following subsection briefly overviews fingerprinting instantiation 141646 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.details, including the dataset, implementation details, experimental setup, and baseline algorithms.

A. MULTIMEDIA DATASETS
We instantiated our approach on different publicly available widely used datasets.The details of the datasets are given in Table1.We have instantiated our approach on image datasets named I-Search 5 and Oxford-IIIT Pet. 6 The I-Search dataset contains 10305 images.The I-search dataset is divided into 51 categories, and each type has approximately 200 images.Similarly, the Oxford-IIIT dataset consists of 37 category pet datasets with approximately 200 images for each class totaling around 7349 images.The I-Search audio dataset consists of 637 audio files classified into 43 categories.Finally, the audioMNIST 7 dataset consists of 30000 audio samples of spoken digits (0-9) of 60 different speakers.These datasets contain the ground truth value which facilitates the calculation of accuracy, precision, and recall measures.

B. DESCRIPTORS
The visual features extracted via routines are mainly implemented in C# and MATLAB.We extracted features from image objects that include the Color and Edge Directivity Descriptor (CEDD), Color-Correlogram (CC), and Histogram of Oriented Gradients (HoG) features.These features were extracted via openCV library. 8The resultant fingerprints from the CEDD, CC, and HoG are shown in Figure 3.In the case of audio datasets, Spectral Roll-off (SR), Spectral Centroids (SC), and Mel Frequency Cepstral Coefficients (MFCC) are extracted via Librosa 9 library.The librosa.display routine is used to display the audio files in different formats, such as wave plots, spectrograms, or color maps.Amplitude and frequency are important parameters of the sound and are unique for each audio, for which librosa.display.waveplotroutine is used.Figure 4 shows the fingerprints obtained via acoustic descriptors.The information contained in image and audio objects is extracted as vectors and matrices, respectively.These vectors and matrices are finally stored in text files comprising numeric values in corresponding matrices and vectors. 5https://vcl.iti.gr/dataset/i-search-multimodal-dataset/ 6https://www.robots.ox.ac.uk/∼vgg/data/pets/ 7 https://www.kaggle.com/datasets/sripaadsrinivasan/audio-mnist 8https://opencv.org/ 9https://librosa.org/The remaining objects in the set whose Canberra distance is less than the normalization factor are deemed cluster elements.The clustering procedure continues until the normalization factor of the last created clusters is not less than the normalization factor of the non-clustered objects.The objects in the non-clustered set, whose normalization factor is less than the last made cluster, are included in a new cluster.The procedure automatically stops until the partition of the set of objects into a dynamic number of clusters.It is revealed from the simulation that our proposed clustering approach distinguishes the effective results (Figure 5).
The media objects with the highest average similarity to the other objects will offer the highest content coverage in the set.The working sampling algorithm samples the most similar objects from the cluster created by the clustering algorithm.The algorithm calculates the E o CKL − D mean M of all the objects in the first cluster.Points with maximum KL Divergence from the cluster are sampled as candidate samples.An object from the candidate samples with maximum KL-divergence for all other objects in the cluster is considered a sample from the cluster.This procedure continues until the objects are sampled from all clusters.Figure 5 demonstrated the fingerprints extracted from the clusters.

V. EVALUATION A. EXPERIMENTAL SETUP
We have applied our approach on a quad-core Intel (R) Core (TM) i7-6700 @ 3.4 GHz desktop computer with 8GB DDR3 RAM.All methods were implemented in the Python 3.8 version of the Spyder10 environment with the 64-bit interpreter.Pandas, 11 Sci-Kit, 12 Flask, 13 Keras, 14 and OpenCV15 libraries.

B. BASELINE ALGORITHMS
We have proposed a new and novel method for clustering and provided an unsupervised approach that only requires prior information like the number of clusters or initial value.However, We have compared the performance of our algorithm with standard benchmarks to determine the efficiency of our algorithm.The algorithms such as Mean-Shift, K-Means, and DBSCAN were utilized to test the effectiveness of our algorithms.

C. EVALUATION MEASURES
The results are evaluated in terms of the quality and performance of clusters.The results are also compared with traditional clustering methods such as K-Means, DBSCAN, and Mean-Shift.The details are discussed in the subsequent subsections.An information-theoretic approach has been conducted for clustering to view it as a series of decisions.To evaluate the performance of clustering, a contingency matrix has been measured as True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN).TP is those decisions when similar elements are assigned to the same cluster, whereas TN decisions give dissimilar elements to a different cluster.In that case, two types of errors can be committed.FP and FN.An FP decision assigns dissimilar elements to the same cluster, and an FN decision refers to those decisions when similar items are assigned to different clusters.
According to the literature, precision reveals the effectiveness of clusters.It illustrates the fraction of relevant results among the retrieved results [71].The appropriate score is divided by a total score to measure the precision.The precision can be measured as P c = TP/(TP + FP), Where the average precision can be calculated as AP c = n i=1 P c /n. Recall can be discussed as the completeness of outcomes, which can be defined as the fraction of relevant results retrieved over the total number of relevant results.In mathematics, we can define recall as R c = TP/(TP + FN ).Similarly, the average recall rate can be calculated as Another measure that we can use to check the performance of clusters is accuracy which tells us how correctly our elements are clustered.Accuracy can be defined as A c = (TP+TN )/(TP+TN +FP+FN ), Where the average accuracy can be defined as AA c = n i=1 A c /n.

A. BASELINE RESULTS
The evaluation was performed on visual and acoustic datasets.For the former, we used the I-Search dataset and the Oxford-IIIT Pet dataset.For the latter, we used the AudioMNIST and I-Search datasets.The results were obtained on the existing state-of-the-art clustering algorithms (K-Means, Mean-Shift, and DBSCAN) and the proposed algorithm.For the I-Search image dataset, the proposed algorithm achieved the average highest accuracy and recall of 84.39% and 80%, respectively, gained from CEDD embedding.Meanwhile, the highest precision is recorded at 89% in the case of K-Means and CEDD embedding.
The detailed results obtained are summarized in Figure 6.Hence, the proposed algorithm outperforms in accuracy and recall over all of the existing baselines in CEDD embedding.Amongst the baselines, the K-Means was observed as the close competitor.However, K-Means only surpassed in case of the precision while the proposed approach was able to outperform the accuracy and recall.
For the Oxford-IIIT Pet image dataset, the CEDD embedding again yielded the best overall accuracy, precision, and recall scores of 87%, 79%, and 77%, respectively, when compared to all existing baselines.The highest recall reported also belonged to the proposed system reported at 82%.Similarly, amongst the baseline algorithms, only the K-Means was able to achieve the best results, with accuracy, precision, and recall reported at 79%, 78%, and 75%, respectively, for the CEDD embedding.Holistically, the proposed algorithm surpassed all the existing baselines for all the other feature sets e.g., HoG and CC.Hence, the proposed algorithm presents a new promising baseline.We also evaluated the proposed approach using the AudioMNIST and I-Search acoustic datasets.For the I-Search audio dataset, the proposed approach and the Mean-Shift clustering algorithms performed the best, achieving 85% accuracy scores on the SC and MFCC feature sets, respectively.For the precision, the K-Means and the Mean-Shift performed marginally (2%) better than the proposed.The recall was the highest in the K-Means clustering algorithm reported at 85%.However, the proposed approach was able to outperform all the existing baselines in the accuracy and recall of the SC feature set.Holistically, the proposed approach performs nearly as well as the existing baselines in the I-Search acoustic dataset.The detailed results are presented in Figure 8.
For the AudioMNIST dataset, the DBSCAN outperforms existing baselines by achieving accuracy, precision, and recall rates of 88%, 88%, and 87%, respectively for the MFCC feature set.The Mean-Shift algorithm closely follows up with a margin of 1% in the accuracy.The proposed and the Mean-Shift performs nearly as well with a difference of 1% recall margin.The proposed approach outperforms the baselines in SC recall by achieving a recall of 79%.The detailed results are shown in Figure 9.

B. APPROACH RESULTS
The proposed approach outperformed the image datasets' results in terms of AA by achieving a maximum of 83%.The AR was also the best amongst all the baselines by achieving a maximum score of 79%.The best average precision was reported in the Oxford-IIIT image dataset of 78%.However, the proposed algorithm stayed marginally behind the K-Kmeans algorithm in the I-Search image dataset.The proposed approach achieved stable performance across the audio datasets.The AA remained the highest in the I-Search audio dataset (81%).For the same dataset, the AR was  the second best by 1% margin.The rest of the baseline algorithms demonstrated a variable performance for each instance of the dataset.The averaged results for the image and audio datasets are presented in Figure 10 and Figure 11,   respectively.Hence, the proposed approach was able to achieve the best performance for the image datasets with stable results (SD=0.02).Similarly, the proposed approach remained stable with acceptable performance (SD=0.02).The detailed results are provided in Table 2.

C. CLUSTERING ANALYSIS
We also measured cluster performance with Silhouette analysis, which is used to compute the stability of clusters.Silhouette analysis is also utilized to measure the interruption distance between clusters.The investigation is done by generating a plot that illustrates the assessment of cluster numbers visually.Mathematically, these can be calculated as S = (b − a)/max(a, b), Where the term ''a'' represent the mean distance among all points in the similar cluster and a sample.In contrast, ''b'' represents the mean distance between all points in the next closest cluster and sample.The scores are in the range of −1 and +1.As the value reaches +1, it demonstrates precise clustering, whereas the value zero reveals the overlapping of clustering.A higher score defines the stability of clusters.
The silhouette coefficients have also been extracted to test the stability of Clusters.The silhouette analysis is employed to select an optimal standard for n-clusters [72], [73].It also illustrates the stability of clusters.Figure 12 shows the silhouette plot of the CEDD features set, which presents that the n-cluster value for K-Means of 30, 70, and 90 are poor choices for the given multimedia objects because of the occurrence of clusters with lower average silhouette scores.It also presents that these n-cluster numbers are appalling because of the wide variations in the size of silhouette plots.However, this plot is more indecisive in choosing an n-cluster number between 10 and 50.Moreover, the results illustrate that the choice of 50 is quite beneficial as it has a high score.At the same time, the Mean-Shift algorithm for CEDD features demonstrates that 10 and 50 clusters are not providing promising results.The n-clusters of 30, 70, and 90 indicate a good number of clusters.The result shows that n-clusters of 70 are more practical for evaluating mean shifts.However, the mean shift lacks satisfactory accuracy, precision, and recall results.
The results in Figure 12 demonstrate that the n-cluster values for DBSCAN of 10, 30, 50, and 90 are not a good choice due to the below-average silhouette scores.The extensive fluctuation in the range renders it a poor choice.However, the n-cluster of 70 shows the stability of DBSCAN clustering.The results show that our proposed approach obeys the 141650 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.natural portioning as the scores are above average for 10, 30, 50, and 70.However, the outcome of silhouette scores illustrates that the drastic increase in the change of clusters does not give promising results.It also demonstrates the stability of n-clusters between 30 and 50, as they both give favorable scores.
Figure 13 illustrates the silhouette coefficients plot resultant from the CC feature set.The plot demonstrates that the n-cluster value for K-Means of 30 and 90 is not a good choice because of its low score of silhouette coefficients.However, the analysis is more cautious in determining between 10, 50, and 70.Similarly, n-clusters of 70 and 90 are poor choices for the Mean-Shift algorithm.Whereas the n-clusters between 30 and 50 provide more balanced results.Similarly, DBSCAN presents that n-clusters between 50 and 70 provide promising stability of Clusters.Our approach offers stable results for the CC feature set when the value of the n-cluster is between 30 and 50.This feature set also assures that our algorithm does not support drastic expansion in several clusters.
Figure 14 presents the average silhouette coefficients plot for the HoG feature set.The results demonstrate that the n-cluster for K-Means offers the stability of the n-cluster value between 10 and 50.The silhouette analysis is more ambivalent in determining between 10 and 50.However, the 30, 70, and 90 are rejected because of below-average scores.Similarly, the DBSCAN offers stability on the n-cluster value of 70.The Mean-Shift algorithm shows that n-cluster values of 10, 70, and 90 are insufficient.The stability is indicated between 30 and 50 as they demonstrate good scores.Our proposed approach presents that the partition of media objects gives promising results.

D. FINGERPRINTS ANALYSIS
The core idea of fingerprinting is to capture the divergent elements of a dataset.The fingerprint diversity scores are calculated based on the variability of the multimedia fingerprints within a cluster.This metric is suitable for assessing the diversity of sampled fingerprints, especially when using techniques like Canberra distance and KL divergence.In contrast, traditional clustering methods like K-Means, DBSCAN, and Mean-Shift do not inherently produce or focus on multimedia fingerprints.Instead, they aim to cluster data points based on their feature vectors.They aim to find natural clusters in the data based on the chosen distance metrics.The proposed approach focused on identifying multimedia fingerprints as a distinct task.The verification of the audio and image fingerprint sampling is therefore leveraged via fingerprint diversity score.The diversity of the result set can be measured based on the distance between the visual and acoustic features.The features such as HoG, CEDD, and CC were extracted from the result set, and their diversity was calculated.Similarly, the acoustic features, i.e., SR, SC, and MFCC, were extracted from the result set, and their diversity was calculated against each feature set.We assume that media objects are encoded in the R feature vector of E-dimensions.The diversity of a result set F with N elements can be formalized as: where var(i) and var(j) are the various modules computed as the standard deviation of the feature vector of all M media objects in F. Similarly, δ(i, j) have been computed as a distance function between i th and j th dimension feature.Here, the δ(i, j) has been calculated as a reciprocal of similarity among features as δ(i, j) = 1 Q(i,j) which has been computed as: The similarity is calculated by cosine distance.In this context, we have taken a media object feature vectors as r i & r j as the i th j th elements.In particular, Q(i, j) can be considered as the probability of i th and j th feature vectors as elements that coincide in all media objects.P(r i = i, r 2 = j) =  3.These diversity scores reveal the dissimilarity of the images contained in the result set.In our experiments, we take the result set as M and extracted features as HoG, CEDD, and CC in the image dataset.For the audio dataset, we extracted MFCC.The normalized score [0,1] informs about the diversity as the ''1'' score shows the complete diverse set of results.However ''0'' score exposes highly redundant data.We achieved a maximum of 94% fingerprint diversity scores.The detailed scores are also provided in Table 3 which reveals that our fingerprints are diverse in nature.

VII. DISCUSSION
Our proposed approach identifies the relevant samples as fingerprints, which may demonstrate the multimedia resources.Our approach provides a diversified representation of a multimedia resource.It enhances the user's informationseeking journey to find relevant content from multimedia resources.The more diversity as a whole accommodates the richer information that is accessible to the system, and the higher performance is estimated via our proposed approach.In the fingerprints selection problem, we aim to pick a few representative samples that capture distinguished characteristics of an entire multimedia resource.We proposed a generic approach to seek fingerprints that proportionally reflect specified characteristics exemplified in a target population.

A. CONTRIBUTIONS
The proposed approach outperformed clustering results with the additional advantage of diversity in the final results.The proposed approach consisted of three distinctive phases.Firstly, the convergence of the distinct media set space was calculated using a novel variation of Canberra distance.Afterwards, using a variation of Kullback-Leibler divergence, the most distinct samples within each media set space were identified.Finally, the proposed approach was evaluated in terms of clustering and fingerprint identification perspectives.According to the obtained results, the proposed algorithm outperformed the existing state-of-the-art clustering algorithms for image datasets in terms of commutative accuracy (82%) and recall (79%).The precision was also on par with the existing baselines (79%), marginally behind the K-Means by 1%.Similar results were obtained for the audio datasets where the accuracy (79%) and recall (78%) surpassed all the existing baseline results.The precision (76%) was marginally behind (1%) compared to the existing baselines (77%).The proposed approach was able to demonstrate stable performance (SD=0.02)across various feature spaces and datasets, as shown in Table 4.The summarized averaged results obtained for the image and audio datasets are shown in Figure 15, respectively.The averaged fingerprint diversity scores obtained for the image and audio datasets were 80% and 87%, respectively.To the best of our knowledge, no previous fingerprinting approach was introduced that processed diverse multimedia datasets and surpassed the existing baseline algorithms.

B. FINGERPRINTS
This research proposed a unique approach to sampling fingerprints from multimedia resources using a combination of Canberra distance and Kullback-Leibler (KL) divergence to identify the most diverged samples within multimedia content clusters.The proposed approach is different from traditional fingerprinting methods that may rely on other techniques or algorithms for feature extraction and clustering where the aim is to find natural clusters in the data based on the chosen distance metrics.In contrast, the proposed approach focused on identifying multimedia fingerprints as a distinct task.This research leveraged unsupervised learning algorithms to create clusters of multimedia content based on their fingerprints which is not limited to a specific algorithm but is instantiated across various multimedia descriptors, representing flexibility in adapting to different modalities and datasets.
The proposed research was evaluated against performance metrics such as accuracy, precision, and recall, which are commonly used to evaluate the effectiveness of fingerprinting methods.The reported high values (80%, 77%, and 78%) for these metrics indicate the effectiveness of the proposed approach, surpassing existing baseline clustering methods like K-Means, Mean-Shift, and DBSCAN.
Furthermore, the clustering stability was measured using the silhouette coefficient, which represented how well-defined the clusters are.This aspect assesses the quality of clusters and helps ensure that the identified clusters are meaningful and well-separated.Furthermore, the research introduces fingerprint diversity scores for verification of the audio and image fingerprinting samples, which indicate the variability and distinctiveness of the fingerprints.A high diversity score (up to 94%) suggests that the sampled fingerprints are diverse and can capture a wide range of information.
The proposed variation of Canberra distance and KL divergence are reported to provide stable performance with a low standard deviation (SD=0.02).This stability ensures consistent results across different datasets and multimedia types with statistical significance.However, the limitation of this study is the choice of the various clustering algorithms may generate distinct results without standard selection criteria.According to the impossibility theorem, no single clustering algorithm can generate consistent and optimal results for a variety of problems.Hence, this aspect needs thorough investigation to determine the effect of clustering ensemble via detailed comparative analysis.

C. IMPLICATIONS
The proposed approach can also be adapted for different application scenarios.It can be utilized for enhanced content-based multimedia retrieval, exploration of big datasets, and summarization of multimedia result sets.The summarization creates a subset of information by reducing information computationally [17].The subset signifies the most relevant and valuable information comprised of original content.In the image and video domain context, selecting the most representative images and frames can be depicted as the process of image summarization and video summarization, respectively [13].The proposed approach can be practiced in image and video summarization as it provides a diverse and representative representation of multimedia objects.Dataset fingerprints can be classified as summaries of a collection.
The proposed approach can also be practiced in the context of Content-based retrieval.It is the process of retrieving contents via similar multimedia content, i.e., an acoustic query returns similar audio files [74].The practice can be leveraged by matching the query fingerprint and retrieving cluster fingerprint results that demonstrate relatedness to the user query.
Information exploration is the process of searching for and discovering the required information.In the case of information exploration and discovery, users often need clarification and more skills in expressing their information needs via query [75].To ease the user, relevant information and diverse content can be provided.According to our proposed approach, fingerprints can be provided to the user that contains a diverse collection of relevant information.

VIII. CONCLUSION AND FUTURE RESEARCH
The paper presented the framework for relevant sample identification as fingerprints.The proposed approach identified the m-most convergent items as multimedia fingerprints from n-most divergent clusters.The approach was instantiated across various multimedia datasets over widely recognized descriptors such as MFCC, SR, and SC for acoustic samples, and CEDD, HoG, and CC descriptors for visual samples.A detailed comparison was conducted for the proposed algorithm with existing state-of-the-art clustering techniques such as K-Means, DBSCAN, and Mean-Shift.On average, the proposed variation of Canberra distance and KL divergence achieved 80%, 77%, and 78% accuracy, precision, and recall, respectively, with the most stable clustering performance (SD=0.02)across all the descriptors.The fingerprints were further assessed in the context of diversity, obtaining surpassed scores of 94%.The proposed approach has implications in content-based multimedia retrieval, summarization, and multimedia exploration activities.While the choice of the various clustering algorithms may generate distinct results without standard selection criteria, according to the impossibility theorem, no single clustering algorithm can generate consistent and optimal results for a variety of problems.Hence, in the future, the effect of clustering ensemble can be identified via detailed comparative analysis.Furthermore, the fingerprinting approach can be adopted via deep learning-based embeddings to automate the multimodal feature description.
. Panday et al. devised fingerprint Singular Value Decomposition (SVD) to generate the image fingerprints.The notion was to construct the fingerprint regardless of the rotation of the image [68].Sharma et al. employed Local Adaptive Binary Patterns (LABP) and Uniform Local Binary Patterns (ULBP) along with Support Vector Machine (SVM) to learn LABP and ULBP features as fingerprints [69].Ye et al. proposed a novel fingerprinting that decomposes the image fingerprint code via structure fingerprint embedding.The objective was to use a unique image fingerprint to encrypt the images [70].

FIGURE 2 .
FIGURE 2. The overview of the fingerprint sampling approach comprising media (a) object space accommodation, (b) features extraction, (c) cluster generation, (d) sample extraction, (e) fingerprint identification, and (f) approach evaluation.
F i is unique within a C i , and F sim > (T sim ) → C since F ⊆ T and F is clusters are most diverged components of T .C. DISTANCE MEASURES 1) CANBERRA DISTANCE Our approach employs basic Canberra distance (d cn ) to split Equation 1 can not give the component-wise mean Canberra distance (C o Cn mean ) of a component T i concerning all other components in set S, where S are non-clustered components and ∀S k ∈ T .The C o Cn mean is computed as: C o Can mean T i and ∀T i ∈ T .The equation 10 represents a set of all components that are not converged in any ∀C i ∈ C and their C o Can mean T i > 0. Equation 4 represents the components excluded in C.

FIGURE 3 .
FIGURE 3. The clustering samples obtained for each feature set.

FIGURE 4 .
FIGURE 4. The fingerprints obtained via acoustic descriptors.

FIGURE 6 .
FIGURE 6.Comparison of clustering results on I-Search image dataset.

FIGURE 7 .
FIGURE 7. Comparison of clustering results on Oxford-IIIT pet image dataset.

FIGURE 8 .
FIGURE 8. Comparison of clustering results on I-Search audio dataset.

FIGURE 9 .
FIGURE 9. Comparison of clustering results on AudioMNIST pet image dataset.

FIGURE 10 .
FIGURE 10.Averaged feature set image datasets results of the proposed with state-of-the-art baselines.

FIGURE 11 .
FIGURE 11.Averaged feature set audio datasets results of the proposed with state-of-the-art baselines.

FIGURE 12 .
FIGURE 12.Comparison of silhouette coefficients results based on CEDD embeddings on the I-Search image dataset.

FIGURE 13 .
FIGURE 13.Comparison of silhouette coefficients results based on CC embeddings on the I-Search image dataset.

FIGURE 14 .
FIGURE 14.Comparison of silhouette coefficients results based on HoG embeddings on the I-Search image dataset.

FP(r 1
= i, r 2 = j|M )P(F) = F P(i|F)P(j|F)P(F) (10) P(i|F) and P(j|F) indicate the conditional probability of a feature in result set F whereas P(F) comprises the prior probability.if N elements are enclosed in the result set then it is equal to 1 N elements.The diversity scores of the results set are obtained on different audio and image datasets, as shown in Table Anguera et al. computed masks near the spectral peaks in the spectrogram for robust audio fingerprinting [51].Yu et al. proposed hybrid high-performance data structures for indexing massive amounts of audio fingerprinting data for efficient search [52].Ouali et al. quantized spectrogram regions into a series of horizontal and vertical slices, which are then represented as 48-dimensional fingerprints [53].Malekesmaeili et al. computed scale-invariant features from two-dimensional time-chroma representations of spectrogram patches [54].Saravanos et al. proposed a novel audio fingerprinting technique based on the expression of audio signals by establishing a dictionary [55].Li et al. proposed a compact representation for audio fingerprints executed from local linear embedding that is further utilized in the retrieval task Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE 1 .
Datasets employed in the instantiation of the proposed fingerprinting research.

TABLE 2 .
Detailed experimental clustering results for each algorithm and corresponding features evaluation (CEDD/HOG/CC for visual datasets and SR/SC/MFCC for acoustic datasets), where the highest obtained results are bolded.

TABLE 3 .
Diversity scores of the proposed fingerprinting approach.

TABLE 4 .
Statistical significance of the overall obtained clustering results.
FIGURE 15.Averaged audio and image datasets summarized result.