Self-Supervised Learning for Audio-Visual Relationships of Videos With Stereo Sounds

Learning cross-modal features is an essential task for many multimedia applications such as sound localization, audio-visual alignment, and image/audio retrieval. Most existing methods mainly focus on the semantic correspondence between videos and monaural sounds, and spatial information of sound sources has not been considered. However, sound locations are critical for understanding the sound environment. To this end, it is necessary to acquire cross-modal features that reflect the semantic and spatial relationship between videos and sounds. A video with stereo sound, which has become commonly used, provides the direction of arrival of each sound source in addition to the category information. This indicates its potential to acquire a desired cross-modal feature space. In this paper, we propose a novel self-supervised approach to learn a cross-modal feature representation that captures both the category and location of each sound source using stereo sound as input. For a set of unlabeled videos, the proposed method generates three kinds of audio-visual pairs: 1) perfectly matched pairs from the same video, 2) pairs from the same video but with the flipped stereo sound, and 3) pairs from a different video. The cross-modal feature encoder of the proposed method is trained on triplet loss to reflect the relationship between these three pairs ( $1>2>3$ ). We apply this method to cross-modal image/audio retrieval. Compared with previous audio-visual pretext tasks, the proposed method shows significant improvement in both real and synthetic datasets.

essary for practical applications such as cross-modal retrieval, 23 alignment, and localization [1], [2], [3]. based on the temporal alignment of videos and sounds [5]. 30 However, these methods mainly focus on the semantic rela-31 tionship between images and monaural sounds. We cannot 32 The associate editor coordinating the review of this manuscript and approving it for publication was Jon Atli Benediktsson . utilize the spatial relationship typically represented by sound 33 locations in these methods. Namely, in the previous methods, 34 it is implicitly assumed that the main sound source appears 35 in a large portion of the image. Therefore, these methods 36 have difficulties in applying to videos where the sound source 37 exists in a specific location or where multiple sound sources 38 exist. For example, the video on the left in Figure 1 depicts 39 that the main sound source is in a specific location but does 40 not occupy a large area of the image. To perform appropri-41 ate feature extraction, we need to identify the location of 42 the trumpet as being on the left side. Also, the right video 43 in Figure 1 cannot be properly interpreted if the positions 44 of the drum and piano are unknown. To acquire a feature 45 extractor that can be applied to these videos, the arrival 46 directions of sound sources need to be incorporated into the 47 training. 48 Nowadays, video clips with stereo sounds can be eas-49 ily acquired. They contain both semantic information and 50 FIGURE 2. Assuming a dataset of unannotated videos, there are three types of stereo sounds paired with an image: (a) Sound from the same video corresponding to the image, (b) the same sound but the left and right audio channels are flipped, and (c) sound from another video. The basic idea of this work is to learn to maintain the ordered distance relationships of this triplet to understand the semantic and spatial relationships between stereo sounds and images.
method can process the difficult cases mentioned above and 97 illustrated in Figure 1. We evaluate the proposed method on 98 a cross-modal image/sound retrieval scenario using both real 99 and synthetic datasets. Experimental results demonstrate that 100 the proposed method enables cross-modal retrieval with the 101 semantic and spatial correspondences of the sound source 102 information. 103 Our contributions are twofold. 104 1) We propose a novel triplet loss function for self- 105 supervised cross-modal feature learning of videos with 106 stereo sounds. To our knowledge, this is the first work 107 to enable learning of cross-modal representation con-108 taining both semantic and spatial information for video 109 with stereo sound.
110 2) We demonstrate an application of the proposed method 111 to cross-modal retrieval and show that we can success-112 fully retrieve images and sounds with the same sound 113 category and location. 114 The rest of this paper is organized as follows. In Section 2, 115 we review related work and show the position of this study 116 in audio-visual and self-supervised learning. In Section 3, 117 we propose a method to acquire the feature space that 118 captures both semantic and spatial relationships between 119 images and stereo sounds. In Section 4, we evaluate the pro-120 posed method by audio-visual cross-modal retrieval. Finally, 121 in Section 5, we conclude this paper. Due to the difficulties of providing annotations for both visual 125 and synchronized auditory information, self-supervised 126 learning is often adopted for audio-visual representation 127 learning. A wide variety of pretext tasks have been proposed 128 to capture the audio-visual relationship. The most representa-129 tive approach is the one classifying matched and unmatched 130 audio-visual pairs utilizing the fact that visual and audio 131 inputs from the same video have the same semantic informa- [15], cross-modal retrieval [4], and sound separation [5]. 141 However, the target of these prior works is limited to learn-    It has been also adopted in various cross-modal learning 179 methods, such as sound localization [ . Among 209 these datasets, we employ FAIR-Play, which contains vari-210 ous sound categories, as our training dataset. This is because 211 variations in sound categories are necessary to evaluate the 212 proposed method to acquire general audio-visual semantic 213 and spatial correspondences. However, FAIR-Play has a prob-214 lem of low visual diversity because it is recorded by lim-215 ited performers in one music room. To solve this problem, 216 we construct PseudoMUSIC, an additional synthetic dataset 217 of images and stereo sounds described in Section IV-A1.  Our goal is to learn a representation of the cross-modal fea-221 tures between images and stereo sounds to acquire semantic 222 and spatial information about sound sources. To this end, 223 we introduce a novel loss function, called Stereo Sound Rank-224 ing (SSR) loss, to learn a feature space where a spatially 225 and semantically matched image/stereo sound pair becomes 226 closer than a spatially incongruent pair (flipped stereo 227 sound) using one triplet loss. Also, a spatially incongruent 228 pair becomes closer than a semantically incongruent pair 229   SSR loss L SSR is then given as Triplet loss L m&f requires that d m be smaller than d f . The 270 parameter α m&f is the separation margin of f S m and f S f . Simi-271 larly, triplet loss L f&mm requires that d f be smaller than d mm , 272 and the parameter α f&mm is the separation margin between f S f 273 and f S mm . With L m&f and L f&mm , we have the three distances 274 in the desired order (d m < d f < d mm ).  The input RGB image is first resized to 480 × 240. Then, 300 as data augmentation, we randomly crop 448 × 224 images.

301
The color and intensity of the image are randomly changed 302 in the range of 0.7 to 1.3.

303
As sound input, we calculate the log-spectrogram of a

318
In this section, we report the experimental results on 319 cross-modal retrieval to evaluate whether the proposed 320 approach can acquire better feature representation contain-321 ing both semantic and spatial audio-visual relationships. 322 We present both quantitative and qualitative results for image-323 to-sound/sound-to-image retrieval. We also visualize the net-324 work attention and learned feature spaces for further analysis. FAIR-Play consists of 1,871 ten-second videos of one or 342 more persons playing musical instruments. These are videos 343 where the sound source comes from specific directions or 344 videos where multiple sound sources exist, to which existing 345 methods are difficult to apply. There are nine main musical 346 instruments: banjo, cello, drum, guitar, harp, piano, trum-347 pet, ukulele, and upright bass. In several cases, there are 348 additional percussion instruments, such as tambourine, and 349 several videos also contain singing voices. The dataset is ran-350 domly divided into training, validation, and test subsets with 351 a ratio of 80%/10%/10%. Because there is no ground-truth 352 annotation of sound category and location on FAIR-Play, 353 we manually annotate the test subsets with category labels 354 and their bounding boxes. The instrument types previously 355 mentioned are used as category labels. For this purpose, from 356 the test split, we further select items where all sound sources 357 emit sound. Finally, we obtain 352 annotated image/audio 358 pairs, containing those that are horizontally flipped in both 359 images and sounds for additional data. We built PseudoMUSIC following previous work on gen-362 erating images and their corresponding spatial sounds [29]. 363 In this method, image patches representing sound sources are 364 placed in specific directions of a background image, and syn-365 chronized monaural sounds are converted to spatial sounds 366 from those directions using head-related impulse response 367 (HRIR) [31] to generate audio-visual pairs.

386
Cropping patches from the images is performed by the 387 masks both containing the performers and instruments taken 388 as the union of the human masks and the bounding boxes 389 of the instruments. For the human masks, we leverage 390 the semantic segmentation model of a fully convolutional 391 VOLUME 10, 2022     We introduce the CatLoc score to take into account both the 472 categories and locations of sound sources. An example of the 473 calculation is shown in Figure 6. We assume an image-to-474 sound retrieval task, where we obtain a query image (Q) and 475 an image corresponding to a retrieved sound (R). As shown 476 in Figure 6, we divide each image into three regions: left, 477 center, and right. We make a mapping of each region to an 478 instrument contained in the region. If there is no sound source, 479 No source is assigned as shown in Figure 6. An instrument is 480 judged to be in a region if the center of the x-coordinate of its 481 corresponding bounding box is within the region. Then, as an 482 extension to the cosine similarity between Q and R, s(Q, R) 483 for CatLoc is calculated as follows.

526
We perform image-to-sound and sound-to-image retrieval 527 tasks and evaluate the performance on the basis of both cate-528 gories and locations of sound sources using CatLoc. Table 2 529 shows the performance of the proposed method and all base-530 line methods. We show the scores for both the cases where 531 the query has a single sound source (single) and where it has 532 multiple sound sources (multiple).

533
In FAIR-Play, we see that Proposed outperforms the base-534 lines. Specifically, we claim the following superiorities of 535 Proposed over baselines.  The score of Proposed is higher than that of Flip. 544 This also indicates that Proposed captures both the 545 categories and locations of the sound sources.

546
Over Flip/Mismatch 547 Proposed is superior to Flip/Mismatch. This indi-548 cates that we cannot obtain the desired feature 549 space simply by increasing the similarity between 550 VOLUME 10, 2022 TABLE 2. Results of cross-modal retrieval in FAIR-Play and PseudoMUSIC using CatLoc. We adopt nDCG@5 as the metric for retrieval evaluation. ''image-sound'' represents the result of image-to-sound retrieval and ''sound-image'' represents the result of sound-to-image retrieval. ''single'' and ''multiple'' indicate the result when the query contains only a single sound source and multiple sound sources, respectively.   Table 3 shows the performance of Proposed and all 582 baseline methods using Cat.

583
In FAIR-Play, Proposed scores highest in most cases, sug-584 gesting that Proposed is successful in extracting features 585 on the basis of the semantic information of sound sources. 586 However, Mismatch (Mono) and Mismatch (Stereo) obtain 587 scores comparable to Proposed. This is because the diversity 588 of FAIR-Play is low, and even learning the pattern of the 589 recording room results in a certain performance in FAIR-Play. 590 In addition, significant improvement can be observed with 591 Proposed compared to Mismatch (Mono) and Mismatch 592 (Stereo) in PseudoMUSIC. Because PseudoMUSIC contains 593 TABLE 3. Results of cross-modal retrieval in FAIR-Play and PseudoMUSIC using Cat. We adopt the same setting as Table 2.   Table 4.   input image of 224 × 448, we use a mask of size 32 × 32 to 628 nullify the image regions. Then the distance between the 629 feature of the nullified image and the corresponding stereo 630 sound feature is calculated. If this distance increases over the 631 original one, the nullified area should represent the sound 632 source because the sound source is critical in calculating the 633 distance. By shifting the mask in the input image, we obtain 634 a 7 × 14 (= 224 × 448/32 × 32) map. This map indicates 635 whether the network focuses on the sound sources.

636
Several examples of attention maps are shown in Figure 10. 637 In FAIR-Play and PseudoMUSIC, we see that the generated 638 attention maps successfully capture the sound sources, indi-639 cating that the proposed method focuses on sound sources.

641
We additionally visualize the feature spaces to qualitatively 642 evaluate whether the learned features capture both seman-643 tic and spatial information. In Figures 11 and 12, we show 644 results of applying t-SNE [36] to image and sound features 645 from FAIR-Play and PseudoMUSIC, respectively. We visual-646 ize image and sound feature spaces by the proposed method 647 in two ways: visualization of categories and locations. First, 648 we show the visualization of sound categories. In Figures 11 649  FIGURE 11. Visualization of image and sound features of FAIR-Play. There are three labels, Category (Single): sound categories for the single sound source, Category (Multiple): sound categories for multiple sound sources, and Location: sound locations. In the labels for Location, we use vectors representing the number of sound sources in each region. For example, label 101 denotes that the left, center, and right regions have one, zero, and one sound source, respectively.