Audio-Visual Kinship Verification: A New Dataset and a Unified Adaptive Adversarial Multimodal Learning Approach

Facial kinship verification refers to automatically determining whether two people have a kin relation from their faces. It has become a popular research topic due to potential practical applications. Over the past decade, many efforts have been devoted to improving the verification performance from human faces only while lacking other biometric information, for example, speaking voice. In this article, to interpret and benefit from multiple modalities, we propose for the first time to combine human faces and voices to verify kinship, which we refer it as the audio-visual kinship verification study. We first establish a comprehensive audio-visual kinship dataset that consists of familial talking facial videos under various scenarios, called TALKIN-Family. Based on the dataset, we present the extensive evaluation of kinship verification from faces and voices. In particular, we propose a deep-learning-based fusion method, called unified adaptive adversarial multimodal learning (UAAML). It consists of the adversarial network and the attention module on the basis of unified multimodal features. Experiments show that audio (voice) information is complementary to facial features and useful for the kinship verification problem. Furthermore, the proposed fusion method outperforms baseline methods. In addition, we also evaluate the human verification ability on a subset of TALKIN-Family. It indicates that humans have higher accuracy when they have access to both faces and voices. The machine-learning methods could effectively and efficiently outperform the human ability. Finally, we include the future work and research opportunities with the TALKIN-Family dataset.

Abstract-Facial kinship verification refers to automatically determining whether two people have a kin relation from their faces.It has become a popular research topic due to potential practical applications.Over the past decade, many efforts have been devoted to improving the verification performance from human faces only while lacking other biometric information, for example, speaking voice.In this article, to interpret and benefit from multiple modalities, we propose for the first time to combine human faces and voices to verify kinship, which we refer it as the audio-visual kinship verification study.We first establish a comprehensive audio-visual kinship dataset that consists of familial talking facial videos under various scenarios, called TALKIN-Family.Based on the dataset, we present the extensive evaluation of kinship verification from faces and voices.In particular, we propose a deep-learning-based fusion method, called unified adaptive adversarial multimodal learning (UAAML).It consists of the adversarial network and the attention module on the basis of unified multimodal features.Experiments show that audio (voice) information is complementary to facial features and useful for the kinship verification problem.Furthermore, the proposed fusion method outperforms baseline methods.In addition, we also evaluate the human verification ability on a

I. INTRODUCTION
F ACIAL kinship verification (FKV) aims at automatically determining whether two individuals have a kin relationship or not from their given facial images or videos [1].Since the seminar work by Fang et.al [2], recently FKV has gained increasing attention [1] due to its wide range of potential applications, including finding missing persons, border control and customs, criminal investigations [3], family photo album organization, improving the performance of face recognition systems, and social media analysis [4].To the best of our knowledge, although closely related to face verification, that has been well developed and made into products for real world [5], the FKV technology has not been, however, capable of performing at a sufficient level for any practical applications due to its unique challenges discussed in great detail in [1].
Existing research in kinship verification has been extensively focused on exploring kinship features from the visual modality of the facial images/videos [3], [6], [7].Certainly, facial similarity plays an important role in FKV, as facial similarity and kinship judgments are highly correlated according to recent psychology research [8].However, there have been studies demonstrating that voice similarity is also related to kinship judgments [9], [10], [11], [12], [13].For instance, according to [9], the vocal tract shape that affects voice properties are genetically determined, consequently subjects with kinship have a similar voice.In addition, the study on human perception of kin voice indicates that humans have the ability to judge kinship by listening to the speaking voice [14], [15].Despite these evidences, voice modality has not yet been explored for FKV.
In recent years, audio-visual fusion has been shown to be an effective way of improving performance in various problems, including emotion recognition [16], speech recognition [17], event detection [18], and biometrics [19], such as speaker identification and speaker authentication.Based on the aforementioned discussion, it is natural to ask: in addition to the c 2023 The Authors.This work is licensed under a Creative Commons Attribution 4.0 License.
For more information, see https://creativecommons.org/licenses/by/4.0/Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
visual modality, is it beneficial to explore other modalities (the voice channel in specific) for the problem of FKV? Therefore, in order to answer this question, in this article, we carry out the first study, and aim to build an audio-visual kinship verification framework in an attempt to further improve the FKV performance.To this end, we need to address two main challenges: 1) collection and publication of a new audio-visual dataset as there is no such datasets available and 2) development of novel approaches specifically for improving the verification performance.High-quality datasets enable rapid progress in the FKV task.In our previous preliminary attempt [20], we have collected the TALKIN dataset.However, the TALKIN has some obvious limitations, that is, limited number of training samples, limited diversity in terms of environmental conditions, kinship categories, and mono-annotation with binary kinship labels only.To address some of these identified limitations, we aim to establish a new audio-visual kinship dataset, called TALKIN-Family, that consists of facial videos and synchronous speaking audio with properties that differ from the existing one.In TALKIN-Family, there are 246 unique family trees and 1012 individuals with rich annotations of family relationships, age, gender, and scene conditions.The size of family tree ranges from 2 to 14 subjects whose age is between 5 and 81 years old.Each subject has multiple talking facial videos of about 10-s length under different conditions.Overall, there are 9.2 h of videos in TALKIN-Family.
In response to the second challenge, we consider the design of a novel framework for the task of audio-visual FKV.It encompasses two main steps, that is: 1) extracting appropriate features and 2) integrating modality information.Representing modalities, that is, audio and video, in an appropriate way is crucial before fusion.Visual features have been widely studied for FKV [21].Comparatively, very few acoustic features are designed specially for kinship verification, because the study has been largely under explored.However, wellknown acoustic representations such as Mel-frequency cepstral coefficients (MFCCs [22]) and data-driven features [23], [24] have been commonly applied in speech community.Similar to the correlation between facial similarity and FKV, we propose to compute the voice similarity and set new benchmark methods for FKV by using acoustic features.
When fusing audio-visual features for the problem of FKV, based on our benchmarks and investigation, we find the inter modal discrepancy and modal weighting are essential to exploit informative knowledge.Motivated by the adversarial learning [25] strategy and the self-attention mechanism [26], we propose the fusion method, called unified adaptive adversarial multimodal learning (UAAML) based on deep neural networks (DNNs), which addresses the aforementioned challenges.The UAAML jointly considers multimodal feature learning and kinship attention weights with similarity learning.Particularly, we introduce the L 2 norm layer [27] to generate the unified features before fusion and make the network training stable and efficient.
We highlight the contributions of this article as follows.[20].We improve our previous work from aspects of the proposed dataset and evaluated methods, including benchmarks and the proposed fusion method.The human performance on kinship verification from faces and voices is also studied.The structure of this article is organized as follows.Section II briefly reviews related work.Section III introduces the details of the TALKIN-Family dataset.Section IV presents our proposed UAAML approach.Section V shows extensive experiments and results.Section VI concludes this article with possible future directions.

II. RELATED WORK
A. Kinship Verification 1) Kinship Datasets: Table I compares the main characteristics of existing kinship datasets.We categorize those datasets with data modality.At the early years, kinship datasets are mainly based on images.Among those, the FIW dataset is the largest and most comprehensive image kinship dataset.The facial video kinship datasets are the ones that only the facial information is available, including UvA-NEMO Smile [32], KFVW [33], and KIVI [34].The video and audio kinship datasets include the TALKIN dataset [20] and FIW MM dataset [35].The TALKIN dataset is the first audio-visual kinship dataset.It was organized with a pairwise structure, while lacking the family structure.Each subject has only one video sample under the unconstrained condition.FIW MM is the recent one and has a larger data volume with 200 families, and multiple samples were collected for some subjects under wild conditions.Compared with TALKIN and FIW MM datasets, the dataset proposed in this article, that is, TALKIN-Family, is superior on the dataset volume and environment scenarios.More families are included in the dataset.Moreover, the TALKIN-Family dataset also contains people speaking the fixed content and different contents.
2) Kinship Verification Methods: The kinship verification has been studied for more than ten years [1].The research is mainly carried out from still facial images.Early imagebased works focused on extracting the facial features with handcrafted descriptors [30], [36] and measuring the similarity by computing common distance metrics such as cosine similarity [36].Then, the metric-learning-based methods [3], [37] were proposed to separate the kin and nonkin pairs.With the development of deep learning, many methods with different motivations were raised [1].The first End-to-End deep learning architecture for kinship verification is proposed by Zhang et al. [38] in 2015.The network takes two stacked facial images as the input and then predicts the kinship at the top layer.Later, Li et al. [39] proposed a Siamese network with the similarity metric to learn the discriminative features for kinship verification.Based on the Siamese CNN architecture, different strategies were explored on how to reason the relations between two facial features.Dahan and Keller [40] computed the kinship verification scores by fusing the face embeddings collected from the last FC layer.Li et al. [41], [42] introduced the star-shaped graph to model the facial feature.Then, the relational reasoning is performed on the graph by the recursive message passing scheme.The kinship dataset has the intrinsic issue of limited positive samples and far more negative samples.To exploit all the possible training samples, Li et al. [43] proposed the meta-mining approach to sample the unbalanced training batch.Alternatively, Song and Yan [44] proposed the KinMix method to augment the kinship positive samples with the linear sampling method from the feature level.Extensive experiments showed that the refined training batch could effectively boost the model learning capability.
On the basis of image-based studies, to capture multisource information, researchers proposed to study the kinship verification from facial videos.Compared with image-based studies [21], the works on facial videos [32], [33], [34] can only be found with a limited scale [1].At the beginning, the constrained facial video dataset was used.Dibeklioglu et al. [32] proposed to fuse the facial appearance and dynamic facial features that are extracted from a smiling video clip.However, collecting standard smiling faces under unconstrained conditions is relatively hard.Therefore, researchers raised the study of kinship verification from unconstrained facial videos.Kohli et al. [34] extracted the spatiotemporal kin information in videos.Yan and Hu [33] studied the metric-learning methods on unconstrained videos for kinship verification.However, the works above neglect the additional kinship clue that resides in the human voice.

B. Acoustical Study for Kinship
In our daily lives, people with a kin relation can have similar voices.For instance, it is sometimes hard to distinguish between father and son over the phone.This phenomenon has attracted researchers from many domains into the fields.Researchers explicitly studied the vocal similarity of kin people.The earliest genetics of voice research was found in the 1990s.Sataloff [9] demonstrated that the voice function is related to the phonatory organ structures.The physical features are genetically determined, which intuitively indicates that the human voice is also genetically determined.Later, psychological studies assessed human perception on recognizing the kin voice.Studies by Van et al. [14] and Taylor [15] showed that humans could verify kinship from voice by providing listeners with the voice of specific sentences.Motivated by the research above, acoustic studies quantitatively confirmed voice similarity within kinship by measuring and comparing various acoustic characteristics [10], [11], [13].Though many works have been carried out on studying the vocal similarity of kinship, the voice has not been directly applied in automatic kinship verification.

C. Multimodal Learning
Multimodal fusion methods can exploit complementary sources of information.Different sources of information are typically integrated through early fusion (feature level) or late fusion (score or decision levels) [45].Feature-level fusion using concatenation or aggregation is often considered to provide a high level of accuracy.However, feature patterns may also be in compatible and increase system complexity.Techniques for score-level fusion using deterministic (e.g., average fusion) or learned functions are commonly employed but are sensitive to the impact of score normalization methods on the overall decision boundaries.
When considering multimodal fusion, one main challenge is eliminating the modal discrepancy and learning a joint feature space that can better fuse the features.Recent generative adversarial networks (GANs) [25], [46] have achieved the significant success that can map the data distribution into the desired one by adversarial training.Inspired by this, Mai et al. [47] built the encoder-decoder networks for different modal inputs to learn the latent feature embeddings.The adversarial learning was introduced on the encoder to learn the joint feature space for different modalities.Zhou and Shen [48] studied the multimodal clustering problem.They developed the end-to-end adversarial attention multimodal clustering (EAMC) method that consisted of the adversarial learning module and modal attention module to align the feature distribution and quantify the important modal weights.A proposed clustering objective was added to guide the network training on the top of the network.

A. Motivations
Benchmark datasets serve as the common ground for performance measurement and comparison of various algorithms, and help the field progress toward challenging problems.On the other hand, dataset biases could bring unwanted information, which the system takes as class clues and show high confidence in prediction [49], [50].To ensure our kinship dataset applicable, the possible familial biases, such as recording devices, recording conditions, and speech contents are considered during the data collection procedure.We found that video-sharing websites such as YouTube2 usually contain free-style speaking videos while lacking fixed-text speech.
To fill the blank, we choose to collect the TALKIN-Family offline.The video recording task is distributed to the participating families, and family members record the qualified videos by following the provided instructions.We will introduce the collection steps in details in the following section.

B. Collection Pipeline
The overall collection pipeline is shown in Fig. 1.The participants were asked to record the frontal talking facial videos of themselves and biologically related family members.Considering eliminating the family-related biases (e.g., recording conditions, recording devices, and speech contents), we set up several recording protocols.
Participants: The subjects involved in the recording mission within one family should be biologically related.The number of subjects within one family should be at least two, including collateral relatives and direct relatives across the generations.This means that collateral relatives cannot be considered an isolated family.Subjects across different families should have no biological connection.
Environment Conditions: The background is quiet without noise or voices from other people.There is only one subject that appears in the video.To further ensure that videos within one family do not only have one background, we ask the subject to record videos against both the white background and the nonwhite one.We refer to the white background as "white" and the nonwhite background as "wild," as shown in Fig. 1.This could eliminate the familial background bias [50] by generating kin pairs across different backgrounds.
Speech Content: In the speaker verification study, it is distinguished as text-dependent speaker verification and textindependent speaker verification.When the speaking content is fixed, it refers to text-dependent speaker verification [51].In text-independent speaker verification, subjects talk freely without the explicit cooperation [52].In our dataset, we consider both scenarios for the sake of extensive usage of TALKIN-Family and meanwhile avoiding familywise spoken utterances.The participants were provided with the specific content (that is the Mandarin new year greeting).Other than that, they could speak freely while differently from the provided content.The abbreviations for text-dependent and text-independent are TD and TI.Therefore, for each subject, there are four talking videos, referred as BACKGROUND_CONTENT (i.e., White_TD, White_TI, Wild_TD and Wild_TI), as shown in Fig. 1.
Shooting Device: The videos were recorded by the camera of the smartphone.The phone should be held still during the recording, and the retouching function was turned off.Within one family, multiple (more than one) phones were asked to be used for recording the videos (to avoid device bias).Each video lasts for about 10 s.
Data Packing: We set the principle subject as ROOT ("me"), who is one of the young generations.Family members are backtracked based on the root, and the family tree is generated and labeled as in Fig. 1.Every involved single-family has a family folder as FXXX (i.e., F001-F246).In addition, the gender and age labels were also collected.Under the family folder, each subject has a subfolder called ID_GENDER_AGE (e.g., P1_female_6), where ID refers to the subject's family role defined by the family tree.GENDER is male or female, and AGE is an integer referring to the subject's age.Then, under the subject's folder, four facial videos are stored.

C. Data Preparation
In the TALKIN-Family dataset, each video clip was recorded with the cooperation of the participants, and only one subject appears in each video.Therefore, Speaker Diarization [53] is not required to determine "who spoke when" before data preprocessing.We do the preprocessing from visual data and audio data separately, as described below.
Visual Data: We first extract facial frames from each video, and faces are automatically detected, cropped, and aligned as done in [54].Note that some recorded videos are shot in landscape mode or upside down.Therefore, in such cases, face orientation and image rotation are needed during face detection.Then, facial frames are resized to 224 × 224 and encoded by face-image descriptors.Section V details the face descriptors we employed in the experiments, including traditional descriptors and deep encoders.
Audio Data: Since the subject starts to talk and ends right after the subject stops, we extract the audio directly from videos and save them as WAV files.The signal is converted and normalized to the single channel at a 44.1-kHz sample 2) Data Details: The length of each video clip is about 10 s.In total, TALKIN-Family has 9.2 h of videos.There are about 1 million facial frames in TALKIN-Family.All the subjects are from China and speak Mandarin Chinese (some of those have accents).

E. Problem Establishment
We address the audio-visual kinship verification as a binary classification problem: given a pair of signals [a pair of video sequences with speech utterances, for example, (X, Y)], the objective is to automatically determine whether they have a kin relation.In practice, we represent X and Y using recordinglevel representations.The kinship score, a numerical indicator associated with higher values for kin relation pairs, is obtained by computing similarity score between the feature representations.Three levels of generation (Siblings, Parent-Child, and Grandparent-Grandchild) are considered in our experiments.

IV. PROPOSED METHOD
The overall framework of the proposed method is shown in Fig. 2. It consists of modality-specific feature generators, modal fusion, and kinship assignment.The modality-specific networks are encouraged to exploit the distinct modal property.Then, the modal fusion is trained to eliminate the cross-modal discrepancy to parse the better fusion of multiple feature vectors from different modalities.When obtaining the fused features, the contrastive loss is added to enforce the network to learn the compactness within kinship and separation between nonkinship.

A. Preliminaries
X i and Y i represent the ith sample pair that comes with both audio and visual modalities denoted by respectively.The pairwise label l i denotes whether the ith the sample pair has a kin relation, that is, l i = 1 represents that X i and Y i have a kin relation, and l i = 0 denotes that X i and Y i have the nonkin relation.
Our method has two feature encoders: 1) the audio encoder E a (•; θ a ) and 2) visual encoder E v (•; θ v ) that are parameterized by θ a and θ v .The audio and visual data are fed into the modalspecific encoder, and the feature representation is expected to be modal invariant.This is achieved by the adversarial learning associated with the discriminator D(•; θ d ), where θ d is the network parameter.Besides, to let the feature pay more attention to effective kinship traits and emphasize them, the attention mechanism is proposed to learn the weights for the feature-level fusion.The weight vector w is computed by the multiple layer perceptron (MLP).The entire network is designed with Siamese fashion that shares weights for two different inputs X i and Y i .To preserve the kin discrimination of the network, we employ the contrastive loss L kin to let the modal learn the closeness of kinship and separation of nonkinship.

B. Modality-Specific Networks
Different sources of data are difficult to be combined at the raw data level.Therefore, we first adopt the modality-specific networks to transform the face and voice data into the latent feature space.Following the work in [20], the network inputs are the facial image and spectrogram computed from a particular speech.The residual network (ResNet) architecture [55] is adopted for both face and voice backbone network described as follows.We take sample X i for an example, which goes the same to the input Y i .
1) Visual Subnet: The visual backbone directly adopts the InsightFace with ResNet-34 architecture [56], [57].Given an input facial image X v i ∈ R D×H×W , we extract the corresponding feature embedding as The W and H indicate the spatial size, and D is the number of channels.As the facial image is cropped and resized into 112 × 112, the generated facial features fall into 512-D.
2) Audio Subnet: The audio backbone employs the ResNet-50 pretrained on Voxceleb2 [23], [58] to extract the vocal features from the spectrogram inputs.We extract a 3-s utterance clip and convert it into the single channel with a 16-kHz sampling rate.The spectrogram is generated by a sliding hamming window of width 25 ms and step 10 ms.Therefore, the audio network input X a i has the size of 512 × 300, and the corresponding output Similarly, we can have the audio and visual embedding for

C. Model Fusion
The modal fusion module fuses audio and visual features for comprehensive estimation.It consists of the unified feature operation, modal alignment, and feature fusion attention learning.
1) Multimodal Adversarial Learning: When merging features generated from different modalities, they generally have different scales and norms.Directly combining these features leads to poor fusion performance since the larger features can overwhelm the smaller ones.Rather than carefully tuning the network parameters with efforts, Liu [27] found that normalizing the features before fusion improves the model stability.Therefore, before learning the modal-invariant features, we add a L 2 normalization layer to transform the feature as a unified one.Formally, for the audio feature x a i and visual feature x v i , we normalize them differently as xa i , xv i using L 2 -norm . ( The audio and visual encoders learn multimodal representations that may have a large gap between different modalities.Inspired by the recent GANs [25], we introduce the discriminator D(•; θ d ) to distinguish the audio and visual features.Since the audio and visual features have different dimensions, we first feed them into one fully connected layer that is FC a (•) and FC v (•) to map them into a common length.Then, the two-class classification is performed.The discriminator is optimized by the following objective function: One the other side, the modality-specific networks are trained to confuse the discriminator with the opposite modal label by minimizing the adversarial loss min where λ adv is the weight coefficient.The discriminator guides the modal encoders to learn the same distribution representations through min-max adversarial learning.
2) Feature Fusion Attention: After we obtain the modalityinvariant representations, we concatenate the audio and visual features for two inputs as [ • ] denotes the concatenation operator.In particular, we design a fusion attention module to emphasize the efficient vector values.It consists of an MLP with the Sigmoid function and the output is the weight vector w with the same dimension of x f and y f , which can be calculated by where σ (•) is the Sigmoid function and FCs(•) is the stacked two fully connected layers.The original concatenated feature x f and y f has the dimension of 2560.The first FC layer reduces the feature dimension to 1024, and then the last FC layer increases the dimension to the original 2560-D.After passing the Sigmoid activation, the weight vector could be obtained and we can obtain the adaptive feature fusion by using (5).
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.x and y are the fused representations to obtain the kinship analysis.We denote attention parameters as θ att .

D. Learning Kinship Awareness Embedding
To perceive the kinship traits, that is, similarity between kinship and difference between nonkinship, we adopt the contrastive learning to train the network in a supervised way.By integrating the kinship label l i , the network objective can be expressed as min where threshold M is the margin, d = x − y 2 .
The training procedure is summarized in Algorithm 1.During each training step, two multimodal encoders are first trained alternatively in an adversarial way together with discriminator without kin label evolved.Then, the entire network is jointly trained using the kin labels.
During the testing process, we collect the fused feature from the network.The cosine similarity sim(x, y) = ((x • y)/( x • y )) is calculated to represent the distance between two subjects.A threshold applied to sim determines whether two inputs have a kin relation as has been done in [21].

A. Implementation Detail 1) Data Preparation:
We first generate kin pairs with 11 relationship types described in Section III, where the sample pairs have different backgrounds and speak content.After we obtain the kinship pairs (positive pairs), we split them into maximum of five folds to conduct the K-fold validation [21].Within each fold, we randomly generate the nonkinship pairs as negative samples, where nonkinship subjects are from different families and biologically unrelated.The negative samples have the same size as positive samples.Note that there is no family overlap between folds.The experimental data statistics distribution of audio-visual kinship verification in the wild is shown in Table II.The reason why it cannot be divided into five folds for relations, such as SS and BS, is that the negative samples suffer from insufficient families.We perform data preprocessing on all videos for visual and audio data as introduced in Section III.Since the video length for t-steps do 3: update parameters θ d of the discriminator by ascending their stochastic gradients: 4: end for 8: for d-steps do 9: update parameters θ a , θ v , θ att of the discriminator by ascending their stochastic gradients: end for 14: end while 15: return θ a , θ v , θ att varies from video to video and the neighbor video frames have a slight difference, we extract and align 60 facial frames and audio frames for each video.Due to the head variations and orientations, some frames are lost for a few subjects.

B. Compared Methods
To verify the effectiveness of our proposed method on the TALKIN-Family dataset and compare the performance between the unimodality and multimodalities, we perform baseline methods on vocal and FKV and four fusion methods.
1) Voice Features: We employ two methods: 1) GMM-UBM [59] and 2) I-vector [60], for audio analysis.We extract MFCCs with 12 cepstral coefficients from the audio samples.The UBM with 128 mixture components of GMM is trained with the training set.For the GMM-UBM [59] method, the kin pair model is created from UBM using the maximum a posteriori (MAP) estimation.The verification likelihood is the log-likelihood ratio between speaker models and registered speakers' GMM.In the I-vector [60] method, UBM is trained using expectation-maximization (EM) with MFCCs.The I-vector is obtained by MAP point estimation.Then, the dimension of the I-vector is reduced by linear discriminant analysis (LDA).We compute similarity between two speakers with the cosine similarity of I-vectors.
Besides, we also evaluate the pretrained deep models as feature encoders.
pyannote-S: The pyannote.audio[61], [62] is an Endto-End generic PyanNet that is trained on Voxceleb [24] and Voxceleb2 [23] datasets.The trained model takes the utterance and samples it with a sliding window to generate overlapping 512-D features.The pyannote-S means that we evaluate the performance using only the single vocal feature.
pyannote-A: For the utterance clip, we average all audio features for the sequence as its final feature representation.
VGG_M: The model architecture is based on VGG_M [24], and takes the audio spectrogram as input.The spectrogram is computed with the same method described in Section IV.VGG_M is trained on the Voxceleb dataset [24] with the task of speaker verification.The final audio feature has a length of 1024 dimensions.
ResNet-50: The model is trained on the Voxceleb2 dataset [23] and the audio embedding is collected from the FC layer with the length of 2048.
Furthermore, the deep CNN models pretrained on largescale face datasets are also widely used in kinship verification to encode the facial image with output embedding.
SphereFace [67], [70] is a CNN model trained with the angular softmax (A-Softmax) to learn more discriminative features.The SphereFace is trained on the face dataset CASIA-WebFace [71].Then, the deep features can be collected from the FC1 layer with 512 dimensions.
VGG-Face network [68] is trained on a large face dataset with 2.6 million images of over 2662 people.We feed the facial image into the network and collect features from layer fc7.
InsightFace [56], [57]: Compared to SphereFace, InsightFace utilizes the AcFace loss that has fewer parameters yet with a better classification margin.The model is trained on the MS1MV2 dataset.The facial frames are fed into the pretrained model, and we can obtain the final 512-D feature embedding.
3) Fusion Methods: We perform both early fusion and two late fusion methods on audio-visual kinship verification.
Early Fusion: The multiview features are concatenated together as the fused feature for later similarity comparison.
Late Fusion (Mean): For the late fusion, the similarity scores are computed separately for each modality.Then, the mean fusion average scores were obtained from multimodalities as the final decision score.
Late Fusion (Max): Rather than calculating the averaged score, max fusion takes the maximum score as the final decision score.
Siamese fusion [20] introduced one FC layer to learn the fusion scheme.By adding the contrastive loss on the top of the network, the FC layer automatically learns the fusion weights for each element.

C. Experimental Settings 1) Implementation Details:
We implement our network on the PyTorch library.Since the released pretrained InsightFace net and ResNet-50 (audio) are implemented based on MXNet and Matconvnet libraries, respectively.We first convert those models into PyTorch using opensource code from Github [74] and [75].To initialize our network parameters, we use the ResNet-34 weights trained on MS1MV2 [56] for the visual network and the ResNet-50 weights trained on VoxCeleb2 [23] for the audio network.Parameters in other layers are initialized using random weights.For training the proposed method, the parameters of networks are optimized by the Adam optimizer with the learning rate of 1e-6, weight decay of 1e-4, and mini-batch size of 50.We train the entire network for 250 iterations.The program runs on two NVIDIA V100 GPUs (32 GB).The hyperparameter λ adv determines the degree of multimodal discriminative information used during the model training process.In the case of using small λ adv , no sufficient modality discrimination could be applied.We set λ adv with 1 [76].
2) Evaluation Protocol: In our experiments, we compute the cosine similarity between two features.The threshold is used to classify whether two subjects have a kin relationship [21].The verification accuracy and receiver operating characteristic (ROC) curves are used to evaluate the method's performance.

D. Experimental Results and Comparison
This section presents experimental results of kinship verification on the TALKIN-Family dataset from both single modality and multiple modalities.
1) Single-Modal Kinship Verification: Table III shows the kinship verification accuracy from single modality (based on one modality).For the voice-based kinship verification, the ResNet-50 has the best performance.The traditional methods I-vector and GMM-UBM have comparatively low performance.Notice that Grandparent-Grandchild results are not provided because the UBM is hard to converge due to limited data.The possible solution is to employ external data to train the UBM.Regarding the pyannote model, the performance can be slightly improved by averaging all vocal features within one utterance.
For kinship verification from faces, deep models outperform traditional descriptors by a large margin.Compared to traditional descriptors, the MRNML metric-learning method [3] has a better average accuracy, and the spatial-temporal descriptor LBP-TOP also outperforms the averaged frame-level features.Among deep learning models, InsightFace surpasses others with a large margin except for GFGS, that VGG-Face achieves Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.the best performance.The better models boost the kinship verification performance due to the accurate feature representations.Therefore, we apply ResNet-50 (voice) and InsightFace (face) as the backbone networks for the fusion.
2) Multimodalities: As presented at the end of Table III, the proposed UAAML method shows an improvement over the single modalities for all 11 kin relations and the average level.Table IV also compares the results of several baseline fusion methods.Fig. 3 visualizes different methods' corresponding ROC curves.It can be seen that by fusing the audio and visual features, the performance could be improved, demonstrating that the vocal and facial features complement each other.In addition, the proposed fusion method improves the single-modality verification accuracies and the baseline fusion methods to a certain extent.Average accuracy improves by about 3.5% and 2.0% from the single modality and baseline fusion methods.Although baseline fusion methods cannot beat the UAAML method at the average level, score fusion methods show slightly higher accuracy in relations such as BS, NS, and GMGS.This is a motivation for future work that further explores multifusion strategies for audio-visual kinship verification. 3

) Ablation Study:
To analyze the effect of different components of UAAML, we ablate the proposed method and evaluate the effectiveness of each.
a) Fusing different features: To study fusing various single-modality features, we include the two single-modal features for both face and voice modality into evaluation.The VGG-M and ResNet-50 models are used for vocal features, and the FaceNet-V and InsightFace models are applied for facial features.To simplify the process, we implement feature fusion to evaluate the effectiveness of the fusion.The L 2 normalization is computed before fusion to reduce the discrepancy within different features.Table V shows the averaged verification accuracy when combining various multiple features.The experimental results show that the InsightFace (Face) and ResNet-50 (Voice) feature fusion achieves the best performance.However, when combining VGG-M (voice) features or FaceNet-V features with comparatively low performance, the system can be easily affected by poor features.Therefore, the InsightFace and ResNet-50 encoders are used as our backbone networks.b) Roles of different losses and components: We further evaluate the effectiveness of adversarial learning, contrastive learning loss, and attention layer: 1) w/o.att + L kin denotes the network discards the adversarial learning and the attention layer, and it is trained with the contrastive learning loss; 2) w/o.att + L adv denotes the adversarial network without the attention layer, which is trained with the selfsupervised learning strategy [77] without kin labels.It learns the consistency between modalities to embed the semantic multimodal features; and 3) w/ att + L kin denotes the network discards the adversarial learning module but keeps the attention layer, which is trained with the kinship loss.Table VI reports the verification accuracy.Experimental results demonstrate the necessity of the adversarial module, attention layer, and kinship loss.The proposed UAAML further improves the performance compared with the three variants.Those results also convey that adversarial and attention modules are the key components for audio-visual kinship verification.c) Normalization layer: We perform the model training with the same efforts without the normalization layer.As shown in Fig. 4, the performance drops significantly, showing that the normalization layer is crucial to make the training process stable and improve the performance.

E. Evaluation on the TALKIN Dataset
In this section, we further evaluate the effectiveness of the proposed UAAML method on the TALKIN dataset for  audio-visual kinship verification.The TALKIN dataset has four parent-child kin relations, that is, FS, FD, MS, and MD.For each kin relation, there are 100 pairs of kin facial videos, and 100 pairs of nonkin videos.The five-fold validation is performed.Similarly to previous experimental settings, we apply the InsightFace with ResNet-34 architecture [56], [57] (face) and ResNet-50 [23], [58] (voice) as the single-modality backbone networks.Table VII presents the performance of single-modality methods and different fusion methods, and Fig. 5 shows the corresponding ROC curves.The experimental results demonstrate that the proposed UAAML method obtains the highest level of accuracy compared to both single-modality and baseline fusion methods.The baseline score fusion method (max) shows a 1.0% higher accuracy in the FS relation compared with UAAML.Considering that the videos in TALKIN contain additive background noise, the performance of the audio modality is relatively worse and, thus, brings limited fusion improvement.Therefore, for audio-visual kinship verification, especially when it comes to real-world problems, more robust voice models are needed [78].

F. Influence Factors
The audio-visual kinship verification is affected by many factors.From the perspective of biological attributes, this Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.includes the depth of the genealogical tree, age, and gender.From the data acquisition conditions, the factors include the recording background and video speech content.We analyze how those factors influence the performance by providing the corresponding experimental results.
1) Genealogical Tree: Fig. 6 shows the averaged verification accuracy for three generations of kinship with different inputs.It can be seen that the deeper the genealogical tree, the performance on faces drops significantly.One reason for this is the age difference between kinship, as distributed in Fig. 6.The siblings of the same generation have the smallest age difference of about ten years on average, of which parent-child has about a 26-year age difference.However, second-generation subjects have an average age difference of about 50 years.As people start aging, the appearance of their faces varies in structure and texture.These differences affect the inner similarity of the kin image pairs, consequently reducing the verification performance, whereas acoustic features compensate for facial aging issues to some extent, especially for the Grandparent-Grandchild relationship.
2) Gender Factor: The experimental setting of relationspecific evaluation provides us with the possibility of analyzing influence brought by gender.From Table IV, we could observe that the influence of gender is significant for siblings, where the opposite gender (BS) has a comparatively lower accuracy than the cases with the same gender (BB, SS).Regarding the parent-child and grandparent-grandchild relations, the influence of gender is more limited, and its impact is lower than the influence caused by the texture difference brought by the age gap.On the other hand, on the TALKIN dataset, the influence of opposite genders can be found in the parent-child relations (Table VII), as some kinship videos are recorded at the similar age (e.g., FS pairs: [25,26], [43,44], [45,46], [71,72], etc.), rather than at the same time (e.g., TALKIN-Family).
3) Recording Conditions: The data collection conditions potentially influence the system performance, such as speech text in speaker verification [51], and the same photo issue in kinship verification [50] by providing latent clues.To control one variable factor for one time, we generate the kinship pairs that: 1) speak the fixed text but with different backgrounds (text-dependent) and 2) are recorded under the white background but with different speaking content (white background).The data statistics on the two scenarios are listed in Table VIII.Fig. 7 shows the experimental results on textdependent and white background conditions with different inputs.The background influence could be clearly seen from Fig. 7(a) that the white background performance has higher accuracy.Two reasons explain the phenomenon: 1) the noise effect is eased under the white background and 2) the white background videos within one family are possibly recorded at the same place, with similar illumination, which could cause data bias [50].This also explains why we asked the participants to take videos under two backgrounds, one of which is white, to easily distinguish the same or different backgrounds.As illustrated in Fig. 7(b), the fixed text setting achieves comparable performance to the free-speaking setting due to the equal similarity within kin and nonkin pairs.Overall, the audio-visual fusion improves performance under all conditions, while under two semicontrolled environments, the improvement of fusion is comparably limited.

G. Human Performance
We test the human performance on kinship verification by using a subset of TALKIN-Family.Twelve volunteers from China participated in the experiments.Before the test, they had never seen or known any information about the dataset subjects.They were asked to answer whether the given clips have a kin relation.In general, we set up three types of tasks, namely, kinship verification from: 1) facial videos without voice; 2) voice; and 3) facial videos with voice.For each task, we select two kin pairs and two nonkin pairs from each of 11 kin relations, resulting in 22 positive pairs (kinship) and 22 negative pairs (nonkinship) in total.To avoid  the recall of previously seen information, we designed the set such as there is no subject overlap between positive and negative pairs or among the three subtasks.Fig. 8 illustrates the human performance results, in which Fig. 8(a) shows the overall accuracy and distribution of the subject performance.We compare the true positive (TP) and true negative (TN) accuracy in Fig. 8(b).Generally, an important finding is that humans tend to have a better ability to verify kinship from voice than from face, while when given synchronous facial videos and voice, humans can make a better judgment.Fig. 8(a) indicates that face and voice information enables human observers to make a more stable assessment and higher accuracy.Fig. 8(b) shows that the humans have higher accuracy in verifying the negative samples, and multimodal information helps humans to recognize nonkinship, thus improving the overall accuracy.
It is worth noting that it takes about an hour for one observer to complete the entire test, while machine-learning methods spend much less time in the inference process.We conclude that machine-learning methods can outperform human ability both efficiently and effectively.

VI. CONCLUSION AND FUTURE WORK
Audio-visual kinship verification is a new and potential research topic.In this article, we systematically investigate the problem of audio-visual kinship verification.We establish the most comprehensive audio-visual kinship dataset, called TALKIN-Family.Moreover, the baseline experiments of single-modal kinship verification are performed, of which the vocal kinship verification is evaluated for the first time.Based on the single-modal methods, we provide a deep learning framework, called UAAML, to jointly learn the modal-invariant and adaptive fused features for kinship verification with contrastive loss.The extensive experimental results demonstrate the effectiveness of audio-visual fusion compared to unimodal methods.Our proposed fusion method could outperform to the baseline methods.The human performance shows that by providing both the faces and voices, people could have higher kinship verification accuracy than using faces or voices only.
We expect this work sets a milestone for audio-visual kinship verification.To stimulate future study, in this section, we investigate the limitations of our datasets and the proposed approach and discuss future directions.Finally, we point out how TALKIN-Family can be applied in research beyond kinship verification.

A. Limitations and Future Work 1) TALKIN-Family Dataset:
The offline data collection has drawbacks, such as the difficulty of increasing the data volumes, the cost of human effort to collect the data, and homogeneous ethnicity distributions.Given this, the future work is considered to speed up the data collection procedure by applying crowdsourcing, at the same time, saving manual labor and increasing the data diversity.Since the TALKIN-Family only has people from China, when conducting the validation on other ethnicities, the ethnicity adaptation and how to mitigate the demographic bias [79] can be a future research direction.
2) UAAML: The main limitation of the proposed UAAML is that the model training demands high computational resources.However, during the inference time, the proposed method is comparable to simpler methods such as the naive fusion.In our experimental results, the late fusion shows better performance in some kin relations.We argue that the reason lies in the different scores that are better classified by the late fusion.This inspires us to explore hybrid fusion methods in the future to combine the advantages of both.More effective and efficient fusion methods are demanded for audio-visual kinship verification, such as multimodal regularization [80], and multimodal joint representation [81] learning the complementary semantics.

B. Research Opportunities With the TALKIN-Family Dataset
This work focuses on studying the audio-visual kinship verification based on the TALKIN-Family dataset.Beyond it, the proposed dataset could also be used in studying kinship with a wide range.The TALKIN-Family database contains family information, subject labels, environment context, etc.Those data attributes allow researchers to explore kinship verification with intensive analysis for example, at the family level, on the Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
effects of age and gender, and with background context and speaking content.Based on audio-visual kinship verification, the study could also be extended to other kinship recognition problems, such as trisubject kinship verification [30], family recognition, family retrieval [21], child face/voice generation.Furthermore, the robustness of multimodal kinship recognition is also an open issue, such as against adversarial attack [82], spoof attack [83], [84], poor conditions (e.g., modality missing and cross-modal feature learning).Data bias, fairness [79], and privacy-aware studies [85] are also worthy of further attention with the growing concern of data privacy protection.TALKIN-Family can also be helpful in audio-visual studies, such as talking face generation [86] and face-voice matching [87], and human perception studies on kin faces and voices.
In conclusion, we expect that TALKIN-Family could motivate researchers from different fields to advance the audiovisual kinship studies, techniques, and applications and enable further development.

Fig. 1 .
Fig. 1.Overall collection pipeline for the TALKIN-Family dataset.The data in TALKIN-family is collected offline by recruiting a number of families.Subjects participate the data collection with their family.Each subject has four facial talking videos under two background and two speech conditions.The TALKIN-Family is organized with family structure, and in each family, people are labeled according to our kinship labeling rules.Then, we do data preprocessing with audio and facial video separately.To study the audio-visual kinship verification, we define the problem with different kin relation types.

Fig. 2 .
Fig. 2. Proposed UAAML method.rate.Standard methods in the speech field, MFCCs[22] and DNNs, are used to embed the audio features.

Algorithm 1 :
Training Procedure of Our UAAML Input: Training set D, initialize modality-specific encodes E a , E v , hyper-parameter λ adv Output: The parameters θ a , θ v , θ att 1: while not converged do 2:

Fig. 3 .
Fig. 3. ROC curves of different methods on TALKIN-Family with the wild condition obtained on (a) siblings, (b) parent-child, and (c) grandparent-grandchild kin relations.

Fig. 5 .
Fig. 5. ROC curves of different methods on the TALKIN dataset obtained on parent-child kin relations.

Fig. 6 .
Fig. 6.Line charts illustrate the verification accuracy on different modalities.The bar chart shows the age gap between the kin subjects.

Fig. 7 .
Fig. 7. Performance of kinship verification on TALKIN-Family under different conditions.(a) Shows the performance comparison on the visual kinship verification under white and nonwhite backgrounds.(b) Compares the single modal and multimodal performance with different data recording settings.

Fig. 8 .
Fig. 8. Human performance on a subset of TALKIN-Family from the face, voice, and face&voice, respectively: (a) overall verification performance with different modalities and (b) TP and TN distributions of human performance under different settings.

TABLE I MAIN
CHARACTERISTICS OF EXISTING KINSHIP DATASETS.WE SORT THOSE DATASETS BY THE DATA MODALITY.IN THE EARLY YEARS, MANY IMAGE KINSHIP DATASETS WERE PROPOSED.THEN, SOME VIDEO DATASETS WITH ALIGNED FACIAL INFORMATION WERE PROPOSED.THE DATASET PROPOSED IN THIS ARTICLE, TALKIN-FAMILY, CONSISTS OF BOTH VISUAL AND VOCAL INFORMATION AND IS THE MOST COMPREHENSIVE ONE BY FAR

TABLE II DATA
STATISTICS FOR STUDYING THE AUDIO-VISUAL KINSHIP VERIFICATION IN THE WILD ON THE TALKIN-FAMILY DATASET.THE # folds MEANS THE NUMBER OF FOLD VALIDATIONS FOR EACH KIN RELATION.THE # families and # subjects REPRESENT HOW MANY FAMILIES AND INDIVIDUALS ARE INVOLVED WHEN STUDYING THE SPECIFIC KIN RELATION.THE # kin pairs MEANS THE NUMBER OF KIN PAIRS AT THE SUBJECT LEVEL.THE # videos IS THE TOTAL NUMBER OF VIDEOS USED, WHICH IS USUALLY FOUR TIMES THE NUMBER OF SUBJECTS, SINCE EACH SUBJECT HAS FOUR FACIAL VIDEOS.THE # sample pairs IS THE NUMBER OF FRAME-LEVEL SAMPLE PAIRS IN EACH KIN RELATION.APPLICABLE ALSO TO TABLE VIII

TABLE III AVERAGE
ACCURACIES (%) FOR K-FOLD KINSHIP VERIFICATION WITH VOICES, FACES, AND FUSION OF VOICES AND FACES UNDER THE WILD CONDITIONS IN THE TALKIN-FAMILY DATASET TABLE IV COMPARISON OF DIFFERENT FUSION METHODS ON THE TALKIN-FAMILY DATASET FOR AUDIO-VISUAL KINSHIP VERIFICATION IN THE WILD WITH AVERAGE ACCURACIES (%) FOR K-FOLD VALIDATION.THE FIRST TWO ROWS ARE SINGLE-MODAL VERIFICATION PERFORMANCE WITH "A" SHORT FOR AUDIO AND "V" FOR VIDEO.APPLICABLE ALSO TO

TABLE V COMPARISON
BY FUSING DIFFERENT SINGLE-MODAL FEATURES.(A1 IS THE VOCAL FEATURE COLLECTED FROM RESNET-50 TRAINED ON VOXCELEB2, A2 IS THE VOCAL FEATURE OBTAINED FROM VGG_M TRAINED ON VOXCELEB; V1 IS THE FACIAL FEATURE EXTRACTED FROM INSIGHTFACE, AND V2 IS THE FACIAL FEATURE COLLECTED FROM FACENET-V)

TABLE VI LOSS
AND MODULE ANALYSIS OF THE UAAML METHOD ON THE TALKIN-FAMILY DATASET.THE ATT IS THE ABBREVIATION OF FEATURE ATTENTION Fig. 4. Comparison of the effect when we take the same efforts training the network with normalization and without normalization.

TABLE VII COMPARISON
OF DIFFERENT FUSION METHODS THE TALKIN DATASET WITH AVERAGE ACCURACIES (%) FOR FIVE-FOLD VALIDATION

TABLE VIII DATA
STATISTICS FOR THE AUDIO-VISUAL KINSHIP VERIFICATION UNDER CONDITIONS OF FIXED SPEECH AND CLEAN BACKGROUND ON THE TALKIN-FAMILY DATASET