Person Gender Classification on RGB-D Data with Self-Joint Attention

Automatic gender classification has many potential applications including automatic annotation of images, video surveillance, security, and human-computer interaction. In the last decades, many research works focused on classifying gender using the cues from 2D images of the person’s frontal view. This limits their application in the real world. Also, classifying a person’s gender from a different view, pose and scale are still challenging problems. RGB-D images, containing color and depth images, have more immunity to this problem than 2D images. Recent approaches using RGB-D images explored different combinations of features descriptors from depth and color images to classify the gender. These methods used low-level features extracted from the depth and color separately, ignoring the inter-dependency between the two modalities. In this article, we proposed a deep learning-based approach using a self-joint attention mechanism for human gender classification on RGB-D images. The proposed attention mechanism is designed to encode the inter-dependent information between the depth and color images to enhance the feature discrimination power. We benchmarked our method on a challenging gender dataset that consists of different views, poses, and scales. The presented method outperforms the state-of-art methods with an improvement in the accuracy of 5.2%, 7.5%, and 8.7% on three different test data.


I. INTRODUCTION
Automatic gender classification has received great attention in pattern recognition and computer vision communities over a while. It has a variety of applications including image search, automatic annotation of images, demographic research, video surveillance and security, human-computer interaction, psychological analysis, and commercial profiling.
Various gender classification methods have been proposed employing different cues from the appearance information [1] [2] [3], e.g. face, body, or non-appearance information [4] [5], e.g. voice, iris, fingerprint, gait. Most of the works focused on the face as it contains reliable and distinctive information. Hence, researchers investigated different techniques to find discriminative representations of the face to ensure reliable predictions. For instance, the Local Binary Pattern (LBP) histogram [6] has been used to extract shape and texture features from the face. However, LBP-type features are limited in describing the local texture and cannot capture the global dominant texture. Other approaches used global feature descriptors like Histogram of Oriented Gradients (HOG) [7]. Combining the representative local features and the global features [8] is proposed to ensure a representation insensitive to the effects of illumination, translation, and rotation.
With the advent of deep learning, Convolutional Neural Networks (CNN) have also been introduced in gender classification [1]. The expressive power of CNN is frequently adopted to extract high-level semantic features, which significantly boost the models' performance.
Despite the above advancements, 2D image-based methods are still facing constraints and limitations. A clear look of the face is required for most face-based approaches. Further, with the presence of occlusion, facial pose dynamic change, and illumination change, most of the approaches still suffer from imperfections.
Encompassing both depth and appearance information, 3D data, e.g. RGB-D image, is intrinsically more immune to the aforementioned nuisance factors. The depth information contains profound geometric information of the person in 3D. However, there is limited work using 3D data for gender classification and depth information is not yet fully exploited in this field. Some research works [9] only focused on extracting pattern features from the depth image, like gait. Extracting the same type of features separately from the two modalities [10] is another preferable way to use the RGB-D data for gender classification. But the inter-dependency between the RGB and depth is not encoded in these features which limits the potential performance of their models. 3D point cloud data is also explored for gender classification using the geometrical feature of the body [3]. However, these features are hand-crafted and their performance degrades in challenging scenarios.
In this work, we proposed a deep learning method exploiting the complementarity of the appearance and shape information in the RGB-D data for gender classification. Inspired by visual object tracking [11] and object detection [12] methods, we employed a CNN with an attention module to learn representative features from the RGB-D data. The attention module serves as a feature fusion mechanism to find the inter-dependency between the RGB image and depth image. This mechanism boosts the discriminative power of the feature representation while enhancing the capacity of gender recognition under different view-angle, pose changing, and occlusion.
The proposed model contains three key components: a feature extraction module, a feature fusion module, and a prediction head. The feature extraction module consists of two branches of pre-trained CNN to extract features from the two modalities separately. The self-joint attention module is then used to fuse the features from the two modalities by learning the self-joint attention. A binary prediction head, consisting of a fully connected layer, is used to predict the class of input images. Compared to competitive methods, our method is distinguished by the following contributions: • It explores the deep neural networks to learn discriminative features from RGB-D images to distinguish men and women. • It proposes a self-joint attention module to capture the inter-dependency between the RGB image and depth image. The self-joint attention module effectively fuses the self-attentions of the two modalities hence improving the feature representation. • It achieves the best performance compared to the stateof-the-art methods on the challenging RGB-D gender dataset including occlusion, poses variation, and scales changing scenarios.
The remainder of this paper is organized as follows. Section II briefly reviews related work. Section III describes the proposed gender classification method. Section IV discusses the experimental results. Section V concludes the paper.

II. RELATED WORK
While most of the previous work focuses on 2D images, some research works explored 3D data for automatic gender classification. Ballihi et al. [13] proposed classifying a person's gender by measuring the similarity of the geometric shape from the face. The geometric shape features, e.g. geodesic paths and distances between facial curves, are extracted from the 3D face image captured by a laser scanner. Then these features are compared to their counterparts in random male and female templates to derive the feature vector signature. Boutellaa et al. [10] compared the performance of Kinect depth and RGB image for different face analysis tasks including gender classification. Four texture and shape features, i.e. LBP, local phase quantization(LPQ), HOG, and binarized statistical image features (BSIF), are extracted from both the depth and RGB image separately. Two corresponding SVM classifiers are used for classification. The results, obtained with three different face datasets, show that the depth image has a close performance to the RGB image on gender classification. Azzakhniniand et al. [14] proposed using face parts to improve the classification. They extracted LBP features from the face parts and trained an individual classifier for each part. Then a decision fusion scheme is applied to fuse the classifiers' results. However, the LBP is limited in describing the local texture and cannot capture faithfully the global dominant one.
Instead of focusing on the face, other research works looked for cues from other parts, such as the body [3], and skeleton [5]. Exhibiting more exposure than face, the whole body area has been thought of as an opportune alternative despite its high variability in terms of appearance, poses, and scale. Linder et al. [3] considered classifying the person's gender from the body using 3D data from the RGB-D images. They propose a tesselation method that extracts the local geometry features from the point cloud of the human body. Global texture descriptors like the Histogram of oriented gradients (HOG) and histogram of oriented depths (HOD), computed on the RGB and D image respectively, are used added to integrate the information from the two modalities in [15].
Kwon et al. [5] proposed a gait feature called joint swing energy (JSE) extracted from a sequence of 3D positions of skeletal joints to discriminate the gender. The intuition behind the JSE feature is that women and men walk differently which can be visualized from the pelvis and hip motion.
Recent deep-learning methods are also used in this field. Socher et al. [16] trained convolutional-recursive neural networks (CRNNs) on the RGB image and depth image separately to learn high-level features. The learned features from the two modalities are then concatenated to make predictions. Similarly, Bo et al. [17] employed a multi-layer sparse coding network to learn features from the RGB and Depth images separately. The embedding outputs from each layer are used to form a dictionary to do the classification. Different from their work, we propose employing an attention module to learn and integrate the features from the two modalities. Overall, most of the works dedicated to gender classification use features of the face to discriminate the gender of the person. While using RGB-D data, most approaches still focus on extracting discriminative hand-crafted features. In addition, many methods separate the feature extraction for the two modalities, and the inter-dependent information is ignored.

III. METHODOLOGY
The overall architecture of the proposed method is shown in Figure 1. We will elaborate on the three key components: feature extraction module, feature fusion module, and prediction head in Section III-A and we will illustrate the process of the proposed self-joint attention module in Section III-B.

A. OVERALL ARCHITECTURE
Feature Extraction: We adopted a deep [18] convolution neural network (CNN) as the backbone to extract representative features from the RGB image, I r ∈ R H0×W0×3 , and depth image, I d ∈ R H0×W0×1 , respectively. Conforming to the input format of CNNs, I d is expanded into a threechannel image, I d ∈ R H0×W0×3 , by duplicating itself on the other channels. I r and I d are reshaped to the sizes H 0 , W 0 = 256. The pipeline of the whole system is shown in Figure 1. The backbone contains two branches of CNN for RGB image and depth image respectively. The two branches are not sharing the weights. Several convolution layers in gray color are frozen while the convolution layers in blue are fine-tuned during the training. For the RGB branch, we only re-trained the top convolution layers to learn the taskspecific high-level features. Because the CNN is pre-trained on a large amount of data, we deemed keeping more shallow layers frozen is appropriate to derive adequate feature representation. However, for the depth branch, we unfroze more convolution layers from the top to the bottom for better learning, as the CNN is not exposed to depth data in the pre-training. The extracted features are denoted as X r ∈ R H×W ×C and X d ∈ R H×W ×C for RGB and depth respectively.
Feature Fusion: The feature fusion module is meant to encode the complementarity between the RGB and the depth image, thus enhancing the feature representation. We proposed a self-joint attention module to fuse the information of the two modalities. The detail of this module will be elaborated in Section III-B. In the self-joint attention module, the extracted features X r and X d are used to learn their selfattention maps f r and f d individually at first. Then their self-attention maps are combined and augmented with the residual connection to the features X r to obtain the attentionenhanced features Z.
Prediction Head: The prediction head consists of two fully connected layers and an average pooling layer shown as FCN in Figure 1. The average pooling layer is used to pull out the feature vector, Z a ∈ R 1×C based on each channel of Z. Z a is then flattened and forwarded into the fully connected layers. The fully connected layers, containing a 512 neurons layer with RELU activation function and a 2 neurons layer with a softmax activation function, predict the class of gender using Z a .

B. SELF-JOINT ATTENTION
Self-attention has been widely adopted to enhance feature discriminative [12]. Inspired by this latest work and the model proposed in the Transformer architecture in [19], we proposed a self-joint attention module, depicted in Figure 2, VOLUME 4, 2021 to effectively fuse the features from different modalities.
Given the embedding from the RGB image, X r ∈ R H×W ×C , and depth image, X d ∈ R H×W ×C , where H, W, and C are the height, width, and channel size respectively, the self-attention module first projects the X r and X d into three subspaces using a 1 × 1 convolutional layers. The three subspaces are named as the query q ∈ R H×W ×C , key k ∈ R H×W ×C , and value v ∈ R H×W ×C . C is the dimension of subspaces. Note that, because the RGB and depth images are going through mostly the same process, we simplify the math expression in the rest of the paper by setting X = X r orX d unless specified. We computed the projection as: where W q , W k , and W v ∈ R C×C are the embedding weights. In the experiment, we set C = 0.5C. The q and k are then used to find the self-attention map, f . First, the q and k are reshaped into R HW ×C and then implemented the element wise multiplication to get the f as: √ C is empirically used to normalized f according to [19]. Each row, f i , of f represents the attention scores in the i th position. f (X r ) and f (X d ) simply contain the self-attention maps of their own modality.
Depth is a particularly strong cue for object shape and has the potential of enhancing the distinct part of the person. Therefore, we added the self-attention from the two modalities to generate joint-attention map. Because the attention scores of the self-attention map have high variance, we applied column-wise softmax normalization on the jointattention map as: where λ 1 and λ 2 are the weight factors. They are chosen in a learning manner and we will elaborate on that in Section III-C. f n is the normalized joint-attention map.
We can find the self-joint attention A ∈ R H×W ×C by multiplying the normalized joint-attention map, f n , and the value, v r , from the RGB embedding as in (4). v r is reshaped to the HW × C before the multiplication. The output is reshaped back to H × W × C .
According to the finding in the transformer paper [19], it is beneficial to learn the attention from multiple pairs of query, key, and value, which are different projections of X. The attention learned from multiple pairs of projections can jointly attend to information from different subspaces at different positions. These attentions are then concatenated and projected back to the C dimensional subspace, yielding the multi-head attention. We also adopted this multi-head attention into the self-joint attention module as depicted in the following: Where W 0 ∈ R nC ×C is 1 × 1 convolutional layer, n is the number of head. We chose n = 2 in the experiment. Finally, a residual connection adds the multi-head attention to the RGB embedding X r , yielding the attended features as the output from the self-joint attention module:

C. SELECTIVE FUSION
The self-joint attention module considers the weights of the attention from RGB and depth equally. However, practically the contribution does not present the same level of reliability. For example, depth data might be noisy and inaccurate compared to its RGB counterpart. Assigning different weights for different positions of attention is an effective way to suppress unwanted attention. To this end, we employed a learnable scheme, proposed in [12], to assign the weights for the attention of different modalities in the self-joint attention module as shown in Figure 2(b). The fusion weights are also learned using the embedding of the RGB image, X r , and depth image, X d . A convolution layer with a softmax activation is added to generate weight map α from the X r and X d : where [ ] denotes matrix concatenation and W α is a 1 × 1 convolution layer. The weight map α ∈ R H×W ×2 can be split into two weight maps α r , and α d ∈ R H×W ×1 . Each of the weight maps represents the reliability at all positions of the attention of the corresponding modality. Then the selective self-joint attention can be obtained by updating (3) with a weighted sum of the two modalities' attentions: where is the channel-wise Hadamard product.

D. TRAINING LOSS
Our model is trained in an end-to-end fashion, where the training loss is a binary cross-entropy loss: Where y i ∈ [0, 1] is the ground truth,ŷ i is the prediction, and N is the number of samples.

A. DATASETS
We benchmarked our model using the RGB-D gender dataset [3]. This dataset contains four sets of challenging scenes that capture 115 instances (60 females and 64 males) in different postures, such as standing, walking, and interacting with the camera. In set-1, each person stands at a 2.5m distance from the camera and faces eight different directions. Set-2 and set-3 consist of a person walking in a circle path following clockwise and counter-clockwise directions respectively. Set-4 simulates human close-up interaction with the camera and the person goes back and forth in front of the camera. In this set, due to the limit of the camera's view, most of the time only part of the persons' body is visible. The dataset also includes some scenarios of people wearing backpacks, carrying luggage, and in wheelchairs. The dataset is recorded using a Kinect-2 camera, in which depth has a range from 0.5m to 4.5m. The camera keeps static during the recording. For a fair comparison, we followed the same evaluation protocols used in [3]. The training data and testing data are separated based on the number of instances. 32 females and 33 males are randomly chosen for training, and for the rest of the instances, 28 females and 31 males are used for testing. In the training phase, we used 50% of the training data for training and 25% of the training data for validation.
In the testing phase, three different testing sets, namely TestSet-1, TestSet1-3, and TestSet1-4, are constructed from the testing data. TestSet-1 contains only a standing posture. TestSet1-3 combines the standing and walking postures. TestSet1-4 includes stands, walking, and interaction postures. Thus, TestSet1-3 and TestSet1-4 are more challenging than TestSet-1. This setting aims to assess the performance of the system in the three scenarios mentioned above: standing, walking, and interacting.
The statistics of the Gender dataset are summarized in Table 1. The numbers of samples in the training set and test sets are listed in detail. The distribution of the male and female are almost equal in both the training set and testing sets.

B. EVALUATION METRIC
The first metric used for gender classification is the overall accuracy (ACC) which can be defined as (8).
where the T P represents the correct-classification of the male and T N represents the correct-classification of the female. The denominator (T P +T N +F P +F N ) covers all the predictions including the correct-classification and missclassification.
Other typical evaluation metrics used in classification problem are also adopted in this work, such as Precision (P), Recall(R) and F1 score and they are defined as (9),(10), and (11) respectively.

P recision =
T P T P + F P Recall = T P T P + F N (10)

C. IMPLEMENTATION DETAIL
We implemented two stages of model training, warm-up stage, and backbone training stage. During the warm-up stage, the two convolution branches are frozen and only the self-joint attention module and prediction head are trained. This stage is to let the self-joint attention module and prediction head have a good initialization. The warm-up stage goes through the whole training data using one epoch. After that, the backbone branches unfreeze the selected convolution layers, the C5 layer in the RGB branch, and C3-C5 in the depth branch, which are also mentioned in III-A. Both backbone branches use the pre-trained weights on ImageNet.
The model is trained with stochastic gradient descent (SGD) with a batch size of 16 pairs of images, RGB, and depth. The training epoch is set to 10 and an early stop condition will take place when the training loss decreases less than 0.001 in the consecutive epoch. The learning rate is set to 0.01 for both the self-joint attention module and the prediction head in the warm-up stage and backbone training stage. The backbone learning rate is set to one-ten of the learning rate since the backbone is doing fine-tuning. The learning rate decays exponentially from 0.01 in the last 5 epochs. Weights decay and momentum are set to 0.0001 and 0.9 in the SGD. The whole training and testing are implemented in Python using Pytorch on a PC with Intel 1.8GHz CPU, NVIDIA GTX GPU, 8G RAM.

D. ABLATION STUDY
Backbone Comparison: We evaluated the performances of the model using five standard CNNs as the backbone. These CNNs include VGG19, VGG19 with batch normalization (VGG19bn), VGG16, VGG16 with batch normalization (VGG16bn), and ResNet50. The model is trained and tested using the protocol described in the Section IV-A and IV-C. The training curves of the model using different backbones are shown as in Figure 3. The training accuracy and training loss are measured at each epoch. The model is quickly converging after four epochs. Table 2 shows the classification results using these standard CNNs as the backbone. The best and second-best performances are marked in bold and blue respectively for each evaluation metrics. The VGG19 and VGG16bn rank the best or second-best in all the testing sets. Note that the proposed model using different backbones has close performances on the three testing sets. It shows that the proposed model is effective to classify the person's gender in different postures. Figure 4 shows the ROC curves of the model in the testing. It also shows that using the proposed model can classify the male class and female class with high precision. Figure 5 reports the classification confidence distributions of the model using VGG16bn as the backbone. The red dot line represents the distribution of the false classifications while the blue dot line represents the distribution of correct classifications. The distributions are calculated using the cumulative distribution function. Because TestSet1 has a limited number of data, the distribution curve is shown in Figure 5(a) is sparse. From the figures, We can see that the model has high confidence in the accurate prediction more than 85% while the model has lower confidence in the false prediction.
Impact of the fusion stage: In the proposed method, selfjoint attention is applied to the features from the deep convolution layers. The top convolution layers have a larger receptive field than the bottom convolution layers, This is because the size of the feature maps reduces as we move towards the top layers. Hence the convolution filters will have larger receptive fields at the top layers. Therefore, the features from the top convolution layers can capture more semantic and global information though at the expense of spatial and local information. Hence, the self-joint attention module would have a different impact on the features from different stages of convolution layers. We modified VGG19 and VGG16bn backbone by inserting the self-joint attention module at different stages to compare the model's perfor-  mances. For example, both the selected CNNs have five stages of convolution, from Conv1 to Conv5. We changed the position of the self-joint attention module to fuse the features right after Conv3 or Conv4. These fusion variants of the model are trained using the same setting except for the selection of frozen blocks. This aims to fine-tune the convolution layers adapting to the taskspecific domain before the fusion. When fusion is taken after Conv3, we only froze the Conv1, and Conv4 and Conv5 are trained from scratch. Similarly, when the fusion is taken after Conv4, we froze Conv1 and Conv2, and Conv5 is trained from scratch. The rest of the convolution layers will be finetuned. These variants are also tested using the same protocol.
From Table 3, we can see that early-stage fusions on Conv4 and Conv3 degrade the model's performance compared to late-stage fusion on Conv5 on both backbones. In the case of using VGG19 as the backbone, when fusing on Conv4, the accuracy reduces around 4% or 6% on all the test sets. But when fusing on Conv3, the accuracy is around 51% and the precision is 0 in all the testing sets which means the models have no discrimination capacity to classify the male class and only can recognize the female class. Thus the Precision, Recall, and F1 score are 0. In the case of using VGG16bn as the backbone, fusion on Conv3 and Conv4 have the same accuracy and precision and they are lower than 50%. These simply imply that the models have no discrimination capacity to distinguish the two classes and bias on the male class. Thus the Recall is 100% while the accuracy and precision are very low. These results show that the self-attention module on the  Comparing the impact of self-joint attention on the model using VGG19 and VGG16bn as backbones. The Fusion Stage column indicates the fusion options. Conv5 WA stands for fusing the features from Conv5 without using the attention module. Conv5 WS stands for fusing the features from Conv5 using the attention module but without the selective fusion scheme. The performances are measured in terms of Accuracy (ACC), Precision (P), Recall (R), F1 score, and the computation overhead (CP). The best and second-best performances on each testing set are marked in bold and blue respectively. late-stage can focus more on the important part of the features while on the early-stage the self-attention module adds attention to the non-important place on the features. This is explicit when we visualize the heatmaps of the predictions in the testing. Figure 6(a-c) show the heatmaps on the same testing samples based using the VGG19 backbone and the corresponding different fusion stages. In the late-stage fusion like Conv5, the heatmaps have high responses on the discriminative part of the person. In contrast, in the earlystage fusion like Conv3 and Conv4, the heatmaps usually have high responses on the less discriminative part or even the background and thus models' performance degrades. In terms of computation overhead, the results in the last column of Table 3 show the computation cost is almost the same using different fusion stages. This is because the proposed self-joint attention module is lightweight. Impact of Self-joint attention: We also investigated the effectiveness of using self-joint attention by comparing the model with and without the self-joint attention module. We continued using the VGG19 and VGG16bn networks as backbones for comparison. In the case of without self-joint attention module, we fused the features from Conv5 using element-wise addition with a softmax normalization to add the features. The training and testing are using the same protocols mentioned in Section IV-C.
In Table 3, the performances of the model without selfjoint attention module are listed in rows marked as "Conv5 WA". From the comparison between "Conv5" and "Conv5 WA", the self-joint attention module improves the model's performances on the three testing sets, especially when using VGG16bn as the backbone. Note that when the model uses VGG16bn as the backbone and without the self-joint attention module, it has the same performance as when the model is doing early-stage fusion. The model is unable to recognize the two different classes. Inversely, this is proved that the proposed self-joint attention module does enhance the model's discrimination capability for person gender classification.
We can find more impact of the self-joint attention module from the heatmaps in Figure 6. The model using the selfjoint attention module as shown in Figure 6(a) adds more concentration on the body part than the model without the self-joint attention module as shown in Figure 6(d).
Impact of selective fusion: We trained and tested the model using the self-joint attention module without the selective fusion scheme described in III-C to look into the impact of the selective fusion. The training and testing follow the same protocols. Their testing results are depicted in the "Conv5 WS" rows in the Table 3.
Compared to the results between "Conv5 WS" and "Conv5", the model using VGG19 as the backbone has close performances on the three testing sets. While using VGG16bn as the backbone, the selective fusion scheme is better than the without selective fusion scheme in terms of accuracy, precision, and F1 score, especially in TestSet1. The comparison shows adding the learnable selective fusion in the self-attention module improves the model's performance.

E. STATE-OF-THE-ART COMPARISON
We compared our method with state-of-the-art methods on the same dataset following the same experimental protocols described in IV-A. We chose the two best performances of the model using VGG19 and VGG16bn as backbones as the results described in Section IV-D.
The performance comparison is depicted in Table 4. We can observe that our approach reaches the best accuracy of 96.3%, 93.9%, and 94.1% for the three testing sets respectively. Compared to the state-of-the-art methods, our method improves the accuracy with an increment of around 5.2% on the simplest TestSet-1 and a large increment around 7.5% and 8.7% on the two other challenging test sets (TestSet1-3 and TestSet1-4), respectively. Our method significantly boosts the performance on the challenging testing sets including the walking and interaction postures. Also, compared to the others, our method has a close accuracy on the three testing sets while the other methods have a significant gap between the testing sets. This shows our method is robust and invariant to the change of posture.

F. STATISTICAL SIGNIFICANCE TEST
A statistical significance test is conducted to examine the performance of the proposed method and the state-of-the-art methods. The statistic significance test is designed to quantify the likelihood of the performance scores, which refers to accuracy in this work, being observed given the assumption that they are were drawn from the same distribution. If this assumption, also called the null hypothesis, is rejected, it implies that the difference in the scores is statistically significant. The most commonly used statistical significance test is the paired Student's t-test as described in (12). The null hypothesis in this test is that there is no difference between the performance of the two models.
whereS 1 andS 2 are the means of the two models' performance. σ 1 and σ 2 are their standard deviation. Table 5 shows the t values of the comparison between the proposed method and the state-of-the-art method. The values are calculated at the 0.05 significance level. Considering the critical value t critical = 4.303, we can see that all the t values are larger than 4.303. Also, all the p-values are smaller than the significance level. The null hypothesis is rejected. Therefore, the results show that the proposed model VOLUME 4, 2021    and the state-of-the-art are performed differently and their performances are statistically significant.

G. PERSON ORIENTATION TEST
We sampled eight subsets, Orrt1 to Orrt8, from the testing sets to evaluate the model's sensitivity to the orientation of the person. The eight subsets contain the person in eight different orientations, as described in Table 6, respectively. Each subset contains one sample from 28 females and 31 males standing in the same orientation. We evaluated the model with VGG16bn as the backbone using these subsets. The results show that the accuracies of most of the subsets are the same around 96.88%. These observations evidence that our proposed model is robust to the change of the person's orientation.

V. CONCLUSION
In this paper, we proposed a self-joint attention method for gender classification of persons captured by RGB-D camera and exhibiting challenging scenarios of different views, poses, and scales. The method uses an attention mechanism to fuse the features from two different modalities through encoding the inter-dependent information between them so that the model can learn more discriminative features. We conducted a thorough ablation study to set the parameters of our system to evaluate the performance. The proposed method outperforms the state-of-art in all three different testing sets with significant margins of 4.7%, 7.5%, and 8.7% respectively on a public RGB-D gender dataset.