Efficient Lightweight Attention Network for Face Recognition

Although face recognition has achieved great success due to deep learning, many factors may affect the quality of faces in the wild, such as pose changes, age variations, and light changes, which can seriously affect the performance of face recognition. In this work, an effective approach called Efficient Lightweight Attention Networks (ELANet) is proposed to address the challenge brought by the impacts of poses and ages on face recognition performance. First, similar local patches are particularly important when the geometry and appearance of a face change drastically. To alleviate this challenge, spatial attention is used to capture important locally similar patches and channel attention is employed to focus on features with different levels of importance. Furthermore, Efficient Fusion Attention (EFA) module is designed to achieve better performance, which can alleviate the computational effort required by fusing spatial and channel attention. Second, multi-scale features learning is necessary because pose or large expression changes can cause similar recognition regions to appear at different scales. For this purpose, pyramid multi-scale module is presented, which constructs a series of features at different scales via pooling operations. Third, to unite low-level local detail information with high-level semantic information, the features of different layers are fused by Adaptively Spatial Feature Fusion (ASFF) instead of simply utilizing addition or concatenation. Compared to recent lightweight networks, the ELANet improved performance by 1.83% and 2.17% on the CPLFW and VGG2_FP datasets, respectively, and by 0.92% on the CALFW dataset. The ELANet addresses the challenge regarding the impacts of poses and ages on face recognition performance with few parameters and computational effort and is suitable for embedded and mobile devices.


I. INTRODUCTION
Significant progress has been achieved in the field of face recognition by applying deep convolutional neural networks (DCNNs) [1], [2], [3]. However, most works do not simultaneously consider the importance of hierarchical multiscale features and local regions for face recognition.
Many factors influence the performance of face recognition, such as posture, age, illumination, occlusion, or quality variations. For example, as shown in Fig. 1, the face images in the second row are subject to different unconstrained factors, which are still a challenge for current face recognition algorithms, even though they can be easily recognised by humans. And these problems may lead to great changes The associate editor coordinating the review of this manuscript and approving it for publication was Donato Impedovo .
in facial geometries and appearances. In contrast, similar local face areas are particularly important. Several works depend on face landmarks to obtain face local information [4], [5]. However, landmark detection may not work due to posture, age, illumination, occlusion, or quality variations. As illustrated in Fig. 1, changes in pose make parts of the face disappear; blurred images of the face make the whole face area unclear; changes in lighting make detailed information about the face lost. Different face regions can contribute to the final recognition results to different degrees. Spatial attention is incorporated to automatically characterize informative regions and extract local information. As presented in [6], Local Aggregation Network (LANet) is used to locate the most distinguishable face domains and achieves good performance on datasets relating to posture and age. Furthermore, channel attention aims to highlight important channels and suppress channels with less information. The low-level feature channels contain local detail information and that high-level channels represent high-level semantic information. Thus, the Squeeze-and-Excitation Network (SENet) [7] adaptively recalibrates the channel characteristic response by modelling the interdependencies between channels and brings significant performance improvements to the CNNs models at a slight increase in computational cost. When employing CNNs to extract face features, the most recognisable face regions should be given more more weight, and similarly, the feature channels with the most distinguishing feature information should be assigned more weight. It is intuitive to combine them together to obtain better performance. At the same time, to alleviate the computational effort caused by their fusion, Efficient Fusion Attention (EFA) module is introduced to our model.
Representing features at multiple scales is useful in various vision tasks [6], [8], [9]. Multi-scale features are necessary for face recognition because local face regions may have various sizes or shapes due to dramatic facial changes. As shown in the third and fourth rows of Fig. 1, mouths have various sizes in columns 1, 2, 3 and 6; eyes appear at different sizes in columns 3, 4 and 5. Most of the methods fail to consider that useful feature information is not always fixed within the same layer. [8] extracts multi-scale features with hierarchical pyramid-based diverse attention network to address this challenge and uses diverse learning to alleviate the redundant response problem. This method also achieves state-of-theart results in posture and age challenges. However, local discriminative face regions may appear in different layers. Thus, pyramid multi-scale modul is proposed which is able to scale features in the same layer to different sizes to extract more local features.
Because high-level features have larger receptive fields and represent high-level semantic information. Therefore, most previous works do not use low-level features with local information but directly use the last layer of convolution. These approaches inevitably lack local details or low-level small-scale information. To alleviate the above problems, [10] obtains the local features from the first network layer and the global features by principal component analysis. Compared to MobileNet [11], this method extracts more comprehensive feature information. [12] combines low-level and high-level feature information to gain different representations. However, these methods all use simple addition or concatenation.
This paper proposes Efficient Lightweight Attention Networks (ELANet) suitable for face recognition in mobile or embedded devices. The contributions of paper are described as follows: 1) The proposed ELANet can learn multi-scale features from the same layer and local features from different layers. The proposed pyramid multi-scale module is embedded in the ELANet. The pyramid multi-scale module encourages the model to learn multi-scale features by dividing the same feature into features of different scales through pooling operations. The ELANet has small numbers of parameters and computations and is well suited for deployment on mobile or embedded devices. 2) Spatial attention and channel attention are introduced simultaneously in the EFA module. An SENet module is used to assign different weights for different channels according to their importance levels, highlighting the discriminative channels while suppressing channels with less information. The LANet module locates the most discriminative face regions. The EFA module achieves better performance while alleviating the computational effort required for fusion and allows focus on local features. 3) To unite low-level local detail information with high-level semantic information, the features of different layers are fused by Adaptively Spatial Feature Fusion (ASFF) instead of simply using addition or concatenation. The proposed approach fuses hierarchical features to obtain extra comprehensive feature information. The rest of the paper is organized as follows. Section II briefly reviews the work related to face recognition and attention mechanisms. Section III describes the ELANet in detail, Section IV provides the results of experiments and discusses the performance of the ELANet in detail. Section V gives our conclusions and discusses future work.

II. RELATED WORK
A brief review of face recognition and attention mechanisms is presented.
A. FACE RECOGNITION DCNNs have achieved great success in the field of face recognition. Due to the simplicity and probabilistic interpretability of the softmax loss function, it is regarded as one of the and important components in CNNs. Thus, in the early stage, face recognition approaches mainly use softmax loss VOLUME 10, 2022 function, but it can not effectively lessen the within-class variance and expand the between-class variance. Several novel loss functions are proposed in [1], [3] to further reduce the within-class variance and increase the between-class variance. However, most of them do not effectively take into account multi-scale representations and local features of the face.

1) MUTIL-SCALE FACE RECOGNITION
Multi-scale feature representation is of great importance for face recognition. [6] learns multi-scale representations in two perspectives: on the one hand, it uses convolutional kernels of different sizes to extract multi-scale information in the same layer; on the other hand, it connects the output of each layer to learn multiscale features across layers. [9] replaces a set of 3 × 3 filters with smaller set of filters, while connecting the different filter groups in a hierarchical residual-like style. [13] uses different structures of CNNs in the same level to extract multi-scale features. However, most of them do not notice that features may cover a larger range of scales in a given layer. Thus, the proposed pyramid multi-scale module divides the same feature into features of different scales through pooling operations.

2) LOCAL FEATURE REPRESENTIONS
Local representation learning can effectively handle postural and age variations. [4] trains multiple CNNs in facial regions, but the overall features of the face are ignored. [5] unites multiple face region features with global face features by sharing shallow and mid-level features. [14] solves for pose variation by simultaneously learning feature alignment and feature extraction through deformable convolution with spatial displacement fields. Most methods are inevitably dependent on face landmarks. However, landmark detection may not work due to posture, age, illumination, occlusion, or quality variations.

B. ATTENTION MECHANISMS
One trend has involved the investigation of attention. Attention mechanisms play a very important role in computer vision [7], [15], [16]. Attention assign more weight to the most informative features while suppressing the less useful features. However, few studies have applied attention for the general face recognition task. Residual-attention and selfattention were combined to address cross-age face recognition in [17]. Efficient attention was introduced to recognize faces under various poses in [18]. Two attention blocks were used to adaptively add feature vectors into a single feature for video face recognition in [19]. An improved SENet module was applied in [20], and self-attention is employed to capture more detailed information [21]. The LANet and SENet were introduced sequentially to automatically locate the most distinguishing face region in [6]. However, most of these approaches apply only individual implementations of attention or apply attention sequentially. In this work, to achieve better performance, channel attention and spatial attention are fused. Furthermore, the proposed EFA is used to relieve the computational overhead caused by fusion.

III. A NEW NETWORK
The proposed ELANet model, which mainly contains three modules: bottleneck attention module, pyramid multi-scale module, and ASFF module.

A. BOTTLENECK-ATTENTION
MobileNet [22], [11] builds lightweight networks via depthwise separable convolution and inverted residual structure, where the depthwise separable convolution can reduce the number of required parameters and the inverted residual structure ensures the performance of the model. The core network in MobileNet is the bottleneck. Thus, combining EFA with bottleneck results in BA, as shown in Fig. 1. The BA module consists mainly of two 1 × 1 convolution kernels, a 3 × 3 depthwise separable convolution kernel and an EFA module, which perform different operations with various step sizes. The first 1 × 1 convolution module is designed to expand the feature channels to extract more feature information; the second 1 × 1 convolution module is introduced to reduce the feature channels; 3 × 3 depthwise separable convolution module is used to reduce the amount of parameters. To prevent retified linear unit (ReLU) from destroying features, linear is used in the final output section. Besides, to match the shortcut dimension, two different structures are proposed for the BA module. When the stride is 1, shortcut is used to boost the model performance similar to the residual structure; a stride of 2 convolution module is used as downsampling.

B. EFA MODULE
To achieve better performance, the EFA module is proposed, which fuses the SENet and LANet instead of using them separately, as shown in Fig. 3. The EFA module can alleviate the computational effort required to fuse spatial and channel attention.
Let the feature X ∈ R h×ω×c denote the input of the EFA module, where h, ω, and c are the parameters of the feature, representing the height, width, and number of channels respectively. First, the input features are split into  outputs with different groups [X 1 , X 2 , · · · , X g ], where g is the number of groups. X i ∈ R h×ω×c i is the output of the i th group, where c i represents the channel size. The channel size of each output layer is determined by c and g.
Second, the i th group is divided into two groups according to the channel equivalence [X i1 , X i2 ]. To retain both spatial and channel information, the LANet module [6] and SENet module [7] are used.
The LANet, as shown in Fig. 4, uses two consecutive 1×1 convolution layers. The first convolutional layer outputs c/r channels, where c denotes the input channels and r is the reduction rate, followed by a ReLU function. Then, an output feature with 1 dimension is generated by a 1 × 1 convolution layer followed by a sigmoid function, called spatial attention. Finally, the LANet output is the input features scaled by spatial attention.
The structure of the SENet is shown in Fig. 5. To obtain a single descriptor, the squeeze operation compresses the global channel information by global averaging pooling. Formally, the statistic z ∈ R c is obtained for channel t by reducing U through the spatial dimensionality of the feature as follows: where u(i, j) is an element at position (i, j) on channel t and H ×W is the spatial dimension of z t . The excitation operation learns the weight coefficients of each channel, thus making the model more discriminative with respect to the features of each channel. Two fully connected (FC) layers are used, which consist of a dimensional reduction layer ω 1 and a dimensional extension layer ω 2 : where σ denotes the sigmoid function, and δ represents the ReLU function. The dimensional reduction layer outputs c r channels, and the dimensional extension layer outputs c channels. Finally, the learned activation values for each channel are multiplied by the input features. By concatenation, the j th , j ∈ [1, 2, · · · , g] final output the same channel size as the i th group. Finally, i ∈ [1, 2, · · · , g] groups of subfeatures are aggregated together and then output by the ''channel shuffle'' operator [23].

C. PYRAMID MULTISCALE MODULE
The framework is shown in Fig. 6. The features contained in the same layer have multi-scale local representations to extract more fine-grained features.
For a given feature map X ∈ R h×ω×c , h, ω and c are the parameters of the feature, representing the height, width, and number of channels, respectively. The pyramid multi-scale module first splits the feature X into outputs with different scale sizes via pooling operations. where h i × ω i stands for the subfeature size. The maximum size of subfeature is the same as that for the input feature. Second, the spatial and channel information of each subfeature X i is obtained through the EFA module, followed by a 1 × 1 convolution. The features at each scale are upsampled by using bilinear interpolation and the upsampled features are defined as X ij , 4i = j ∈ [1, 2, · · · , s], which have the same size as the input features. Then refined feature maps R ij , i = j ∈ [1, 2, · · · , s] are aggregated by the product of X ij and the input X : where • denotes the Hadamard product. Finally, to output the same number of channels as that contained in the input features, the refined feature maps are connected by a concatenation module, followed by a 1 × 1 convolution.

D. ASFF
Most previous works do not use low-level features with local information but directly use the last convolutional layer to learn features. These approaches do not consider the fact that the representation obtained from each layer is not comprehensive. Thus, it is natural to integrate the different layers of features. The pyramid multi-scale module is applied in every two BA modules. Therefore, pyramid multi-scale module extracts more integrated features from different layers with the EFA module. Different from the previous methods that aggregate information from different layers using elementwise summation or concatenation, the approach in [24] is taken to integrate multilevel information, which consists of two steps: scale transformation and adaptive fusion.
x l is defined as the features at the level l. Feature x n→l (n = l) is denoted as the resizing of the features from level n to level l. In the network, the features in different layers have various scales and numbers of channels. Therefore, different up-sampling and downsampling strategies are adopt for features at different scales. For upsampling, a 1 × 1 convolution is used to channel adjustment, followed by bilinear interpolation to increase the resolution of the features. For down-sampling with a 1/2 ratio, a 2 × 2 convolution with a stride of 2 and a padding of 1 are used to change the number of channels and the resolution simultaneously. For the 1/4 ratio, a max-pooling with a 2-stride is added before the convolution operation.
The feature at position (i, j) of the feature map is indicated as x n→l ij . The layers interact with each other to obtain more comprehensive information, as shown below: where y l ij implies the (i, j)-th vector of the output feature maps y l for the channel. α l ij , β l ij and γ l ij refer to the spatial weights of different levels with respect to level l, which can be learned adaptively in the network. α l ij is calculated by the following formula: where λ l α ij , λ l βij and λ l γ ij refer to the control parameters of the softmax function and force α l ij + β l ij + γ l ij = 1, α l ij , β l ij , γ l ij ∈ [0, 1].
Finally, the FC layer is employed to reduce the number of output dimensions to 128 dimensions.

E. EFFICIENT LIGHTWEIGHT ATTENTION NETWORKS
Due to its superior performance and use of fewer parameters than popular lightweight networks, MobilefaceNet [2] is used. The EFA module is introduced into a bottleneck, as shown in Fig. 2, called BA. The SENet module and LANet module are combined in the EFA module, as illustrated in Fig. 3, where the SENet module and LANet module are applied simultaneously. Multi-scale features are necessary for face recognition because local face regions may have various sizes or shapes due to dramatic facial changes. Meanwhile, local discriminative face regions may appear in different layers and features may cover a large range of scales in a given convolutional layer. To solve the above problems, the pyramid multi-scale module is introduced with EFA module, as demonstrated in Fig.6. The pyramid multi-scales modules are applied in every two BA modules to extract more integrated features from different layers. Most methods use only the last convolutional layer, but inevitably lack local details or low-level small-scale information. At the same time, simple fusion methods achieve sub-optimal results. Since ASFF adaptively fuses features and introduces an almost free overhead, it is used to aggregate the rich features The overall framework of the ELANet model is shown in Fig. 7. Four parts are included: inputs, local features, global features, and outputs. Two operations are included in the convolution layer: a 3 × 3 convolution and a depthwise 3 × 3 convolution. The proposed BA module is repeated n times, as shown in Fig. 2, which describes important local features and the importance of channels. The pyramid multi-scale module and ASFF learn local multi-scale features and fuse them across layers. Local features and global features are fused, and 128-dimensional features are output through the fully connected layer. For the loss function, AraFace [3] l is used to reduce the within-class variance and widen the between-class variance based on the following formulation: log e scos(θ yi +m) e scos(θ yi +m) + n j=1,j =y i e s·cosθ j (7) where N is the batch size, n is the number of classes, s is the hypersphere radius of the characteristic distribution, m is an additive angular margin, and θ j is the angle between the weight W j and the feature x i . VGGFace2 contains 3.14M face images in a large range of poses, ages and ethnicities. If not explicitly state, MS1MV3 is used as the training dataset.

B. IMPLEMENTATION DETAILS
The ELANet is implemented by PyTorch [33]. The hyperparameter s is set to 64 and the angular margin m of Arcface VOLUME 10, 2022  is set to 0.5 according to [34]. The batch size is 256 and one NVIDIA 3090(24 GB) GPU is used as training machine. The initial learning rate is given as 0.1 and divided it by 10 every epoch. The training process is finished after 25 epochs. The momentum is set to 0.9, and the weight decay is set to 5e − 4.

C. ABLATION STUDY
The importance of the three components is first demonstrated in this section: BA module, pyramid multi-scale module, and ASFF. The performance of different combinations of the LANet and SENet are compared. Then, the effects of the hyperparameter reduction r and the number of groups g on model performance are investigated. Finally, the performance of the different fusion methods on the model is shown.

1) THE IMPORTANCE OF THE THREE MODULES
To gain insight into ELANet model, the following modules are analyze: the bottleneck [1], BA module, pyramid multiscale module, and ASFF module. The importance of each module is studied and is shown in Table 1.
The performance of the BA module is significantly improved relative to that of the original model. This is because the EFA module is added to the original model so that it can emphasize both where facial parts are and which features are significant. The experimental results in Table 1 show that pyramidal multi-scale feature learning and cross-layer information fusion are necessary. As illustrated in Table 1, the proposed ELANet has better performance. The ELANet model performs better than all of these variants in two aspects. On the one hand, it incorporates the pyramid multi-scale module for extracting multi-scale features and enriching fine-grained feature information. On the other hand, it uses ASFF to fuse different levels of information, which makes the final output feature information more comprehensive and richer and helps to improve the recognition accuracy.

2) DIFFERENT ATTENTION COMBINATIONS
This section examines the effect of different combinations of the SENet and LANet on the performance of the model. Four combinations are shown: the first is to use the SENet module alone and the second is to use the LANet module alone; the third uses dual face attention (DFA) [6]. The last is the EFA module. Table 2 summarizes the experimental results. The LANet emphasizes where facial parts are, and the SENet learns where the significant features are. In the LANet and DFA, the parameters for the experiments are set as in [6]. The performance of the LANet or DFA alone is not as good as that of the other methods in Table 2. The possible reason for this is that the number of channels in the network is too small, and after compression in the LANet, the useful information is drastically reduced, leading to a decrease in model performance. The performance of the EFA model on the cross-pose and cross-age datasets is significantly improved compared to that of other methods except on the CPLFW dataset. A possible explanation is that we divide the number of channels into different groups with the parameter g in the EFA model, so the SENet in the EFA model cannot make good use of the global channel information, which leads to slightly worse performance for the EFA module on CPLFW than that of the SENet alone approach. Thus, the use of both the SENet and LANet can improve performance over that of the method of using one module before the other. As demonstrated in Table 2, compared to other methods, the EFA model makes a trade-off between accuracy and complexity while improving performance.

3) THE EFFECTS OF THE PARAMETERS g AND r ON THE MODEL
To investigate the effects of the parameters g and r in the EFA module on the fusion of the SENet and LANet, the following study is conducted. The effects of the parameters g and r are investigated in Table 3.
The hyperparameter g is set to 2, 4, 8, and 16. The overall performance increases when g decreases. This can be explained by the fact that dividing the data into too many groups leads to useless information or noisy information being given more attention. To investigate the trade-off between the computational cost and performance due to the hyperparameter reduction parameter r, r is set to 2, 4, 8, and 16. However, the computational and parametric quantities of the model are also related to the parameter g, as demonstrated in the actual experiments. The overall performance of the model degrades when no group convolution is used relative to the case with group convolution. Finally, g = 2 and r = 2 are chosen to balance performance and complexity.

4) DIFFERENT INTEGRATION METHODS
Adding or concatenating features directly is the method chosen for most feature fusion approaches. However, simple addition or concatenation is not able to fuse cross-layer information. To overcome this problem, ASFF is used to fuse across-layer information.
Experiments results comparing ASFF with other fusion methods are shown in Table 4. Compared to the addition and concatenation fusion methods, the performance gains of ASFF on the CPLFW dataset are 0.55% and 0.6% respectively; on the CALFW dataset the performance gains are 0.45% and 0.54% respectively. The advantages of ASFF in capturing interlayer features as well as adaptive learning weights are shown. However, ASFF does not perform as well as the concatenation approach on datasets containing large pose variations, such as VGG2(FP) and CFP(FP). In general, when facing larger pose variations, we need more channels to extract richer feature information. In our experiments, the number of channels obtained with concatenation is the highest, so it has the best performance on this problem.

D. COMPARISON WITH DIFFERENT BACKBONE NETWORKS
Several popular CNNs are compared with ELANet, including lightweight face recognition networks and large complex networks. The experimental results are shown in Table 5. Results of lightweight face recognition models on different datasets derive from [35].
Compared with these lightweight models, the proposed ELANet model achieves an overall improvement in performance with only a small increase in computational complexity. In particular, in the cross-pose datasets CPLFW, VGG2_FP, and CFP_FP, ELANet performance improved by 1.83%, 2.17% and 0.26% respectively over the other best performing lightweight face recognition models. In the cross-age dataset CALFW, ELANet performance improved by 0.92% over the other best performing lightweight face recognition models. Compared with ResNet-50 [37] and DenseNet [36], the ELANet model has fewer parameters and computational effort and performs better. As a result, better parameter efficiency is demonstrated in the ELANet model. ResNet enhances the expressiveness of the model via short connections and DenseNet achieves improved model performance with dense connections. EfficientNet [38] optimizes the expressiveness of the model from three aspects simultaneously: the height, width, and resolution of the network. By using depthwise separable convolution in MobileNet, the parameters of the model are reduced, and an inverse residual structure is used to enhance the model representation. Thus, the ELANet continues to use the bottleneck from MobileNet-V2 to reduce the number of model parameters and integrates the EFA module into the bottleneck, allowing it to learn local patch feature information. The EFA module has better performance and fewer fusion parameters. Different levels of feature information are used and fused by ASFF to enhance the performance of the model. The proposed EFA module enables the ELANet to focus more on the most discriminative features of pose changes, and thus, ELANet model performs better on datasets containing multiple pose changes.
In summary, the ELANet model has good representation capability, effectively uses its the parameters, performs well under complex data distributions, and makes a good trade-off between accuracy and complexity. Especially important is that it has small numbers of computations and parameters, making it is very suitable for use in some embedded devices with low computing power.

E. EXPERIMENTS ON CROSS-POSE
In the cross-pose experiments, MS1MV3 and VGGFace2 datasets are used as training data. The results of the comparison between ELANet model and state-of-the-art methods are shown in Table 6.
PIM [41] proposes a two-way generative adversarial network that learns both local and global information, and a discriminative learning subnet that learns discriminative and generic feature representations, achieving 93.10% in the CFP_FP dataset. DA-GAN [42] generates high resolution images by using a fully convolutional network and uses the autoencoder as a discriminator with a double agent. p-CNN [43] utilizes multi-task convolutional neural network that groups different poses to learn pose specific identity features, which obtains 94.39%. NoiseFace [44] is trained with a large amount of noisy data and get 96.04%. LS-CNN [6] learns multi-scale and local feature information, which improves performance to 97.17% in CFP_FP. HDPA [8] results in 92.35% on the CPLFW dataset by multivariate guided learning. DLL [45] proposes distributed distillation loss to improve performance on hard samples. It achieves state-of-the-art performance on both the CFP_FP and CPLFW datasets. ELANet is simple and efficient compared to data enhancement methods that require a great deal of complexity (PIM, DA-GAN). And it achieves very good performance in cross-pose datasets. Compared to   the large and complex network LS-CNN, ELANet obtains similar performance with few parameters and computational effort. Compared to state-of-the-art methods, ELANet model achieves sub-optimal performance with a lightweight model; compared to lightweight face recognition models proposed in recent years, ELANet model outperforms them on cross-pose datasets.

F. EXPERIMENTS ON CROSS-AGE
In the cross-age experiments, MS1MV3 and VGGFace2 datasets are used as training data. ELANet is compared with other state-of-the-art methods on the CALFW dataset in Table 7.
VGGFace [46] and CCL [47] are trained by using advanced loss function. VGGFace is trained using triplet loss. CCL disperses the face features into coordinate space and divides the classification vectors on the hypersphere. AFJT-CNN [48] alternately trains fusion network and combines factor model. The proposed ELANet performs far better than these methods on cross-age datasets. Compared to LS-CNN, the EFA module proposed in ELANET is able to focus on more discriminative face regions. HDPA achieves state-of-the-art performance through multivariate guided learning, but its overall network is so complex that it is difficult to implement in embedded or mobile devices. In contrast, ELANet utilizes fewer parameters and easy and effective method to achieve optimal performance.

G. EXPERIMENTS ON IJB-B/C
A performance comparison between the ELANet model and several other methods on the IJB-B/C datasets is given in this section. The results of the ResNet50 [37], MN-v [49], MN-vc [49] and DCN [50] models are obtained from [3]. We use Arcface [3] to conduct the same experiments.
This experiment compares the TAR(@FAR = 1e−4) of the ELANet with those of the state-of-the-art models, as shown in Table 8. With the exception of VarGFaceNet, ELANet achieves slightly better performance on IJB-B/C than any of the other lightweight face recognition models. The ELANet has similar performance to the complex network model on IJB-B/C except for ResNet100. This illustrates the necessity of introducing local features and multilayer and multi-scale information to face recognition. It also demonstrates that introducing different layers of information to jointly extract features is useful for face recognition. In Fig. 8, we show the receiver operating characteristic (ROC) curves of the proposed ELANet on the IJB-B dataset and the IJB-C dataset.

V. CONCLUSION
An effective approach is proposed to address the challenge regarding the impacts of poses and ages on face recognition performance. A new lightweight network structure is proposed based on MobilefaceNet that can learn rich multi-scale, multilevel features as well as discriminative local features; it provides different for different channels and spatial features and joins different levels of features together for face recognition. The proposed ELANet model can generalize across multiple datasets and achieve high performance with fewer parameters and computations than that required by other approaches, making it ideal for deployment in mobile and embedded devices. Experiments show that the ELANet achieves significantly improved model performance over some other state-of-the-art lightweight networks. The ELANet can achieve similar performance to that of a complex model and even better performance on some test sets. In the future, the ELANet will be tested in real deployments in mobile or embedded devices to further optimize the performance of the model. PENG ZHANG received the Ph.D. degree in mechanical design and manufacturing and automation from the Lanzhou University of Technology. He is currently an Associate Professor with the School of Instrumentation and Electronics, North University of China. He has published more than ten articles. He is also engaged in scientific research on multifunctional simulation turntables, stabilization platforms, inertial devices and micro-inertial combined navigation, inertial sensing microsystems, collaborative control and estimation of multiple UAVs, and intelligent sensing systems for multiple sensors.
FENG ZHAO was born 1996. He received the bachelor's degree in electronic information engineering from the North University of China, in 2020, where he is currently pursuing the master's degree. His main research interests include objects detection and recognition.
PENG LIU received the Ph.D. degree in control science and engineering from Southeast University. He has published 11 academic articles.