PocketNet: Extreme Lightweight Face Recognition Network using Neural Architecture Search and Multi-Step Knowledge Distillation

Deep neural networks have rapidly become the mainstream method for face recognition (FR). However, this limits the deployment of such models that contain an extremely large number of parameters to embedded and low-end devices. In this work, we present an extremely lightweight and accurate FR solution, namely PocketNet. We utilize neural architecture search to develop a new family of lightweight face-specific architectures. We additionally propose a novel training paradigm based on knowledge distillation (KD), the multi-step KD, where the knowledge is distilled from the teacher model to the student model at different stages of the training maturity. We conduct a detailed ablation study proving both, the sanity of using NAS for the specific task of FR rather than general object classification, and the benefits of our proposed multi-step KD. We present an extensive experimental evaluation and comparisons with the state-of-the-art (SOTA) compact FR models on nine different benchmarks including large-scale evaluation benchmarks such as IJB-B, IJB-C, and MegaFace. PocketNets have consistently advanced the SOTA FR performance on nine mainstream benchmarks when considering the same level of model compactness. With 0.92M parameters, our smallest network PocketNetS-128 achieved very competitive results to recent SOTA compacted models that contain up to 4M parameters.


Introduction
Face recognition is an active research field, and it has benefited from the recent advancements in machine learning, especially the advancements in deep learning [9] and the novelty of margin-based Softmax losses [5,31], achieving a notable recognition accuracy.SOTA FR solutions rely on a deep learning models with an extremely large number of parameters [5,24].Deploying such models on embed-ded devices or in applications with limited memory specifications is a major challenge [22,6].This challenge has received increased attention in the literature in the last few years [22,6].
However, none of these works [1,3,35,21,22] designed a network specifically for the FR task, rather than adopting existing architectures designed for common computer vision tasks.With the developments in AutoML, Neural Architecture Search (NAS) has shown SOTA performances in many computer vision tasks [16,34].NAS aims at automating the neural network architecture design achieving higher performances than the handcraft-designed network architectures.One of the early works of NAS was introduced by Zoph et al. [42].That work [42] proposed that the architecture of a neural network can be described as a variable-length string.Thus, a Recurrent neural network (RNN) can be used as a controller to generate such a string.While this method showed competitive results in compar-ison to SOTA models, it requires a very long search time (22,400 GPU days [42]).NASNet [43] points out that a convolutional neural network (CNN) such as ResNet [9] is a repetition of modules that consist of combinations of convolution operations.Based on that, they introduced a new search space, called NASNet.NASNet proposed to learn the network building block (cell), rather than learning the whole architecture.The network architecture, in this case, is constructed by stacking these cells together.NASNet was able to reduce the search time to 2,000 GPU days, in comparison to 22,400 GPU days needed by NAS [42].Prox-ylessNAS [2] directly learned the architectures for the target task.It trained an over-parameterized network by gradient optimization that contained all candidate paths and pruned redundant paths to achieve a compact architecture.The architecture search needed 200 GPU hours, which was much faster than NASNet [43].Differentiable Architecture Search (DARTS) [16] relaxed the discrete search space in a continuous manner.DARTS proposed to use gradient optimization to optimize the architecture search space.Similar to NASNet, DARTS proposed to learn the main building block (cell) of the network architecture rather than learning the entire network architecture.DARTS search algorithm requires around 1.5 GPU days, which was orders of magnitude faster than NASNet [43].To reduce the NAS search time, all mentioned NAS algorithms proposed to learn from small training datasets such as CIFAR-10 [15] and then utilized the discovered architecture to train on larger datasets such as ImageNet [4].This advancement in NAS solutions has only recently captured the attention of biometric recognition solutions [41,32], however, with no deployments towards lightweight or embedded architectures.
In this work, we successfully aim at intelligently designing and training a family of lightweight FR models, namely the PocketNets, that offer the SOTA trade-off between model compactness and performance.To achieve that, we focus on two aspects, the first is the use of a NAS algorithm to learn an FR-specific lightweight network architecture, and the second is to design a novel knowledge distillation (KD) paradigm to relax training difficulties raised by the substantial discrepancy between teacher and student models.We use CASIA-WebFace (500K images) [36] to learn the optimal architecture using DARTS [16].We additionally propose a novel training paradigm based on KD, namely multi-step KD, to enable transferring the knowledge of the teacher network at different stages of the training process, and thus enhance the verification performance of the compact student model.We prove our face-specific NASbased architecture and the proposed multi-step KD in two detailed ablation studies.First, we experimentally evaluate the impact of the NAS training dataset source (face vs. general image classes) on the FR performance of the learned architecture.Second, we experimentally proved and analyzed the competence of our proposed multi-step KD on improving FR performance in comparison to the baseline KD solutions, as well as training without KD.To experimentally demonstrate the competence of our proposed PocketNets, we report their FR performance on nine different benchmarks, in comparison to the recent SOTA compact models, in terms of FR performance and model compactness.In a detailed comparison, different versions of our PocketNets scored SOTA performances in both, under 1M parameters and under 2M parameters, FR model categories.Moreover, PocketNets achieved very competitive results to much larger FR models, and even outperformed them in many cases.

Methodology
This section presents the methodology leading to our proposed PocketNets solution, both the architecture design and the training paradigm.We first present the NAS process leading to the architecture of our proposed PocketNets.Then, we present our proposed multi-step knowledge distillation training paradigm.

Towards PocketNet Architecture
Neural architecture search (NAS) automates the network design by learning the network architecture that achieves the best performance for a specific task.NAS has proved to be a robust method in discovering and optimizing neural network architecture.Previous works [2,16] demonstrate that the discovered network architectures by NAS do outperform handcraft-designed network architectures for different computer vision tasks.For our PocketNets, we opt to use differential architecture search (DARTS) algorithm [16] to search for two types of building blocks (cell) i.e. normal cell and reduce cell, which can be stacked to form the final architecture.Our choice for DARTS is based on: a) it achieved a competitive result to the SOTA NAS solutions on different image classification tasks [16], and b) the search time for DARTS is feasible in comparison to other search methods [42,43] and thus, it can be adapted to a large-scale dataset.Unlike common NAS algorithms that are applied on a small image size of a small dataset, our NAS will be learned on a large-scale face image dataset with relatively high resolution.In the following, we briefly present the DARTS algorithm.Our goal here is not only to build an optimal architecture, but also to analyze the FR performance implications when optimizing such an architecture on a different learning task, as will be clarified later in this work.
DARTS aims at learning two types of cells: normal cell and reduce cell.Each cell is a direct acyclic graph (DAG) that consist of N nodes.Each node x i is a latent representation, where i ∈ [0, N ].The operation space O is a set of candidate operation e.g.convolutional layer, skip-connection, pooling layer etc.Each edge (i, j) be-tween node x i and x j is a candidate operation o (i,j) ∈ O that applies a particular transformation on x i .Each candidate operation o is weighted by the architecture parameter a(i, j).An intermediate node x j is calculated as x j = i<j,i∈[0,N ] o (i,j) (x i ).Each cell (DAG) has two input nodes and a single output node.The two input nodes are the output of the previous two cells of the network.The output of the last node x N −1 i.e. the cell output, is a concatenation of all nodes in the DAG excluding the input nodes.The candidate operation applied to x (i) is represented as a function o(.).The choice of a candidate operation is formulated by applying a Softmax function over the weights of all possible operations O: where α is a network architecture weight parameter of a candidate operation o.Therefore, the architecture search becomes a task of learning a set of parameters α = {α (i,j) }.The learning procedure of DARTS is based on jointly learning the network architecture represented by α and the network weights w.Given L train and L val as the train and validation loss, respectively.The learning objective of DARTS is to find the optimal architecture represented by α * that minimizes the validation loss L val (w * , α * ) with w * = arg min w L train (w, α * ) as the best performing network weights on the training set.The architecture parameters are learned using a bi-level optimization problem with α as the upper-level and w the lower level variable: ( The final discrete architecture is derived by setting o (i,j) = argmax o∈O α (i,j) o . Given an input of the shape w × h × c, the output of the reduction cell is w/2 × h/2 × 2c and the output of the normal cell is w × h × c.The first two nodes of cell k represent the output of the two previous cells k − 1 and k − 2.
3) max pooling layer with kernel size of 3 × 3. 4) average pooling layer with a kernel size of 3 × 3. 5) identity.6) zero.A zero operation indicates that there is no connection between nodes.The max and average pooling layers are followed by batch noramlization (BN) [13].We use Parametric Rectified Linear Unit (PReLU) [8] as the non-linearity in all convolutional operation.PocketNet architecture: We followed [16] by setting the number of nodes in all cells to N = 7.We apply fast downsampling in the beginning of the network using 3×3 convolution (stride=2) followed by BN [13].To obtain the feature embedding of the input face image, we use global depthwise convolution [11] rather than using average pooling or fully connected layer directly before the classification layer.Our choice of using the global depthwise convolution for the embedding stage is based on: a) it contains fewer parameters than a fully connected layer, b) convolutional neural network (CNN) with global depth-wise convolution is more accurate than the one with average pooling for FR, as reported in previous works [3,1].The rest of the network architecture is constructed by stacking M normal cells and 3 reduction cells at 1/3 and 2/3 of the network depth, and after the last normal cell.We trained the NAS to optimize α normal and α reduction used to construct the normal and reduction cells, respectively.We trained the search algorithm to learn from the CASIA-WebFace dataset [36].Training details are presented later in Section 3.1.The best discovered normal and reduction cells by DARTS are shown in Figures 1a and  1b

PocketNet Training Paradigm
Towards the PocketNet training paradigm that incorporates our proposed multi-Step KD, we start by formulating the margin-based Softmax loss and knowledge distillation concept.Margin-Based Softmax loss has been widely deployed in recent FR solutions [5,31,24].It achieved SOTA accuracy on major benchmarks [5,22,24].In this work, we utilize the ArcFace loss [5] to train our PocketNets.Ar-cFace loss extends over the softmax loss by manipulating the decision boundary between the classes by deploying an additive angular margin penalty on the angle between the weights of the last fully connected layer and the feature representation.Formally, ArcFace loss is defined as follow: e s(cos(θ y i +m)) + C j=1,j =y i e s(cos(θ j )) where θ yi is the angle between the feature f i and i−th class center, y i ∈ [1, C] (C is the number of classes), M is batch size, m is the margin penalty value and s is scale parameter.Knowledge distillation (KD): KD is a technique to improve the performance and generalizability of smaller models by transferring the knowledge learned by a cumbersome model (teacher) to a single small model (student) [10].The idea is to guide the student model to learn the relationship between different classes discovered by the teacher model that contains more complex information beyond the ground truth labels [10].The KD is originally proposed to improve the performance of a small backbone trained with SoftMax loss for a classification task [10].However, the learning objective of the FR model is to optimize feature representations needed for face verification.In this work, as a step towards our proposed multi-step KD, we train our Pocket-Net model to learn feature representations that are similar to the ones learned by the teacher model.We achieve that by introducing an additional loss function (Mean squared error (MSE)) to ArcFace loss operated on the embedding layer.Formally,the l mse loss is defined as follows: where Φ S t and Φ T t are the feature representations obtained from the last fully connected layer of student and teacher models, respectively, and D is the size of the feature representation.The final training loss function is defined as follow: where λ is a weight parameter.The feature representations learned by the ArcFace loss are normalized.Thus, the value range of l mse is much small i.e. ≤ 0.007.This value is very small in comparison to the ArcFace loss value (around 60 at the beginning of the training phase.)We set the λ value to 100.Thus, the l mse contributes to the model training.

Multi-
Step Knowledge Distillation: Previous works [25,35] observed that transforming the knowledge from a very deep teacher model to a small student model is difficult when the gap in terms of network size between the teacher and the student model is large.
In this work, we present a novel concept by relaxing this difficulty of a substantial discrepancy between teacher model and student by synchronizing the student and the teacher model during the training, without the need for transforming the knowledge to intermediate networks [25,35].Our solution is designed to transfer the knowledge learned by a teacher model in a step-wise manner after each x number of iterations, i.e.Multi-Step KD.The key idea is that the information learned by a teacher at different steps of the training phase is different from the one learned when the teacher is fully converged.Thus, transferring the knowledge learned by a teacher at an early stage of training is easier for a student to learn.Thus, at a later point when the student is converged to some degree, it can learn more complex patterns from the teacher.To achieve that, we first train the teacher for I iterations.This teacher model is noted as T 1.Then, we train the student model for the same number of iteration I with the assistance of the teacher T 1.In this case, Φ T (Equation 4) is T 1 obtained after the first I iterations.We choose to train the teacher for one epoch each time.This will give the teacher a chance to learn from the whole training dataset.We repeated these two steps until the teacher and student models are converged.To simplify the implementation, we train first the teacher model until it is converged and save the model weights after each epoch.Then, we train the student model with the assistance of the teacher models.During the student training, we load the teacher weights that correspond to the same training epoch.

Neural Architecture Search
We train the DARTS to learn the normal and reduction cells on the CASIA-Webface dataset [36].CASIA-Webface consists of 494,141 face images from 10,757 different identities.We split the dataset equally into two parts used for training and validation.The images are prealigned and cropped to 120 × 120 for the training subset and to 112 × 112 for the validation subset using the Multitask Cascaded Convolutional Networks (MTCNN) solution [37].During the training phase, the training images are randomly cropped to have a fixed size of 112 × 112 and then randomly horizontally flipped to make the search more robust, following common practice in FR research [5,24].All the training and validation images are normalized to have pixel values between -1 and 1.We followed DARTS training setup [16] by using Stochastic Gradient Descent with the momentum of 0.9 and weight decay of 3e − 4 to optimize the DARTS weight w.We utilize a cosine anneal-   2a).Also, it can be observed that training with multi-step KD guides the model to learn feature representations that are more similar (in comparison to KD) to the teacher ones (Figure 2b).These figures are based on training the PocketNetS-128 network.
ing strategy [18] to decrease the learning rate after each epoch with a minimum learning rate of 0.004.We set the batch size to 128 and the initial learning rate to 0.1.For α optimization, we use similar setup to DARTS [16] by using Adam optimizer with momentum β = (0.5, 0.999) and weight decay of 1e − 3. We set the initial learning rate for Adam optimizer to 0.0012.The initial channel size is set to 64 and the number of nodes in each cell is set to 8. We use a batch size of 128 and train DARTS for 50 epochs.These configurations are chosen to enable DARTS training on available GPUs.All training codes are implemented in Pytorch [27] and trained on 6 NVIDIA GeForce RTX 2080 Ti (11GB) GPUs.The training lasted 2274 hours.We additionally conducted an additional experiment on CIFAR-10 [15] as a NAS domain ablation study for this work.The CIFAR-10 is a commonly used dataset for object detection and image classification tasks consisting of 60000 images (of the size 32 × 32) of 10 classes.We split CIFAR-10 equally into two parts: training and validation subsets.We run the DARTS search using the exact configurations described previously in this section to learn on the CIFAR-10 dataset.The training lasted around 30 hours on 6 NVIDIA GeForce RTX 2080 Ti (11GB) GPUs.

Face Recognition models and training
Based on the normal and reduction cells learned by DARTS on CASIA-WebFace [36], we trained three instances of PocketNetS-128.The first instance (noted as PocketNetS-128 (no KD)) is only trained with ArcFace loss described in Section 2.2.The second instance (noted as PocketNetS-128 (KD)) is trained with ArcFace loss with KD.The third instance is trained with ArcFace loss along with our proposed multi-step KD (noted as PocketNetS-128 (multi-step KD)).These three instances are used in our ablation study towards the proposed multi-step KD.On the other hand, based on the normal and reduction cells learned on CIFAR-10 [15] (object classification domain), we train another model based on these cells, noted as DartFaceNet-128 (no KD).This training is used as an ablation study to analyze the effect of training dataset sources on the neural architecture search algorithm by comparing its FR performance to its direct counterpart PocketNetS-128 (no KD).
Additionally, as detailed earlier, we trained four instances of PocketNets: PocketNetS-128, PocketNetS-256, PocketNetM-128, and PocketNetM-256 to compare our proposed PocketNets with the recent compact FR models proposed in the literature on different levels of compactness.All these models are trained with ArcFace loss along with our proposed multi-step KD.To enable KD multi-step solutions, we trained two instances of the ResNet-100 model with embedding sizes of 128 − D and 256 − D. The ResNet-100(128) is used as a teacher for PocketNetS-128 and PocketNetM-128, while ResNet-100(256) is used as a teacher for PocketNetS-256 and PocketNetM-256.
We use the MS1MV2 dataset [5] to train all the investigated FR models in this work.The MS1MV2 is a refined version [5] of the MS-Celeb-1M [7] containing 5.8M images of 85K identities.We follow the common setting [5] to set the scale parameter s to 64 and margin value of ArcFace loss to 0.5.We set the mini-batch size to 512 and train our models on a single Linux machine (Ubuntu 20.04.2 LTS) with Intel(R) Xeon(R) Gold 5218 CPU 2.30GHz, 512 G RAM, and 4 Nvidia GeForce RTX 6000 GPUs.The proposed models in this paper are implemented using Pytorch [27].All FR models are trained with Stochastic Gradient Descent (SGD) optimizer with an initial learning rate of 1e-1.We set the momentum to 0.9 and the weight decay to 5e-4.The learning rate is divided by 10 at 80k, 140k, 210k, and 280k training iterations.The total number of training iteration is 295K.During the training, we use random horizontal flipping with a probability of 0.5 for data augmenta-tion.The networks are trained (and evaluated) on images of the size 112 × 112 × 3, with pixel values between -1 and 1.These images are aligned and cropped using the Multi-task Cascaded Convolutional Networks (MTCNN) [37], following [5].

Ablation Study
This section presents two ablation studies addressing the two main aspects of our design of the PocketNets solution.
Ablation Study on NAS training dataset source: We trained two different instances of DARTS search algorithm to learn from CASIA-WebFace [36] (face images) and CIFAR-10 [15] (animals, cars, etc.), respectively.Figure 1 presents the normal and reduction cells learned on CASIA-WebFace and CIFAR-10, used to build our PocketNetS-128 (no KD) and the DartFaceNetS-129 (no KD), respectively.These networks share the same structure including the embedding stage and the number of cells.These networks are trained using the exact training setup described in Section 3.2.DartFaceNetS-128 (no KD) contains 885,184 parameters with 620.9286MFLOPs.PocketNetS-128 (no KD) contains 925,632 parameters with 587.11MFLOPs.Table 3 presents the achieved performance by PocketNetS-128 (no KD) and DartFaceNetS-128 (no KD) on nine different benchmarks.It can be clearly noticed that PocketNetS-128 (no KD) outperformed DartFaceNetS-128 (no KD) with an obvious margin on all considered benchmarks.The demonstrates that utilizing neural network architecture designed for common computer vision tasks leads to sub-optimal performance when it is used for the FR.It also supports our choice for training NAS to learn from a face image dataset and points out that FR does require face-specific architecture design.

Experimental results
Table 4 presents the achieved FR results by our Pocket-Nets on all evaluation benchmarks.It also presents a com-parison between our proposed PocksetNets and the recent compact models proposed in the literature.The presented models are ordered in groups based on the number of parameters (compactness).The first part of Table 4 presents the achieved result by the models that have between 2 and 5M trainable parameters, while the second and third parts present the results for the models with less than 2M and less than 1M trainable parameters, respectively.
Our PocketNetM-128 (1.68M parameters) and PocketNetM-256 (1.75M parameters) outperformed all models proposed in the literature that have less than 2M parameters.They also achieved competitive results to the models that have between 2 and 5M parameters, even outperforming them in many cases.For example, our PocketNetM-128 achieved SOTA accuracies on the challenging CA-LFW and CP-LFW among all models that have less than 5M of trainable parameters.On the large-scale evaluation benchmarks, IJB-B and IJB-C, our PocketNetM   The achieved results on 9 benchmarks.The results are reported in % based on the evaluation metric described in Section 3.3.The models are ordered based on the number of parameters.Our PoacketNetS-128 and PocketNetS-256 consistently extend the SOTA performance on all evaluation benchmarks for the models that have less than 1M parameters.Our PoacketNetM-128 and PocketNetM-256 also achieved SOTA performances for models that have less than 2M parameters.Additionally, they achieved very competitive results to larger models that have between 2 and 5M parameters.All decimal points are provided as reported in the respective works.
On MegaFace and the refined version of MegaFace, our PocketNetM outperfomred all the models than have less than 2M of trainable parameters and they achieved a competitive results in term of identification and verification accuracies to the models that have between 2 and 5M parameters.For example, our PocketNetM-258 (1.75M parameters) outperformed MixFaceNet-S [1] (3.07M parameters), ProxylessFaceNAS [22] (3.2M parameters) and MobileFaceNetV1 [22] (3.4M parameters) on MegaFace and MegaFace (R).
To visually illustrate the competence of our PocketNet, we plot the number of parameters vs. the achieved verifica-tion performance of our PocketNet and the recent compact models proposed in the literature (all numbers provided in Table 4).Figure 3 presents a trade-off between the number of parameters and the achieved verification performance.Each of the presented solutions is marked with a point(x,y) in the plot, where x is the number of parameters in millions and y is the achieved verification performance.The model that tends to be placed on the top-left corner (small x and large y) of the plot has the best trade-off between the model compactness and the achieved verification performance.It can be observed, in Figure 3, that our Pocket-Nets are always in the top left corner in comparison to other methods, proving to achieve SOTA trade-off between model compactness and FR performance.It must be noted that all the reported PocketNets in this section are trained with our proposed multi-step KD.

Conclusion
We present in this paper a family of extremely lightweight FR models, namely PocketNets.This is one of the first efforts proposing to utilize NAS to learn to design a compact yet accurate FR model.We additionally presented a novel training paradigm based on knowledge distillation, namely mulit-step KD, where the knowledge distillation is performed at multiple stages of the teacher training maturity.Extensive step-wise ablation studies proved the benefits of both, designing a face-specific architecture, as well as, the enhanced performance of the lightweight model when trained with the proposed multistep KD.Through extensive experimental evaluations on nine FR benchmarks, we demonstrated the high verification performance achieved by our compact PocketNet models and our proposed mulit-step KD.Under the same level of model compactness, our PocketNets consistently scored SOTA performances in comparison to the compact models proposed in the literature.
, respectively.In this work, we present four architectures based on the learned cells: PocketNetS-128, PocketNetS-256, PocketNetM-128, and PocketNetM-256.The architecture of PocketNetS-128 and PocketNetS-256 (Pocket-Net small) are identical.Each of them contains 18 cells i.e 15 normal cells and 3 reduction cells.The number of feature maps (out channel) of the first layer is 64.The only difference is the embedding size, where the embedding in PocketNetS-128 is of size 128-D and in PocketNetS-256 is of size 256-D.
(a) ArcFace loss value of the model trained with KD vs. the model trained with multi-step KD over training iterations.(b) KD vs. multi-step KD loss values over training iterations.

Figure 2 :
Figure 2: Effect of multi-step KD on the student model convergence.It can be noticed that multi-step KD enables the model trained with ArcFace and multi-step KD losses to better converges in comparison to the case where the model is trained with ArcFace and KD losses (Figure2a).Also, it can be observed that training with multi-step KD guides the model to learn feature representations that are more similar (in comparison to KD) to the teacher ones (Figure2b).These figures are based on training the PocketNetS-128 network.

Table 1 :
Architecture of PocketNetS-128.Normal and reduction cells are the cells learned by DARTS on CASIA-WebFace.The table shows the number of parameters for each operation.If the operation contains a set of suboperations (e.g.Conv2d, BN), the number of parameters is presented as the sum of parameters for all these suboperations and multiplied by R. Column R indicates how many times the operation is repeated.The k of the convolution layer (Conv2d) refers to the kernel size, s is the stride, p is the padding, and g is the group parameter.

Table 2 :
Table 1 presents the overall architecture of PocketNetS-128.PocketNetS-128 contains in total 925,632 trainable parameters and setting the embed-Architecture of PocketNetM-128.Normal and reduction cells are the cells learned by DARTS on CASIA-WebFace.The table shows the number of parameters for each operation.If the operation contains a set of suboperations (e.g.Conv2d, BN), the number of parameters is presented as the sum of parameters for all these sub- operations and multiplied by R. Column R indicates how many times the operation is repeated.The k of the convolution layer (Conv2d) refers to the kernel size, s is the stride, p is the padding, and g is the group parameter.dingsize to 256 increases the number of parameters in PocketNetS-256 to 991,424.All networks use floatingpoint 32 and the required memory footprints are 3.7 and 3.9 MB by PocketNetS-128 and PocketNetS-256, respectively.The main motivation for using different embedding sizes is to evaluate the effect of embedding size on the network performance and memory footprint.We also investigate a wider architecture of PocketNet by doubling the number of feature maps of the network and reducing the number of cells from 18 to 9.This result in two networks: PocketNetM-128 and PocketNetM-256 (PocketNet medium) with embedding size of 128-D and 256-D, respectively.The architecture of PocketNetM-128 is presented inTable 2. PocketNetM-128 contains 1,686,656 parameters and PocketNetM-256 contains 1,752,448 parameters.