Playing with blocks: Toward re-usable deep learning models for side-channel profiled attacks

This paper introduces a deep learning modular network for side-channel analysis. Our deep learning approach features the capability to exchange part of it (modules) with others networks. We aim to introduce reusable trained modules into side-channel analysis instead of building architectures for each evaluation, reducing the body of work when conducting those. Our experiments demonstrate that our architecture feasibly assesses a side-channel evaluation suggesting that learning transferability is possible with the network we propose in this paper.


Introduction
In the side-channel analysis (SCA) field of research, deep learning models (DL models) are powerful tools to evaluate the implementation of secure algorithms.Unfortunately, despite the significant accomplishments by using deep learning models, many challenges remains.
When evaluating a secure implementation of an IoT device, for example, it is challenging to develop a deep learning classifier that feasibly assesses the resilience of the devices.Electronic noise as countermeasure and desynchronization are specific challenges during the evaluation.Indeed, a noisy signal intrinsically suggests dealing with high-dimensional signals.For instance, targeting a modern System-on-Chip with high clock frequencies requires increasing the sampling resolution; as consequence, the side-channel information required for the evaluation contains leakage traces with several irrelevant features (sample points).Then, noise filters and feature engineering as pre-processing steps are being reconsidered as tools to deal with those challenges [22,19,27,16,13,11].
This paper proposes a new technique to overcome those challenges and introduces a novel approach that uses a deep learning classifier whose part of its architecture allows to be re-used in other models that use the same approach.By featuring the exchangeable modules, we can re-use networks for different SCA evaluations, reducing the body of work of deriving models each time.The suggested architecture comprises coupled modules, and those modules have specific tasks to deal with the challenges of an SCA evaluation.We call this approach DL-SCA modular network.Precisely, an autoencoder and a convolution base classifier are the two modules that we suggest in this paper.
An autoencoder can effectively deal with the problem of high dimensionality and the problem of noise.An autoencoder comprises two parts by itself; the encoder and decoder.The encoder ends to an embedding where high dimensional leakage traces are transformed into lower dimensional version of them.Because of it, autoencoder are learning algorithms used in pre-processing steps i.e feature extraction [15,19,7].
The classifier module serves two objctives; (i) the classification required for the SCA evaluation and (ii) to regularize the autoencoder.As we explain in further sections of this paper, autoencoders might fail to compress the samples taken from the device under test; so penalizing it with a regularization might correct it toward better performance.
Our experiment uses datasets with desyncronization and countermeasures.After proving the effectiveness of the DL-SCA modular network, we perform a second set of experiments where we exchange the modules between modular networks.Our results show that transferability is feasible and applicable to sidechannel analysis.The contributions of this paper are as follow: -We introduce an approach called DL-SCA modular network and deep learning architecture featuring the exchange of modules through models.We provide the implementation details of the architecture, as well as the hyperparameter to take into account in the design to avoid pitfalls.-We present a training strategy based on sharing weight technique and early stopping policy for a seamlessly adoption of our approach in current SCA evaluations.
-We elaborate experiments that demonstrate the effectiveness of re-using modules through modular networks, using different "sharing" protocols based on non-trainable layers.
The rest of the paper is organized as follows: Sect. 2 details theoretical aspects of the topics used for this work.Related works are discussed in Sect.3. Sect. 4 provides information about datasets used in the experiments.Sect. 5 discuss the main contribution of this paper.Sect.6 and Sect.7 discuss the experiments.While, Sect.8 concludes the paper.

Profiled attack
A side-channel attack requires a leakage model to attack the sensitive information contained in a target device.A leakage model refers to a function (δ) that models the leak of sensitive information.Using a leakage model, an adversary can steal the secret key from a device that implements a cryptographic algorithm.The expression (1) is an example of leakage model used to attack a cryptographic implementation of AES 3 .In that leakage model, p is the publicly available data i.e. the plaintext and k * is the secret key.
The adversary measures the power consumption4 when the device inputs the AES algorithm with p random values from the keyspace K = {0, • • • 255}, drawing several leakage traces (a.k.a power traces).With enough leakage traces, the adversary can find a correlation between the power consumption and the inputs p of the leakage model; consequently, he can infer the key k * .
The previous paragraph just describes a traditional non-profiled side-channel attack over a crypto primitive.However, to understand how a profiled sidechannel attack works we have to explain its differences from a non-profiled attack.A profiled type of attack born with the idea of training classifiers to distinguish the outputs of a leakage model; so that the attack splits into two phases; (i) to train a classifier (profiling phase) and (ii) to perform the attack (attack phase).
The first phase comprises applying the corresponding leakage model to a clone device (a.k.a.profile device), collecting from it the leakage traces forming a set of profiling traces (X ) used to train the classifier.During the attack phase, a set of attacks traces from the actual device is collected and used for the trained classifier to compute probabilities, then a key recovery process takes place using an algorithm called guessing entropy that we will explain in brief.
Template attack and machine learning are two techniques to build a classifier to evaluate side-channel attacks [7,12].It is well-known that to come up with a classifier for SCA evaluation is not straightforward.Indeed, to reduce the body work of building classifier anytime an SCA evaluation is required is the motivation for the proposal of this paper.We propose a deep learning-based model whose architecture allows the model to exchange classifiers with other deep learning model aimed to conduct SCA evaluation over a different device; and in the following, we address the necessary aspects that support the theory about this approach.

Guessing entropy (GE)
GE is the average rank of the correct key byte value k * in a key guessing vector g, over all the set K of key candidates k [26].Formally denoted as GE = rank k * (g), where rank k (g) ∈ {0, . . ., |K| − 1}, and the key guessing vector is defined as: is the input vector of probabilities p i,j from a classifier (usually aimed for key recovery task) given a leakage trace t i .After applying the expectation E per multiples experiments of P r , the sort function orders the resultant vector g in decreasing order.The element g 0 ∈ g corresponds to the most likely key candidate, while g |K|−1 ∈ g is the less likely one.

Deep learning base profiled attacks
A deep learning classifier outputs a vector of probabilities fed into the guessing entropy (GE) metric to compute the rank of the key (k * ).We denote a deep learning model C θ δ for profiled attacks as a classifier C with a vector of parameters θ ∈ R n aimed to distinguish leakage traces labeled using a leakage model δ.
Having labeled leakage traces means that our learning approach is supervised learning [10] which represents one of the most feasible ways to leverage the learning of a deep learning classifier.
Despite several deep learning architectures, CNNs based models are the preferred architecture to use in profiled attacks.The convolutional part plays an essential role when leakage traces are desynchronized.The deep learning model we propose uses a specific type of convolution, called dilated convolution [18] for boosting the feature extraction capability of the layer (see sub-section 2.5).

Feature extraction
A feature extraction process applies a transformation (linear or non-linear) to a space of observations resulting in a new space mapped by the transformation.Formally, given profiling set X of N leakage traces and each trace comprises m features (or sample points).Feature extraction applies a function to the profiling set X mapping a new profiling set Y whose elements have fewer dimensions of the corresponding elements in X ; precisely, is an application such as : X −→ Y, and X ∈ R m , Y ∈ R n such that n < m.This transformation aims to derive new features (Y) to leverage the performance of a classifier, for instance.Theoretically, features in Y contain the "transformed" information that best represents the ground truth of X , in the SCA case, it is the leakage of the sensitive information.In simple words, the intensity of the valuable information gets emphasized while the irrelevant information (non-correlated information) has little to no influence in the new space.
However, it is not straightforward to come up with a transformation that indeed emphasized the side-channel information.A transformation that goes wrong discards a lot of useful information, and it happens when cannot keep the variance that distinguishes a leakage trace from another; as consequence, Y is made of several collapsed traces becoming useless for classification purposes.In section 5, we will discuss how our proposed method implements regularization to avoid transformations that collapse the Y space.
Function can be inferred directly from X .For instance, Principal Components Analysis (PCA) [9] or Linear Discriminant Analysis (LDA) [2] are two algorithms to build linear base functions for feature extraction.However, PCA and LDA are highly sensitive to desynchronization because of their "per feature" process, meaning they find a relation by correlating the same positioning feature through samples.So that, when the samples have a spatial disruption, the relation gets reduced, requiring more samples.

Autoencoders
An autoencoder is a learning algorithm useful to infer ; contrary to PCA and LDA, an autoencoder can infer a non-linear transformation due to the nonlinear activation functions in its architecture.Moreover, when the autoencoder architecture comprises convolution layers, it handles the spatial disruption better than PCA and LDA.
An autoencoder consists of two parts; (i) an encoder ϕ and (ii) a decoder ψ.Let us define a leakage trace t i ∈ X , and its dimension being denoted as dim(t i ) = m.The encoder outputs a new trace t i with dim(t i ) < dim(t i ) (see expression (2)).At the other side of the autoencoder, the decoder tries to reconstruct t i but it is able to re-build an approximation ti only; consequently, one can understand that an autoencoder learns by minimizing the difference between t i and ti (as we will see in expression 6).
From a functional perspective, the encoder maps X to an embedding space denoted by Z (i.e.ϕ : X → Z), the embedding Z is usually called latent space, code, latent code or hidden code.Further, Z is the space result of the transformation applied by the encoder.According to the discussion in the previous subsection, Z is the resultant space of a feature extraction process i.e.Y.Likewise, the decoder maps Z to X (i.e.ψ : Z → X ), where t ∈ X .The expressions (3) and (4) formalize these two mappings; Function σ denoted a non-linear activation function.An encoder is parameterized by a weight matrix W enc ∈ R m×n and a bias vector b ∈ R n ; likewise, a decoder is parameterized by a weight matrix W dec ∈ R n×m and a bias vector b ∈ R m (see Fig. 1).Training an autoencoder implies finding a vector of parameters θ = (W enc , W dec , b, b ) that minimize a loss function L such as; As we said, autoencoder learns by minimizing the difference between t i and ti ; so that, the Mean Square Error (MSE) is a loss function commonly used; Fig. 1.Typically autoencoders are symmetrical models, meaning that both parts of the encoder and decoder resemble each other.During training, the encoder trains to code the original signal to a Latent space, ideally this code from the features that better represent the characteristic of the original signal.From there, the decoder re-constructs as much as possible the original signal.
Convolution layer architecture Autoencoders are built using either fully connected layers or convolutional layers.The latter makes the autoencoder inherit the spatial invariant robustness property, which is useful when leakage traces are desynchronized; our autoencoder uses dilated convolution layers.
A convolution layer consists of kernels that essentially are matrices; then, to dilate a convolution layer consists of inserting zeros into its kernels, meaning to separate the matrices' elements using zeros, expanding their receptive field 5 .According to [18] a dilated kernel allows convolutions base classifiers to combine spread features that contain the leakage information, at the same time, avoiding irrelevant features that might lay in between.
Let us consider the expression in (7) showing a regular convolution where a leakage trace t i is multiplied by the kernel q whose length is denoted by l q .If we displace the leakage trace t i from right to left, a single feature t of t i is multiplied l q times.If l q is large, then t might be excessively used during the operation.According to [29], this excessive use of t may decrease the convolution effectiveness.Notice that if l q increases aiming to use further spread features, it also increases the times t is used.By using dilated convolutions, one can avoid this downside.
The expression in (8) shows a dilated kernel with one zero inserted between its elements.Notice that when the convolution is performed, the feature t alternates being multiplied or not by a zero; consequently, it reduces the times the operation uses the feature t.
The hyperparameter dilatation rate (dr) controls the number of zeros inserted.When a kernel is dilated its receptive field is modified by the relation; In this way, the receptive field increases by modifying either the length of the kernel or the dilatation rate, letting the user regularize the convolution operation.

Related work
While few works in SCA discuss an approach of architecture transferability with reusable modules, several works have discussed feature reduction for SCA.Cagli et al. in [6,5,4] discussed application of traditional feature reduction methods using PCA [9], LDA [2], and its kernel base variant Kernel PCA and KDA.Picek et al. [21] published results using same methods as [5].However, authors in [21] used an approach that combined feature extraction and feature selection; precisely, PCA and LDA combined with SOST and SOSD, they called it hybrid feature selection methods.
Intrinsically, any work that uses the same feature reduction techniques aims to downsample the signal by taking it to a new space (latent space).However, these approaches consider only linear base feature reduction disregarding the more powerful non-linear version of it; it is likely, that this situation may be a consequence of advertising CNNs as built-in feature extraction deep learning models.Hence, very few works have addressed non-linear methods for SCA evaluation.One of those few works are, for instance, Paguada et al. [19], and Yang et al. [28]; similar to us, those works used autoencoders toward inferring a nonlinear function to pre-process leakage traces in a fashion that overcome linear methods.
While those two works are the closest one we can relate with, to our best knowledge, there is no previous work on side-channel analysis that suggest a deep learning approach based on modules; featuring to share modules between models.
The dataset has two versions, traces collected with fixed key encryption k f and traces collected with random key encryption k r (plaintext is always random), while the target byte of the secret key in both cases is the third one.We named these versions as ASCAD f and ASCAD r respectively.Due to these key characteristics, ASCAD r is more challenging and more realistic than ASCAD f when conducting an SCA evaluation over them.TABLE 1 contains a summary of main characteristics of these two datasets.
Profiling traces 50 000 Profiling traces 200 000 Attack traces 10 000 Attack traces 100 000 dim(ti) 700 dim(ti) 1 400 Table 1.Cardinalities of the ASCAD datasets.Since their goal is to be used for benchmarking profiled attacks the leakage traces are grouped in profiling traces and the attack traces sets.
Leakage traces in each version are desynchronized according to a threshold value that moves traces around the x-axis, being frequently used threshold values of 0, 50, and 100.Then, to make clear distinctions when exchanging the modules between modular networks, we add to the name the threshold value, for instance, ASCAD r desync50.

DL-SCA modular network architecture
This section explains the details about the architecture of the DL-SCA modular network; further, we describe the strategy to train it.
Since we are using autoencoders; then, our suggested DL-SCA modular network comprises three main modules; an encoder, a decoder, and a classifier (see Fig. 2).Particularly, we will group the encoder and decoder into a single module called a downsampler.The downsampler has two goals; (i) to extract meaningful features by reducing the noise in the leakage traces and (ii) to downsample them.Now, the classifier is in charge of evaluating those extracted features as a classification problem.
It is worth mentioning that once the DL-SCA modular network is trained, we discard the decoder of the downsampler, and we only use the encoder and classifier to perform the SCA evaluation.Due to this, we elaborate a training strategy to monitor only those two parts of the model; we will elaborate this late in this section.
The goal of both modules might be apparent; however, the downsampler has an implicit objective.To achieve compatibility with as many classifiers as possible, we should use a downsampler to fix the classifier input.Precisely, we downsample the leakage traces to a fixed length; then, when we re-use the classifier with another downsampler, this latter fixes its output to match the classifier's input.By doing this, we fulfill the first step of re-usability.We demonstrate this in the experimental section of this paper.
Training a DL-SCA modular network architecture requires a loss function for the decoder and another for the classifier.The decoder's loss function (L M SE ) was discussed in the sub-section 2.5.We introduce the classifier loss function.

Classifier loss function
As we said, a classifier outputs a vector of probabilities used as input for the guessing entropy.So, for the classifier to output this vector, it must be trained using a cross-entropy (CE) loss function.In supervised profiled side-channel attacks the leakage traces are labeled by the output of a leakage model (see expression (1)).Further, the classifier learns by minimizing its error in predicting the label of each trace.
To explain this better, let us consider the expression (10).The space K corresponds to a batch of key candidates or labels, each one of the labels in K represents a trace.For instance, let us take δ i ∈ K as one of those labels, we say that δ i is the ground truth while σ(δ i ) is the output score a neural network computed 7 .
7 Often σ is the softmax activation function for multi-class classification During training, this loss function computes the error in the prediction made by the classifier; consequently, the weights of the classifier are updated toward achieving a prediction with highest accuracy possible.
Clearly, we use a classifier with the same purpose as in a common profiled side-channel evaluation.However, the feature in using a classifier in our approach is to add a regularization term to the downsampler.Precisely, the supervised classifier adds an extra penalization to the downsampler with regard to the featuring space Y the downsampler is building up; leading the whole network toward better performance.It is impossible to feature this with a self-supervised downsampler trained separately as authors did in [19].
The arrangement depicted in Fig. 2 suggests that both the classifier and decoder attach to the encoder.Consequently, when training the modular network, the classifier feed-forwards the downsampled traces from the embedding and back-forwards its loss.Meanwhile, the decoder trains its reconstruction capability that additionally penalizes the encoder.These two losses resemble a double voting system that the encoder uses to leverage learning.Now, notice that because the activation functions are non-linear, the classifier acts as a non-linear regularizer for the embedding space.Consequently, the decoder takes the regularization effect as small perturbations in that space; those perturbations challenge the decoder in reconstructing the original traces as it understands that those are small errors in its reconstruction.Contrary to the approach in [19], training jointly the autoencoder and the classifier produces an embedding likely to learn positive features.Due to the regularization factor, less correlated features are emphasized over highly correlated noisy features.

Analogy with linear regularized autoencoder
Autoencoders aim to be imperfect models; so, when training an autoencoder we must avoid an architecture that ends with a model called "identity function".When this phenomenon happens, the autoencoder will just copy the data from the input to the output.One way to avoid this is by using a undercomplete architecture, which refers to the embedding we discussed early; further, the deepest an autoencoder is the stronger becomes to avoid ending as an identity function.However, doing this carelessly might reduce the network performance as the model becomes supra-complex, so that, we cannot rely on it repeatedly.
Applying a regularizer to the latent space is another alternative.Regularized autoencoder proved overcoming normal autoencoders when leveraging meaningful features in the embedding.A linear regularizer applies to the latent space an extra penalization.The embedding neurons fire the additional penalization to the decoder added to its loss function as small epsilons of error.This latter advocates the disruption by training its neurons to reconstruct the original data; ignoring that it is being fooled by the regularizer, so its learning is actually "imperfect" [10].
A drawback of using linear regularizer is precisely its nature.A linear regularizer applies the penalization linearly to all the embedding neurons, there is no a criterion that controls the magnitude each neuron should receive based on its contribution to the loss function; as eventually, the decoder starts copying the input as it is.
In a DL-SCA modular network, the classifier acts as a regularizer; nonetheless, the regularization is based on non-linearity since the classifier is a non-linear function.The non-linear activation functions used in the classifier receive their input from the embedding neurons; once the classifier does the back-propagation it applies an epsilon value according to their contribution to the classification.Once again the decoder interprets those as small errors, but now facing a more advance regularization.
Both linear and non-linear regularizers require a value to control the intensity of the penalization.For our non-linear regularizer, this value is a parameter γ ∈ ]0, 1] ⊂ R multiplied by the loss function's result.

DL-SCA modular network loss function
Now that we know the two losses required by our architecture as well as the hyperparameter to control their intensity, we have the expression (11) that defines the loss function for a DL-SCA modular network architecture.
Notice that there is an ω parameter for L MSE that works exactly as γ.We fix ω = 1, because our goal is to control the regularization and not the reconstruction.

Training strategy for a DL-SCA modular network
Recently, authors from [20] published an early stopping framework to monitor the state of a deep learning model during its training preventing it from getting overfit/underfit.Overfitting/underfitting is a phenomenon that might happen during training; it represents the state when a deep learning network cannot generalize beyond its training set.
The framework computes the guessing entropy at the end of each epoch basing the stopping criterion on the whole guessing entropy vector, considering when the guessing entropy converges, and how many traces keep the guessing entropy in the state of convergence, proving to overcome existing frameworks (more details can be found in the original paper [20]).We use this early stopping to elaborate a training strategy for our DL-SCA modular network.
Training strategy We know that an early stopping framework stops the training of a deep learning model when it meets conditions established using a metric, e.g., the accuracy of the model.Typically, these frameworks evaluate the entire model.In contrast, we need the framework to consider just the encoder and the classifier as they are the parts used in the SCA evaluation.The framework from [20] is a "typical" framework, so it monitors the whole deep learning model.
We modified the suggested framework to receive a truncated model which comprises just the modules of interest (encoder and classifier).To apply this modification is effortless when using the weight sharing technique [30].In this technique two or more neural networks share the references to some specific layers, all of those networks can update the weights of those layers; however, in our particular case the original networks updates the weights, while the truncate model monitors the state (of those weights) of the encoder and classifier.
Precisely, we set a truncated model (see Fig. 3) to reference those modules and to be only evaluated (not trained) by the early stopping framework.Then, it stops training when the encoder generates features that makes the classifier to achieve an expected performance, in this case an expected guessing entropy convergence.The modification works since the framework uses the truncated model as the predictor, and its output serves as the input to compute the guessing entropy.
In the first experimental results section of this paper, we show the training strategy outcome using surface plots.Notice that we did not stop the network training, so the surface plots correspond to the entire training process, our goal is to show that an SCA-DL modular network does not rely on an early stopping framework.

Experimental results of training modules
In this section, we discuss the results of using our proposed approach over AS-CAD datasets -ASCAD f all desync and ASCAD r all desync.
We organize the experimental results as two different use cases where a DL-SCA modular network analyzes; (i) ASCAD fixed key dataset and (ii) ASCAD random key dataset.We accomplish two goals with these uses cases; (i) to show the feasibility when an SCA evaluation uses our architecture to attack an specific dataset, and (ii) to create a scenario where we demonstrate the feasibility of sharing modules.The strategy is applicable to real evaluations; the derived modular network evaluates a first dataset; consequently, a second modular network could evaluate another dataset borrowing a module from a previous modular network.
In our particular case, our experiments use two datasets that share the same source of data; precisely, both datasets were composed with leakage traces from the same microcontroller (Atmega8515 8-bit).We aim for performing experiments when the source of data is uncommon between both datasets as future works.
Notice, we used the same model for all levels of desynchronization, meaning that additional effort in finding neural network architectures for specific noisy scenarios is not required.

ASCAD f all desync use case
The TABLE 6.1 summarizes the hyperparameters of the modular network architecture to evaluate ASCAD f all desync.
Network's architecture We set the architecture by following the discussion in Sect.2.5; the first convolutional block uses dilated convolutions to avoid any useless features that might reduce the model's performance.We dilate the convolutions at the first convolutional block because it is where we deal with the original version of the trace.Further, we add convolutional blocks to the encoder following the rules applied for VGG [25] base deep learning architectures 8 .
The decoder mirrors the encoder, as our downsampler uses symmetric autoencoders.For the decoder to up-sample, namely to reconstruct the actual length of the trace, it uses transpose convolutions.As known, matrix multiplication is not commutative, and we cannot achieve the same output in respective convolutional blocks.Consequently, we have to tune the hyperparameters in the decoder's convolution layers.For instance, let us take the third encoder's convolutional block that uses a stride value of 5, its corresponding decoder's transpose convolutional block is the first one but it uses stride value of 7. By doing this, we fix the output of the decoder to meet the original trace dimension.
Latent space hyperparameters With regard to latent space units and γ value.We perform a grid search for the best number of units in the latent space, using the values 100, 200, 300, 400, and 560.Further, we know that the parameter γ relates strictly with the number of latent units; consequently, to find the value of γ we create combinations using the latent space values and values of γ as  2. DL-SCA modular network architecture to use in experiments with ASCAD f all desynchronizations levels {1e −3 , 1e −6 , 1e −9 }.It turned out that the best combinations was 300, and 1e −3 for latent space units and the γ parameter, respectively.
Regarding the classifier module, bear in mind that we are only interested in its classification performance and not too much in its ability to filter out unnecessary features of the leakage traces, so we use a shallow architecture since it will deal with already filtered features.

Training strategy and results
As we said, to train a modular network, we use the early stopping framework from [20].To show that our suggested architecture does not rely on the framework, we did not stop the training after the mentioned framework finds the best learning state.Further, we will use this outcome in the next section to discuss the result of the reusing modules experiment.Fig. 4 depicts the training process when our modular network evaluates ASCAD f datasets.As we expected, the training outcome differs according to the level of desynchronization; regardless, our modular network achieved a zero convergent guessing entropy for all desynchronization levels.A view of the attack performance is depicted in Fig. 5.

ASCAD r all desync use case
Network's architecture Regarding this dataset, our strategy was to keep the same modular network as the previous use case to reuse as much as possible an already worked model and see how it performs.After experimenting, we noticed that the downsampler module required an additional convolutional block -identical to the third convolutional block of the decoder-without pooling layer.Consequently, the decoder should also have the corresponding transpose convolutional block.

Latent space hyperparameters, training strategy, and results
We keep the same classifier as in the previous use case because we have the same number of latent units.In our particular case, to keep the same latent units is convenient because we aim for exchanging a trained classifier in the following experiments to evaluate the modular re-usability.Fig. 7 depicts the training process of guessing entropy by epochs for ASCAD r dataset.In this case, we observe that the performance of our modular network slightly decreases, which is expected since the dataset has a higher level of noise than the previous.Even though, we achieve good guessing entropy convergence as depicts Fig. 6.Finally, we compare our experimental results with previously reported results over the same datasets.TABLE 3 gathers this information.

Module re-usability experimental results
This section presents the results of module re-usability.We show that another non-trained DL-SCA modular network can reuse the modules of a DL-SCA modular network.We use the DL-SCA-based module networks trained in the previous section to prove it.

Analyzing transferability
We aim to show how "transferable" is the knowledge of a classifier module.We have six modular network -meaning six classifiers-trained with three different datasets -three on ASCAD f and three on ASCAD r .Further, due to the number of latent units (300) we used, all classifiers are interchangeable without performing additional downsampling operations to fix their inputs.For our experiments, we took the classifier from the DL-SCA modular network of ASCAD f desync50 to share with all the downsampler from ASCAD r .We considered it sufficient for proving our claim about "module re-usability".We chose ASCAD f −→ ASCAD r direction because it represents the complex direction -from fixed key to random key.We inspect the transferability of the ASCAD f desync50 classifier by conducting a similarity analysis using gradient activation operations.In particular, we use heatmaps and gradient visualization to compare how the neurons' of the classifier are activated by the data outputted from the downsamplers.

Original
We perform this analysis by locking specific layers of the classifier to identify how transferable those layers are.Precisely, we choose convolutional block (Conv) layers and fully connected block (FC) layers and lock them by turns to evaluate them separately.A heatmap allows us to inspect the convolutional layers of the classifier, while gradient visualization helps us analyze how both Conv and FC perform with the different datasets.TABLE 4 summarizes the similarity analysis we are going to perform using the classifier ASCAD f desync50, the ASCAD r datasets, and the gradient activation operations.Fig. 8 depicts the first convolutional layer heatmaps from ASCAD f desync50 classifier and ASCAD r all desync classifiers (desync0, desync50, and desync100).
For these particular experiments, all ASCAD r classifiers share similarities with the ASCAD f desync50 classifier in how their convolutional layer neurons' get stimulated.According to our assumptions, it indicates that the weights of those layers might be transferable.This claim is experimentally demonstrated later in the final experiments.
Although the magnitude of the ASCAD f desync50 classifier's heatmap is higher than any other heatmap from ASCAD r classifiers, it does not represent a drawback to the transferability.We could have gotten the same magnitudes if we had normalized the weights applying constraints in the architecture, though a similarity analysis does not need to do this.
We use gradient visualization to inspect the classifiers' fully connected block (FC).The output of that operation indicates which input features are the most meaningful for the classification.The gradient visualization uses the loss function of a trained classifier to conduct backpropagation, collecting the information about those neurons that emphasize the performance.Further, when it reaches the input layer, it points out which features are connected to those neurons, indicating the meaningful features [14,24,1].Fig. 9 depicts the result of gradient visualization operation.
Notice that gradient visualization shows less intuition than heatmap.As a workaround, we apply a Dynamic Time Warping (DTW) [17] to visualize the similarities between gradient visualization signals.
According to this experiment, two phenomena happen; (i) the meaningful features are displaced according to each classifier, or/and (ii) the meaningful features are less intense in magnitude.These phenomena could represent an issue.For instance, let us take the ASCAD r desync0 classifier, notice the displacement because the ASCAD f desync50 interprets that the meaningful features localize differently.Further, those features have an even lower magnitude in contrast to those supposedly being the lowest(see points from 0 to 30 in Fig. 9

top plot).
This analysis gives us the intuition that we will need to retrain the classifier; nevertheless, the reader might remember that the classifier is just a part of a bigger model.The downsampler will leverage its learning according to the limitation imposed by the classifier.

Playing with blocks
Let us suppose we have trained a DL-SCA modular network using a former dataset; then, we have the opportunity to evaluate another dataset.We could use the classifier module of the first network to evaluate it.In this hypothetical scenario, the first dataset is played by ASCAD f and the second one by the ASCAD r dataset.
To experimentally evaluate if we need to re-train some or all parts of the classifier, we perform experiments locking the blocks of the classifier to restrict them from getting trained.In the previous sub-section, we inspected the blocks of the classifier (Conv and FC), and we observed some similarities in its neurons' weights.Now, we are going to evaluate the performance of the whole modular network when its classifier module has the following locks: -Convolutional block -Fully-connected block -Both blocks We will refer to these as "sharing protocols".We find out which could be the best sharing protocol for these particular modular networks by locking the blocks.TABLE 5 summarizes the combination of locks and dataset where the shared classifier will be used.We previously said that the chosen classifier ASCAD f desync50 will tackle a more complex dataset -the ASCAD r desync100.Now, by evaluating the ASCAD r desync0 dataset; then, we will cover the scenario where the shared classifier comes from a more complex dataset.Still, bear in mind that it is in terms of desynchronization because it does not come from a complex dataset in terms of its secret key's nature -from random to fixed key, for example.So, we rate the "experience" of the classifier as medium level of experience.
Due to space constraints, we did not perform an inter-classifier sharing and a no-block lock sharing protocol; furthermore, we claim that the sharings addressed in our experiments represent the difficult one, being enough to prove our contribution.However, we let those experiments and further combinations of sharing protocol for future works.Fig. 10 depicts the training process of all chosen sharing protocols.It is worthy of mentioning that we did not change the loss intensity parameter (γ), reducing the effort in tuning the modular network.
For this experiments, we trained the modular networks using the early stopping framework from [20].Contrary we did in the previous section, we do stop the training when the policy finds out the best learning state.We can now know the number of epochs required to achieve good performance.
Generally, all sharing protocols perform well if we contrast the training process of Fig. 10 and Fig. 5. Nevertheless, the sharing protocols that worked best are the fully-connected block, both blocks, and the convolutional block lock.
Observe that for the fully-connected block lock, ASCAD r desync0 has a convergent guessing entropy after 9 epochs, ASCAD r desync50 at 65 epochs, and ASCAD r desync100 took the whole training process (100 epochs); even thought, it achieves good performance.Both blocks lock cases seem to require more epoch or the convergence is roughly achieved, ASCAD r desync0 and ASCAD r desync50, for instance.Finally, we notice that the convolutional block lock converges after several more epochs than the previous locks.In this case, ASCAD r desync100 did not converge within 1 000 leakage traces.We summarize in Fig. 11 the best guessing entropy from all combination of locks.

Discussion
Using a shared classifier instead of a non-trained modular network, we have reduced the training time and the effort in tuning hyperparameters while evaluating the leakage of a dataset with good results.Since we locked some blocks and the whole classifier, we reduced the number of neurons to train; consequently, the training time is reduced since the number of operations per neuron is less than a non-trained modular network.As we do not have to tune the hyperparameter of a classifier, then we do not spend time in it.Further, we are confident that the classifier has a high probability of working since it already has previous "experience".We demonstrated the latter by actually achieving good results.
Clearly, some initial effort has to be made.For instance, we were tuning the latent space and losses intensity hyperparameters.Coming up with an initial deep learning modular network could be challenging, but it is an equivalent effort in finding several small deep learning models for different datasets.Finally, bear in mind that by saying that a classifier has previous experience, we do not claim that it will work flawlessly.As we said, the experience of a shared classifier represents a neurons' weights initializer.So instead of randomly initializing the weights using any well-known function -he uniform, for instance-; we start from a state leveraged by a previous worked learning.We have proved, experimentally, that it has good results.

Conclusion
We introduced the DL-SCA modular network approach to conducting SCA evaluation reusing modules from previously trained modular networks.A DL-SCA modular network consists of two main modules; a downsampler and a classifier.We demonstrate that modules from a modular network can be detached and attached to other modular networks and conduct an efficient SCA evaluation.The strategy is to use a classifier with good performance and reuse it to conduct another evaluation in a different dataset.
Our experiments demonstrate that it is not mandatory to re-train a classifier module to effectively evaluate the aimed dataset, regardless of whether the source classifier has been trained with a dataset with a lower noise level.We systematically lock the layers of the classifier to restrict them from getting trained, replicating different sharing protocols to evaluate the effectiveness of our approach.
As we said in the paper, we aim to work with more sharing protocols and improve the performance of our modular network in future works by using other types of deep learning architecture for the downsampler.Furthermore, we look for applying methodologies that might help tun the hyperparameters of a modular network.

Fig. 2 .
Fig. 2. DL-SCA modular network architecture illustration.The encoder, with its embedding layer and the classifier ensemble the final model used to perform the attack.

Fig. 3 .
Fig. 3.The truncated model shares weights with the DL-SCA modular network; while the latter is training, the former updates his weights.The early stopping framework uses the truncated model to compute the guessing entropy at the end of each epoch, and it stops the training when it meets the conditions.

Fig. 4 .
Fig. 4. The training process of the modular network for ASCAD f datasets.The surface represents the values of the guessing entropy during a chosen number of epochs.Stopping condition success GE=0.

Fig. 8 .
Fig.8.Comparison between heatmaps of the ASCAD f desync50 classifier and classifiers from all the ASCAD r datasets.Notice how ASCAD f desync50 heatmap resembles all other heatmaps.It indicates that ASCAD f desync50 classifier's convolutional layer fires its neurons according the data received.

Fig. 9 .
Fig. 9. Comparison between gradient activation per sample of the ASCAD f desync50 classifier and classifiers from all the ASCAD r all desync.

Fig. 10 .
Fig. 10.The training results of the knowledge transferability experiments.Through the columns lies the levels of desynctronization [0, 50, 100]; while through the rows, lies the different block lock cases -ConvLock, BothLock, and FCLock.

Table 4 .
Summary of the similarity analysis between ASCAD f desync50 classifier and ASCAD r all desync classifiers

Table 5 .
Combination of sharing protocols used for the ASCAD f desync50 classifier