Optimizing Implementations of Non-Profiled Deep Learning-Based Side-Channel Attacks

The differential deep learning analysis proposed by Timon is the first non-profiled side-channel attack technique that uses deep learning. The technique recovers the secret key using the phenomenon of deep learning metrics. However, the proposed technique made it difficult to observe the results from the intermediate process, while the neural network had to be retrained repeatedly, as the cost of learning increased with key size. In this paper, we propose three methods to solve the aforementioned problems and any challenges resulting from solving these problems. First, we propose a modified algorithm that allows the monitoring of the metrics in the intermediate process. Second, we propose a parallel neural network architecture and algorithm that works by training a single network, with no need to re-train the same model repeatedly. Attacks were performed faster with the proposed algorithm than with the previous algorithm. Finally, we propose a novel architecture that uses shared layers to solve memory problems in parallel architecture and also helps achieve better performance. We validated the performance of our methods by presenting the results of non-profiled attacks on the benchmark database ASCAD and for a custom dataset on power consumption collected from ChipWhisperer-Lite. On the ASCAD database, our shared layers method was up to 134 times more efficient than the previous method.


I. INTRODUCTION
As attackers can access physical equipment, they require cryptographic algorithms embedded in the equipment for security against not only mathematical cryptanalysis but also against physical attacks. A side-channel attack is a physical attack that breaks a security system using side-channel information gained from cryptographic devices, such as operation time, sound, temperature, power consumption, or electromagnetic radiation. Cases of successful attacks through side-channel attacks on actual devices, such as mobile phones and transit cards, are rising [1], [2].
Side-channel attacks can be categorized into profiled and non-profiled attacks, depending on an attacker's environment. A profiled attack is an example of a side-channel attack, performed using a profiling device, which is the same (but not equal to) as the attack target device with a fixed secret The associate editor coordinating the review of this manuscript and approving it for publication was Yongming Li . key, such as Template Attack [3] and Stochastic Attack [4]. Attackers use profiling devices to characterize the leakage of the target device and then use the information collected to analyze the target device. Conversely, a non-profiled attack is part of a side-channel attack in an environment without profiling devices. Attackers collect side-channel information only from a target device and analyze secret keys using statistical techniques with the measurements captured from the target. These attacks involve recovering the secret key using only the side-channel information collected from the target devices without pre-calculated templates. Non-profiled attacks include Differential Power Analysis (DPA) [5] and Correlation Power Analysis (CPA) [6].
Recently, research on side-channel attacks combined with deep learning, which demonstrated high performance in various fields, has been introduced [7]- [9]. Most studies have focused on profiling attack scenarios. Similar to conventional profiling attacks, an attacker trains a neural network using side-channel information acquired through a VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ profiling device. Next, the secret key is recovered by classifying the side-channel information gained from the target device using a trained neural network. Studies have shown that masking and hiding countermeasures can be exploited without preprocessing when the attacker uses deep learningbased side-channel attacks [10]- [12]. However, profiled deep learning-based side-channel attacks have a limitation because it is necessary to know the correct label for the training data to train the neural network. Therefore, this approach has been restricted to research into profiling attacks where training data and labels can be obtained. With a focus on research in the profiling environment, Timon proposed a side-channel analysis using deep learning in a non-profiling context, called differential deep learning analysis [13]. Timon's method uses the metrics of deep learning as a distinguisher instead of conventional metrics, such as the correlation between side-channel information and intermediate values, such as the estimated power consumption or electromagnetic radiation. The attacker uses deep learning metrics, such as loss, accuracy, and gradient, as the distinguisher. The method takes advantage of the fact that an intermediate value calculated with the right key is faster to train than a value calculated with wrong keys, and the various training metrics can distinguish this. To the best of our knowledge, Timon's work is the first to demonstrate that deep learning can be applied in non-profiling attacks. However, performing differential deep learning analysis requires training the same neural network the same number of times as the number of key guesses k. For example, if a single network needs to be trained for e epochs, then the total training time is k ×e epochs. This total number of epochs required is relatively expensive compared to profiled deep learning attacks. In addition, the cost of I/O operations is not included in the algorithm and cannot be ignored because the adversary needs to initialize k identical neural network parameters to the same value. Still, determining the total number of epochs needed to distinguish the right key before an attack is an enormous challenge. In order to monitor the metrics in every epoch, the computational cost, including I/O operations, must increase. More complex neural networks may be needed to succeed in a non-profiled deep learning attack, compared to a profiled attack. Thus, higher time complexity is required. This implies that the cost of performing differential deep learning analysis is prohibitive. Differential deep learning analysis is different from conventional non-profiled side-channel attacks such as CPA or DPA, which can obtain fixed results with a single execution. Because random factors such as initial weights and hyperparameters affect deep learning network training, analysis results must be obtained through several repetitions and then analyzed. For example, the correlation coefficient between the measurements and hypothetical intermediate values was fixed unless the data changed. However, even if the training data and their labels are the same, the training results vary from performance to performance owing to random factors, such as hyperparameters and weighted initial values.
In addition, it is difficult to set hyperparameters in non-profiling environments where the attacker cannot obtain the correct labels. Therefore, it is important to reduce the attack time to obtain results for more hyperparameters or to reduce probabilistic factors through repetition attacks. Therefore, unlike in conventional attacks, the faster the attack, the better the results.
According to the previous algorithm, if the attacker performs differential deep learning analysis, the attacker will set up an initial epoch and then perform an attack. However, a non-profiled deep learning attack can fail at the initial set value. If more training is needed, the attacker must start the training from the beginning. When the attacker modifies the algorithm and stored each model, they can load the stored training parameters and perform the attack again using these parameters; however, in this case the added cost of storing models as many as the number of key guesses must be considered. Therefore, the attacker needs to monitor the results of each epoch to decide when the attack stops learning. It makes that the attack can be trained at less than the value set up at the beginning. This called an early stopping. Early stopping [14], which is widely used in deep learning, is clearly necessary and has been applied in profiled deep learning side-channel attacks [12], [15], [16]. Non-profiled attacks are no exception. Prior research recognized these problems and suggested reducing complexity by monitoring, but failed to suggest a practical method to achieve this. In this paper, we propose novel methods that can monitor training metrics that reflect the intermediate training process and help perform faster than the previous approach.

A. OUR CONTRIBUTIONS
The contributions of this paper can be summarized as follows.
1) Modifying a Timon's algorithm to apply an early stopping technique that can prevent overfitting In Timon's algorithm [13], an attacker cannot observe metrics, such as accuracy, loss, and gradient in the intermediate process of training. Therefore, it is difficult for the attacker to set an epoch and check the result or terminate it in advance; instead, they must wait for all the training to be performed. In this paper, we modify the algorithm to allow an attacker to monitor every epoch. Owing to this modification, the attackers can apply the early stopping technique, which is used to reduce overfitting and time complexity. Furthermore, we show the problems that arise when the algorithm is changed, and propose new neural network architectures to solve these problems. 2) Introducing a novel neural network architecture in parallel to improve speed The previous method had to set the neural network so that it corresponded to each key guess and trained the weights of each network separately. For example, when attacking the first subkey byte in the first AES round, the attacker has to train 256 neural networks, while the size of the key is guessed. Because the number of backpropagations required for a single neural network is (number of epochs) × (training set size) / (batch size), the number of backpropagations for an attack is 256 × (number of epochs) × (training set size) / (batch size). When the size of the key being guessed is k, the total number of backpropagations is k × (number of epochs) × (training set size) / (batch size). In this paper, we propose a parallel strategy to reduce the time complexity of differential deep learning analysis. Our method performed training separately for each key estimated, while we integrated it into a single network.

3) Proposing a novel method with shared layers to reduce memory and time costs
When an attacker designs the algorithm, they can observe the intermediate training process, and this reduces speed. When an attacker uses a parallel architecture, the disadvantage is that the memory complexity increases because many weights are stored in memory at once. In this paper, we propose a new methodology that uses shared layers to reduce both memory and time complexity. In the parallel method, several training parameters are required because all the networks, according to all the key guesses, are independently composed. However, if the networks share layers or neurons, memory usage can be reduced. Our shared layers encode the data and decrease the number of weights for the networks by reducing the data dimensions during the encoding process, reducing memory complexity. 4) Implementing the proposed methods as well as verifying and comparing these methods with the previous method We propose three methods to improve on the previous differential deep learning analysis. We expected our methods to recover the secret key and reduce time and memory usage costs, except in the method that was modified to use only early stopping techniques.
To verify the strategies and compare them with the previous attack, we implemented the parallel method and shared layers method. Through implementation, we showed that the proposed methods can successfully perform side-channel attacks, similar to the conventional method. In our experiments, all the proposed methods were faster than Timon's work. In addition, the method with shared layers has less memory usage than the previous method. According to our experiments on the ASCAD database, an attack using the shared layers technique is 134 times faster than the execution time of a conventional attack with a similar memory usage level.

B. ORGANIZATION
We organised this paper as follows. Section II describes conventional side-channel attacks with deep learning and is most related to this study's work, differential deep learning analysis. In Section III, we present the proposed method using novel neural network architectures. Then, we verify this proposed method with previous work in Section IV. Finally, Section V includes concluding remarks and a discussion of future work.

II. PRELIMINARIES
In this section, we briefly review a previous attack, differential deep learning analysis, which is most closely related to our work. For a general introduction to deep learning, we refer interested readers to [17].

A. DEEP LEARNING-BASED PROFILED SIDE-CHANNEL ATTACKS
Deep learning has attracted much attention recently owing to developments in hardware and improvements in learning algorithms that have resulted in significantly improved results in various fields, such as computer vision, natural language processing, and speech recognition. Recently, deep learning has been actively studied in the field of side-channel analysis, and has been used as a classifier to classify side-channel measurements [18]- [21]. Research on classifying measurements using deep learning was conducted primarily in a profiling environment in which an attacker uses a profiling device. The attacker characterizes the leakage of the profiling device and then analyzes the measurements collected from a target device using the characterized information.
Maghrebi's research results show that deep learning-based profiling attacks can be analyzed regardless of whether masking response techniques are applied [11]. Results by Cagli et al. show that if an attacker uses convolutional neural networks, it is possible to recover the secret key through deep learning-based side-channel attacks without performing preprocessing steps such as alignment [12]. Open datasets are provided to compare and benchmark the analysis performance based on deep learning [22]. Recently, not only on block ciphers, but also deep learning based attack have been performed on public-key cryptosystems [23], [24].

B. DEEP LEARNING-BASED NON-PROFILED SIDE-CHANNEL ATTACK; DIFFERENTIAL DEEP LEARNING ANALYSIS
Differential deep learning analysis (DDLA) is the first type of deep learning-based side-channel analysis in use in non-profiled scenarios, where the attacker cannot obtain a template device [13]. Without a template device, it is not possible to obtain the label for side-channel measurements, so it becomes difficult to apply the profiling side-channel attacks based on deep learning, as described in the previous subsection. However, DDLA overcomes these limitations by providing a methodology for guessing labels. In a correlation power analysis, which is a type of conventional side-channel analysis, intermediate values guessed with the right key are most correlated with measurements. Similarly, in deep learning analysis, a neural network trained with the correct label performs better than the other neural networks trained with incorrect labels. Various metrics describe the performance VOLUME 10, 2022 of deep learning, but in DDLA, these metrics are used as a distinguisher. The attacker can distinguish the right key from the wrong key using this type of distinguisher. Algorithm 1 is the algorithm for the DDLA in AES implementation.

Algorithm 1 Differential Deep Learning Analysis [13] (Case: AES)
Input: N traces (t i ) 0≤i<N with corresponding plaintexts p i . A neural network Net, number of epochs n e , and substitution box Sbox(·).
Output: Metrics of training M . The execution procedure of the algorithm is as follows. First, the attacker sets the collected traces as the training data X . Next, the following steps were repeated according to each key guess. First, initialize or re-initialize the trainable parameters of the neural network Net, such as the weights and biases. Second, we estimate the intermediate values through the guessed secret key value k and calculate the hypothetical values H i using the hypothetical intermediate values and power consumption model. Third, we set the hypothetical power consumption value calculated above as the label Y k of the training data according to the guessed key. Finally, we use the training data X and label Y k to train Net for as many epochs as n e . If the attacker has completed the deep learning training DL for all key guesses, the attacker can recover the right key, which leads to the best DL metrics M . Figure 1 shows the results of the training metrics, such as accuracy and loss, for the key guess. The loss and accuracy of training with the correct label differed significantly.
In the first method, sensitivity analysis based on multilayer perceptron (MLP) first layer weights uses the weights of the first hidden layer of the MLP, which is the same as the corresponding name. The first hidden layer has parameters, called weights, that are connected to the input and hidden layers. The neural network updates the weights of the hidden layer to minimize loss through learning. In this process, the updating increases the absolute value of the weights connected to the feature points of the input to transfer the features of the input to the next hidden layer. Therefore, when learning through a label generated with the correct key, the weights connected to the feature point of the input have significant values and are updated considerably. This implies that the gradient of the weights was large. However, when learning with labels generated through the wrong key, the neural network cannot find the feature points that distinguish the label groups or the wrong feature points, so the weights connected to the feature point of the input are only updated slightly. Given these results, the attacker can recover the secret key using the gradient of the weight of the first hidden layer as the distinguisher. The attacker calculates S weights according to each key guess, and the S weights with the maximum value can be recovered for the secret key.
The second method is sensitivity analysis based on network inputs, which uses a backpropagation value, the gradient that updates the weight of the neural network. When attackers used a convolutional neural network to attack an implementation with hiding countermeasures, the first method became difficult to apply because there is no weight matched with the input dimensions one-to-one in the convolutional neural network, making it dissimilar to MLP. Therefore, when an attacker uses CNN, the input-based sensitivity analysis technique should be applied. As in the first method, an inputbased technique can also be applied by calculating the S inputs according to the key guess, and the S inputs with the maximum value can be recovered for the secret key.
In DDLA, loss and accuracy are primarily used as the metrics M to recover the secret key. Training with the correct labels tends to reduce loss and increase accuracy when learning is performed. However, when training with labels generated by wrong keys, the metrics do not change much or get worse, so the previous mentioned phenomenon is not observed. Therefore, the wrong key model is distinguishable from the right key model. Figure 1 shows the results for loss and accuracy for each key guess. In this paper, we used the most common metric, accuracy.

III. PROPOSED METHODS
In this section, we propose several methods to overcome the limitations of the previous DDLA proposed in Timon's work. First, we introduce a modified algorithm such that the metrics in the intermediate process can be monitored by the DDLA algorithm as well as the problems associated with this change. Second, we propose a parallel method to solve the problem of time complexity. We confirmed that the networks, according to each key guess, are independent of each other, and thus we know that each network can be used in parallel. Therefore, we propose a novel algorithm that can be performed in parallel by changing the neural network architecture: networks in network. Finally, we propose shared layers that solve the speed and memory problems. Our method using this parallel technique is fast, but the memory complexity is increased because the networks have to be stored in memory at the same time. We know all the networks used in the parallel method are the same network and solve the memory problem by sharing the hidden layer connected to the input layer.

A. MODIFIED DIFFERENTIAL DEEP LEARNING ANALYSIS FOR MONITORING METRICS
In Algorithm 1, the process of training Net is repeated with the labels according to each key guess. When the network is trained with the label computed by the guessed key k, the results or metrics for the other key (larger than k) cannot be checked until all the epochs the attacker selected have been performed. In other words, even if k has the best result among the metrics results from 0 to k, the attacker cannot stop the attack because the result of a value higher than k is unknown. Therefore, in order to monitor the results of all key guesses in the intermediate training process, key guesses should be the inner loop, and the epoch should be the outer loop. When modifying the algorithm, it is not reasonable to create a label for each epoch. Thus, it is efficient to calculate and store labels for all key guesses in advance. At this time, additional memory space is required to store the all key guessed labels.
When training for the n e epochs, where the attacker presets for all key guesses, it is not guaranteed that the key can be selected at all times from the metrics M obtained through the attack. There are several reasons this might happen, but among them, if the number of training epochs is insufficient and then the attacker fails, the attacker must select a higher value n e than the previous number of epochs n e and perform the algorithm again. In order to solve the failure problem, the trained networks for all key guesses should be stored, and the attacker must reload the networks when they fail to recover the key and perform additional training. At this time, the operation and memory resources for storing the trained networks are required. Algorithm 2 depicts a modified DDLA for observing metrics on a per-epoch basis. Set training labels as Y k = (H i,k ) 0≤i<N 5: end for 6: Initialize trainable parameters of (Net k ) 0≤i≤255 7: for e = 1 to n e do 8: for k = 0 to 255 do 9: Load trainable parameters of Net k .  change allows the network for each key guess to be stored. In addition, we added lines 15 to 17 to the algorithm to check whether the early stop condition is satisfied for each epoch with the metrics M so that the attacker can stop training early. With early stopping, if the attack is successful, the process is terminated earlier than expected to prevent overfitting, where the network memorizes the training data instead of learning the features of the data. Even in the case where the attack fails, it determines whether the proper hyperparameter is chosen early, so the algorithm can be stopped in the middle of the key recovery attack, and it reduces the unnecessary cost. However, additional storage is required for this modification.

B. PARALLEL METHODOLOGY FOR DIFFERENTIAL DEEP LEARNING ANALYSIS
In this subsection, we propose a new algorithm and a novel neural network architecture that can improve the performance of DDLA. Using the networks in network methodology, we design the architecture to work in parallel and replace the previous algorithm so that attacks can be performed by training only a single network. In this paper, a small network within a neural network, denoted Net i , is called the base architecture for key i. The base architecture is the neural network used in the previous work. The proposed neural network architecture replicates the base architecture for the number of key guesses, sharing only the input layer and output layer. In addition, our architecture is locally-connected such that the other parameters are not connected. The proposed neural network architecture is illustrated in Figure 2.
In Timon's training parameters, MSB (most significant bit) or LSB (least significant bit) labeling was used, and we used the same labeling method. They used the dense output layer of two cells with the softmax function for one-hot encoding. If the output of the first cell is higher than the second, the MSB or LSB is predicted to be 0. If not, it is determined to be 1. However, we use the dense output layer of single cell with the sigmoid function for multi-hot encoding. If the output of the single cell is lower than 0.5, the bit is predicted to be 0. If not, it is determined to be 1. This technique reduces the dimensions of the output layer and cost because performing one-hot encoding is unnecessary and only MSB or LSB is used. Thus, the base architecture is the same as the hyperparameter used in Timon's paper, except for the output layer. Each output cell represents the predicted value for each key guess, as shown in Figure 2. Therefore, category loss functions are not available, and binary cross-entropy or the sum of squared functions should be used as loss functions. Table 1 compares the network used in Timon's work and the proposed network.
When using the proposed architecture, non-profiling attacks should be performed using Algorithm 3, not Algorithm 1 or 2. The main difference between the other algorithms and Algorithm 3 is that Algorithm 3 trains a single neural network, not several networks. Up to line 6, the label setting and initialization of the training parameters are performed, similar to Algorithm 2. After line 6, the algorithm trains the network, which can be seen in many other deep learning algorithms. However, the memory burden increases when the parallel method is used. This is because the networks, according to each key guess, should be put into memory all at once. For example, if the attacker recovers an AES 1-byte subkey, the attacker has to upload 256 networks to memory at once. The attacker can perform an attack with a DDLA method by choosing an algorithm that depends on the attack environment selectively.

C. IMPROVED DIFFERENTIAL DEEP LEARNING ANALYSIS WITH SHARED LAYERS
The DDLA with the parallel technique is faster than a conventional attack, but the memory burden increases with the size of the key being guessed. Memory burden comes from putting the weights of the networks for all key guesses into memory at once. The most memory-intensive part of the process is to store the weights of the first hidden layer. No matter how small the hyperparameter of the base architecture is designed to be, the number of weights increases significantly if the length of the trace is long. For example, when the length of the trace is 1, 000 and the dimensions of the first hidden layer of the base architecture are 200, 1000 × 200 × 256 training parameters are required for the weights.
We propose an architecture that encodes measurements using shared layers to reduce the dimensions of the measurements before transferring them to each base architecture. A similar approach to this concept, which encodes side-channel measurements using deep learning was proposed by Robyns [25]. Also, in [26], they also separated a neural network for Points of Interest (PoI) detection and plaintext feature embedding. The part of the network that detects PoIs is similar to our shared layers. Considering that profiled deep learning-based side-channel attacks perform preprocessing and classification through the network; endto-end attack, it is reasonable to have the network perform preprocessing. Therefore, the attacker was designed to preprocess the traces using the shared layers and deliver the preprocessed traces to each base architecture so that all key guesses can be performed through one network similar to the proposed parallel method. Because only the neural network architecture has been switched and the input and output are the same, the attack can use the same algorithm (Algorithm 3) as the parallel method. Figure 3 shows the neural network architecture with the proposed shared layers.
Comparing Figure 3 and Figure 2, it can be seen that when the number of layers in the shared layers is increased, many parts of the base architecture are shared, and the number of training parameters decreases. This approach clearly reduces the memory burden. Moreover, as the shared weight increases, the weights that need to be updated are reduced, and the training speed increases. Therefore, if the performance of the analysis is the same, it is advantageous to design the architecture to share as much as possible.
However, with a neural network, which is a black-box model, it is challenging to determine the extent to which the shared layers are affected. In other words, it is difficult to confirm in advance whether the performance of the attack will improve or worsen by sharing many weights, and the performance change according to the shared part is not verified. We show that it is possible to perform a non-profiled deep learning attack, and we experimentally validate the sharing technique in the following section. In order to verify the performance of the shared layers method, we set the architecture that maximized the shared part in all of our experiments. This means that we use the most different architecture from the parallel method.

IV. EXPERIMENTAL RESULTS
In this section, we validate the proposed methods introduced in Section III. We do not show the experimental results of the method proposed in Section III-A because it is simply a modification of the order of operation from Timon's work. In our experiments, we used same hyperparameters provided in Timon's work [13] as much as possible, except for a batch normalization layer [27], which makes training faster and reduces internal covariate shift through normalization for each training mini-batch, and fine-tuning only the initial learning rate. We did not consider that Timon's hyperparameters are optimal for our proposals because we do not claim that they are best. In this paper, we focus only on the success or failure of an attack and its complexity.

A. ATTACK ON ChipWhisperer-LITE
In order to verify the performance of the proposed methods, we use side-channel measurements collected by ChipWhisperer-lite [28]. We program the AES-128 1 Round SubBytes without countermeasures code on the ChipWhisperer-lite evaluation board with an Atmel XMEGA128 8-bit processor with a fixed clock frequency of 7.37 MHz. We captured the power trace sampled at four points per clock. The dataset includes 10,000 power consumptions of 800 samples for AES SubBytes operation with random plaintexts and a fixed key. The experimental results obtained using the proposed method with the collected power consumption dataset are shown in Figure 4. We target its 1-byte subkey; the red line represents the accuracy for the right key, and the gray lines are the results for the wrong keys. As shown in Figure 4, the difference between the two lines is clear. This results confirm that attackers can recover the secret key successfully.

B. ATTACK ON DE-SYNCHRONIZED TRACES
Timon showed that if the attacker performs DDLA with convolutional neural networks (CNN), which is called CNN-DDLA, implementations with hiding countermeasures can be exploited without preprocessing because of the translation-invariance property of the CNN architecture [13]. In this paper, we show that the proposed method can also be applied to CNN-DDLA. We construct the base architecture with convolutional layers and attack de-synchronized traces.
The traces were created by arbitrarily shifting the measurements captured from the ChipWhisperer-Lite used in the first experiment. We set the maximum range for the shift to 20, which implies that we obtained a shifted trace of 780 points from the 800 point trace. The shifted traces were obtained from the traces in the ChipWhisperer-Lite database without increasing the number of traces. CNN exp was used as the base architecture [13]. The experimental results obtained using CNN-DDLA with the proposed methods are shown in Figure 5. Similar to the previous results, the red line represents the right key, and the gray lines indicate the wrong keys. As can be seen in Figure 5, the difference between the two types of lines is clear. This shows that attackers can recover the secret key successfully, even if a hiding countermeasure is applied. The third dataset to be analyzed is a set of databases, called ASCAD [22], which was released as a benchmark reference for side-channel analysis. The data were collected by a 2Gs/s sampling rate for the implementation of AES with Boolean masked countermeasure operated by an Atmel ATmega8515, and there were a total of 60,000 electromagnetic measurements. In our experiments, we used 20,000 electromagnetic traces captured during the AES SubBytes operation of the third byte, with 700 points in the ASCAD database. Figure 6 shows the experimental results of applying the proposed methods to the open dataset. The red line is the accuracy for the right key, and the gray lines are the results for the wrong keys. As can be seen in Figure 5, the difference between the two lines is clear. This indicates that attackers can successfully recover the secret key even if a masking countermeasure is applied.

D. COMPARISON OF TIME AND MEMORY COMPLEXITY
In Section IV-A, IV-B, and IV-C, we validate the performance of our proposed methods on the measurements in different situations, and our proposed methods can successfully recover  the right key in all scenarios. In this subsection, we verify that the proposed method can improve the speed of the attack. In our experiments, we recovered the AES subkey. In other words, 8-bit key guessing is performed; the number of key guesses is 256. Table 1 is a summary of the architecture used in our experiments. The conventional DDLA attacks were all performed with the same architecture, MLP exp . Furthermore, we carried out all of the proposed methods with MLP exp as the base architecture. Table 2 is a comparison of the time taken and the memory used for each method. All experiments were performed with the TensorFlow (Version 1.14.0) [29] and Keras (Version 2.2.4-tf) [30] library from Python on a personal computer with a single NVIDIA GeForce GTX 1080Ti GPU, a single Intel(R) Core(TM) i7-8700K CPU and 48 GB of RAM. Table 2 summarizes the execution time and memory usage of the previous DDLA method and the proposed methods. For a fair comparison, 50 epochs were set to be performed equally by all the methods without applying early stopping techniques. Therefore, memory usage was compared using the results from measuring memory usage when each technique was performed for 50 epochs. As expected, all the proposed methods were faster than the conventional DDLA attack. Notably, the results of the shared layer method are significantly faster than those of the other attack methods. It is up to 140 times faster than the previous method on ChipWhisperer-Lite and 134 times faster on the ASCAD database. Memory usage is the largest in the parallel technique, and the other two techniques are smaller and are similar to each other. The results of the experiments show that the performance of the attack speed is faster when the proposed methods are used, and the method with shared layers is best in terms of speed and memory usage.

V. CONCLUSION
In this paper, we proposed three methods to efficiently perform a non-profiled side-channel attack based on deep learning. First, we modified DDLA to allow attackers to observe training metrics for each key guess at each epoch so that the attackers can monitor the model and apply early stopping techniques if appropriate. Second, we optimized the design of neural network architectures to work in parallel according to each key guess, so that attackers can perform all training at once and reduce the attack time. Finally, we reduced the number of training parameters by integrating parts of the parallel networks through shared layers, thereby reducing not only memory usage but also time complexity. Using the proposed method that reduces the time and memory costs, the attacker is able to apply DDLA with more epochs and more hyperparameters, which will, in turn, improve the success rate of deep learning attacks.
Timon's research showed that a non-profiled context in side-channel analysis is not the same as an unsupervised learning context in machine learning, proving that deep learning-based side-channel attacks can be applied in non-profiled attack scenarios. However, learning without correct labels has led to many problems. The first problem is that it is hard to search for network hyperparameters. It is challenging to distinguish whether wrong hyperparameters are selected or there is a lack of training, which makes it difficult to determine the cause of failures. In addition, when the attacker selects wrong hyperparameters, the wrong key may have a higher accuracy than the correct key due to overfitting, and the attack fails. The second problem is that there is no clear criterion for choosing a key in the DDLA. We recovered the guessed key with the highest accuracy value into the right key. Owing to the first problem, it cannot be claimed that the training result with the right key is always better than the other results. For an attacker who does not know the correct answer, it is difficult to select the correct key when the training metrics of several keys are similar. Further research will be undertaken to address these problems.

APPENDIX OVERFITTING IN DIFFERENTIAL DEEP LEARNING ANALYSIS
In profiled deep learning analysis, overfitting is one of the main reasons for failure. It is also no exception in nonprofiled attacks. Validation sets, which are separate data from training sets, were used to prevent overfitting in profiling attacks. Because attackers cannot know the real key for training data, they predict labels with guessed keys for training neural networks. So all collected measurements are used as training data without validation data. At first glance, it seems that it is unnecessary to separate a part of the training datasets into validation datasets because the attackers do not aim to increase the accuracy of the model, but are focused on distinguishing one of the models. However, the problem of overfitting can occur in DDLA, and we will display the experimental results. We also show that high accuracy on incorrect key-labeled data due to overfitting can be a major cause of attack failure in DDLA. As shown in Figure 7, the right key with the red line was well distinguished at the beginning of learning, but as the learning progressed, it was difficult to distinguish it from the wrong keys because of overfitting.
As in the case we presented, there are cases in which overfitting results in high accuracy not only for the right key, but also for the wrong keys. In this case, attackers who did not monitor intermediate processes were confused about the key to choose. In this paper, we recommend using a part of a training data as a validation set, like performing profiled attacks. If attackers check whether the features extracted from the training datasets also show meaningful results on the validation data, the attackers are able to distinguish the right key even if overfitting occurs on the training data, as shown in Figure 7. K-fold cross-validation is a powerful method for preventing overfitting [31]. Furthermore, applying early stopping techniques is useful to prevent the problem. These methods are not sufficient to solve the overfitting problem, but they can help mitigate it.