A Combined Single Trace Attack on Global Shuffling Long Integer Multiplication and its Novel Countermeasure

Advanced collision-based single trace attacks which can be applied on simple power analysis resistant scalar multiplications become virtual threat on elliptic curve cryptosystems recently as their practical experimental results are increasingly reported in the literature. Since such attacks are based on detecting collisions of data dependent leakage caused by underlying long integer multiplications, so-called global shuffling countermeasure which breaks such collision correlation by independently randomizing the execution order of unit operations such as single precision multiplication and carry propagation, is considered as promising countermeasure if theoretical randomness of shuffling order is guaranteed. In this paper, we firstly analyze the practical security of the global shuffling long integer multiplications by exhibiting a combined single trace attack on software implementations on an ARM Cortex-M4 microcontroller. Our combined attack consists of a simple power analysis for revealing random permutation vectors which enables later collision-based single trace attack. First we demonstrate how to reveal random permutation vectors for carry propagation process of whole global shuffling long integer multiplications within a single power trace by simple power analysis accompanied with straightforward substitution of power consumption samples. Then we perform collision-based single trace attacks after rearranging the order of subtraces for unit carry propagations based on revealed permutation vectors. Since the vulnerability to simple power analysis is originated from the if-statement for selection of proper entries of the permutation vectors, we propose a novel countermeasure which eliminates such selection with simple addition and modulus operation and also demonstrate practical result achieving regularity in power trace patterns.


I. INTRODUCTION
Elliptic curve cryptosystems (ECC) [1], [2] are widely used until recently because of their advantages providing equivalent security with shorter key length compared with other public key cryptosystems (PKC) such as RSA, DSA, and, DH. Furthermore, with the shorter keys, scalar multiplications which is the main operation of ECCs feature shorter execution time and lower memory requirement than The associate editor coordinating the review of this manuscript and approving it for publication was Xiangxue Li . RSA modular exponentiations. Hence ECCs are preferably deployed for PKCs on secure embedded devices.
On the other hand, side-channel analysis attacks [3], [4] which can reveal the secret from implementations of ECCs exploiting their timing, power consumption, electromagnetic emanation, etc., have been researched consistently. Since ECC protocols such as ECDSA or ECDH use an ephemeral secret, resistance against Simple Power Analysis or Timing Attacks [3] which can be performed with a single power trace is essential for secure embedded devices. The core idea of such attacks is based on distinguishing doubling and addition of a scalar multiplication such as double-and-add algorithm [5]. Hence countermeasures which operate regular scalar multiplication like double-and-add-always [6], Montgomery ladder [7], [8], and atomic scalar multiplication [9] deploying unified point addition formulae are proposed.
Nevertheless, advanced single trace attacks which can defeat such countermeasures are also proposed. Walter [10] proposed Big Mac attack which can distinguish doubling and addition operations in a non-regular scalar multiplication from a single trace utilizing Euclidean distance. Inspired by the Walter's work, Clavier et al. [11] proposed Horizontal Correlation Analysis (HCA) which can perform on regular scalar multiplications exploiting correlation between intermediate data and power consumption in a single trace. Bauer et al. [12] introduced power trace averaging techniques which can defeat countermeasures proposed by Clavier et al. [11] against their attack. Recovery of Secret Exponent by Triangular Trace Analysis (ROSETTA) [13] can attack scalar multiplications by determining inner-collisions of a long integer multiplication (LIM) [14] caused by the same input single precision operations. Horizontal Collison Correlation Attack (HCCA) [15] takes advantages of both Big Mac attack and HCA by exploiting collisions originated by the same operand inputted in two LIMs. Hanley et al. [16] improved HCCA by detecting collisions between input and output operand of two LIMs. These attacks, except HCA and the attack of Bauer et al. [12], can be categorized as collisionbased single trace attacks.
On the other hand, practical results of such collisionbased single trace attacks are presented later whereas only simulated results of are shown in the original papers besides only the work of Hanley et al. [16] exhibited experimental results targeting 192-bit implementations of scalar multiplication on a 32-bit microcontroller and a FPGA. Thereafter, practical results of an improved HCCA exploiting collisions of multiple LIMs in scalar multiplication targeting a 384-bit implementation on a 64-bit architecture is published by Danger et al. [17]. Practical experimental results of HCCA and ROSETTA on specific elliptic curves are presented in the works of Das et al. [18] and Cho et al. [19] targeting 192-bit and 256-bit implementations, respectively.
Countermeasures of advanced single trace attacks are also proposed mainly for securing LIM operations. Clavier et al. [11] firstly proposed the method of randomizing two loops for single precision multiplications in LIM operations to exterminate collision characteristics caused by identical manipulation of the same input. Bauer et al. [12] enhanced the latter countermeasure and proposed the global shuffling LIM which utilizes incorporated single random permutations for the two loop randomization of single precision multiplications and separate random permutations for carry processing operations. Furthermore, in more recent work [15], this countermeasure is referenced as a possible countermeasure against HCCA. However, its practical effectiveness is not explored in the literature.
Our contribution is threefold. We present the first practical results of a combined single trace attack, which is a combination of a simple power analysis for revealing permutation vectors and collision-based single trace attack with rearranging subtraces, on software implementations of global shuffling LIM, which is known to be secure against advanced collision-based single trace attacks, operated on an ARM Cortex-M4 based STM32F405 microcontroller [20] targeting 128, 192, and 256-bit ECC primitives. We analyze the vulnerability of the algorithm's carry propagation process exploitable by SPA despite theoretically it is intended to give an adversary (2l − 1)! complexity of guessing random permutations where l is the word length of input operands of LIM and demonstrate practical result of recovering whole permutation vectors with a single trace by SPA accompanied with straightforward substitution of power consumption samples. Then we successfully mount collision-based single trace attacks with power consumption subtraces of unit carry propagation operations after rearrangement of processing order on the basis of revealed permutation vectors. Secondly, we provide three attack scenarios on which such vulnerability of the global shuffling algorithm is exploited to successfully recover the secret scalar. Since our proposed attack targets the carry propagation process in which only the results of single precision multiplications are manipulated, exploitable collisions in operands of LIM is limited in case of both operands are the same. Nevertheless, still the vulnerability of global shuffling LIM can lead to the recovery of the secret scalar for three cases where particular unified point additions are deployed. Finally, we propose a novel countermeasure against our proposed attack. The vulnerability of global shuffling LIM is caused by the if-statement for selection of proper entries from the permutation vector. Our proposed countermeasure eliminates such selection with a permutation vector rearrangement method utilizing simple addition and modulus operation and achieves regularity in power trace patterns consequently providing security for resistance against SPA. Practical result demonstrating the regularity of power consumption trace acquired from the implementation of our countermeasure is also presented. This paper is organized as follows. In Section II, we introduce SPA-resistant scalar multiplications and advanced collision-based single trace attacks defeating such countermeasures. And we introduce global shuffling LIM proposed in [12] which is known to defeat advanced collision-based single trace attacks when the algorithm is deployed by the latter scalar multiplications. And in Section III, we analyze the vulnerability of the global shuffling LIM and present attack scenarios for three elliptic curve cases on which recovery of the secret scalar is possible when unified point additions in projective coordinates are deployed. Then we present practical results of our combined single trace attacks which consist revealing random permutation vectors by SPA and detecting input collisions of global shuffling LIMs by collision-based single trace attacks followed by rearrangement of subtraces based on revealed permutation vectors.
In Section IV, we propose a countermeasure eliminating SPA-leakage and also present practical result of such implementation. Finally, we conclude this paper in Section V.

II. PRELIMINARIES
Since ECC cryptographic protocols such as ECDSA or ECDH use an ephemeral secret, differential power analysis [4] which requires several or many SCA-leakage traces, for example, power consumption traces, manipulating the same secret is not possible. As result, attacks exploiting a single power trace are researched importantly. SPA or Timing attack is simple and powerful attack which can reveal the secret from a single power trace hence the SPA-resistant property became essential for secure implementation for embedded devices. Thereafter more advanced single trace attacks defeating SPAresistant implementations are also proposed. In this section, we describe basic idea of SPA on some ECC protocols and the side-channel atomic scalar multiplication with unified point addition which can defeat SPA. And illustrate some advanced collision-based single trace attacks which can be performed in the presence of additional countermeasures defeating DPAlike (advanced) single trace attacks, such as HCA, and the global shuffling LIM countermeasure.

A. SIMPLE POWER ANALYSIS ON ELLIPTIC CURVE CRYPTOGRAPHY AND UNIFIED POINT ADDITION
Most sensitive operation in ECDSA or ECDH protocols is scalar multiplication since a secret scalar k is directly manipulated during calculation of kP where P is a point on an elliptic curve. Simplest algorithm for scalar multiplication is left-toright binary or double-and-add [5] method. Assume an n-bit scalar k is represented as k = (k n−1 , . . . , k 0 ) 2 and a register R for the result of the algorithm is set as the point at infinity of an elliptic curve, the algorithm scans the most significant bit to the least significant bit of k and if scanned bit is zero it performs a point doubling on R or a point addition R + P followed by a point doubling on Rif the bit is one.
Since double-and-add algorithm operates different sequences depending on the secret bit, an adversary can reveal the key bit if it can determine the difference by inspecting side-channel leakage of an implementation such as timing or patterns of power trace [3]. Hence so-called regular algorithms, that is, double-and-add-always [6], Montgomery ladder [7], [8] and, side-channel atomic scalar multiplication [9] are proposed. Double-and-add-always and Montgomery ladder algorithms operate identical sequence regardless of the key bit value, that is, one doubling and one addition per the key bit. On the other hand, operation sequence of sidechannel atomic scalar multiplication, as described in Alg. 1, is similar to double-and-add algorithm. However, doubling operations are replaced by point additions with the same point within R 0 and the process of scanning the key bit is replaced by Step 5 which is regular regardless of the key bit value whereas double-and-add has an if-statement for the process.
To properly implement side-channel atomic scalar multiplication, unified point addition, in which both addition Algorithm 1 Side-Channel Atomic Scalar Multiplication Input: P, k = (k n−1 , . . . , k 0 ) 2 Output: kP Algorithm 2 Long Integer Multiplication [14] Input: r a+b ← v and c ← u 9: end for 10: r a+l ← c 11: end for 12: Return r and doubling operation are calculated on the same formula, is deployed for the indistinguishability of two operations. Since Brier and Joye firstly proposed on Weirstrass curve in [21], unified point additions on various curves and point representations are studied [21]- [23].

B. ADVANCED COLLISION-BASED SINGLE TRACE ATTACKS AND COUNTERMEASURES
However, advanced collision-based single trace attacks [13], [15], [16] can be performed despite such regular scalar multiplications are implemented. These attacks are based on determining collision characteristics originated from identical unit operations in field multiplications consisting in regular scalar multiplications and key-dependent existence of such characteristics. We describe collision characteristics of advanced collision-based attacks in the following and we assume that a field multiplication is composed of a long integer multiplication, as shown in Alg. 2, and a following modular reduction.
ROSETTA [13] targets detecting inner collisions of field squaring operations. Assume an adversary can extract sidechannel traces of single precision multiplications, that is, x a × y b in Step 7 of Alg. 2, if a field squaring is performed a half of (l 2 −l)traces have collision correlation since x a ×x b = x b ×x a for all a = b. In the opposite case, for a field multiplication, since x a ×y b = x b ×y a for all a = b the collision characteristic does not exist.
HCCA [15] determines collisions in at least one input operand of two (or more) field multiplications. Consider two LIMs consisting in two field multiplications, LIM (x, y) and LIM (x, z), if an adversary can collect two groups of sidechannel traces of single precision multiplications, respectively for x a × y b and x a × z b , and if correlation of two groups are higher than other cases where none of the inputs have the same value, the collision characteristic is exploitable.
The attack of Hanley et Al. [16] exploits collisions of input and output of field multiplications. Consider two field multiplications denoted by MUL(x, y) = z and MUL(z, w), if there is higher correlation in this case compared to non-collision cases, the attack is possible. Contrary to two former attacks, this attack requires considering the different placement of side-channel leakage in time domain, thus they deploy the cross-correlation technique to find points of interest where input-output collision correlation exists.
Countermeasures of those attacks are already proposed. Clavier et al. [11] first proposed the idea of randomizing the two loops of Alg. 2, that is, a and b, separately to constrain collision correlations caused by sequential (or predetermined) manipulation of input operands in LIMs. After that, Bauer et al. [12] exhibited the attack weakening the effect of the separate randomization and proposed global shuffling LIM, described in Alg. 3. Also in [15], this countermeasure is considered to be an effective countermeasure against HCCA and to the best of our knowledge novel single trace attacks can defeat this countermeasure are not proposed in the literature. We further analyze the global shuffling LIM in the next section.

III. COMBINED SINGLE TRACE ATTACK ON GLOBAL SHUFFLING LONG INTEGER MULTIPLICATION
In this section, we present a combined single trace attack which can be applied on side-channel atomic scalar multiplications with unified point addition deploying global shuffling LIM which is known to be secure against advanced collisionbased single trace attacks. First, we analyze a vulnerability of the global shuffling LIM proposed in [12] and present attack scenarios which can lead to the recovery of the secret scalar for three cases, that is, unified addition formulas with projective coordinates for short Weirstrass, Jacobi quartic, and Jacobi intersection curves, where such vulnerability can be exploited by an adversary. Next, we demonstrate by practical experiments that the vulnerability really exists showing distinct patterns with respect to the condition of an if-statement in power consumption traces. Consequently, as revealing whole permutation vectors by simple power analysis with a straightforward substitution of subtraces are possible, then collision-based single trace attacks can be mounted on originally shuffled carry propagation operations after rearranging subtraces based on revealed vectors.

A. EXPLOITABLE VULNERABILITY ANALYSIS OF GLOBAL SHUFFLING LONG INTEGER MULTIPLICATION
The global shuffling long integer multiplication [12] is an enhanced version of two loops randomization LIM proposed by Clavier et al. [11]. The core idea of both countermeasures is to impose an (l!) 2 complexity on an adversary to perform advanced single trace attacks exploiting single precision multiplications in LIM by randomizing the execution order of them. To this end, the two loops randomization LIM deploys two random permutations in [0, l − 1] to randomize index variable a and b in Step 7 in Alg. 2, separately. However, Bauer et al. [12] demonstrate the technique of nullifying the effect of randomization for one random permutation by averaging l subtraces. Such attack is possible since an adversary knows the fact that when executing the loop of Step 6-9, there are l single precision operations while the index variable a fixed with some unknown value. Consequently, Bauer et al. [12] proposed the method of randomizing both indices simultaneously for operations of single precision multiplications x a × y b .
On the other hand, by the result of changing the construction of the loop for operations of single precision multiplications x a × y b , carry propagation process needs to be changed properly. In the global shuffling LIM algorithm, as described in Alg. 3, such carry propagations are processed independently as shown in Step 12-21 and to prevent advanced single trace attacks exploiting the correlations caused by the collisions of input operands of LIMs, another random permutation vector in [1, 2l − 1] is deployed. Thus theoretically, an adversary attempting advance single trace attacks on carry propagation process must guess the permutation out of (2l − 1)! possibilities. Nevertheless, such attacks are still possible because of the existence of if-statement in Step 15, consequently the random permutations can be revealed by SPA.
First, consider Step 12-21 of Alg. 3, when i = 2l − 1 the if-statement suffices only once where the entry of the permutation vector P is 2l − 1. Hence in the power consumption trace, there will be 2l − 2 identical patterns executing  where the condition of if-statement is false and one different and longer pattern caused by additional operations of . If an adversary can identify two different patterns, it can recover the position of entry 2l − 1 in P where there exist only 2l − 1 positions in the vector. Next, for i = 2l − 2 the if-statement suffices twice, that is, for entries 2l − 1 and 2l − 2, in this case there will be two different and longer patterns are expected to be observed in the power consumption trace. Since the adversary already knows the position of the entry 2l − 1, it can recover the position of entry 2l − 2 simply considering the remained position corresponding to another pattern caused by true condition of the if-statement is for the latter entry. Similarly, entire positions of entries which are randomly distributed in P can be recovered step-by-step. Now assuming that the adversary knows every permutation vectors corresponding to each global shuffling LIM's carry propagation process, it can mount collision-based single trace attacks independently exploiting carry propagation process. Note that the previous attacks mainly exploit the single precision multiplications in LIM. Since our proposed VOLUME 8, 2020 Algorithm 3 Global Shuffling Long Integer Multiplication [12] Input: c s ← 0 20: end for 21: end for 22: Return r attack targets the carry propagation process in which only the results of single precision multiplications are manipulated, exploitable collisions in operands of LIM are limited in case of both operands are the same. Nevertheless, still this vulnerability of global shuffling LIM can lead to the recovery of the secret scalar for three cases where particular unified point additions are deployed, as described in the next section.

B. APPLICABILITY OF PROPOSED ATTACK ON ELLIPTIC CURVES
Here, we analyze the possibility of attacking side-channel atomic scalar multiplication with unified point addition deploying global shuffling LIM. Assuming an adversary can detect collisions of two operands in two field multiplications, that is, two global shuffling LIMs in this context, the secret scalar k of Alg. 1 can be recovered if such collisions exist only for doubling (or addition) despite the same sequence of operations are executed by the unified point addition. In the following, we demonstrate such possible attack scenarios by analyzing unified point addition formulae for several elliptic curves.

1) SHORT WEIERSTRASS CURVES
A Short Weierstrass elliptic curve E K defined over a field K satisfies following equation

Algorithm 4 Unified Point Addition for Short Weierstrass
Curves [22] Input: where (x, y) ∈ K 2 in affine coordinates and a, b ∈ K . Given two points on the curve P 1 = (X 1 , Y 1 , Z 1 ) and P 2 = (X 2 , Y 2 , Z 2 ) with the projective coordinate representation where x = X /Z and y = Y /Z , the unified addition formulas for P 1 + P 2 = (X 3 , Y 3 , Z 3 ) are presented as following algorithm [22].
If P 1 = P 2 , two input operands of field multiplications corresponding to (Step 1, Step 2), and (Step 3, Step 4) are identical, respectively. For the case when P 1 = P 2 , the latter collisions of input operands do not occur.

2) JACOBI QUARTIC CURVES
A Jacobi quartic elliptic curve E K defined over a field K satisfies following equation where (x, y) ∈ K 2 in affine coordinates and a ∈ K . Given two points on the curve P 1 = (X 1 , Y 1 , Z 1 ) and P 2 = (X 2 , Y 2 , Z 2 )with the projective coordinate representation where x = X /Z and y = Y /Z , the unified addition formulas for P 1 + P 2 = (X 3 , Y 3 , Z 3 ) are presented as following algorithm [22]. If P 1 = P 2 , two input operands of field multiplications corresponding to (Step 1, Step 6), (Step 2, Step 7), and (Step 4, Step 9) are identical, respectively. For the case when P 1 = P 2 , the latter collisions of input operands do not occur.

3) JACOBI INTERSECTION CURVES
A Jacobi intersection elliptic curve E K defined over a field K satisfies following equation 5248 VOLUME 8, 2020 Algorithm 5 Unified Point Addition for Jacobi Quartic Curves [22] Input:

C. EXPERIMENTAL RESULT
In this section, we evaluate the security of global shuffling LIM against advanced single trace attacks. As analyzed in the previous section, unified point additions for some elliptic curves have collisions in two operands of field multiplications which can be exploitable for recovery of the secret scalar from side-channel atomic scalar multiplications. Such collisions can be detected by an adversary if field multiplications are based on school-book LIMs as proposed in previous works [13], [15], [16]. However, the global shuffling LIM can prevent such detection if it is secure as intended. In the following, we demonstrate that detection of collisions in two operands of global shuffling LIMs is possible by a single trace collision-based attack targeting the carry propagation operations followed by recovery of the random permutation vectors with SPA.

1) EXPERIMENTAL SETUP
We conduct practical experiments of our attack for three power consumption traces consisting of 2n (n = 128, 192, 256) global shuffling LIMs, where n represents the number of bits of targeted ECC primitive which is operated on ARM Cortex-M4 based STM32F405 microcontroller [20] embedded on ChipWhisperer [24] CW308T-STM32F [25] target board. Note that in real attack situation there will be more than 2n LIMs for one scalar multiplication because unified point additions consist of more than two LIMs. For the situation, an adversary should extract power consumption traces of target LIMs. Such extraction can be achieved by finding a LIM trace, we refer it as reference trace, from visual inspection and then locating the rest LIM traces by calculating correlation coefficient of the reference with the whole scalar multiplication point-by-point. For the simplicity of the explanation and due to limited memory of our oscilloscope, we conduct our experiment assuming an adversary exploits one LIM from each unified addition. Alg. 3 is implemented in ARM assembly language. In detail, each power consumption trace has n pair of LIMs where half of them have input collisions of two operands and the rest have no collision, that is, if we represent (A, B, C, D) as operands of a pair of LIM operations, A = C, B = D for the collision case and A = C, B = D for the opposite case where all the inputs are different random values for each pair of LIMs. Of course, permutation vectors γ and P for individual LIM operations are different. Two LIMs consisting a pair are located contiguously. Note that as we target carry propagation process of the global shuffling LIM, we focus on power consumption samples corresponding to l(2l − 1) unit carry propagation operations. Since we conduct experiments for 128, 192, and 256-bit ECC primitives based on 32-bit words, there exist 28, 66, and 90 such unit operations within each LIM, respectively. For all acquisitions of power consumption traces we amplify the signal with CW501 differential probe [26]  and capture them with LeCroy HDO6104A oscilloscope [27]. Operating clock frequency of the target device and sampling rate of oscilloscope for each acquisition are listed in Table 1. For simplicity, we illustrate our attack experiment targeting 128-bit implementation hereafter. The attack principle and actual results for the rest of implementations are similar.

2) REVEALING RANDOMIZED VECTORS BY SIMPLE POWER ANALYSIS
To reveal the permutation vector P for each LIM, we firstly identify and extract the power consumption samples of Step 12-21 in Alg. 3 denoted by C k where k = {1, . . . , 2n} indicates the index of each LIM operation. This process is done by finding a single reference trace of C k and calculating correlation coefficient point by point with the whole power consumption trace. Since the vector P varies for each LIM, actual patterns of every C k are not identical. Nevertheless, whereas the value of correlation coefficient peaks ranges from 0.36 to 0.88, all the peaks are distinguishable to sufficiently determine the starting point in time for each C k . Hence we cut the power consumption samples with the same length of the reference from the starting points and acquire every C k . Fig. 1 presents all C k where k = {1, . . . , 256}. As shown in Fig. 1, we can determine intervals for each loop of Step 13-20 in Alg. 3 whereas i increases due to the power consumption peaks which are common for all C k appeared after each loop interval (they are located outside the dotted lines in Fig. 1). Considering the interval where i = 7, we can confirm that seven peaks exist since if-statement of Step 15 in Alg. 3 only suffice once, that is, when s = 7, whereas s varies from 1 to 7 in random sequence. Hence, by observing the position of each peak among the seven peaks, we can determine the location of entry 7 in each permutation vector P.
However, for i = {2, . . . , 6} such simple determination is not possible because power consumption peaks in each interval do not present in seven positions. This phenomenon is caused by the fact that additional operations of  in Alg. 3 are performed only when the if-statement is true and such condition occurs more than two times for i = {2, . . . , 6} case. To solve this problem, we can reconsider the interval of i = 7 in Fig. 1 and analyze different patterns corresponding to the condition of the if-statement of Alg. 3 as represented in Fig. 2. Obviously, if the condition is true, longer and distinct pattern of power consumption is observed and this pattern can be used as a reference pattern to locate similar patterns in every C k by calculating correlation coefficient point by point (Note that these patterns  can be efficiently distinguished by checking whether power consumption exceeds a certain threshold, however we choose this method to locate such patterns accurately and then easily substitute them from the points where correlation peaks exist in the next step). After this process, we can substitute all these particular patterns, which are similar to the reference pattern, with an arbitrary pattern with the same length of a false case pattern in Fig. 2 but with different shape. For example, we choose samples located in the center of true case pattern in Fig. 2 where relatively higher peak of power consumption is observed. The result of the substitution for all C k is presented in Fig. 3. As all peaks caused by true condition of the if-statement are located in seven positions for each loop of i = {2, . . . , 6}, we can now reveal entire entries of the permutation vector P for each C k by SPA.  We illustrate here an example of recovering the permutation vector P for a particular C k where k = 255 in our experiment. Similarly to Fig. 2, we presentC 255 in Fig. 4, whereC k represents a power consumption trace resulted from the previous substitution on each C k , and the restC k where k = {1, . . . , 254, 256} in gray color to reveal the permutation vector P by visual inspection. The recovery of entries of the permutation vector P is performed one by one from the interval of i = 7 to i = 2. As described previously, a power consumption peak in the interval of i = 7 indicates the position of entry 7 in the permutation vector. Hence we can determine that the first entry of P is 7. Next we can observe that for the interval of i = 6, there are two peaks in the first position and the seventh position. As we already know that the peak in the first position is caused by the case of s = 7, we can infer that the peak in the seventh position is caused by the case of s = 6. As a result, the entry 7 and 6 are located in the first and the seventh position in P, respectively. The rest of entries can be recovered similarly and for i = 1 case, the entry 1 is assigned to the last remained position. Finally, we can recover the permutation vector as P = {7, 3, 1, 2, 4, 5, 6} for C 255 . All permutation vectors for the restC k can be recovered by the same manner.  Fig. 2. We reconstruct the power consumption traces as C k = [T k,1 T k,2 · · · T k,l(2l−1) ] for all k = {1, . . . , 256}.
Then we applied the integration compression technique [28] to each C k . The window size parameter which represents how many number of samples integrated to one point is selected among factors of points per clock, that is, the sampling rate of acquisition of a measuring device divided by the operating clock cycles of a target device. We present selected window size parameters which lead to the best success rates of further collision attacks for all three experiments targeting 128/192/256-bit ECC primitives in Table 2.
Finally, we applied the subtraction of mean traces of T k,m in the same manner as represented in [29] to remove the operational dependent leakage and possible noise. Since we target to determine collisions of manipulated data in a pair of global shuffling LIMs, we divide power traces C k where k = {1, . . . , 256} into two groups. Then for the two groups separately, mean traces of each group are subtracted from each T k,m accordingly.

4) REORDERING POWER CONSUMPTION TRACES
Now to exploit collision characteristic of two global shuffling LIMs caused by input operands similarly to the previous VOLUME 8, 2020

Algorithm 7 Reordering Subtraces
Input: C = [T 1 T 2 · · · T l(2l−1) ], P Output: Reordered C = [T 1 T 2 · · · T l(2l−1) ] where position index m of subtrace T m is rearranged corresponding to entry values of P 1: offset = 0 2: for i = 1 to 2l − 1 do 3: for j = 1 to |P| do 4:T offset+P[j] ← T offset+j 5: end for 6: offset ← offset + |P| 7: P ← P\{i} // Omit entry i from vector P 8: end for 9: ReturnC = [T 1 T 2 · · · T l(2l−1) ] attacks performed on non-shuffling LIMs, we reorder the placement of subtraces T k,m in each C k such that all reordered tracesC k have practically the same order of processing carries to remove the effect of the global shuffling countermeasure vanishing the correlation between two LIMs' power traces by shuffling the manipulating order of each unit carry processing operation, that is, Step 16-19 in Alg. 3, every time the global shuffling LIM is operated. More precisely, as analyzed in the previous subsection, manipulated indices for each iteration of i in Alg. 3 can be denoted by where the positions of entries for each m i are shuffled and Alg. 7 rearranges these shuffled positions in ascending order.

5) FINDING POINTS OF INTEREST
To perform further collision-based single trace attack, we find points of interest by calculating correlation coefficient vector with the length of T k,m denoted by l T . As we have n pairs of LIM traces where each LIM trace C k is composed of l(2l −1)T k,m traces, we vertically stack all T k,m separately for each group of LIM traces and calculate the correlation coefficient of these two matrices as described in Alg. 8. By performing this process, correlation coefficient peaks caused by the collision characteristic from the same manipulated data for some pair of LIM traces can be identified despite the noise is added in the resulting correlation coefficient vector if there is no collision of input operands for other pair of LIM traces for the same point in time. After choosing samples as points of interest where the correlation coefficient value is equal or greater than some threshold as shown in Fig. 5, we can perform further collision-based single trace attack.

6) RECOVERING SECRET BY COLLISION-BASED SINGLE TRACE ATTACK
We can now determine whether each pair of LIM traces have input collisions or not. As we analyzed in Section III-B, if an adversary can detect whether there is collision of two input operands of global shuffling LIMs in some unified point additions deployed by a side-channel atomic scalar multiplication or not it can recover secret scalar as described in Alg. 10

Algorithm 8 Calculating Correlation Coefficient Trace to Find POIs
Input: C k = [T k,1 T k,2 · · · T k,l(2l−1) ] Output: (1 × l T ) correlation coefficient vector CI 1: for j = 0 to n − 1 do 2: for m = 1 to l(2l − 1) do 3:   experiment we just consider collisions of a pair of global shuffling LIMs for simplicity). First, we extract POIs selected in the previous subsection in each unit carry propagation subtrace T k,m and reconstructĈ k = [T k,1 ||T k,2 || · · · ||T k,l(2l−1) ] where the group of extracted samples of POIs are denoted bŷ T k,m . Then we calculate the correlation coefficient δ j between two contiguous LIMsĈ 2×j andĈ 2×j+1 for all j = {0, . . . , n− 1}. If the value of δ j is greater than the mean of all δ j we consider the bit value of secret scalar d j is zero else the bit is one. Results of these attacks targeting 128-bit, 192-bit, and 256-bit implementations are presented in Table 3 where each success rate is represented by the number of correctly recovered bits divided by the whole secret bit length n. Despite several bits are recovered incorrectly, we can observe that the success rate increases as n increases as well whereas each success rate of the attack is considerable against the security of a side-channel atomic scalar multiplication deploying specific unified point additions with global shuffling LIMs. This phenomenon can be described by the increased number of unit carry propagations for recovering a bit of secret scalar. Hence the success rates can be improved by exploiting more pair LIMs in one doubling or addition operations in a unified point addition (this can be done by utilizing more than one measuring device as presented in [30]).
Furthermore, for the full recovery of the secret scalar we can consider a method of correcting error bits. If an adversary can determine locations of error bits, the correction could be done by just swapping bit values of corresponding locations. Brute-forcing such locations can be represented by n e i=1 n−i C i where n e is the number of error bits and assuming the most significant bit of the secret is one. In our experiments, the results are bounded by 2 29 , 2 37 , and 2 35 for attacking 128-bit, 192-bit, and 256-bit implementations, respectively. To decrease these costs, we can select some candidate bit-locations for brute-forcing on which correlation coefficient δ j values are within an arbitrary distance from the mean of all δ j . For example, in our experiment targeting 192-bit implementation, the mean value of all δ j is 0.4915 where the values range from −0.6153 to 0.9614, and by choosing the distance as 0.3035 we have thirty-nine candidate locations including all error-bit-locations. Then we calculate the brute-forcing cost again based on the numbers of candidate locations instead of (n − 1) in the combination formula, as results are presented in Table 4.

IV. COUNTERMEASURE
In this section, we propose a novel global shuffling LIM against our attack. As analyzed in Section III-A, the vulnerability of the carry propagation process is caused by the if-statement for the selection of entry s which suffices s ≥ i from random permutation vector P whereas i increases from 1 to 2l − 1. Hence, in our proposed countermeasure, as described in Alg. 11, the if-statement is removed. In detail, Step 12-21 in Alg. 3, we refer this as the original process, is substituted by Step 12-32 in Alg. 11. In the following, we describe the core idea for the selection of s without the if-statement considering the value of i in the original process and then present power consumption traces, which have the same patterns for different execution of LIMs, of the practical implementation of our proposed countermeasure.
First, for the case i = 1 in Alg. 3, the selection of s is not necessary since all 2l − 1 entries of P are required hence the carry propagation process for this case is represented separately as Step 12-18 in Alg. 11.
Also for the case i = 2l − 1, the selection is not required since this case means the processing of most significant word of the result register by addition with the last carry. Hence, the process can be simplified as Step 32 in Alg. 11.
Consider the remaining case where 2 ≤ i ≤ 2l − 2. The number of required entries within a permutation vector P where i increases for the original process is 2l−i, denoted by q in Step 16 in Alg. 11, since entries with the value smaller than i are not selected. Instead of selecting required entries, our proposed countermeasure deploys a rearrangement method which changes the permutation vector to have required entries in the first q positions iteratively, consequently accompanies no selection process.
The core idea can be described as follows. First calculate the difference t between the value of entry in the last position in the original permutation vector and the value q + 1. Next add t for the all entries then apply mod (q + 1), now the entries in the first q positions have the values from 1 to q where still placed in random positions and zero in the last position. Then, since required value of the entries in the original process ranges from i to 2l − 1, add the difference i − 1 = (2l − 1) − q for entries in the first q positions in the derived permutation vector. Hence we can use the first q entries from the derived permutation vector for the carry propagation process. To derive the next permutation vector, consider the first q entries of the derived permutation vector as the original permutation vector.
The process of above description is presented as Step 19-31 in Alg. 11. Note that the modulus operation is actually done in Step 22-23 utilizing the negative flag value of the status register of target processor after subtraction with q + 1. If naively implemented, the modulus operation can cause an SPA-leakage. For example, the modulus operator % in C language can be compiled to a division instruction in assembly language of target processor which has variable execution cycles depending on the input value if an exact instruction for modulus operation do not exist. Considering mod(q + 1) operation, the result of subtraction with q + 1 is negative for the inputs with the value lower than q + 1 where for such inputs modulus reduction is not necessary hence compensating the result by addition with q + 1 is required. On the other hand, if the result of subtraction is positive, no compensation is required. Since the negative flag value is set as one for the former case or zero for the latter case, multiplying the value of the negative flag with q + 1 and then adding to the result perform such compensation and the correct result for the modulus operation can be acquired without SPA-leakage.
To demonstrate the practical effectiveness of our countermeasure, we implemented 128-bit version of Alg. 11 for the same target and setting from our attack experiment introduced in Section III-C. Fig. 5 presents 256 power consumption traces of our proposed global shuffling LIM in the same manner depicting Fig. 1. It can be observed that there are no differences in patterns of power traces hence revealing the random permutation vectors from SPA is not possible.

V. CONCLUSION
We present the first practical results of a combined single trace attacks on software implementations of global shuffling LIM by analyzing the vulnerability of the algorithm's carry propagation process which is exploitable by SPA and performing collision-based single trace attacks with power consumption subtraces of unit carry propagation operations after revealing permutation vectors. Since the vulnerability of global shuffling LIM is caused by the if-statement for selection of proper entries from the permutation vector, we propose a novel countermeasure which eliminates such selection with a permutation vector rearrangement method utilizing simple addition and modulus operation and achieves regularity in power trace patterns consequently providing security against SPA. We also provide a practical result of implementation our countermeasure.