Fit the Joint Moments: How to Attack Any Masking Scheme

Side-Channel Analysis (SCA) allows extracting secret keys manipulated by cryptographic primitives through leakages of their physical implementations. Supervised attacks, known to be optimal, can theoretically defeat any countermeasure, including masking, by learning the dependency between the leakage and the secret through the profiling phase. However, defeating masking is less trivial when it comes to unsupervised attacks. While classical strategies such as correlation power analysis or linear regression analysis have been extended to masked implementations, we show that these extensions only hold for Boolean and arithmetic schemes. Therefore, we propose a new unsupervised strategy, the Joint Moments Regression (JMR), able to defeat any masking schemes (multiplicative, affine, polynomial, inner product…), which are gaining popularity in real implementations. The main idea behind JMR is to directly regress the leakage model of the shares by fitting a system based on higher-order joint moments conditions. We show that this idea can be seen as part of a more general framework known as the Generalized Method of Moments (GMM). This offers mathematical foundations on which we rely to derive optimizations of JMR. Simulations results confirm the interest of JMR over state-of-the-art attacks, even in the case of Boolean and arithmetic masking. Eventually, we apply this strategy to real traces and provide, to the best of our knowledge, the first unsupervised attack on the protected AES implementation proposed by the ANSSI for SCA research, which embeds an affine masking and shuffling counter-measures.


I. INTRODUCTION A. CONTEXT
Side-Channel Analysis (SCA) is defined as the process of gaining information on a device holding a secret through its physical leakage such as power consumption [1] or Electromagnetic (EM) emanations [2]. The underlying assumption is that the secret and the side-channel data are statistically dependent. This allows an adversary to extract sensitive information such as cryptographic keys by carefully exploiting these dependencies.
Strategies are mainly divided into two categories: supervised and unsupervised SCA and their utilization depends on the considered threat model. In the first one, the adversary The associate editor coordinating the review of this manuscript and approving it for publication was Jiafeng Xie. is supposed to be able to conduct a profiling step of the target, most likely on a clone device, in which she learns the leakage model of the intermediate variables and then adopts a maximum likelihood approach to recover the secret key. This includes strategies such as Gaussian template attack [3] or deep learning profiled attacks [4]. If the model is perfectly learned during the profiling phase, these attacks are known to be optimal from an information theory point of view.
If the profiling step is not possible, the adversary has to use an a priori on the leakage model to mount an unsupervised SCA. As shown in [5] there does not exist a generic strategy that would work without requiring such an a priori. Different approaches have been developed allowing to exploit an amount of information corresponding to the quality of this a priori. For example, [6] showed that Mutual Information Attacks (MIA) can exploit a large part of the information contained in the traces but require an explicit representation of the leakage model (recent deep learning based unsupervised SCA [7], [8] also fall into this category).
The main alternatives to MIA are the stochastic attacks, such as Correlation Power Analysis (CPA) [9] or Linear regression Analysis (LRA) [10], in which the adversary's a priori is reduced to a parameterized statistical model whose parameters are regressed on the fly. A measure of fitness is then used as a distinguisher to discriminate key candidates.
To prevent instantaneous leakage of the sensitive variables, a classical strategy is to protect implementations using masking techniques. It consists in splitting the internal state of the processing into multiple random shares following secret sharing ideas [11]. SCA against masked implementations is still possible through the so called higher-order attacks which combine multiple leakage samples corresponding to each share. However, these attacks are harder to conduct since the impact of the noise is amplified exponentially with the masking order [12]. Among unsupervised attacks, the multivariate CPA described in [13] has often proved to be an efficient strategy in practice. However, it relies on a Hamming weight leakage assumption (of the shares) that may not be correct especially when it comes to local EM measurements. Indeed, each bit of the intermediate variable can have very different leakage behavior and even sign inversions of their coefficients as shown in [14]. To deal with such situations [15] proposed a generalization of the LRA whose main strength is to offer flexibility on the a priori without constraining each bit to have the same impact on the leakage.
This method exploits information hidden in the covariance, i.e., the second order joint moment of the distribution since the first order moments (the means) are leakage-free thanks to masking. However, we argue that this method is not generic enough because it is based on the assumption that the covariance per class could be expressed as a low algebraic function, assumption that only holds for Boolean and low order arithmetic masking as shown in this paper. Indeed, the proposed attack fails even in theory (on synthetic traces with zero noise) when dealing with other masking schemes such as the multiplicative or affine ones. These masking schemes along with the polynomial and inner product masking are getting more and more studied recently and begin to be used in modern implementations. This trend may continue in the future since these schemes seem to offer better resistance against side-channel attacks [16]. Mutual information-based attacks have also been extended to masked implementations but have not either proven to be valid strategies for any kind of masking and their applicability is mainly related to the open questions, raised in [6], about the choice of the partitioning function. This leads us to the following observation: To the best of our knowledge, no generic unsupervised strategy able to defeat any kind of masking outside of the Hamming weight leakage assumption emerges from the state-of-the-art.
We propose such a strategy in this paper: the Joint Moment Regression (JMR). The latter is built on the idea that the discriminating information, if it exists, is necessarily hidden in higher-order joint moments since lower-order leakages are prevented by masking (at least when not considering glitches from the physical implementation [17]). Intuitively, joint moments encapsulate information about the corresponding distribution. The idea is to make a leakage assumption on each share (for example a linear leakage) and try to directly regress the leakage model of each share, using joint moments conditions, instead of trying to regress the joint moment itself as it is done in [15]. This comes at the cost of the loss of linearity since the joint moment conditions involve a multiplication between the leakage parameters of the different shares which gives rise to a non-linear system of equations. However, we show that numerical optimization algorithms can be used to find an estimation of the solution that best fits the conditions. A measure of fitness is used as a distinguisher between key candidates. The joint moment conditions depend on the underlying masking scheme which allows to embed knowledge of the latter into the system and, therefore, makes the attack generic.

B. CONTRIBUTIONS
• The first contribution of the paper is to present the state-of-the-art on the stochastic higher-order attacks, especially focusing on the method proposed in [15] to understand its strengths and limitations. This analysis can be found in section II.
• As a second contribution, we introduce a new attack strategy: the Joint Moment Regression (JMR) in section III. It is built to circumvent the issues found in the state-of-the-art and proposes a method which is agnostic to the underlying masking scheme.
• We then draw a parallel between the core of JMR and a more general framework: the Generalized Method of Moment (GMM) [18] which is a well-studied paradigm in statistics and economics. This allows to improve our attack in the case of biased masking schemes such as the multiplicative and affine ones. This analysis can be found in section IV.
• Finally, section V-B presents applications of JMR to real traces and provides at the same time, to the best of our knowledge, the first unsupervised attack on the secured AES implementation of the ANSSI, protected by an affine masking scheme. Attacks that do and do not exploit the lower-order leakage are both presented.

A. NOTATIONS
Random variables are represented as upper-case letters such as X . They take their values in the corresponding set X depicted with a calligraphic letter. Lower case letters such as x stand for elements of X . Expectation of X is denoted E[X ] and covariance between X 1 and X 2 is noted cov(X 1 , X 2 ). Eventually, |X | stands for the cardinal of X . VOLUME 10, 2022

B. GENERAL ATTACK FRAMEWORK
In this paper, the attack framework is described considering that an adversary targets the manipulation of a sensitive variable Z ∈ Z = F n 2 , for a given n ∈ N. This variable is supposed to functionally depends on a public variable X ∈ X = F m 2 , for a given m ∈ N, and a secret key k * ∈ K = F m 2 through the relation: Z = f (X , k * ) where f : X ×K → Z is a known function depending on the underlying cryptographic algorithm. The adversary is supposed to own a set {( i , x i ), 1 ≤ i ≤ N } of N side channel traces labeled with the corresponding public value of X . Traces correspond to realizations of a leakage variable L ∈ L coming from a stochastic process S, Z S − → L (often separable into a deterministic and a noise part). The leakage variable L is supposed to contain information about Z . The general idea of an unsupervised side-channel attack is to make a series of hypotheses k i on the key, and to use the dependency between Z and L to build a distinguisher D : K → R to rank the different key candidates. One of these distinguishers, the LRA, is described in the next section.

C. LINEAR REGRESSION ANALYSIS
The following recalls the steps required to perform an LRA such as suggested in [19]. Traces are assumed to feature one sample. In a real-life scenario, the same procedure would be repeated for each sample and the final distinguisher keeps the best value along all samples according to a chosen policy (often being the minimum/maximum value of the distinguisher).
1) Partitioning. Partition the traces into |X | classes: 3) Basis choice. Choose a basis of functions (b i ) 1≤i≤r such that b i : Z → R. 4) Making hypotheses. For k ∈ K compute the hypotheses matrix: 5) Linear regression. For k ∈ K find the parameter vector θ k = (θ k,1 , . . . , θ k,r ) T minimizing the euclidean norm of the error vector: Ranking. Rank the keys according to their distinguisher value (from low to high) 1 : Sometimes the coefficient of determination R 2 is used instead but the ranking is strictly equivalent except that one ranks from high to low values of the distinguisher.
Since step 5 corresponds to a linear regression it has a closed-form solution: ·L However, to highlight similarities with JMR later in the paper, we decided to keep the generic formulation of the optimization problem.
The choice of the basis is important since it should be large enough for the leakage to be representable as a linear combination with the b i • f functions when k = k * but small enough so that it is not the case for wrong hypotheses. The adversary uses his a priori on the leakage model, often related to physical assumptions, to choose the basis.
A common example is to assume that each bit of the sensitive variable contributes to the leakage independently from the others. If this assumption holds there exists α = (α 0 , . . . , α n ) such that i = α 0 + α j · bit j (z i ) + with bit j denoting the projection on the j th bit and being sampled from a noise distribution. In such a case, the basis would be {1, bit 1 , . . . , bit n } and θ k * should be close to α.
Another example is to assume that the leakage is depending on the Hamming Weight (HW) of the sensitive variable so that i = α 1 HW(z i )+α 0 + . The basis is then reduced to {1, HW} and the attack corresponds to the classical CPA.

D. MASKING
To prevent instantaneous leakages and mitigate the first-order attacks presented above, one of the most widely used countermeasures is masking [20]. The idea is to split each sensitive intermediate value Z , into d shares: (Z i ) 1≤i≤d . The d − 1 shares Z 2 , . . . , Z d are randomly chosen and the last one, Z 1 is processed such that: for a group operation * of Z. This has the effect of complexifying the stochastic process S generating L from Z , rendering it no longer separable into a deterministic and a noise part. Assuming the masks are uniformly distributed, the knowledge of d −1 shares does not tell anything about Z (this is why such masking is said to be of order d − 1). Therefore, any sound SCA strategy has to combine leakage samples from the d shares to perform an attack (which corresponds to at least d samples if the leakages are disjoint). Such attacks are called d th order attacks. One of them, the second-order LRA is presented in the next section. The uniform assumption is sometimes not strictly realized in practice depending on the masking scheme being used. Four of the most common masking schemes that will be studied in this paper are listed in table 1. The ⊕ and ⊗ respectively stand for the addition and the multiplication operation in F n 2 . Since the multiplication by 0 is not invertible the ''multiplicative shares'' have to be chosen in F n 2 \ {0}. As Z itself can take the value 0, the multiplicative and affine schemes are then slightly biased, and therefore, do not guarantee in theory, SCA resilience to all the (d − 1) th and lower order attacks. Such attacks will be discussed in section IV.

E. SECOND-ORDER LRA
This section describes the generalization of the LRA introduced in [15] which aims at defeating a first-order masked implementation (d = 2). Traces are considered to be composed of 2 samples: L = (L 1 , L 2 ) where L 1 and L 2 represent respectively the leakage of the first and second share. In a real-life scenario, the attack would be repeated with all the combinations of two samples from the raw traces. To perform a second-order LRA the adversary is supposed to own a set of The idea is to replace the estimated mean per class by the estimated covariance per class in the classical LRA which naturally combines information from the two samples. Indeed the covariance Y = cov(L 1 , L 2 ) involves the product of the centered variable L 1 − µ 1 and L 2 − µ 2 , with (µ 1 , µ 2 ) = E[L], which has been shown to be a good combining function for second-order SCA [13]. The steps to perform a second-order LRA are depicted hereafter. 1) Partitioning. Partition the traces into |X | classes: 2) Estimating covariances. Compute the estimated covariance for each classȲ = (ȳ x ) x∈X with where (μ 1 ,μ 2 ) stands for the estimated mean of L. 3) Basis choice. Choose a basis of functions (b i ) 1≤i≤r such that b i : Z → R. 4) Making hypotheses. For k ∈ K compute the hypotheses matrix: 5) Linear regression. For k ∈ K find the parameter vector θ k = (θ k,1 , . . . , θ k,r ) T minimizing the euclidean norm of the error vector: Ranking. Rank the keys according to their distinguisher value (from low to high): The attack may seem very similar to a first-order LRA except that it is performed on the covariance instead of the mean (the change happens in step 2). However, the choice of the basis is much more delicate. The link between the adversary a priori and a basis leading to a successful attack is not trivial anymore. Indeed, the hypotheses matrix is constructed using the unmasked variable Z (k) = f (X , k) while the leakage a priori concerns the shares. The choice of the basis proposed in [15] is based on an assumption that is recalled hereafter.
Let us define the set of functions (ϕ k ) k∈K : Z = F n 2 → R such that: Since all the Boolean functions in F n 2 can be represented by a multivariate polynomial in R[z 1 , . . . , z n ]/(z 2 1 − z 1 , . . . , z 2 n − z n ) (i.e. the degree of every z i in every monomial is at most 1) [23], there exists, for any k, a unique set of coefficients (α k,u ) u∈F n 2 such that: where each term z u denotes the monomial (function) z → z u 1 1 z u 2 2 . . . z u n n with z u i i ∈ F 2 . Let deg(ϕ k ) stands for the degree of the polynomial representing ϕ k .
The assumption on which the attack from [15] relies is the following: The intuition behind this assumption is that since , ϕ k is expected to have a high degree (close to n) if k = k * , due to cryptographic properties of f which often embeds highly non-linear S-boxes to prevent algebraic attacks. Note that this reasoning only holds if ϕ k * itself has a low degree which is implicitly assumed in [15]. This point will be discussed later.
If 1 holds, the basis: } is a valid basis for the second-order LRA. Indeed, it spans all the functions of degree less or equal deg(ϕ k * ). Therefore there exists a decomposition of ϕ k * in this basis while it is not the case for other ϕ k , by hypothesis, which guarantees the success of the attack (provided that the number of traces allows for a fair approximation of the covariances per class).

F. LIMITATIONS
The first observation is that even if 1 holds, the attack may fail in practice if deg(ϕ k * ) is not low enough. Indeed, the cardinal of the basis, and therefore the number of parameters to estimate, increases quickly with deg(ϕ k * ) offering a big capacity to the statistical model to fit the data whatever the considered value of k. If the noise is not negligible, this often means that the wrong hypotheses can reach similar scores than the correct one which reduces the distinguishability and therefore the effectiveness of the attack. For example, with n = 8 and deg(ϕ k * ) ∈ {1, 2, 3} the cardinal of the basis is VOLUME 10, 2022 respectively equal to 9, 37 and 93. In practice authors of [15] run their attack with the following basis: . Choosing d max = 3 never led to the best attack even in cases where deg(ϕ k * ) was strictly greater than 2 (due to the high model capacity and lack of distinguishability).
Then, one could ask if 1 holds at all. Since ϕ k * (z) = cov (L 1 , L 2 |Z (k * ) = z) it is obviously related to the nature of the leakage L 1 and L 2 . These variables can be assumed to be separable into a deterministic and a noise part with respect to the shares: with l i : Z → R representing the leakage of share i and i being an independent random noise variable. By bilinearity of the covariance and independence of i : Both l 1 and l 2 can be assumed of low degree (through a physical a priori on the leakage). For example, it is realistic to assume that both shares follow a linear leakage. But we argue that this is not enough to guarantee 1 and that it is still depending on the underlying masking scheme, especially on the nature of the * operation.
Then, a natural question arises: why does the attack presented in [15] work? We argue that it is related to the studied masking schemes in their paper. Indeed, the latter one focuses on the Boolean and arithmetic masking schemes which are both exceptions as far as 1 is concerned. This claim is justified by the two following propositions.
Proposition 1 (Boolean Masking): Let * = ⊕. Let l 1 : Z → R and l 2 : Z → R be two leakage functions of degree 1. Let ϕ Bool (z) = cov l 1 (z ⊕ Z 2 ), l 2 (Z 2 ) . Then, Proof can be found in appendix A. Proposition 2 (Arithmetic Masking): Let * = + mod 2 n . Let l 1 : Z → R and l 2 : Z → R be two leakage functions of Proof can be found in appendix A. These two propositions explain the success of the attacks presented in [15]. However, we could not find equivalent propositions for other masking schemes, suggesting that Boolean and arithmetic masking are, in fact, exceptions. This will be empirically confirmed in section III-D where it is shown that even without noise, the higher-order LRA fails against multiplicative or affine masking with a linear leakage of the shares. Therefore, to the best of our knowledge, there is no strategy in the literature able to defeat a generic masking scheme in an unsupervised context, with a simple linear leakage assumption of the shares. We introduce such a strategy in the next section.

III. JOINT MOMENTS REGRESSION
We first introduce the concept of Joint Moment (JM) which generalizes to any masking order the idea of the covariance, found in the previous section.

A. JOINT MOMENTS
Moments of probability distributions are quantitative measures related to the shape of the distribution. The moment of order d, denoted µ d , of the variable X is defined as: For second and higher orders, the centered momentsμ d of order d are often used instead and are defined as: Joint moments are the generalization of moments to multivariate variables. Let X = (X 1 , . . . , X n ) ∈ R n be a multivariate random variable. Let u = (u 1 , . . . , u k ) ∈ N n be a vector of positive integers such that u i = d. The JM of order d with respect to vector u, denoted jm u , is defined as: Centered JM are also defined as: One important property of JM (and of simple moments) is that, for distributions defined on a compact set of R n , the distribution is fully defined by the list (maybe infinite) of all its JM. This is also true for the centered JM provided that the first order JM are also given. The effect of a d-order masking is that no information related to the sensitive variable can be found in the d − 1 and lower JM. That is why the second-order LRA performed a regression on the second-order centered JM with u = (1, 1) which happens to be the covariance. Indeed it is the lowest order JM bringing information on the sensitive variable. Information could also be found in higher-order JM but they are harder to estimate. Indeed, more terms are involved in the product and the noise in each one of them is amplified through the multiplication. One typically wants to take the JM with the lowest standard error (the standard deviation of its estimator). This also explains why centered JM are preferred: as shown in [13], they have a lower standard error than their uncentered counterpart.

B. ATTACK DESCRIPTION
Let an adversary own a set of N traces { i , 1 ≤ i ≤ N } of a d order masked implementation. Traces are considered to be composed of d samples: In a real-life scenario, the attack would be repeated on combinations of d samples from the raw traces depending on the attacker a priori on the points of interest. To defeat this implementation a naive solution would be to extend the attack proposed in [15] using centered JM instead of covariance but as stated in section II-F: there is no obvious link between the physical a priori, which happens to be on the shares, and the basis that has to be chosen and applied to the unmasked sensitive variable.
That is why we propose a new strategy where the adversary chooses d basis, one for each share (in practice they will often be the same basis), and directly regresses the leakage of each share using information from the estimated d order centered JM. The steps of what we call the Joint Moment Regression (JMR) are depicted hereafter.

JMR Procedure
Compute the estimated centered d order joint moments matrixJ M . Each row represents the estimation for one class: Intuitively, if l j corresponds to the leakage of share j, the adversary wants to choose a basis such that i (z j ) + j for some coefficient θ j ∈ R r , with j representing an independent random noise variable. 4) Making hypotheses. Letl j (z j ) stand for the leakage prediction of share j according to the chosen basis: For k ∈ K, define the theoretical JM vector JM k (θ ) with respect to θ ∈ R d×r , that traduces the leakage assumption of step 3 into |X | = 2 m JM per class expressions: Here, µ θ j stands for the theoretical mean of the leakage of share j under the assumption of θ j : 5) Non-linear regression. For k ∈ K, find through numerical optimization techniques (see subsection III-C), the parameter vector θ (k) ∈ R d × R r minimizing the euclidean norm of the error vector: 2   6) Ranking. Rank the keys according to their distinguisher value (from low to high): The general attack structure of JMR is very similar to the LRA and second-order LRA. The main difference with the latter one is that the assumption is done on the leakage of the shares and is therefore directly related to the physical a priori. These assumptions are then combined to build a parameterized system of unknown θ ∈ R d×r : where each line represents a condition on the JM knowing that X = x. Note that by the independence assumption, the noise terms j are canceled from the theoretical equations of the JM per class, listed in the JM k (θ) vectors. The goal is then to find the solution θ (k) that fits the most the system and to use a measure of fitness as distinguisher. Note that the knowledge of the underlying masking scheme is embedded in the system through the A x sets which describe the possible values (z 1 , . . . , z d ) of the shares given the value of Z . This is what ensures the genericity of JMR regarding the masking scheme. When the number of traces N tends towards infinity, the estimated JM per class JM tends towards the true JM per class. If the leakage assumptions are correct there exists θ (k * ) ∈ R d×r such that JM k * (θ (k * ) ) is equal to the true JM per class. Therefore: while it is unlikely to be the case for k = k * due to cryptographic property of f , which assures the soundness of the attack. However, this multi-shares assumption comes at the cost of linearity. Indeed, even if all the shares are assumed to leak linearly, the system that JMR regresses is not linear anymore: it is of degree d. Therefore there is no closed-form solution and one has to use numerical optimization tools to find an approximation of the solution. Numerical optimization is a research field in itself and is out of the scope of this paper. There exist multiple ready-to-use implementations in different programming languages, which is enough for our concern. Note that since the system is of degree d, the uniqueness of the solution of step 5 is not guaranteed. This is not a problem for the attack: as long as one solution can be found and that equation 14 holds only for the correct key hypothesis, the attack will succeed for a sufficient number of traces. VOLUME 10, 2022

D. SIMULATION EXPERIMENTS
This section provides simulation experiments to assess the feasibility of JMR in practice against the masking schemes presented in table 1. Its efficiency is compared with state-ofthe-art attacks at second and third order.

1) IMPLEMENTATION
We implemented the core of the JMR attack using the least_squares function from the python scipy. optimize package [24]. It solves a non-linear least-squares fitting problem using the Levenberg-Marquardt (LM) algorithm [25], [26] which is itself based on the Gauss-Newton algorithm and the method of gradient descent. The attack time or complexity is mostly constant regarding the number of traces because the latter does not affect the number of parameters nor equations in the system. The only part that scales with the number of traces is the estimation of the JM per class which is just a product and a sum and that can be handled with numpy [27] array manipulations.
Since the least-squares problems related to the different key hypotheses are independent, the implementation is highly parallelizable. We exploited this using a 48 cores Xeon Platinium 8168 processor which speeded up the attack by a significant factor since the implementation of the least_squares function is not parallelized in itself. Other implementation optimization could be explored such as using the fast GPU version of the LM algorithm proposed in [28] but this is not in the scope of this paper. To give an order of magnitude, with our setup, running the full JMR procedure as described in subsection III-B for a d−tuple of time samples, requires around 10 and 15 seconds for respectively a second and third-order attack (assuming one trace for each possible values of the shares: 2 16 and 2 24 respectively).

2) GENERATING DATASETS
To assess the JMR method and to compare it with state-ofthe-art attacks, synthetic trace datasets with linear leakage of the shares have been generated for first and second-order masking (d ∈ {2, 3}). Boolean, arithmetic, multiplicative and affine (only with d = 3) schemes are used to mask the classical sensitive variable of an AES: Z = Sbox[k * ⊕ P] (k * and P are both supposed to be 8 bits long). To be able to average the results of 100 different attacks, performed with 100 different linear leakage models, we have generated a matrix of random coefficients C i for each share (1 ≤ i ≤ d): (15) where all the α a,b are uniformly drawn from [−1, 1]. Each row represents a different linear leakage model.
To avoid any kind of estimation error (the error coming from sampling), each dataset contains one trace for each of the possible values of the shares (z 1 , . . . , z d ) ∈ Z d (for multiplicative and affine schemes the multiplicative shares can not be 0 so we take them from 1, 255 instead). The trace (a) (z 1 ,...,z d ) corresponding to the d- tuple (z 1 , . . . , z d ) is Algorithm 1 Generate Traces Input: k * , The correct key byte Input: a, representing a row in the matrices C i Input: d, the masking order Input: , a group operation with / the associated division Input: σ , the value of the noise Output: L, a (2 8 * d , d) array Output: P, a (2 8 * d ) array L ← empty list P ← empty list for (z, z 2 , . . . , generated by concatenating the leakage of each shares (represented by the a th row of the C i matrices) as follows: with l (a) where z i [b] corresponds to the b th bit of z i and i is drawn from a normal distribution N (0, σ 2 ). The exact procedure that generates the traces considering the the a th leakage model is depicted in algorithm 1.

3) RESULTS
Second-order attacks results are presented in Figure 1. Each point represents the average rank of k * over the 100 datasets for a given value of σ . We recall that we are using exhaustive datasets, therefore, a failed attack for a given value of σ does not mean that the attack is impossible but rather that the adversary would need more traces than one per possible value of the shares. We compare JMR with • A higher-order CPA, denoted HO-CPA, computed with a Hamming weight prediction model and using the JM of order d as combining function which happens to be the same as the centered product described in [13].
• Higher order LRA, denoted HO-LRA-d max where d max is the assumed degree of ϕ k * as defined in Equation 5. Therefore the basis used in HO- The combining function is also the JM of order d which for d = 3 is a straightforward extension of the second order attack described in [15].
• Mutual Information Analysis, denoted MIA-f , where the distinguisher used is MI (f (Z k ), L). MIA requires the use of a non-injective function f to create distinguishability for the correct hypothesis. Since the leakage model is unknown we used very generic models: f = MSB and f = 7B, where MSB stands for the most significant bit of Z k and 7B stands for the 7 most significant bits of Z k . The MI has been estimated using the histogram method described in [29]. (a) For the Boolean case, JMR and HO-LRA-1 performs approximately the same which is not surprising since, by 1, 1 holds for HO-LRA1. It also holds for HO-LRA-2/3 but HO-LRA-1 perfectly explains the data with fewer parameters, and thus, performs better. One can notice that even without noise the HO-CPA is not converging towards 0 which confirms that it relies on the Hamming weight leakage assumption. Also, MIA strategies do not perform well which is not surprising since the underlying leakage model is unknown and it is, therefore, hard to select a good non-injective function. (b) For the arithmetic scheme, JMR outperforms all the other attacks even, HO-LRA-2 in which 1 holds by 2. Again this is explained by the fact that JMR only needs (9 × 2) parameters to predict the data while HO-LRA-2 needs 37 parameters. Even without noise, the data can not be perfectly explained in an HO-CPA or HO-LRA-1 model since their curves do not converge towards 0. (c) For the multiplicative scheme, as predicted, none of the state-of-the-art attacks perform better than random even without noise which confirms that Assumption 1 does not hold at all for such masking scheme. JMR is the only sound attack in this case. Results for third-order attacks are presented in Figure 2. In this case, HO-LRA-d max represents the generalization of the second-order LRA replacing the covariance by the third-order joint moment. Conclusions are the same than for the second-order attacks. Among the considered attack strategies, one can observe that, as for the multiplicative case, JMR is the only sound option to attack affine masking under a linear leakage of the shares.

4) ABOUT THE BIASED SCHEMES
Both multiplicative and affine schemes are slightly biased which can induce lower-order leakage. We argue that such leakage has not been exploited in this section since the estimated JM were computed with the leakage of all the shares (thus, the variance of the estimation result from d multiplications of noisy leakages). To confirm this statement, we repeated the previous experiments for the biased schemes removing Z = 0 from the possible values, thus, simulating non-biased schemes. The results being essentially the same than those presented in Figures 1c, 2c and 2d so we do not plot them. Since the multiplicative and affine masking do not seem to have special algebraic properties like the Boolean and arithmetic scheme as shown in propositions 1 and 2, we argue that these results could be extended to any other masking scheme. Indeed, the real added value of JMR is its ability to encode the scheme knowledge in the system's equation making it generic and able to work even for non-biased schemes with a high algebraic degree 2 where other attacks would not.
However, in the specific case of biased schemes, lowerorder leakage could be exploited with simpler attacks such as a classical CPA with a zero-valued based power model. One could also perform more advanced attacks taking advantage of leakages at multiple orders at the same time. All these attacks are discussed in the next section where we introduce the generalized method of moment paradigm.

IV. GENERALIZED METHOD OF MOMENTS PARADIGM
Looking from a broader perspective, it appears that the core of the JMR attack can be seen as part of a more general framework known as the Generalized Method of Moments (GMM) [18]. This method comes from the field of statistics and economy and its main purpose is to estimate parameters in a statistical model. Embracing this paradigm requires to gain a level of abstraction but it allows to use the powerful mathematical foundations behind it. In particular, it will tell us how to optimally combine information from different orders, which is useful when the masking scheme is biased.

A. BACKGROUND ON GMM
Let suppose that the available data consists of N observations (L i ) 1≤i≤N of a random variable L ∈ R n . This data is assumed to come from a stochastic process defined up to an unknown parameter vector θ ∈ R p . The goal is to find the true value θ 0 of this parameter or at least a reasonably close estimate.
In order to apply GMM the data must come from a weakly stationary ergodic stochastic process (independent and identically distributed (iid) variables are a special case of these conditions). Then one needs to have c ''moment conditions'' defined as a function g( , θ) : R n × R p → R c such that: The idea is then to replace the theoretical expectation with its empirical analog: and to minimize the norm of m(θ ) with respect to θ. The properties of the GMM estimator depend on the chosen norm and therefore the theory considers the entire family of norms defined up to a positive-definite weighting matrix W ∈ M c (R): The GMM estimator is then defined as: The way of solving this optimization problem is not specified in the GMM theory. It is left to the numerical optimization field. The purpose of W is to weigh the different conditions. Choosing W = Id c leads to consider the classical euclidean norm and is equivalent to considering that all conditions should weigh the same. The intuition behind the fact that one may prefer another norm is that some conditions may be less informative, redundant, or more volatile in their empirical estimation. One typically wants to use the norm minimizing the asymptotic variance of the resulting estimator. This problem has a closed-form solution with the following theorem: Theorem 1 (Hansen 1982): Letθ N be the random variable representing the output of the GMM estimator with N data observations. Let also define as the covariance matrix of the conditions function g evaluated at θ 0 : = cov-mat g(L, θ 0 ) . Then, In the particular case where conditions are independent the matrix −1 is diagonal and choosing W = −1 simply means that the moments' condition should be weighted inversely proportionally to their underlying variance. This is in line with the intuition that conditions with high variance are less informative.

B. PARALLEL WITH THE JMR ATTACK
This section exhibits the similarities between the GMM and JMR. The core of the JMR attack relies on an estimation of the true parameters θ 0 ∈ = R d × R r of a chosen statistical model (encoded in the choice of the basis) in order to explain the leakage of each share. Let L = (L 1 , . . . , L d ) represents the observed leakage variable and L θ the predicted leakage variable under the assumption of θ so that, under the assumption that the chosen model is correct, L = L θ 0 .
Since the moment conditions in JMR depend on the value of another public variable X , let define, for each key hypothesis k, a condition function g k : R d × X × → R |X | as: where e x = (0, . . . , 1, . . . , 0) ∈ R |X | stands for a vector of 0 with one 1 at position 3 x, 1 d stands for a vector of d ones: L is defined as in Equation 11. This definition of g k may seem very artificial but it is designed so that Equation 18 holds for the correct hypothesis k = k * (under the assumption that L = L θ 0 ): Therefore applying GMM with g k * as condition function is sound while it is not for wrong key hypotheses. In fact this property is the one exploited by JMR since step 1 to 5 of JMR are equivalent to apply |K| GMM estimations, one for each of the g k condition functions, with W = Id |X | .

C. IMPROVING JMR USING GMM THEORY
This section describes two ways of improving JMR using the GMM theory. The first one is generic and the second one focuses on the unbalanced masking schemes.

1) USING THE OPTIMAL WEIGHTING MATRIX
Since the GMM theory recommends to use −1 as weighting matrix, one could ask if using the identity matrix was optimal. Indeed, the adversary typically wants to minimize the variance of the GMM estimator for the correct key k = k * . Therefore it would be natural to replace the identity matrix 3 Here x ∈ X = F m 2 is seen as an element of Z/2 m Z.
with −1 where = cov-mat g k * (L, X , θ 0 ) . The problem is that is hard to estimate with data since θ 0 is unknown. The solution to this problem is usually to apply the so-called two-step estimator where an estimation of θ 0 is first computed with JMR with a sub-optimal weighting matrix (for example the identity) which allows estimating and eventually apply GMM with the latter estimation as weighting matrix. However, in our case, does not depend on θ 0 which makes the process easier. Indeed, the variance of the components of g k * (and therefore the covariance matrix) only comes from the right term of Equation 23 which does not depend on θ. Therefore the equation of can be re-written as: In addition, since for a fixed x, only one component of g k * ( , x, θ) is non-zero, is diagonal. Then, one can estimate the diagonal terms of using the observed data and then apply GMM. We denote by JMR ++ the JMR attack with W = −1 where is an estimation of the optimal weighting matrix.
To confirm the soundness of this approach, we performed the same experiments as those described in subsection III-D to compare JMR and JMR ++ . Figures 3a and 3b show the results for the second-order Boolean and arithmetic masking and, according to the theory, JMR ++ performs a little better than JMR. It can be noticed that in the case of Boolean masking JMR ++ also outperforms HO-LRA-1, which has approximately the same performance as JMR, despite having more parameters to estimate.

2) THE CASE OF BIASED SCHEMES
Some masking schemes, such as the multiplicative or the affine one, violate the assumption of shares uniformity. Therefore the resilience to (d − 1) th -order attack is not guaranteed anymore. For example, Z = 0 implies Z 1 = 0 in a multiplicative scheme inducing a first-order leakage. As well, when Z = 0, the affine scheme becomes a Boolean scheme of order 2 inducing second-order leakages. Since lower order JM are informative in these cases, a first idea to exploit this weakness is to apply JMR but at a lower-order. This means that the considered conditions concern only the first-order moments for a multiplicative scheme and the second-order JM for an affine scheme. Such an attack is denoted JMR Lower . Since this would only exploit the difference between the class Z = 0 and Z = 0 this attack would be very close to a CPA computed with a zero-valued model considering only two classes: Z = 0 and Z = 0, denoted CPA-0 (or HOCPA-0 in the affine case) afterward. Figures 3c and 3d confirm this intuition by showing that both CPA-0 and JMR Lower behave very similarly and have better results than JMR for high noise values but worse results for low noise values. Indeed, since the main advantage of masking is to amplify the impact of the noise exponentially with the order of the mask [12] or more accurately, with the order of the attack required to defeat it. For low values of σ the JM conditions used in JMR Lower are less informative than the one used by JMR (they only exploit a difference between the class Z = 0 and the other classes) but the impact of the noise is amplified by a lower order which explains the better results of JMR Lower for high σ .
A natural challenge is to design an attack benefiting from the best of both worlds: JMR and JMR Lower . To this aim, we propose to use the flexibility of the GMM paradigm to develop an attack with conditions from both informative orders at the same time. This corresponds to building a system with 512 conditions instead of 256 when attacking a key byte. In this case, the weighting matrix is very important since each half of the system concerns conditions with very different variances (estimating joint moments is exponentially hard with the order). To highlight this fact we denote by JMR Full and JMR ++Full the version of JMR with both order conditions respectively with W = Id 512 and W = −1 .
Results are presented in figures 3c and 3d. As expected, JMR Full outperforms JMR but is impacted by the variance of the d-order conditions and therefore performs worse than JMR Lower for high values of σ . However, JMR ++Full benefits from the advantage of exploiting the d-order conditions for low values of σ but still converges towards JMR Lower for high noise values thanks to the well-chosen weighting of these conditions. Indeed, it is proven in [18] that adding more moments conditions can only improve the performance of the GMM estimator (by lowering its variance) when using the optimal weighting matrix −1 .
We highlight the fact that for the multiplicative scheme, there is an interest in using JMR ++Full over CPA-0 since for example it would give a successful attack at σ = 1 where CPA-0 would rank the correct key at the 20 th position which is not enumerable considering the full 16 bytes key. However, for the affine case, the curves look very similar and we argue that the overhead in time complexity of using JMR ++Full (or JMR Lower ) over CPA-0 is not worth it.

V. EXPERIMENTS ON REAL TRACES
To assess the performance of JMR on real traces, we decided to attack two open source protected AES implementations. The first one is protected by a first-order Boolean masking scheme (ASCAD) [30]. The second one embeds an affine scheme and a shuffling countermeasure (ASCADv2) [31]).

A. ATTACK ON A FIRST-ORDER BOOLEAN MASKED AES (ASCAD)
As a first experiment, we performed the different stochastic attacks discussed in this paper on the public dataset of ASCAD. It is a common set of side-channel traces, introduced for research purposes on deep learning-based sidechannel attacks. The targeted implementation is a software AES, protected with a first-order Boolean masking, running on an 8-bit ATMega8515 board.
We performed guessing entropies for the different attacks, using the training dataset containing 50k traces. We extracted VOLUME 10, 2022 from the dataset, the two Point-of-Interests (PoI) corresponding to the highest signal-to-noise ratio, one for each share. This step requires the knowledge of the shares and would not be feasible by a non-profiled adversary. In a real scenario, a visual analysis of the trace combined with knowledge on the implementation can be used to perform a PoI selection to reduce the number of sample combinations to be tested. Our goal here is to assess the security supposing that the adversary is able to apply the methodology on the best sample combination.

1) RESULTS
Results are depicted in Figure 4. We observe similar outcomes than in the first-order Boolean simulations. As expected the results of JMR ++ and HO-LRA-1 are very close since it is a Boolean masking (and thus, Assumption 1 holds). Similarly to Figure 3a., the slight advantage of JMR ++ may be explained by the use of the optimal weighting matrix. One may notice that the HO-CPA performs better than in the simulations. It outperforms all the other attacks for low numbers of traces even though attacks with an average correct key rank higher than 25 does not allow for a successful enumeration in a reasonable time. The better performance of HO-CPA can be explained by the fact that the leakage model of the ASCAD traces is much closer to a Hamming weight leakage model than those used in the simulated experiments. In such cases, regression-based attacks benefits less from their genericity. As the execution time has a low dependency to the number of traces, running JMR ++ took approximately 10 seconds as in the simulations.

B. ATTACK OF AN OPEN SOURCE HARDENED AES IMPLEMENTATION (ASCADv2)
As a second experiment, we decided to attack the second protected AES implementation proposed by the Agence Nationale de la Sécurité des Systèmes d'Information (ANSSI) [31]. They published a library implementing an AES-128 on an ARM Cortex-M4 architecture using state-ofthe-art counter-measures. Indeed, this implementation uses an affine masking as well as random shuffling of independent operations [32]. It is accompanied by a publicly available dataset called ASCADv2 providing 800,000 traces acquired on an STM32F303 microcontroller running this protected AES. A detailed leakage analysis of this dataset has been published in [33]. Following their terminology we tried to attack the unmasked variable Z = Sbox[k * ⊕ P] using the leakage of the three shares: Unfortunately, the number of traces turned out to be too low to analyze the unsupervised attacks discussed in this paper. Thus, we reproduced a similar experimental setup, described in the next section, in order to collect significantly more traces.

1) ACQUISITION SETUP
Our setup has the following features: • The acquisitions have been performed on a NUCLEO-F303RE board, which embeds the same STM32F303 micro-controller as used in ASCADv2.
• The device is clocked at 8MHz, while ASCADv2 device is clocked a 4 MHz. This allows faster acquisitions without altering the execution behavior (e.g., introducing FLASH wait cycles). Being in an evaluation setup, we had the labels of the shares and validated that it did not affect the signal-to-noise ratio of these intermediate variables.
• We measured the magnetic field produced by the circuit with a Langer H-field probe (RF-U 5-2). This differs from ASCADv2 setup, which measures the current of the device through a ChipWhisperer [34]. However, we observed better signal to noise ratios on the EM field. The probe covers a large portion of the CPU and no specific tuning of the probe placement was performed.
• The scope was configured at 3.125 GS/s and acquired a window of 8µs, which represents 25,000 time samples.
• The masked AES implementation was taken ''as-is'' from the SecAESSTM32 repository [31]. We only made the following changes to the assembly code: a GPIO is raised in the Load_random function, which manipulates r mul and r out . a GPIO is raised in the first round of the AES, just after the Xor_Word operation. To further speed up the acquisitions, we do not transfer the plaintext and masking inputs through the serial port. Indeed, this represents 54 bytes (16 + 19 × 2) per encryption, which quickly becomes a bottleneck for acquisitions. Instead, the device runs a PCG32 Pseudo-Random Number Generator (PRNG) [35] to generate those data on the fly. This PRNG is re-seeded randomly (by sending 8 bytes on the serial port) every 250 acquisitions. This allows to regenerate (from the stored seeds) the plaintexts and random masks offline, to label the dataset.
For each encryption, the scope triggers twice and acquires 50,000 samples. The final dataset contains 100M traces and took 14 days to acquire. In summary, we used the same AES implementation and micro-controller as in the ASCADv2 setup. We only made some changes in the instrumentation and measurement chain to reduce the number of traces needed and improve the speed of acquisition.

2) SIMULATING AN UNSHUFFLED VERSION
The implementation uses random permutation of the 16 Sboxes applications. However, using the same idea as developed in technical analysis of the ANSSI repository [31], one can simulate (through the knowledge of the key and the permutation Sh being used) an attack on an unshuffled version even if the acquired traces are shuffled. Instead of targeting the first byte Z = Sbox[k * [0] ⊕ P[0]] one may target: where Sh −1 (0) denotes the index of the byte that is computed first through the permutation Sh. Then such an attack would uses as hypothesis intermediate variable, the attack being successful if the best hypothesis is 0.

3) RESULTS
In a similar way to the first experiment from subsection V-A, we extracted from the dataset described in subsubsection V-B1, the three Point-of-Interests (PoI) corresponding to the highest signal-to-noise ratio, one for each share. We performed the attacks on both the shuffled and unshuffled versions. The attacks on the shuffled version only use the leakage of the first Sbox computation. Shuffling adds a lot of noise since even for the correct key hypothesis the predicted value of Z is only correct once out of 16 in average.
Results are presented in Figure 5. Each point represents the mean ranking of the correct key over 100 attacks performed with the corresponding number of traces. For each attack, traces are randomly drawn among the 100M dataset. Both JMR ++Full and JMR ++ converge towards a guessing entropy of 0 which provides by the same token, the first unsupervised attack on the secured ANSSI's AES implementation.
a: USING THE SCHEME BIAS Not surprisingly, JMR ++Full and HOCPA-0, which exploits the bias in the masking scheme, gives the best results. These attacks require 30k 4 and 15M traces to converge towards 0 for the unshuffled and shuffled version respectively. This confirms that for high noise value, a lower-order leakage induces attacks with at least one order of magnitude smaller data complexity. Thus, it confirms that even though d shares are used to mask the sensitive value, a biased d-order masking should not be considered of order d as far as security is concerned.

b: NOT USING THE SCHEME BIAS
When this lower-order leakage is not considered in the attack, JMR ++ outperforms the other state-of-the-art attacks and is the only attack able to converge toward a guessing entropy of 0 with the considered number of traces. As in the simulations, the amount of time required to run this attack is approximately 15 seconds.
For biased masking schemes, there is no interest to perform this attack over CPA-0. However, we argue that this result is interesting since it shows how JMR would perform on a generic (with a high algebraic degree) unbiased second-order masking schemes.

C. DISCUSSION
Results obtained on the real traces collected on AES implementations (proposed by the ANSSI) protected with boolean and affine masking are in line with the simulation results. It confirms that JMR gives a sound methodology, able to work with flexible leakage model assumptions (linear, quadratic. . . ), which is applicable to any masking scheme, even newly invented ones. Such strategy widens the state-of-the-art. 5

VI. CONCLUSION
This paper introduced a new unsupervised strategy, JMR, which embeds the masking structure within it, allowing it to defeat arbitrary masking schemes. It is based on a non-linear system regression which allows to derive the leakage model of each share by carefully exploiting higher-order joint moments conditions. JMR outperforms state-of-the-art attacks which are limited to Boolean and arithmetic masking, especially when the Hamming weight leakage assumption does not hold. We reduced the core of JMR into a more general framework: the generalized method of moments and derived optimizations of JMR from it. Experiments performed on synthetic data confirmed the effectiveness of the proposed attack, especially against multiplicative and affine masking schemes. Eventually, this new strategy has been confirmed on real traces, allowing a fully unsupervised attack of the ANSSI's protected AES implementation which embeds an affine masking and shuffling counter-measures.
The JMR method is not highly multi-dimensional in the sense that it only exploits d times sample when applied on a d th -order masking. It is well known that sensitive variables can leak several times in a single trace. Strategies able to extend JMR approach to use more informative time samples simultaneously (i.e. exploit more of the available information) would be of great interest for further research.

APPENDICES PROOFS
Proposition 1 (Boolean Masking): Let * = ⊕. Let l 1 : Z → R and l 2 : Z → R be two leakage functions of degree 1. Let ϕ Bool (z) = cov l 1 (z ⊕ Z 2 ), l 2 (Z 2 ) . Then, Proof 1: Since both l 1 and l 2 are of degree 1, there exist two unique sets of coefficients (α (1) i ) 0≤i≤n ∈ R and (α (2) i ) 1≤i≤n ∈ R such that: where z[i] stands for the i th bit of z. Since the covariance involves a centered product, one can suppose without loss of generality that α Using the identity: where ∧ stands for the Boolean AND: which is of degree at most 1 since the z[i] terms are not mixed.

Proof 2:
We give a proof by induction. Let define the property P n : P n : For any l 1 and l 2 of degree 1, deg(ϕ n ) ≤ 2, where for z ∈ Z = F n 2 : Initialisation. The case n = 1 is trivial since deg(ϕ Arith ) is at most 1 in this case.
Induction. Let suppose that P n holds. We are going to prove that P n+1 also holds. Since both l 1 and l 2 are of degree 1, there exists two unique sets of coefficients (α (1) i ) 0≤i≤n+1 ∈ R and (α (2) i ) 0≤i≤n+1 ∈ R such that: where z[i] stands for the i th bit of z. Since the covariance involves a centered product, one can suppose without loss of generality that α (j) 0 = 0 (we removed α (j) 0 for readability reasons but it does not change anything to the proof). Injecting this into the expression of ϕ n+1 one has: for i ∈ 1, n + 1 , the following identity holds: (z + z 2 [2 n+1 ])[i] = (z + z 2 )[i]. Indeed, the modulo corresponds to either doing nothing or subtracting 2 n+1 when z + z 2 ≥ 2 n+1 . Then: i ·(z + z 2 )[i]+α (1) n+1 · (z + z 2 )[n + 1]−µ 1 · n i=1 α (2) i · z 2 [i] + α (2) n+1 · z 2 [n + 1] − µ 2 i ·(z+z 2 )[i]+α (1) n+1 ·(z+z 2 )[n + 1]−µ 1 Again, one can add a [2 n ] in the (z + z 2 )[i] terms since it does not change anything for i ∈ 1, n . Then: The second line of Equation 36 can be re-indexed summing from 0 to 2 n −1. Then, by P n , the first two line of Equation 36 are of degree at most 2. So let us focus on the last term denoted A and prove that it is also of degree at most 2: since z 2 [n + 1] = 0 implies that all the term in the sum are equal to 0. One can notice that the latter sum has two expression depending on the (n + 1) th bit of z: Therefore: n+1 · (z − 2 n ) · (2 · z[n + 1] − 1) n+1 · ( n+1 k=1 2 k−1 · z[k] − 2 n ) · (2 · z[n + 1] − 1) (39) VOLUME 10, 2022 which is of degree at most 2 since developing the latter sum involves product of at most 2 bits of z together. Injecting this into Equation 36 show that deg(ϕ n+1 ) ≤ 2 and therefore that P n+1 holds. This concludes the induction and therefore the proof of 2.
For the interested reader, we give as a bonus the coefficients of ϕ Arith in terms of α (j) i : With: THOMAS HISCOCK received the Ph.D. degree in computer science, in 2017. He is a Researcher with the french Commissariat à l'énergie Atomique et aux Energies Alternatives (CEA). He performs security testing of pre-production embedded devices with CEA's industrial partners. His main research interests include the design of secure processors (against hardware attacks), side-channel analysis, fault-injections, and micro-architectural attacks.
PHILIPPE MAURINE received the M.S. and Ph.D. degrees in electronics from the University of Montpellier, Montpellier, France, in 1998 and 2001, respectively. Since 2003, he has been an Associate Professor with the Laboratory of Informatics, Robotics, and Microelectronics, University of Montpellier, developing microelectronics in the engineering program of the University. His current research interests include secure IC design, side-channel analysis, and fault injection techniques. VOLUME 10, 2022