Optimized Sequence Dataset Generation for a Two-Stage Pipelined Microcontroller Instruction-Level Electromagnetic Information Leakage Analysis

With the prevalence of the Internet of Things (IoT) and microcontrollers (MCUs), the security issues of IoT and MCUs have become increasingly important. Side-channel analysis (SCA) is a major threat to such problems. One of the non-invasive SCAs is through electromagnetic information leakage (EM-leak) analysis. The author has developed a machine-instruction-level EM-leak analysis by neural network (NN) model. The NN model analysis needs a large dataset for training and validation. And the dataset should be complete and sufficient. The dataset sufficiency can be achieved by acquiring more data from the already proposed EM-leak measurement platform. However, the completeness issue becomes a challenge due to the pipelined architecture in the target MCU. In this paper, the completeness issue of a NN model dataset for the EM-leak analysis is addressed. Experiments show that the proposed algorithm can find an optimal solution. The contributions of this paper include: it successfully reduces a complex and practical SCA problem into a two-stage pipelined MCU individual EM-leak analysis sequence (2-EMAseq) description, proves that the 2-EMAseq problem has at least one optimal solution, uses this proof to develop an algorithm, and uses this algorithm to find a complete dataset is the target two-stage pipelined MCU dataset for NN model training/validation. Currently, the algorithm can only generate the optimal EM-dataset for two-stage pipelined MCUs. However, there are many MCUs with 4~8 pipeline stages. Whether there is an optimal solution when the stages are more than two is still an open question. And it needs to find proofs or derive heuristics in the future for such MCUs.

form predefined behaviors according to internal programs 28 The associate editor coordinating the review of this manuscript and approving it for publication was Su Yan . to acquire environment input or to provide output to net-29 works or other devices. An MCU is a single-chip pro-30 grammable sequential digital electronic component with 31 integrated peripherals. The MCU is pervasive [1], [3], [4] 32 in the contemporary information age because of the small size 33 with many input/output peripherals. 34 An MCU program is a carefully designed sequence of 35 machine instructions. Different instructions (load, store, addi-36 tion, multiplication, comparison, jumps, etc.) control the 37 on/off states of the sub-functional blocks (network, IO inter-38 face, bus arbitration, etc.) within the MCU and affect the tim-39 ing, power consumption, and electromagnetic (EM) radiation 40 of these blocks. The observable EM behaviors are called the 41 it is a database of handwritten digits that is used for image 81 processing systems. The database contains only 60,000 train-82 ing images and 10,000 testing images. It is clear that the 83 handwriting digits of all people cannot be complete for the 84 undocumented or the unborn children. And the MNIST is just 85 a small group of handwriting samples. Generally, the com-86 pleteness of these datasets is not important because NN mod-87 els targeted to their problems are open questions and cannot 88 be fully elaborated. 89 However, it is not the case in the proposed EM-dataset. The 90 dataset is finite and should be complete because the DUT's 91 SCA should be identified by the NN model even to the least 92 possible corner cases [11]. Currently, this paper proposes a 93 new problem to be solved due to the dataset completeness is 94 not met by other types of dataset.

95
Thus, generating a ''sufficient and complete'' EM-dataset 96 is very important for such studies. Here, sufficiency means 97 the dataset size should be large enough for the training and 98 validation of the NN models. And the completeness means 99 the dataset should cover all instruction sequence cases. The 100 sufficiency issue can be covered by iterating the measurement 101 to increase the dataset sizes. However, completeness can-102 not be guaranteed by the iterations. The reason is described 103 below.

104
Although [5] can identify the EM-leak to some extent, the 105 dataset for the NN training/validation is far from complete-106 ness. And only partial instruction can be identified in [5]. The 107 reason is detailed below.

108
If an MCU has n different instructions, all these instruc-109 tions can be selected into a testing program at any sequence. 110 The testing program is then programmed (burnt) into a 111 target MCU (the device under test, DUT). Here, the verb 112 ''programmed (burnt)'' means an action to download the 113 compiled testing program into the DUT's memory (flash 114 or RAM). After the programming, the DUT is powered up and 115 the collective EM-leak signals (cEM-leak) can be measured. 116 Some digital signal post-processing techniques [5] can be 117 applied to the cEM-leak signals and the iEM-leak target to 118 each instruction can be acquired. These iEM-leaks with their 119 corresponding instruction labels can be collected into the 120 EM-dataset. Since all of the n instructions' iEM-leaks can be 121 measured and stored in the EM-dataset, the completeness is 122 guaranteed.

123
A possible EM-dataset collection and EM-leak analysis 124 of an unknown program procedure may have the following 125 steps. 2) The testing program is compiled into instructions 139 (machine codes). And these instructions are pro-140 grammed into the target DUT.   13.9 hours (100 * 1000 * 0.5 seconds) to generate a suitable 174 EM-dataset.

175
However, the completeness of an EM-dataset cannot be 176 generated by such a naïve procedure. The reason is that mod-177 ern MCUs contain pipeline designs [12]. A pipelined MCU 178 design can increase the MCU's throughput, clock frequency, 179 and thus, the overall MCU performance. However, this makes 180 the EM-dataset generation in [5] to be very difficult.

181
For the analysis simplicity, assuming an MCU contains 182 only 3 instructions A, B, and C; and assuming the MCU has 183 a 2-stage pipeline (stage 1 and stage 2), it means that the 184 MCU contains 2 consecutive instructions inside it to control 185 an iEM-leak signal in one execution cycle. Thus, there are 186 a total 3 2 = 9 different pairs of iEM-leak patterns. They 187 are A 1 A 2 , A 1 B 2 , A 1 C 2 , B 1 A 2 , . . . , C 1 B 2 , and C 1 C 2 . It is easy to find that the mapped n nodes digraph G 255 is a complete graph with n 2 arcs. In this paper, each arc 256 has a number label from 1 to n 2 . The arc label represents 257 its corresponding line position (instruction sequence) in the 258 program context.

259
For example (FIGURE 5), if an MCU has only 3 instruc-260 tions: load, store, and add, and they are labeled as {0 (load), 261 1 (store), 2 (add)}. Assuming the program context instruction 262 sequence is randomly assigned as FIGURE 5(a). The mapped 263 digraph G= (N, A) is defined as N being the node-set, and 264 A is the arc-set. The corresponding arcs can be shown in 265 FIGURE 5(b). It is described above that the total iEM-leak 266 count is 9. Thus, the arc labels are defined as the program 267 context's line numbers (labeled as {1, 2, 3

274
The graphs in this paper follow the adjacency matrix [14] 275 representations. For example, if a complete digraph 276 is FIGURE 5 (c), the adjacency matrix can be represented 277 as FIGURE 6. The integer numbers inside the matrix are the 278 arc labels. If we order the arcs by their labels, the result is 279 shown in FIGURE 7 which shows the node travel sequence 280 according to the arc label. Here, we define: 281 from(a XY ) ≡ X ≡ the arc label a's head node to(a XY ) ≡ Y ≡ the arc label a's tail node 282 We define two adjacent labels of arcs as an ''arc-pair''. 283 For an arc-pair {arc i , arc i+1 }, if the node of to (arc i ) and 284 from(arc i+1 ) is the same, it means the adjacent arc-pair need 285 not change node when traveling from arc i to arc i+1 . This also 286 means the instruction can be reduced by 1 without losing any 287 completeness. In this paper, it is defined an ''efficient'' arc-288 pair as: to (arc i ) = from (arc i+1 ). For example, in FIGURE 7, 289 the arc-pair {5, 6} is not efficient because the to(arc 5 ) = 290 b = c = from (arc 6 ). Arc-pair{8, 9} is efficient because 291 the to(arc 8 ) = a = from(arc 9 ). 292 FIGURE 6. Digraph of FIGURE 5(c) and its adjacency matrix representation (matrix numbers are arc labels).  [15], the node-307 redundancy is zero.

308
Proof: By the redundant node definition, if an Eulerian 309 trail (a walk through the graph which uses every edge exactly 310 once) exists, it is trivial that there will be no redundant node 311 and the node-redundancy is zero.

312
Lemma 2: For a digraph G mapped by the 2-EMAseq 313 problem, if the node-redundancy of G is zero, it is an optimal 314 solution to the 2-EMAseq problem.

315
Proof: if the node-redundancy is zero, it implies no 316 redundant node of a given arc sequence. If there is no 317 redundant node, the efficiency is optimized by definition. 318 Since there is always an edge from node i to node j in the 319 2-EMAseq mapped diagraph, it is clear that G is connected. 320 If G is efficient and connected, the corresponding 2-EMAseq 321 problem is optimized.

322
Lemma 3: there is at least one optimal solution to the 323 2-EMAseq problem.  2) At most one vertex has in-degree − out-degree = 1.

332
In the 2-EMAseq problem, the mapped digraph is a com-333 plete digraph. Thus, all nodes' (in-degree)=(out-degree)= 334 n (condition 3). No nodes have unbalanced in-degree and 335 out-degree (conditions 1 & 2). All the nodes are connected 336 to the same component (condition 4). It is clear that the 337 digraph mapped from 2-EMAseq problem meets all the 338 conditions. Thus, the Eulerian trail exists in the mapped 339 digraph.

340
Since the Eulerian trail exists, by Lemma1, the node-341 redundancy is zero. Since at least one solution that the arc 342 sequence's redundancy is zero, by Lemma2, it is an optimal 343 solution. This means at least one optimal solution to the 344 2-EMAseq problem exists.

346
Because the optimal solution exists, the author tries to derive 347 an algorithm to find the solution.

348
The algorithm's pseudo code is shown in FIGURE 9.

349
Theorem: Given a complete digraph G (N , A), where the 350 N = {0, 1, . . . , n} is the node-set, |N | = n + 1 ≥ 3, and 351 A = a 1 , · · · , a (n+1) 2 is the arc set. The proposed algo-352 rithm can find an optimal solution to the 2-EMAseq problem 353 of G. The algorithm is designed by python. A 10-instruction 385 iEM-leaks are shown here for validation. The result is shown 386 in FIGURE 10. The algorithm can find an optimal iEM-387 leak program context sequence (node-redundancy=0) as 388 expected.

389
About the execution time of the algorithm, for the target 390 DUT's instruction, the algorithm takes 0.49 seconds to derive 391 the Eulerian trail and 0.2 seconds to prepare the program 392 context. However, the iEM-leak signal acquisition will take 393 a longer time. And the training time would be even longer. 394 However, the completeness of the iEM-leak signal generation 395 is guaranteed. The detailed instructions [10]   is an optimal solution when the stages are more than two is 443 still an open question. And it needs to find proofs or derive 444 heuristics. The author will continue to study these problems.