Design and Implementation of a Configurable Encryption System for Power-Constrained Devices

In this work, we present a configurable encryption system based on the Encryption by Synchronization in a Cellular Automata (ESCA) system, which is a symmetric key algorithm based on the synchronization phenomenon of Cellular Automata with rule-90. With the aim of producing a flexible system to trade-off power consumption and security level, we implemented a pseudo-random number generator (PRNG) that can be configured with three different key sizes. This variable-length PRNG, together with the capability of bypassing specific modules in the rest of the system, allow us to operate under a wide range of applications. In particular, it would enable online adjustments in IoT and power-constrained devices to fine-tune them between a low-power consumption and a maximum-security level. The system can be implemented with 5956gates, and it is designed to provide in a 0.5 $\mu \text{m}$ CMOS process a throughput of 50Mbps @ 37mW, at the maximum-security level, and an energy consumption of less than 7mW @ 30Mbps at the lowest-security level, while still providing a satisfactory perceptual security metric.


I. INTRODUCTION
Nowadays, embedded systems are used in a wide variety of applications such as wearable devices, medical implants, Internet of Things (IoT) devices, Radio-frequency identification (RFID) tags, and Wireless Sensor Networks (WSN). An important property of these systems is their capacity to securely store, access, and transmit information, and the process responsible for safeguarding this information does not have a low energy cost [1], [2]. To achieve this, different types of encryption systems have been developed to safeguard the confidentiality and integrity of information.
A problem to overcome is that embedded systems are typically battery-operated, and the time and energy costs The associate editor coordinating the review of this manuscript and approving it for publication was Jun Wang . of using an encryption system are high [3], [4], [5]. Such energy-constrained embedded systems require a low-cost specialized encryption process with low energy consumption and a small footprint area [6], [7]. Instead of implementing these encryption systems in software or general purpose hardware (e.g. single-board computers, microcontrollers, or FPGAs), an Application-Specific Integrated Circuit (ASIC) approach is preferred.
An encryption process is a tool to protect information, transforming the data into illegible text in such way that only the recipient is able to return to the original form of the information. Some of the most common encryption systems are AES [8], DES [9], and 3DES [10]. The main problem with the DES system is that, with the current computing power, it can be broken with using a brute force attack. To solve this problem, the 3DES system was implemented, which basically uses triple DES systems with 3 different keys, but with the disadvantage that it increases the execution time and consumption of different resources such as energy or silicon area for its implementation. The AES system supports 128, 192, and 256 bit keys. The process consists of four stages, and 10, 12, or 14 rounds, depending on the key size, which can be compute-intensive, and prohibitive for some energy-aware applications. Currently, there are other types of encryption algorithms that have been proven to be compared and sometimes better than the AES system, much of which depends on the type of information being encrypted and the means by which it is transmitted.
Cellular Automata (CA) systems provide simple algorithms owing to their high parallelism, homogeneity, and easy implementation in software and hardware systems [11], [12], making them good candidates for designing cryptosystems. CAs are discrete dynamical systems that evolve in discrete time steps and have been used to simulate biological systems [13], [14], road traffic [15], epidemics [16], surface water flow [17], and cryptography [18], [19], among others. In recent years, many encryption systems based on CA have been developed, such as those presented in [20], [21], [22], [23], [24], and [12], which are based on the evolution of cellular automata states. In [25], the synchronization phenomenon of CA was applied to develop an encryption system, now known as ESCA, composed of an indexed family of permutations and a pseudo-random number generator (PRNG) based on the forward and backward evolution of the CA rule-90. The indexed family permutations are the block cipher functions of the proposed scheme and are applied to plain-text and cipher-text by employing a key produced by the PRNG.
In this paper, we propose a configurable ASIC implementation of the ESCA system. The fundamental transformations are built around additive CA rules [26], [27], which makes them suitable for hardware implementations. The implemented system allows a trade-off between power consumption and security level, depending on the needs, by disabling selected modules and reusing hardware. We applied statistical and sensitivity tests to the proposed system to corroborate its high level of security, and validated its functionality in software and on an FPGA.
The rest of the paper is organized as follows: Section II presents an overview of the ESCA system and in Section III we review different tests applied to the encryption system to corroborate its security. Section IV describes the hardware implementation in ASIC technology, and Section V details its performance and power consumption. Section VI presents the implementation results and a functional validation of the system with FPGA and software implementations. Section VII presents conclusions and a brief discussion of the study.

II. ESCA SYSTEM OVERVIEW
The ESCA system is a symmetric block cipher of keyand data-variable lengths, based on the synchronization phenomenon of CA that evolves according to the local rule-90 [25], [28]. The system is conformed by an asymptotically perfect pseudo-random number generator [26] employed to generate keys for two indexed families of permutations, and , called encryption and decryption functions, respectively. The system can be implemented using a Basic Unit Cipher (BUC), as shown in Fig. 1a, which consists of a space-pattern conformed by the words . These words can be initialized, and the automata can then be evolved, according to rule-90, in the forward or backward directions to find the desired word, depending on the task required. The word c is obtained by iterating backward in time using as initial conditions x and m, and correspond to the indexed family permutation t . The word m is obtained iterating forward in time employing as inputs x and c, and corresponds to the indexed family t . On the other hand, word t is generated iteratively backward in time using x and y, and correspond to the function t = h(x, y) [25], [27]. A simpler version to implement a BUC is the onetime algorithm, in which these words are calculated using the XOR sum of only the required elements.
In addition to the indexed family permutations, a preprocessing stage based on the structure of the PRNG was added to obtain a secure and flexible cryptosystem [29]. In addition, a substitution box (S-box) based on the cellular automata rule-90 [30], was added to confer greater security to the system. In [27], an ESCA system matrix approach was proposed to have a simple implementation when larger block sizes are required, and is the notation used in this work. The encryption procedure starts with the pre-processing stage, where the plain-text is transformed to an unintelligible form; afterwards, the S-box applies a non-linear transformation; finally, the data are encrypted with permutation , and the key is generated to be employed in the permutation. The decryption process is carried out in reverse order.
The proposed implementation operates on an 8-bit block of data, and the supported key sizes are listed in Table 1. The cryptosystem can be configured in 15 different modes, bypassing one, two or three modules, as well as three key sizes. The notations used to denote the modes are listed in Table 1. It consists of five bits divided into two groups. The first group of three bits represents the enabled modules, where the first bit enables the permutation, the second bit the S-box, and the third bit the pre-processing module. The remaining group of 2 bits sets the key size of the PRNG: 00 for 15 bits, 01 for 31 bits, and 10 for 63 bits. For instance, the 100-01 mode enables the module only and set the key size to 31 bits, and mode 111-10 enables all modules and is set to 63 bits of the key size. The sub-blocks of the encryption system based on the matrix approach are described below.

A. PSEUDO-RANDOM NUMBER GENERATOR
We define a PRNG of size N = 2 n − 1, for n = 1, 2, 3 . . ., as a module that takes an input of size 2N + 1, and produces VOLUME 11, 2023 a sequence of outputs of size N , with the property that larger modules, that is, those with larger N values, produces higherquality sequences [11].
The keys employed in and permutations, are generated by the PRNG, and can be calculated using where, H N = (H top ; H bot ) is a square matrix of size (2N + 1) × (2N + 1), where N = 2 n − 1 for n = 1, 2, 3, . . ., and U i = (x; y) T is a vector of 2N + 1 elements. The matrix H top of size N × (2N + 1), is produced by applying an addition modulo 2 operation of the two previous rows, with previous row shifted to the right by one position, given two initial vectors, and correspond to The matrix H bot of size (N + 1) × (2N + 1), is formed by a identity matrix in the first (N + 1) columns and the remaining N columns are filled with zeros. In the first iteration where i = 1, the vector U 0 is composed by the initial conditions (seeds) x and y, of N and N + 1 elements respectively, and the vector U 1 resulting from (1), is conformed by the first key generated in the first N elements (also denoted as t 1 at time i = 1), and the remaining N + 1 elements, correspond to the feedback of the first N elements of U 0 . The missing element is taken from the LSB of U 0 to be the new MSB of the remaining N + 1 elements of U 1 . In other words, H top computes the key and H bot feeds back the previous key for the next operation.
In this work, three different PRNG sizes were implemented, with n = 4, 5 and 6, in such way that N = 15, 31 and 63 for the key length. For instance, for N = 7, the equation (1) can be rewritten as  (2) is possible obtain boolean functions, as in (3), for an easy and suitable hardware implementation instead of evaluating the BUC or calculating the matrix-vector product, which requires implementing a more complex circuit and a greater number of operations. The binary operation + is the logic XOR and the word t correspond to the number (key) generated.
By taking advantage of the CA-based PRNG structure, an efficient implementation is possible in which a larger PRNG is formed by growing the BUC as shown in Fig. 1a, allowing the generation of different key sizes using the same PRNG. For a module of size 2 n − 1, the first 2 n−1 − 1 boolean functions are those of a module of size 2 n−1 − 1. This allows an efficient hardware implementation of smaller PRNG modules out of a larger module. We exploit this property to derive smaller but faster and more energy-efficient PRNG modules out of larger, power-hungry modules, at the expense of producing lower quality sequences.

B. TRANSFORMATION PERMUTATIONS
The transformation from plain-text m to cipher-text c is achieved by the permutation c = ψ t (m) and is calculated using The key t of N elements and plain-text m of J elements, are multiplied by the matrices P N and Q N respectively, and then added to complete the transformation. Matrix P N has dimensions of J × N and is generated by applying a zero-fill right shift of one position to the previous row, from the first row whose components in position j = (2 n + 1) − 2 i+1 for i = 0, 1, 2, . . . , (n − 1) are equal to 1 and the rest are equal to 0 [27]. The matrix Q N has dimensions of J × J and is generated as if it were a grid of CA by applying the 90 rule, with zeros as boundary conditions and taking the first row as To recover the plain-text, the inverse process corresponds to the permutation m = t (c), and is calculated by using where In a symmetric encryption system, the same key must be used for both encryption and decryption processes. In contrast to the implementation of the PRNG, which is implemented as a single block, the permutation have to be implemented as three different blocks for each key size, since the single-time algorithms for each key size are different.

C. PRE-PROCESSING
The pre-processing transformation converts the plain-text into an unintelligible form by employing a seed z of J + 1 elements and the matrix H t used in the PRNG [29]. The pre-processing transformation is calculated bŷ For the first transformation, z is employed, and for the remaining calculations, the last computedm, becomes the first elements of the next z and for the last one element, the first element of the previous z takes place.
Matrix M N has the same dimensions as H t and is made of matrices Q N and D N , such that M N = (Q N |D N ). Matrix D N contains a value of 1 on the elements of the first superdiagonal line, and on diagonals k a = 2 a − 1, for a = 1, 2, . . . , n − 1. The Q N matrix is described in detail in Section II-B. The same key is used for the inverse process.

D. SUBSTITUTION BOX
Since the CA rule 90 has presented attractive properties in cryptography such as flexibility to adapt in an algorithm of massively parallel computations in computing architectures, randomness, multifractal features, among others [26], [27], [31], [32], previous works [27], [31], [32] have considered a matrix approach based on this CA rule to implement in an adjustable way the main elements of an encryption system. To assess this matrix approach, its performance was evaluated and analyzed by several tests, and the results in security issues have demonstrated good performance required for many cryptographic applications [27], [31], [32]. Following this matrix approach and taking advantage of the appealing properties of the CA rule, in [30] a square generating matrix K was implemented to generate a strong S-box, where the cryptographic strength was evaluated by standard tests, such as strict avalanche criterion and nonlinearity, among others, showing a high performance of this S-box compared with other existing S-boxes [30]. Of course, there exists different manners to implement the matrix K to generate a strong S-box, but we considered this. Since, we are considering to carry out an extensive examination with different initial condition vectors to generate matrices, and evaluate the performance in a complete encryption system, a thorough investigation of these issues is left for a future work.
Basically, an S-box is a non-linear substitution applied to a word of eight elements of a state matrix to be transformed to a different value. The transformation is obtained using where x is the word to transform and K is a matrix based on the CA rule-90 with dimensions of 8 × 8 elements, and is formed by the matrices K L and K R , with dimensions of 8 × 4 each. Similar to the matrix Q N , the matrix K L can be generated by applying the CA rule-90 with zeros as boundary conditions for the next three rows, taking as the initial sequence, the vector a = [0, 0, 0, 1, 0, 0, 0, 0], then, the matrix is rotated 90 degrees in the clockwise direction. Matrix K R is generated by multiplying K L with matrix P. Matrix K is formed as K = (K L |K R ) [30]. The S-box detailed above does not satisfy the condition of not having fixed points or opposite fixed points; thus, the implemented S-box was generated from the initial vector a = [0, 1, 0, 0, 1, 1, 0, 0] to get only one fixed point. Equation (9) shows the matrix generated from the vector previously mentioned.
Similar to the AES S-box [33], to remove fixed points, the sum of a constant was added, modifying the transformation (8)  The initial condition used for the generation of the S-box was selected from 2 8 possible combinations based on the number of fixed-points and a greater number of valid constants for its sum. In addition, the S-boxes that met the above conditions were synthesized to compare the area in silicon they occupied to select the one with the smallest size.
To recover the input data, the inverse function is employed using (10), which reverses the S-box operation by adding the same constant. Then, the multiplicative inverse is calculated, and the inverse affine transformation is applied.

III. SECURITY ANALYSIS
To measure the efficiency of the encryption system, the most common tests applied in cryptography [31], [34], [35], [36] were carried out, such as statistical tests including histogram analysis, correlation analysis, entropy analysis and PSNR analysis, as well as sensitivity tests such as differential attacks analysis and key sensitivity analysis. In addition to these tests, detrended fluctuation analysis (DFA) was performed to evaluate the performance of the system [37]. The tests were performed on a database of 60 images 1  DFA is a technique used to analyze singular or fractal behaviors that can be present in images. This is done by extracting the Hurst indices of sequences in the plane of the image, and then, the average of the scaling exponent can be estimated [37]. To resist a common attack, a good cryptosystem should map the plain-image to a random cipher-image without exhibiting any pattern, and the image pixels should take as many random values as possible [38]. This can be measured in terms of the fractal dimension of the cipher-image and therefore by its scaling fluctuation exponent [39]. In [40] DFA and multifractal-DFA were generalized for higher dimensions, and 2D-DFA was subsequently employed to distinguish the fractal and multifractal properties of synthetic surfaces. The scaling fluctuation exponent α = 0.5 means that the pixels are uncorrelated or short-range correlations (e.g., white noise). For 0 < α < 0.5, the correlation in the signal is said to be anti-persistent, where an increment is highly probable to be followed by a decrement and vice versa. In the case of 0.5 < α < 1, the signal is said to be persistent, where an increment is highly probable to be followed by an increment and vice versa. The values α = 1 and α = 1.5, correspond to 1/f noise and Brownian motion, respectively [41]. For secure encryption, the scaling exponent should be close to 0.5 and 1 for one-dimensional DFA and 2D-DFA, respectively [37], [39].
The results of the 2D-DFA tests show that in all modes, except mode 010-11 (S-box only), the scaling exponents α are close to 1 (Table 2), which suggests that there is no correlation between pixels and that there is no visually comprehensible information, whose pixels appear totally random. As pointed out in [39], this metric seeks to evaluate the level of perceptual security, so we can conclude that except for the 010-11 mode, the rest of the encryption modes are secure from the point of view of perceptual security. Figure 2 shows a plain-image and the cipher-images in the different operation modes, which do not reveal any information that would allow us to distinguish the original image.
In addition to the DFA, histogram and entropy analyses were performed to verify the level of randomness. In the case of a cipher-image histogram, a uniform distribution means that each possible pixel value has the same probability to appear, which is ideal for a cipher-image because it can hide the redundancy of plain-images [31], [36]. The chi-square test was used to check whether the given histogram fit a normal distribution, and using a level of significance α = 0.05, we obtained χ 2 (255, 0.05) = 293. Comparing the chi-square values obtained from the ciphered images, as shown in Table 2, our null hypothesis is accepted; therefore, the histograms fit a uniform distribution, except for the 010-00 mode. Additionally, the entropy analysis is a measure of the degree of randomness in the image and shows the distribution of pixel values without considering spatial information. For grayscale images, the ideal value for a cipher-image is 8. Table 2   to the theoretical value except for the 010-11 mode (S-box only), obtaining the same result for both, because the S-box maintains the frequency of the histogram, and the entropy is insensitive to spatial information.
The Peak Signal-to-Noise Ratio (PSNR) represents the ratio between the maximum possible value (power) of a signal and the power of the distorting noise that affects the quality of its representation. The PSNR value approaches infinity when the mean squared error (MSE) value between the plain and cipher-image approaches zero, where a higher PSNR value indicates higher image quality, and a smaller PSNR value implies large differences between the images [42]. The results in Table 2 show values close to 8.7, except for the S-box configuration, which confirms the good encryption results for the remaining modes.
The Pearson correlation coefficient is a measure of linear correlation between two sets of data. In images, the correlation coefficient of adjacent pixels is very high due to the amount of redundant information; therefore, a good encryption system should present a weak correlation close to 0 [43]. As shown in Table 3, the correlation coefficients were calculated for the adjacent pixels in the horizontal, vertical, and diagonal directions, and between the cipher and plain-images. The results showed a low correlation between the cipher and plain-images, as well as a low correlation between adjacent pixels in the horizontal, vertical, and diagonal directions in all operation modes except for the 010-11 mode (S-box only), with a slight increment. While the above tests show poor performance with the S-box module, this does not imply that the S-box does not fulfill its purpose. The appropriate and most commonly used tests for S-boxes are the strict avalanche criterion (SAC), bit independence criterion (BIC), nonlinearity (NL), linear approximation probability (LP), and differential approximation probability (DP). These tests were performed on the S-box implemented in [30], showing that the performance is comparable to that of other S-boxes, including those used in the AES system.

B. SENSITIVITY TESTS
Differential attack analysis consists of measuring the difference between cipher-images with a slight change (i.e., a pixel) in one of the plain-images. The most common tests used to assess resistance against differential attacks on a cryptosystem, are Number of Pixel Change Rate (NPCR) and Unified Average Changing Intensity (UACI) [44], [45]. As shown in Table 4, the modes without the pre-processing module obtained NPCR and UACI values (marked with * ) that can be interpreted as low resistance because they do not have the influence of the previous pixels, whereas the remaining modes demonstrated resistance against differential attack; thus, we recommend employing the pre-processing stage to guarantee a more secure encryption process.
Key sensitivity can be measured by two cipher-texts generated by the same plain-text, but with a slightly different key, and it should result in two significantly different ciphertexts. To corroborate the expected performance, the set of images was encrypted with two different keys, differing in only one bit. Table 4 shows the correlation, NPCR, and UACI parameters (marked with †), between the cipher-image sets. The results suggest that the groups of cipher-images are significantly different, with the exception of the operation mode using only the S-box.
In addition, a good cryptosystem should not reveal information when the cipher-text is decrypted with a slightly different key. As we can see in Table 4, all operation modes result in significantly different plain-text (marked with ‡), except for mode 001-11, which recover much of the original information; therefore, it is not recommended to use the pre-processing module alone. To solve this problem when using the pre-processing mode, it is possible to use the inverse pre-processing module as the pre-processing module, simply by changing the mode from encrypt to decrypt and vice versa. The results of comparing decrypted images with a slightly different key were a 0.0043 correlation coefficient, 99.6127 % NPCR and 30.5535 % UACI.

C. BRUTE-FORCE ATTACK
A brute-force attack is a method of defeating a cryptosystem by trying all the possible keys. The size of the key space should be greater than 2 100 to be considered a secure system [38]; the greater the computational power and the smaller the key space size, the more feasible is the decoding of the message. The implemented cryptosystem uses the x and y keys in the PRNG module to generates keys for and permutations, which can be configured to three different sizes. Therefore, the sizes of x and y are 15-16, 31-32 and 63-64 bits. In addition, z and c, of 9, and 4 bits, respectively, are employed in the pre-processing and S-box modules. The total key length may vary from 4 to 140 bits, making the encryption system sufficiently secure to overcome a bruteforce attack, when permutation with a 63 bits key length is used.

D. CRYPTOANALYSIS ATTACKS
The implemented system is shown to be secure against the above statistical tests. In [29], an analysis with the Chosen-Plain image attack was performed to an older version of the implemented system (without the S-box module), and proved to be resistant to such attacks. We performed this analysis attack for all operation modes as follows: a mask image I m is generated by applying the bitwise XOR operation between the chosen plain-image (completely black image) and its cipher-image. Then, a cipher-image I c is obtained from another plain-image I o (peppers test image). Finally, we try to recover the plain-image I o , by performing the bitwise XOR operation between the mask image I m and the cipher-image I c . If the obtained image reveals information from the plain image, it means that the encryption system is insecure, otherwise, the system resisted the attack. Figure 3 shows the needed images to perform the chosen-plain image attack. The encryption mode used was all-module and 63-bit key length (mode 111-10). As can be seen in Figure 3, it does not reveal plain-image information, suggesting that the system is resistant to this type of attack. In addition, as mentioned in [29], if the encryption system is secure against Chosen-Plain image attack, it is also secure against cipher image-only attack or known-plain image attack.

TABLE 4.
Average NPCR and UACI values between cipher-images with a single pixel change on one of the plain-images (marked with * ). Average NPCR and UACI values, as well as correlation coefficients between images ciphered with slightly different keys (marked with †), and between plain-images decrypted with slightly different keys (marked with ‡). In both cases, the keys differ in their least significant bit.

E. POWER ANALYSIS ATTACK
In the power analysis attack, the attacker monitors the power consumption of the encryption system, so that he can obtain information about the data being manipulated from the small movements of electrical charges on the device [46]. To corroborate the reliability of this type of analysis, we performed simulations of the entire synthesized system by varying the seed while monitoring the gates switching activity. We performed toggle counting on 100 decryption processes with the correct and incorrect key and obtained two normallike distributions. To verify that both distributions were similar, we performed the Kolmogorov-Smirnov goodnessof-fit test [47], for which the null hypothesis is that the data from the incorrect-key decryption process were drawn from a population with a distribution of data belonging to the correct-key decryption group, and the alternative hypothesis is that they were not. We obtained that the null hypothesis is not rejected with a 0.05 significance level and p-value of 0.4688. Once configured, the implemented system processes the data in the same way regardless of the seeds, therefore the toggle count between with correct and incorrect seeds shows insignificant variations that do not reveal information about the process being carried out.

F. NIST TEST
The NIST Statistical Test Suite (STS) is a testing suite for randomness analysis, based on statistical hypothesis testing. It comprises 15 tests that evaluate the null hypothesis over a binary sequence. From each test, we obtained a p-value that represents the probability that a perfect random number generator would have produced a sequence that was less random than the tested sequence [48]. As shown in Table 5, for N = 31 and N = 63, all tests were successful, suggesting that the PRNG produces high-quality pseudorandom numbers. For N = 15 bits, some statistical problems are reflected in the Runs test, so they are considered to be of lower quality. One thousand blocks of one million bits were used to perform the tests. VOLUME 11, 2023

G. DIEHARD TESTS
DIEHARD tests 2 are a set of statistical tests that measure the quality of randomness in a sequence of numbers. It consists of 18 tests from which a p-value with an acceptable range of [0,1) is obtained. Table 6 shows the p-values obtained from sequences of numbers generated by the implemented PRNG of 63, 31 and 15 bits. As can be seen, all values are under acceptable range, thus ''pass'' the tests and we can conclude that the PRNG produces highly random keys.

IV. HARDWARE IMPLEMENTATION
The top level architecture of the ESCA system core is illustrated in Fig. 4a. As mentioned above, the system can be configured in several modes (as shown in Table 1), with the configuration byte stored in an 8-bit shift register. This initial configuration process also includes an encryption key and the first byte to encrypt (or decrypt). The order in which they are streamed to the input is as follows: configuration byte, encryption key and the first byte to encrypt. The data stream is carried out by employing the SPI protocol, and the configuration is loaded when the chip is in the reset mode (active low cfg signal). The encryption key is stored in a 140-bit shift register, 127 bits of which are employed for x and y, 9 bits for z, and 4 bits for c. When the initial configuration stream is completed, a pulse of cs and cfg signal change, are needed to start transmitting the data to be processed. The data to be processed is entered into 8-bit blocks and stored in a shift register. When a new byte is received, it is processed, and the previous data are stored in a shift register to be sent. In other words, new data are processed every 8 cycles (every new valid 8-bit data); therefore, no clock cycles are wasted. The data_in and z registers were instantiated twice to improve routing area efficiency. The contents of the registers for the unused modules, are masked to zero to prevent glitches. 2 Available at https://web.archive.org/web/20160125103112/http://stat.fsu. edu/pub/diehard

A. PRNG MODULE
The PRNG module shown in Fig. 5a is composed by three sub-modules, which are enabled depending on the selected mode and each generates a key that can be concatenated to form a larger key. The smaller sub-module generates a 15-bit length key, the second generates the next 16 bits to form a 31-bit length key and the largest generates the last 32 bits to form a 63-bit length key. In each sub-module, two registers are used to store the x and y seeds. The input data of these registers are multiplexed between the initial key and the values described in Section II-A. The outputs of the registers are gated to prevent glitches when a sub-module is not in use. These gated signals feed the block composed by the Boolean equations to calculate the key, so that a key is generated each time the clock is triggered in the registers. Fig. 5b shows the structure of the block containing the Boolean equations and XOR gates for the calculation of the key.

B. AND PERMUTATIONS MODULES
The implemented modules of the and permutations are shown in Fig. 6a and Fig. 6b, respectively. Each permutation consists of three different sub-modules for each key size. The datapath of the permutation starts with multiplexers for the plain-text and key, that allows masking the signals to zero values when the module is disabled. If the signals are propagated, they are fed to each sub-module, which may or may not be gated depending on the configured mode. If the module is enabled, only one sub-module can be enabled, and in it, the plain-text and the key will propagate to the block containing the combinational logic that generates the ciphertext. The sub-module outputs are ORed to generate the permutation module output. The structure of the module only differs with the latches to store the plain-text, to prevent glitches while inputting the keys.

C. PRE-PROCESSING MODULE
As shown in Fig. 6c, the input data is fed to latches and multiplexed to prevent glitches inside the block, which contains a combinational circuit (equivalent to the matrix-vector multiplication detailed in II-C) to the pre-processing data.
The key is fed to a 9-bit register through multiplexers that select between the initial condition or the last word of the previous calculus to perform the feedback process. When the input propagates through the latches, and the key through the registers, the data is transformed by the pre-processing block. The inverse process in Fig. 6d feeds the input data to the boolean equations block through multiplexers and, simultaneously, in a 9-bit register, a key is stored through a multiplexer that selects between the initial condition and the previous input data. The data and key are processed by an inverse pre-processing block and the output data is generated. Fig. 6e shows a schematic of the S-box implemented, made mainly from a lookup table described in Verilog and synthesized to convert the description into a combinational circuit. The same process was performed for the inverse S-box. The datapath consists of a multiplexer that masks the input data which later feeds the combinational logic block that transforms the data. The transformed data is XORed with a constant that is provided from another lookup table of valid constants, which is controlled by a key stored in the initial configuration. The sum corresponds to the output of the module.

D. SUBSTITUTION BOX MODULE
In the inverse process, the input data is multiplexed to prevent glitches and the constant is decoded from the lookup table of valid constants. The input data and the constant are XORed and the result is transformed by the lookup table that performs the byte transformation.

V. IMPLEMENTATION RESULTS AND DISCUSSION
The design was described in Verilog and synthesized using a Standard Cell library. Pre-and post-synthesis simulations were performed to corroborate its performance and the physical design was also checked with Design Rule Check (DRC) and Layout Versus Schematic (LVS) checks. A maximum operation frequency of 50 MHz, and throughput of 50 Mbps, was obtained and with all modules enabled, and a 63-bit key length. The latency is equal to the cycles needed to input the initial stream conformed by the configuration, seeds and the first data (8 bits), and may vary from 164 to 25 cycles. Figure 4b shows the layout of the chip, with an area of 4 mm 2 , including the IO pads. Table 8 lists the gate count per module and its percentage occupation. The PRNG module occupies the largest amount of area with 27%, despite the fact that the modular design to generate three different key sizes is focused on reducing area. The second largest blocks are the S-boxes, with 18% each. A possible way to obtain a more compact implementation is described in [49], where the computation of the GF(2 8 ) inverse is broken up into computations in smaller subfields.
The maximum operating frequency depends on the chosen operation mode. When all modules are enabled and the key size is set to 63 bits (111-10 mode), the lowest operating frequency is obtained (50 MHz); otherwise, when only the pre-processing module is enabled, the highest frequency is reached (110 MHz). The datapath to encrypt/decrypt flows through up to 3 modules (pre-processing, S-box and permutation) without any memory element between them, forming a single combinational logic path. Therefore, the propagation time is the sum of the propagation time of the 3 modules. The system can be improved by using a faster communication protocol and pipelined structure, adding memory elements such as registers or latches, at the output of each module, decreasing the final propagation time, dominated by the largest of the 3 modules, and thus increasing the performance of the system. The implemented system was designed to implement the SPI communication protocol, with a maximum standard frequency of 50 MHz. The mode of operation with all the modules enabled and 63-bit key theme (111-10 mode) can already operate at such frequency, so it was not necessary to use a pipeline structure. The higher cost of using a pipelined approach is the increment in power consumption, silicon footprint, and latency. Table 7 shows the simulation results of the power consumption in all configurations with a clock frequency of 30 MHz. When the maximum security is set (111-10 mode), the chip consumes the most energy owing to the number of gate switchings of all modules in use. When the PRNG key size is set to 31 and 15 bits, the power consumption is reduced by a factor of 63% and 45% respectively. On the other hand, if the S-box module, pre-processing module, or both are disabled, consumption is reduced to 93%, 93%, and 89% respectively.
Compared to the 111-10 mode (all modules enabled and 63-bit key size), the modes 010-11 (S-box only) and 000-11 (no module activated), consume 24% and 20%, respectively and this is because the largest amount of energy is consumed by the data transmission. To achieve the difference between the modes mentioned, the transistor size of the flip-flops that compose the shift registers involved were optimized by optimizing the setup time, hold time, and C-Q delay values, to achieve minimal power consumption without affecting the previously reached performance. To corroborate this, simulations were performed with smaller register transistors, and the power consumption of data_in and data_out registers were reduced to 30% and 26%, respectively. Table 8 shows the energy consumption of each module in the simulation configured with the 111-10 mode. The PRNG module consumes 56% of the power owing to its high-switching-activity gates. The next highest consumption comes from the data_in and data_out registers, which consume 13% of the total power. Next was the module with 12% due to the glitches caused by the propagation delay of the data in the previous modules and PRNG. The modules that are not in use, such as the inverse modules and the input register of the decryption process, have lower consumption compared to the others.
The consumption of all modules in the 010-11 mode (S-box only) is also shown in Table 8, which is the mode with the lowest energy consumption. The data_in and data_out registers use 50% of the total power. The second highest power consumption is the control unit with 30%, whereas the S-box module consumes 13%, making data transmission the most energy-consuming part. Based on the above, we recommend three operation modes (highlighted in Table 7): one with the highest level of security (highlighted in blue), one with the lowest energy consumption (highlighted in green), and one that offers a fair compromise between security and energy consumption (highlighted in red).
Mode 111-01 (all modules enabled with 31 bits key length) consumes 63% compared to 111-10 mode (all modules enabled with 63 bits key length), and in terms of security both satisfy the statistical and sensitivity tests applied; however, it has a shorter key length, therefore is more susceptible to brute force attack. Another option to save more energy is to disable pre-processing and S-box modules; however, the trade-off between energy savings and reduced security, it is not worth it. The third option has the lowest power consumption in exchange for a lower level of security, however, S-boxes are used in some applications such as watermarking [50], image encryption [51], and embedded systems [52].
To make a fair comparison between our system with other cryptosystem implementations, it is necessary to consider several factors such as: security level, round-based or serial-based ciphers, CMOS technology used, among others. Since it is difficult to apply all the above criteria, we decided to compare the ciphers using the power/frequency (µW/MHz) metric. This is in order not to favor specific metrics for evaluation. Because the 0.5 µm process consumes significantly more power than smaller processes, we do not rule out the possibility of implementing our design in newer technologies to improve the implementation. In comparison to the lightweight cipher implementations in [49], [53], [54], and [55], where powers in the range of 37 to 278 µW/MHz are obtained using a 130 nm node and ranging from 3.1k to 18k gate equivalent (GE), we obtained a range of 62 to 733 µW/MHz with the 0.5 µm node. However, by implementing our system on a 130 nm node, considering the quadratic effect of decreasing voltage and capacitance [56], we estimate to reduce the system up to 10% of the current consumption, i.e., a range of 6.2 to 73 µW/MHz. In comparison with AES implementations, which is one of the most important security benchmarks, the power consumption of implementations presented in [46], [57], [58], and [59] varies in the range of 60 to 2461 µW/MHz, in which the amount of GE varies from 60k to 200k.

VI. SYSTEM VALIDATION
The proposed system was implemented in an FPGA Cyclone IV E, and a Raspberry Pi 3 Model b+ was used to test the embedded design. Figure 7a shows the error obtained by encrypting in all the modes at different frequencies. The difference in the error between the modes across different frequencies is very low, because the difference between frequencies is a significant cause of the limitations of Raspberry Pi in generating more frequency in clock signals. The system works correctly with a maximum operating frequency of 16 MHz and uses 1341 logic elements. Despite the FPGA technology node (60 nm), the implementation in 0.5 µm CMOS technology is approximately three times faster in addition to the saving energy.
The system was also implemented in software using Python and the Numpy library on a computer with an Intel Core i7-8550U processor and 8 GB of memory RAM. Tests were performed to measure the frequency and the results are shown in Fig 7b. When all modules are disabled, that is, the cost of moving the information only, it achieves a throughput of 1 Mbps, and when all modules are enabled and 63-bit PRNG is set, it achieves 176 kbps.

VII. CONCLUDING REMARKS
In this work, we report on the ASIC implementation of a configurable ESCA system with encryption and decryption processes embedded on the same chip. The design is implemented on a 5V, 0.5 µm CMOS process. In its highest security configuration, it can achieve a maximum operating frequency of 50 MHz with a throughput of 50 Mbps (at 37 mW), which is higher than that of FPGA and software implementations of the same design. Furthermore, since the system is implemented using a Standard Cell based approach, it can be readily migrated to more advanced processes, thus improving its performance per watt metric. The unique nesting property of the PRNG implementation, in which smaller PRNG modules are embedded into larger ones, allows us to reduce by 25% the area of the PRNG module, thus resulting in a total of 5956 gates for the implementation of the entire system. Of these, 16% were used for the SPI protocol, and can be omitted in designs where it is not required. In addition, it has been demonstrated through different tests, such as histograms, DFA, correlation, entropy, PSNR, and differential attack, the high level of security of the implemented encryption system, which is especially suitable for encrypting images, as it provides a high level of perceptual security in all but one mode of operation.
We believe that this type of configurable encryption systems will be a key tool for its use in a wide range of applications, owing to the trade-off between power consumption and security level. This allows us to use the system in applications that require a higher level of security, where the power consumption of the encryption procedure is insignificant, as well as on resource-constrained devices, such as ASICs for IoT devices or IPs for system-on-chip. Figure 8 shows the result of the chosen plain-image attack in the different system operation modes. As mentioned in Section III-D, the mask images were generated using a completely black image and their corresponding cipherimage. The plain-image is the peppers test image shown in 3a. As mentioned in Section III-B, the operation modes without the pre-processing module, presented a low resistance to the differential attack analysis since they do not have the influence of the previous pixels, and also presented a weak performance against the chosen plain-image attacks, thus our recommendation of employing the pre-processing module.