A Highly Reliable Arbiter PUF With Improved Uniqueness in FPGA Implementation Using Bit-Self-Test

Physically unclonable functions (PUFs) promise to be a critical hardware primitive for billions of Internet of Things (IoT) devices. The arbiter PUF (A-PUF) is one of the most well-known PUF circuits. However, its FPGA implementation has a poor reliability, and error correction codes (ECCs) are usually needed to eliminate the noise in the responses, which incur additional high hardware overhead and require NVM for helper data storage. In this paper, we present a highly reliable arbiter PUF with improved uniqueness using the bit-self-test (BST) strategy. A delay detection circuit is added into a classical arbiter PUF to test the delay deviation that produces each bit of the PUF response in real-time and mark the response as reliable using a reliability flag when the delay deviation is significantly more than a predefined threshold. Then, the robust responses can be used. We implemented the BST-arbiter PUF on a Xilinx Artix 7 FPGA. The test results show that the selected responses achieve outstanding performance where the bit error rate is less than 10−9, the bias is 50.3%, and the uniqueness is 49.1%. Thus, the BST-APUF, which drastically reduced the ECC overhead, can be applied to lightweight cryptography applications.


I. INTRODUCTION
Internet of Things (IoT) systems usually utilize various cryptographic algorithms to encrypt private data and authenticate each other [1], and the security of the key is particularly essential. The cryptographic keys are usually stored in the nonvolatile memory (NVM) in most IoT devices. However, the keys stored in NVM are not secure enough because they are most likely to be detected by physical attacks [2]. A PUF is an emerging cryptographic primitive for key generation/storage and can be widely used in authentication and key generation for secure operations. A PUF can be represented by the mapping of an external input challenge C to an output response R. The response R to the challenge C is random but unique for an IC that contains the same PUF circuit due to their mismatches in the electrical parameters caused by the The associate editor coordinating the review of this manuscript and approving it for publication was Sedat Akleylek . manufacturing process variations. The challenge-response pairs (CRPs) that uniquely describe an IC are prohibitively difficult to reproduce using a physical clone of the same circuit. mainly due to the metastability of the delay flip-flop (DFF) arbiter and the routing constraints of the FPGA. Postprocessing using helper data algorithms (HDAs) [16] is required to extract the reliable keys from noise responses. However, most HDAs require complex error correction codes (ECCs) such as BCH with an additional high hardware overhead and NVM for helper data storage.
Given that the ECC overheads decrease significantly if the errors in the raw responses decrease, many lightweight reliability enhancement mechanisms are proposed for APUFs to reduce the error correction cost. A metastability detection technique using the 4-DFF or SR latch-based arbiter to significantly improve the average and minimum reliabilities has been proposed in [17]. However, this technique can result in unbalanced responses and increased overhead of detecting and recoding meta-stabilization states. More arbiters can also be added to reduce the number of utilizable challenges and improve the reliability of the response at the expense of the hardware overhead [18]. In addition, the outputs of multiple arbiters are correlated, which can compromise the unpredictability of the response. A reliability-enhanced A-PUF with trinary digit (trit) quadruple responses, which uses two flip-flop arbiters to produce a trit for metastability detection, was proposed in [19]. Its challenge-response quadruple classification dramatically reduces the burden of error correction, but the method still faces similar problems with high hardware resource requirements.
Preselection is an efficient preprocessing method that can identify and select the stable PUF responses. After the preselection process, the error rate will decrease, and thus the error correction becomes less complex or even unnecessary [20], [21]. A simple method to determine the stable responses is measuring them multiple times across different operating conditions [22], but this requires significant additional runtime and costs. The more efficient approaches are run test on the PUF cells and the response that has a mismatch exceeding a predefined limit is selected. For example, Bhargava and Mai [23] use a built-in self-test to determine which sense amplifier (SA) PUF bits are reliable and only use those bits for key generation. Yizhak S et al. proposed a similar preselection mechanism for the SRAM PUF [24]. However, both of them utilized precision differential analog voltages to identify unstable bits. These voltages usually need an accurate voltage regulator integrated into the PUF array, which adds costs, complexity, and power. To solve this problem, the capacitive preselection test circuit [25] and VSS-bias generator [26] were proposed to detect unstable bit cells with insufficient mismatches instead of a voltage regulator for SRAM PUF. The preselections are also used in the newly proposed RO PUF frameworks [27], [28]. In these mechanisms, if the two RO's delay difference is larger than a threshold R th , the PUF bit will be identified as reliable. However, this efficient approach is not easy apply to APUF, because the delay difference of APUF is much more difficult to measure than that of RO PUF.
In [29], He et al. proposed a bit-self-test (BST) preselection strategy for a switched-capacitor (SC) strong PUF in which a self-test module was added to automatically test the capacitance difference that produces each response and generates a flag bit to indicate the reliability of this response. This reliability enhancement strategy is suitable for both weak PUFs and strong PUFs, and the selected robust responses can achieve very high stability across environmental variations.
In this paper, we present an arbiter PUF circuit with improved reliability and uniqueness based on the BST strategy. The main contributions of this paper are as follows: 1) We propose a novel bit-self-test arbiter PUF (BST-APUF), which can generate a response and a reliability flag bit that indicates the reliability of the response when inputting a challenge. This means that unlike traditional PUFs, the BST-APUF produces a large amount of challenge-response-reliability flag pairs (CRRPs). The reliability flags can be used as the helper data to extract or recover the stable keys from the original PUF responses in cryptographic applications. 2) We design a BST-APUF by adding a delay detection circuit into a classical arbiter PUF. The detection circuit automatically tests the delay deviation that produces each bit of the PUF response and marks the response as reliable using a reliability flag when the delay deviation is more significant than a certain threshold.
3) The BST-arbiter PUF is implemented on Xilinx Artix 7 FPGA. The test results show that the selected responses achieve very high reliability with improved uniqueness. The hardware resource requirement of the BST-APUF is lower than those of the state-of-the-art mechanisms, and it is suitable for resource-constrained systems.

II. BIT-SELF-TEST ARBITER PUF A. THE CLASSICAL ARBITER PUF
The arbiter PUF (APUF), which is shown in Fig. 1, is one of the most famous PUFs [15]. It is classified as a delay-based PUF since the response is generated based on the intrinsic timing differences of two topologically and functionally identical paths in an IC due to its manufacturing variations. Two electrical pulses race simultaneously through two paths consisting of several stages. Each stage consists of a pair of multiplexers, which are determined by a challenge C i , configured as a crossbar switch. A latch or flip-flop acts as an arbiter to determine which of the signals along the two symmetrical paths is faster. When the output signal in the upper path is faster, the arbiter outputs 1; otherwise, it outputs 0. Therefore, an APUF can convert the delay differences into digital responses controlled by the challenges. For an N-stage arbiter PUF, it can generate 2 N Challenge-Response Pairs (CRPs).

B. BIT-SELF-TEST STRATEGY FOR AN ARBITER PUF
Most PUFs produce responses by amplifying some electrical characteristics (e.g., delay and threshold voltage) from two nominally identical circuit components in the PUF core. When the electrical difference is significant, the PUF response will more robust across environmental variations [15]. Therefore, if the electrical differences causing each PUF response can be tested automatically, the PUF responses with more considerable differences can be selected for key generation, and the reliability of the PUF will be significantly improved. Signal propagation delay as a function of temperature [15].
For an arbiter PUF, it produces responses by detecting the delay difference between two symmetric delay paths under the control of the challenge. If the delay difference D for a given applied challenge at a nominal temperature is small, the polarity of D is very likely to change due to the variations in temperature and supply voltage, and the response will change, as Fig. 2(a) shows. In contrast, for a larger D, as Fig. 2(b) shows, its polarity is unlikely to be affected by the temperature; thus, the response will be more stable. Thus, the absolute value of the delay differences | D| is a good indicator of the response reliability.
However, the delay differences | D| are often too small to measure directly. Moreover, the arbiter PUF can generate a large number of responses, and it is not feasible to test the reliability of each response before delivery. Concerning this problem, this paper proposes a bit-self-test arbiter PUF (BST-APUF) by adding a delay detection circuit to a classical arbiter PUF. The detection circuit can automatically test the | D| that produces each response and generates a reliability flag for each response to indicate its reliability. If | D| is larger than a threshold D t , the flag is set to 1, and viceversa. Generally, when a challenge is used as an input into the BST-APUF, it can generate a response and a reliability flag simultaneously, as Fig. 3 (a) shows. If the reliability flag is 1, it represents that the corresponding response is reliable; otherwise, the response is unreliable. Therefore, unlike traditional PUFs, the BST-APUF can produce a large amount of challenge-response-reliability flag pairs (CRRPs). We can select the robust responses according to the reliability flags to generate cryptographic keys or conduct identification and authentication; thus, the reliability flags can be used as the helper data to extract or recover the keys. The extraction process of robust responses is shown in Fig. 3 It should be noted that the reliability flags might change between different evaluations with the same challenge because of the noise. In some cases, the reliability flags generated during the registration phase should be stored so that it can be used in recovery phase to recover the same stable keys from the noise responses. In other cases, such as in the computationally secure fuzzy extractor proposed in VOLUME 8, 2020  Cao et al. [32], the reliability flags do not need to be stored. They can be treated as confidential information to recover the key, even though they change during each key extraction.

C. EXECUTION PROCESS OF THE BST-APUF
The key to realizing the BST-APUF is to design a detection circuit to measure the delay deviation that causes each response and determine if it is greater than a threshold D T . Since the value of | D| is too small to measure directly, we embed an additional delay module that can generate a threshold delay D T into a classical APUF. By connecting the delay module to the upper and lower delay paths separately, the reliability flags can be produced by checking the outputs in these two situations. The execution process of BST-APUF is as follows: First, the PUF works in a response-output mode to generate a response R i .
At this time, the delay module is not connected to the circuit, and the challenge C i causes the N-stage switching delay module of the arbiter PUF to form two delay paths, as shown in Fig. 4. If the total delay caused by the upper path to the input pulse is D 1 , and the total delay caused by the lower path is D 2 , the delay difference D = D 1 -D 2 is input to the arbiter. When D > 0, the arbiter outputs a response of 0. When D < 0, the arbiter outputs a response of 1, and the response R i is stored in a register.
Second, the PUF enters a reliability self-test mode to produce the reliability flag F i . F i will be generated in three steps: 1) The delay module is connected to the upper delay path to generate a test output T i1 , as shown in Fig. 5. Since the delay module can generate a delay of D T , the delay of the upper path is D 1 + D T . The delay difference D 1 = D + D T is input to the arbiter, resulting in a test output T i1 .
2) The delay module is connected to the lower path to generate the test output T i2 . At this time, the delay difference of the two delay paths, D 2 = D-D T , is input to the arbiter to generate a test output T i2 , as shown in Fig. 6. 3) Generate the reliability flag F i . In steps (1) and (2), if the output of T i2 and T i1 are the same, that is, the polarities of D 1 and D 2 are the same, then the following is satisfied: Since D T is a positive value, we can get that | D| > D T . That is, the absolute value of D is higher than the threshold D T . If D T is set to be large, the response R i generated by this D will be very stable across environmental variations.
Conversely, if T i2 is different from T i1 , it satisfies the following: D 1 > 0 and D 2 < 0, or D 1 < 0 and D 2 > 0. This means the delay difference | D| is less than D T , and, thus, the output R i will be unstable.
Therefore, we can generate F i via the Exclusive-NOR  it means that T i1 and T i2 are the same, the response R i generated by the PUF circuit under the challenge C i is reliable and the response will not change as the temperature and voltage change. F i is stored in the register along with R i . By performing the above process repeatedly with different challenges C i , we can get many CRRPs.

A. IMPLEMENTATION OF BST-APUF
The designed BST-APUF implementation structure is shown in Fig. 7, it consists of five modules: a 64-stage arbiter PUF, a self-test module, a reliability flag generator, a controller, and a UART.
In the design, the self-test module consists of a delay module, two 2-2 multiplexers A 1 and A 2 , and two 2-1 multiplexers, as shown in Fig. 7. The delay module can generate a test delay D T . In our design, we simply use several nongates in series. Of course, more complex circuits can be adopted if you need a more accurate D T . In the following, we call each nongate a ''delay gate''. The multiplexer A 1 is used to connect the delay gates to the upper or lower delay paths, and A 2 is used to ensure that the ports of the two delay paths input to the arbiter do not change. By executing the response output mode and test output mode, respectively, the arbiter and the reliability flag generator can generate the responses and the corresponding reliability flags.
The reliability flag generator includes a response register REG1, a reliability flag register REG2, a XNOR module, two 1-2 data distributors, and a 2-1 MUX, as shown in Fig. 8.
The controller generates control signals S and K to control the operation of the PUF. It works as follows: 1) Set S = 0 to put the PUF in the response-output mode. At this time, the signals D 1 and D 2 from the upper and lower paths are directly connected to the arbiter through MUX1 and MUX2 to generate response R i . R i is stored in REG1 through DMUX1 inside the reliability flag generator.
2) Set S = 1 to put the PUF in self-test mode. In this mode, the control signal K is set to ''0'' first to connect the delay module to the upper delay path, and the test output T i1 is generated by arbiter and then stored in REG2 through DMUX2 and MUX3. Then, K is set to ''1'', the delay module is connected to the lower delay path, and test output T i2 is generated and then the exclusive-NOR with T i1 in REG2 is used to get the reliability flag F i . Finally, F i is stored in REG2.

B. EXPERIMENTAL ENVIRONMENT
To verify the performance of the BST-arbiter PUF, a 64-stage BST-APUF design was coded in VHDL and simulated by the VOLUME 8, 2020 Due to the threshold, D T dramatically affects the value of the reliability flags. The larger that D T is, the lower the probability that the reliability flag is 1, and the higher the reliability of the selected robust responses. Thus, we use different connected delay gates (from 0 to 8 nongates) in the delay module and analyze the ratio and reliability of the selected robust responses under different temperatures.
It should be noted that if the routing is performed automatically by the tool when implementing the method on the FPGA, the mismatch within the elements in A1 and A2 of the self-test module could counter the effect of the delay element. For cases when the delay module uses few delay gates, sometimes even to an extent, the path without the delay module might have a delay higher than another path. Therefore, the layout design and routing need to be performed manually to decrease the path mismatch. We did our best to constrain and guide the placement and routing software to achieve the highest degree of symmetry in the PUF layout. The estimated latency bias between the two paths within the self-test module by the ISE software is less than 0.125 ns after optimization. Since the delay of a nongate implemented by LUT in our test FPGA is approximately 0.6 ns, it is much greater than the mismatch of 0.125 ns. Thus, even for case with one nongate, the path with the delay module will have a delay more significant than another path.

1) MISMATCH OF THE SELF TEST MODULE
We first measured the mismatch of the self-test module to ensure that it will not have a noticeable effect on the self-test circuit. In the original APUF (as shown in Fig. 1), if the delay difference D between two symmetric delay paths is greater than 0, the output response is 1; otherwise, it is 0. When the self-test mode is connected to the delay paths (as shown in Fig. 7), the mismatch of the self-test module, D m , and the additional delay generated by the delay module, D T , are superimposed on the original delay D, leading to a change of the bias (the ratio of '1') in the test outputs. The larger D m or D T is, the higher the change in the bias. Since D m and D T are difficult to detect and compare directly, we can measure the changes in the bias between the test outputs and the original APUF responses instead.
If the BST-APUF uses no delay gate in the delay module, the additional delay D T is zero, and the change in the bias is entirely caused by the mismatch D m . When 1 delay gate is used, the delay D T 1 generated by 1 delay gate, together with the mismatch D m , will contribute to the bias change. Therefore, we can compare the mismatch D m and the delay D T 1 by comparing the bias changes of the PUF outputs in above cases.
We performed experiment using 64 BST-APUF instances, each measuring 65536-bit responses, to calculate the bias characteristics. We first measured the bias of the original responses for each BST-APUF instance by letting them work in the response-output mode. Then, we made them work in the self-test mode and connected the delay module to the upper and lower delay paths using no delay gate and 1 delay gate, respectively. The bias changes compared with the original PUF responses in the above four cases are calculated and the results are shown in Fig. 9, in which the X axis is the serial number of the PUF instances and the Y axis represents the value for their bias changes compared with the original responses.
It can be seen that the bias changes when connecting 1 delay gate are significantly larger than those when connecting no delay gate. This means that the delay generated by even 1 delay gate is higher than the mismatch of the self-test module. Moreover, when connecting 1 delay gate, the biases of all the PUFs are increased compared with the original responses (the bias changes are greater than 0) when the delay module is connected to the upper path, and they are decreased when the delay module is connected to the lower path. This proves that the path with the delay module always has a delay higher than the path without the delay module. Furthermore, in order to obtain high reliability, more delay gates will be used in practical applications. Thus, the delay generated by the delay module will much higher than the mismatch of the self-test module.  Fig. 10 shows the percentage of 1s in the reliability flags when using different delay gates under the normal temperature (25 • C). It can be seen that as the number of test delay gates increase, the ratio of 1s in the reliability flags rapidly decreases. Then, the reliability of the selected robust responses (the responses where the reliability flags are 1) is measured. The reliability can be quantified using the bit error rate (BER), which is the average percentage of erroneous response bits obtained in different periods or operating environments. The BER of PUF instance i was calculated by the formula [31], [32]:

2) ROBUST RESPONSES AND THEIR RELIABILITY
where R i is an n-bit response produced by PUF instance i under normal operating conditions (1.0 V and 25 • C) and a set of input challenges, C. Then, the same set of challenges is applied k times to the same PUF under multiple operating temperatures ranging from −25 to 80 • C to obtain the responses R i,j for j = 1,2, . . . , k. Since R i and R i,j are n-bit responses generated by n challenges, their Hamming distances are divided by n to calculate the average error rate per bit.
To test the reliability of the robust responses, 65536 random challenges are input into a BST-APUF instance, and 65536 bits responses and 65536 bits reliability flags are generated at a voltage of 1.0 V and a temperature of 25 • C. The challenges and the responses with reliability flags of 1 are selected as the reference challenges and the reference robust responses. Then, the reference challenges are applied 1000 times to the same PUF instance under −25 • C, 0 • C, 25 • C, and 80 • C, respectively. The responses generated in the above environment are compared with the reference robust responses to calculate the BER for this BST-PUF instance. Then, all the other 127 PUF instances are tested to calculate the average BER of the robust responses. The BERs of the selected robust responses for different delay gates (from 0 to 8) are measured. Fig. 11 shows the BERs at a normal temperature (25 • C). It can be seen that as the number of connected delay gates increases, the BERs of the selected robust responses rapidly decrease. When the number of connected delay gates is 0, 1 and 2, the BER of the robust responses is 0.634%, 0.129% and 0.004%, respectively. When 3 delay gates are connected, only 2 bit errors were observed among the 2.24 × 10 9 bit measurements for all the 128 PUF instances; thus, the BER is approximately 8.9 × 10 −10 . When the number of connected delay gates is greater than or equal to 3, no errors were found while the  tested robust responses are more than 10 10 bits; thus, the bit error rate is less than 10 −10 . Fig. 12 and Fig. 13 show the BERs of the selected robust responses for different delay gates at different operating temperatures. The worst-case BER was measured at a temperature of 80 • C, as Fig. 12 shows. At this temperature, the BERs of the robust responses are 3.52%, 1.03%, 0.036% and 4.46 × 10 −9 , respectively, when the numbers of connected delay gates are 0, 1, 2 and 3, respectively. If more than 3 delay gates are connected, no errors were found in more than 10 10 bit measurement robust responses.

3) BIAS
We compute the bias (percentage of 1s) in the selected robust responses for various delay gates. We found that the bias is 58.1% when no delay gate is accessed. Theoretically, as the number of delay gates increases, the percentage of 1s will increase, and the bias characteristics will deteriorate [15]. However, we found that there was no noticeable deterioration of the bias characteristics in our test. The ratio of the selected responses being ''1'' fluctuated from 58.3% to 50.3%, as shown in Fig. 14. This probably occurs because when implemented on an FPGA, a fixed deviation is introduced after the PUF circuit layout and routing, and the additional delay accessed to the upper and lower paths still are not completely symmetrical; thus, more response ''1''s are marked as unreliable. The bias of the selected robust responses is 50.3% when the number of delay gates is 3.

4) UNIQUENESS
Uniqueness represents the distinguishability of PUF pairs, and it is a measure of the average number of bit differences among the responses of different PUF instances to the same challenge. It is usually estimated by the inter Hamming distance (HD) as follows: where m is the number of PUF instances. R i and R j are two n-bit responses to the same challenge generated by two different PUF instances, i and j, respectively. We first tested the uniqueness of the original arbiter PUF. We randomly selected 65536 challenges, input them into 128 PUF instances, obtained 65536-bit responses from each PUF, and evaluated the Hamming distance of the 65536-bit response between each of the two PUFs. The HD distribution is shown in Fig. 16 (a), and the average inter HD is 38.5%.
Then, we measured the uniqueness of the selected robust response of the BST-APUF for different delay gates. For these measurements, it should be noted that when inputting the same challenge to different BST-APUFs, the length and position of the selected response will change; therefore, it was difficult to directly calculate the Hamming distance between them.
Since we only used the selected robust responses, the average Hamming distance between these responses represents the interchip distinction of BST-PUFs. Given that the length of the selected response for each PUF was different, we selected the shortest as the benchmark. During the test, the same 65536 challenges were input into the respectively 128 PUF circuits. If the robust responses selected from 128 PUFs were at least N bits, we first collected the N selected responses of all PUFs and evaluated their average Hamming  distance distribution. The average inter HD calculated by this method for different delay gates is shown in Fig. 15. It can be seen that the uniqueness of BST-PUFs greatly improved, from 38.5% to nearly 50%. The histogram of the inter HD distribution of the selected response when the number of connected delay gates was 3 is shown in Fig 15(b), and the average inter HD was 49.1%. Therefore, the BST reliability enhancement method proposed in this paper can improve both the uniqueness and reliability.

5) RESOURCE CONSUMPTION AND PERFORMANCE
The test results show that the bit error rate, bias and uniqueness of the original 64-stage arbiter PUF were 3.52%, 58.1%, and 38.5%, respectively. The number of occupied slice LUTs and registers were 96 (20800) and 19 (41600), and the speed was 120.7 kbps. After using the BST, the number of occupied slice LUTs was 150, and the number of slice registers was 47. The resource consumption increased by approximately 70%, and the speed was reduced to one-third or 43 kbps. This increased overhead is far less than those of the existing lightweight ECC techniques [33] Moreover, if there are many BST-APUF instances in a chip, they can share one BST circuit, and, therefore, it does not create significant overhead.
When connecting 3 delay gates, the selected responses can achieve outstanding performance: The BER is less than 10 −9 , the bias is 50.3%, and the uniqueness is 49.1%. They can be used directly for cryptographic applications without using any error correction mechanism. At this time, the ratio of selected responses is 66.85%. This means that the BST-APUF only requires 1.5 raw response bits to generate a reliable bit. It is much better than the current ECC implementations. Therefore, the BST-APUF greatly reduces the complexity and implementation overhead of the error correction mechanism. Table 1 gives a comparison of the BST-APUF with other newly proposed arbiter-based PUFs. Our proposed PUF has higher reliability, lower resource consumption, and similar bias and uniqueness characteristics as other state-of-the-art PUFs. Table 2 provided the implementation overhead that produces a 1 bit response to different BERs in our experiments. Due to the improvement of robust response reliability, the number of original responses required to produce a 1-bit robust response increases, and thus the execution time and energy consumption will increase. For some cryptography applications that do not require high reliability, we can tradeoff between the overhead of the implementation and the BER. We did not provide the energy consumption data in the table. This is because our design and testing are based on an FPGA platform while most of the power dissipation in an FPGA occurs from the static, clocking and I/O power consumption, making it difficult to accurately evaluate the energy consumption that produces 1 bit PUF responses.

A. APPLICATION POSSIBILITIES
BST-APUF has many potential applications. Many state-ofthe-art fuzzy extractors [33] and authentication protocols [34] use the response reliability confidence information to eliminate the effects of noise. For example, Herder et al. [30] propose a computationally secure fuzzy extractor that treats the reliability confidence information as a trapdoor to build a stateless key generator. The TREVERSE protocol [34] allows the prover to directly process noisy PUF responses via reliability confidence information; it ultimately discards expensive ECC logic, thus removing the necessity of the helper data that are exploitable by an adversary.  However, for most PUF constructions, the response reliability confidence is hard to be directly measured on-chip. For instance, the APUF's response reliability is not measurable on-chip unless expensive peripheral circuits are used. Although there are works that capture the confidence information by modeling [34]- [36], they are only suitable for resource-rich servers and not resource-constrained PUF devices. Our proposed bit-self-test strategy can measure the reliability confidence information of the APUF responses directly and easily on-chip. This makes it possible to apply APUFs to the above mechanisms, which can potentially minimize the computational complexity of the device and server.
Additionally, as described in the next paragraph, BST-APUFs can be applied in many of the state-of-the-art authentication and key generation protocols if the responses and reliability flags are not exposed to the attackers.

B. SECURITY VULNERABILITY AND COUNTERMEASURES
Due to the strong linear correlation between the response and challenge, it is well known that the APUF is vulnerable to modeling attacks when using machine learning (ML) algorithms [37]. Furthermore, in our proposed BST-APUF, the reliability flags will reveal the confidential information of the responses, which will increase the modeling attack vulnerability and reduce the number of CRPs required for a successful modeling attack [38], [39]. The confidential information in [38], [39] is obtained by measuring the stability of the responses multiple times while the BST-APUF provides it directly. Therefore, the BST-APUF is more vulnerable to modeling attacks. Thus, we need to be quite careful about security when applying it to cryptography applications.
To resist modeling attacks, many defenses for traditional APUFs are proposed and can be roughly classed into structural nonlinearization and CRP obfuscation [40]. Structural nonlinearization implements nonlinear PUF structures to obstruct ML-based modeling. CRP obfuscation [35] can hide the mapping of CRPs to prevent attackers from collecting valid CRPs to model strong PUFs. Many of them, such as using nonlinear structures such as forward feedback PUF (FF PUFs) [41] and multiplexer PUFs (MPUFs) [42], can be applied to BST-APUFs to increase their modeling attack resistance. However, most of the nonlinearization structure and CRP obfuscation methods are vulnerable to advanced ML attacks such as approximation attacks [43] and CMA-ES [38], [39]. Another approach is to reduce the exposure of responses and helper data in the protocol. For example, Yu et al. [44] proposed a lockdown technology that restricts the maximum number of APUF CRPs that can be acquired by an adversary, increasing the machine learning difficulty. However, it limited the number of CRPs for authentication.
In fact, since the helper data of the widely used fuzzy extractor also may leak some information about the derived key, many provable security PUF-based authentication protocols [45], [46] proposed recently encrypt the responses and helper data before sending them over the network, making it impossible for attackers to obtain the CRPs and the helper data. By using lightweight encryption such as XOR, these protocols provide a good balance between the security and a low overhead. In future work, we will demonstrate that our proposed BST-APUF can be easily applied to the above protocol. By using the reliability flags instead of the helper data of the fuzzy extractor, the error correction process can be significantly simplified, and the execution overhead will be significantly reduced, making these protocols more suitable for resource-constrained IOT systems. Meanwhile, the responses and reliability flags are encrypted before transmission, preventing an attacker from acquiring the CRPs to perform modeling attacks.

V. CONCLUSION
This paper presents a bit-self-test arbiter PUF with high reliability and uniqueness. We first proposed a new concept of BST-APUFs and then designed a BST-arbiter PUF by adding a delay detection circuit into a classical arbiter PUF. The detection circuit can automatically test the delay difference that produces each response and generate a reliability flag for this response. Thus, we can test and select the reliable responses automatically at normal temperatures and working environments with low overhead. We implemented the BST-APUF on a Xilinx Artix 7 FPGA. The test results show that the selected responses achieved very high reliability with improved uniqueness, and thus they drastically reduced the ECC overhead.