Stuck-at Faults Tolerance and Recovery in MLP Neural Networks Using Imperfect Emerging CNFET Technology

Devices using emerging technologies and materials with the potential to outperform their silicon counterpart are actively explored in search of ways to extend Moore’s law. Among these technologies, low dimensional channel materials (LDMs) devices, such as carbon nanotube field-effect transistors (CNFETs), are promising to eventually outperform silicon CMOS. As these technologies are in their early development stages, their devices still suffer from high levels of defects and variations, thus unsuitable for nowadays general-purpose applications. On the other hand, applications with inherent error resilience and high-performance demands would suppress the impact of process imperfection and benefit from the performance boost. These applications, including image processing and machine learning through neural networks, would be the ideal targets for adopting these new emerging technologies even in their early stage of technology and process development. In this article, the effects of stuck-at faults in CNFET static random access memory (SRAM)-based multilayer perceptron (MLP) neural network are investigated. The impacts of various fault patterns are analyzed. Several fault recovery techniques are introduced, and their effectiveness is analyzed under different scenarios. With the proposed recovery techniques, the system can recover and tolerate a high level of stuck-at faults up to 40%, paving the path to adopt the early-stage and faulty emerging devices technologies in such high-demand applications.


I. INTRODUCTION
For decades, the continuous increase in digital design performance, following Moore's law roadmap, has been enabled by shrinking the silicon (Si) metal-oxide-semiconductor fieldeffect transistors (MOSFETs).This has resulted in transistors with lower power consumption, improved density, and higher performance.However, as MOSFETS scale below 100 nm technology nodes, short channel effects become detrimental to a performance by increasing leakage current and power consumption, and lowering the on/off ratio [1].To continue scaling, the industry has looked to alternative structures to improve gate electrostatic control (e.g., Fin-shaped FETs (FinFET) and Gate-all-around (GAA) FETs).
Material enhancements through strain engineering and alternative channel materials with higher mobility than silicon have also been proposed.Low dimensional materials (LDM) possess thin shapes suitable for GAA/nanosheet structures with some LDMs also having high mobility (e.g., carbon nanotubes (CNTs) or graphene).As such, LDMs have become a promising candidate for continued scaling beyond the 5 nm node.As the first widely studied LDM, CNTs are arguably the most mature LDMs with waferscale fabrication processes already being demonstrated [2].Carbon nanotube field-effect transistors (CNFETs) have been shown outperforming their silicon counterparts: 5 and 10 nm CNFETs have demonstrated subthreshold swing (SS) of 73 and 60 mV/decade respectively [3].Moreover, 10 nm CNFETs have exhibited energy delay products (EDP) 36× better than their silicon counterparts on the same technology node [3].
While CNFETs offer better performance, their fabrication processes, as with many other emerging technologies in early development, are immature and have high variation and fault rates [4].CNTs are grown as a mixture of semiconducting and metallic CNTs, where the metallic CNTs must be removed by applying a high voltage to breakdown metallic CNTs [5] or separation [6] during the fabrication.Remaining metallic CNTs short the transistor channel, causing ''stuck-at'' faults in circuits, which is arguably the most critical reason for CNFET circuit failure.Other problems, such as missing channels or high contact resistances in CNFET, cause performance degradation such as low on/off ratio, low noise margins, and increased leakage current, rather than direct failure [7].
Currently, the most promising wafer-level fabrication technique is a solution-based separate-placement technique [6].During this fabrication process, metallic CNTs are first separated from the solution.Then, the remaining semiconducting CNTs are placed on the substrate to form the channel of the CNFET.Different dispersion techniques may be used to improve solution purity to 99.99% [8] or even 99.9999%[9] semiconducting CNTs.Yet, consistent wafer-level results are yet to be demonstrated.In addition, the metallic CNTs sometimes become clustered in the solution, causing stuckat faults concentrated in certain areas.It has been proposed that these faults could be mitigated with larger CNFETs [2].However, the larger CNFETs increase circuit area and reduce their power efficiency as seen in [2].
Despite the CNFET faults, neuromorphic computing, image processing, and machine learning applications are prime candidates for adopting this emerging technology.Being fault-tolerant and high-computing applications, they can take advantage of the improved performance and power efficiency of CNFETs without being compromised by the CNFETs fabrication faults [4], [10], [11], [12].
This work studies both the effects of different fault patterns and higher/more realistic fault rates (up to 40%) for a CNFET static random access memory (SRAM)-based multilayer perceptron (MLP) neural network.The MLP neural network is chosen since it resembles more closely the densely connected layers of modern machine learning applications in contrast to the work described in [4] and [10].The increased fault rate is intended to model higher fault rates present in emerging technologies processes.In this study, we analyze different fault patterns and propose techniques to mitigate their effects on system accuracy.It is important to note that the techniques discussed in this article are not limited to CNFETs and are applicable to other emerging technologies with high fault rates.The rest of the article is organized as follows.In Section II, the effect of different stuck-at fault patterns is analyzed using simulation results from NeuroSim [13].Section III discusses the proposed techniques for fault recovery and their impact on improving system accuracy.Section IV concludes the article.

II. ANALYSIS OF FAULT PATTERNS
To study the effect of stuck-at faults, an SRAM-based MLP neural network for detecting the MNIST dataset is used [14].The MLP network is a commonly used neural network in machine learning applications due to its simple structure [15].Fig. 1 illustrates the structure of the SRAM-based MLP neural network used in our analysis [13].The weights are stored as a 6-bit unsigned binary integer on the SRAM then mapped to [−1, 1] digitally in NeuroSim.The MLP is first trained on the MNIST dataset without faults to establish a baseline accuracy for recognizing handwritten digits.Then in order to quantify how faults affect the MLP accuracy, the MLP is retrained on the same dataset but with certain weights held at either 0 or 1 depending on the type of fault being investigated.The performance of this new system is then evaluated against the baseline without faults.
As shown in Fig. 1, the MLP consists of three sets of neuron/node layers: input-layer, hidden-layer, and output layer.Depending on the 1) data presented to each layer and 2) the bit value stored in the SRAM cell connecting neurons/nodes in one layer to another, the distinct feature of the input data is extracted.Equation (1) describes the operation of our SRAMbased MLP network In (1), the output for the MLP network is y k which is associated with the kth digit.y k results from applying a sigmoid transformation function to the outcome of the output layer to select the digit with the highest y k value.The output layer is connected to a hidden-to-output (HO) layer whose output is calculated as the weighted sum (with weighting factor v jk ) of inputs from the input to the hidden layer (IH) passed through a sigmoid activation function.Similarly, the IH layer output is obtained through applying sigmoidal transformation function of the weighted sum of the input image mapped to x i .In the MNIST dataset, 20 × 20 black and white input images are passed in with each input pixel is mapped to a specific x i value.Black pixels are mapped to 0 while white pixels are mapped to 1.The MNIST dataset was cropped down from 28 × 28 → 20 × 20 and the color was mapped from [0, 255] → [0, 1] to reduce unnecessary calculations and simplify the implementation in NeruoSim.The MNIST dataset was chosen as it provides a simple way to evaluate the effect of device-level defects on system performance.Future work is planned to evaluate more complicated datasets.In Sections II-A and II-B, we discuss the stuck-at faults and the resilience of SRAM-based MLP to various fault patterns.

A. STUCK-AT FAULTS
The ''stuck-at'' faults are not unique to CNFETs-in CMOS devices and other emerging technologies, defects may arise in various steps (e.g., metal lithography and etching) during fabrication leading to stuck-at faults [16].For CNFETs, the metallic CNTs shorting the source and drain of a CNFET become a severe and major cause of stuck-at-faults as shown in Fig. 2(a).In CNFET-based SRAM, as illustrated in Fig. 2(b), the metallic CNTs (in red) keep the transistor on and short the output node to the V DD rail leading to storage node Q being stuck at logic ''1.''Without an efficient and consistent process to remove metallic CNTs, CNFET circuits show a significantly high level of stuck-at faults [4].
In this article, we explore these faults under either random or clustered stuck-at-fault patterns and their effects on an SRAM cell used for the storage of a neural network's weights (Figs. 2 and 5).Random stuck-at faults occur in a random uniform distribution over the entire wafer due to random metallic CNTs [11].Clustered stuck-at faults are stuck-at faults concentrated in areas adjacent to each other and emulate metallic CNTs that have clustered together after the separation process [17].
Seven different cases are considered in our analysis illustrated in Fig. 3 .Fault Rate is defined as the percentage of SRAM cells in the array with stuck-at faults.We demonstrate the results where stuck-at faults are all stuck-at 0 or all stuckat 1.These are considered the worst case conditions for determining whether the system can still operate under such a bad fabrication process or not.For the cases with a mix of stuck-at 0 and 1, the effects are counteractive, resulting in better accuracy than the case with the same type of fault.
In our analysis of fault patterns on MLP accuracy, Neu-roSim simulator [13]

B. RESILIENCE TO STUCK-AT 1 FAULT PATTERNS
The accuracy after re-training with stuck-at-1 faults is shown in Fig. 4. Without faults in the SRAM cells, the system  achieved a baseline accuracy of 93.77%.The fault rate range is chosen from a realistic estimation for state-of-the-art CNFET technology [2].As demonstrated in Fig. 4, when the error rate increases, the system accuracy decreases but with different tolerance depending on the fault patterns.Random, top right, and bottom left clusters [shown in Fig. 3(a) and (c)] can tolerate higher fault rates than other fault patterns.Due to the stuck-at 1 faults compounding in the layers of the network, the fault patterns shown in Fig. 3(b) can only achieve accuracy below 50% when fault rates exceed 20%.
To explain the compound effect further, the network weights (i.e., SRAM stored values) for middle-clustered stuck-at 1 fault pattern at 40% fault rate are shown in Fig. 5.The area enclosed by dashed lines is where stuck-at 1 fault happens, resulting in incorrect high weight values.Due to the centered fault patterns, these incorrect high weights affect the middle ranged indices of the input digit image.Through (1), the output of the system will be only detecting digits with middle-ranged indices at logic ''1'' (e.g., 4, 5, and 6 digits), regardless of the input patterns.This occurs because the HO layer's weights for 4, 5, and 6 will be held at ''1'' by the stuckat 1 faults thereby increasing the probability that the detected digit is 4, 5, or 6.
In contrast, increases in accuracy are observed if the faults are more evenly spread out across the digit weights (i.e., random case) as the faults are less likely to emphasize a certain digit.This also applies for the top-right and bottom-  left cases, because the incorrect high-weight values are less likely to be multiplied together through (1) for any output k.

C. STUDY OF HIDDEN LAYER SIZE
The hidden layer size within the network is also varied to investigate the effect of network size on fault tolerance.In the Section II-B, the number of units in the hidden layer is chosen to be 100-the default setting for NeuroSim [13].Fig. 6 shows the network accuracy for networks with hidden layer units of 200 and 50.For the 50 and 200 hidden unit cases, the baseline accuracy with no faults (Fault Rate = 0%) are 90.93% and 95.7%, respectively.
As illustrated in Fig. 6, increasing network size also increases accuracy by effectively producing a network with additional redundancies.For the middle clustered fault pattern analyzed in Section II-B, as the SRAM array increases in size, the faults affect relatively less hidden units with middleranged indices.A jump in accuracy at the 30% fault rate is seen in the 200 case as the HO's cluster fault expands and is distributed more evenly over the weights thereby and covering the output layer weights more uniformly producing a ''even'' distribution of correct and incorrectly identified weights.Random and top right also benefit from additional hidden cells as it allows them to detect more digit features.

D. RESILIENCE TO STUCK-AT 0 FAULT PATTERNS
The system accuracy with different stuck-at 0 faults patterns is shown in Fig. 7.The system accuracy for all the stuckat 0 fault patterns is above 80% except for the top-right and bottom-left cluster fault patterns.Fig. 8 illustrates the network weights for the case when bottom-left fault cluster with 40% fault rate is applied.The area within the red dashed line has a weight of zero due to the stuck-at 0 fault.As described in Section II-B, since the faults are in the bottom left, the system output will be only able to identify digits with bottom-left pixels are black (e.g., 7 or 9).For the other fault patterns, there is much more overlap in the hidden layer nodes affected by faulty IH and HO weights, rendering the remaining portion of the neural network usable to produce the correct output.

E. DIFFERENCES IN RESILIENCE TO STUCK-AT 1 AND STUCK-AT 0 FAULTS
Examining the two types of faults, it is evident that the system is less tolerant of stuck-at 1 faults.Of the seven fault patterns studied, stuck-at 1 faults are only tolerable in the top right, bottom left, and random patterns-all other patterns are intolerable.This occurs because the effect of the errors is compounded between multiple layers of the network as the stuck-at 1 faults improve the activation probability of the hidden layers which in turn may be connected to other stuckat 1 faults in the next layer.
In contrast, stuck-at 0 faults tend to produce networks with higher re-trained accuracy.This is because stuck-at-1 faults actively add incorrect high weight value terms into the total weight, which masks the weight produced by the unaffected units.This makes the unit unaffected by the faults useless, and the system is unable to benefit from those unaffected units.For stuck-at-0 faults, the affected units are excluded from the weight equation and the remaining unaffected units can perform as a reduced-size neural network.As long as the size of the hidden unit is not too small, the inherited redundancy from the system allows it to operate under a reduced-size yet still produce reasonable accuracy.Additionally, the patterns for which stuck-at 0 faults are tolerable are those that are intolerable for stuck-at 1 faults (Fig. 3(b), middle, top left, bottom right, random and 3 by 3 clusters are tolerable stuckat 0 faults).This is also attributed to how the incorrect weight impacts the contributions of the correct weights in (1).For stuck-at-1 case, the incorrect high-value weight from the affected units essentially masks the other weight produced from the unaffected unit.In the stuck-at-0 case because the incorrect units produce low-value weight, this results in the unaffected weight dominating the overall weight.Essentially, the impact of the stuck-at-1 and stuck-at-0 faults are counteractive to each other.Thus, the results assuming faults being only one type provide the worst case system accuracy for a given fault rate.If both stuck-at 1 and 0 faults are present in the system, the two faults will counteract each other providing better accuracy (when compared to systems with either stuckat 0 or 1 faults at the same fault rate).

III. FAULT RECOVERY
To further improve the accuracy of the neural network postretraining, a few techniques can be used to enhance the process and diminish the effect of the stuck-at faults.These techniques depend on the specific fault pattern and will be explored here.

A. INVERTING ROW AND COLUMN ACCESSES
A memory array is accessed via bitlines and wordlines.If we invert one (or both) of the row or column decoders, this mirrors the memory array, which also mirrors the location of the defective cluster.An example of this is shown in Fig. 9.This technique is highly effective for asymmetric fault patterns, as inversion will move the fault cluster to a different location.Prior to inverting, for the stuck-at 1 faults, the IH and HO layers will have faults which link with hidden layer nodes with the same j values causing the fault to compound.After inverting, the overlap is eliminated.For stuck-at 0 faults, the overlap is opposite to that of the stuck-at 1 faults.Before inverting, the IH and HO layers will not overlap at the same hidden layer nodes.After inverting, they will overlap resulting in more of the neural network that can be utilized.
Unfortunately, symmetric patterns such as the middle fault pattern, or the 3 × 3 semi-clustered pattern will not work as inversion will not change the location of the faults.However, stuck-at 0 faults already tolerate these symmetric cases well.This technique was tested with the bottom left fault pattern for stuck-at-0 faults, one of the decoders was inverted for the HO memory array prior to retraining.The accuracies obtained decreased monotonically from 93.24% to 89.84% as the fault rate increased from 10% to 40% (Table 1).
The storage overhead for this technique is 2 bits per weight array, as 1 bit is used to store the inversion state of both the row and column decoders.For this 2 layer network, 4 bits are required, which is nearly a 0% overhead.A slight performance penalty is incurred due to the additional hardware needed to manage the inversion of the row or column requests.

B. DISTRIBUTING WEIGHT STORAGE BITS
Ordinarily, bits making up each weight are grouped together and stored as shown in Fig. 10(a).The bits of each weight are kept together, and then these groups of bits are stored in memory.The technique in this section is reflected in Fig. 10(b).In this method, the most significant bit (MSB) of all weights in the neural network are grouped together in order, the second MSBs of all weights are grouped together, and so on.Finally, all these groups are ordered and stored in memory.Regardless of stuck-at 1 or 0 faults, this technique works well when the faults are clustered in the middle of the SRAM arrays or away from the MSB storage area, as those weights will have their most and least significant bits moved away from the fault cluster.If the fault cluster is located at a region where the MSBs will end up [e.g., the left side in the case of Fig. 10(b)], then this technique alone sometimes is inadequate in recovering accuracy of the neural network to a reasonable level (Fig. 11).However, if the technique in this section is combined with the technique from Section III-A, then the MSBs can be stored in the corner opposite the fault cluster.
As indicated by the arrows in Fig. 11, the distributed weight storage technique allows the neural network to achieve 88.08% accuracy (up from 23.37% with ordinary bit storage) at a fault rate of 40%.While this technique does not   inccur storage overhead, some additional circuitry is required to arrange the weight bits which results in a performance penalty.

C. WEIGHTED MSB PROTECTION
The MSB in each weight has the greatest effect on the value of the weight, accounting for half of its value.For example, a weight represented by 8 bits has a range of [0, 255], with the MSB having a value of 128.Thus, we can ensure these bits are undamaged by using, for example, an (8,6) Hamming code (Fig. 12).Such a code is equivalent to adding two parity bits for every six MSBs, which results in a storage overhead of 4.17% for 8 bit long weights.This is calculated by requiring two parity bits for every 6 × 8 = 48 data bits.Since weight values are mapped to a range of +1 to −1 in NeuroSim, increasing the number of bits per weight increases the resolution of the weight.As only the MSB is protected in this technique, the greater the resolution, the lower the percentage overhead.
Fig. 13 shows the accuracy post-retraining using two schemes for a middle fault cluster case for stuck-at-1 faults.The first scheme uses a perfect MSB protection scheme which is unattainable in practice.The second scheme uses the (8,6) Hamming code with two parity bits per six weight bits.The second scheme performed better than the first, which is a result of the extra parity bits at the right edge of the memory array offsetting the fault cluster.Since the fault cluster is no longer perfectly centered, compounding between the IH  and HO layers through common hidden layer nodes will be eliminated.
This technique is effective for clustered fault cases where the parity bits are not faulted (middle, top left, and top right).However, for cases where the parity bits may be faulted (top left, bottom left, or 3 by 3), the Hamming code cannot resolve the correct bit values, making these cases ineffective.This technique, like the method in Section III-B, is effective for both types of faults with the same pattern and could be combined with the method in Section III-A to further enhance accuracy.

D. WEIGHTED AVERAGING 3 × 3 NEAREST NEIGHBOR WEIGHTS
This technique is influenced by Convolutional Neural Network (CNN) Average Pooling [18].When a particular weight j is required, the eight nearest neighbors are also read, and all nine weights are averaged.If this is performed on the IH layer, these weights will correspond to a group of nine pixels arranged in a square on the input image.This technique is obviously very effective for a random distribution of faults, as the stuck-at faults will have (1/9)th the effect as they did originally.For clustered faults, this is ineffective as the nearest neighbors of a faulted weight are also faulted.For the weights on the edges, padding weights must be added to make averaging possible.Four different weights showing which weights are averaged in the top left corner of the weight array is shown in Fig. 15.
For the IH layer used in this article, there are 400 weights per hidden layer node, which can be arranged in a  20 × 20 array, resulting in 84 padding weights required.This yields a 21% storage overhead.However, the greatest downside is requiring nine memory reads and averaging circuitry for each IH weight fetch, which will greatly increase processing time and energy consumption.In Fig. 14, we can see the accuracy of the neural network for a random fault distribution and a top left clustered fault distribution.At the lowest fault rate of 10%, the random distribution accuracy remains about the same using the weighted averaging technique.At 20% fault rate, around a 2% increase in accuracy is observed, while at 40% fault rate, a 5% increase is observed.For the clustered case, there is no concrete improvement seen as expected.

E. SUMMARY AND RECOMMENDATIONS
Before any recommendations can be made, each of the aforementioned fault recovery schemes and their pros/cons and overhead is summarized (a summary is provided in Table 2).1) Inverting SRAM array access performs well for asymmetric faults like corner cluster faults but is unsuitable for symmetric faults.Additionally, this method is effective on opposite sets of patterns for stuck-at-0 and stuck-at-1 patterns.Low storage and hardware overhead is incurred.2) Distributed weight bits perform well for center clustered faults and performs at the random fault's baseline, regardless of types of faults.However, performance is poor when the fault lands on the MSB (i.e., top left cluster).No storage overhead is induced though some hardware overhead arises from additional wire routing.3) MSB protection performs well against middle fault clusters and has good performance in the random fault pattern for fault rates of 10%-30%, regardless of types of faults.However, as the parity bit is stored in the right side of the SRAM, this technique is less effective for patterns affecting those storage cells.A 25% storage overhead is induced, and extra hardware is needed to check MSB and parity bit.4) 3 by 3 averaging only provides improvements in the presence of random/un-clustered fault patterns-no benefit is gained in the clustered case.The extra padding from the weights induces an overhead of 20.5%.
While each technique has its own strengths and weaknesses, the most general technique found is the distributed weight bit technique due to its ability to recover from most faults without storage overhead.However, if the fault pattern is known, a specific fault recovery technique suitable for the pattern may be selected.For example, the 3 by 3 averaging technique is recommended if the fault is known to be a random distribution.Finally, different techniques may be combined to produce improved recovery schemes.In this case, the combination of a inverting bit reading in conjunction with distributed weight bits provides the best coverage for the lowest overhead.

IV. CONCLUSION
In this article, the effects of various stuck-at faults patterns and rates caused by defects in CNFETs-based MLP neural network are investigated.Fault recovery techniques are also proposed.Our analysis shows that the choice of best-fitting recovering technique depends on the fault pattern and the allowed computational and storage overhead.Even at high fault rates of 40%, the accuracy can be recovered post-retraining should the proper recovery technique be selected.While verified with the MNIST dataset, further work is needed to investigate validity with more complicated datasets such as CIFAR10.The analysis and conclusions are not tied to the details of CNFET processes and can be applied to any emerging technology showing stuck-at faults.Our study demonstrates that immature emerging technologies with power and performance benefits can be practically useful for applications with inherit error-resilience, e.g., neural network applications.This may guide the direction of early technology adoption for immature emerging technologies, potentially expediting their technology adoption time frame significantly.

FIGURE 1 .
FIGURE 1. Implementation of the MLP Network in NeuroSim.

FIGURE 2 .
FIGURE 2. (a) Example of different faults in CNT fabrication.(b) Stuck-at 1 fault due to a metallic CNT (red) in SRAM cell.

: 1 )
Randomly distributed faults [Fig.3(a)], 2)-7) Different clustering and semi-clustering cases where the cluster is placed at different location in the SRAM array [Fig.3(b) and (c)] is used.NeuroSim is an open-source platform which can represent dense neural network layers used in nowadays architecture of various machine learning applications.Modifications to NeuroSim code are added and validated to 1) represent stuck-at faults by a fixed 0 or 1 value (depending on stuck-at fault type), and 2) prevent the training algorithm from updating the assigned SRAM fixed values.

FIGURE 3 .
FIGURE 3. Fault-patterns under study where white shows the places where stuck-at faults happen: (a) random faults, (b) patterns (middle, top-left, bottom-right, and 3 × 3 array) intolerable for stuck-at 1 faults, and (c) patterns (top-right and bottom-left) intolerable for stuck-at 0 faults.

FIGURE 4 .
FIGURE 4. Plot of accuracy of the system with different fault rates and patterns for stuck-at 1 faults.The dotted box indicates tolerable faults while the circled point in red corresponds to the weight distribution shown in Fig. 5.

FIGURE 5 .
FIGURE 5. Example of network, with 40% stuck-at-1 faults in a centered clustered pattern (corresponding to the circled data point in Fig. 4).

FIGURE 6 .
FIGURE 6.Effect of varying network size on accuracy with different stuck-at 1 fault patterns.

FIGURE 7 .
FIGURE 7. Effect of stuck-at 0 on accuracy with different fault patterns.The circled point in red corresponds to the weight distribution shown in Fig. 8.

FIGURE 8 .
FIGURE 8. Example of network with 40% stuck-at 0 faults in a bottom left pattern (corresponds to circled point in Fig. 7).

FIGURE 9 .
FIGURE 9. (a) Top left cluster case without recovery and (b) fault pattern after inverting HO row accesses.Dark areas indicate stuck-at faults.

FIGURE 11 .
FIGURE 11.Distributed weight storage bit accuracy post-retraining for stuck-at-1 faults.Arrows indicate improvement due to distributed weight storage.

FIGURE 13 .
FIGURE 13.Perfect MSB and weight MSB protection accuracy post-retraining.

FIGURE 15 .
FIGURE 15.Four different averaging 3 × 3 weights in red.The squares in the bottom right 3 × 3 corner are the original weights, and the dotted squares bordering the top left are added for padding.