Efficient and Small Network Using Multi-Trim Network Structure for Tactile Object Recognition on Embedded Systems

Tactile object recognition (TOR) is critical in robot perception. However, as an embedded system, a robot brain has a fixed resource budget and is unsuitable for modern convolutional neural networks (CNNs). To bridge this gap, we present a simple network-compression approach that improves the accuracy–latency trade-off of the network. The multi-trim network structure (MTNS) is a robust combination of network compression (NC) techniques providing a lightweight network with no performance drop. Furthermore, as an optical tactile sensor, we present a random-dot sensor that obtains rich information with a single touch, thus avoiding modality fusion. The random-dot sensor captures the object shapes and inputs them to TOR. In an experimental evaluation, we compare the performances of the proposed MTNS approach with those of CNN filter pruning, the network quantization technique, an adaptive mixture of low-rank factorizations, and knowledge distillation. The MTNS better resolved the accuracy–latency trade-off in tactile object recognition than the modern NC methods. By combining the random-dot sensor and MTNS approach, TOR enhances the accuracy and processing time performances.


I. INTRODUCTION
Much effort has gone into developing smart robots, wherein perception and manipulation are among the most fundamental and challenging problems. Tactile sensors (TSs) play an important role in robot perception of the environment [1], [2]. Robots use TSs as their sense of touch, and such sensors are improving continuously because of the demands for enhanced environmental understanding by robots.
Tactile object recognition (TOR) is among the most challenging problems in robot perception. TOR has two parts, namely (i) the TS and (ii) the recognition system. Recent works on TOR methods mostly use either multi-model sensors or multi-touch TSs [3], [4]. For a better understanding of the tactile object pattern, these methods use multimodality data [5]- [8], multiple touches (contact points) [9]- [13], or sequential data [8], [14]- [16]. By contrast, the The associate editor coordinating the review of this manuscript and approving it for publication was Yongqiang Zhao . multimodality example-based method struggles to associate local patterns and kinesthetic data. A multi-touch TS requires a large number of contacts and a long holding period. Therefore, our view of previous sensors is that they are (i) complex, (ii) sometimes generate more information than can be understood, and (iii) require a large number of contacts and a long holding period. To deliver its maximum benefits, TOR should be both accurate and timely. The TS and the recognition system are interdependent and are essential for realizing a highperformance TOR. For example, the correctness of the TOR relies on two factors, namely, (i) the information acquired during the data capture of TS and (ii) the learning ability of the recognition system based on the captured data of the TS. Meanwhile, the processing speed of TOR depends on (i) the capturing time of the TS and (ii) the inference time of the recognition system (latency) on the captured data of the TS. Therefore, the complex and ineffective TS must be replaced with a simplified TS with few sensors, non-fusion modalities, few contacts, and a short capturing time. The remaining VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ challenge is obtaining rich information by the TS with less contact than normal and without multimodality fusion. This challenge is defined as the one-shot capture task. To enable rich information by a single-touch modality, we develop an optical TS (OTS), a random-dot sensor that recognizes the shapes of tactile objects [30]. As it has developed, machine learning has become a powerful tool for recognition, such as in the form of convolutional neural networks (CNNs). CNNs are capable of not only image classification but also other computer-vision tasks. For example, successful image-classification networks with deep CNNs are AlexNet [17], VGG [18], and ResNet [19]. Recently, the performances of CNNs have improved as their structure has become more complicated. This means that they incur higher computational costs and become slower. However, as an embedded system, the robot brain has a fixed resource budget, and modern CNNs are not suitable for real-time processing on such a system. Thus, the approach of compressing CNNs plays an important role in reducing their computational cost, examples being (i) the hand-crafted approach and (ii) the network compression (NC) approach. The hand-crafted approach involves factorization and manual compression. Examples of hand-crafted networks include SqueezeNet [20], MobileNet [21], MobileNetV2 [22], ShuffleNet [23], CondenseNet [24], ShiftNet [25], and Neural Architecture Search (NAS) [41], [44], [54]. However, these approaches usually require significant amounts of manual effort and domain knowledge and are time consuming, although the NAS provides considerable expert knowledge. Instead, the NC approach takes advantage of pre-trained networks, such as pruning [26], [52], quantization [27], low-rank factorization [28], [50], and knowledge distillation [29]. These techniques were managed to solve a hand-crafted need problem. Although most NC techniques can achieve a good compression rate, high accuracy, or both, they do not affect certain dimensions (e.g., the width, resolution, and depth) of the networks. For example, (i) the pruning process exploits only a relatively narrow network, (ii) quantization employs a relatively low-memory bandwidth and storage size of network, (iii) low-rank factorization affects only relatively light weights, and (iv) knowledge distillation exploits a narrow and shallow network. The remaining challenge then is to compress a network structure in all dimensions using NC, and achieve a trade-off between the accuracy and latency. Our efficient NC approach achieves simplicity while improving the trade-off between accuracy and latency. We also present a multi-trim network structure (MTNS), a robust combination of the following NC techniques: (i) formulating a problem, (ii) shrinking the network structure, (iii) developing the training accelerator through the hybrid pruning of convolutional kernels and shrinking the network structure, and (iv) developing the search strategy through trim neural architecture search.
We show that using a random-dot OTS and an MTNS exceeds the previous best results (more details in Section III-B). In the conference version of this article [49], we proposed a network-structure shrinking approach to compress networks for an embedded system. We cast pre-trained networks (VGG16 and MobileNetV2) and then shrank them, and we searched for the best ratio of compound multipliers to achieve the best network performance.
The present article extends our previous work [49] by (i) increasing the number of pre-trained networks, (ii) developing a training accelerator, and (iii) the search space of shrinking expansions and analysis. The MTNS approach leads to a more-efficient trade-off between latency and accuracy. Experiments are performed to search for a network structure that is suitable for the embedded system. Results of the tactile image captured using random-dot OTS and compressed network perform a low latency and high accuracy. Overall, the contributions of the present study are in the following three aspects. 1. Random-dot OTS. We develop a novel random-dot OTS that (i) can obtain more details and (ii) is both more enough to capture in one-shot and no need modality fusion. 2. MTNS approach. We present a new simple and efficient NC technique, and its result network processes beyond the trade-off between latency and accuracy. 3. High-performance network of result. Using the MTNS approach allows us to bridge the gap between high rate compression and high network accuracy. The remainder of this article is organized as follows. In Section II, we review related recent literature. In Section III, we describe the random-dot OTS architecture and the MTNS approach. In Section IV, we evaluate the performance of the MTNS approach and compare it with those of CNN filter pruning [26], [52], network quantization technique [38], an adaptive mixture of low-rank factorizations [50], and knowledge distillation [29], [51]. We end by drawing conclusions in Section V.

II. LITERATURE REVIEW A. TACTILE SENSORS FOR TACTILE OBJECT RECOGNITION
There are several TSs for TOR, such as the Weiss [9], Teksan [31], and GelSight [32] sensors. Such sensors can be categorized into the following five types, and their advantages and disadvantages are summarized in Table 1. 1. Piezoresistive sensors are based on the change in the electrical resistivity of a material when it is deformed. Examples of sensors based on piezoresistive materials are multilayer conductive rubber [9] and a nonflexible pattern of electrodes combined with conductive rubber [31]. 2. Piezoelectric sensors use the piezoelectric effect, i.e., the generation of electrical charge in a crystalline material due to deformation caused by an applied force. An example of a piezoelectric sensor is polyvinylidene fluoride (PVDF) [5]. 3. Capacitive sensors are coupled capacitive plates whose capacitance changes when the gap between them changes under an applied force. A commercial capacitive sensor is DigiTacts [33]. 4. Optical sensors are based on optical reflection between media with different refractive indices. The best-known optical sensor is GelSight [32], [34]. They have good spatial resolution and high sensitivity, but they suffer from high computational cost and power consumption. 5. A multimodality TS is a combination of different types of TS, such as piezore [5], [6]. There is also the multi-model sensor, which is a fusion of a TS and a visual sensor [7], [8], [35]. Both multimodality TSs and multi-model sensors are less limited than a single sensor, but they struggle to associate local patterns and kinesthetic data. In addition, each type of sensor has its disadvantages. We can categorize them into three groups: (i) those that cannot detect spherical shapes [31], [33], (ii) those that require multiple modalities [5]- [8], [35], and (iii) those that require multiple touches [9]- [13], [35]. More-closely related to the present work, the GelSight sensor [32], [34] was proposed for tactile texture recognition, but it is ineffective for TOR. In fact, most of the previous TSs are ineffective in TOR, for the following reasons: (i) the complexity of multiple modalities, (ii) the difficulty in capturing features from the unassociated data of multiple modalities, and (iii) the long capturing time with multiple touches. Resolving these problems is the aim of the one-shot capture task mentioned in Section I. To this end, we present a simple TS composed of a random-dot sensor, which obtains rich information with single-touch contact, and non-modality fusion. This efficient TS part realizes an accurate and timely TOR. The randomdot sensor enhances the accuracy of the TS part, and the speed of the TOR process is enhanced by obtaining sufficient information in a single touch.

B. NETWORK COMPRESSION FOR IMAGE RECOGNITION
For more-accurate performance, modern CNNs tend to become deeper and more complex, thereby become slower and incurring high computational costs. A platform with a fixed resource budget is not suitable for processing modern CNNs because such a platform has low memory and small processors. The CNN compression approach plays an important role in modern CNN applications. This approach is a computation-reduction technique for a network working on a platform with a fixed resource budget. CNN compression can provide a smaller and faster network. Recent work on compressing CNNs can be classified into two groups, namely (i) hand-crafted and (ii) NC.
The hand-crafted methods create a lightweight and low-computation network. SqueezeNet [20] solves the problems of inefficient distributed training, high overhead, and unfeasible embedded systems, but it has low accuracy. MobileNet [21], [22] solves the problem of the high computational cost of CNNs by using a depthwise separable CNN and inverted residual. However, although it is more accurate than SqueezeNet, MobileNet requires considerable manual efforts. ShuffleNet [23] cracked a simple technique by shuffling CNN, thereby increasing the accuracy and latency. CondenseNet [24] shows latency beyond ShuffleNet with the same level of accuracy. ShiftNet [25] is smaller than previous networks [20], [23], [24] by using shift operations, but hurt in accuracy. However, although they achieve networks with either low latency or high accuracy, these approaches require considerable manual efforts and domain knowledge and are time consuming. Moreover, the NAS approach allows the automation of the network design process [41], [44], [54]. NAS fixes the problem of domain knowledge required; however, it is usually time consuming.
Unlike the hand-crafted methods, the NC methods take advantage of pre-trained networks. NC squeezes an existing network into one that is smaller and requires less computation. We categorize the NC methods into the following four groups. 1. Pruning-based methods explore the redundant parameters or convolutional filters in the network and remove those that are not critical. Parameter pruning [26], [36] yields an effective reduction of network size; however, it relies on the support of hardware and a computing library because of sparse parameter kernel after pruning. Convolutional filter pruning [26], [36], [52], [53] resolves the unfriendliness of hardware and Basic Linear Algebra Subprograms (BLAS library). However, the problem of drop accuracy is commonly encountered in both approaches. 2. Quantization-based methods reduce the number of bits associated with each weight. Eight-bit quantization [37], [38] leads to remarkable acceleration but lower accuracy, the latter having to be solved by retraining [27]. Binary quantization (one bit) [39] enhances the speed but offers much-lower accuracy. Sixteen-bit quantization reduces the accuracy loss and was weight increment. 3. Methods based on low-rank factorization use matrix decomposition to save storage. The low-rank method of fully connected layers and a CNN [28] achieves a good compression rate, but both cause some networks to lose accuracy. Adaptive Mixture [50] fixes the problem of lower accuracy; however, it results in a network that is larger than that of Madds. VOLUME 8, 2020 4. Methods based on knowledge distillation transfer knowledge from a teacher network (a pre-trained network) to a student network (a smaller network). Softmax temperature [29] assembled loss of teacher network into student one of similar depth and caused hurt in accuracy. Soften label probabilities [29], [40], [51] assisted student networks to preserve more information, but were more compact.
As mentioned above, nearly methods could hurt in accuracy of the result network although had a good compression rate, and a compact method hinders use. To solve these problems, we bridge the gap between trade-off improvement and simplicity. We present structure-compression methods and balance the trade-off between accuracy and latency. The present network-structure shrinking approach differs from previous studies. The method of shrinking the network structure involves compressing a pre-trained network by its structure. Trimming the network structure leads to lower computational costs and a faster network. Furthermore, we boost the performance in terms of both speed and accuracy by developing a training accelerator and a search strategy.

III. METHODS
The methods are divisible into four parts: (i) the randomdot OTS, (ii) the MTNS approach for embedded systems, (iii) training, and (iv) the TOR system. Section III-A introduces the random-dot sensor and the method for data collection. A random-dot sensor is an OTS that can obtain sufficient information in a single touch. The MTNS approach involves compressing the network structure (i.e., shrinking the network, developing a training accelerator, and formulating a search strategy). As described in Section III-B, MTNS approaches can reduce the computational demands and improve the accuracy for working on embedded systems. Next, the training methodology is presented in Section III-C. Finally, the TOR system is presented in Section III-D.
A. RANDOM-DOT OPTICAL TACTILE SENSOR AND DATA COLLECTION 1) ARCHITECTURE OF RANDOM-DOT OPTICAL TACTILE SENSOR As shown in Fig. 1, the random-dot OTS (random-dot sensor) is shaped like a box with an area of 37 mm × 37 mm and a height of 14 mm. This sensor has two layers, namely (i) an opaque layer (white) that is 9-mm thick and (ii) a transparent layer (orange) that is 5-mm thick. The average size of the particle markers is around 2 mm, and the average interval between them is random. The camera in the sensor takes original images of 640 × 480 pixels in size over an area of 31.4 mm × 23.4 mm at a rate of 30 Hz. The overall system is shown in Fig. 2. By bringing an object into contact with the opaque part, the transparent part is displaced and causes the positions of the dots to change. The shape of the object can then be classified by inspecting the transparent part.

2) DATA COLLECTION AND AUGMENTATION
As shown in Fig. 2(b), the objects in the experiments had surfaces of five different shapes, namely (i) a circle, (ii) a triangle, (iii) a square, (iv) a pentagon, and (v) a hexagon, as well as a non-contacted object (default). A tactile image was captured while the surface of the experimental object was in contact with the opaque layer of the random-dot sensor. Displacing the opaque layer changed the positions of the dots in the transparent layer. For each object, 12 images were taken under different conditions, resulting in 72 images in total. However, this number of images is too few to train a highperformance network. Therefore, we used image augmentation to boost the performance of the experimental networks by preventing overfitting. The conditions in the image augmentation were random combinations of rotation, shift, flipping, shewing, elastic distortion, zoom, and brightness. Shewing is performed in eight directions (corner = {top-left, top-right, bottom-left, bottom-right} and tilt = {left, right, top, bot-tom}). As elastic distortion is important for increasing the number of tactile images with low consistency, elastic distortion was performed using 10 × 10 grids. The range for each condition is given in Table 2, and examples of the augmented tactile images are shown in Fig. 3. In total, 10,368 images were generated, 1,728 for each object. The holdout method was used to split the dataset into training and testing parts with a ratio of 60:40. The training part consisted of training  and validation datasets, which were separated by three-fold cross-validation.

B. MULTI-TRIM NETWORK STRUCTURE FOR EMBEDDED SYSTEMS APPROACH
The MTNS approach involves four steps: (i) problem formulation, (ii) shrinking the structure of the network, (iii) developing the training accelerator (a hybrid of pruning convolutional kernels and shrinking the network structure), and (iv) formulating the search strategy (TrimNAS).

1) PROBLEM FORMULATION FOR TRADE-OFF BETWEEN ACCURACY AND LATENCY
Embedded systems are a platform with a fixed resource budget, such as low memory and small processors. In contrast, modern deep CNNs called networks require high computation, which causes them to work slowly. Thus, one of the main tasks of the MTNS approach is to optimize the trade-off between accuracy and latency for efficient work on embedded systems. The problem formulation is multi-objective (high accuracy and low latency). According to [41], we adapted the objective function as where m is the child network, ACC (m) is the accuracy of each child network on the target task, LAT (m) is the latency on the target embedded system, TAR = 16 ms is target latency, and w = 0.09 is a hyper-parameter that controls the trade-off between accuracy and latency. In the experiments, the higher the objective score, the better the child network.

2) SHRINKING STRUCTURE OF NETWORKS
The network-structure shrinking approach, called the shrinking method, is a network-structure compression technique to reduce the computation of networks for working on embedded systems. The shrinking method differs from previous work [26], [28], [29], [36]- [40] in that it trims the network structure. Scaling up [42] proposes how to extend networks with significant performance improvement using compound dimensions. Inspired by compound multipliers, we adapt it into a means of shrinking the structure of networks. The shrinking method involves the simple approach of compressing the network structure into a light computational network that balances accuracy and latency. The shrinking method compresses the network structure in three dimensions (width, resolution, and depth) and consists of three parameters, namely multipliers of the width, resolution, and depth. Each multiplier is a coefficient with a value of 0.25, 0.5, 0.75, or 1.0. Fig. 4 shows the baseline network (a) and shrunk network (b). By decreasing the multipliers, the shrunk network reduces the number of parameters, the network size, and the computational cost. For example, the Inverted Bottleneck Conv network [22] was used in this experiment, and its computation at each block is described by where α is the width multiplier, β is the resolution multiplier, γ is the depth multiplier, E is the expansion ratio, D K is the kernel size, M is the number of inputs, N is the number of outputs, L is the number of layers, and D F is the feature-map size. By shrinking the Inverted Bottleneck Conv, we obtain a reduction in computation of approximately 78% of the original one as presented below: Furthermore, the correctness of networks can be improved using a suitable ratio of multipliers. The shrinking experiment searched for a suitable ratio of multipliers that offered better performance and less latency.

3) HYBRID OF PRUNING CONVOLUTIONAL KERNELS AND SHRINKING STRUCTURE OF NETWORK (SHRINKING WITH TRAINING ACCELERATOR)
According to [26], a pruning technique is to prune the convolutional kernels. Most pruning methods cause a significant drop in accuracy, although they can reduce the training time considerably. Pruning the convolutional kernels is close to VOLUME 8, 2020 the technique of shrinking the width dimension, because they both decrease the number of CNN filters. With filter ranking, pruning the convolutional kernels, retains important filters and removes redundant ones, as well as boosts the correctness of networks. For this reason, we consider pruning the convolutional kernels as being a training accelerator. However, compared with pruning, shrinking exerts different effects on the width scaling of the structure; e.g., shrunk networks have the same width scaling throughout the networks. Pruned networks, in contrast, have different scaling for each layer and the total number of removed filters is characterized by the same ratio as that of the shrunk networks (see Figs. 4(b) and (c)). Thus, the number of parameters and Madds in shrunk networks differs from that of the pruned networks. The hybrid approach of shrinking and pruning is a method for compressing the network structure but with a relatively short training period. The hybrid approach creates a compressed structure using the shrinking approach and then removes CNN filters using the pruning approach in the inner loop (25% of the total number of filters in the compressed structure in each inner loop) with 10 recovery epochs. An overall hybrid approach is shown as Fig. 5. With the hybrid approach, the training time can be reduced even if the search-space size of the hybrid experiment is the same as that of the shrinking experiment. Moreover, the accuracy of the network can be improved if the hybrid technique meets fit network scale and there are sufficient recovery epochs. Therefore, the hybrid experiment searches for a suitable ratio of the resolution and depth multipliers in the shrinking technique as well as a percentage of the number of filters in the pruning technique.

4) TRIM NEURAL ARCHITECTURE SEARCH WITH WEIGHT SHARING AND FACTORIZED HIERARCHICAL SEARCH SPACE
We address the idea that a more-complex shrinking structure of network may improve the performance of child networks. Shrinking the network structure at block level provides an enormous search space compared with that provided by the original shrinking of the network structure: the original search space is 240 shrunk networks, whereas the block-level search space is 9.65 × 10 28 shrunk networks. The search strategy plays an important role in search-space exploration and exploitation. There are many different search strategies, such as Bayesian optimization, evolutionary methods, gradient-based methods, random search (RS), and reinforcement learning (RL). A case study comparing evolutionary methods, RL, and RS concluded that evolutionary methods and RL perform better than RS in this experiment [43].
We present Trim Neural Architecture Search (TrimNAS), which is a macro search to compress networks to meet the trade-off between accuracy and latency. TrimNAS, one of the MTNS methods, is a network-compression technique that combines NAS with shrinking the structure of the network. TrimNAS aims to explore a more complex compressed network architecture than can use either the shrinking approach or the shrinking-with-training-accelerator approach, which is at block level. Using RL, a controller can explore and exploit its target, despite its search space being enormous, and it can yield computational demands in the order of thousands of GPU days. TrimNAS involves short training and networkstructure compression at block level to obtain better performance. TrimNAS uses shared weights [44] and a factorized hierarchical search space [41] to reduce the training time without decreasing the total search space size. The shared weights are managed to reduce the notable training time of TrimNAS. Each training epoch has two phases, namely (i) the shared parameters of the child networks (ω) and (ii) the parameters of the controller (θ) phase. Training ω in the shared parameters of the child networks' phase fixes the controller's strategy and optimizes ω with stochastic gradient descent. The goal is to minimize the expected loss function: where m is the sample and L (m; ω) is the cross-entropy loss on sample m. Training θ in the controller parameters fixes the shared parameters of the child networks and performs θ to maximize the expected reward: where π (m; θ) is the controller's strategy and R (m, ω) is the reward of sample m. Fig. 6 shows the baseline structure of TrimNAS's search space. To shrink each block, the controller (LSTM) samples N blocks of decisions. The predictions of the controller for each block involve four steps: FIGURE 6. Baseline structure of TrimNAS's search space. Each block has an identity that includes convolutional operation, width multiplier, a number of layers, and skip operation. We search for the identity in each block of the first layer and repeat the same layer L times. For example, layers 6-1 to 6-L 6 have the same identity, while different blocks can have different identities.
Step 1. Sample a convolutional operation.
Step 2. Sample a width multiplier to apply to the operation selected in step 1.
Step 3. Sample the number of layers to apply to the operation selected in step 1.
Step 4. Sample a skip connection to apply to the operation selected in step 1 The controller repeats the above four decision steps N times corresponding to the seven blocks in the TrimNAS experiment. Each block has a list of identical layers, of which operations, skip connection, width multiplier, and depth multiplier. A sub search space in each block is presented as follows: • convolutional operations: Conv, SepConv [22], Inverted Bottleneck Conv [22];

C. TRAINING
We used the tactile image dataset described in Section III-A to experiment with the MTNS approach for embedded systems. For a fair comparison, we selected VGG16, the CNN employed in the conference version of this article, as a pre-trained network. We also trained three high-performance models, namely, MobileNet, MobileNetV2, and MnasNet. The shrinking and hybrid approaches used a grid search to explore a target, while TrimNAS used RL. The target was a child network that provided the highest score in the trade-off between accuracy and latency, referring to Section III-B-1. The latency measures on JetsonTX2 with GPU. The weights of the experimental networks were all initialized using the Kaiming technique [45]. Every epoch, the training set was shuffled and the batch size was 128 images. We selected the network that performed the best on a validation dataset to measure the performance with the testing dataset. To compare the child-network performance, we used an objective function for which a higher score meant better performance. The cross-entropy loss was a loss function for training the child networks, and Adam optimization [46] was the optimizer. We were training on seven Nvidia GTX 1080Ti GPUs.

D. TACTILE OBJECT RECOGNITION SYSTEM
Tactile object recognition (TOR) is an object recognition system that uses tactile sensors. The TOR in this article comprises two parts, namely, the TS (the random-dot sensor) and the recognition system (lightweight network), as shown VOLUME 8, 2020 in Fig. 7. The TOR should be accurate and timely to deliver its maximum benefits. The accuracy of the TOR depends on: 1. the information acquired when capturing the data of the random-dot sensor; and 2. the accuracy of the lightweight network based on the captured data of TS. Meanwhile, the processing speed of TOR relies on: 1. the capturing time of the random-dot sensor; and 2. the inference time of the lightweight network (latency) on the captured data of TS. To evaluate the accuracy and processing speed of the TOR, we determine the accuracy of the lightweight network and the latency of TOR, respectively. The total latency is the sum of the capturing time of the random-dot sensor and the inference time (latency) of the lightweight network.

A. EXPERIMENTS ON SHRINKING NETWORK STRUCTURE 1) SEARCH SPACE
In this section, we look for the compound multipliers that gave the highest score for the child networks. The search strategy was a grid search, and there were 240 cases of compound multipliers, described as follows: • width multiplier {×0. 25

2) TRAINING DETAILS
To shrink the pre-trained networks and then train the child networks with 50 epochs, we set the environment following Section III-C. The learning rate was set to 1 × 10 −3 for VGG16 and MobileNet and to 2.56 × 10 −2 for MobileNetV2 and MnasNet. Fig. 8 shows that the depth multiplier affects the latency of the child networks of MobileNetV2 and MnasNet, while the other multipliers have only a minor effect on the latency, i.e., the low-latency group (LAT < 15 ms) corresponds to the ×0.25 depth multiplier, the mid-latency group (15≤ LAT ≤20 ms) corresponds to the ×0.5 depth multiplier, and the high-latency group (LAT ≥20 ms) corresponds to the ×0.75-1.0 depth multiplier, as shown in Fig. 9(a). The reason is that deeper networks cause computation increase by times of repeated layers. Table 3 shows the results for the shrunk child networks. For example, a shrunk MobileNetV2 has 0.30M parameters and 12M computation [multiplier-accumulators (Madds)], being 7.47× and 25× smaller than the baseline. With a small computation, the shrunk MobileNetV2 has a latency of approximately 12.07 ms, which is 2.19× less than the baseline. By contrast, the child networks of VGG16 and MobileNet are affected by the depth and width multipliers, as shown in Figs. 9(b) and 9(c). For the child networks of VGG16, the high-latency group (LAT > 80 ms) corresponds to ×1.0:1.0 (width:depth multipliers), the low-latency group (LAT <40 ms) is the set of {(0.25: γ ), (0.5:γ ≤ 0.75)}, and the mid-latency group (40 ≤ LAT ≤ 80 ms) corresponds to the other conditions. In case of the child networks of MobileNet, the low-latency group (LAT <12 ms) is α:γ ≤0.5 (width:depth multiplier), the high-latency group (LAT > 14 ms) is 1.0:γ ≥ 0.75, and the mid-latency group (12≤ LAT ≤14 ms) corresponds to the other conditions. The shrunk VGG16 has a 16.16-ms latency with 43M parameters and 378 Madds, being 6.58× faster and 41× smaller than the baseline.

3) RESULTS a: LATENCY AND ACCURACY DISCUSSION
Regarding the accuracy, we observe that all multipliers are dependent. For a lower-resolution image, the network depth should be decreased to reduce the computation because too much network depth diminishes the accuracy gain and makes the training more difficult because of the vanishing-gradient problem. The network width should also be decreased to prevent difficulties in capturing higher-level features when the network is extremely wide and shallow. For example, Fig. 10(a) shows an accuracy graph of the network-depth analysis of MnasNet with ×0.75:0.5, which corresponds to the width:resolution multipliers, respectively. Too low a network depth (γ = 0.25) makes it impossible to capture complex features, and too high a network depth (γ ≥ 0.75) makes it impossible to gain accuracy. In addition, Fig. 10(b) shows an accuracy graph of the network-width analysis of MnasNet with ×0.5:0.5, which corresponds to the depth:resolution multipliers, respectively. Too low a network width (α ≤ 0.5) makes it impossible to capture all the fine-grained features, and too high a network width (α = 1.0) makes it impossible to capture higher-level features. For these reasons, using balanced multipliers leads to higher-performance child networks in accuracy. (b) network-width graph. Black shows the highest score of each child network as described in Table 3. Although some of the highest scores correspond to lower accuracy, their latencies are significantly low, impacting their score better than the more-accurate ones.
In summary, balanced multipliers and shallower networks provide the highest score with ×0.75:0.5:0.25 (width: resolution:depth multiplier) for MnasNet. Table 3 shows the highest performance results of each pre-trained network in the experiment on shrinking the network structure. The highest performance of the shrunk MnasNet achieves 95.93% accuracy and 12.29-ms latency with 0.78M parameters and 26M Madds. The shrunk MnasNet is more accurate but 12× smaller than the baseline. The training time is 4.46 d on Nvidia GTX 1080Ti GPUs (749 GPU hours), as shown in Table 4.

b: SHRINKING AND PRUNING COMPARISON
The highest performances of shrunk networks and pruned networks are different. The shrunk networks are better than the pruned networks in accuracy, latency, and Madds. For example, a shrunk MobileNetV2 achieves 95.69% accuracy and 12.07-ms latency with 12M Madds, being 1.56% more accurate, 2.19× faster, and 25× smaller than the baseline. Nevertheless, pruned MobileNetV2s, which are pruned and soft pruned, have 88.83%-92.89% accuracy and 21.62-24.71-ms latency with 69-147M Madds. These networks are 1.23%-5.3% less accurate, 1.07-1.22× faster, and 2.04-4.35× smaller than the baseline. However, the pruned networks have fewer network parameters than the shrunk networks; e.g., a pruned MobileNetV2 has 12.4× fewer parameters than the baseline; however, a shrunk one has 7.47× fewer parameters than the baseline. In the case of the soft pruned networks, the network parameters are the same as the baseline because this method pruned CNN filters using a set zero of parameters.

c: SHRINKING AND QUANTIZATION COMPARISON
According to [38], quantized networks have the same numbers of parameters and Madds as the baseline, but they have 2-4× lower memory bandwidth and storage size of network. Most shrunk networks are better than the quantized networks in accuracy and latency; e.g., a shrunk MnasNet has 95.93% accuracy (1.43% higher than the baseline) and 12.29-ms latency (2.28× faster than the baseline), but a quantized  MnasNet has 88.09% accuracy (6.41% lower than the baseline) and 13.96-ms latency (2.01× faster than the baseline).

d: SHRINKING AND LOW-RANKING COMPARISON
In terms of accuracy, latency, parameters, and Madds, the performance of most shrunk networks is better than that of low-rank factorized networks. For example, a shrunk VGG16 has 93.51% accuracy and 16.16-ms latency with 43M parameters and 378M Madds, and it is 7.75% more accurate, 6.58× faster, and 41× smaller with 3.12× fewer parameters than those in the baseline. Nevertheless, a low-rank factorized VGG16 has 82.49% accuracy and 89.16-ms latency with 120M parameters and 403M Madds, and it is 3.27% less accurate, 1.19× faster, 38.46× smaller, and has 1.12× fewer parameters than those in the baseline. However, some low-rank factorized networks have fewer network parameters than the shrunk networks; e.g., a low-rank factorized MobileNet has 12.84× fewer parameters than the baseline; however, a shrunk one has 1.92× fewer parameters than the baseline.

e: SHRINKING AND KNOWLEDGE DISTILLATION COMPARISON
According to [29] and [51], distilled networks have the same structure and number of parameters as the shrunk networks. Most shrunk networks are better than the distilled networks in terms of accuracy, latency, and Madds; e.g., a shrunk MobileNetV2 has 95.69% accuracy (1.56% higher than the baseline), 12.07-ms latency (2.19× faster than the baseline), and 12M Madds (25× smaller than the baseline). However, distilled MobileNetV2s have 91.74%-93.82% accuracy (0.30%-2.39% lower than the baseline), 12.92-ms latency (2.14× faster than the baseline), and 44M Madds (6.82× smaller than the baseline). The Madds of distilled networks differ from that of the shrunk networks owing to the resolution of images (distilled network and skunk networks characterized by a resolution of ×1.0 and ×0.5, respectively, of the original image.
In summary, the shrinking technique achieves the trade-off between accuracy and latency. With balanced multipliers, the shrunk networks with the highest scores have higher accuracy, lower latency, and fewer parameters and Madds, as shown in Table 3. Furthermore, in terms of accuracy and latency, the shrunk networks perform better than the pruned, quantized, low-ranked, and distilled networks.

B. EXPERIMENTS ON HYBRID OF PRUNING CONVOLUTIONAL KERNELS AND SHRINKING STRUCTURE OF NETWORK (SHRINKING WITH TRAINING ACCELERATOR) 1) SEARCH SPACE
The search space of the hybrid approach was the same as that of the shrinking approach (Section IV-A-1), and the search strategy was a grid search.

2) TRAINING DETAILS
To shrink the pre-trained networks in both resolution and depth and then train the child networks with 50 epochs, the next step is to prune four times, namely 25% of the number of filters, and recover 10 epochs each time. The learning rate of recovery was set to 3 × 10 −4 for VGG16 and MobileNet and to 8.5 × 10 −3 for MobileNetV2 and MnasNet.

3) RESULTS
With the training accelerator, the training time is reduced from 749 to 449 GPU hours (2.67 d on Nvidia GTX 1080Ti GPUs), being 1.67× faster than in the shrinking experiment. The child network with the highest score for each pre-trained network is shown in Fig. 11 and described in Table 5. Almost all the child networks in the shrinking-withtraining-accelerator experiment have higher latency than in the shrinking experiments because both the parameters and Madds are increased in number. For example, a result for MobileNet at ×0.5:0.75:0.75 (width:resolution:depth multipliers), as shown in Fig. 12, has 13.16 ms latency with 1.18M parameters and 144M Madds, being 1.03× slower, having 1.74× more parameters, and being 1.57× larger than a shrunk one. The shrinking-with-training-accelerator technique improves the accuracy, especially from 25% to 50% of pruning (×0.5-0.75 width multiplier) when compared with the shrinking technique in Section IV-A, e.g., a result for  MobileNet at ×0.5:0.75:0.75 (width:resolution: depth multipliers) has 93.48% accuracy and is 4.25% better than a shrunk one. With ranking filters of CNNs, the shrinking-withtraining-accelerator technique not only retains important features but also removes redundant ones.
In summary, the shrinking-with-training-accelerator approach managed to reduce the training time by 40% in the shrinking experiment. Although their latency increased, almost all the child networks gained in accuracy when the width multiplier was between ×0.5 and ×0.75.

C. EXPERIMENTS WITH TrimNAS 1) SEARCH SPACE
The search space of TrimNAS was described in Section III-B-4. VOLUME 8, 2020

2) TRAINING DETAILS
TrimNAS searched 100 epochs in each resolution. In each epoch of the TrimNAS searches, there were 350 sharedparameters training steps and 50 controller training steps. In the shared-parameters ω training phase, we used the Nesterov momentum [47] with a learning rate of 0.001-0.05, T 0 = 10, and T mul = 2. In the controller-parameters θ training phase, all the parameters were initialized uniformly in the range [−0.1, 0.1] and Adam optimized at a learning rate of 0.001. To prevent premature convergence, we used a tanh constant of 2.5 and a temperature of 5.0 [48] for the controller's sample. The skip connection probability was 0.6.

3) RESULTS
The TrimNAS technique achieves a higher trade-off between accuracy and latency than do shrinking and the hybrid of shrinking and pruning; e.g., TrimNASnet with the ×0.5 resolution multiplier (TrimNASnet-0.5) has 96.09% accuracy and 9.14-ms latency, being 0.04-0.16% more accurate and 1.34-1.35× faster than a shrunk MnasNet and one with training accelerator. TrimNASnet-0.5 has 0.61M parameters and 18M Madds, which is 1.27-1.48× fewer parameters and 1.44-1.67× smaller than a shrunk MnasNet and one with training accelerator. With shrinking at block level, Trim-NAS can search in a larger search space than shrinking and shrinking with training accelerator in order to achieve a high objective score. Fig. 13 shows the architecture of TrimNASnets. The TrimNAS technique reduces the training time significantly from 749 to 455 GPU hours (2.70 d on Nvidia GTX 1080Ti GPUs), being 1.65× faster than the shrinking experiment, but is 6 GPU hours more than the shrinking-with-training-accelerator experiment. More-over, the resolution multiplier affects the accuracy, latency, and training time of TrimNASnet. The higher the resolution multiplier that TrimNAS uses, the greater the accuracy, latency, and training time. For example, TrimNASnet-0.25 has 94.97% accuracy, 9.10-ms latency, and 87.7 GPU hours of training, which is 1.24% less accurate, 1.23× faster, and 1.61× shorter training than TrimNASnet-1.0. The highest score of TrimNASnet is that of TrimNASnet-0.5, as shown in Table 5.
In summary, the TrimNAS approach reduces the training time by 39% in the shrinking experiment. With an enormous search space, TrimNASnets achieve the highest trade-off between accuracy and latency in this study.

D. EVALUATION OF THE TACTILE OBJECT RECOGNITION SYSTEM
The high accuracy and processing speed achieved in TOR are confirmed in Table 6. The accuracies are presented as the accuracies of the child networks of MTNS. The TOR captures sufficient information in the data of the random-dot sensor, and the child networks of MTNS are efficient learners. Meanwhile, the processing speed was measured by the overall latency of TOR, which was computed by summing the capturing time of the random-dot sensor and the inference time (latency) of the child networks of MTNS. The capturing time of the present random-dot sensor was 33.33 ms (obtained from the 30-Hz frame rate of the camera sensor part). As discussed in Section IV-C, TrimNASnet-0.5 obtained the highest objective score of MTNS. The latency of the TOR based on TrimNASnet-0.5 was 42.47 ms, 1.07-1.44 times faster than  the baseline network (TOR based on MnasNets) and the other compressed networks using the MTNS approach. In summary, the random-dot sensor and MTNS approach achieved a high-performing TOR.

V. CONCLUSION
In this article, we proposed a random-dot sensor, which is a novel OTS without modality fusion, for one-shot capture task, and the MTNS approach, which is a new robust combination of NC techniques, for accuracy-latency trade-off task. For TOR, our random-dot sensor can obtain sufficient information in a single touch, and our MTNS approach can provide a lightweight network without performance drop. Results obtained in this study have achieved high performance of TOR in terms of the accuracy and processing time when compared with the modern NC methods. We also demonstrated that the training accelerator and search strategy can significantly reduce the training time of the MTNS technique.