Sequential Image-based Attention Network for Inferring Force Estimation without Haptic Sensor

Humans can infer approximate interaction force between objects from only vision information because we already have learned it through experiences. Based on this idea, we propose a recurrent convolutional neural network-based method using sequential images for inferring interaction force without using a haptic sensor. For training and validating deep learning methods, we collected a large number of images and corresponding interaction forces through an electronic motor-based device. To concentrate on changing shapes of a target object by the external force in images, we propose a sequential image-based attention module, which learns a salient model from temporal dynamics. The proposed sequential image-based attention module consists of a sequential spatial attention module and a sequential channel attention module, which are extended to exploit multiple sequential images. For gaining better accuracy, we also created a weighted average pooling layer for both spatial and channel attention modules. The extensive experimental results verified that the proposed method successfully infers interaction forces under the various conditions, such as different target materials, illumination changes, and external force directions.


Introduction
Of the five basic human senses, touch is an important perceptional modality for understanding the relationship between our surroundings and us. It offers complementary information that helps comprehend the environment. From this perspective, touch or tactile sensing has been an attractive subject of research in robotics and haptics for many years [9][4] [13][16] [29]. The main physical property needed for grasping and interacting with objects is the force of interaction. When a robotic hand attempts to grasp an object, a contact-type haptic sensor is used to measure the force of interaction between the device and the object. This improves the success rate of the gripping and enables precise hand manipulations [14]. In the case of humans, the visual information sensed through the eyes is used in addition to tactile sensations when grasping an object. Through visual information, we perceive the shape, appearance, and texture of an object, and infer the tactile memory learned through the past experience. From the perspectives of neuroscience and psychophysics, Ernst and Banks [6] investigated the mechanism of information sharing between the senses of vision and touch. Newell et al. [17] showed that the human brain employs shared models of objects across multiple sensory modalities, e.g., vision and tactile sensing, so that knowledge can be transferred from one to another.
Inspired by the knowledge transfer from vision to touch [21], we propose a vision sensor-based method that simulates tactile sensing, which has a different modality, using only visual sensing information. When humans try to touch an object, they can recall how it feels before touching it by summoning past experiences. Specifically, if we know what an object is, and we can observe how its appearance changes by a finger pressing, we can predict the in-teraction force between the object and the finger from past experiences. Another focus of the proposed method is that compared with contact-type haptic sensors, a non-contacttype sensing method can help constantly measure the haptic force because the camera sensor is not worn out even when it is thus used for a long time. Moreover, as an additional touch sensor does not need to be attached to the instrument, the mechanism of the instrument can be miniaturized. In this paper, our computational approach is based on learning haptic information from past human experiences. The following are pivotal rules: to predict the interaction forces exerted by the target object using only sequential images, and for simplicity, we assume that the target objects have been touched in advance like past experiences. For this purpose, we collected more than 300,000 images of different objects under a variety of conditions and used the corresponding databases to train and validate the proposed method.
From the viewpoint of predicting haptic information from images, the basic deep learning architecture is developed using a convolutional neural network (CNN)-based recurrent neural network (RNN) [5]. As in human perception processes, we used the CNN to analyze types of target objects and changes in their appearances using the images, analyzed the images over time, and used temporal changes in them as inputs to the RNN to eventually estimate the force of interaction. To construct the network, we believe that the attention mechanism [22] [23], which focuses only on regions of importance in images for visual question answering (VQA) [1], helps improve the accuracy of prediction of the force. However, the main difference between the proposed method and previously developed attention networks [22] [23], which commonly have been designed for a single image-based attention mechanism, is that we use a temporal dynamics-based attention method using sequential images to predict the interaction forces. Because the appearance changes of the target object between the sequential images play a pivotal role in inferring haptic information. As the number of CNN features increase due to the sequential images, there is an increasing need for a method to efficiently process a large amount of information generated. For this purpose, we propose a sequential image-based attention method consisting of a sequential spatial attention module (SSAM) and a sequential image-based channel attention module (SCAM) to attain higher accuracy for predicting haptic information. By developing the attention module based on sequential images independently of the RNN, as shown in Fig. 1, the concentrated information can be inferred clearly to predict the haptic force based on changes in the appearance of the target object. Moreover, we use spatial and channel attention modules, respectively, and each attention module is based on a proposed weighted average pooling (WAP) method for handling successfully a large amount of information generated by the sequential images.
The main contributions of this paper are as follows: (1) We propose a computational method for predicting the haptic interaction force only from visual information without a haptic sensor. (2) The sequential image-based attention modules are proposed for efficiently processing the increased convolutional features due to the sequential images and for obtaining more accurate haptic information at the same time. (3) We collected a large number of sequential images of objects and the corresponding information 1 concerning the forces of interaction on the objects by using an automatic mechanism.

Related Work
Studies have been conducted to measure interaction forces without force sensors. In [2], a stereo camera was used to reconstruct a 3D artificial heart surface and a supervised learning method was applied to predict the applied force. In [32], a video-based method to estimate the interaction force between a human body and an object was proposed using 3D modeling information. In [18], a single RGB-D camera-based method was used to estimate the contact forces between a human hand and an object. In [7], a deep learning-based hand action prediction method was proposed using only visual information. It can predict the force exerted by the fingertips using the proposed networks. In [12], the authors focused on predicting the interaction force using visual changes to the target objects by using a simple RNN-based method. Their work is the first to focus on predicting the interaction force using only images without additional sensors. However, the proposed RNN-based method does not have deeper layers to effectively train all variations in visual changes, such as illumination and pose changes, at the same time. To solve this problem, we employ the basic framework of the CNN-based RNN method in which the CNN first analyzes variations in salient visual features using the proposed sequential image-based attention module, and the RNN works on the serialized features to predict the final interaction force. Our proposed method can thus attain more robust accuracy with respect to variations in conventional images, such as different objects, various illumination conditions, and camera pose variations.
From the viewpoint of the attention-based networks, CNN-based attention mechanism has been widely studied such as image caption generation [24], image classification [31] [23], and visual sentiment analysis [28]. In case of action recognition, in [15], LSTM-based attention was proposed to learn the attention weight between the LSTM, and the global long sequence attentive network [27] was proposed, designed on the spatial attention based on subsequence attention network using the sub-skeleton images.
In [25], the attentive spatial-temporal pooling was proposed for video-based person re-identification, where they use the similarity scores of two videos to compute attention vectors and the attention vectors were used to perform pooling after RNN outputs. Most of the temporal attentions for a video-based recognition have been developed for improving the baseline of LSTM, not CNN by itself, but in this paper, we propose the sequential image-based attention method for enhancing the performance of the convolutional features.

Baseline Method
The baseline algorithm [5] consists of CNN (visual feature extraction) and RNN (temporal dynamics modeling) and is described as follows.
Visual Feature Extraction The CNN is recently a wellknown method for a representation of images. In case of sequential data, each frame could be represented by its corresponding feature of the CNN. The tth image frame passes through visual extractor V as an input I t , and the CNN generates the fixed-length visual feature vector representation: To confirm the feasibility of our model, we use a variant of the VGG model [19] as an encoder, which is a common deep CNN architecture. We extract feature maps from the last pooling layer. The convolutional features of each frame are considered one chunk for an input step of the RNN. The resulting frame-level vector is fed into our long short-term memory (LSTM) architecture.
Sequential LSTM Model Given a frame-level feature vector X t in sequential frames, we use X t as input to the LSTM, which is known to perform well on many sequential problems [5] [26]. To extract sequential features, we apply an LSTM comprising self-recurrent units and a memory cell. It can store information concerning several dozen time steps. We use the bidirectional LSTM (BLSTM) [8] derived from the LSTM. It considers all available information concerning the past and the future. As the BLSTM uses inputs in two ways, i.e., from past to future, and from future to past, there are two hidden-state outputs. We combine them in the last time step and send them to a fully connected layer.

Sequential Image-based Attention Module for inferring haptic information
In this section, we describe the dynamic attention module designed to model the interaction between the objects by using the sequential images. As described in Section 3.1, we used the CNN-based RNN module as baseline for analyzing sequential images. The CNN first extracts the visual features of each frame that are passed to the RNN to predict the interaction forces based on complex temporal dynamics. The sequential attention module focuses on salient regions and considers temporal dynamic information simul-taneously, as illustrated in Fig. 1.
Sequential Image-based Spatial Attention Module (SSAM) In general, an interaction between objects occurs in the region that is touched; therefore, the application of a global image feature may lead to a sub-optimal result owing to its consideration of irrelevant regions. To solve this problem, a spatial attention mechanism has been proposed in many previous works [1][22] [23]. Such a mechanism focuses on the key regions of information in an image by excluding less important ones, and has yielded improvements in performance. However, most previous works [1][31] [28] have assumed that only a single frame is used for the convolutional features or the temporal attention methods [15] [25] have been proposed for the RNN. As the purpose of this work is to predict the interaction force between objects in the sequential images, the consideration of the dynamic information of each convolutional feature is also important. Therefore, instead of extracting only an attention map from a single frame, our attention module exploits the multiple adjacent frames to generate an accurate attention map by considering dynamic information for the convolutional features. The overall procedure is illustrated in Fig. 2 (a).
We basically represent the convolutional feature of the tth frame as X t ∈ H×W ×C . Unlike the existing methods [1] [22][23], we make use of the previous frames for extracting more salient attention information. For predicting the interaction forces using the camera, it is important how the appearance of an object changes in the sequential images rather than using only one image. In this respect, the sequential convolutional feature of the n sequential frames are concatenated as X tn = [X (t−n+1) , ..., X t ] in X tn ∈ H×W ×nC . The SSAM process can be summarized as follows: where ⊗ denotes the element-wise multiplication operation, M s ∈ H×W ×1 represents the sequential image-based spatial attention map, and X t ∈ H×W ×C is the spatial-wise excited feature map.
where * denotes the convolution operation and X tn ∈ H×W ×nC represents concatenated convolutional features from the (t − n + 1)th to the tth image. To squeeze the concatenated feature map X tn by using the proposed WAP for spatial information, we use a 1×1 convolution kernel ω s ∈ 1×1×nC to generate projection tensor Y s ∈ H×W . Each y i,j of Y represents a linear combination of all C channels at spatial location (i, j). To generate an attention map, the projected map Y passes the convolution layer and the sigmoid function is applied as follows: where σ is the sigmoid function, and b is the bias parameter. Sequential Image-based Channel Attention Module (SCAM) Similar to the SSAM, the proposed SCAM generates salient features by exploiting the channel information of frames adjacent to the given one. As the amount of channel information increases because of multiple images, so does redundant channel information. In this case, as noted in [30], non-salient channel information causes the problem of distraction. To solve this issue, we use the self-gating attention module based on channel dependence [11] in the proposed channel-wised WAP method. Fig. 2 (b) describes the overall block architecture of SCAM.
The set of visual features of sequential frames X tn = [X (t−n+1) , ..., X t ] is given as input as follows: where M c ∈ 1×1×nC represents the sequential channel attention map, and X t ∈ H×W ×C is the final refined feature map, To squeeze the concatenated feature map X tn into the channel axis, we use 1×1 convolution kernel ω c ∈ 1×1×(H·W ) after reshaping X tn to obtain squeezed vector Y c ∈ 1×nC×1 . Each y k of Y c represents the linear combination of all spatial positions in channel k. The output then passes through two MLP layers to provide non-linear dependencies, and the sigmoid function is then applied as follows: where F 0 ∈ (nC/r)×nC and F 1 ∈ nC×(nC/r) are the parameter weights of the multilayer perceptron, and r is the reduction ratio.

Weighted
Average Pooling (WAP) Recent works [10] [11] have used the Global Average Pooling (GAP) to calculate the spatial average of the convolutional feature map and this type of pooling helps to achieve better accuracy in visual recognition. GAP can be used to efficiently encode several convolutional feature maps into a vector with a limited size. Therefore, many attention methods [11] [23] have widely used GAP due to its simplicity and efficiency. However, we argue for this simple assumption and propose the WAP method which could encode the convolutional features with consideration of their importance for spatial and channel attentions, respectively. In this paper, the proposed attention method for CNN features makes use of the increased number of feature information due to the sequential images and it is not an easy task to calculate the feature information with the same weights as GAP. For example, if the channel attention is obtained using two frames, the channel information extracted from the current frame should be calculated with higher weight than the information extracted from the previous frame. Therefore, the proposed WAP encourages the network to emphasize more discriminative features when the feature information is increased using the sequential images.
As shown in Fig. 2 (a), to average the channel information by using different weights, the convolutional feature X ∈ H×W ×nC is split into {x 1 , x 2 , ..., x nC } (x ∈ H×W ). We calculate the weighted average by multiplying each element of weight vector w ∈ nC to the corresponding spatial map. In this respect, we simply implement it by applying a 1×1 convolution operation. A similar approach can be used to average spatial information using different weights as shown in Fig. 2 (b). In this case, we flatten the tensors of the convolutional feature maps, e.g., X ∈ H×W ×nC → 1×nC×(H·W ) , and apply a 1×1 convolution to obtain different weights for the spatial regions.

Ensemble Module
The ensemble network has shown better accuracy in many applications [20] [3]. To combine the attention networks, Woo et al. [23] designed serialized spatial and channel-wise attention modules under a single network. However, in this study, we trained the SSAM and SCAM independently and calculated the average of their results based on the late fusion rule. A major reason for this merging using late fusion is that the two proposed attention mechanisms play different roles and focus on different characteristics to infer the forces. SSAM focuses on the spatial regions in images, whereas SCAM is responsible for evaluating which channels of the convolution layer are important. Learning the two attention methods, SSAM and SCAM, the characteristics of which are different under a single network, is challenging. Moreover, we used multiple images to learn more temporal dynamics for better performance. The amount of information to be assessed by the proposed method increased compared with that in the single image-based attention method, and separately training the SSAM and the SCAM is a better choice in terms of efficiency.

Experimental Setup and Database
For a fair experimental training and validation protocol, we built a data-collecting system consisting of a motorized probe system, and captured images during interactions between a probe and an object while recording the interaction forces. As shown in Fig. 3 (c), we used a RC servo-motor attached to the translation stage for generating the movement. The rod type tool mounted by the translation stage moved up and down (only z direction) automatically to apply force on the object. We measured the interaction force between the tip of the tool and the interaction object through a load cell (model BCL-1L, CAS) and captured the 1280×1024×3 (RGB) images using a 149-Hz camera (Cameleon3, CM3-U3-13Y3C-CS, Pointgrey). We synchronized the collected images and interaction forces using the time stamp of the camera. Note that the maximum magnitudes (e.g., 0N -12N ) of the pressing force and pressing time were varied randomly.
To infer the interaction forces from the images, we selected four objects composed of different materials as shown in Figs. 3 (a) and (b). Each object had a different rigidity. We collected images of a sponge, a paper cup, a tube, and a stapler. To vary the environment around the objects, each object was subjected to four pressing angles (0 • , 10 • , 20 • , 30 • ) and three levels of light intensities (350, 550, 750 lux). Fig. 3 (a) showed the example images. One image set consists of four contacts to the material, and a total of 15 sets were collected for each environment. In the end, we collected approximately 360,000 sequential images (=15 sets×500 images×4 objects×3 lights×4 angles). We selected three test sets from each material, and the other sets were used to train the deep learning models. Table 1 summarizes detailed information concerning the training and test sets of images of the four objects.

Implementation Details
We trained the network weights through the mini-batch stochastic gradient descent by using Adam for 120 epochs. The initial learning rate was le-4, and was multiplied by 1/10 every 30 epochs. In each iteration, a mini-batch of 64 samples was made by sampling 20 sequential training frames, and from each frame, an object was randomly selected. The image then underwent cropping and resizing to a gray-scaled 128×128-pixel image. In the experiment, as a baseline, a variant of the VGG network was used to extract the visual features. As described in Table 2, the network was composed of 10 layers and output 256 channel feature vectors after the GAP. We also experimented with   Table 3. Experimental results of a baseline, a single frame-based method and the proposed multiple frame-based method. an 18-layer ResNet [10] to verify that our proposed model works well on other CNNs. To exploit the temporal dynamics, we used the BLSTM network with 256 hidden units and 20 time steps. The last hidden unit feature that was concatenated was fed to 1,024 fully connected layers. Finally, to predict the 1D interaction force, we used the linearregression model. We trained all models from scratch, and measured performance by using the root mean-squared error (RMSE) and mean absolute error (MAE). We used the MAE as the standard measurement for performance comparisons.

Experimental Results and Discussion
In this section, we first provide the overall performance comparison between the baseline [5] and the proposed method. Table 3 shows that the attention methods helped to improve the performance of the baseline by more than 9% in terms of predicting unknown interaction forces. The channel attention method (or SCAM) was always better than the spatial attention method (or SSAM) in this paper because at high layers of the CNN, high-level features were found in the channel maps of the CNN, and not the spatial maps. Moreover, the proposed ensemble method, by merg-  ing the results of the spatial and channel attention methods, effected an improvement of over 27% over each attention method. Compared with the single frame-based method, with an MAE of 0.0340, the proposed method based on multiple frames always yielded better results, with an MAE of 0.0318 in the ensemble. This indicates that the attention map to infer forces can be effectively generated by exploiting the temporal dynamics of the target object. Quantitative evaluation was conducted to find the optimal multi-frame bounds. Fig. 4 shows that the performance improvement was saturated after n = 1 and in this paper, we use n = 1 for consideration of the computational complexity and its improvement. We empirically verified that our proposed pooling method is effective at squeezing sequential frame information by comparing two methods of averaging the feature maps: GAP and WAP. From Table 5, we conclude that the proposed WAP is superior at handling concatenated sequential information and predicting the relevant forces.  [11] 0.09838 0.03769 107% CBAM [23] 0.09974 0.03745 108% Proposed Method 0.09549 0.03122 130% Table 7. Comparative evaluation with the well-known attention modules.
Experimental Results on Different Network Architectures To validate the generality of our method, we applied our model to ResNet [10], a well-known deep learning architecture. Table 6 shows the comparative results between VGG-like and ResNet. The proposed method worked successfully regardless of the architecture used. For example, the ResNet-based method also achieved results 30% better in terms of MAE than the baseline.
Comparative Evaluation with Well-known Methods We conducted a comparative analysis with other wellknown attention methods. In Table 7, we provide a summary of the results of the comparative evaluation in terms of inferring the interaction forces on our dataset. The methods tested were our proposed attention module and recently developed state-of-the-art techniques based on the attention mechanism [11] [23]. The proposed method was superior. Note that such methods as in [11] [23] are not designed to generate an attention map from sequential images, and thus suffered performance degradation.

Performance Analysis according to Changes in Force Intensity
To better understand reasons for why the proposed method improved performance over the baseline method, we divided the magnitudes of the forces into 11 bins, each of which spanned a 1N force interval as shown in Fig. 4. We used the MAE measurements of each force interval to determine how the single-frame-based attention method and the proposed method improved compared with the baseline method. From Fig. 4, we confirm again that the proposed method of generating attention by using sequential images improved performance in most force intervals. In detail, from 0 N to 4 N , the contact between the tool and the object is initially started, and the interaction force could be measured by the load cell. In this rage, the appearance of the target object begins to change. Since these changes could be concentrated by the attention mechanism, the proposed  method helps to improve the accuracy compared with the baseline and the single image-based method. On the contrary, the range from 4 N to 6 is the interval where the applied force gradually increases, and the performance of the proposed method is saturated. In relatively strong force intervals, e.g., 9-11 N , the proposed method achieved an average improvement of 16% over the single-frame-based attention model. As the appearance changes of the target object increased when the external force was strong, the proposed method effectively made use of differences in the values of pixels between sequential images to generate attention maps. Overall, the proposed method achieved better performance than the single frame-based attention method.

Performance Analysis on Different Objects
In this section, we investigate the performances according to the different objects. Overall, Fig. 5 shows that the proposed method predicts the interaction forces using only images, even if the maximum forces are randomly generated. Looking more closely, Fig. 6 shows that the proposed method is better than the baseline method when the external force reaches peaks. The baseline method estimated the peak of the interaction force well at first, but its predictions  were less stable than those of the proposed method, which were closer to the ground truth. In this respect, we can conclude that temporal dynamics are useful for generating the attention map using the CNN, even though the LSTM make use of the temporal information. Table 8 describes the performance improvements according to different target objects and Fig. 7 illustrates the spatial attention map generated by the proposed method. Specifically, a sponge is an elastic object. Compared with the other objects used, changes to the appearance of the sponge owing to external forces were most apparent, and it thus yielded good results. The proposed method shows the best results on images of the paper cup, as the complex surface textures represent rich visual information. For this reason, it yielded a high estimation accuracy compared with the other rigid objects. As shown in the second row of Fig. 7, the network focused on the top and bottom textures of images of the paper cup, where significant changes occurred owing to external forces. The tube was composed of plastic rubbers, and was softer than the other objects, be- cause of which changes to its surface were not obvious. For this reason, the proposed method showed a slightly low improvement on images of the tube. In case of a stapler, because it is composed of solid materials, its pattern of shape changes was always constant when an external force was applied. In this respect, temporal dynamics played a pivotal role in predicting the interaction forces, this was confirmed by the experimental results in Table 8. The improvements in the single image-based attention method and the proposed method were 129% and 141%, respectively. Compared with the other objects, this 12% improvement is significant.

Conclusion
To predict the interaction forces between objects using only images, we developed a sequential image-based attention module that learns a salient model from temporal dynamics. We also proposed a weighted average pooling layer for modifying both spatial and channel attention modules, with the result generated by their ensemble. To verify our method, we collected 359,413 images and information concerning the corresponding interaction forces using an electronic motor-based device. Extensive experiments proved the effectiveness of our method, which achieved better performance than well-known single-frame-based methods. Our proposed method enables the network to concentrate on regions of interaction to infer interaction forces. It serves as good initial research in force prediction using only one vision sensor.