DL-Net: Sparsity Prior Learning for Grasp Pattern Recognition

The purpose of grasp pattern recognition is to determine the grasp type for an object to be grasped, which can be applied to prosthetic hand control and ease the burden of amputees. To enhance the performance of grasp pattern recognition, we propose a network DL-Net inspired by dictionary learning. Our method includes two parts: 1) forward propagation for sparsity representation learning and 2) backward propagation for dictionary learning, which utilizes the sparsity prior effectively and learns a discriminative dictionary with stronger expressive ability from a mass of training data. The experiment was performed on two household object datasets, the RGB-D Object dataset, and the Hit-GPRec dataset. The experimental results illustrate that the DL-Net performs better than traditional deep learning methods in grasp pattern recognition.


I. INTRODUCTION
In semi-autonomous prosthetic hand control, cameras are integrated to the system to output grasp patterns facilitating the prosthetic hand to grasp objects, thus reducing the user's burden. Traditional surface electrical myography (sEMG) based grasp pattern recognition [1], [2] has problems such as easily being affected by sweat, power line interference, and the variance of sEMG signals between different people. On the contrary, the image-based methods do not have such problems [3], [4], [5]. Its data acquisition is simple and easy, and the data preprocess methods have been well established. Furthermore, it is hard for the amputee to generate distinctive sEMG signal patterns for the machine learning model to predict intended grasps accurately, which leads to lower accuracy and higher mental workload to the amputee [6], [7].
In recent years, image-based grasp classification methods have achieved success. Morrison et al. [8] proposed a Generative Grasping Convolutional Neural Network (GG-CNN), The associate editor coordinating the review of this manuscript and approving it for publication was Yen-Lin Chen .
which uses an end-to-end approach to output the grasp pose on each pixel. Therefore, it solves the problem of invalid candidate frames. However, GG-CNN requires high requirements for annotation of the dataset including grasp quality, grasp angle, and grasp width, and label each pixel point. It will increase the cost of dataset annotation, which may lead to difficulty in real-wrold practice. Hundhausen et al. [9] also proposed a grasp pattern recogition based on convolutional neural network. To enhance the performance of grasp pattern recognition, the model first performs semantic segmentation of the input image to separate the target from the background, and then the targets obtained by the above process are used for grasp pattern recognition.
However, because these models contain convolutional layers, their performance is subject to multiple constraints. First, the convolutional layer operation cuts the image into small pieces for feature extraction. It destroys the integrity of the image. Therefore, the extracted features are local. Second, convolutional neural networks are poorly interpretable. It is difficult to adjust the network structure to improve the network performance without explainable results. Finally, convolutional neural networks require a large number of samples for training to extract robust features [10], [11], [12].
Dictionary learning [13], [14] is able to extract the essential features of a sample, which is similar to the role of convolutional layers to extract features but without destroying the global of the sample. Such that the sample data can be involved in dictionary learning as a whole with the global features extracted, which will improve the performance of the model prediction. In general, the more global the extracted features are, the better the subsequent prediction results will be. Moreover, dictionary learning can still perform well even with a small number of samples [15]. Therefore, we choose dictionary learning to build the network DL-Net.
Furthermore, DL-Net is different from traditional dictionary learning in terms of classification. Traditional dictionary learning needs to add classification operations. However, DL-Net is an end-to-end model that makes predictions with a classifier composed of fully connected layers, where the classifier layers are more flexible and can make full use of the sparse features for classification. Because using this classifier, the label information is learned more efficiently from the dictionary. Furthermore, DL-Net takes hyperparameters as trainable variables and learns the parameters based on the loss function that consists of the mean squared loss and cross-entropy loss instead of manual tuning as in traditional learning, which reduces labor costs.
The contributions of this paper are as follows: • We build a network DL-Net using dictionary learning. Experimental results show that it is comparative to convolutional neural networks. In some cases, it even surpasses the convolutional neural network.
• We created the novel deep learning framework with layers which can extract global feature of the input image.
• The proposed DL-Net has the powerful self-adjusting capability of deep learning and the ability to extract global features that dictionary learning has.

II. RELATED WORK
The convolutional neural networks have demonstrated the feasibility of grasp pattern recognition. In [16], Ghazaei  However, the above methods are all based on convolutional neural networks which might suffer from localized feature extraction as we have discussed in the Introduction section. Mei et al. [19] proposed a non-local sparse network (NLSN) for Single Image Super-Resolution (SISR). NLSA combined non-local operations and sparse feature representation, thus giving NLSA the power of remote modeling and the robustness and efficiency of enjoying sparse representation.
Global features can be extracted by sparse representation [20], [21], [22]. The purpose of sparsity representation learning is to find a sparse matrix X that DX reconstructs the sample Y as much as possible, where X is the sparse representation of Y and the dictionary matrix D is pregiven. Dictionary D pre-selection is diverse and difficult to select. However, dictionary learning can find both the sparse matrix X and the dictionary D [13]. Therefore, dictionary learning has become a hot research topic. For example, Yang et al. [23] proposed a model called ADMM-CSNet for reconstructing images with sparsely sampled measurements. The model combined the traditional model-based Compressive Sensing method and data-driven deep learning method. Zhong et al. [24] proposed a dictionary learning-based THz CT reconstruction (DLTR) model. The dictionary in the model can be adaptively adjusted during the reconstruction of the image. In addition, it was used for image denoising tasks. Zheng et al. [25] proposed a novel framework of deep convolutional dictionary learning (DCDicL) for image denoising. The framework strictly adhered to the dictionary learning representation model and can adaptively adjust the dictionary according to each input image.

III. THE PROPOSED METHODS
The dictionary learning model can be formalized as: m)}, and λ is a penalty parameter to balance the sparsity of X and the reconstruction error term ∥Y − DX ∥ F . Here, Y = [y 1 , y 2 , · · · , y m ] is a matrix composed of image samples and There are two stages for traditional dictionary learning: (1) sparse representation learning by solving Eqn. (1) with given D, and (2) dictionary learning by solving Eqn. (1) with given X . Here, we propose a new method based on a developing deep network (as shown in Fig. 1), in which we use the forward propagation for sparsity representation learning and the backward propagation for dictionary learning. In addition to the dictionary, we can learn the parameter λ from the FIGURE 1. Architecture of DL-Net. The network contains a total of n blocks, each block consists of one X layer, one A layer, and one layer. The FC layer represents the fully connected layer. The model structure is based on dictionary learning. Therefore, its main purpose is to convert the input image into a sparse matrix X and a dictionary D (n+1) , and then reconstruct the original image through the sparse matrix X and dictionary D (n+1) . data instead of manually tuning that the traditional method usually does.

A. SPARSE REPRESENTATION LEARNING BY THE FORWARD PROPAGATION
When the dictionary D is fixed, Eqn. (1) convert to: arg min To solve Eqn. (2), we remove the constraint on sparse matrix X by introducing an auxiliary matrix A. Then, we have: The Augmented Lagrange funtion of Eqn. (3) can be written as: where the matrix is the Lagrange multiplier and µ > 0. According to the Alternating Direction Method of Multipliers (ADMM) [26] framework, we solve Eqn. (3) iteratively by the following steps.
• For sparse matrix X : where we have: • For auxiliary matrix A: To solve Eqn. (8) about the auxiliary matrix A, we introduce soft thresholding function [27]: Its another equivalent expression is as follows: Therefore, A (k+1) can be obtained by • For Lagrange multiplier : • For penalty parameter µ: where ρ>1 andμ is upper bound of µ (k+1) . Inspired by the above iterative framework, we propose the network DL-Net. Each block of the DL-Net corresponds to an iteration that consists of the following steps: where D (k+1) ∈ , λ (k+1) > 0 and µ (k+1) > 0 are the learnable parameters at k + 1-th block of DL-Net.
The structure of DL-Net is shown in Fig. 1, in which layers X (k+1) , A (k+1) and (k+1) at the k + 1-block are for the computing of x (k+1) , a (k+1) and l (k+1) , respectively. From Fig. 1, the forward propagation for DL-Net is identity to the above ADMM-based iterative framework for given } n k=1 , allowing us to learn D and the parameters µ and λ from a mass of training data by the backward propagation. We use f (y, {D (k) , µ (k) , λ (k) } n k=1 ) to represent x (n) for clarifying the relation between x (n) and these trainable parameters.

B. DICTIONARY LEARNING BY THE BACKWARD PROPAGATION
Here, we update D for given X . When we fix X , then Eqn. (1) convert to arg min In the traditional dictionary learning, a standard way to get a dictionary is to solve (15) by using ADMM framework.
Here, we update D by the backward propagation, where the mean squared error (MSE) loss function is used: Here, m is the number of samples, and y i is the i-th sample data.

C. CLASSIFICATION
In this sub-section, we classify the test sample y test after getting D (n+1) and x (n+1) test . In DL-Net, we use a classifier consisting of several layers of fully connected layers for classifying the test sample y test , which take x (n+1) test as an input and is trained by using a cross-entropy (CE) loss function: Here the p j (i) represents the true probability that the i-th training sample belongs to the j-th category. In contrast, q j (i) represents the predicted probability that the i-th training sample belongs to the j-th class by DL-Net. Here, θ are the trainable parameters in our classifier.

A. DATASET
We used two datasets, RGB-D Object dataset 1 [28], [29] and Hit-GPRec dataset 2 [17] to evaluate our proposed network, where RGB-D Object dataset was proposed for object category classification, and then manually labeled with gestures by Zhang [29]. RGB-D Object dataset contained a total of 300 objects and 207921 images labeled into four gestures (palmar wrist neutral, palmar wrist pronated, pinch, tripod). The Hit-GPRec dataset was proposed by Shi [17] and was originally used for grasp pattern recognition tasks. It contained a total of 121 everyday objects that were labeled into four gesture types (cylindrical, lateral, spherical and tripod). Each object was photographed under different environmental conditions (4 types of lighting, 4 different camera positions and different postures of the objectst) according to 16 rotation angles to form the Hit-GPRec dataset. Some objects in the two datasets were shown in the Fig. 2. These datasets were not filtered and theses entire datasets were used for the experiments.

B. SAMPLE METHODS
To test the learning ability and generalization ability of the model, we used two sampling methods, within-whole dataset cross-validation (WWC) and between-object cross-validation (BOC). The WWC sampling method tests the performance of the model facing the object at different angles. Therefore, the dataset is randomly divide into a training set, a validation set, and a test set in the ratio of 8 : 1 : 1. While BOC sampling method simulates the situation where the model is tested with samples that never appear in the training set. It tests the generalization ability of the model. Therefore, BOC divides the images of different views of the same object as a whole, into the training set, validation set, and test set, in the ratio of 8 : 1 : 1.

C. IMPLEMENTATION DETAILS
The input image was resized to 48×48 size, grayscale image. The size of D was 48×48 and α was set to 10 −4 . The classifier consisted of two fully connected layers. The size of the first layer was 128, and the size of the second layer was the number of gesture labels. In the first layer we used the Relu function and in the second layer we used the Softmax function. The sparse matrix x (0) , the auxiliary matrix a (0) , and the Lagrange multiplier l (0) were initialized to all zero matrices and used as model inputs. However, if forward propagation was performed in this way, the gradient explosion problem occurred.
To solve this problem, we added a LayerNorm layer after all X layers (excluding the X (n+1) layer), layers, and A layers.
And we evaluated the model using global accuracy (GA). The GA calculation equation is as follows: where R t and R f indicate the number of true and false prediction results, respectively.
The experiments were conducted on a same environment using Ubuntu 9.4.0 and a GeForce RTX 2080Ti. Each experiment was run three times to take an average. The size of the images in the comparison method was the same as in the original paper training or testing. A total of 40 epochs were used for both stages of training DL-Net, as well as all other models. The model employed Adam optimizer, where β 1 and β 2 defaulted to 0.9 and 0.999. The initial learning rate was set as 1 × 10 −4 , and it will decay to 1 × 10 −6 .

D. COMPARISON WITH STATE-OF-THE-ART METHODS
We compared DL-Net with some convolutional neural network models which contained CnnGrasp [16], Ghost-Net [30], EfficientNet [31], RegNet [32] and Lightlayers [33]. CnnGrasp was designed specifically for grasp pattern recognition and the remaining four methods were given for the image classification. And we used two sampling methods WWC and BOC for testing the model. The number of model parameters and the computational cost of completing one prediction for different models is shown in Table 3. According to Table 3, it can be found that the computational cost required by DL-Net to complete one prediction is much lower than other methods, even lightweight networks Lightlayers. And the model size of DL-Net is comparable to Lightlayers.

1) COMPARISON IN WWC
To test the robustness of the model facing different views of the object and its ability to generalize within the dataset, we compared our method with five other methods on two datasets. The experimental results are shown in Table 1. Specifically, DL-Net reaches 99.20% on the RGB-D Object dataset, which is almost 100%.
Our method outperforms the other methods on both datasets. Compared to CnnGrasp for grasp pattern recognition, DL-Net achieves the best performance on both datasets, while CnnGrasp has an average performance. Meanwhile, the most competitive aspect of DL-Net is EfficientNet. On the RGB-D Object dataset, the performance of the two methods is similar. While on the Hit-GPRec dataset, DL-Net outperforms Efficient by about 3%. All experimental results shown in Table 1 illustrate the effectiveness of the proposed method in WWC. In addition, the performance difference of DL-Net on RGB-D objcet dataset and Hit-GPRec dataset illustrates that DL-Net can extract effective features even when facing a small number of samples. Compared to the lightweight network Lightlayers, DL-Net outperforms LightLayers in RGB-D Objectdatset and Hit-GRPec dataset. DL-Net outperformed Lightlayers by nearly 15% on both datasets. Therefore, DL-Net has stronger memory than Lightlayers and can have good prediction performance for objects that appear in the training set.

2) COMPARISON IN BOC
To test the model prediction performance for invisible objects, we compared our method with five other methods  on two datasets. The experimental results are shown in the Table 2. From the Table 2, it can be seen that DL-Net has comparable performance.
In addition, comparing Table 1 and Table 2, it can show that BOC is more challenging than WWC. On the Hit-GPRec dataset, DL-Net outperforms all other methods. It illustrates that DL-Net can extract effective features for enhanced grasp pattern recognition when faced with a small number of samples, regardless of whether WWC or BOC sampling is adopted. On the RGB-D Object dataset, although DL-Net ranks third in performance. However, its difference with GhostNet, which ranks second, is only less than 1.5%. Although DL-Net's performance is not the best, DL-Net outperforms most of the convolutional neural networks. It shows that DL-Net is strongly competitive with convolutional neural networks. It is the same result as in WWC that DL-Net outperformed Lightlayers. The difference in performance between the two models is around 10%. Compared with Lightlayers, DL-Net performs better with BOC sampling, which proves that DL-Net has better robustness and can make more accurate predictions for objects that do not exist in the training set.
Combining Table 1 with Table 2, it can be found that DL-Net outperforms most traditional deep neural networks. The reason for it is that DL-Net is more interpretable and effectively avoids a large amount of labor and resources to construct the network manually.

E. ABLATION STUDY
In this section, we study the effect of the block number of the model on the model performance, and the block number is set to {4, 8, 16}. The ablation experiments are all performed on the Hit-GPRec dataset with WWC and BOC, respectively.
The results are shown in Fig. 3. Regardless of the sampling method, the performance of the DL-Net improves with the increase of the block number. It demonstrates that increasing the block number makes the DL-Net better extract the most essential features from images, which enhances the performance of grasp pattern recognition.
However, comparing DL-Net with block number 4 and 16 respectively, it can be found that the performance difference between the two cases is not significant, and the GA gap is around 5%. It proves that although block number is one of the factors affecting DL-Net performance, its influence is not significant. The objective performance of DL-Net is attributed to the sparse matrix extracted by using dictionary learning.
According to Fig. 3, it can be found that the accuracy of the model increases with the increase of the stage number. And when the stage number is close to 16, the accuracy increase is flat. It shows that increasing the depth of the model can bring performance improvements to the model. However, as the depth of the model increases, the effect on the performance improvement of the model decreases. It is the reason why we set stage number of DL-Net as 16.

V. CONCLUSION
In this paper, we proposed a new network DL-Net combining dictionary learning with deep learning. Our method can exploit the sparse prior information within the data (i.e., the natrual signal are often distributed in a low-dimensional space) more effectively. Compared with tranditional dictionary learning methods, DL-Net is an end-to-end model that makes predictions with a classifier composed of fully connected layers, where the classifier layers is more flexible and can make full use of the sparse features for classification. Because using this classifier, the label information is learned more efficiently of the dictionary. To prove the effectiveness of the model, we compared six benchmark methods and used two datasets include RGB-D Object dataset and the Hit-GPRec dataset. The experiments showed that DL-Net was comparative to the state-of-the-art convolutional neural networks.