Introduction
In semi-autonomous prosthetic hand control, cameras are integrated to the system to output grasp patterns facilitating the prosthetic hand to grasp objects, thus reducing the user’s burden. Traditional surface electrical myography (sEMG) based grasp pattern recognition [1], [2] has problems such as easily being affected by sweat, power line interference, and the variance of sEMG signals between different people. On the contrary, the image-based methods do not have such problems [3], [4], [5]. Its data acquisition is simple and easy, and the data preprocess methods have been well established. Furthermore, it is hard for the amputee to generate distinctive sEMG signal patterns for the machine learning model to predict intended grasps accurately, which leads to lower accuracy and higher mental workload to the amputee [6], [7].
In recent years, image-based grasp classification methods have achieved success. Morrison et al. [8] proposed a Generative Grasping Convolutional Neural Network (GG-CNN), which uses an end-to-end approach to output the grasp pose on each pixel. Therefore, it solves the problem of invalid candidate frames. However, GG-CNN requires high requirements for annotation of the dataset including grasp quality, grasp angle, and grasp width, and label each pixel point. It will increase the cost of dataset annotation, which may lead to difficulty in real-wrold practice. Hundhausen et al. [9] also proposed a grasp pattern recogition based on convolutional neural network. To enhance the performance of grasp pattern recognition, the model first performs semantic segmentation of the input image to separate the target from the background, and then the targets obtained by the above process are used for grasp pattern recognition.
However, because these models contain convolutional layers, their performance is subject to multiple constraints. First, the convolutional layer operation cuts the image into small pieces for feature extraction. It destroys the integrity of the image. Therefore, the extracted features are local. Second, convolutional neural networks are poorly interpretable. It is difficult to adjust the network structure to improve the network performance without explainable results. Finally, convolutional neural networks require a large number of samples for training to extract robust features [10], [11], [12].
Dictionary learning [13], [14] is able to extract the essential features of a sample, which is similar to the role of convolutional layers to extract features but without destroying the global of the sample. Such that the sample data can be involved in dictionary learning as a whole with the global features extracted, which will improve the performance of the model prediction. In general, the more global the extracted features are, the better the subsequent prediction results will be. Moreover, dictionary learning can still perform well even with a small number of samples [15]. Therefore, we choose dictionary learning to build the network DL-Net.
Furthermore, DL-Net is different from traditional dictionary learning in terms of classification. Traditional dictionary learning needs to add classification operations. However, DL-Net is an end-to-end model that makes predictions with a classifier composed of fully connected layers, where the classifier layers are more flexible and can make full use of the sparse features for classification. Because using this classifier, the label information is learned more efficiently from the dictionary. Furthermore, DL-Net takes hyperparameters as trainable variables and learns the parameters based on the loss function that consists of the mean squared loss and cross-entropy loss instead of manual tuning as in traditional learning, which reduces labor costs.
The contributions of this paper are as follows:
We build a network DL-Net using dictionary learning. Experimental results show that it is comparative to convolutional neural networks. In some cases, it even surpasses the convolutional neural network.
We created the novel deep learning framework with layers which can extract global feature of the input image.
The proposed DL-Net has the powerful self-adjusting capability of deep learning and the ability to extract global features that dictionary learning has.
Related Work
The convolutional neural networks have demonstrated the feasibility of grasp pattern recognition. In [16], Ghazaei et al. proposed a network that consists of two convolutional layers and one downsampling layer. Although its network structure was simple, it showed amazing performance in experiments. Then, Shi et al. [17] studied the effect of image type on grasp pattern recognition. In the experiments, three image types were used, including RGB images, grayscale images, and depth images. With similar model structures, it was found that the model with depth images had the best performance, followed by grayscale images, and the worst with RGB images. In addition, convolutional neural networks are innovative in image classification. For example, Ding et al. [18] proposed a network RepVGG consisting of only
However, the above methods are all based on convolutional neural networks which might suffer from localized feature extraction as we have discussed in the Introduction section. Mei et al. [19] proposed a non-local sparse network (NLSN) for Single Image Super-Resolution (SISR). NLSA combined non-local operations and sparse feature representation, thus giving NLSA the power of remote modeling and the robustness and efficiency of enjoying sparse representation.
Global features can be extracted by sparse representation [20], [21], [22]. The purpose of sparsity representation learning is to find a sparse matrix
The Proposed Methods
The dictionary learning model can be formalized as:\begin{equation*} \underset {X,D\in \Psi }{\arg \min } \frac {1}{2} \Vert Y-DX \Vert _{F}^{2} + \lambda \Vert X \Vert _{1}, \tag{1}\end{equation*}
There are two stages for traditional dictionary learning: (1) sparse representation learning by solving Eqn. (1) with given
Architecture of DL-Net. The network contains a total of
A. Sparse Representation Learning by the Forward Propagation
When the dictionary \begin{equation*} \underset {X}{\arg \min } \frac {1}{2} \Vert Y-DX \Vert _{F}^{2} + \lambda \Vert X \Vert _{1}. \tag{2}\end{equation*}
\begin{align*}&\underset {X,A}{\arg \min } \quad \frac {1}{2} \Vert Y-DX \Vert _{F}^{2} + \lambda \Vert A \Vert _{1} \\&\quad s.t. \quad A=X. \tag{3}\end{align*}
The Augmented Lagrange funtion of Eqn. (3) can be written as:\begin{align*}&\hspace {-0.5pc}L_{\mu} (X,A,\Lambda)=\frac {1}{2} \Vert Y-DX \Vert _{F}^{2} + \lambda \Vert A \Vert _{1} + \langle X-A,\Lambda \rangle \\&+\, \frac {\mu }{2} \Vert X-A \Vert _{F}^{2}, \tag{4}\end{align*}
According to the Alternating Direction Method of Multipliers (ADMM) [26] framework, we solve Eqn. (3) iteratively by the following steps.
For sparse matrix
:$X$



For auxiliary matrix
:$A$


Its another equivalent expression is as follows:\begin{equation*} S_{\theta }(x) = sign (x) max \{abs(x) - \theta,0\}. \tag{10}\end{equation*}
Therefore, \begin{equation*} A^{(k+1)} = S_{\frac {\lambda }{\mu }}\left[{X^{(k+1)} + \frac {\Lambda ^{(k)}}{\mu ^{(k)}}}\right]. \tag{11}\end{equation*}
For Lagrange multiplier
:$\Lambda $

For penalty parameter
:$\mu $

Inspired by the above iterative framework, we propose the network DL-Net. Each block of the DL-Net corresponds to an iteration that consists of the following steps:\begin{align*} \begin{cases} \displaystyle \textbf {x}^{(k+1)} = ((D^{(k+1)})^{T}D^{(k+1)} + {\mu ^{(k+1)}} I)^{-1}(\mu ^{(k+1)} \textbf {a}^{(k)}+ \\ \displaystyle (D^{(k+1)})^{T}\textbf {y} - \textbf {l}^{(k)}),\\ \displaystyle \textbf {a}^{(k+1)} = S_{\frac {\lambda ^{(k+1)}}{\mu ^{(k+1)}}}\left[{\textbf {x}^{(k+1)} + \frac { \textbf {l}^{(k)}}{\mu ^{(k+1)}}}\right],\\[0.8pc] \displaystyle \textbf {l}^{(k+1)} = \textbf {l}^{(k)} + \mu ^{(k+1)} (\textbf {x}^{(k+1)} - \textbf {a}^{(k+1)}),\end{cases} \tag{14}\end{align*}
B. Dictionary Learning by the Backward Propagation
Here, we update \begin{equation*} \underset {D \in \Psi }{\arg \min } \frac {1}{2} \Vert Y-DX \Vert _{F}^{2}. \tag{15}\end{equation*}
Here, we update \begin{align*}&\hspace {-2pc}f_{loss}(\textbf {y}_{i},\textbf {f}(\textbf {y}_{i}, \{D^{(k)},\mu ^{(k)},\lambda ^{(k)}\}^{n+1}_{k=1})) \\=&\frac {1}{m} \sum _{i=1}^{m} \Vert \textbf {y}_{i} - D^{k+1}\textbf {f}(\textbf {y}_{i}, \{D^{(k)},\mu ^{(k)},\lambda ^{(k)}\}^{n+1}_{k=1})\Vert _{F}^{2}. \tag{16}\end{align*}
To ensure the learned dictionary \begin{equation*} \hat {\textbf {d}}^{(k+1)}_{j} = \frac {\textbf {d}^{(k+1)}_{j}}{\Vert \textbf {d}^{(k+1)}_{j} \Vert _{2}}, \tag{17}\end{equation*}
\begin{equation*} h(x) = \max \{x,0\} + \alpha. \tag{18}\end{equation*}
C. Classification
In this sub-section, we classify the test sample \begin{equation*} \underset {\theta }{\arg \min } -\sum _{j=1} p_{j}(i) \log _{2} q_{j}(i). \tag{19}\end{equation*}
Experiments
A. Dataset
We used two datasets, RGB-D Object dataset1 [28], [29] and Hit-GPRec dataset2 [17] to evaluate our proposed network, where RGB-D Object dataset was proposed for object category classification, and then manually labeled with gestures by Zhang [29]. RGB-D Object dataset contained a total of 300 objects and 207921 images labeled into four gestures (palmar wrist neutral, palmar wrist pronated, pinch, tripod). The Hit-GPRec dataset was proposed by Shi [17] and was originally used for grasp pattern recognition tasks. It contained a total of 121 everyday objects that were labeled into four gesture types (cylindrical, lateral, spherical and tripod). Each object was photographed under different environmental conditions (4 types of lighting, 4 different camera positions and different postures of the objectst) according to 16 rotation angles to form the Hit-GPRec dataset. Some objects in the two datasets were shown in the Fig. 2. These datasets were not filtered and theses entire datasets were used for the experiments.
B. Sample Methods
To test the learning ability and generalization ability of the model, we used two sampling methods, within-whole dataset cross-validation (WWC) and between-object cross-validation (BOC). The WWC sampling method tests the performance of the model facing the object at different angles. Therefore, the dataset is randomly divide into a training set, a validation set, and a test set in the ratio of 8:1:1. While BOC sampling method simulates the situation where the model is tested with samples that never appear in the training set. It tests the generalization ability of the model. Therefore, BOC divides the images of different views of the same object as a whole, into the training set, validation set, and test set, in the ratio of 8:1:1.
C. Implementation Details
The input image was resized to \begin{equation*} GA = \frac {R_{t}}{R_{t} + {R}_{f}}, \tag{20}\end{equation*}
The experiments were conducted on a same environment using Ubuntu 9.4.0 and a GeForce RTX 2080Ti. Each experiment was run three times to take an average. The size of the images in the comparison method was the same as in the original paper training or testing. A total of 40 epochs were used for both stages of training DL-Net, as well as all other models. The model employed Adam optimizer, where
D. Comparison With State-of-the-Art Methods
We compared DL-Net with some convolutional neural network models which contained CnnGrasp [16], GhostNet [30], EfficientNet [31], RegNet [32] and Lightlayers [33]. CnnGrasp was designed specifically for grasp pattern recognition and the remaining four methods were given for the image classification. And we used two sampling methods WWC and BOC for testing the model. The number of model parameters and the computational cost of completing one prediction for different models is shown in Table 3. According to Table 3, it can be found that the computational cost required by DL-Net to complete one prediction is much lower than other methods, even lightweight networks Lightlayers. And the model size of DL-Net is comparable to Lightlayers.
1) Comparison in WWC
To test the robustness of the model facing different views of the object and its ability to generalize within the dataset, we compared our method with five other methods on two datasets. The experimental results are shown in Table 1. Specifically, DL-Net reaches 99.20% on the RGB-D Object dataset, which is almost 100%.
Our method outperforms the other methods on both datasets. Compared to CnnGrasp for grasp pattern recognition, DL-Net achieves the best performance on both datasets, while CnnGrasp has an average performance. Meanwhile, the most competitive aspect of DL-Net is EfficientNet. On the RGB-D Object dataset, the performance of the two methods is similar. While on the Hit-GPRec dataset, DL-Net outperforms Efficient by about 3%. All experimental results shown in Table 1 illustrate the effectiveness of the proposed method in WWC. In addition, the performance difference of DL-Net on RGB-D objcet dataset and Hit-GPRec dataset illustrates that DL-Net can extract effective features even when facing a small number of samples. Compared to the lightweight network Lightlayers, DL-Net outperforms LightLayers in RGB-D Objectdatset and Hit-GRPec dataset. DL-Net outperformed Lightlayers by nearly 15% on both datasets. Therefore, DL-Net has stronger memory than Lightlayers and can have good prediction performance for objects that appear in the training set.
2) Comparison in BOC
To test the model prediction performance for invisible objects, we compared our method with five other methods on two datasets. The experimental results are shown in the Table 2. From the Table 2, it can be seen that DL-Net has comparable performance.
In addition, comparing Table 1 and Table 2, it can show that BOC is more challenging than WWC. On the Hit-GPRec dataset, DL-Net outperforms all other methods. It illustrates that DL-Net can extract effective features for enhanced grasp pattern recognition when faced with a small number of samples, regardless of whether WWC or BOC sampling is adopted. On the RGB-D Object dataset, although DL-Net ranks third in performance. However, its difference with GhostNet, which ranks second, is only less than 1.5%. Although DL-Net’s performance is not the best, DL-Net outperforms most of the convolutional neural networks. It shows that DL-Net is strongly competitive with convolutional neural networks. It is the same result as in WWC that DL-Net outperformed Lightlayers. The difference in performance between the two models is around 10%. Compared with Lightlayers, DL-Net performs better with BOC sampling, which proves that DL-Net has better robustness and can make more accurate predictions for objects that do not exist in the training set.
Combining Table 1 with Table 2, it can be found that DL-Net outperforms most traditional deep neural networks. The reason for it is that DL-Net is more interpretable and effectively avoids a large amount of labor and resources to construct the network manually.
E. Ablation Study
In this section, we study the effect of the block number of the model on the model performance, and the block number is set to
The results are shown in Fig. 3. Regardless of the sampling method, the performance of the DL-Net improves with the increase of the block number. It demonstrates that increasing the block number makes the DL-Net better extract the most essential features from images, which enhances the performance of grasp pattern recognition.
However, comparing DL-Net with block number 4 and 16 respectively, it can be found that the performance difference between the two cases is not significant, and the GA gap is around 5%. It proves that although block number is one of the factors affecting DL-Net performance, its influence is not significant. The objective performance of DL-Net is attributed to the sparse matrix extracted by using dictionary learning.
According to Fig. 3, it can be found that the accuracy of the model increases with the increase of the stage number. And when the stage number is close to 16, the accuracy increase is flat. It shows that increasing the depth of the model can bring performance improvements to the model. However, as the depth of the model increases, the effect on the performance improvement of the model decreases. It is the reason why we set stage number of DL-Net as 16.
Conclusion
In this paper, we proposed a new network DL-Net combining dictionary learning with deep learning. Our method can exploit the sparse prior information within the data (i.e., the natrual signal are often distributed in a low-dimensional space) more effectively. Compared with tranditional dictionary learning methods, DL-Net is an end-to-end model that makes predictions with a classifier composed of fully connected layers, where the classifier layers is more flexible and can make full use of the sparse features for classification. Because using this classifier, the label information is learned more efficiently of the dictionary. To prove the effectiveness of the model, we compared six benchmark methods and used two datasets include RGB-D Object dataset and the Hit-GPRec dataset. The experiments showed that DL-Net was comparative to the state-of-the-art convolutional neural networks.