Introduction
Target detection is an important research subject in the field of computer vision [1]. It detects whether a video or image has a target object that needs to be detected and determines its regional coordinates and category information. Over the years, image processing and recognition have become mainstream target detection methods and the accuracy and speed of target detection have improved significantly. Such as the two-stage detector Faster-RNN [2], the single-stage detector RetinaNet [3], YOLOv5 [4], YOLOv7 [6], [7], [8], [9], [10], [11] and YOLO8 [12], which improve network detection accuracy by increasing network depth. At present, intelligent visual retail commodity detection technology requires both high precision and high speed. As for the single-stage detector, YOLO5 has the advantages of high accuracy and speed.
Currently, the methods for object occlusion detection and processing can be divided into methods based on constrained visible parts and components, methods based on the optimization loss function, and combined innovation based on multiple methods. Different parts are used to distinguish the location of the occlusion for effective feature fusion or to define the information of certain areas. The loss function is optimized by constraining the distance loss between the prediction and rear frames to make it closer to the real detection frame. Another method is to start with the occlusion dataset, train the detection model by generating a large number of occlusion models, and use the attention mechanism to enhance the reliability of the network and improve the detection effect and robustness.
In the process of commodity testing, we first obtain the movement information and angle information of commodities. Then, the feature of the proportion of unobstructed parts is extracted through the ResNet feature extraction network [13], [14] and the features of some location areas of commodities are combined with feature entropy
Bi-directional feature pyramid networkfeature fusion
The first step in feature fusion is the extraction of high-level features from the backbone for direct prediction. However, this structure does not include feature fusion, which results in low accuracy. Subsequently, a feature pyramid network (FPN) [24] which is based on the concept of feature fusion is proposed. First, a top-down path is established for feature fusion. Then, the fused feature maps are then used for higher-level features to obtain the semantic information, which is used to improve the commodity accuracy. However, this top-down FPN is easily limited by the one-way information flow. Therefore, PANet has been proposed in recent years, and it builds a top-down path based on the FPN and uses the strong location information of the underlying feature map to fuse features, thus improving detection accuracy. However, PANet extracts feature from lower to higher levels which can easily lead to missing feature extractions. The BiFPN used in this study is superimposed by a simple feature map, following which different resolutions of the feature map are entered into the same node, thereby fusing more features without increasing the parameters.
Attention mechanism
In deep learning, the attention mechanism is inspired by the human visual processing mechanism, whereby people focus on areas of interest. Similar algorithms, such as those for eliminating the background information of interference detection when detecting images to quickly locate the region of interest, are an important research field in target detection. Therefore, the purpose of introducing an attention mechanism is to eliminate redundant information and extract effective information about commodities, thereby improving the network performance.
Non-maximum suppression (NMS) based on the soft-IOU algorithm
The traditional NMS algorithm [25] uses greedy clustering at a fixed distance. This is achieved by selecting a large number of high-score detection results and deleting the neighboring results beyond the threshold, thus balancing the accuracy and recall rate. However, if the prediction box is not aligned with the real box, the IOU cannot accurately reflect its coincidence and therefore cannot filter the detection box effectively. Therefore, a Dsoft-IOU algorithm is used to improve the NMS in this study. The algorithm uses a new intersection-to-ratio formula and optimizes the penalty function to increase the consideration of the center distance of the prediction box, thereby reducing the confidence of the prediction box, preventing the deletion of the preselection box, improving the recall rate, and enhancing the prediction ability of the model.
The structure of this article is as follows: Section II introduces the overall algorithm and related theory. Section III introduces the details of the experiment, including the experimental platform, experimental dataset, comparative experiment, ablation experiment, visualization of experimental data and results, experimental results, and analysis. Section IV summarizes the proposed algorithm.
Related Work
In traditional smart retail containers, the most common technologies are gravity induction, RFID wireless radio-frequency identification [26], face recognition, and two-dimensional code. With the heightened pursuit of improved consumer experience, RFID and gravity-sensing technologies [27] are not in line with the current trend. Simultaneously, the process of real-time target detection for blocking commodities [28], [29] which is affected by objective factors such as illumination, motion blurring [30], environmental impact, and morphological changes, reduces the accuracy and stability of target detection. In the field of visual detection, occlusions have a significant impact on scene reconstruction, object recognition [31], behavior recognition [32], [33] target tracking, stereo matching, visual measurement, and other visual tasks. Thus, the occlusion issue has become separated from related visual tasks over time and has been widely studied extensively by both domestic and foreign researchers. Therefore, reducing detection and misdetection and increasing the detection accuracy of occlusion targets in the process of occlusion detection has become a research hotspot in the field of commodity target detection.
In this study, a new feature extraction algorithm that combines global information and an adaptive matching algorithm with a feature entropy mechanism is presented for detecting occlusions and foreign bodies in smart retail containers. The algorithm applies a feature entropy mechanism and adds an attention module [34], [35] to the uncovered part to extract as many effective commodities features as possible. To solve the problems of illumination change, the color similarity between the background and target, and the difficulty in feature extraction under occlusion during detection, C. Xiu proposed a method combining the Camshift algorithm [36], [37] and Kalman filter to improve the robustness of the algorithm and ensure its real-time performance. Liang et al. [38] propose a DetectPreer category auxiliary transformer object detector based on Transform, which can use data augmentation technology to improve the backbone network and use the attention mechanism to extract channel spatial characteristics and direction information. Liang et al. [39] propose an improved sparse R-CNN, which integrates the attention module with ResNeSt to construct a feature pyramid and modify the backbone network to extract more important effective features.
The influence of different occlusion levels on the detection performance is quantified based on an analysis of the commodity detection performance under different occlusion ratios. Based on the analysis, high-quality positive samples are selected for inclusion in the model training by introducing attenuation weights according to the occlusion ratio, which effectively improved the detection performance under occlusion and located the uncovered areas. This method uses VGG to convolute the commodity characteristics. To obtain information on the commodity characteristics, the threshold value of the feature entropy is introduced to determine and convolute the holes in SENet [15], set the weight of the feature entropy threshold [40], import into the attention channel, and finally entered into the pooling layer to obtain effective features. By calculating the position offset of the detection target over time, the IOU loss regression of the occlusion target prediction box and the real box is improved, the redundancy of the detection box is reduced, and the appropriate size of the detection box is determined under non-extreme suppression (RPN) [41]. This significantly improves the accuracy of the feature extraction for occlusion detection, enhances the detection robustness, and constructed an appropriate detection framework.
In this study, the structure of the target detection algorithm is modified. (1) An attention enhancement algorithm based on semantic segmentation is proposed to solve the missing high-frequency information in occlusion detection and to achieve the feature extraction of commodities in the unobstructed section of semantic segmentation, as well as also provide location information for video frame localization. (2) In the detection module, the designed YOLO-R algorithm, which combines a residual module to enhance the residual characteristics of the convolution network is deployed. The algorithm improves the NMS algorithm and sets the BiFPN feature pyramid. In feature extraction, it enhances the detailed information, and thus, fuses the features of the different scales of the network, enhances the robustness of the features, and improves the model accuracy. In YOLO-R, the attention mechanism of the SENet channel is added, which make the network model pay more attention to the unobstructed areas of commodities and extract effective information on commodities. Finally, in the processing stage, the NMS model is adjusted automatically using the Dsoft-IOU with the location information of the threshold set by the eigenvalue function, which prevents the true frame from being filtered out and enhances the generalization of the model.
Method
In a related study, a new YOLO-R algorithm is designed. the YOLO-R algorithm uses the BiFPN feature pyramid for feature fusion. In the upsampling model, CARAFE lightweight sampling operators are used to increase the sampling characteristics. In this study, the SENet attention mechanism is added to YOLO-R and the attention mechanism threshold is increased by feature entropy. YOLO-R consists of two CSP structures designed to reduce inference and computing power, namely CSP blocks for backbone networks and CSP_PPM mainly for neck network structures. In post-processing, the NMS algorithm of DSoft-IOU is redesigned. The penalty function threshold is increased, and samples are filtered out by DSoft-IOU range loss regression to reduce the impact of redundant features. The algorithm used in this study is shown in Fig. 1.
Overview of the proposed architecture. (a) We first perform initial feature extraction convolution based on U-Net to get the basic features and fuse the basic features to get the semantic segmentation results. (b) The semantic segmentation features and also the commodity unmasked features are convolutionally pooled, in which the features are sampled to obtain the commodity effective features, the detection framework is built according to the effective features, and the suitable detection frame is filtered using non-maximal suppression.
A. Combining ECA Semantic Feature Extraction
The improved FCN used in this study uses U-Net improvement in the FCN network extractor to increase the ECA attention mechanism [33] in the U-Net network sampling and to speed up model estimation. In this study, improvements are made to the U-Net semantics segmentation module, and a channel interaction strategy module (ECA module) is proposed. The module uses a one-dimensional convolution module, which effectively reduces the number of parameters and computing power, thus improving performance. The module includes only a few additional parameters to avoid the effect of dimension reduction convolution on the channel attention mechanism.
The selected ECA module optimizes the computational performance and model complexity. Firstly, the features are aggregated by global average pooling to obtain channel global information, and the global average pooling operation formula is as follows:\begin{equation*} y=\frac {1}{H\times W}\sum \nolimits _{a}^{H} \sum \nolimits _{b}^{W} {x_{i}(a,b)} \tag{1}\end{equation*}
In the formula,
Secondly, using the channel dimension \begin{equation*} k=\varphi (C)=\left |{ \frac {{\mathrm {log}}_{2}(C)}{\gamma }+\frac {b}{\gamma } }\right | \tag{2}\end{equation*}
In the formula:
The extraction performance is improved through multichannel shared weight information interaction as follows:\begin{equation*} w=\sigma (C1D_{k}(y)) \tag{3}\end{equation*}
In this study, C1D represents a one-dimensional convolution, and
B. Feature Extraction
1) Characteristic Entropy Weight Distribution
Local feature points are extracted using the YOLO-R residual module, a convolution self-coding neural channel is added for the low-dimensional pixel points, and global features are exploited to enhance the detection characteristics. The adaptive weight redistribution algorithm originated from the similarity principle of the gestalt grouping. The degree of color similarity is used as an influencing factor for weighting, and the influencing factor is determined by the Euclidean distance. In this study, a fully connected layer \begin{equation*} g=F\times \left ({\frac {1}{H_{D}W_{D}}\sum \nolimits _{h,w} d_{h,w}^{p} }\right)^{1/p}+b_{F} \tag{4}\end{equation*}
Assuming the color features extracted by the keyframe are \begin{equation*} F_{n}=H_{Concat}(B^{1}_{n},B^{2}_{n}\mathrm {,\cdots }B^{i}_{n}) \tag{5}\end{equation*}
Therefore, the fused feature is \begin{equation*} p(F_{i})=\frac {F_{i}}{\sum \nolimits _{i=1}^{n} F_{i} } \tag{6}\end{equation*}
Its corresponding eigenvalue is \begin{equation*} E_{i}=-\frac {2}{n+1}\sum \nolimits _{i=1}^{n} {p(F_{i})\ln p(F_{i})} \tag{7}\end{equation*}
First, according to the feature weight set in this study \begin{align*} \omega _{i}=\begin{cases} \displaystyle \frac {E_{i}}{\sum \nolimits _{i=1}^{n} E_{i} },&E_{i}>\tau \\ \displaystyle 0,&else\end{cases} \tag{8}\end{align*}
According to the weight, other IOUs can be filtered below the set threshold, and the obtained attention mechanism weight can be introduced into SENet, and the characteristics can be filtered to obtain the effective characteristics of the commodity.
2) YOLO-R Multi-Cascade Convolution Feature Extraction
Commodity detection is the result of detection based on the target position and semantic information of the video frame. In this study, YOLO5 is used with a cascade for feature extraction according to the characteristics of sheltered commodities [43]. ResNet residual features are added based on YOLO5, which yielded the YOLO-R algorithm to learn the feature relationship between the enhancement layer and layer, enhance the network perception field, complete the feature convolution fusion of high and low resolution, compensate for the loss of high-resolution semantic information, enrich edge information, and improve the detection accuracy of the occluded targets.
Numerous residual module area features are used in the YOLO-R algorithm employed in this study. Therefore, a simple feature module is used instead of the representation algorithm module structure. In the residual extraction module, the backward-propagating module propagates from forward to backward, and finally selects the last layer in the feature block, as shown in Fig. 2.
The YOLO-R residual convolution is a ResNet deep residual network that can effectively improve the performance of feature extraction detection. This module is a framework that stacks blocks with the same connection shape. The blocks used in this study are also known as residual units. The residual element calculation process is as follows:\begin{align*} y_{n}&=h\left ({x_{n} }\right)+\mathcal {F}(x_{n},M_{n}) \tag{9}\\ x_{n+1}&=f(y_{n}) \tag{10}\end{align*}
In this formula,
Because \begin{equation*} x_{n+1}=x_{n}\mathcal {F}+(x_{n},W_{n}) \tag{11}\end{equation*}
\begin{align*} x_{n+2}&=x_{n+1}+\mathcal {F}\left ({x_{n+1},W_{n+1} }\right)=x_{n}+\mathcal {F}\left ({x_{n},W_{n} }\right) \\ &\quad +\mathcal {F}(x_{n+1},W_{n+1}) \tag{12}\end{align*}
For cell \begin{equation*} x_{n+1}=x_{n}+\mathcal {F}(x_{n},W_{n}) \tag{13}\end{equation*}
Therefore, the low-resolution image \begin{equation*} F_{0}=H_{Conv}(I_{LR}) \tag{14}\end{equation*}
The image features are then added pixel-by-pixel before being fused with the BiFPN to obtain an effective feature map of the commodity.\begin{align*} B^{i}_{n}&=B^{i-1}_{n}+H_{Concat}(F^{i,1}_{n},F^{i,1}_{n} \\ &\quad +F^{i,2}_{n},F^{i,1}_{n}+F^{i,2}_{n}+F^{i,3}_{n}) \tag{15}\end{align*}
3) Multiscale Feature Fusion
With the deepening of the YOLO-R network level of the algorithm in this study, the commodity features are convoluted from low to high dimensions. However, with the deepening of the feature extraction in each layer of the YOLO-R extraction network, a few features are missing. Therefore, it is necessary to fuse features at different levels to enhance the feature semantics. A lightweight BiFPN is used for feature fusion, and the YOLO-R backbone network is used for bidirectional feature fusion at different scales.
A new BiFPN, where
The size of the feature map output by YOLO-R in this study is \begin{align*} P_{4}^{TD}&=Conv\left({\frac {w_{1}\mathrm {\cdot }P_{4}^{IN}+w_{2}\cdot Resize\left ({P_{5}^{IN} }\right)}{w_{1}+w_{2}+\varepsilon }}\right) \tag{16}\\ P_{3}^{OUT}&=Conv\left({\frac {w_{3}\mathrm {\cdot }P_{3}^{IN}+w_{4}\mathrm {\cdot }Resize(P_{4}^{TD})}{w_{3}+w_{4}+\varepsilon }}\right) \tag{17}\\ P_{4}^{OUT}&=Conv\left({\frac {w_{5}\mathrm {\cdot }P_{4}^{IN}+{w_{6}\mathrm {\cdot }P_{4}^{TD}+w}_{7}\cdot Resize\left ({P_{3}^{OUT} }\right)}{w_{5}+w_{6}+w_{7}+\varepsilon }}\right) \tag{18}\\ P_{5}^{OUT}&=Conv\left({\frac {w_{8}\mathrm {\cdot }P_{5}^{IN}+w_{9}\mathrm {\cdot }Resize(P_{4}^{OUT})}{w_{8}+w_{9}+\varepsilon }}\right) \tag{19}\end{align*}
In the fusion network, different feature entropy weights \begin{equation*} \mathrm {Out=}\sum \nolimits _{i} {\frac {w_{i}}{\mathrm {\varepsilon +}\sum \nolimits _{j} w_{j}}\times In_{i}} \tag{20}\end{equation*}
According to the formula, each normalized feature entropy weight is
4) SENet Attention Allocation Combined with Characteristic Entropy
In SENet, the commodity feature information is entered into the channel attention mechanism through stacked clustering feature layers. The feature map under the restriction of information entropy is beneficial for filtering out the effective features and learning the weights of each layer automatically. In the SENet module, a \begin{equation*} M_{s}=\sigma \left \{{f_{Conv}^{\mathrm {1\times 1}}\left \{{ f_{Conv}^{\mathrm {5\times 5}}\left [{ f_{Conv}^{\mathrm {7\times 7}}\left ({F_{i}^{\prime} }\right) }\right] }\right \} }\right \} \tag{21}\end{equation*}
The squeeze operation in SENet enlarged the global receptive field of a commodity, obtained the spatial feature information through maximum pooling calculation, and used a convolution kernel to ascend the dimensions to obtain a spatial attention feature map \begin{equation*} M_{s}=\sigma \left \{{f_{Conv}^{\mathrm {3\times 3}}\left [{ MaxPooling\left ({F^{\prime} }\right) }\right] }\right \} \tag{22}\end{equation*}
Finally, according to the obtained spatial attention feature map \begin{equation*} F^{\mathrm {''}}=M_{s}\odot F_{i}^{\prime} \tag{23}\end{equation*}
The attention model is introduced into the residual network ResNet, which is first squeezed to compress the feature dimension, and then the ReLu activation function is added to the fully connected layer to complete the construction of the attention channel, and finally the feature weight of the channel is obtained through the Sigmoid function, and then the original channel dimension is weighted with the new channel, and finally the effective commodity features are output. The improved attention mechanism in this study has few parameters and good embedding, which can be quickly embedded in YOLO-R residual networks. The structure is shown in the Fig. 5.
C. Inspection Frame Construction
1) Strict Decision RPN
The core of the RPN [22], [23] is an anchor. The detection prediction box is generated using an anchor that could be used to select the location of the following detection box. In this study, the K-means clustering algorithm is used to adaptively generate the anchor parameters to improve the clustering effect.
Commodity detection is primarily used for the overlap between the prediction and real detection boxes and the overlap ratio referenced in this study. IOU obtains the similarity between both boxes according to the degree of overlap between them. In this study, a new measure based on the Dsoft-IOU is used. Its formula is as follows:\begin{align*} IOU(a,b)&=\frac {\vert a\mathrm {\cap }b\vert }{\vert a\cup b\vert } \tag{24}\\ d_{i}&=\beta \sqrt {1-IOU(b_{bbx},c_{cluster,i})} \tag{25}\end{align*}
Second, a loss function is combined with the IOU in this study to reduce the impact of the redundancy characteristics, and the following conclusions are made:\begin{equation*} L_{IOU}\mathrm {=1-}IOU(a,b)+\frac {d_{i}^{2}(a,b)}{c^{2}(a,b)}+\alpha \beta \tag{26}\end{equation*}
\begin{align*} \beta &=\frac {\alpha }{\mathrm {(1-}I_{IOU})+\alpha } \tag{27}\\ \alpha &=\frac {4}{\pi ^{2}}\left({\arctan \frac {w^{gt}}{h^{gt}}-\arctan \frac {w}{h}}\right)^{2} \tag{28}\end{align*}
Structure diagram of decision RPN. According to the pre-designed box, it is revolutionized pooled to obtain the final detection box center area.
2) Nonmaximal Suppression Based on Scale Estimation
The NMS algorithm is a processing algorithm that deletes the redundant prediction boxes of a network. The box is scored according to the confidence level. Then, a bubble sort is performed according to the size of the confidence level score, and the IOU threshold is compared with the high-score prediction box. If the threshold is set higher, the prediction box is deleted and the maximum confidence score prediction box is not deleted. This is repeated until all the checkboxes are processed.
In this study, the DSoft-NMS algorithm is used. The algorithm is based on the Euclidean distance between the IOU true box and the prediction box. This is based on the IOU of the preselected box and the real box: the larger the confidence level of the detection box, the smaller the confidence level of the prediction box. The probability formula is derived based on the IOU crossover ratio as follows:\begin{align*} S_{i} =\begin{cases} \displaystyle S_{i},\quad IOU(M,b_{i})-\frac {d^{2}(b_{i},M)}{c^{2}(b_{i},M)} < \textrm {w}_{\textrm {i}} \\ \displaystyle S_{i} \left [{ {1-IOU(M,b_{i})+\frac {d^{2}(b_{i},M)}{c^{2}(b_{i},M)}} }\right],\quad \\ \displaystyle \qquad IOU(M,b_{i})-\frac {d^{2}(b_{i},M)}{c^{2}(b_{i},M)}\ge \textrm {w}_{\textrm {i}} \end{cases} \tag{29}\end{align*}
Experimental Results and Discussion
In this section, the algorithm system, experimental parameters, and procedures, are described in detail. First, the overall process is introduced step-by-step. Subsequently, the key experimental parameters are determined during the training process. Finally, these indices are used to evaluate the occlusion detection results.
A. Comparison With State-of-the-Arts
The experiments are analyzed and compared using self-made smart retail container datasets. The experimental device used is the Lenovo Xiaoxin Air-12IIL 2020 with an Intel Core i5-1035G1 CPU and a discrete graphics card with NVIDIA GeForce MX3502GB. The system and software running the algorithm are Win10 and PyCharm, respectively. HF899 with a 2.7-mm (135 °distortion-free) camera is used.
A self-made dataset with common commodities in daily retail containers is used and 4,270 pieces of commodity data are collected. The target datasets are labeled according to the VOC2007 format. Among the 4,270 datasets, 3,843 pictures are used for training, and 427 for validation. The test commodity datasets are constructed separately.
B. Experimental Evaluation Criteria
Based on a review of various test studies, the average accuracy (AP) and the average AP (mAP) are used in this study to obtain the average value of the detection accuracy of commodity categories. Additionally, the F1 index is used as the evaluation standard for model stability [43]. The detected samples are considered positive when the confidence level of the commodity detection box is equal to the threshold value and negative when the confidence level of the commodity detection box is equal to or lower than the threshold value. The recall rate R is the percentage of the total sample of positive samples correctly tested, defined as \begin{equation*} R=\frac {TP}{TP+FN} \tag{30}\end{equation*}
The TP is the number of samples correctly classified as positive. FN is the total number of positive samples incorrectly identified as negative samples. T represents the maximum spacing between video frames.
Precision P represents the proportion of positive samples detected by the algorithm to the total number of positive samples detected by the detection result, which is defined as \begin{equation*} P=\frac {TP}{TP+FP} \tag{31}\end{equation*}
Set the total number of samples to \begin{equation*} AP=\sum \nolimits _{k=1}^{n} {p_{k}(r_{k+1}-r_{k})} \tag{32}\end{equation*}
\begin{equation*} mAP=\frac {1}{L}\sum \nolimits _{L=1}^{L} {AP_{q}} \tag{33}\end{equation*}
The model stability is detected using the F1 value (H-mean value), which is obtained by dividing the arithmetic mean by the geometric mean. The F1 value is inversely proportional to model stability. The formula used is as follows:\begin{equation*} F_{1}=\frac {2PR}{P+R}=\frac {2TP}{2TP+FP+FN} \tag{34}\end{equation*}
The formula shows that F1 is the weighted summation of the precision and recall, expressed as a harmonic mean.
C. Analysis of Experimental Results
In this study, a semantic segmentation algorithm is used to enhance the semantic information of the commodities and improve the accuracy of the data. Subsequently, the commodities are detected using YOLO-R, an algorithm that semantically split the network association, which constitutes the overall process. The test results for this process are as follows.
The results of the algorithm detection are shown in Fig. 8. The first, second, and third rows represent the original image, semantically segmented prediction results, and detection results, respectively. It can be observed from Fig. 8 that in all semantic segmentation maps of commodities, the commodities are separated from the background, and the algorithm accurately detected obscured commodities.
The results are output during the experiment. First line: Input masking Commodities. Second line: Output semantic segmentation result graph. Third line: Output test result graph.
At the beginning of this study, the ideal number of iterations is determined to be 100 based on training sessions conducted for each comparison model. Furthermore, the stability of the proposed model is compared to that of the traditional models, as shown in Fig. 9.
Iterative stability of loss training. In (a)(b)(c)(d)(e)(f), the network models are stable after 100 trainings.
1) Public Dataset Detection and Comparison
This study first verifies the performance of the algorithm. The algorithm uses in this study and the traditional algorithm is trained and verified using the VOC2007 public dataset. The data contains 21,503 pictures, which are divided into a training set and a verification set through cross-validation at 9:1.
Table 1 uses the mAP and F1 data obtained by VOC2007, and according to the literature, YOLOX [44] and DETR [45] models are introduced for comparative analysis, and the comprehensive analysis shows that the detection accuracy of mAP is similar, and the higher the F1 index of the model, the stronger the stability of the model. Therefore, this paper uses a more mature YOLO5 model to improve the model in this study, and the performance of YOLO5 is more stable than other networks.
2) Comparison of Self-Made Dataset Detection Models
In Table 2, The results are based on 100 training iterations. This study refers to the current mainstream comparative experimental models, and finds that the proposed model is superior to other network models in AP and Recall, among which the YOLO-R model is 0.01% higher than YOLOX in AP, and the YOLO-R model is better than DETR, but the accuracy is slightly lower than YOLOX. In terms of commodity detection speed, the detection speed of the original model YOLO5 is lower than that of YOLO7, but the YOLO-R improved by the algorithm in this paper increases the speed by 3.83 F/s, and the improved YOLO-R detection speed is significantly better than YOLO7, and 2.48F/s faster than YOLO7. YOLO-R also improves the detection accuracy by 0.9% compared to the original model and the speed by 3.83F/s. This comparative experiment verifies the feasibility and superiority of the proposed algorithm. In the self-made dataset, the improved model accuracy in this paper is more accurate, and the comprehensive performance has also been improved to a certain extent.
Compared with other algorithms, YOLO is an end-to-end target-detection neural network. YOLO predicted multiple candidate boxes at one time, regressed the object location area and the category of objects in the area at the output layer, and is faster. The faster-RCNN and other algorithms must generate several candidate frames in the picture and they have several parameters and a long training time.
3) Comparison and Analysis of Different Commodity Tests on Self-Made Datasets
The occlusion detection performance of the proposed model is compared with those of other networks based on AP using test sets of different commodities. By comparing the detection effect of each network on other datasets, the stability of the proposed model is verified. Simultaneously, the universality and stability of the algorithm model are demonstrated, proving its applicability to different commodity detections, as shown in Table 3.
As shown in Table 3, The inspection progress of YOLO-R on different commodities is higher than that of other models, but the detection accuracy of YOLO7 models on Scream commodity is 0.02% lower, and the overall detection accuracy is 2.69% higher than that of YOLO7. The detection accuracy of the improved model on different commodities is 0.05%–2.42% higher than that of the second-best model. In the selected comparison model, the overall performance of the improved YOLO-R commodity detection method improved by 2.18% compared to the original model. In conclusion, the model outperforms the general approach.
4) Comparative Analysis of Network Model Stability on the Homemade Datasets
A series of comparisons are also made for commodities with different shielding to verify the stability of the model. Two representative commodities (packed potato chips and bottled mineral water) are shielded to different degrees and their average accuracies are obtained. The comparison analysis is based on the average accuracy.
In Table 4, to compare the stability of the network models, we select the packed potato chips and bottled water are selected as the experimental objects and shield the two commodities are shielded to varying degrees to detect the accuracy of each network model. From the analysis of the experimental results, it can be concluded that the detection accuracy of the proposed algorithm is not different from that of the other network models when there is little or no occlusion. However, the detection performance of the model decreased with an increase in the occlusion. Nevertheless, with an increase in the occlusion ratio, the detection performance of the YOLO-R model remained high, indicating greater stability on the self-made occlusion dataset. The detection accuracy for potato chips reached an astonishing 97.88% at approximately 60% of the serious occlusion degree, which is 1.35% higher than that of the second-best network model. For the detection of Farmer Spring, although the faster R-CNN is not better in the presence of moderate occlusion, the overall performance is better. For severe occlusions, the detection accuracy of YOLO-R reached 68.87% and 1.08% higher than that of the second-best network, respectively.
5) Comparison of Attention Mechanisms Ablation Experiments
To identify the attentional mechanism that best matches the proposed model, numerous studies are consulted and three representative attentional models are selected for ablative comparison with the attentional model in this study.
As shown in Table 5, the YOLO algorithm of the improved SE attention mechanism is 0.2% better than that of the original model. The ECA attention mechanism network appears to be similar to the improved SE model during the analysis process. The focus of this study is the comparison and validation of the actual commodity detection of each network model. Through a comparative analysis, YOLO-R is found to improve both accuracy and speed, with 0.40% accuracy and 2.38 F/s.
Table 6 presents an extension of the ablation experiments described previously. Several representative commodities are screened in extensive experiments and tested to verify the superiority of the improved attention mechanism.
The four model attention mechanisms are compared, and it is found that the improved SE attention mechanism in the study is more stable, and the detection accuracy is approximately 0.2% higher in terms of the mAP, compared to when no attention mechanism is used. In Table 6, six commodities (the test datasets in Tables 6 and 3 are not the same) are selected to verify the detection accuracy of the commodities randomly sampled from the self-made dataset used in this study. The average accuracy is the average value of the detection accuracy for the commodities mentioned above. As shown in Table 5, the average detection accuracy improved when the attention mechanism is incorporated. The greatest improvement is observed with YOLO with the SE attention mechanism model, whose detection accuracy is 3.18% higher than that of YOLO. On this basis, the YOLO-R accuracy of the improved algorithm is 4.14% higher than that of the original model. Tables 5 and 6 show that the algorithm is faster, more accurate, and has better detection stability.
6) Innovative Comparative Ablation Experiment
The BiFPN feature fusion model is used to replace the original PANet fusion structure and to reduce the model parameters in the BiFPN layer. Second, the detection box algorithm is improved and the threshold limit of the eigenvalue is increased so that a suitable detection box could be generated. Therefore, an innovation point verification is conducted in this study.
As indicated in Table 7, the lightweight BiFPN in this study has the highest stability detection accuracy, and the stability and detection accuracy are higher. The lightweight network model is 0.01-0.02 higher than other networks on the F1 index and 0.3%-0.4% higher than other networks on mAP.
To verify that the proposed NMS algorithm for the Dsoft-IOU is better in terms of the accuracy of the construction of the detection box and its size, a similar model structure is guaranteed in this study, and the NMS algorithm is modified for comparison experiments, as shown in Fig. 10.
Comparing the improved NMS algorithm with the original one. We use different types of commodities, compare the results of testing, and solve the average detection accuracy of these commodities.
The improved NMS algorithm produced a more accurate detection box, located the detection box more accurately, effectively contained the detected commodities, improved the detection accuracy, and had a better detection effect.
D. Algorithm Detection Result Graph
1) Detection Results of Different Algorithms
To make this article more convincing, the next section presents a visualization of the test results. The results of the six comparison network detection models selected in this study are shown in Fig. 11 to further illustrate the detection effect characteristics of the proposed algorithm. The algorithm is effective in detecting occluded images in self-made datasets. This can solve the problem of occlusion when shoppers use commodities in a container. The following is an experimental analysis and detection result diagram of several algorithms. In the diagram, bagged potato chips are used as the detection object to facilitate the comparison and detection of various models.
Detection results of common network models and this paper’s network model on Ritz-Carlton potato chips.
This section also adds a comparative visualization of commodities detection models under different signal-to-noise ratios. As Fig. 12.
Interference immunity comparison chart. (a) When the signal-to-noise ratio is 0.01, our model is compared with the original model. (b) (c) (d)When the signal-to-noise ratio is 0.02, 0.5, and 0.1, our model and the original model are compared for detection.
The algorithm results are displayed and analyzed Fig. 11 shows the detection results of the proposed model and the other network models. In this study, a self-made occluded commodity dataset is used for comparative experiments, and compared with other models, the proposed model achieves a high detection accuracy.
According to the citation of relevant literature [46], the detection and comparative analysis of goods under different noise conditions are carried out. In Fig. 12, our model has better detection results at different signal-to-noise ratios than the original model, but when SNR=0.1, neither our model nor the original model can detect the commodities. This experiment can verify that the algorithm in this paper is more stable through comparison. This signal-to-noise ratio experiment proves that the anti-interference ability of the improved algorithm has been improved, but the anti-interference ability and YOLO-R anti-interference ability need to be improved.
2) Self-Made Datasets Tested in This Model
The dataset used in this study is a self-made commodity dataset. The commodity commonly used in intelligent retail containers are selected through big data detection and divided into three categories, depending on whether they are bagged, bottled, or canned. A few commodities are selected from these three categories, and the inspection experiment is visualized.
The type of dataset used in this study is similar to the displayed sample data. There are many kinds of commodities, and the degree of occlusion is determined by the proportion of occlusion.
Twenty-one types of commodity datasets are used in this study. Twelve categories of commodities are presented in the homemade datasets and are divided into three categories. Using a free graph, it is proven that the algorithm is effective on self-made datasets.
3) Attention Mechanism Thermal Visualization
In this section, a visualization of the attention mechanism is presented, as shown in Fig. 14. This section further demonstrates that the YOLO model enhances the network’s ability to extract the signs of the model and could focus on the effective feature areas of commodities very well.
Display of three categories of self-made datasets.We categorize the datasets into three categories of commodities, and select some of them for visualization and detection.
Attention visualization and comparison. First line: Visualization of lightly occlusion heatmaps. Second line: Medium occlusion visualization. Third line: Heavily occlusion heatmap visualization.
The graph shows the thermographic display and detection accuracy of YOLO-R at different occlusion levels and a comparison of the proposed model with and without an attention mechanism network. As can be observed from Fig. 14, the algorithm focuses on increasing the attention on the unobscured part to make the feature extraction more effective. The detection accuracy decreased with an increase in the occlusion degree, the strengthened attention mechanism network became more effective in the area of interest, and the color is more in-depth. Each test commodity is the same as the training set commodity, and the commodity that obtained the AP from the test is the result of the video-frame excerpt test. Through a comparative analysis, it can be concluded that the attention mechanism model in this study focused more on the unobstructed features of commodities and is more accurate for detection.
Conclusion and Future Work
To solve the issue of large changes in the target scale, multiple occlusion cases, and target detection accuracy in the process of occlusion commodity detection, a residual network combined with an attention module is proposed to enhance the range of field scale and enhance the multiscale information fusion ability of the model, thus improving the detection accuracy of the model. To address the insufficient feature fusion in YOLO-R and the mix of multilayer features, the BiFPN feature pyramid is used in this approach. The feature pyramid is sampled as a CARAFE structure, which enlarges the sensing field, fused features at different scales of the network structure, and enhances the robustness of the features. For the processing module of the YOLO-R model, a Dsoft-IOU loss regression module that combines the location information and characteristic entropy threshold is proposed to adaptively adjust the model’s Dsoft-IOU, thus, preventing the real detection box from being filtered and improving the prediction accuracy of the model.
The method is tested by masking the homemade datasets. The results show that the improved YOLO-R based on YOLO and the SENet occlusion detection method combined with the attention mechanism uses the eigenvalue to limit the threshold value, thereby increasing the attention model. The mean average accuracy obtained using this method is higher than that obtained using YOLO, and the speed is also improved. The algorithm also achieves good detection results for the commodities with different occlusion ratios. However, the proposed algorithm, similar to the traditional algorithm, has certain limitations. Compared with the original method, the noise immunity of the improved algorithm is significantly enhanced, but with the increase in noise ratio, the commodity detection ability decreases significantly. The algorithm in this study is insufficient in the low anti-interference ability of noise and low detection accuracy Therefore, in future work, this paper focuses on optimizing the model network structure, improving the noise resistance and anti-interference ability of the model, enhancing the generalization of the model, and improving the stability of model detection.
Author Contributions
An Xie conceived algorithms of the paper and write the manuscript, Kai Xie reviewed the paper, Hao-Nan Dong and Kai Xie designed experiments, Hao-Nan Dong conducted comparative experiments and collected data, Jian-Biao He checked spelling and grammar and made suggestions.