Loading web-font TeX/Math/Italic
Detection of Commodities Based on Multi-Feature Fusion and Attention Screening by Entropy Function Guidance | IEEE Journals & Magazine | IEEE Xplore

Detection of Commodities Based on Multi-Feature Fusion and Attention Screening by Entropy Function Guidance


work as guidance into the unobstructed area of commodities, and propose that the entropy function limits the attention network layer weight adjustment model area of inter...

Abstract:

Although traditional convolutional neural networks (CNN) have been significantly improved for target detection, they cannot be completely applied to objects with occlusio...Show More

Abstract:

Although traditional convolutional neural networks (CNN) have been significantly improved for target detection, they cannot be completely applied to objects with occlusions in commodity detection. Therefore, we propose a target detection method based on an improved YOLOv5 model and an improved attention mechanism algorithm is proposed to solve the commodity occlusion problem. This method improves the traditional YOLO deep convolution network, features a more detailed BiFPN layer, and performs lightweight two-way feature fusion, where the multidimensional features of the commodities are convolved and fused, thus improving the overall detection speed and accuracy of the YOLO-R algorithm. Feature entropy is introduced to the attention channel to restrict the threshold value and obtain the global information of the occlusion target. The global information obtained is fused with a bidirectional feature pyramid layer to enhance the robustness of the features. This method could accurately and quickly detect the occluded commodities and the detection accuracy has been greatly improved. Experiments show that the improved YOLO-R model can improve the accuracy and speed of commodity detection, and can achieve good results in objective evaluation. The average accuracy of commodity detection on the self-made product dataset is up to 97.80%, and the detection rate is 22.72F/s. Therefore, the method in this paper has high detection accuracy and fast detection speed.
work as guidance into the unobstructed area of commodities, and propose that the entropy function limits the attention network layer weight adjustment model area of inter...
Published in: IEEE Access ( Volume: 11)
Page(s): 90595 - 90612
Date of Publication: 21 August 2023
Electronic ISSN: 2169-3536

Funding Agency:


SECTION I.

Introduction

Target detection is an important research subject in the field of computer vision [1]. It detects whether a video or image has a target object that needs to be detected and determines its regional coordinates and category information. Over the years, image processing and recognition have become mainstream target detection methods and the accuracy and speed of target detection have improved significantly. Such as the two-stage detector Faster-RNN [2], the single-stage detector RetinaNet [3], YOLOv5 [4], YOLOv7 [6], [7], [8], [9], [10], [11] and YOLO8 [12], which improve network detection accuracy by increasing network depth. At present, intelligent visual retail commodity detection technology requires both high precision and high speed. As for the single-stage detector, YOLO5 has the advantages of high accuracy and speed.

Currently, the methods for object occlusion detection and processing can be divided into methods based on constrained visible parts and components, methods based on the optimization loss function, and combined innovation based on multiple methods. Different parts are used to distinguish the location of the occlusion for effective feature fusion or to define the information of certain areas. The loss function is optimized by constraining the distance loss between the prediction and rear frames to make it closer to the real detection frame. Another method is to start with the occlusion dataset, train the detection model by generating a large number of occlusion models, and use the attention mechanism to enhance the reliability of the network and improve the detection effect and robustness.

In the process of commodity testing, we first obtain the movement information and angle information of commodities. Then, the feature of the proportion of unobstructed parts is extracted through the ResNet feature extraction network [13], [14] and the features of some location areas of commodities are combined with feature entropy E_{i} assigning different weights to the occluded and visible parts of the commodities. Subsequently, a spatial attention feature map is generated from the first-layer structure hole convolution of the squeeze-and-excitation network [15] and the attention module of the second layer. This feature map under the original feature entropy under the activation of spatial attention is the input and the effective features of the commodities are the output. Thus, the basic features of layer commodities or foreign objects are obtained. Feature entropy is introduced into the attention channel to limit the threshold and obtain global information on the occlusion target. Furthermore, the global information [16], [17] is obtained and convoluted with the adaptive hole under global embedding, and the effective feature points of the target commodity are extracted. Based on the effective feature points, a commodity detection frame is constructed in the channel. Moreover, the influence of the attention mechanisms is examined. In addition to the characteristic attributes of the commodity itself, the generated commodity location-related information is extracted in the case of commodity occlusion and is imported into the scale-adaptive module. The detection frame is generated adaptively through K-means++ clustering, following which the distance measurement loss regression of the intersection over union (IOU) [18], [19], [20] is used to reduce the impact of the redundant features. The commodity detection frame is constructed using the channel correlation of the attention mechanism [21]. In addition to the characteristic attributes of the commodity itself, the information relate to the position of the commodity generated by the optical flow difference extraction in the case of commodity occlusion is imported into the scale-adaptive module. According to the occlusion ratio, the attenuation weight method is introduced to screen the high-quality positive samples for inclusion in the model training, effectively improving the detection frame size under occlusion. Based on the reconciliation of the region proposal network (RPN) network [22], [23], the detection confidence is obtained to compare the confidence weights and to generate the appropriate and accurate detection frames.

  1. Bi-directional feature pyramid networkfeature fusion

    The first step in feature fusion is the extraction of high-level features from the backbone for direct prediction. However, this structure does not include feature fusion, which results in low accuracy. Subsequently, a feature pyramid network (FPN) [24] which is based on the concept of feature fusion is proposed. First, a top-down path is established for feature fusion. Then, the fused feature maps are then used for higher-level features to obtain the semantic information, which is used to improve the commodity accuracy. However, this top-down FPN is easily limited by the one-way information flow. Therefore, PANet has been proposed in recent years, and it builds a top-down path based on the FPN and uses the strong location information of the underlying feature map to fuse features, thus improving detection accuracy. However, PANet extracts feature from lower to higher levels which can easily lead to missing feature extractions. The BiFPN used in this study is superimposed by a simple feature map, following which different resolutions of the feature map are entered into the same node, thereby fusing more features without increasing the parameters.

  2. Attention mechanism

    In deep learning, the attention mechanism is inspired by the human visual processing mechanism, whereby people focus on areas of interest. Similar algorithms, such as those for eliminating the background information of interference detection when detecting images to quickly locate the region of interest, are an important research field in target detection. Therefore, the purpose of introducing an attention mechanism is to eliminate redundant information and extract effective information about commodities, thereby improving the network performance.

  3. Non-maximum suppression (NMS) based on the soft-IOU algorithm

    The traditional NMS algorithm [25] uses greedy clustering at a fixed distance. This is achieved by selecting a large number of high-score detection results and deleting the neighboring results beyond the threshold, thus balancing the accuracy and recall rate. However, if the prediction box is not aligned with the real box, the IOU cannot accurately reflect its coincidence and therefore cannot filter the detection box effectively. Therefore, a Dsoft-IOU algorithm is used to improve the NMS in this study. The algorithm uses a new intersection-to-ratio formula and optimizes the penalty function to increase the consideration of the center distance of the prediction box, thereby reducing the confidence of the prediction box, preventing the deletion of the preselection box, improving the recall rate, and enhancing the prediction ability of the model.

The structure of this article is as follows: Section II introduces the overall algorithm and related theory. Section III introduces the details of the experiment, including the experimental platform, experimental dataset, comparative experiment, ablation experiment, visualization of experimental data and results, experimental results, and analysis. Section IV summarizes the proposed algorithm.

SECTION II.

Related Work

In traditional smart retail containers, the most common technologies are gravity induction, RFID wireless radio-frequency identification [26], face recognition, and two-dimensional code. With the heightened pursuit of improved consumer experience, RFID and gravity-sensing technologies [27] are not in line with the current trend. Simultaneously, the process of real-time target detection for blocking commodities [28], [29] which is affected by objective factors such as illumination, motion blurring [30], environmental impact, and morphological changes, reduces the accuracy and stability of target detection. In the field of visual detection, occlusions have a significant impact on scene reconstruction, object recognition [31], behavior recognition [32], [33] target tracking, stereo matching, visual measurement, and other visual tasks. Thus, the occlusion issue has become separated from related visual tasks over time and has been widely studied extensively by both domestic and foreign researchers. Therefore, reducing detection and misdetection and increasing the detection accuracy of occlusion targets in the process of occlusion detection has become a research hotspot in the field of commodity target detection.

In this study, a new feature extraction algorithm that combines global information and an adaptive matching algorithm with a feature entropy mechanism is presented for detecting occlusions and foreign bodies in smart retail containers. The algorithm applies a feature entropy mechanism and adds an attention module [34], [35] to the uncovered part to extract as many effective commodities features as possible. To solve the problems of illumination change, the color similarity between the background and target, and the difficulty in feature extraction under occlusion during detection, C. Xiu proposed a method combining the Camshift algorithm [36], [37] and Kalman filter to improve the robustness of the algorithm and ensure its real-time performance. Liang et al. [38] propose a DetectPreer category auxiliary transformer object detector based on Transform, which can use data augmentation technology to improve the backbone network and use the attention mechanism to extract channel spatial characteristics and direction information. Liang et al. [39] propose an improved sparse R-CNN, which integrates the attention module with ResNeSt to construct a feature pyramid and modify the backbone network to extract more important effective features.

The influence of different occlusion levels on the detection performance is quantified based on an analysis of the commodity detection performance under different occlusion ratios. Based on the analysis, high-quality positive samples are selected for inclusion in the model training by introducing attenuation weights according to the occlusion ratio, which effectively improved the detection performance under occlusion and located the uncovered areas. This method uses VGG to convolute the commodity characteristics. To obtain information on the commodity characteristics, the threshold value of the feature entropy is introduced to determine and convolute the holes in SENet [15], set the weight of the feature entropy threshold [40], import into the attention channel, and finally entered into the pooling layer to obtain effective features. By calculating the position offset of the detection target over time, the IOU loss regression of the occlusion target prediction box and the real box is improved, the redundancy of the detection box is reduced, and the appropriate size of the detection box is determined under non-extreme suppression (RPN) [41]. This significantly improves the accuracy of the feature extraction for occlusion detection, enhances the detection robustness, and constructed an appropriate detection framework.

In this study, the structure of the target detection algorithm is modified. (1) An attention enhancement algorithm based on semantic segmentation is proposed to solve the missing high-frequency information in occlusion detection and to achieve the feature extraction of commodities in the unobstructed section of semantic segmentation, as well as also provide location information for video frame localization. (2) In the detection module, the designed YOLO-R algorithm, which combines a residual module to enhance the residual characteristics of the convolution network is deployed. The algorithm improves the NMS algorithm and sets the BiFPN feature pyramid. In feature extraction, it enhances the detailed information, and thus, fuses the features of the different scales of the network, enhances the robustness of the features, and improves the model accuracy. In YOLO-R, the attention mechanism of the SENet channel is added, which make the network model pay more attention to the unobstructed areas of commodities and extract effective information on commodities. Finally, in the processing stage, the NMS model is adjusted automatically using the Dsoft-IOU with the location information of the threshold set by the eigenvalue function, which prevents the true frame from being filtered out and enhances the generalization of the model.

SECTION III.

Method

In a related study, a new YOLO-R algorithm is designed. the YOLO-R algorithm uses the BiFPN feature pyramid for feature fusion. In the upsampling model, CARAFE lightweight sampling operators are used to increase the sampling characteristics. In this study, the SENet attention mechanism is added to YOLO-R and the attention mechanism threshold is increased by feature entropy. YOLO-R consists of two CSP structures designed to reduce inference and computing power, namely CSP blocks for backbone networks and CSP_PPM mainly for neck network structures. In post-processing, the NMS algorithm of DSoft-IOU is redesigned. The penalty function threshold is increased, and samples are filtered out by DSoft-IOU range loss regression to reduce the impact of redundant features. The algorithm used in this study is shown in Fig. 1.

FIGURE 1. - Overview of the proposed architecture. (a) We first perform initial feature extraction convolution based on U-Net to get the basic features and fuse the basic features to get the semantic segmentation results. (b) The semantic segmentation features and also the commodity unmasked features are convolutionally pooled, in which the features are sampled to obtain the commodity effective features, the detection framework is built according to the effective features, and the suitable detection frame is filtered using non-maximal suppression.
FIGURE 1.

Overview of the proposed architecture. (a) We first perform initial feature extraction convolution based on U-Net to get the basic features and fuse the basic features to get the semantic segmentation results. (b) The semantic segmentation features and also the commodity unmasked features are convolutionally pooled, in which the features are sampled to obtain the commodity effective features, the detection framework is built according to the effective features, and the suitable detection frame is filtered using non-maximal suppression.

A. Combining ECA Semantic Feature Extraction

The improved FCN used in this study uses U-Net improvement in the FCN network extractor to increase the ECA attention mechanism [33] in the U-Net network sampling and to speed up model estimation. In this study, improvements are made to the U-Net semantics segmentation module, and a channel interaction strategy module (ECA module) is proposed. The module uses a one-dimensional convolution module, which effectively reduces the number of parameters and computing power, thus improving performance. The module includes only a few additional parameters to avoid the effect of dimension reduction convolution on the channel attention mechanism.

The selected ECA module optimizes the computational performance and model complexity. Firstly, the features are aggregated by global average pooling to obtain channel global information, and the global average pooling operation formula is as follows:\begin{equation*} y=\frac {1}{H\times W}\sum \nolimits _{a}^{H} \sum \nolimits _{b}^{W} {x_{i}(a,b)} \tag{1}\end{equation*}

View SourceRight-click on figure for MathML and additional features.

In the formula, x_{i}(a,b) represents the i -th feature map with input size H\times W , which represents the global average pooling of feature x .

Secondly, using the channel dimension C adaptive to calculate the number of channels sharing weight k . The adaptive function formula is as follows:\begin{equation*} k=\varphi (C)=\left |{ \frac {{\mathrm {log}}_{2}(C)}{\gamma }+\frac {b}{\gamma } }\right | \tag{2}\end{equation*}

View SourceRight-click on figure for MathML and additional features.

In the formula: C is the channel dimension; b and r are constants, where b =1 and r =2. The ratio of vector C to r is used to obtain the channel sharing weight k .

The extraction performance is improved through multichannel shared weight information interaction as follows:\begin{equation*} w=\sigma (C1D_{k}(y)) \tag{3}\end{equation*}

View SourceRight-click on figure for MathML and additional features.

In this study, C1D represents a one-dimensional convolution, and \mathrm {\sigma } represents a sigmoid function. The amount of information in this study is relatively small, which is conducive to reducing the model’s complexity of the model. Therefore, this method of channel attention information interaction ensures effectiveness and model efficiency. This enables efficient and fast extraction of commodity features.

B. Feature Extraction

1) Characteristic Entropy Weight Distribution

Local feature points are extracted using the YOLO-R residual module, a convolution self-coding neural channel is added for the low-dimensional pixel points, and global features are exploited to enhance the detection characteristics. The adaptive weight redistribution algorithm originated from the similarity principle of the gestalt grouping. The degree of color similarity is used as an influencing factor for weighting, and the influencing factor is determined by the Euclidean distance. In this study, a fully connected layer F\in R^{C_{F}\times C_{D}} is used to integrated it into the model architecture, and the bias it learned is b_{F}\in R^{C_{F}} . These two parts are combined into a global feature that summarize the discriminatory content of the entire image, g\in R^{C_{F}} :\begin{equation*} g=F\times \left ({\frac {1}{H_{D}W_{D}}\sum \nolimits _{h,w} d_{h,w}^{p} }\right)^{1/p}+b_{F} \tag{4}\end{equation*}

View SourceRight-click on figure for MathML and additional features.

p is expressed as a hyperparameter of the average pooling. d_{h,w}^{p} is the mapping features of the signature map.

Assuming the color features extracted by the keyframe are F_{0},F_{1},\cdots F_{n} , the features fused are \begin{equation*} F_{n}=H_{Concat}(B^{1}_{n},B^{2}_{n}\mathrm {,\cdots }B^{i}_{n}) \tag{5}\end{equation*}

View SourceRight-click on figure for MathML and additional features.

Therefore, the fused feature is \left \{{ F_{1},F_{2},\cdots,F_{n} }\right \} , and there is \begin{equation*} p(F_{i})=\frac {F_{i}}{\sum \nolimits _{i=1}^{n} F_{i} } \tag{6}\end{equation*}

View SourceRight-click on figure for MathML and additional features.

Its corresponding eigenvalue is \begin{equation*} E_{i}=-\frac {2}{n+1}\sum \nolimits _{i=1}^{n} {p(F_{i})\ln p(F_{i})} \tag{7}\end{equation*}

View SourceRight-click on figure for MathML and additional features. where H_{Concat}(B_{n}^{1},B_{n}^{2},B_{n}^{3}\ldots B_{n}^{i})^{\ast} p(F_{i}) is the characteristic ratio and E_{i} is the characteristic entropy used in this study. The E_{i} value is directly proportional to the effective characteristic quantity of commodities.

First, according to the feature weight set in this study \left \{{ \omega _{1},\omega _{2},\cdots,\omega _{n} }\right \} , the following formula is obtained:\begin{align*} \omega _{i}=\begin{cases} \displaystyle \frac {E_{i}}{\sum \nolimits _{i=1}^{n} E_{i} },&E_{i}>\tau \\ \displaystyle 0,&else\end{cases} \tag{8}\end{align*}

View SourceRight-click on figure for MathML and additional features.

According to the weight, other IOUs can be filtered below the set threshold, and the obtained attention mechanism weight can be introduced into SENet, and the characteristics can be filtered to obtain the effective characteristics of the commodity.

2) YOLO-R Multi-Cascade Convolution Feature Extraction

Commodity detection is the result of detection based on the target position and semantic information of the video frame. In this study, YOLO5 is used with a cascade for feature extraction according to the characteristics of sheltered commodities [43]. ResNet residual features are added based on YOLO5, which yielded the YOLO-R algorithm to learn the feature relationship between the enhancement layer and layer, enhance the network perception field, complete the feature convolution fusion of high and low resolution, compensate for the loss of high-resolution semantic information, enrich edge information, and improve the detection accuracy of the occluded targets.

Numerous residual module area features are used in the YOLO-R algorithm employed in this study. Therefore, a simple feature module is used instead of the representation algorithm module structure. In the residual extraction module, the backward-propagating module propagates from forward to backward, and finally selects the last layer in the feature block, as shown in Fig. 2.

FIGURE 2. - YOLO-R network structure.
FIGURE 2.

YOLO-R network structure.

The YOLO-R residual convolution is a ResNet deep residual network that can effectively improve the performance of feature extraction detection. This module is a framework that stacks blocks with the same connection shape. The blocks used in this study are also known as residual units. The residual element calculation process is as follows:\begin{align*} y_{n}&=h\left ({x_{n} }\right)+\mathcal {F}(x_{n},M_{n}) \tag{9}\\ x_{n+1}&=f(y_{n}) \tag{10}\end{align*}

View SourceRight-click on figure for MathML and additional features.

In this formula, F is a residual convolution function and 5\times 5 convolution stacks are commonly used. x_{n} represents the feature layer entering for the n residual unit modules. Function f is the operation of adding feature weights, which is expressed as the ReLU activation function. Function h is an identical mapping, h\left ({x_{n} }\right)=x_{n} . The residual units are shown in Fig. 3.

FIGURE 3. - Residual element structure diagram.
FIGURE 3.

Residual element structure diagram.

Because h is an identical map: x_{n+1}=y_{n} , the following can be obtained:\begin{equation*} x_{n+1}=x_{n}\mathcal {F}+(x_{n},W_{n}) \tag{11}\end{equation*}

View SourceRight-click on figure for MathML and additional features. By making recursive calls, it can be observed that \begin{align*} x_{n+2}&=x_{n+1}+\mathcal {F}\left ({x_{n+1},W_{n+1} }\right)=x_{n}+\mathcal {F}\left ({x_{n},W_{n} }\right) \\ &\quad +\mathcal {F}(x_{n+1},W_{n+1}) \tag{12}\end{align*}
View SourceRight-click on figure for MathML and additional features.

For cell L of any depth and cell l of any shallow layer, the following is obtained:\begin{equation*} x_{n+1}=x_{n}+\mathcal {F}(x_{n},W_{n}) \tag{13}\end{equation*}

View SourceRight-click on figure for MathML and additional features.

Therefore, the low-resolution image I_{LR} is extracted through feature extraction, and the shallow feature extracted by the YOLO-R residual network is F_{0} , which includes \begin{equation*} F_{0}=H_{Conv}(I_{LR}) \tag{14}\end{equation*}

View SourceRight-click on figure for MathML and additional features. where {H}_{Conv}\left ({I_{LR} }\right) , representing the YOLO-R residual convolution feature extraction, extracted shallow semantic features and transfers them to a deeper feature extraction module to extract deeper occluded commodity features.

The image features are then added pixel-by-pixel before being fused with the BiFPN to obtain an effective feature map of the commodity.\begin{align*} B^{i}_{n}&=B^{i-1}_{n}+H_{Concat}(F^{i,1}_{n},F^{i,1}_{n} \\ &\quad +F^{i,2}_{n},F^{i,1}_{n}+F^{i,2}_{n}+F^{i,3}_{n}) \tag{15}\end{align*}

View SourceRight-click on figure for MathML and additional features. where B^{i}_{n} is the deep-seated feature map output of the i mixed YOLO-R residual convolution block in the n multi-scale feature extraction module, F^{i+1}_{n},F^{i+2}_{n},F^{i+3}_{n} represent the distance feature outputs of the three different scale modules.

3) Multiscale Feature Fusion

With the deepening of the YOLO-R network level of the algorithm in this study, the commodity features are convoluted from low to high dimensions. However, with the deepening of the feature extraction in each layer of the YOLO-R extraction network, a few features are missing. Therefore, it is necessary to fuse features at different levels to enhance the feature semantics. A lightweight BiFPN is used for feature fusion, and the YOLO-R backbone network is used for bidirectional feature fusion at different scales.

A new BiFPN, where P_{3}\sim P_{7} denote five input features, is adopted in this study. The algorithm adopted operation methods, such as lightweight (CARAFE) up-sampling (the up-sampling core prediction module and the feature reorganization module are used to predict the sampling core using the up-sampling core module, and then the feature reorganization module is used to complete the up-sampling), down-sampling, and superposition in BiFPN to output the five extracted features of a single channel. According to the requirements of this study, this algorithm regards a BiFPN as a bidirectional stackable network structure for feature superposition to effectively enhance the feature fusion information of the occluded commodities. The structure is shown in Fig. 4.

FIGURE 4. - Structure diagram of BiFPN.
FIGURE 4.

Structure diagram of BiFPN.

The size of the feature map output by YOLO-R in this study is 480\times 640 , and the five feature layers input in the BiFPN network are P_{3}^{IN}=(160,160,128) , P_{4}^{IN}=(80,80,256) , P_{5}^{IN}=(40,40,142) , P_{6}^{IN}=(20,20,1024) , P_{7}^{IN}=(10,10,2048) . The BiFPN assigns new weights w_{i} to the input features, and then quickly normalizes the weights linearly from P_{3} to P_{5} . The input formula for the fusion node is given by Equations (16)–​(19):\begin{align*} P_{4}^{TD}&=Conv\left({\frac {w_{1}\mathrm {\cdot }P_{4}^{IN}+w_{2}\cdot Resize\left ({P_{5}^{IN} }\right)}{w_{1}+w_{2}+\varepsilon }}\right) \tag{16}\\ P_{3}^{OUT}&=Conv\left({\frac {w_{3}\mathrm {\cdot }P_{3}^{IN}+w_{4}\mathrm {\cdot }Resize(P_{4}^{TD})}{w_{3}+w_{4}+\varepsilon }}\right) \tag{17}\\ P_{4}^{OUT}&=Conv\left({\frac {w_{5}\mathrm {\cdot }P_{4}^{IN}+{w_{6}\mathrm {\cdot }P_{4}^{TD}+w}_{7}\cdot Resize\left ({P_{3}^{OUT} }\right)}{w_{5}+w_{6}+w_{7}+\varepsilon }}\right) \tag{18}\\ P_{5}^{OUT}&=Conv\left({\frac {w_{8}\mathrm {\cdot }P_{5}^{IN}+w_{9}\mathrm {\cdot }Resize(P_{4}^{OUT})}{w_{8}+w_{9}+\varepsilon }}\right) \tag{19}\end{align*}

View SourceRight-click on figure for MathML and additional features. where Conv refers to the convolution operation, Resize refers to the upsampling (CARAFE) or downsampling operation on the input features. A lightweight operator (CARAFE) is introduced for upsampling to replace the nearest neighbor difference. The lightweight operator has low redundancy, strong feature fusion ability, and fast speed. w_{i} is the characteristic entropy weight, \varepsilon =0.0001 is the stability coefficient.

In the fusion network, different feature entropy weights w_{i} are assigned to the input feature map such that the network can constantly adjust and determine each output feature. A fast normalization method is adopted.\begin{equation*} \mathrm {Out=}\sum \nolimits _{i} {\frac {w_{i}}{\mathrm {\varepsilon +}\sum \nolimits _{j} w_{j}}\times In_{i}} \tag{20}\end{equation*}

View SourceRight-click on figure for MathML and additional features. where, Out , In_{i} is the output and input characteristics.

According to the formula, each normalized feature entropy weight is w_{i}\epsilon (0,1) . Owing to the absence of the softmax operation in BiFPN, the efficiency is improved to a certain extent.

4) SENet Attention Allocation Combined with Characteristic Entropy

In SENet, the commodity feature information is entered into the channel attention mechanism through stacked clustering feature layers. The feature map under the restriction of information entropy is beneficial for filtering out the effective features and learning the weights of each layer automatically. In the SENet module, a 5\times 5 Gaussian convolution kernel is used to increase the field of sensation, convolute to a lower dimension through a 1\times 1 convolution layer, and finally enhance the nonlinear characteristics of commodities through a sigmoid activation function to obtain the characteristic graph M_{s}\in R^{H\times W\times 1} , as follows:\begin{equation*} M_{s}=\sigma \left \{{f_{Conv}^{\mathrm {1\times 1}}\left \{{ f_{Conv}^{\mathrm {5\times 5}}\left [{ f_{Conv}^{\mathrm {7\times 7}}\left ({F_{i}^{\prime} }\right) }\right] }\right \} }\right \} \tag{21}\end{equation*}

View SourceRight-click on figure for MathML and additional features. where \boldsymbol {\sigma } is a sigmoid function, \mathbf {f}_{\mathbf {Conv}}^{\mathbf {1\times 1}}\mathbf {,}\mathbf {f}_{\mathbf {Conv}}^{\mathbf {5\times 5}}\mathbf {,}\mathbf {f}_{\mathbf {Conv}}^{\mathbf {7\times 7}} represent the convolution layers, and F_{i}^{\prime } is the attention output feature map of the SENet module.

The squeeze operation in SENet enlarged the global receptive field of a commodity, obtained the spatial feature information through maximum pooling calculation, and used a convolution kernel to ascend the dimensions to obtain a spatial attention feature map M_{s} , whose M_{s}\in R^{H\times W\times 1} . The formula used is as follows:\begin{equation*} M_{s}=\sigma \left \{{f_{Conv}^{\mathrm {3\times 3}}\left [{ MaxPooling\left ({F^{\prime} }\right) }\right] }\right \} \tag{22}\end{equation*}

View SourceRight-click on figure for MathML and additional features. where MaxPooling represents maximum pooling and f_{Conv}^{3\mathrm {\times }3} is 3\times 3 convolution layer.

Finally, according to the obtained spatial attention feature map M_{s} , feature map F_{i}^{\prime } under input the feature entropy is activated. By multiplying the valid feature map by the original feature map, the parameter quantity is reduced, and the valid feature map F^{''} of the commodity is filtered through the linear normalization weight:\begin{equation*} F^{\mathrm {''}}=M_{s}\odot F_{i}^{\prime} \tag{23}\end{equation*}

View SourceRight-click on figure for MathML and additional features. where \boldsymbol {\odot } is the commodity of the feature graphs by the elements.

The attention model is introduced into the residual network ResNet, which is first squeezed to compress the feature dimension, and then the ReLu activation function is added to the fully connected layer to complete the construction of the attention channel, and finally the feature weight of the channel is obtained through the Sigmoid function, and then the original channel dimension is weighted with the new channel, and finally the effective commodity features are output. The improved attention mechanism in this study has few parameters and good embedding, which can be quickly embedded in YOLO-R residual networks. The structure is shown in the Fig. 5.

FIGURE 5. - Squeeze-and-Excitation Module.
FIGURE 5.

Squeeze-and-Excitation Module.

C. Inspection Frame Construction

1) Strict Decision RPN

The core of the RPN [22], [23] is an anchor. The detection prediction box is generated using an anchor that could be used to select the location of the following detection box. In this study, the K-means clustering algorithm is used to adaptively generate the anchor parameters to improve the clustering effect.

Commodity detection is primarily used for the overlap between the prediction and real detection boxes and the overlap ratio referenced in this study. IOU obtains the similarity between both boxes according to the degree of overlap between them. In this study, a new measure based on the Dsoft-IOU is used. Its formula is as follows:\begin{align*} IOU(a,b)&=\frac {\vert a\mathrm {\cap }b\vert }{\vert a\cup b\vert } \tag{24}\\ d_{i}&=\beta \sqrt {1-IOU(b_{bbx},c_{cluster,i})} \tag{25}\end{align*}

View SourceRight-click on figure for MathML and additional features. where a\cup b is the union area of frames a and b . a\cap b is the area where both frames intersected; d_{i} is the distance between the bounding box and the first cluster center, b_{bbx} is the bounding box, c_{cluster,i} is the i cluster center, \beta represents a coefficient. The influence of the IOU is amplified using a distance measurement formula, and the influence of the Euclidean distance variation is mitigated to generate more appropriate clustering results, as shown in Fig. 6.

FIGURE 6. - Schematic diagram of anchor construction.
FIGURE 6.

Schematic diagram of anchor construction.

Second, a loss function is combined with the IOU in this study to reduce the impact of the redundancy characteristics, and the following conclusions are made:\begin{equation*} L_{IOU}\mathrm {=1-}IOU(a,b)+\frac {d_{i}^{2}(a,b)}{c^{2}(a,b)}+\alpha \beta \tag{26}\end{equation*}

View SourceRight-click on figure for MathML and additional features. where IOU represents the size of the merge ratio of the prediction and real boxes, \alpha is the similarity factor for measuring the aspect ratio, and c^{2}(a,b) is the square of the minimum diagonal length of the overlay rectangle.\begin{align*} \beta &=\frac {\alpha }{\mathrm {(1-}I_{IOU})+\alpha } \tag{27}\\ \alpha &=\frac {4}{\pi ^{2}}\left({\arctan \frac {w^{gt}}{h^{gt}}-\arctan \frac {w}{h}}\right)^{2} \tag{28}\end{align*}
View SourceRight-click on figure for MathML and additional features.
where w^{gt} and h^{gt} are the true box width and height, respectively, w and h are the predicted box width and height, respectively. This process is illustrated in Fig. 7.

FIGURE 7. - Structure diagram of decision RPN. According to the pre-designed box, it is revolutionized pooled to obtain the final detection box center area.
FIGURE 7.

Structure diagram of decision RPN. According to the pre-designed box, it is revolutionized pooled to obtain the final detection box center area.

2) Nonmaximal Suppression Based on Scale Estimation

The NMS algorithm is a processing algorithm that deletes the redundant prediction boxes of a network. The box is scored according to the confidence level. Then, a bubble sort is performed according to the size of the confidence level score, and the IOU threshold is compared with the high-score prediction box. If the threshold is set higher, the prediction box is deleted and the maximum confidence score prediction box is not deleted. This is repeated until all the checkboxes are processed.

In this study, the DSoft-NMS algorithm is used. The algorithm is based on the Euclidean distance between the IOU true box and the prediction box. This is based on the IOU of the preselected box and the real box: the larger the confidence level of the detection box, the smaller the confidence level of the prediction box. The probability formula is derived based on the IOU crossover ratio as follows:\begin{align*} S_{i} =\begin{cases} \displaystyle S_{i},\quad IOU(M,b_{i})-\frac {d^{2}(b_{i},M)}{c^{2}(b_{i},M)} < \textrm {w}_{\textrm {i}} \\ \displaystyle S_{i} \left [{ {1-IOU(M,b_{i})+\frac {d^{2}(b_{i},M)}{c^{2}(b_{i},M)}} }\right],\quad \\ \displaystyle \qquad IOU(M,b_{i})-\frac {d^{2}(b_{i},M)}{c^{2}(b_{i},M)}\ge \textrm {w}_{\textrm {i}} \end{cases} \tag{29}\end{align*}

View SourceRight-click on figure for MathML and additional features.

d(b_{i}, M) expressed as b_{i} , M is the Euclidean distance, where b_{i} and M represent the prediction box and the center point of the optimal detection box, respectively; c(b_{i}, M) is the diagonal length between the center of the two boxes. w_{i} is the threshold.

SECTION IV.

Experimental Results and Discussion

In this section, the algorithm system, experimental parameters, and procedures, are described in detail. First, the overall process is introduced step-by-step. Subsequently, the key experimental parameters are determined during the training process. Finally, these indices are used to evaluate the occlusion detection results.

A. Comparison With State-of-the-Arts

The experiments are analyzed and compared using self-made smart retail container datasets. The experimental device used is the Lenovo Xiaoxin Air-12IIL 2020 with an Intel Core i5-1035G1 CPU and a discrete graphics card with NVIDIA GeForce MX3502GB. The system and software running the algorithm are Win10 and PyCharm, respectively. HF899 with a 2.7-mm (135 °distortion-free) camera is used.

A self-made dataset with common commodities in daily retail containers is used and 4,270 pieces of commodity data are collected. The target datasets are labeled according to the VOC2007 format. Among the 4,270 datasets, 3,843 pictures are used for training, and 427 for validation. The test commodity datasets are constructed separately.

B. Experimental Evaluation Criteria

Based on a review of various test studies, the average accuracy (AP) and the average AP (mAP) are used in this study to obtain the average value of the detection accuracy of commodity categories. Additionally, the F1 index is used as the evaluation standard for model stability [43]. The detected samples are considered positive when the confidence level of the commodity detection box is equal to the threshold value and negative when the confidence level of the commodity detection box is equal to or lower than the threshold value. The recall rate R is the percentage of the total sample of positive samples correctly tested, defined as \begin{equation*} R=\frac {TP}{TP+FN} \tag{30}\end{equation*}

View SourceRight-click on figure for MathML and additional features.

The TP is the number of samples correctly classified as positive. FN is the total number of positive samples incorrectly identified as negative samples. T represents the maximum spacing between video frames.

Precision P represents the proportion of positive samples detected by the algorithm to the total number of positive samples detected by the detection result, which is defined as \begin{equation*} P=\frac {TP}{TP+FP} \tag{31}\end{equation*}

View SourceRight-click on figure for MathML and additional features.

Set the total number of samples to n , and detect k samples. The completion rate is expressed as r_{k} , and p_{k} is the maximum accuracy rate that is greater than r_{k} . The average accuracy is defined as:\begin{equation*} AP=\sum \nolimits _{k=1}^{n} {p_{k}(r_{k+1}-r_{k})} \tag{32}\end{equation*}

View SourceRight-click on figure for MathML and additional features. mAP is the average accuracy for all categories, defined as \begin{equation*} mAP=\frac {1}{L}\sum \nolimits _{L=1}^{L} {AP_{q}} \tag{33}\end{equation*}
View SourceRight-click on figure for MathML and additional features.
where L represents the total number of categories, and AP_{q} represents the average accuracy of category q .

The model stability is detected using the F1 value (H-mean value), which is obtained by dividing the arithmetic mean by the geometric mean. The F1 value is inversely proportional to model stability. The formula used is as follows:\begin{equation*} F_{1}=\frac {2PR}{P+R}=\frac {2TP}{2TP+FP+FN} \tag{34}\end{equation*}

View SourceRight-click on figure for MathML and additional features.

The formula shows that F1 is the weighted summation of the precision and recall, expressed as a harmonic mean.

C. Analysis of Experimental Results

In this study, a semantic segmentation algorithm is used to enhance the semantic information of the commodities and improve the accuracy of the data. Subsequently, the commodities are detected using YOLO-R, an algorithm that semantically split the network association, which constitutes the overall process. The test results for this process are as follows.

The results of the algorithm detection are shown in Fig. 8. The first, second, and third rows represent the original image, semantically segmented prediction results, and detection results, respectively. It can be observed from Fig. 8 that in all semantic segmentation maps of commodities, the commodities are separated from the background, and the algorithm accurately detected obscured commodities.

FIGURE 8. - The results are output during the experiment. First line: Input masking Commodities. Second line: Output semantic segmentation result graph. Third line: Output test result graph.
FIGURE 8.

The results are output during the experiment. First line: Input masking Commodities. Second line: Output semantic segmentation result graph. Third line: Output test result graph.

At the beginning of this study, the ideal number of iterations is determined to be 100 based on training sessions conducted for each comparison model. Furthermore, the stability of the proposed model is compared to that of the traditional models, as shown in Fig. 9.

FIGURE 9. - Iterative stability of loss training. In (a)(b)(c)(d)(e)(f), the network models are stable after 100 trainings.
FIGURE 9.

Iterative stability of loss training. In (a)(b)(c)(d)(e)(f), the network models are stable after 100 trainings.

1) Public Dataset Detection and Comparison

This study first verifies the performance of the algorithm. The algorithm uses in this study and the traditional algorithm is trained and verified using the VOC2007 public dataset. The data contains 21,503 pictures, which are divided into a training set and a verification set through cross-validation at 9:1.

Table 1 uses the mAP and F1 data obtained by VOC2007, and according to the literature, YOLOX [44] and DETR [45] models are introduced for comparative analysis, and the comprehensive analysis shows that the detection accuracy of mAP is similar, and the higher the F1 index of the model, the stronger the stability of the model. Therefore, this paper uses a more mature YOLO5 model to improve the model in this study, and the performance of YOLO5 is more stable than other networks.

TABLE 1 Comparison of Public Dataset Model Data. The Best Score is Highlighted in Bold
Table 1- 
Comparison of Public Dataset Model Data. The Best Score is Highlighted in Bold

2) Comparison of Self-Made Dataset Detection Models

In Table 2, The results are based on 100 training iterations. This study refers to the current mainstream comparative experimental models, and finds that the proposed model is superior to other network models in AP and Recall, among which the YOLO-R model is 0.01% higher than YOLOX in AP, and the YOLO-R model is better than DETR, but the accuracy is slightly lower than YOLOX. In terms of commodity detection speed, the detection speed of the original model YOLO5 is lower than that of YOLO7, but the YOLO-R improved by the algorithm in this paper increases the speed by 3.83 F/s, and the improved YOLO-R detection speed is significantly better than YOLO7, and 2.48F/s faster than YOLO7. YOLO-R also improves the detection accuracy by 0.9% compared to the original model and the speed by 3.83F/s. This comparative experiment verifies the feasibility and superiority of the proposed algorithm. In the self-made dataset, the improved model accuracy in this paper is more accurate, and the comprehensive performance has also been improved to a certain extent.

TABLE 2 Comparison of Average Accuracy, Map Value and Recall Rate on Self-Made Commodity Datasets. The Best Score is Highlighted in Bold
Table 2- 
Comparison of Average Accuracy, Map Value and Recall Rate on Self-Made Commodity Datasets. The Best Score is Highlighted in Bold

Compared with other algorithms, YOLO is an end-to-end target-detection neural network. YOLO predicted multiple candidate boxes at one time, regressed the object location area and the category of objects in the area at the output layer, and is faster. The faster-RCNN and other algorithms must generate several candidate frames in the picture and they have several parameters and a long training time.

3) Comparison and Analysis of Different Commodity Tests on Self-Made Datasets

The occlusion detection performance of the proposed model is compared with those of other networks based on AP using test sets of different commodities. By comparing the detection effect of each network on other datasets, the stability of the proposed model is verified. Simultaneously, the universality and stability of the algorithm model are demonstrated, proving its applicability to different commodity detections, as shown in Table 3.

TABLE 3 Comparison of Various Network Models in Different Commodity Detection. The Best Score is Highlighted in Bold
Table 3- 
Comparison of Various Network Models in Different Commodity Detection. The Best Score is Highlighted in Bold

As shown in Table 3, The inspection progress of YOLO-R on different commodities is higher than that of other models, but the detection accuracy of YOLO7 models on Scream commodity is 0.02% lower, and the overall detection accuracy is 2.69% higher than that of YOLO7. The detection accuracy of the improved model on different commodities is 0.05%–2.42% higher than that of the second-best model. In the selected comparison model, the overall performance of the improved YOLO-R commodity detection method improved by 2.18% compared to the original model. In conclusion, the model outperforms the general approach.

4) Comparative Analysis of Network Model Stability on the Homemade Datasets

A series of comparisons are also made for commodities with different shielding to verify the stability of the model. Two representative commodities (packed potato chips and bottled mineral water) are shielded to different degrees and their average accuracies are obtained. The comparison analysis is based on the average accuracy.

In Table 4, to compare the stability of the network models, we select the packed potato chips and bottled water are selected as the experimental objects and shield the two commodities are shielded to varying degrees to detect the accuracy of each network model. From the analysis of the experimental results, it can be concluded that the detection accuracy of the proposed algorithm is not different from that of the other network models when there is little or no occlusion. However, the detection performance of the model decreased with an increase in the occlusion. Nevertheless, with an increase in the occlusion ratio, the detection performance of the YOLO-R model remained high, indicating greater stability on the self-made occlusion dataset. The detection accuracy for potato chips reached an astonishing 97.88% at approximately 60% of the serious occlusion degree, which is 1.35% higher than that of the second-best network model. For the detection of Farmer Spring, although the faster R-CNN is not better in the presence of moderate occlusion, the overall performance is better. For severe occlusions, the detection accuracy of YOLO-R reached 68.87% and 1.08% higher than that of the second-best network, respectively.

TABLE 4 The Stability Between Network Models. The Best Score is Highlighted in Bold. This Data Indicates the Accuracy of Commodity Detection
Table 4- 
The Stability Between Network Models. The Best Score is Highlighted in Bold. This Data Indicates the Accuracy of Commodity Detection

5) Comparison of Attention Mechanisms Ablation Experiments

To identify the attentional mechanism that best matches the proposed model, numerous studies are consulted and three representative attentional models are selected for ablative comparison with the attentional model in this study.

As shown in Table 5, the YOLO algorithm of the improved SE attention mechanism is 0.2% better than that of the original model. The ECA attention mechanism network appears to be similar to the improved SE model during the analysis process. The focus of this study is the comparison and validation of the actual commodity detection of each network model. Through a comparative analysis, YOLO-R is found to improve both accuracy and speed, with 0.40% accuracy and 2.38 F/s.

TABLE 5 Comparing Models of Attention Mechanisms. The Best Score is Highlighted in Bold
Table 5- 
Comparing Models of Attention Mechanisms. The Best Score is Highlighted in Bold

Table 6 presents an extension of the ablation experiments described previously. Several representative commodities are screened in extensive experiments and tested to verify the superiority of the improved attention mechanism.

TABLE 6 Commodity Detection Accuracy of Yolo Model in the Self-Made Dataset. The Best Score is Highlighted in Bold
Table 6- 
Commodity Detection Accuracy of Yolo Model in the Self-Made Dataset. The Best Score is Highlighted in Bold

The four model attention mechanisms are compared, and it is found that the improved SE attention mechanism in the study is more stable, and the detection accuracy is approximately 0.2% higher in terms of the mAP, compared to when no attention mechanism is used. In Table 6, six commodities (the test datasets in Tables 6 and 3 are not the same) are selected to verify the detection accuracy of the commodities randomly sampled from the self-made dataset used in this study. The average accuracy is the average value of the detection accuracy for the commodities mentioned above. As shown in Table 5, the average detection accuracy improved when the attention mechanism is incorporated. The greatest improvement is observed with YOLO with the SE attention mechanism model, whose detection accuracy is 3.18% higher than that of YOLO. On this basis, the YOLO-R accuracy of the improved algorithm is 4.14% higher than that of the original model. Tables 5 and 6 show that the algorithm is faster, more accurate, and has better detection stability.

6) Innovative Comparative Ablation Experiment

The BiFPN feature fusion model is used to replace the original PANet fusion structure and to reduce the model parameters in the BiFPN layer. Second, the detection box algorithm is improved and the threshold limit of the eigenvalue is increased so that a suitable detection box could be generated. Therefore, an innovation point verification is conducted in this study.

As indicated in Table 7, the lightweight BiFPN in this study has the highest stability detection accuracy, and the stability and detection accuracy are higher. The lightweight network model is 0.01-0.02 higher than other networks on the F1 index and 0.3%-0.4% higher than other networks on mAP.

TABLE 7 Comparison of YOLO-R’s Multi-Feature Fusion Module Network. The Best Score is Highlighted in Bold
Table 7- 
Comparison of YOLO-R’s Multi-Feature Fusion Module Network. The Best Score is Highlighted in Bold

To verify that the proposed NMS algorithm for the Dsoft-IOU is better in terms of the accuracy of the construction of the detection box and its size, a similar model structure is guaranteed in this study, and the NMS algorithm is modified for comparison experiments, as shown in Fig. 10.

FIGURE 10. - Comparing the improved NMS algorithm with the original one. We use different types of commodities, compare the results of testing, and solve the average detection accuracy of these commodities.
FIGURE 10.

Comparing the improved NMS algorithm with the original one. We use different types of commodities, compare the results of testing, and solve the average detection accuracy of these commodities.

The improved NMS algorithm produced a more accurate detection box, located the detection box more accurately, effectively contained the detected commodities, improved the detection accuracy, and had a better detection effect.

D. Algorithm Detection Result Graph

1) Detection Results of Different Algorithms

To make this article more convincing, the next section presents a visualization of the test results. The results of the six comparison network detection models selected in this study are shown in Fig. 11 to further illustrate the detection effect characteristics of the proposed algorithm. The algorithm is effective in detecting occluded images in self-made datasets. This can solve the problem of occlusion when shoppers use commodities in a container. The following is an experimental analysis and detection result diagram of several algorithms. In the diagram, bagged potato chips are used as the detection object to facilitate the comparison and detection of various models.

FIGURE 11. - Detection results of common network models and this paper’s network model on Ritz-Carlton potato chips.
FIGURE 11.

Detection results of common network models and this paper’s network model on Ritz-Carlton potato chips.

This section also adds a comparative visualization of commodities detection models under different signal-to-noise ratios. As Fig. 12.

FIGURE 12. - Interference immunity comparison chart. (a) When the signal-to-noise ratio is 0.01, our model is compared with the original model. (b) (c) (d)When the signal-to-noise ratio is 0.02, 0.5, and 0.1, our model and the original model are compared for detection.
FIGURE 12.

Interference immunity comparison chart. (a) When the signal-to-noise ratio is 0.01, our model is compared with the original model. (b) (c) (d)When the signal-to-noise ratio is 0.02, 0.5, and 0.1, our model and the original model are compared for detection.

The algorithm results are displayed and analyzed Fig. 11 shows the detection results of the proposed model and the other network models. In this study, a self-made occluded commodity dataset is used for comparative experiments, and compared with other models, the proposed model achieves a high detection accuracy.

According to the citation of relevant literature [46], the detection and comparative analysis of goods under different noise conditions are carried out. In Fig. 12, our model has better detection results at different signal-to-noise ratios than the original model, but when SNR=0.1, neither our model nor the original model can detect the commodities. This experiment can verify that the algorithm in this paper is more stable through comparison. This signal-to-noise ratio experiment proves that the anti-interference ability of the improved algorithm has been improved, but the anti-interference ability and YOLO-R anti-interference ability need to be improved.

2) Self-Made Datasets Tested in This Model

The dataset used in this study is a self-made commodity dataset. The commodity commonly used in intelligent retail containers are selected through big data detection and divided into three categories, depending on whether they are bagged, bottled, or canned. A few commodities are selected from these three categories, and the inspection experiment is visualized.

The type of dataset used in this study is similar to the displayed sample data. There are many kinds of commodities, and the degree of occlusion is determined by the proportion of occlusion.

Twenty-one types of commodity datasets are used in this study. Twelve categories of commodities are presented in the homemade datasets and are divided into three categories. Using a free graph, it is proven that the algorithm is effective on self-made datasets.

3) Attention Mechanism Thermal Visualization

In this section, a visualization of the attention mechanism is presented, as shown in Fig. 14. This section further demonstrates that the YOLO model enhances the network’s ability to extract the signs of the model and could focus on the effective feature areas of commodities very well.

FIGURE 13. - Display of three categories of self-made datasets.We categorize the datasets into three categories of commodities, and select some of them for visualization and detection.
FIGURE 13.

Display of three categories of self-made datasets.We categorize the datasets into three categories of commodities, and select some of them for visualization and detection.

FIGURE 14. - Attention visualization and comparison. First line: Visualization of lightly occlusion heatmaps. Second line: Medium occlusion visualization. Third line: Heavily occlusion heatmap visualization.
FIGURE 14.

Attention visualization and comparison. First line: Visualization of lightly occlusion heatmaps. Second line: Medium occlusion visualization. Third line: Heavily occlusion heatmap visualization.

The graph shows the thermographic display and detection accuracy of YOLO-R at different occlusion levels and a comparison of the proposed model with and without an attention mechanism network. As can be observed from Fig. 14, the algorithm focuses on increasing the attention on the unobscured part to make the feature extraction more effective. The detection accuracy decreased with an increase in the occlusion degree, the strengthened attention mechanism network became more effective in the area of interest, and the color is more in-depth. Each test commodity is the same as the training set commodity, and the commodity that obtained the AP from the test is the result of the video-frame excerpt test. Through a comparative analysis, it can be concluded that the attention mechanism model in this study focused more on the unobstructed features of commodities and is more accurate for detection.

SECTION V.

Conclusion and Future Work

To solve the issue of large changes in the target scale, multiple occlusion cases, and target detection accuracy in the process of occlusion commodity detection, a residual network combined with an attention module is proposed to enhance the range of field scale and enhance the multiscale information fusion ability of the model, thus improving the detection accuracy of the model. To address the insufficient feature fusion in YOLO-R and the mix of multilayer features, the BiFPN feature pyramid is used in this approach. The feature pyramid is sampled as a CARAFE structure, which enlarges the sensing field, fused features at different scales of the network structure, and enhances the robustness of the features. For the processing module of the YOLO-R model, a Dsoft-IOU loss regression module that combines the location information and characteristic entropy threshold is proposed to adaptively adjust the model’s Dsoft-IOU, thus, preventing the real detection box from being filtered and improving the prediction accuracy of the model.

The method is tested by masking the homemade datasets. The results show that the improved YOLO-R based on YOLO and the SENet occlusion detection method combined with the attention mechanism uses the eigenvalue to limit the threshold value, thereby increasing the attention model. The mean average accuracy obtained using this method is higher than that obtained using YOLO, and the speed is also improved. The algorithm also achieves good detection results for the commodities with different occlusion ratios. However, the proposed algorithm, similar to the traditional algorithm, has certain limitations. Compared with the original method, the noise immunity of the improved algorithm is significantly enhanced, but with the increase in noise ratio, the commodity detection ability decreases significantly. The algorithm in this study is insufficient in the low anti-interference ability of noise and low detection accuracy Therefore, in future work, this paper focuses on optimizing the model network structure, improving the noise resistance and anti-interference ability of the model, enhancing the generalization of the model, and improving the stability of model detection.

Author Contributions

An Xie conceived algorithms of the paper and write the manuscript, Kai Xie reviewed the paper, Hao-Nan Dong and Kai Xie designed experiments, Hao-Nan Dong conducted comparative experiments and collected data, Jian-Biao He checked spelling and grammar and made suggestions.

References

References is not available for this document.