Endoscopic ultrasound image recognition based on data mining and deep learning

The recognition of medical images, especially endoscopic ultrasound images, has the characteristics of changing images and insignificant gray-scale changes, which requires repeated observation and comparison by medical staff. In view of the above-mentioned characteristics of ultrasound imaging, a system scheme suitable for image processing is proposed, which can analyze the biliary tract, gallbladder, abdominal lymph nodes, liver, descending duodenum, duodenal bulb, stomach, pancreas, pancreatic lymph nodes, there are a total of 10 ultrasonic organs, 21kinds of sub-categories and 3510 images. The images are preprocessed using binarization, histogram equalization, median filtering, and edge enhancement algorithms. The improved YoloV4 convolutional neural network algorithm is used to train the data set and perform high accuracy is detected in real time. Finally, the average accuracy of this algorithm has reached 91.59%. The algorithm proposed in this paper can make up for the shortcomings of manual detection in the original image detection system, improve the efficiency of detection, and at the same time as an auxiliary system can reduce detection misjudgments, and promote the development of automated and intelligent detection in the medical field.

INDEX TERMS endoscopic ultrasonography; image processing; data mining; convolutional neural network

I. Introduction:
Ultrasound medical imaging is widely used in medical diagnosis. With the advent of the era of big data, medical imaging has entered a new development process that combines big data and deep learning. The medical ultrasound image library imports a large number of medical images every day, including various information about normal and abnormal organs in various parts. For the existing technology, the extraction of medical information is very critical and necessary. Based on this key information, professionals can derive the type of patient and the degree of harm from it, and take relevant treatment methods based on the image data. At present, common medical imaging equipment can manually mark images and color specific areas. However, for specific parts and organs, doctors' experience and learning basis are still needed to make judgments. In recent years, the development of computer technology and artificial intelligence has gradually expanded to the field of medicine. In the recognition of medical images, the direct image enhancement technology proposed by Chen Yan et al. is different from the indirect image enhancement technology used in the classification of medical images [1]. Li Bo et al. proposed a medical image classification algorithm based on multi-feature fusion in scale space, which classifies images by establishing a Gaussian scale space [2]. Li Zuoyong described an improved segmentation algorithm for digital morphology medical images, which segmented medical images by combining morphology and filtering operations [3]. In the field of deep learning, image segmentation algorithms gradually exert their advantages. Wang Li et al. introduced an algorithm based on Faster-RCNN neural network, which serves as an auxiliary diagnosis [4]. Bakalo Ran et al. proposed a weak semi-supervised detection dual-branch deep learning network algorithm for medical image recognition, which has a significant improvement effect on recognition [5]. Literatures [6][7][8][9] introduced a multi-layer boundary perception deep learning network, a detection network based on Spatial Domain, Transform Domain and CNN methods, VOLUME XX, 2017 1 and a deep learning image classification algorithm using wavelet decomposition to replace the convolution kernel. At present, this type of algorithm is mainly trained on a large amount of data, with a large amount of calculation and high computational cost, and it is difficult to give full play to its advantages in ultrasound endoscopic detection with few key frames, and the detection samples are insufficient. The current research on ultrasound endoscopy mainly has the following problems: (1) Ultrasound image itself is relatively fuzzy, and the key frames with few features make it difficult to be detected during processing.
(2) Medical ultrasound imaging uses the physical characteristics of ultrasound and converts it into an image. Its imaging is very limited for the detection of organs that contain gas and bones, and the interference is difficult to eliminate. It requires strong professional knowledge to assist in judgment, which is likely to cause misdiagnosis.
(3) The labeling error of medical ultrasound images is large, and the labeling relies on the assistance of professional physicians, and the judgment is difficult to quantify.
Aiming at the above problems, this paper proposes a medical ultrasound detection and recognition algorithm based on data mining and deep learning. Based on the existing data set, the algorithm performs key frames of ten organs of the biliary tract, gallbladder, abdominal lymph nodes, liver, descending duodenum, duodenal bulb, stomach, pancreas, pancreatic lymph nodes, and esophagus. Subdivision, establishing multiple tag libraries for different parts of special organs, and realizing the recognition of 21 organ subtypes. The improved algorithm greatly reduces the requirements of the scene, and solves the training problem caused by the different organ environment; Use image edge enhancement and filtering algorithms to improve image clarity and recognition, greatly reducing the impact of image quality issues such as blur on recognition; Through OpenCVbased random image transformation and data mining technology, the problem of fewer key frames is solved, and the sample capacity is expanded; Finally, the image is trained and learned through the improved YoloV4 convolutional neural network, which improves the accuracy of image recognition and solves the problems of low efficiency and bias in the current manual judgment of ultrasound endoscopic images.

II. Data set image preprocessing
Aiming at the following quality problems in endoscopic ultrasound images: (a) Interference and noise interference caused by the useless burr of the image itself; (b) Image blur and ghosting caused by image framing; (c) The local salt and pepper noise and global Gaussian noise of the image.
In this paper, an image preprocessing algorithm based on OpenCV is used to optimize the image data set of medical ultrasound endoscopy, which preserves high-frequency signals, filters out irrelevant noise, improves image quality, and makes the features in the data set easier to be identified.
The main processing flow is as shown in Figure 1.

Acquisition of data set
At present, there are few data sets of medical ultrasound annotation image classification. This article is based on the framing and annotation of the medical ultrasound endoscopy training video of the People's Medical Publishing House. The video data contains the medical classification and annotation of dozens of visceral endoscopic images including the gallbladder, pancreas, etc., which provides a scientific and rigorous image source for the construction of the data set. Through the framing of the image and video, enough frames are obtained to form an image data set. When framing, the video framing algorithm is used for frame extraction, and the interval is set to 1~5 frames. This study selected ten normal organs such as stomach and pancreas as the research and test objects, and finally extracted 3510 pictures with useful frames. The data division table of endoscopic ultrasound images is shown in table 1. The obtained images cannot be directly used for image classification training, and certain repairs and image enhancements are required [10][11].The data set needs to be further processed to remove the interference frame composed of bubbles (such as large bubbles in the esophageal image) and other interference items in each organ, so that the characteristics of a certain organ in different orientations can still be identified in a complex scene [12]. In this paper, labelimg (a visualization tool for generating data set labels) is used for image labeling and ROI (Region of Interest, ROI) frame selection, and 21 divided images of organs are labelled. By generating the corresponding xml file, a data set in VOC format is obtained. Part of the pancreatic endoscopic ultrasound image set after framing, as shown in Figure 2.

Image graying and enhancement
The color interface in the data set is composed of RGB (Red, Green, Blue) three channels, and its characteristics only indicate the optical characteristics of the image, and cannot reflect the morphological characteristics of the target object. RGB image does not help to solve the identification of ultrasonic internal organs, and the redundant information it contains will increase the amount of features and calculations. In order to reduce the amount of calculation in the later stage, the image needs to be unified to the [0, 255] grayscale variation range [13]. The average dimension method is used in this study.
In the above formula, Gray(i,j) represents the gray value of row i and column j, R, G, and B respectively represent the value of each channel, and the average value is used as the gray value of the pixel.
After the medical ultrasound image is divided into frames, some images will appear blurry and ghosting. This kind of noise will reduce the recognition of the image and increase the difficulty of recognition. To solve the above problems, after the image is grayed, the PIL library in Python is used to enhance the sharpness of the image to improve the quality and definition of the image [14]. As shown in Figure 3, the sharpness of the edge portion of the image (b) is significantly VOLUME XX, 2017 1 enhanced compared to the image (a). It has a better feature enhancement effect for images judged based on internal edge signals of organ endoscopic images.

Histogram equalization
The grayscaled image has uneven distribution of brightness. In order to improve the overall pixel grayscale distribution and contrast of the image, it is necessary to transform an image with a known grayscale distribution into a uniformly distributed grayscale image. By improving the irregular distribution of pixels, the range of pixel distribution is enlarged, and the contrast of the image is further improved. The steps to realize the remapping distribution of the histogram are as follows: Calculate the probability density function at each gray level: In the above formula,   r k P r is the probability under the gray level of rk, nk is the number of pixels under the gray level, and n is the total number of pixels in the image. Use the mapping relationship to get the distribution function value after gray level mapping.
Among them, Sk represents the value of the probability distribution function under the gray level from rj to rk.
After converting to the standard gray value through the mapping arrangement, the histogram is equalized, and the number of pixels will no longer only be distributed near the black, but distributed on the gray scale of 0-255, realizing the medical image Rearrangement of gray levels. Histogram before and after equalization, as shown in Figure 4. As shown in Figure 4, the abscissa represents the gray level, and the ordinate represents the number of pixels in the gray level. After equalizing the pancreas image, the gray value and pixel statistics are used to obtain a visual gray histogram. Before equalization, most of the pixels are distributed in the 0-50 area. After equalization, it can be clearly seen that the distribution of the grayscale histogram is more balanced, with a distribution in the range of 50~255, thus achieving grayscale remapping and image equalization [15].

Image filtering algorithm
Ultrasound imaging mainly uses the acoustic characteristics of ultrasound reflected in different organs and tissues, and can distinguish different organ contours. The image has interference mainly caused by speckle noise, and this kind of noise cannot be eliminated by physical methods, and can only be processed by the method of imaging [16]. In this article, the graininess and glitch interference in the ultrasound endoscopic images are noise points. In order to reduce the impact of this kind of noise, Gaussian denoising is first adopted, and then median filtering is used [17].
Gaussian denoising blurs the full-screen noise, performs assignment calculations through the movement of the module, and uses a two-dimensional Gaussian filter to design a 3*3 mask when setting the template.
Among them, k represents the size of the serial port. Through the movement of the window, Hi,j is the value of row i and column j in the mask, and Gaussian filtering is performed on pixels in all positions.
However, Gaussian filtering cannot remove salt and pepper noise. In response to this situation, this paper uses median filtering as a supplementary method to further remove the salt and pepper noise in the image, while retaining the edge features of the image. When the template is moved, the gray levels of the pixels in the mask are sorted, VOLUME XX, 2017 1 and the median value of the area covered by the mask is taken as the new gray value of the central pixel. The filtered result is shown in Figure 5. Compared with figure (a), the burr and salt and pepper noise of the processed figure (b) are greatly weakened, and the joint filtering effect is good.

Edge extraction and enhancement algorithm
The recognition of endoscopic images is mainly to judge the ultrasonic appearance characteristics of different organs and membrane structures. The main high-frequency information is the changes in the pixel gray levels at the edges and contours. Therefore, the optimization and extraction of edge information will play an important role in the recognition effect. In this paper, Laplace operator is used to construct a 3*3 convolution kernel, which is used to calculate the gray jump value of the edge pixels of the image [18][19].
Among them, 2 f  represents the second-order gray-scale differential value of the image at (x, y) coordinates.
The simplified expression is: 2 4 (x, y) f(x 1, y) f(x 1, y) (x, y 1) f(x, y 1) Further use the Laplace convolution template to realize the calculation: Among them, H represents the Laplace operator convolution template.
The convolution operation can extract the edges of the image, as shown in Figure 6. The edge features of the image are preserved, and the features of the image are further enhanced.

III. Convolutional neural network training and prediction
The research idea based on convolutional neural network is shown in Figure 7.  The research route mainly consists of three parts: data mining and expansion, Yolo V4 detector network framework construction, network evaluation and improvement. In order to improve the robustness of the algorithm, this paper selects the best parameters and regularization methods to debug and optimize the algorithm.
The simulation environment is under the Windows 10 operating system, using the Tensorflow framework to build a convolutional neural network YoloV4. The project uses RTX Quadro 4000 graphics card as the acceleration platform, and finally 50 rounds of training, a total of 2 hours and 35 minutes.

Construction of YoloV4 network framework
The construction of YOLO V4 network detector is very important, which mainly includes backbone network, SPP, PANet network and prediction network [20], as show in Figure 8. The backbone network is used to extract features, SPP participates in pooling as an additional part [21], and PANet mainly participates in feature fusion [22][23][24][25]. The Yolo head part is mainly used for forecasting. SPP uses 1×1, 5×5, 9×9, 13×13 pooling to check the feature layer convolution and pooling. This structure can increase the receptive field and separate the characteristics of the upper and lower layers as much as possible. PANet is a segmentation algorithm used to improve the features of the target detection object and realize the repeated extraction of features.
The feature layer dimensions designed using Yolo network are respectively (52, 52, 256), (26,26,512) and (13,13,1024). On this basis, the parameters are adjusted, and at the same time, regularization and data enhancement are introduced. The training forms a convolutional neural network suitable for endoscopic ultrasound samples. The improved network performed well in the experiment and can effectively identify endoscopic image features.

Training preparation and data mining
The endoscopic ultrasound images in this article contain both characteristic frames and useless frames. The characteristic frames are organs that have been correctly marked in the image, with fewer interference items. However, the number of such frames is relatively small. In order to solve this problem, data enhancement algorithms are used to increase the diversity of image features while avoiding over-fitting of the model [26].
(a) Data enhancement is based on the OpenCV function library to perform random inversion, rotation, scaling and VOLUME XX, 2017 9 other operations on the training data, and process the frames in the training set by setting the random state flag. (b) The mosaic data enhancement method is introduced, the principle of which is to use image stitching to cut partial pixel areas of four images and stitch them into one image. That is, four pictures are calculated each time, so the batch (number of images per round of training) is not large, which can improve the calculation efficiency of the data.
(c) In the process of forward propagation and backward calculation, the neural network will continuously update the model information, calculate the loss value, and determine the optimal solution for target optimization according to the loss function. Use stochastic gradient descent to continuously optimize and determine the optimal solution. However, when encountering multimodal functions, this gradient descent strategy will fall into the trap of local optimality, and this local optimal solution is not the overall optimal solution. To solve this problem, the cosine annealing decay is introduced, that is, it jumps out when it falls into the local optimal solution, and then restarts the calculation until the overall optimal solution is found. The principle is: In the above formula, j represents the jth index value, max j  is the maximum learning rate, and min j  is the minimum learning rate. current T indicates the number of epoch rounds currently executed. The learning rate will be changed after each restart, so as to achieve an effect of updating the learning rate. After each restart, it is multiplied by a fixed value to realize automatic increase. In this way, the learning rate will be updated after each restart.
Random image changes, Mosaic data enhancement, and cosine annealing attenuation are used to achieve data expansion of a limited frame of ultrasound images, and reduce the amount of calculation, avoiding over-fitting of the calculation.

Training and identification of neural networks
The endoscopic ultrasound image data set includes processed images and original images, with a total of 3510 effective frames and an image size of 608*608. The ratio of the training set to the test set is 9:1, the batch size of each iteration is 2, each round needs to complete 1580 batches, and the total iteration epoch (number of training rounds) is 50 rounds.
The training of the convolutional neural network is mainly the iterative calculation and update of the convolutional layer, the pooling layer, and the fully connected layer. The main steps are: (a) Create an image data set in VOC2007 format, mark the identifiable information ROI, and generate the corresponding XML file.
(b) Convolve the image features through the CSPDarknet53 network, compress and extract useful features, and calculate the loss value under the loss function.
(c) Batch gradient descent is used to calculate the gradient update parameters w and b, and regularization is used to reduce the weight while preventing overfitting.
(d) When the Loss value converges to the expected effect, the weight file is output.
(e) The weight file and the forward propagation process are used to identify the target in the image.
After training according to the above steps, the recognition results of some organs are shown in Figure 9. The recognition result is consistent with the actual situation and has a faster recognition rate. The single image takes about 0.1ms.

IV. Result analysis
In order to verify the effect and accuracy of training and the detection ability of the improved YoloV4 neural network, it is necessary to judge the degree of overlap between the detection frame of the target recognition and the real frame, and the parameter intersection ratio IOU is the main basis.
Among them, the numerator represents the intersection area of the detection frame and the real frame, the denominator represents the union area of the detection frame and the real frame, and IOU represents the intersection ratio.
The Precision P of the model is calculated by the following formula: Among them, TP indicates that the detected result is a positive sample, and FP indicates that the detected result is a negative sample.
The calculation of the recall rate in the model is similar to the Precision: Among them, FN indicates that the detected result is a negative sample, but it is actually a positive sample. Recall reflects the proportion of the correctly classified part of the positive sample to all the positive sample parts.
A single index has certain limitations on the evaluation of the model, and the effect of target detection should be judged under the confidence level. Positive and negative samples are classified, and the corresponding parameter is confidence. The introduction of confidence makes the precision P and the recall rate R unite, and the average precision AP is represented by the P, R curve. The AP statistics of normal pancreatic lymph nodes are shown in Figure 10. Among them, the abscissa represents the recall rate, and the ordinate represents the precision rate. It can be found that as the positive sample threshold point shifts to the left, the value of precision is initially very close to 1, and the recall is very close to 0, and AP of normal pancreatic lymph nodes is relatively high, reaching 95.52%. mAP means average precision, which is a measure of the average precision of all types of objects trained. mAP is an important basis for evaluating the quality of the model. Its calculation formula is: Among them, c represents the total number of sample classifications, and APi is the AP value of the i-th category.
Use the above principles to evaluate the model in this article and plot the mAP of 21 sub-categories in the data set, as shown in Figure 11.