Automatic Segmentation Algorithm of Ultrasound Heart Image Based on Convolutional Neural Network and Image Saliency

The emergence of 4D heart images makes the data volume of the images multiply. It is more urgent to require an effective and fast segmentation algorithm. Therefore, a heart image can be accurately segmented from a large amount of image data and an area of interest can be extracted The segmentation algorithm is very necessary. Based on the segmentation and recognition of medical images, this paper proposes a neural network and image saliency based on the obvious difference between the heart image and other tissues in the slice, and the high similarity between adjacent slices in the CT image sequence. Fully automatic segmentation algorithm and 3D visual reconstruction is the segmented heart image. Convolutional neural network is a special deep neural network model of artificial intelligence. Its connections between neurons are not fully connected. The weights of connections between certain neurons in the same layer are shared, and the network model is reduced. The complexity reduces the number of weights. The use of visual saliency techniques to achieve cardiac segmentation based on CT images. An image saliency detection algorithm is adopted to introduce the image segmentation algorithm based on the saliency technique. In this paper, considering the PET image as grayscale image with low resolution, an improved Itti model and an improved GrabCut image segmentation algorithm are proposed to solve the problem of the original algorithm in grayscale image. At the same time, the operation steps of the user division area are cancelled, and the automatic processing is realized, and the running time of the algorithm is improved while optimizing the image segmentation effect. The convolutional neural network is constructed to realize the positioning function of the heart in the image. The original cardiac CT image is cropped by the positioning result, and some non-target areas are removed. A stacking noise reduction self-coding network is constructed, and the network is manually segmented. Training, realize the classification and recognition of the pixels belonging to the heart tissue in the CT image of the heart, and finally realize the segmentation of the heart image based on the classification result. The results of the above segmentation algorithm are quantitatively evaluated and analyzed with the artificial segmentation results, and the segmentation results are visually reconstructed by surface rendering and volume rendering. The algorithm has better accuracy, reliability and higher. The segmentation efficiency is more simplified for user operations.


I. INTRODUCTION
Heart disease is one of the most common diseases in modern human life and poses a serious challenge to human health. The latest statistics show that cardiovascular disease has become the number one killer of non-accidental deaths.
The associate editor coordinating the review of this manuscript and approving it for publication was Honghao Gao . Global deaths of 17.3 million people each year are related to cardiovascular disease, with the death toll in low-and middle-income countries accounting for around 80% [1]- [3]. Computer-aided diagnosis mainly includes image registration, segmentation, and 3D reconstruction and feature extraction analysis. With the development of computer technology and advances in medical technology, the requirements for clinical diagnostic information have become VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ more comprehensive. Especially in the three-dimensional diagnosis of cardiovascular diseases and cardiac surgery navigation, the amount of medical image data is getting larger and larger. The appearance of 4D heart images makes the data volume of images increase in multiples, and it is more urgent to require an effective and fast segmentation. Therefore, a segmentation algorithm that can accurately segment the heart image from a large amount of image data and extract the region of interest is necessary [4]- [6]. By analyzing the characteristics of CT and PET images and the existing heart segmentation algorithm, the segmentation task of the heart is decomposed into two parts: localization and segmentation. A segmentation algorithm based on neural network and image saliency is proposed. In this paper, the research on the PET image segmentation algorithm of the heart effectively solves the problem of the accurate segmentation and segmentation speed of the existing algorithm, and cancels the user operation. It can assist in analyzing clinical images and reduce the workload of medical staff. Especially in a large number of image segmentation processing, the advantage of image segmentation speed is more obvious. In addition, this method has important guiding significance for PET image analysis of other organs. Implementation of a heart segmentation algorithm based on neural network. The heart image segmentation task is decomposed into two parts: positioning and segmentation, and implemented by convolutional neural network and stacked noise reduction selfencoding network. The two methods of surface rendering and volume rendering are used to visually reconstruct the cardiac segmentation results, the segmentation effect of the segmentation algorithm is verified. The rest of this paper is organized as follows. Section 2 discusses Cardiac PET image segmentation algorithm based on visual saliency model, followed by the Heart segmentation algorithm based on neural network in Section 3. Section 4 shows the simulation experimental results, and Section 5 concludes the paper with summary and future research directions.

II. RELATED WORK
Visual saliency refers to intelligent algorithms that mark significant areas of an image by simulating human visual characteristics through features such as color, brightness, and direction. Ruan et al. [7] proposed a very influential biological heuristic model, which became the foundation of visual significance related research. According to the different calculation modes, the saliency can be divided into visual saliency analysis algorithm, pure mathematical saliency analysis algorithm, and the first two mixed saliency analysis algorithms. Karthikeyan et al. [8], Yang [9] proposed a series of classical visual salience models in his research. The basic idea is to simulate the behavior and neuron structure of the primate dynamic vision system. The method obtains a plurality of salient feature maps of a single condition at different scales, and then combines the feature maps by linear weighting. This method breaks the complexity of visual saliency problems based on rapid selection. Wang et al. [10], Hrinivich [11], Ramadan and Tairi [12] a graph-based visual saliency detection was proposed, and a simulation experiment was performed using Matlab. GBVS (Graph-Based Visual Saliency) is a bottom-up visual saliency model. It consists of two steps: first, generating a feature activation map of the target image according to different feature channels; significant features activate the graph, restore and get a significant feature map. Zhang et al. [13], Huang et al. [14], Fu et al. [15] Proposed area detection and segmentation, this method has important application value in image segmentation, adaptive compression, region-based image retrieval and so on. This method enables the use of low-level brightness and color images to capture significant areas of the image. Highly efficient, it is capable of generating high quality salient features with the same resolution and size as the input image. Thereafter, a full-resolution algorithm is proposed based on frequencyadjusted salient region detection. By using the features of color and brightness to perform image feature recognition, a full-resolution significant feature map is output, and feature objects with boundary significance are effectively identified. It has higher precision and computational efficiency for regional salient feature recognition. Different from the above visual saliency model, Yang et al. [16], Song and Boreom [17], based on the pure mathematics calculation method, the frequency domain residual method (SR algorithm) is proposed. By analyzing the input image spectrum and extracting the residual of the frequency domain image, a fast method for constructing the saliency map in the spatial domain is proposed. This method is successfully applied to the processing of natural images and artificial images. Since the development of image segmentation in the 1970s, a variety of segmentation methods have emerged under the efforts of researchers at home and abroad. The idea is to divide the edges of the image area according to different or identical features based on the image features. At present, image segmentation methods are often divided into four categories: threshold segmentation, boundary segmentation, region segmentation, and specific theory segmentation [18]- [20]. With the variety of image segmentation algorithms and the maturity of technology, they have been widely used. At present, the related research on medical image segmentation is also based on the existing image segmentation method [21]- [23]. The research hotspots of medical image segmentation algorithms include the following aspects: segmentation method based on probability atlas, using probability atlas to improve the reliability of medical image segmentation; based on morphological theory, by constructing a probability atlas of specific organs, combined with deformation Quasitechnical implementation of automatic segmentation; based on a variety of fusion methods, applied to medical image segmentation. Although there are many medical image segmentation methods at present, there is still a large research space. First of all, the traditional image segmentation method mostly targets natural images, and the imaging principle and imaging features of medical images are quite different from those of natural images, which made the traditional image segmentation method not directly applicable to medical image segmentation. Secondly, there are many medical imaging techniques and imaging methods. The imaging principle and image characteristics of each method are different. One segmentation method could not be applied to different types of medical images at the same time, which makes the medical image segmentation technology poorly transplanted. In addition, due to the complexity of human organs, professional medical knowledge is often required in the clinic to make accurate judgments on images. This has high requirements for the correctness and recognition of segmentation algorithms. Some previous automatic segmentation algorithms based on computer algorithms. Such segmentation algorithms do not require segmentation interaction and manual input parameters. However, such algorithms have a large amount of calculation and the segmentation effect is unstable. Problems such as over-segmentation or under-segmentation will inevitably occur. It is difficult to obtain a satisfactory segmentation effect.

III. CARDIAC PET IMAGE SEGMENTATION ALGORITHM BASED ON VISUAL SALIENCY MODEL
PET is an advanced medical imaging technology widely used in clinical practice. In addition to cardiovascular and neurological diseases, PET has important value in the diagnosis of diseases. PET images have the advantages of high sensitivity, specificity, safety, and the whole body imaging. However, due to the limited imaging limitations of PET images, PET images have lower resolution, uneven background images, and obvious radiation problems. Therefore, PET image segmentation still faces many challenges. At present, the segmentation technology of PET images includes two categories: one is based on region generation and the other is based on boundary edge detection. The segmentation method of the direct inspection region represented by the threshold segmentation is difficult to obtain an accurate image for an image having no significant grayscale difference or a large overlap of grayscale values, and thus is not suitable for PET image segmentation. The GrabCut image segmentation algorithm uses the image edge information to manually select the background region and the foreground region and further obtain the segmentation result of the image. This method has the disadvantages of relying on user operations and poor image segmentation with small RGB differences in unknown regions. However, the traditional Itti model, although widely used in image edge segmentation, has problems such as large difference in feature maps and excessive spatial distribution of feature map regions. These have an impact on the accuracy and efficiency of PET image segmentation.
Based on the previous research, this paper proposes a PET image segmentation algorithm based on significant technology. The method adopts the Itti visual saliency model, simulates the human eye to observe the image through computer software, and uses the method of feature extraction and saliency map fusion to obtain the feature map. In the Itti visual saliency model, the color feature map and the filtered interference feature map are optimized to obtain a salient feature map that is more conducive to image segmentation. The manual operation of the traditional GrabCut algorithm was replaced by the introduction of the improved Itti visual saliency model. After initializing the salient feature map, the image is segmented using the improved GrabCut algorithm. The improved GrabCut algorithm achieves the effect of obtaining more accurate edges by introducing an optimization of the energy function. The following figure shows the overall framework of an image segmentation algorithm based on a visually significant model. As shown in Figure 1, the main flow of the algorithm consists of three parts: saliency map generation based on visual saliency model, saliency map transition processing, and image segmentation. In the first part, the target image is preprocessed by a visually significant model and a corresponding saliency map is obtained, providing the required data for subsequent image segmentation. At this stage, the improved Itti model is used to generate a target image pseudo color map, extract different feature maps of the pseudo color map, and merge them to generate a salient map. The second part, the image transition processing, divides the obtained saliency map into regions. The region features are extracted and the results are used for image segmentation. The salient and non-significant regions in the salient map are labeled as the foreground region and the background region respectively, and the partition satisfies the requirements of the image segmentation algorithm; then, the data of the foreground region and the background region are extracted as parameters in the image segmentation algorithm; In part, image segmentation is performed to obtain the final segmentation result. After the image transition processing, according to the foreground area and the background area data extracted in the image transition stage, the image of the salient image is segmented, and finally the segmented image is obtained. VOLUME 8, 2020 A. VISUALLY SIGNIFICANT MODEL According to the characteristics of the PET image and the principle of the Itti model, the improvement of the Itti model includes two aspects: on the one hand, the feature map based on pseudo color is used to adapt to the PET image containing only gray scale; on the other hand, the primary feature map is performed. Binary processing, discarding the significant graph of the interference synthesis effect [24]- [27], and then performing significant feature map merging. Achieve the purpose of improving the feature map extraction effect. After extracting each primary feature map, binarization processing is performed to verify the salient regions on each feature dimension, and only the feature maps that can contribute to the generation of the integrated saliency map are retained, and then merged. Experiments have shown that a feature map with a prominently distributed point that is too uniform or too dense will interfere with the effect of the final composite image. Use the following formula to determine the interference graph: The ratio K rat of the significant area S of the significant feature map to the area M of the image area, when K rat is less than the decision value T jud , is considered to be a noninterfering map, retaining and participating in the merging of the salient map. This method can effectively solve the problem of regional scattered in the Ittl model. The selection of the threshold directly affects the combined effect. Based on the characteristics of PET images and experimental data analysis, the experiment selects T jud = 0.65.
In this paper, the improved Itti visual saliency model is adopted in the process of obtaining the salient map. In the acquisition process of the feature map, the Itti model generates a main feature saliency map corresponding to the three features based on the luma feature, the color feature, and the directional feature, and then performs image fusion. In this paper, the improved Itti visual saliency model is used. For the PET image features, the target image pseudo-color map is added before the primary feature map is generated. In addition, the interference graph verification is added before the primary feature saliency map is merged. Through the above two strategies, a better significant area recognition effect can be obtained.
The first step is to generate a pseudo color map corresponding to the PET image; In the second step, a brightness saliency map is generated. Specifically includes the following: 1. Establish an RGB color model of the pseudo color map; 2. Generate a brightness map of the pseudo color map; 3. Generate a nine-level Gaussian pyramid using a Gaussian filter template; 4. Generate six brightness feature maps, and use bicubic interpolation method during image enlargement; 5. Superimpose the six luminance feature maps to highlight the salient regions and generate a luminance feature map.
The third step is to generate a color saliency map. Contains the following: 1. Generate a nine-level Gaussian pyramid using a Gaussian filter template; 2. Generate twelve color feature maps; 3. Combine the twelve color feature maps to obtain a color saliency map.
The fourth step is to generate a directional saliency map. 1. Generate a 9-level pyramid; 2. Generate twenty-four directional feature maps; 3. Combine the twenty-four directional feature maps to obtain the directional feature map.
The fifth step is to distinguish the interference map. Determine whether the three saliency maps are interference patterns and discard the interference map.
The sixth step is to synthesize the saliency map. Combine the salient maps obtained in the fifth step. In this experiment, the values of the luminance weight, the color weight, and the direction weight are a = 0.5, p = 0.4, and y = 0.10, respectively, and the result is a synthesized saliency map.
The saliency map obtained by the improved Itti visual saliency model cannot be directly applied to the image segmentation algorithm. The transition processing of the image is to directly apply the result of the significant graph processing to the subsequent GrabCut image segmentation algorithm. The transition processing of the salient map includes two aspects: on the one hand, according to the light and dark features of the image, the salient region and the nonsignificant region are distinguished, respectively corresponding to the foreground region and the background region in the image segmentation; on the other hand, according to the salient region of the salient map and extracting sample pixels corresponding to the salient region and the non-significant region from the pseudo color map.
1. Judging the salient and non-significant regions. The saliency map is a grayscale image, and the saliency of the region can be judged according to the gradation value. In actual image processing, according to the characteristics of the picture, a threshold is selected, and the set of pixel points whose gray value is greater than (or less than) the threshold is a salient region.
2. Extracting the mapping position of the salient region in the pseudo color map. In the generation of the salient map, a = 4 is selected, that is, the fourth level of the nine-level gold tower is adopted. The direct conversion method is used to obtain the original image position corresponding to the significant image pixel. If the point on the significant map is (i, j), the pixel mapped to the pseudo color map is (i/2 α , j/2 α ), and the area mapped to the pseudo color map is in the square area centered on (i/2 α , j/2 α ) and 2a as the side length. The mapping relationship of pixels is shown in Figure 2.
In Figure. 2, the upper side is a prominent pixel point, and the down side is a position and an area corresponding to the pixel in the pseudo color map. Using this method to map the pixels in the salient map, the corresponding foreground region pixel set and background region pixel set in the pseudo color map can be obtained.

B. IMAGE SEGMENTATION PROCESSING
The GrabCut image segmentation algorithm cannot produce accurate edge results when segmenting images with low contrast. For PET images without significant grayscale differences, a good segmentation effect cannot be achieved. In this paper, the optimization of the energy function is used to solve this problem, and the foreground form factor is added. A target image I consists of a union of three regions, F, U, and B, corresponding to the foreground region, the unknown region, and the background region, respectively. The foreground, unknown, and background regions obtained by user-specified or significant model processing are denoted as F', U', B'.
The rectangular area calculated by the original GrabCut algorithm includes F ∪ U , and the improved GrabCut algorithm needs to calculate the unknown area U', which is significantly smaller than the original GrabCut algorithm, which is more advantageous in terms of calculation amount and calculation time.
wherein, the two items on the right side of the formula represent the bosom coordinates of all foreground pixels in the current segmentation result, and the bosom coordinates of the user selected region(in this experiment, the salient region corresponding to the salient image). Through the optimized energy function of the method, in the iterative process of the algorithm, the segmented foreground object and the center of gravity of the user selected region can be approximated, thereby achieving the effect of obtaining a more accurate edge. The improved GrabCut algorithm image segmentation consists of three steps: initialization, building a Gaussian model, and energy minimization. The specific process is as follows: 1. Image initialization. According to the processing result of the Itti visual saliency model, the pseudo map area division is obtained. The foreground area is F, the background area B, the initialization foreground area F', the background area B', and the unknown area U'. Wherein, the foreground area pixel corresponds to the weight a = 1, and the background area pixel point a = 0; 2. Initializing the corresponding GMM model according to the foreground area and the background area; 3. Calculate the GMMS parameters corresponding to each pixel of the unknown area U'. The calculation formula of the GMMS parameter corresponding to the nth pixel is as follows: 4. Use the minimum energy min E(α, k, θ, z) to get the current segmentation; 5. Repeat the third step until convergence; 6. Adjust the edge and update the foreground area F'; 7. Smooth the edges and use the Border matting iteration of GrabCut algorithm to find α.

IV. HEART SEGMENTATION ALGORITHM BASED ON CONVOLUTIONAL NEURAL NETWORK
How to accurately segment the target area of interest from the image by computer technology has always been a research problem in the field of medical image processing. Since the appearance of medical images is inevitably interfered by such as environmental noise, local body effects, tissue motion, etc., compared with optical imaging in the natural environment, edge blur, uneven gray distribution, and low image contrast are inevitable. The CT image is a kind of medical image, and the segmentation of the image is also interfered by the above factors. In order to realize the automatic segmentation of the heart image, reduce the difficulty of segmentation and improve the accuracy of segmentation, this paper decomposes the segmentation algorithm into two parts: positioning and segmentation. For the cardiac localization problem, the localization method using convolutional neural network is used. For the segmentation of the heart image after positioning, the stacked noise reduction self-encoding neural network is used.
In order to realize the full automation of the algorithm, it is first necessary to find the position of the heart in the image, so as to exclude the interference of other non-target area information based on the obtained position information, and further realize the segmentation of the heart. In [28], [29], the generalized Hough transform algorithm is used to realize the localization of the heart, but the generalized Hough transform has a large computational period and is not suitable for the positioning processing of CT image sequences. It is also based on heart shape and image gray, statistical methods for detecting prior knowledge such as degree characteristics, and a widely used method is a cascading method based on simple feature extraction. The use of conventional localization algorithms for the localization of cardiac images is highly difficult due to the differences in heart structure between different individuals and cardiac phases, as well as changes in heart shape caused by contraction and relaxation of the heart. Due to its special structure, convolutional neural networks have great advantages in the research of two-dimensional image recognition and video tracking, and are very suitable for the positioning of cardiac targets in CT images. In this paper, the existing convolutional network is improved. The training process of the network is divided into two parts: pre-training and fine-tuning. That is, the noise reduction self-encoding network is used to initialize the convolution kernel parameters, replacing the traditional random initialization method to improve the convolution kernel is robust to image features.
Cardiac CT image sequences typically contain tissue around the heart and chest. In order to reduce the influence of surrounding tissue on segmentation and improve segmentation accuracy, the first step of the algorithm is to locate the location of the heart and calculate the interference of the region of interest containing the heart to remove ribs, vertebrae, and the like. Figure 3 is a schematic diagram of a structure for locating a heart position using a convolutional neural network. The network consists of a convolutional layer, a pooled layer, and a fully connected layer. The convolutional layer is used to extract cardiac image features, and the pooled layer is extracted. Sampling of features, the fully connected layer is based on an image feature classifier for classifying pixels in the heart image. In order to reduce the complexity and reduce the amount of computation, the algorithm samples the original image of 512 × 512 into a 64 × 64 image and uses it as the input of the convolutional neural network, while the output binary image size is set to 32 × 32 to reduce The dependence of the positioning effect on the number of images in the training set. Then the convolution feature map is sampled by the mean pooling. Each pooled feature value is calculated by the mean value of the adjacent non-overlapping regions of size 6 × 6 in the convolutional feature map., the calculation formula is: where 1 ≤ i l , j l ≤ 9 represents the coordinates of the calculated pooled feature and P l ∈ R 9×9 is the serial number of the pooled feature map. The pooled features are expanded into columns as feature vectors p, and connected to the logistic regression layer by means of full connection. The logistic regression layer has a total of 1024 output units. Finally, the output of the logistic regression layer is reassembled into a matrix of size 32 x 32, which is a binarized mask image containing the position information of the heart. Since the size of the original input CT image is 512 × 512, it needs to be obtained. The mask image is subjected to sampling processing to calculate the center of the mask image, and then an ROI image having a size of 400 × 400 is generated from the original image based on the mask image, and the ROI image is a heart image obtained by re-cutting. Finally, the obtained ROI image is used for the next segmentation process. The network needs to be trained before using the network to locate the heart. The training process is the process of obtaining the optimal parameters of the convolutional neural network. Because the structure of the convolutional network is relatively simple, and the number of images in the training set is limited, in order to improve the accuracy of cardiac positioning, the algorithm uses a noise reduction self-encoding network to pre-train the convolution kernel.

A. CONVOLUTIONAL NEURAL NETWORK TRAINING
Convolutional neural network training involves obtaining the optimal values of the convolution kernel parameters F, 1 = 1...100 and other parameters b 0 , W 1 , b 1 . The convolutional neural network can obtain the ideal parameter value through the normal training mode when the training set is large enough. In the case that the training set image is sufficient, only the parameters are randomly initialized, and then the ideal parameter value is obtained through training.. Due to the limited number of data sets in the experiment, in order to obtain more robust features and avoid positioning errors, it is necessary to initialize the filter using the sdA network, that is, the pre-training process. Figure 4 shows 204 of the pre-training set images. Each tile is expanded into a vector by column and then used as the input of sdA.
The pre-trained convolutional network performs feedforward operation until the output layer, and finally pre-trains the output layer parameters by minimizing the loss function. The loss function is: where I i rot ∈ R 1024 is the tag data corresponding to the i-th input image, and N 2 is the number of training set images. The mark data of the output layer is a binary mask image, which is generated by manual marking, as shown in Figure 5.
As shown in Figure. 5, the binary mask image is a binary image with a black background and a white foreground, the white foreground corresponds to the ROI region in the image, and the foreground center corresponds to the center of the heart contour. Finally, the binary mask is sampled to a 32 x 32 image and then expanded into columns by column, used as marker set data for training in convolutional networks. The final step in convolutional network training is to fine-tune the convolutional network by minimizing the loss function. The loss function is:

B. HEART SEGMENTATION BASED ON STACKED NOISE REDUCTION SELF-CODING NETWORK
The improvement of the self-encoder by the noise reduction self-encoder improves the reconstruction capability of the encoder and reduces the reconstruction error by randomly destroying the input data with a certain probability [30]. The stack noise reduction self-encoding network consists of multiple noise-reduction self-encoder stacks, and is also an improved network of stacked self-encoding networks. It has good recognition ability for hidden features in 2D images, especially for spatial features. The identification is commonly used in the classification of images in practical applications. Based on the above advantages of the stacked noise reduction self-encoding network, the algorithm is applied to the segmentation task of the heart image. In order to meet the segmentation task requirements of the heart image, the structure is modified and the final classification layer is changed to a logistic regression layer. The network has the ability to classify individual pixels and improve the network by combining sparse (sparse) and denoising to improve the robustness of the network to cardiac image features. Finally, the trained stacked noise reduction self-encoding network classifies the pixels of the reconstructed cardiac CT image, and then performs segmentation based on the classification result. The training of noise-reduction self-encoding network includes two parts: pre-training and fine-tuning. Because the data available in the segmentation experiment is limited, the layer-by-layer greedy training method is used to train the network parameters, which effectively reduces the occurrence of over-fitting. In the pre-training process, the training of the layer-by-layer training network parameters does not use the marker data, and the training of the parameters uses the marker data. First, the input layer and the hidden layer are separated from the stack noise reduction self-encoding network and an output layer of the same size as the input layer is formed to form a sdA network. The training of the sdA network adopts an unsupervised training mode, and the network joins. The noise reduction process also adds sparsity constraints, so that the average output value of the hidden layer of the network is close to zero. The training process of the network is used to obtain the parameter matrix Wa, and the optimal values of the parameters are λ = 3 × 10 −10 , The separated sdA network input and output training data is a sampled image of a 400 x 400 full-size image. The center of the image is the center of the heart, and the image size is 80 × 80. After the training is completed, the output layer of the sdA network is discarded, and the output of the hidden layer unit is used as the input of the next hidden layer (Hz). The calculation formula for the loss function used to train the last layer of the network is: wherein, I i lv ∈ R 6400 is the marker set data, corresponding to the Zth image in the training set, the marker set data is a binary mask image generated by manual segmentation, and FIG. 5 shows the input image and the corresponding binary mask image. It is explained here that the marker set image is expanded into a vector by column in the training process. At this point, the pre-training of the stack noise reduction self-encoding network ends. The training method of SDA (Stacked Denoising Autoencoders) is ''layer-by-layer'' training. On the basis of CNN, each layer uses the previous layer as input and output, and as the middle layer, it forms a codectype 3-layer neural network for individual training.
The second step of network training is fine-tuning the parameters. After the initial value of the parameter matrix is obtained by layer-by-layer training, the entire network parameters are fine-tuned by minimizing the loss function VOLUME 8, 2020 shown in equation (9).
The fine-tuning process is implemented by a supervised backpropagation algorithm and a batch gradient descent method, which is similar to the automatic positioning training method of the convolutional neural network. The training process is only run once. The stacked noise reduction network after training can be used for CT images. The output mask image size of the stacked noise reduction self-encoding network is 80×80. After sampling, the binary mask image with a size of 400 × 400 has obvious jagged edges, and the obtained mask image is directly used for segmentation of the heart. Therefore, the mask image is smoothed by a two-dimensional Gaussian filter before the heart segmentation is performed using the mask image output from the network.

A. CARDIAC IMAGE SALIENCY SEGMENTATION VERIFICATION
The cardiac PET image segmentation algorithm based on the saliency technique proposed in this paper is a fully automatic method that does not require user interaction. The system consists of two parts: saliency map generation and image segmentation. The saliency map generation uses the improved Itti visual saliency model, the original cardiac PET image as an input, generates the corresponding pseudo color map, and then extracts and merges the saliency map with the saliency map to obtain the saliency map of the image to be segmented. Processing to obtain the data needed for the image segmentation phase. This phase was developed using MATLAB 8.4, and the image was analyzed with the MATLAB internal development tool SaliencyToolbox 2.3 [31]- [34].
Comparative analysis of image segmentation algorithms requires the use of reasonable evaluation indicators. In the comparative analysis of the experimental results in this paper, the overall error rate (OER) and Kappa coefficients are used to compare the pixel point set of the segmented image with the boundary of the segmented image. In the experiment, the standard segmentation result is manually segmented, and the boundary between the pixel point set and the segmentation image is obtained as the comparison standard of the evaluation index. The accuracy of the image segmentation algorithm is one of the important indicators to measure the results of the algorithm. The evaluation indicators selected in this paper are compared from pixel to boundary. The overall error rate OER is based on the pixel method to calculate the accuracy of the segmentation results. The algorithm obtains the set of pixel points of the real result and the set of pixel points of the segmented image, and calculates the proportion of the wrong pixel point and the missing pixel in the segmented image, which is the proportion of the total segmentation pixel point set. The higher the proportion of the wrong pixel point and the missing correct pixel point, the higher the error rate of the corresponding segmentation algorithm, and vice versa, the lower the error rate. The overall error rate OER can effectively measure the over-segmentation and undersegmentation of the segmentation algorithm. Assume that pret is the set of pixel points of the segmented image obtained by an algorithm, pq is the set of target pixel points not divided into the segmentation image, and pu is the set of pixel points that are incorrectly divided into the segmentation image. The calculation formula of the overall error rate OER is as follows: In order to obtain a more accurate segmentation curve of the prominent region of the cardiac PET image, a clear PET/CT image is obtained, which is obtained by manual segmentation. The obtained heart region boundary segmentation results are shown in Figure 7.
The red curve in Figure 7 is the reference standard segmentation result for this experiment. The image resolution of PET/CT is consistent with cardiac PET, and the segmentation results can be similarly transplanted into cardiac PET images. The overall error rate OER and Kappa coefficients in subsequent data analysis were compared to the standard segmentation.
As shown in Table 1 and the above figure, in terms of the running time of the algorithm, since the Snake algorithm needs to manually initialize the contour line of the segmentation region before image segmentation, the number of iterations is determined by multiple iterations to contour convergence in the segmentation process. The contour accuracy of the segmentation result is the longest, and the average segmentation time is 81.24 seconds. The CA-GrabCut algorithm and the algorithm in this paper use the visually significant model for image pre-processing before segmentation,  and then image segmentation. The average image segmentation time of the algorithm is similar, which is 60.12 seconds and 56.76 seconds respectively. Because the algorithm uses the optimized GrabCut algorithm, the segmentation time is slightly smaller; the GrabCut algorithm has the shortest segmentation time of 49.88 seconds, but the segmentation effect is the worst. Through experimental tests and comparison with other image segmentation methods, it can be concluded that the algorithm can effectively improve the segmentation accuracy of lung cancer PET images while ensuring the segmentation speed. The fusion algorithm used in this paper replaces the manual operation steps of traditional algorithm users, improves the efficiency of image processing, and effectively shortens the time of image segmentation. It has important practical application significance for the processing and analysis of batch PET images.

B. CONVOLUTIONAL NEURAL NETWORK TRAINING VERIFICATION
The CT images used for the localization experiments were from a hospital with a total of 9 groups, each of which corresponds to a patient's CT scan of the heart. The CT image parameters are: thickness 8mm, image size 512 × 512. In order to make the calculation convenient and the image display more clear, the DICOM image is converted to bmp format, the window width and window width of the fixed image are 300 and 1450 respectively, and finally 2500 cardiac CT images are arranged as the training set data. The experiment was developed using MATALAB 8.4 and implemented with the open source architecture. The positioning performance and positioning accuracy of the convolutional network are robust to the number of training set images. Therefore, the existing training set data is linearly interpolated, and the training set data is increased from 2,500 to 5000, ie,  training samples and markers. The number of images in the sample is 5000.
The training process is divided into two parts: initial training of convolution kernel parameters and convolutional network training. The initial training of convolution kernel is to reconstruct the noise reduction self-encoding network by using the block obtained by automatically cutting the CT image. There is no need to mark the set data during the training, a total of 20,000 images, the image size is 11 × 11. The training set image used in the convolutional network training process includes a training set sample image and an annotated sample image. The sample image of the training set is a CT image containing the heart, and the image size after the downsampling is 64 × 64. The labeled sample image is a binary image of the center of the heart in the CT image, and the size is 32 × 32. The portion of the labeled sample image with a gray value of 0 corresponds to the portion of the original CT image that does not contain cardiac tissue, and the portion with the gray value of 1 corresponds to the cardiac tissue in the original CT image. The sample label image is a downsampling of the binarized image obtained by manually labeling the center of the heart. Figure 9 shows the training set sample image, and the labeled sample image. The test set image is a complete sequence of cardiac CT images for functional verification of the trained convolutional network, with a total of 275 images.
The network uses the noise reduction self-encoding network to initialize the convolution kernel parameters. Figure 10 shows the relative position error of the positioning VOLUME 8, 2020  results of the two networks with the noise reduction selfencoding network initializing the convolution kernel and the non-noise-reducing self-coding network initializing the convolution kernel. It can be seen that the network with pretraining of convolution kernel has better positioning accuracy and the positioning error is more stable, especially the positioning error in the bottom of the image sequence has been greatly improved. Figure 11 is a three-dimensional line graph of manual positioning and convolutional neural network localization results for the same cardiac sequence, where the z-axis represents different slice images of the same cardiac sequence, and the x-axis and the Y-axis represent the position of the center of the positioning, blue fold line For manual positioning results, the red polyline is the convolutional neural network positioning result. It can be seen that the positioning results of the two different positioning methods from the apex to the bottom of the heart are basically the same, and the overall tends to be close. The trained convolutional neural network accurately locates the cardiac target and successfully tracks the trajectory of the heart center in the same CT image sequence.
The comparison results of Figures 10 and 11 show that the trained cardiac positioning network has a good positioning effect. For the same series of sliced images, the position of the heart in the image can be accurately located, and it's positioning center and manual. The positioning center is very close and can completely replace manual positioning. This lays a foundation for the segmentation of the image in the next step, which greatly reduces the difficulty of the next step of the algorithm.

C. TRAINING VERIFICATION OF STACK NOISE REDUCTION SELF-CODING NETWORK
The stack noise reduction network used in this paper is an improvement on the traditional stacked self-encoding network. The stack noise reduction network in the article adopts an improved method combining the sparsity and noise reduction processing, which is different from the existing improved methods. Yes, noise reduction processing is also performed while adding sparsity processing to the self-encoding network. Figure 12 is a comparison of cardiac segmentation results using a conventional stacked selfencoding network and an improved stacked noise-reduction self-encoding network. The evaluation criteria are accuracy, recall, and F-value. The comparison results show that the improved network can obtain higher evaluation values, and has better applicability for automatic segmentation of heart image sequences that need to protect image edge features.
As shown in Figure. 12, the noise reduction process is a process of randomly destroying the image features extracted by the network. The segmentation performance of the network obtained by different damage rate training is different, and the appropriate amount of damage can effectively improve the segmentation performance of the network. For the segmentation evaluation results of the network corresponding to different damage ratios, it can be seen that when the destruction ratio is 0.3, the segmentation performance of the network is optimal.
As shown in figure 13, it can be seen that existing heart segmentation algorithms such as RCNN-based segmentation, variable model-based segmentation, and multi-Atlas-based methods can achieve good segmentation in the left atrium segmentation task. However, the segmentation result in the whole heart segmentation task is not ideal, mainly because the CT image of the whole heart is not obvious, and the contrast with the surrounding tissue is low, resulting in a smooth outline of the segmentation result. The SdA-based segmentation algorithm proposed in this paper has a good segmentation effect, has a high recognition rate for heart tissue in the image, and has a higher recall rate. It successfully suppresses ribs, spine, lung parenchyma, etc. while obtaining smooth edges. The segmentation results with smooth contours and complete cardiac information, expressing the continuous smoothness of the human tissue surface.
At the same time, the segmentation result of the stacked noise reduction self-encoding network is compared with another visual saliency-based segmentation method (SF) proposed in this paper. In order to more fully reflect the performance of the segmentation algorithm, the evaluation method is to manually segment the Benchmark segmentation, the number of pixels correctly segmented in the statistical segmentation results, and the evaluation criteria include true positive, false positive, true negative, and accuracy of the segmentation results, respectively. Table 1 shows the segmentation results of different physiological segments of the same cardiac CT image sequence using two different segmentation algorithms, SF and neural network. Comparing the results of true positive and accurate evaluation results, the SF-based segmentation results have higher false negatives, which can segment the heart tissue in the CT image, and the method has higher false positives, and the segmentation result will be quite Part of the heart tissue is judged to be non-cardiac tissue, which destroys the edge features of the heart image to a certain extent, does not guarantee the integrity of the heart tissue, and has a certain degree of influence on the visual reconstruction based on the cardiac segmentation result. The segmentation result based on SdA has a good segmentation effect, has a high recognition rate for heart tissue in the image, and has a low false positive and false negative value while obtaining high segmentation accuracy, successfully inhibiting the ribs. The effects of non-cardiac tissue such as spine and lung parenchyma are almost identical to the standard segmentation.  At the same time, in order to reflect the running performance of the algorithm, the running time of the two segmentation methods proposed in the paper is compared and analyzed. The comparison results are shown in Table 3.
As shown in Table 3, the SdA-based segmentation algorithm has a relatively stable running time in the same cardiac sequence, and the average running time is 0.12 seconds, which is much lower than the average running time based on the SF segmentation algorithm. Based on the above analysis results, the SdA-based segmentation algorithm can obtain more accurate segmentation results, and the operation cycle is shorter, which is more suitable for automatic segmentation of CT images.

VI. CONCLUSION
In this paper, based on the imaging characteristics of PET images, by analyzing the deficiencies of Itti visual saliency model and GrabCut algorithm, the PET image of lung cancer is taken as the research object, and the Itti visual saliency model is combined with the GrabCut algorithm, and the image segmentation algorithm based on visual saliency model is proposed. The algorithm uses the improved Itti visual saliency model and the improved GrabCut algorithm to obtain the distinctive feature map of PET pseudo-color map. The result of the transition processing is used as the input of the improved GrabCut algorithm, and the PET image segmentation result is finally obtained. The algorithm realizes the automation of PET image segmentation, improves segmentation efficiency, has better segmentation effect, and provides effective data support for medical image analysis.
The segmentation algorithm based on neural network decomposes the segmentation task into two parts: localization and segmentation. It is implemented by convolutional neural network and stacked noise reduction self-encoding network. Based on the superior performance of convolutional neural networks in the fields of image classification and target recognition, a convolutional neural network is constructed to realize the positioning function of the heart in the image. The original cardiac CT image is cropped by the positioning result, and some non-target regions are removed. By using the strong perceptual characteristics of the edge-coded selfencoding network to image edge features, a stacking noisereduction self-encoding network is constructed. The network is trained by manually segmenting the image to realize the classification of the pixel points belonging to the heart tissue in the cropped CT image. Identifying, and finally, segmenting the target cardiac image based on the classification result. Through the segmentation experiment of the heart image, the experimental results show that the segmentation algorithm based on neural network can obtain better segmentation effect, accurately obtain the peripheral contour of the heart, and the contour edge is smoother, effectively suppressing other tissues and organs with low contrast of the heart. Stacking noise reduction self-encoding is only one of the depth algorithms, and other depth algorithms are in their respective the field has a unique advantage, and the next step is to introduce other deep learning algorithms to achieve segmentation of the heart image. Next in the medical image field, the application of 4D images is the mainstream trend. The basic data of a 4D image is three-dimensional volume data. The three-dimensional volume data contains a large amount of information. The existing segmentation algorithm can be extended to three-dimensional space to achieve direct segmentation of volume data.