Deep Learning in Medical Ultrasound Image Analysis: A Review

Ultrasound (US) is one of the most widely used imaging modalities in medical diagnosis. It has the advantages of real-time, low cost, noninvasive nature, and easy to operate. However, it also has the unique disadvantages of strong artifacts and noise and high dependence on the experience of doctors. In order to overcome the shortcomings of ultrasound diagnosis and help doctor improve the accuracy and efficiency of diagnosis, many computer aided diagnosis (CAD) systems have been developed. In recent years, deep learning has achieved great success in computer vision with its unique advantages. In the aspect of medical US image analysis, deep learning has also been exploited for its great potential and more and more researchers apply it to CAD systems. In this paper, we first introduce the deep learning models commonly used in medical US image analysis; Second, we review the data preprocessing methods of medical US images, including data augmentation, denoising, and enhancement; Finally, we analyze the applications of deep learning in medical US imaging tasks (such as image classification, object detection, and image reconstruction).


I. INTRODUCTION
Ultrasound (US) is an important part of medical imaging and one of the most commonly used medical diagnostic techniques. It plays an important role in the qualitative and quantitative diagnosis of diseases and clinical examinations. 2D US imaging can be used to observe the morphology and anatomical structures of tissues and organs and detect blood flow and muscle contraction speed; 3D US can observe tissues and organs with complex 3D morphology, such as the heart and fetus (Fig. 1). Compared with other medical imaging technologies, US imaging has the advantages of low cost, convenient, no ionizing radiation, high sensitivity, and real-time imaging. However, compared with X-ray, CT, and MRI, US imaging faces some unique problems. Such as more artifacts and noise, low contrast between tissues causing boundary ambiguities [1] (as shown in Fig. 2), highly subjective diagnosis, and highly dependent on the experience The associate editor coordinating the review of this manuscript and approving it for publication was Ahmed Farouk . of doctors. In order to overcome the above problems, it is particularly important to introduce computer-aided diagnosis (CAD) systems. CAD can be used as a supplement to the personal experience and knowledge of doctors, thereby improving the accuracy of US diagnosis.
Deep learning, as a branch of machine learning, is capable of ''representation learning'' or ''feature learning''. After the initial ''low-level'' feature representation is gradually transformed into ''high-level'' feature representation through a multi-layer neural network process, classification and other complex tasks can be completed by using a ''simple model'', reducing the dependence on physicians [5]. In recent years, deep learning has made breakthroughs in computer vision, speech recognition, natural language processing, and bioinformatics [6], which is hailed as one of the top ten technological breakthroughs in 2013 [7]. Currently, CAD based on deep learning has been applied to various tasks such as diseases classification [8], [9], target detection [10], region of interest (ROI) segmentation [11], and image reconstruction [12] in the analysis of medical US image and other imaging  modes analysis. Compared with traditional CAD, which requires manual feature extraction, the deep learning based CAD can automatically extract the low-level and high-level features from the US images based on its deep nonlinear structure, overcoming the limitations of traditional CAD's lack of feature expression capabilities. At present, deep learning has achieved the best performance in many tasks, as shown in Table 1. This paper presents a review of deep learning applications in medical US image analysis, covering deep learning models, data preprocessing methods, and deep learning tasks. We show in Fig. 2 some applications of deep learning in medical ultrasound image analysis.
The rest of the paper is organized as follows. Section II introduces five models of deep learning, including convolutional neural network (CNN), recurrent neural network (RNN), autoencoder (AE), restricted Boltzmann machine (RBM), and transfer learning (TL). The commonly used methods of medical US image preprocessing in deep learning including data augmentation, denoising, and enhancement are presented in Section III. In Section IV, we discuss the specific tasks (classification, detection, segmentation, and reconstruction) of deep learning in medical US image analysis. Finally, conclusions are drawn in Section V.

II. DEEP LEARNING MODELS AND ULTRASOUND DATASETS
A. CONVOLUTIONAL NEURAL NETWORK CNN was first proposed by LeCun et al. [13] in 1989, which was applied to the image recognition of handwritten characters. In 2012, with AlexNet [14] winning the championship by far surpassing the second place in the Ima-geNet competition, CNN has advanced rapidly since then, a large number of classic CNNs have emerged. VGGNet [15] focuses on studying the relationship between network depth and performance, combined with a small convolution kernel, and achieved good performance. GoogLeNet [16] explores the problem of multi-scale fusion in convolution calculations to better characterize image information. ResNet [17] introduces the idea of residual in CNN, which alleviates the problem of gradient disappearance caused by deepening of the network. At present, CNN has become the most widely used network structure in deep learning [18], applied in various fields including computer vision. Fig. 3 shows the CNN architectures of GoogLeNet, ResNet, VGGNet.
A typical CNN includes convolutional layers, pooling layers, activation functions, and fully connected layers. Among them, the convolutional layer and pooling layer are responsible for automatically learning the high-level features of an image from the input; the activation function is used to introduce nonlinear transformation into the convolution neural network to solve the problem of insufficient expression ability of the convolution, which is a linear model; the fully connected layer is like a general neural network, which connects each neuron with all the neurons in the upper layer and inputs them to a classifier (such as SoftMax).

B. RECURRENT NEURAL NETWORK
In basic neural networks, they process independent input data each time, which implies that they have no memory function. When dealing with problems of data association, we need to introduce the neural network with memory function, which leads to RNN.
RNN introduces a recurrent layer. Each layer in the network can be regarded as a time sequence. The current sequence's state h t is determined by an input x t and the previous sequence's state h t−1 , while h t−1 is determined by x t−1 and h t−2 . Basically, it shows that h t is actually determined by all the previous states and input variables, which reflects the memorability of RNN (see Fig. 4 (a)). However, because RNN considers the impact of all inputs on the time sequence before the current output, the gradient explosion or disappearance problem will occur when RNN is backpropagated, especially when the input sequence is very long. In order to solve this problem, Hochreiter and Schmidhuber [24] proposed a long short-term memory (LSTM) model in 1997. LSTM changes the recurrent layer of RNN, adding three gating units, input gate, forget gate, and output gate. The computation graph of LSTM is shown in Fig. 4 (d). In Fig. 4 (d), σ is Sigmoid function, C t is the state value at time t, w i is weight matrix, and b i is bias. If all the output values of the LSTM are expanded, it will be found that the input value at the earliest moment is not repeatedly multiplied by the w i , which avoids the problem of difficulty in backpropagation.
In medical US image analysis, we can predict the next sequence from the shape of the previous sequence's tissue structure through RNN, which can be used to optimize the boundary of tissue structure and other tasks. Chen et al. [25] used knowledge transferred RNN (T-RNN) to automatically detect the standard plane of the fetus and achieved an accuracy of 0.908, outperforming state-of-the-art methods. Yang et al. [26] used RNN to predict prostate shape during the study of prostate US image segmentation. LSTM is often used to generate ultrasound image description. Zeng et al. used LSTM to conduct research on ultrasound image caption generation in 2018 [27] and 2020 [28].

C. AUTOENCODER
AE is an unsupervised model, or a self-supervised model. The input data itself is used as a label. AE consists of three parts: input layer, hidden layer, and output layer. The hidden layer is an encoder, whose number of neurons is fewer than that of the input layer. The significance is that less data can be used to represent the input data and realize the feature extraction. The output layer is a decoder, which is responsible for reconstructing the extracted features to be the same as the input data ( Fig. 4 (b)). Compared with the traditional principal component analysis (PCA), AE can extract more complex and higher level features. The reason is that PCA is a linear transformation method and AE has the activation function of nonlinear transformation, which can deal with more complex data.
In medical US image analysis, AE and its variants (such as denoising autoencoder [29], sparse autoencoder [30], and the stacked versions of AE) are often used for classification. For example, Cheng et al. [31] used CAD based on the stacked deoiling autoencoder (SDAE) to segment the breast lesions. Compared with traditional methods, it significantly improves the performance of CAD. Hassan et al. [4] used stacked sparse autoencoder to extract high-level features from the segmented image as the input of classifier SoftMax and the classification accuracy was 97.2%. Combining AE with convolution, convolutional autoencoders (CAE) can be formed [32]. CAE and AE have the same structure, but because of the introduction of convolution, CAE is more suitable for image operation and is often used in image denoising and feature extraction [33]. Li et al. [33] used the improved version of CAE, named denoising convolutional auto-encoder (DCAE), to denoise and extract features from B-mode ultrasound tongue images.

D. RESTRICTED BOLTZMANN MACHINE
RBM is a random neural network whose output of neurons obeys the Boltzmann distribution. The variables in RBM are divided into visible variables and hidden variables. They are all binary variables, that is, only take 0 or 1. The whole RBM is a bipartite graph and the connections between the variables have no direction (Fig. 4 (c)). RBM finds the optimal parameters by maximizing a likelihood function.
In medical US image analysis, multiple RBMs are usually stacked to form a deep Boltzmann machine (DBM) [46]. Due to its multi-layer nature, high-level features of the input data can be extracted, commonly used for classification and segmentation tasks. For example, Zhang et al. [47] combined the point-wise gated Boltzmann machine (PGBM) and RBM to classify breast tumors, achieving classification performance with an accuracy of 93.4%; Jaumard-Hakoun et al. [48] used RBM to extract the contours of the tongue. No matter RBM or DBM, the neurons are undirected. If the connection direction of the inter layer is specified, DBM can form a deep belief network (DBN).

E. ULTRASOUND DATASETS
US medical datasets are often more difficult to obtain than other datasets. First of all, the annotation of medical images requires significant professional medical knowledge, which makes the annotation very expensive and rare. Secondly, medical data is private and not to be used publicly. Thirdly, different image acquisition devices may lead to different images and medical data parameters. Table 2 shows some publicly available datasets of medical ultrasound images, mainly from competitions and papers. Most datasets exist in CSV format and image (jpg, pgm, png, jpeg and DICOM) format. We selected some datasets for a brief introduction. CETUS Challenge dataset [34] is 3D cardiac US images obtained from 45 patients, the purpose is to segment left ventricular endocardium border. Dataset B [40] uses png format and contains 163 image samples where 106 are benign and 54 are malignant. Thyroid dataset [42] is a set of B-mode ultrasound images containing 329 cases, including several types of lesions such as thyroiditis, cystic nodules, adenomas and thyroid cancers. These lesions accurate contours are provided in XML format. Table 2 shows the details of all datasets.

III. MEDICAL ULTRASOUND IMAGE PREPROCESSING
One of the reasons for the great success of deep learning in various fields is the support of a large number of labeled training samples so that the neural network can obtain good learning performance. However, in medical image analysis, a large number of labeled datasets are difficult to obtain. Relying only on small sample datasets, it is difficult to achieve satisfactory performance of deep learning in medical image analysis. In order to deal with this problem, the most commonly used method is transfer learning (TL), which has been proved to play an important role in medical image analysis [49] and has been widely used in various tasks. At present, researchers mainly use model transfer to deal with the problem of small amount of labeled data in medical images. A neural network model is pre-trained on other datasets and then applied to medical image analysis tasks. For example, Kermany et al. [50] applied TL to develop an artificial intelligence system that can simultaneously diagnose eye diseases and pediatric pneumonia. In addition to model transfer, TL can also be used to reduce the difficulty and costs of data annotation. Zhou et al. [51] uses the technology of active learning and TL to implement medical data labeling, which reduces labeling costs by at least half compared with the state-of-the-art methods. In medical US image analysis, TL is frequently used to pre-train neural networks [3], [41], [52]. In addition to TL, feature fusion is also an option to improve the performance of deep learning models [8]. Wang et al. [53] used feature fusion in a classification study on COVID-19 to further improve the performance of neural networks. In a study of breast cancer diagnosis, Moon et al. [54] fused three types of images and achieved good performance.
Medical US image analysis not only faces the challenge of small amount of labeled data, but also has the disadvantages of a large amount of artifacts and noise. The preprocessing of US images becomes particularly important. Appropriate preprocessing methods can effectively improve the accuracy of the medical US image analysis. The following discusses data preprocessing methods including data augmentation, denoising, and enhancement of medical US images.  reducing the problem of over fitting and improving the generalization ability of deep learning neural networks. The traditional methods include the use of rotation, random distortion, clipping, and other methods to augmente the data. In addition, generative adversarial network (GAN) [55] can be used for data augmentation. Although GAN is more complex than traditional methods, the new data generated by GAN is more abundant and diverse and it can overcome the shortcomings of traditional methods such as the change of image size and position of ROI. In a study of synthetic simulated US images, Tom and Sheet [56] simulated realistic pathological US images by using GAN (Fig. 5).
GAN is composed of a generator and a discriminator. The generator is used to generate samples similar to the real samples. The discriminator is used to determine whether a sample data is a real sample or a generated sample. In the training process of GAN, the samples generated by the generator become more and more similar to the real samples and the judgment ability of the discriminator is also getting stronger and stronger. The generator and the discriminator are constantly confronted and alternately improved. When the system reaches the balance, training ends. The details of the generator and discriminator are introduced below.
For a generator, input a random noise variable z with a probability distribution of p z (z) , learn the probability distribution of z, and generate data G(z) similar to z. The parameters of the generator are represented by θ g . Use D(i) to represent the probability of the real sample when the sample i is input. Adjust θ g so that D(G(z)) is closer to 1, that is, minimizing ln(1 − D(G(z))). VOLUME 9, 2021 A discriminator will receive the real samples and the generated samples, where the real sample x has a label of 1 and the generated sample G(z) has a label of 0. When the input value is x, the value of D(x) should be closer to 1 by adjusting a parameter θ d , that is, maximizing ln(D(x)); otherwise, if the input value is G(z) , the value of D(G(z)) should be closer to 0 by adjusting θ d , that is, maximizing ln(1 − D(G(z))).
To summarize, the objective function min G max D V (D, G) of GAN is as follows, where E is the mathematical expectation, min G means to minimize ln(1 − D(G(z))) and max D means to maximize ln(D(x)) and ln(1 − D(G(z))).

B. DENOISING
US images will produce various noises in the process of acquisition, transmission, and analysis, which is not conducive to the further analysis of images by doctors or CAD systems, so it is necessary to denoise the images. The following mainly introduces three denoising methods, non-local mean filtering, anisotropic diffusion, and U-net.

Non-Local means filter (NLM) was proposed in 2005 by
Buades et al. [57], [58]. Compared with the traditional neighborhood filtering, NLM fully considers the influence of the global information of the image on the central pixel instead of just around the central pixel. This filtering method can well eliminate noise while retaining edge details. The main idea is to judge whether the gray values of two pixels are similar based on the Gaussian weighted Euclidean distance and then replace the gray values of the central pixel with the average gray values of similar pixels.
Given an image containing noise v = {v(i) |i ∈ I } where I is the image pixel set, the filtered value NL[v](i) is, where ω(i, j) is the Gaussian weighted Euclidean distance and its value depends on the similarity between the gray level vectors v(N i ) and v(N j ), where N k represents a fixed size square neighborhood centered on the pixel k.We have where α is the standard deviation of the Gaussian function greater than 0, which usually takes a value between 2-5. h controls the decay speed of the exponential function. h = λσ , σ is the standard deviation of the noise. λ usually takes a value between 0.8-1.5. c (j) is a normalization constant, However, because the neighborhood of a central pixel is to be compared with the whole image, the calculation amount of NLM is very large. Therefore, many scholars have improved NLM, such as Mahmoudi and Sapiro [59] and Liu et al. [60].
Sudeep et al. [61] and Abrahim et al. [62] used different improved NLM algorithms to denoise medical US images. The results show that the improved NLM algorithms can effectively remove speckle noise while retaining the boundary information of the images.

2) ANISOTROPIC DIFFUSION
An Anisotropic diffusion model was proposed by Perona and Malik [63] in 1990 and it has good edge retention, even enhancement, and excellent denoising capability. It has been widely used and the idea of the anisotropic diffusion is as follows, where div is a divergence operator, c( ∇I ) is the diffusion equation, ∇ is the gradient operator, is modular operation, and t is the time of diffusion. Perona and Malik propose the following two expressions for c( ∇I ), In the above two equations, two coefficients of t and k need to be determined. The diffusion time t affects the final smoothing effect and the coefficient k is the gradient threshold. If ∇I is much larger than k, c( ∇I ) tends to 0 and the diffusion is suppressed; if ∇I is much smaller than k, c( ∇I ) tends to 1 and the diffusion is enhanced. Therefore, a large value of k will make a processed image smoother.
The discrete iterative expression of anisotropic diffusion applied to digital images can be expressed as, where I t p represents discrete sampling of the current image, p is the coordinates of the sampled pixels, I t q is a neighborhood discrete sample of I t p , ∂ p represents the neighborhood space of p, |∂ p | represents the size of the neighborhood space, and λ is a coefficient that controls the overall diffusion intensity.
Although the anisotropic diffusion can mostly remove noise well while retaining the edges, because of the existence of the gradient threshold k, when the gradient of the noise is large, the anisotropic diffusion may not be able to remove the noise. It may even enhance the noise. For this reason, many scholars have improved this diffusion method, such as Catté et al. [64] and Ling and Bovik [65]. In the study of the diagnosis of focal liver lesions [4], during the US image 54316 VOLUME 9, 2021 preprocessing stage, an anisotropic diffusion filter is used to enhance the image and remove noise.

3) U-NET
Deep learning has achieved exciting results in many fields such as computer vision. In recent years, image denoising methods based on deep learning have also attracted the attention of many researchers [66]. U-net [67] is a well-known deep learning network, initially known by researchers for medical image segmentation. Due to its unique network structure, many researchers also use U-net and its modified versions as image denoising [68]. U-net is a tpye of convolutional neural network, which adopts an encoder-decoder structure. U-net includes four times of downsampling and four times of upsampling. At the same time, it uses skip connections to connect the corresponding feature maps of downsampling and upsampling to ensure that the features of the image are not lost.
Lan and Zhang [69] designed a new neural network based on U-net named MARU for denoising medical US images. MARU integrates residual and mixed-attention mechanism into U-net and good denoising performance is achieved in terms of peak signal to noise ratio, structural similarity.

C. ENHANCEMENT
Ultrasound images often have problems such as low contrast, unclear tissue boundaries, blurred edges, and low resolution. When using a CAD system based on deep learning to analyze ultrasound images, the quality of the image is closely related to the performance of the neural network. Therefore, it is particularly important to use appropriate enhancement preprocessing methods for ultrasound images. This section mainly introduces three medical image enhancement methods, histogram equalization, homomorphic filtering and super-resolution generative adversarial network.

1) HISTOGRAM EQUALIZATION
Histogram equalization is the basis of the digital image spatial processing technology. The idea is to transform an image with an uneven gray value distribution into a new image with a uniform probability density distribution. As a result, the dynamic range of pixel values is expanded, so as to enhance the overall contrast of the image [70]. The idea is, where L is the number of possible gray levels in the image, r is the gray value of the image to be processed, and r ∈ [0, L−1]. r = 0 means black, r = L − 1 means white, n k is the number of pixels with a gray value of k, MN is the total number of pixels in the image, and P r (r k ) is the probability of the gray value r k appearing. The mapping of r k in the new histogram after equalization is s k , we have where T (r k ) is the histogram equalization transformation, and < > is the nearest integer operation. However, the traditional histogram equalization is not suitable for images with a two-level gray distribution. In order to address this issue, we can use adaptive histgram equalization [71] or contrast limited adaptive histgram equalization [72].Han et al. [73] used histogram equalization to enhance image preprocessing in a study on breast lesion classification.

2) HOMOMORPHIC FILTERING
Homomorphic filtering regards an image as the product of illuminance component and reflection component. Since illuminance can be regarded as incident in the environment, with a small relative change, it can be regarded as the low-frequency component of the image; while with a large relative change of reflectivity, it can be regarded as the high-frequency component. The idea is to reduce the lowfrequency and increase the high-frequency, so as to reduce the light changes and sharpen the edges or details. The equation of homomorphic filtering is, where f (x, y) is the image to be processed, y) is the reflection component. Since the above equation is the product of two functions, we have After using Fourier transform, (12) can be expressed as after filtering the formula (13) by the filtering function H (u, v), we obtain (14) after filtering, the formula (14) is Fourier inverse transformed to get: take the exponential transform of the above formula to get the final result: Generally, the filter function H (u, v) often takes a Gaussian high-pass filter, Butterworth filter, or exponential filter [74].However, the filtering function needs many parameters to be determined and it often takes multiple experiments to find relatively reasonable parameters. A lot research has be conducted to improve the filtering function. Duong et al. [75], in a study of segmenting alveolar bone from intraoral US images, first used homomorphic filtering to reduce noise and enhance the image and then used U-Net to segment alveolar bone. VOLUME 9, 2021 3) SUPER-RESOLUTION GENERATIVE ADVERSARIAL NETWORK When using US to diagnose obese patients, US needs to penetrate a deeper depth. However, with the increase of penetration depth, it is necessary to make a tradeoff between view, frame rate, and scan line density. Since the diagnosis of US is usually real-time imaging, doctors may need to choose to narrow the field of view or reduce the scan line density to increase the depth of penetration, but both options can reduce the quality and resolution of US imaging. Superresolution generative adversarial network (SRGAN) [76] is a commonly used super-resolution network. Different from ordinary GAN, the generator of SRGAN does not receive random noise variables but low-resolution images.
As mentioned earlier, GAN has two loss functions, one from the generator and the other from the discriminator, and SRGAN modifies the loss functions of GAN. SRGAN defines a new loss function, which can evaluate the generated image in terms of perceptual characteristics. The newly designed perceived loss L SR is the weighted sum of content loss L SR X and adversarial loss L SR Gen . L SR X consists of two parts, namely, the mean square error L SR MSE of the generated super-resolution image and the label, and the mean square error L SR VGG of the feature map of the generated image and the feature map of label. We have where L SR MSE and L SR VGG , respectively, (19) In 18 and 19, r, W , and H represent the number of pictures, the width, and height of pictures, respectively. I HR represents label, I LR represents the generated image. In 19, W i,j , H i,j , and φ i,j refer to the width, height and feature map of convolution of layer j before the i pooling layer, respectively. L SR Gen is where D θ D (x) is the output of the discriminator to the variable x, and I LR is the generated image by the generator. We get L SR ,

IV. APPLICATIONS OF DEEP LEARNING IN MEDICAL US IMAGES A. CLASSIFICATION
Classification is one of the most fundamental tasks in medical US image analysis, which can provide doctors with diagnosis suggestions, improve the diagnosis efficiency, and reduce the influence of subjective factors in the diagnosis process. The main classification tasks consider tumor, lesion, nodule, tissue, and organ fibrosis. The traditional CAD classification methods are to extract the morphological features, or use gray-level co-occurrence matrix, wavelet transform, local binary patterns (LBP) and other methods to extract low-level texture features, and then combine the classifier (such as support vector machine, decision tree, naive Bayes, k nearest neighbor) to classification. However the traditional methods are easy to be affected by possible low imaging quality. Compared with traditional methods, deep learning is able to reduce the impact of low quality US image by extracting highlevel features. Among many deep learning structures, CNN has been most widely used in classification. Classic CNN structures (such as GoogLeNet, VGGNet, AlexNet) are still widely used currently with fine-tuning. In 2017, Han et al. [73] conducted a study on the classification of breast lesions using a modified GoogLeNet. In this study, GoogLeNet used ImageNet for pre-training and two auxiliary classifiers were removed. At the same time, the input layer was modified to enable it to process grayscale images. Finally, the original 1000 classes of GoogLeNet output are changed to two classes to correspond to benign and malignant lesions. This study has achieved better classification results than the conventional methods. In a study by Liu et al. [3] on thyroid nodules, VGG-F was applied. In this study, the team first pre-trained VGG-F with ImageNet and then input ultrasound images into VGG-F to obtain high-level features. At the same time, the gradient direction histogram (HOG) and scale-invariant feature transformation (SIFT), LBP and other traditional low-level features are integrated with the high-level features extracted from VGG-F, which are stitched into a one-dimensional vector, and finally classified by the positive majority voting strategy. The results show that the hybrid method proposed by the team performs better than the traditional single type feature method. Meng et al. [77] conducted a study on liver fibrosis classification. This study first used VGGNet pre-trained on the ILSVRC dataset to extract high-level features, and then used FCN to classify the extracted high-level features.
In addition to the classic CNN, CNNs of other structures, such as DBN, deep polynomial network(DPN), deep neural network(DNN), AE, and deep learning software (DLS) are also used in classification tasks and they have achieved good results. For example, Liu et al. [2] utilized stacked DPN to classify tumors; Wu et al. [83] used DBN to study focal liver lesions; Hassan et al. [4] utilized the sparse autoencoder to extract the features of the images in the diagnosis of focal liver lesions and then put the extracted features into a neural network and used the SoftMax classifier.
Various classification studies using deep learning in medical US images are summarized in Table 3.

B. DETECTION
In medical US image analysis, the purpose of detection is mainly to identify and locate region of interest, and to provide assistance for subsequent medical diagnosis, treatment, or image segmentation.
The detection of tumor or lesion is an important task in US image detection. The commonly used methods can be divided into two categories, one is a two-step method combining candidate region and deep learning, which is represented by R-CNN [88] series; the other is a single step method transforming target detection into a regression problem, which is represented by you only look once (YOLO) [89], and single shot detection (SSD) [90]. In 2017, Li et al. [10] detected papillary thyroid carcinoma based on the improved Fast R-CNN, which can automatically detect the papillary thyroid cancer area with 93.5%. Cao et al. [20] compared four deep learning models in breast cancer detection, Fast R-CNN, Faster R-CNN, YOLO and SSD, and concluded that SSD has the best accuracy and recall.
US detection of the fetus is also of great significance. It can determine the pregnancy cycle, confirm the condition of the fetus, and look for factors that may affect fetal growth and delivery. In fetal ultrasound detection, the acquisition of a fetal standard plane is the prerequisite and key for subsequent biometric measurement and diagnosis [91]. Chen et al. [85] and Wu et al. [22] implemented pre-trained CNN to detect the standard plane of the fetus.
A number of deep learning detection studies in medical ultrasound images are summarized in Table 4.

C. SEGMENTATION
The image segmentation can extract the region of interest, which is convenient for the analysis and recognition of medical US images. It is also a prerequisite for quantitative analysis of relevant imaging indicators.
In medical US image segmentation, fully convolutional networks (FCN) is one of the most commonly used network structures. Its main contribution is to use deconvolution instead of fully connected layer to restore the output characteristic image to the original image size, so as to achieve pixel-level classification [105]. Chen et al. [96] presented an iterative multi-domain regularized FCN to segment the anatomical structure. When the image has artifacts and speckle noise, even when the boundary of the anatomical structure is not clear, the anatomical structure can still be accurately segmented. Zhang et al. [94] utilized Coarse-to-Fine Stacked FCN (CFS-FCN) in a study of segmented lymph nodes and compared with U-Net, CUMedNet [106] and other FCN variants. The results show that CFS-FCN is significantly better than other deep learning methods in segmentation performance.
As one of the most classic segmentation models, U-Net is widely used in US image segmentation and a variety of derivative models have been developed, such as Viksit et al. [107] used Multi U-net to effectively segment the breast masses, achieving a mean Dice of 0.82.    A number of deep learning segmentation studies in medical US images are summarized in Table 5.

D. RECONSTRUCTION
Improving the quality of US imaging can effectively improve the accuracy and efficiency of diagnosis. The traditional approach to improve the quality of US images is to use compressed sensing (CS) approaches. Although CS approaches can improve the quality of US imaging, the improvement is limited CS approaches usually need to replace the hardware of US imaging machines, which adds additional costs. Yoon et al. [101] introduced CNN into the study of US image reconstruction and the results show that, compared with the traditional CS approaches, this method can improve the US image quality better, and does not require any hardware replacement, and can be applied to any B-mode US system or transducer. Perdios et al. [104] used stacked denoising autoencoder (SDAE) to compress and recovery US images. This SDAE consists of four layers. The first layer compresses the ultrasound signal and the next three layers are used for signal reconstruction. The results show that this method is superior compared with other US image reconstruction techniques based on the CS method in terms of reconstruction quality and time consumption.
A number of deep learning reconstruction studies in medical US images are summarized in Table 6.

V. CONCLUSION
Cadieu et al. [108] confirmed in a study that deep learning can achieve the same performance as the primate visual inferotemporal cortex, which explains that deep learning has advantages over traditional methods in computer vision tasks. With a simple network structure and automatic learning mode, deep learning can complete various complex medical US image analysis tasks. However, deep learning still faces many problems in medical US image analysis. First of all, compared with other natural image datasets, such as ImageNet, the available datasets in the field of US are very limited, which makes the training of deep learning models difficult. Although traditional methods or TL can be used to alleviate the problem of small sample datasets, it is still not the best way to address this issue. Therefore, it is very important for researchers in the field of deep learning and hospitals to cooperate to find more efficient methods of medical data annotation. Secondly, the generalization ability of deep learning is poor. A model can obtain good performance on the images collected by a specific device. However when different devices are used, the performance will generally decline, which is also an urgent problem to be solved. Third, the innovation of neural network models has slowed down. Many researches have improved performance by simply modifying or superimposing existing neural network models. Fourth, deep learning has poor interpretability. In the field of medical artificial intelligence, one of the major reasons why deep learning is not currently applied on a large scale is its ''black-box'' effect. Although the neural network is constructed by several very simple mathematical formulas, its output is very complicated, and it is difficult for us to know how the neural network works internally. This will cause the doctor-patient community to have doubts about the diagnosis. Also because of the ''black box'' effect, it is difficult for researchers to adjust the parameters of the neural network effectively. It usually takes many different attempts to determine a better parameter.
Although deep learning still has many problems in the analysis of medical ultrasound images, its value has long been widely recognized. Deep learning technology can improve the medical environment in areas with underdeveloped medical resources, improve the diagnostic efficiency of medical institutions. It can also provide doctors with auxiliary diagnosis information before, during, and after the disease treatment. To better integrate the artificial intelligence technology represented by deep learning into ultrasound image analysis, and to deal with the shortcomings of deep learning in practical applications, we need to make the following efforts. First, it is necessary to strengthen the interdisciplinary cooperation between deep learning researchers and medical institutions to reduce the difficulty of obtaining medical data and the difficulty of labeling. At the same time, deep learning researchers can better understand the medical artificial intelligence application needs, and continuously improve the accuracy and ease of use of deep learning technologies. Second, establish a standardized and open large-scale database. Deep learning requires high data quantity and quality. At present, there are relatively few open databases in the field of medical ultrasound data, with different formats and uneven quality, which is not beneficial to the development of data-driven deep learning in the medical ultrasound field.
Generally speaking, deep learning is still in the development stage in the medical ultrasound images. With the advancement of the research in deep learning, the hardware improvement of computer and image acquisition equipment, and the enhanced interdisciplinary cooperation, deep learning will achieve better performance in medical ultrasound image analysis tasks.