Minimizing Image Quality Loss After Channel Count Reduction for Plane Wave Ultrasound via Deep Learning Inference

High-frame-rate ultrasound imaging uses unfocused transmissions to insonify an entire imaging view for each transmit event, thereby enabling frame rates over 1000 frames per second (fps). At these high frame rates, it is naturally challenging to realize real-time transfer of channel-domain raw data from the transducer to the system back end. Our work seeks to halve the total data transfer rate by uniformly decimating the receive channel count by 50% and, in turn, doubling the array pitch. We show that despite the reduced channel count and the inevitable use of a sparse array aperture, the resulting beamformed image quality can be maintained by designing a custom convolutional encoder–decoder neural network to infer the radio frequency (RF) data of the nullified channels. This deep learning framework was trained with in vivo human carotid data (5-MHz plane wave imaging, 128 channels, 31 steering angles over a 30° span, and 62 799 frames in total). After training, the network was tested on an in vitro point target scenario that was dissimilar to the training data, in addition to in vivo carotid validation datasets. In the point target phantom image beamformed from inferred channel data, spatial aliasing artifacts attributed to array pitch doubling were found to be reduced by up to 10 dB. For carotid imaging, our proposed approach yielded a lumen-to-tissue contrast that was on average within 3 dB compared to the full-aperture image, whereas without channel data inferencing, the carotid lumen was obscured. When implemented on an RTX-2080 GPU, the inference time to apply the trained network was 4 ms, which favors real-time imaging. Overall, our technique shows that with the help of deep learning, channel data transfer rates can be effectively halved with limited impact on the resulting image quality.


I. INTRODUCTION
35 U LTRASOUND imaging is well regarded as a noninvasive 36 medical imaging modality. For current clinical scanners 37 with array scanning capabilities, they typically form each 38 image frame via a beamline-based pulse-echo sensing par-39 adigm that involves sending one focused pulse firing along 40 each of a set of beamlines in the imaging view. From each 41 pulse-echo sensing event, one axial line of the image frame 42 is generated by performing delay-and-sum (DAS) beamform- 43 ing based on the channel-domain radio frequency (RF) data 44 samples acquired from all array elements [1]. Real-time frame 45 rates can be achieved with this conventional ultrasound imag-46 ing paradigm. In contrast, in the past decade, unfocused puls-47 ing of planar or diverging wavefronts, which can transiently 48 insonify the entire imaging view, has demonstrated efficacy in 49 achieving high frame rates of over 1000 frames per second 50 (fps) [2], [3]. For this high-frame-rate imaging paradigm, 51 one image frame can be generated from the channel RF 52 data array of the corresponding unfocused pulse-echo firing. 53 The beamformed images obtained from unfocused pulse-echo 54 firings with different steering angles can also be coherently 55 compounded to improve the lateral image resolution [4], [5]. 56 Accordingly, researchers have successfully imaged hemody-57 namics in vivo on a time-resolved basis with submillisecond 58 temporal resolution [6]- [8]. Others have also developed new 59 applications, such as shear wave elastography and functional 60 ultrasound imaging [9], [10]. 61 While conceptually powerful, high-frame-rate ultrasound 62 imaging is not without its practical implementation challenges. 63 Its two primary hurdles are: 1) high computational capacity 64 is required to generate images in real time [11] and 2) high 65 streaming bandwidth is needed in transferring the channel RF 66 data from the array front end to the computing back end [2]. 67 The first hurdle can possibly be overcome by applying soft-68 ware beamforming methods that make use of graphics process-69 ing unit (GPU) computing devices to parallelize the DAS 70 beamforming process over different pixel positions [4], [12]. 71 Nevertheless, the second hurdle is more challenging to address. 72 Specifically, upward of 10 GB of channel RF data can be 73 transmitted every second, and few scanners are designed to 74 accommodate this level of data streaming traffic [2]. In light of 75 this bottleneck, it is of practical interest to reduce the amount redundancy stems from the prevailing concept in pulse-echo 126 sensing that different transducer channels on the same aperture 127 would tend to receive, for a given scatterer, a similar echo 128 signature that is merely shifted in time. Accordingly, we posit 129 that a convolutional neural network (CNN) architecture can 130 be trained to capture the redundant structure within the chan- 131 nel RF data array. The trained network can then be applied 132 to reconstruct full-channel RF datasets from truncated ones 133 acquired with fewer array elements used on reception. In doing 134 so, data streaming traffic can be effectively halved with min-135 imal loss in image quality and with bearable computational 136 cost that does not compromise real-time feasibility, as will be 137 shown later in this article. In principle, it is readily possible to reduce raw data transfer 141 in ultrasound imaging by uniformly decimating every other 142 channel on the array aperture during pulse-echo reception. 143 Nevertheless, in doing so, the array pitch for beamforming 144 would be concomitantly doubled. Using a large pitch relative 145 to the acoustic wavelength (exceeding the spatial Nyquist limit 146 of λ/2) results in spatial aliasing that, in turn, introduces 147 imaging artifacts along the lateral dimension and reduces the 148 contrast of echolucent regions [27]. Fortunately, this drawback 149 is in theory addressable because, according to modern signal 150 processing, even the Nyquist sampling limit can be over-151 come with additional knowledge of the structure of the signal 152 [28], [29]. As explained in the Appendix, ultrasound RF data 153 are indeed highly structured across adjacent channels. Accord-154 ingly, our proposed framework has been devised to implicitly 155 take advantage of this structural redundancy via a data-driven, 156 deep-learning approach to recover decimated channel-domain 157 RF data and, in turn, generate ultrasound images with com-158 parable quality as those formed using RF data acquired from 159 the fully populated array. A convolutional encoder-decoder network is designed to 162 use learned convolutional kernels to extract features in the 163 RF data and to perform data recovery accordingly. This net-164 work forms the basis of our proposed pre-beamformed data 165 formation framework as shown in Fig. 1, together with the 166 overall imaging system. After a transmit event, the framework 167 (dashed box) first acquires from half of the receive channels 168 (odd-indexed) of the ultrasound probe, then infers the unac-169 quired channels (even-indexed) using the neural network, and 170 finally interleaves the acquired and inferred channels to reform 171 a full-channel set of data. The reconstructed full dataset is then 172 passed to a beamforming module to generate an image frame. 173 The encoder-decoder network is chosen because this archi-174 tecture is suitable for the RF data inference task based on 175 image processing literature. In modern image processing, 176 encoder-decoders are used for image restoration tasks such 177 as inpainting (i.e., the recovery of lost data in images and 178 videos [30]), which can be considered analogous to the recov-179 ery of missing RF data between channels in ultrasound. Simi-180 lar to inpainting, after data-driven training, the encoding stage 181 of the neural network learns to generate a compressed repre-182 sentation of the most important spatiotemporal features and 183 the decoding stage learns to regenerate the missing portions 184 of RF data based on those encoded features. for RF data inference is still present in the RF data frame, 210 as shown in top left of Fig. 1 (under "odd-indexed channels").

211
Before entering the encoding-decoding stage of the network, 212 the RF data frame is zero-padded to ensure that the appro-213 priate sample size is used for downsampling along the depth 214 dimension across all encoding layers. The kernels of the pro-215 posed network then capture and compress the structure in the 216 odd-indexed channels and use that structure to infer the even-217 indexed channels. Finally, the RF data frame is cropped to 218 return the output to the original size of the input RF data. In a CNN, the capacity to capture complex structures is 223 governed by the number of layers, the number of kernels per 224 layer, and the kernel sizes. Increasing these parameters allows 225 for increased capacity to capture complex structures; however, 226 doing so comes at the expense of increased computational cost. 227 To balance this tradeoff, a nine-layer encoder-decoder net-228 work was chosen. The encoder-decoder has four encoding 229 layers to extract the redundant structure in the RF frame, four 230 mirrored decoding layers for inference, and one recombination 231 layer to combine the inferred features into a single RF frame. 232 The number of kernels per layer was also chosen to balance the 233 tradeoff between capacity and computation. The four encoding 234 layers, respectively, have 8, 32, 64, and 64 number of kernels; 235 this subarchitecture was mirrored in the decoding layers. The 236 number of kernels per layer along with the convolutional stride 237 length helps to compress the RF data into feature maps in 238 encoding and to infer the RF data in decoding.

239
The convolutional kernel sizes affect the extent of the struc-240 ture captured in each layer. The sizes were selected, so that 241 the first layer captures more of the features related to the 242 fundamental RF frequency along the depth dimension by using 243 a long kernel of size (5,3). The second layer captures more 244 of the salient spatiotemporal features using a wide kernel 245 of size (6,8). The third and fourth convolutional layers, size 246 (3,3), add depth to the network to better capture the redundant 247 structure. The subsequent deconvolutional layers (first four 248 layers of decoding stage in Fig. 2) mirror the parameters 249 of the encoding layers to form a standard encoder-decoder 250 architecture. The convolutional layer at the end of the decoding 251 stage (Fig. 2, in blue) used a kernel of size (5,5) to recombine 252 the inferred features into a single image. To compress and decompress the extracted RF data features, 255 the proposed network used a stride length of two for all layers 256 but the second encoding layer and its corresponding decoding 257 layer (white layers in Fig. 2). The stride of a convolutional 258 layer represents sample "skipping" in each dimension [31]. 259 For example, a kernel with a stride length of one performs 260 convolution at each index. A convolutional kernel with a stride 261 length of two performs convolution at every other index, com-262 pressing the features of the RF data. Similarly, deconvolutional 263 strides expand the features of the RF data. The number and size of the kernels for each encoding and decoding layers are shown below the corresponding layers (dark and light gray). Two sets of skip connection concatenate the outputs of the encoding layers (start of arrows) to the inputs to the decoding layers (end of arrows). The last convolutional layer (dark blue) recombines the information into a single RF frame, which is cropped (orange) to ensure preservation of input and output dimensions, resulting in the inferred even-indexed RF channels (purple).     Table I  (increments of 0.5 • ). This phantom test case served to assess 358 the generalizability of our network both to angles and to 359 imaged structure that was entirely different from our training 360 dataset. The images formed from these acquisitions served to 361 evaluate image quality in terms of resolution, contrast, and 362 spatial aliasing artifacts.

363
To analyze the efficacy of the proposed framework in recov-364 ering image quality and suppressing spatial aliasing artifacts, 365 three derivative RF datasets were generated from the newly 366 acquired RF data.  2) Linearly interpolated channels: This RF dataset repre-370 sents a reference data recovery scheme for the unac-371 quired channel data. The recovery was done by linearly 372 interpolating the neighboring odd-indexed channels to 373 form the unacquired even-indexed channels to restore 374 the full-channel (128) RF dataset.

375
3) CNN-inferred channels: This dataset was formed using 376 our proposed framework. The odd-indexed channels 377 were extracted from the original acquisition and passed 378 into the CNN encoder-decoder to infer the even-indexed 379 channels for full-channel (128) restoration.

380
B-mode images were subsequently formed from the original 381 full-channel dataset and the three derivative RF datasets using 382 a conventional DAS beamforming framework according to 383 the parameters in Table II. The RF data were first prefiltered 384 with a 3-7-MHz finite-impulse response (FIR) bandpass filter 385 applied along the fast-time dimension. The RF data were then 386 converted to the analytic form using the Hilbert transform. 387 Afterward, DAS beamforming was performed to generate the 388 four sets of images; a constant F-number of 1.25 and a uni-389 form (rectangular) apodization were used. Plane wave images are often coherently compounded to 393 improve the image quality as compared to images beamformed 394 from a single transmission. The image quality (spatial aliasing 395 artifacts, contrast) was also assessed after coherently com-396 pounding increasing subsets of the beamformed images from 397 steered transmissions. The steering angles used for compound-398 ing were selected in pairs, so that each compounded image was 399 always formed from balanced positive and negative steering 400 angles. The single transmit images were also selected to be 401 evenly distributed along the angle span (e.g., −8 • and 8 • for 402   . 4) set data (i.e., not included for training and validation). As an 416 example, Fig. 4(a) shows an in vivo human carotid B-mode 417 image (40-dB dynamic range) generated from the test set 418 along with the corresponding reference even-channel RF frame 419 [see Fig. 4(b)] and the inferred RF frame [see Fig. 4(c)] 420 based on the odd-channeled RF input. The overall structure 421 in the inferred RF data generally matched the reference one; 422 this observation is substantiated by the low NRMSE (0.012) 423 between the two RF images.   Fig. 4(a)].

475
As a further evaluation of the effect of using CNN-inferred 476 channels for image formation, a cross section [depth taken 477 shown in Fig. 7(a)] along the row of point targets is plotted for 478 images generated from each compared dataset in Fig. 7(b) for 479 an unsteered plane wave transmission. The artifact reduction 480 of the proposed method (orange line) can be clearly observed 481 on the left and right sides of Fig. 7(b). Fig. 7(c) expands the 482 portion of the cross section corresponding only to the wire 483 point targets to demonstrate the lack of change in resolution. 484 Additionally, the full-width half-maximum (FWHM; measured 485 6 dB below the peak for each target) for each point target 486 labeled in Fig. 7(a) was compared and shown in Table IV  Images beamformed from RF data inferred with our method 495 also show improved image quality for in vivo scenarios. 496 Specifically, it can be observed in Fig. 8 that the images of 497 the human carotid in vivo generated from the full-channel 498 (first column) and the CNN-derived (second column) datasets 499 demonstrate clear visibility of the common carotid artery 500  Table IV. B-mode images formed from steered plane wave transmis-522 sions were coherently compounded to assess the image quality 523 improvement compared to using only a single transmission. 524 Figs. 10 and 11 demonstrate the effect of compounding images 525 beamformed with CNN-inferred data using the same imaging 526 scenarios as Figs. 6 and 9, respectively. Fig. 10 shows that 527 coherent compounding reduces the spatial aliasing artifacts 528 even without inference (light blue line), but our proposed 529 CNN inference scheme results in reduced artifacts with fewer 530  imaging [3] are most effective when they are available to the 562 sonographers in real time. However, two barriers in enabling 563 real-time high-frame-rate imaging are: 1) the immense data 564 transfer rate to move the data from the sampling front end to 565 the processing back end within the imaging system and 2) the 566 computational cost of real-time beamforming at high frame 567 rates.

568
Our work addresses the data transfer challenge of real-time 569 high-frame-rate imaging via channel count reduction with-570 out sacrificing image quality. Simply reducing the transducer 571 receive channel count results in a loss of post-beamformed 572 image quality due to the introduction of artifacts and the 573 loss of contrast. To compensate for the channel reduction, 574 a neural network was trained, using over 60 000 in vivo sets of 575 plane wave data (see Table I), to infer unacquired RF channel 576 data from the acquired channel RF data. An encoder-decoder 577 network structure (see Fig. 2) was designed to leverage the 578 hyperbolic structure of the acoustic echoes to regenerate data 579 from unacquired channels. Examining only the inferred RF 580 data (see Fig. 4), our CNN correctly captures the expected 581 spatiotemporal structure. After CNN inference, a set of full-582 channel RF data can be formed by interleaving the acquired 583 and inferred channel data (see Fig. 1). B-mode images beam-584 formed from the regenerated RF data closely matched the 585 images beamformed from the full-channel RF data in grating 586 lobe artifact power (see Fig. 6), point target resolution (see 587 Fig. 7), and lumen-tissue contrast (see Fig. 9), even when  Conventionally, when receiving acoustic echoes, the ultra-626 sound transducer pitch obeys the Nyquist spatial sampling cri-627 terion. Violation of the Nyquist sampling rate leads to spatial 628 aliasing, but other works have shown that the Nyquist rate can 629 be overcome given knowledge of signal structure [28], [29]. 630 In the Appendix, we show based on physical principles that 631 spatiotemporal structure exists in received acoustic echoes 632 from a plane wave ultrasound transmission. Our experimental 633 results showed that the proposed deep learning method suc-634 cessfully inferred the RF data structure using convolutional 635 kernels. According to the parameters of system used for this 636 study (see Table I), the fundamental frequency was 5 MHz 637 (wavelength of 0.308 mm at 1540-m/s sound speed), while 638 the effective receive pitch was double of the wavelength (i.e., 639 0.61 mm for the pitch of odd-indexed channels). Given this 640 discrepancy, our results suggest that the Nyquist limit for 641 transducer element pitch is too conservative in ultrasound, 642 given an appropriate compensation technique. This conclusion 643 is in the vein of similar insights from works, such as [19]  The value of our framework's RF-to-RF nature is its ability 649 to output raw RF data. This benefit can potentially enable 650 the derivation of high-frame-rate imaging insights outside of 651 B-mode imaging. Some high-frame-rate ultrasound methods, 652 such as color Doppler [41], vector flow imaging [42], [43], 653 or shear wave elastography [44], rely on the availability of 654 the raw RF data or the phase to implement custom signal 655 processing techniques to extract the desired information. Our 656 proposed framework attempts to regenerate the raw RF data, 657 unlike other methods to reduce channel count, which focus 658 only on improving B-mode image quality, such as [19] or [45]. 659 The fidelity of the reconstructed RF data (see Fig. 4) lends 660 credence to our data reduction framework finding applications 661 in non-B-mode ultrasound imaging and evaluating our frame-662 work's applicability to color Doppler is an ongoing effort. The 663 ability to reduce the data on acquisition while maintaining a 664 full RF dataset for image formation is essential for real-time 665 display in a clinical setting. Our proposed framework demonstrated millisecond infer-668 ence frame rates (Section V-E). However, true real-time high-669 frame-rate ultrasound requires submillisecond inference to 670 achieve 1000-fps imaging. There are several avenues to pursue 671 to further improve the inference speed without compromising 672 the fidelity of the regenerated RF data. Without modifying the 673 network architecture, inference frameworks such as Nvidia's 674 TensorRT (Nvidia, Santa Clara, CA, USA) can perform net-675 work optimizations and have shown up to twofold speedup of 676 inference. Apart from software acceleration, next-generation 677 hardware, such as Google's tensor processing unit (Google, 678 Mountain View, CA, USA), is being produced and specifi-679 cally tailored for neural network operations. These hardware 680 advances can be integrated into an ultrasound open platform 681 to add efficient CNN operations to the ultrasound imaging 682 pipeline. Another possibility is to reduce the number of layers 683 or kernels and correspondingly decrease the inference time. 684 Nonetheless, in doing so, data recovery performance may 685 suffer as the network learns to capture less. 686 High-frame-rate ultrasound imaging is also not limited to 687 the imaging system used in this study. While this work 688 was demonstrated using a specific probe (L14-5) and imag-  Illustration of a five-element transducer transmitting an unsteered plane wave pulse (dashed line); elements (denoted with squares) and scattering point source (circle) are projected onto a 2-D plane.
mitigated in vitro and in vivo using a deep learning approach 738 to regenerate the RF data from the unacquired channel set. 739 This method not only helps to enable real-time high-frame-rate 740 imaging but also has a natural extension to the 3-D ultrasound 741 imaging domain, where the data transfer rate remains an even 742 larger barrier. To explain the origin of the redundant structure present in 747 the pulse-echo signals received over different array channels, 748 let us consider the basic case of pulse-echo sensing for a 749 point target. Fig. 12 illustrates a setup of five channels and a 750 reflecting point source on a 2-D plane. Each receiving element 751 has coordinates (0, x 0 ), (0, x 1 ), (0, x 2 ), (0, x 3 ), and (0, x 4 ), 752 respectively, and the point target has coordinates (z p , x p ). 753 Assuming an ideal plane wave transmission at time t 0 = 0 and 754 a constant speed of sound c 0 , each transducer element (indexed 755 by e) would receive the backscattered signal from the point 756 source at the following time: In this scenario, the signal received by each element is identi-762 cal apart from the time of reception. Because of the isotropic 763 reflection of the point source, the delays form a hyperbola 764 in relation to the spacing between elements as in (A2). This 765 delay calculation is the fundamental principle behind DAS 766 beamforming, but this spatial structure can also be leveraged 767 in a novel way for data reduction.

768
From algebra and geometry, given only two pairs of values 769 (t e , x e ) and knowing the delays form a hyperbolic shape, the 770 equation becomes fully determined and can be obtained by 771 solving a system of equations with two variables. Therefore, 772 in the single point-source, five-transducer element example 773 without noise or other adverse effects, three of the five delayed 774 signals can be considered to carry redundant information and 775

782
Let us now extend the explanation to the case of a multi-783 point target phantom. Fig. 13 shows a snippet of the spa-784 tiotemporal structure of the RF data for such a phantom (the 785 same one reported in Section IV-D; data were acquired using 786 the same imaging parameters as those reported in Table I).

787
For each point target in the phantom, three of which are 788 highlighted in yellow in Fig. 13(a), the hyperbolic shape of 789 the corresponding echoes is shown in Fig. 13(b). While the