Deep Learning for Sensing Matrix Prediction in Computational Microwave Imaging With Coded-Apertures

This work aims to simplify the characterization process of coded-apertures for computational imaging (CI) at microwave frequencies. A major benefit of the presented technique is the minimization of the processing time needed to calculate the system sensing matrix for microwave CI-based compressive sensing applications. To achieve this, a deep learning-based approach which is capable of generating the sensing matrix using features learned directly from the coded-aperture distribution is proposed. To avoid the vanishing gradient problem, the proposed deep learning network contains skip connections. Using a dataset of 1,000 testing samples, the average normalized mean-squared-error (NMSE) calculated between the sensing matrix generated by the conventional method and that predicted by the proposed network is 0.0036. Moreover, the average mean-squared-error (MSE) calculated between the images reconstructed using the conventional and the predicted sensing matrix is 0.00297. In addition to providing high-fidelity estimations with minimized error, we demonstrate that using the trained network, the prediction of the sensing matrix can be achieved in 0.212 s, corresponding to a 65% reduction in the computation time needed to calculate the sensing matrix. This has significant outcomes in achieving real-time operation of CI-based microwave imaging systems.


I. INTRODUCTION
Imaging using microwave frequencies has received significant traction across a variety of applications, including security screening [1], medical imaging [2] and nondestructive testing [3].One of the key advantages of microwave imaging is its ability to penetrate optically opaque materials using electromagnetic (EM) waves.Moreover, microwave radiation The associate editor coordinating the review of this manuscript and approving it for publication was Qingchun Chen . is non-ionizing, making it safe for exposure to humans, which is in contrast to X-ray imaging [4].Conventional microwave imaging approaches are often based on synthetic aperture radar (SAR) techniques.SAR typically relies on performing a raster scan [5], mechanically or electronically scanning the radar transmitters and receivers to achieve a point-by-point probing of the imaging scene information in a sequential manner captured at the Nyquist interval [6].While these methods yield good-quality image reconstructions, such techniques can exhibit several drawbacks because of the raster scan requirement, including a slow data acquisition process and increased hardware complexity.An alternative technique which can improve the data acquisition speed is computational imaging (CI) enabled by means of codedapertures [7], [8], [9].CI systems probe the scene under investigation using a set of spatially-incoherent radiation patterns (or measurement modes) [10].These modes can be generated in different ways.One approach consists of employing a frequency diverse antenna, i.e., a metasurface antenna populated with randomly distributed radiating elements of different sizes, such that their behaviour changes as a function of radiation frequency [11].As a consequence, the antenna exhibits quasi-orthogonal radiation patterns.One of the issues of this approach is that it requires the use of large frequency bandwidths to acquire enough information from the imaging scene.A different technique, which is the one considered in this paper, relies on dynamically reconfigured codedapertures to generate the diverse measurement modes [12].In this case, a metasurface antenna is populated with radiating elements that can be tuned on and off dynamically, by means of a switching element such as a pin diode [4], [10].Then, the quasi-orthogonal radiation patterns are generated by changing the tuning states of the radiating elements.In this case, each aperture configuration synthesized with a set of corresponding on/off radiating elements is considered a mask.
The advantage of CI is that the number of data acquisition channels can be reduced significantly, decreasing the hardware complexity [4].The reason behind this is that the coded-aperture-based antennas used in CI systems are able to encode the scene information onto the measurement modes that they radiate, achieving a physical-layer compression [4].However, CI techniques may require a considerable amount of computational power and time, particularly when the aperture size is electrically large, and when electrically large scene sizes are considered for imaging.In particular, one of the main issues regarding the computational burden of most CI systems is the calculation and storage of the sensing matrix.For instance, in [13], the authors showed that for imaging electrically-large scenes, such as those including human-size targets, even constraining the imaging domain to a reduced volume enclosing the target itself, the scene can consist of a significant number of unknowns (or pixels).This, in turn, makes the calculation of the sensing matrix significantly time-consuming and hardware-intensive.The calculations in [13] require more than 90 GB of memory using single precision.
To achieve a higher efficiency in CI, some authors proposed to generate new sensing matrices for compressive sensing.For example, [14] proposed a sensing matrix generation method where a Bernoulli sensing matrix with more columns than the required size was generated.This was followed by the removal of columns with minimal contribution by implementing inner products and threshold comparisons.Compared with directly generating a conventional Bernoulli sensing matrix, the proposed method can enhance the column orthogonality property of the produced Bernoulli sensing matrix.In [15], a variant of a deterministic random sensing matrix was designed.Compared to conventional sensing matrices, including Bernoulli and Gaussian matrices, this proposed sensing matrix can provide more accurate reconstructions.Other works also investigated and compared the resolution of CI reconstructions with the help of different types of sensing matrix [16].The results of these studies suggested that the Hadamard random matrix can provide a higher peak signal-to-noise ratio (PSNR) than other types of random matrices.
There are some studies leveraging deep learning techniques to design a sensing matrix in CI.In [35], a network that contained a sampling sub-network and a reconstruction sub-network was designed.Original images and their corresponding reconstructions were regarded as the input and the output of the network, respectively.The sampling sub-network was used to compress the input images into compressed measurements, while the reconstruction subnetwork was for reconstructing images from the obtained compressed measurements.Compared with the Gaussian random matrix, it was shown that the sensing matrix designed by the proposed network in [35] can provide higher PSNR on image reconstructions.Reference [36] proposed a network that consisted of cascaded fully connected and multistage convolutional layers, where the input and the output of the network were both images.The network presented an endto-end learning process.The model was trained to acquire information on both the compressed measurements extracted from the input image and the reconstruction process applied to the respective compressed representations.Thus, the main purpose of [36] was not only to reconstruct images from the compressed measurements extracted from the input image but also to design a sensing matrix that has less average coherence with a given basis than a randomly selected sensing matrix.Moreover, [36] presented further comparisons showing that the designed sensing matrix from the proposed network can provide reconstructed images with higher PSNR than the reconstructed images generated by other sensing matrices.In the fully connected neural network in [36], the compressed measurements were the output when the original scene information was provided as input.Consequently, the weights of the fully connected layer in this network were the corresponding sensing matrix.A similar approach was proposed in [37], where an auto-encoder architecture was leveraged to simultaneously estimate the sensing matrix and the restoration of the original sparse signal.Specifically, the encoder component within the autoencoder architecture was configured to emulate the linear measurement process, while the decoder was tasked with approximating the sparse signal recovery from the undersampled linear measurements.The results showed that the proposed network can achieve a significantly higher performance in image reconstruction, as compared with methods which did not apply deep learning techniques.However, it is important to note that the methodology employed in our work for learning the sensing matrix is fundamentally different from previous studies.In prior works, the sensing matrix is typically derived from the reconstructed image as the input.Differing from the above-mentioned works, our work is focused on calculating the sensing matrix using a neural network, directly from the transmit and receive aperture fields inherent to a CI system.Notably, the approaches proposed in previous works necessitates the prior reconstruction of the scene as a prerequisite for sensing matrix prediction, which is a potentially challenging task.In contrast, our method exclusively relies on the aperture fields, eliminating the need for scene reconstruction.As evident from the results provided in the manuscript (Section IV), our approach offers a significantly more efficient means of computing the sensing matrix in a considerably reduced time frame when compared to computing the sensing matrix using conventional techniques [4], [13], [38].
As outlined earlier, calculating the sensing matrix of an imaging system from the transmitter and receiver aperture fields is a computationally complex task.This article proposes to leverage deep learning techniques to generate the sensing matrix of a CI system more efficiently.To the best of our knowledge, this is the first time that the use of deep learning techniques to build the sensing matrix is explored for coded-aperture-based microwave CI systems.With the help of the proposed network, the computation time of generating the sensing matrix is reduced to less than half of the time required by the conventional technique, which will be explained in Section II.Results show that the sensing matrix can be successfully learned, enabling the retrieval of similar resolution images (comparable with those obtained with the sensing matrix computed using the conventional, computationally expensive methods).
The main contributions of this paper are: • A novel CNN network is developed to predict the sensing matrix of a CI system, given the transmit and receive aperture fields.The calculation of the sensing matrix from the aperture fields is a computationally expensive task that requires the processing of large data volumes.
Prior literature has presented hardware parallelization strategies aimed at accelerating the computation of these matrices.In our research, we propose another approach that eliminates the need for additional parallelization methods to enhance the computational efficiency of the process.To the best of our knowledge, it is the first time that the prediction of the sensing matrix within a codedaperture CI system is performed directly from the codedaperture fields, leveraging deep learning methodologies.
• The proposed network architecture designed for the CI framework incorporates skip connections, inspired by the residual network design.Additionally, in this architecture, the batch normalization layer is strategically positioned before the activation and convolutional layers.This arrangement optimizes information transmission from the input to the output, facilitating effective training of the network.Notably, given the complex-valued nature of the data, a transformation is introduced in the architecture.This transformation separates complex-valued data into two channels: real and imaginary components, ensuring compatibility with the chosen network architecture.
• The viability of the research lies in the evaluation of the proposed network's performance within realworld conditions, with a specific focus on the system noise.The paper includes an in-depth analysis of the correlation between the model's performance and system noise levels.The proposed network not only showcases a substantial reduction in computation time but also shows a considerable performance across both ideal and noisy scenarios.These findings demonstrate the significance and merit of this work.The article is organized as follows.Section II provides an introduction to the CI paradigm and the image reconstruction algorithms.Section III explains the proposed deep learning architecture, which is based on a CNN architecture.The achieved results are discussed in Section IV, where they are also compared with those obtained with the conventional technique.Finally, the main conclusions are drawn in Section V.

II. COMPRESSIVE RADAR IMAGING A. MATHEMATICAL BACKGROUND FOR COMPUTATIONAL IMAGING
Whereas the background of CI was introduced briefly in Section I, in order to explain how the CI problem is addressed, Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

FIGURE 2.
The proposed network consists of 2 residual blocks and 2 convolutional layers with 1 × 1 × 1 filters.The structure of the residual blocks is inspired from [39].Each convolution layer is composed of a batch normalization operation, a non-linear activation function and a 3D convolution operation.Moreover, two 1 × 1 × 1 filters convolution layers help to change the number of channels.
the mathematical framework of CI is given here.Following the first Born approximation [11], [40], [41], the scattered field, E scat , contains the reflectivity information of each pixel of the scene, ρ(r) [42]: where r denotes the coordinates vector of each pixel of the scene, and E inc is the incident field from the aperture.The signal measured at the receiver, g, in matrix form, is given by: where H is the sensing matrix.We note that, in this work, the bold font is used to denote vector-matrix notation.H is given by the dot product of the fields radiated by the transmitter and the receiver, respectively, propagated to the scene [43].
The transmitter and receiver apertures can be modelled using a set of equivalent sources.In particular, each aperture is discretized into N s equivalent sources, and the radiated field corresponding to the m-th measurement mode at a point r of the scene is given by: where α m,s denotes the field value of the s-th equivalent source for the m-th measurement mode and r ′ s indicates its position while f is the frequency, and c is the speed of light in free-space.
The measured signal, g, is a vector of size M × 1, where M is the number of measurement modes, and the sensing matrix has size M × N , where N is the number of pixels in the imaging scene.
As it can be observed, the sensing matrix is not necessarily square, i.e., M ̸ = N .In order to estimate the reflectivity of the scene, ρ rec , it is possible to resort to different techniques, such as matched filtering [40]: where H † denotes the conjugate transpose of the sensing matrix.

B. IMAGING SETUP
In this paper, the imaging frequency is set to 15 GHz.The imaging scene is located at a distance of 10λ from the aperture plane, where λ denotes the free-space wavelength at the imaging frequency, 15 GHz.The reason for selecting the imaging distance as 10λ is to position the imaging scene within the Fresnel region of the transmit and receive apertures [44].The size of the imaging scene is 4λ × 4λ.
The imaging system consists of two dynamically tunable apertures.These two apertures, separated by 0.2λ, operate in a bi-static setting, as depicted in Fig. 1.Each of the apertures is identical and has a size of 8λ × 8λ.As previously explained, each aperture is modelled using a set of equivalent sources.In particular, the aperture is discretized in a 16 × 16 set of positions (i.e., 256 points), separated by half a wavelength (λ/2).Also, the same set of masks (i.e., measurement modes) are considered for both apertures.We note that, whereas these parameters are chosen for our simulations, without loss of generality and depending on the requirements of the desired applications, these system parameters can be scaled to other frequencies and varied to study other scenarios.The main parameters of the imaging setup in our work are summarized in Table 1.

III. DEEP LEARNING NETWORK A. NETWORK ARCHITECTURE
Details about the deep learning network considered in this work are provided in this section.The proposed network employs residual networks [45], as depicted in Fig. 2. The short skip connection, as shown in the figure, is the characteristic of a residual network.This connection combines the input data and the output data in multiple layers in order to avoid the information loss resulting from multiple layers of computation.In other words, it is able to enhance gradient propagation.
As shown in Fig. 2, the proposed network is composed of two residual blocks, two convolutional layers with 1 × 1 × 1 filters, and one max-pooling layer which reduces the convolutional layer dimensions by half.Each residual block in the proposed network contains seven layers, where each layer consists of a batch normalization, a three-dimensional (3D) convolutional operation and a leaky rectified linear unit (ReLU) [46], [47].In this work, a leaky ReLU with a negative slope coefficient of 0.1 has been considered.Generally, batch normalization is applied after the convolution and activation operations.However, as shown in Fig. 2, in the proposed network the batch normalization layer is placed before the activation and convolution steps.The reason is that with the help of such a sequence, in the forward and back-propagation of the network, information can be directly transmitted from one residual block to another residual block.This mode of information transmission can make the network better trained, as proved in [39], [48], and [49].Fig. 2 also shows the structure of the residual block.The size of the filters in the 3D convolutional operation layers in the residual blocks is the same, which is 3 × 3 × 1, and the number of filters used are 8, 16, 32, 64, 64, 64, and 64, respectively.Two skip connections in each residual block are included to avoid the vanishing gradient problem [50], [51].As shown in Fig. 2, the first skip connection links the input of the residual block with the output of the convolutional layer IV.In terms of the architecture constructed by the former four convolutional layers, the second skip connection links the input of the convolutional layer V with the output of the residual block.
Moreover, in order to match the size of the input dataset with the output, a max-pooling layer is used with a stride of 1 and a pool size of 3 × 3 × 1.The rationale behind keeping the third-dimensional size of filters and pooling layers at 1 is based on the assumption that each aperture distribution scenario is independent of the other and should not influence or interfere with the others.The first 1 × 1 × 1 filter convolutional layer has 64 filters, which helps to increase the number of channels for subsequent skip connections.Meanwhile, the second 1 × 1 × 1 filter convolutional layer at the end of the proposed network has only two filters.This selection helps with decreasing the number of channels.

B. DATA GENERATION
As previously discussed, the fields radiated by each aperture are modelled by a set of equivalent sources.In particular, each aperture is discretized into 256 points uniformly distributed according to a 16 × 16 regular grid.As a consequence, each measurement mode or mask is discretized into 256 complex values (representing amplitude and phase).The number of masks considered is 70, which was found to be optimal by means of a Singular Value Decomposition (SVD) analysis [52] as depicted in Fig. 3.The total number of masks, M = 70, is significantly smaller than the number of pixels in the scene, N = 196.Hence, the scene is undersampled (M < N ).It should also be noted that in this work we leverage the concept of Graphics Processing Unit (GPU)-parallelization, and the size of the sensing matrix (M × N ) cannot be further increased due to GPU memory limitation during training.However, we note that the size of the imaging problem can be increased if a GPU with a larger memory is used without loss of generality.Consequently, each input data sample consists of the 70 aperture field distributions corresponding to the M = 70 masks (or measurement modes) considered to image the scene.Thus, referring to Fig. 2, the input and the output of the proposed network are the aperture fields and the sensing matrix of the system, respectively.It should be noted that the aperture fields and the sensing matrix are complex-valued.Thus, the real and imaginary parts are separated into two channels in the network.The size of the input data in each channel is 16 × 16 × 70.The size of the output data is 70 × 196 × 2, which is equal to the number of masks times the number of pixels of the scene times two (to account for the real and imaginary parts).To avoid an additional reshape layer in the network, which would entail additional training time, the output size is formatted to 14 × 14 × 70 × 2 when a square scene is considered.

C. TRAINING PROCESS OF THE NETWORK
The proposed network has been trained using 14,000 randomly selected training dataset pairs.The training and the validation process are performed on a CUDA platform with an NVIDIA Quadro RTX A5000 GPU with 16 GB dedicated memory size.On the other hand, the testing process is carried out on a 12 th Generation Intel (R) Core (TM) i7-1265U CPU.The reconstructions of the images obtained with the conventional procedure and the proposed network were both implemented on the MATLAB platform with the same 12 th Generation Intel (R) Core (TM) i7-1265U CPU.
Before training, the entire dataset is standardized using a normal standardization process.Standardizing the data generally speeds up the learning process and leads to faster convergence, since the post-processed data elements will follow a normal distribution with a mean of zero and a variance of one [53].The standardization process is done as follows: where U s indicates the number of samples under training and U e denotes the number of elements in a sample.Z denotes each post-processed element, µ indicates the mean value of all training dataset elements, and σ indicates the standard deviation value of all training dataset elements.x i denotes the pre-processed elements.The training (14,000 samples), validation (1,000 samples) and testing (1,000 samples) dataset elements are standardized by the mean and the standard deviation of the training dataset, and the standardization is implemented on both the input and output of both datasets.The loss function during the training of the network is important, as it optimizes the model performance [54].In this work, the mean square error (MSE) is considered to be the loss function.The general form of MSE is given by: where d indicates the number of elements of the matrix F, which is the same for both ground truth and predicted value.F O (i) denotes the i-th ground-truth matrix, while F P (i) denotes the i-th prediction matrix.The mini-batch size of 8 was chosen for the proposed network architecture due to the limitation in GPU memory.The weights in the network are initialized using the He initialization technique [55].It initializes the weights under a normal distribution with a mean of zero and a variance of 2/t, where t is the total number of elements of the input.The training of the network is optimized with the Adam optimizer [56] using the following exponential decay learning rate: where the initial learning rate, L 0 , is considered to be 1×10 −4 , the decay rate, β, is chosen to be 0.96 and k is the epoch number.The epoch number is set to 10,000, but the model is trained for only 100 epochs because the loss value is already low and stable at the 100 th epoch.This is a case of early stopping, where the training process is stopped before the maximum number of epochs is reached because the model has already converged.The learning curves for the 100 epochs (training and validation) are presented in Fig. 4. Analyzing the behavior of the training curves in the figure, an effective training performance is observed.Fig. 4 shows that the training of the proposed network architecture is successful, as the curve of the validation MSE loss converges to the curve of the training MSE loss.Although the loss value at the 100 th epoch is low enough, it is still possible to achieve a lower loss value by training the network for longer epochs.However, it would require much more computational time due to a significantly larger size of the training dataset.

D. NETWORK PARAMETERS AND TRAINING PERFORMANCE
One of the most important factors in the learning process is the number of residual blocks used in the network architecture.Increasing the number of residual blocks, increases the training parameters in the network.Thus, the network can learn the underlying behavior in the dataset and find the latent connection between the input data and output data better [57].However, a larger network requires more training time.Hence, there is a trade-off between the size of the proposed network architecture and the accuracy of the results.Additionally, as explained in [58], more complex networks may consider the noise in datasets as their helpful features, resulting in failing to accurately represent the genuine patterns of the datasets.The outcome of this phenomenon is that the more complex networks may perform extremely well on the training set, but the performance with the unseen samples may not be improved further.This phenomenon is usually referred to as over-fitting [59].Hence, choosing the right number of residual blocks in the network is a crucial step in the design of the network architecture.
In order to compare the predicted (learned) sensing matrix with the original sensing matrix the normalized MSE (NMSE) is used [60].The NMSE is given by: where H O and H P denote the original sensing matrix and the learned sensing matrix.The implications of how many residual blocks are considered in terms of the number of parameters, training time, and accuracy of the model outcomes can be observed in Table 2.In particular, networks with one to four residual block(s) are compared, and the accuracy of the network is measured using the NMSE of the sensing matrix computed for 1,000 testing dataset samples.Table 2 shows that the average NMSE of the predicted sensing matrices which are generated by the networks with two or more residual blocks is much lower than the one obtained 1 residual block.It is observed that with an increase in the number of residual blocks in the network, the training time increases significantly.However, a drastic change in average NMSE is not observed with an increase in the number of residual blocks.Fig. 5(a) shows the training process for networks with different numbers of blocks.The training loss curves of the networks with two or more residual blocks are in a similar range, while the training loss curve of the 1 residual block network shows a higher error.The same phenomenon is also observed in the case of the validation curves, as shown in Fig. 5(b).As previously discussed, the reason for a marginal increase in the NMSE in the case of networks consisting of two or more than two residual blocks is due to over-fitting.In other words, the two residual blocks network not only performs better than one residual block network but also requires less training time than the three and four residual blocks networks.Therefore, a two residual block architecture has been selected for the network.

A. RECONSTRUCTION AND TRAINING RESULT
To provide an assessment of the quality of the reconstructed images using the predicted (learned) sensing matrix and how they compare to the images reconstructed directly from the original (ground truth) sensing matrix, we study the imaging of three sets of targets, each considering a different set of transmit and receive aperture fields.The first and the second targets are an array of reflective stripes oriented along the x− and y−axes, respectively.The third target is a T-shaped object.A comparison between the reconstructed images is shown in Fig. 6.The images obtained with the predicted sensing matrix using the proposed network are depicted in the first row of Fig. 6 whereas those retrieved using the original sensing matrix [40] are shown in the second row.Finally, the imaged targets, i.e., the ground truth, can be observed in the third row.It can be seen that the reconstructed images from the predicted sensing matrix are in close agreement with the reconstructed images from the true sensing matrix.The MSE between the images obtained with the true sensing matrix and those retrieved with the predicted one are 1.1 × 10 −3 , 6.2 × 10 −4 , and 5.5 × 10 −4 , for target (i), target (ii) and target (iii) in Fig. 6, respectively.For a testing data size of 1,000 samples, the average MSE of the reconstructed images is calculated to be 0.00297.This observation confirms that in addition to good qualitative agreement between the reconstructed images retrieved using the predicted and the original sensing matrix, quantitatively, the reconstructed images are also in good agreement.In addition to the good agreement between the reconstructed images, another significant advantage of the proposed technique can be appreciated by comparing the calculation times needed to obtain the sensing matrix of the coded-apertures forming the CI architecture.In view of this, a comparison between the computation times for the true sensing matrix and predicted sensing matrix has been performed with the testing dataset.The process of generating the sensing matrix was conducted 10 times, so that the average time can be computed for each scenario.The conventional calculation of the sensing matrix with CPU takes an average of 0.595 s, whereas the prediction of the sensing matrix using the proposed network with CPU takes an average of 0.214 s.For the result presented in Fig. 6, the reduction in computation time achieved for the target (i), target (ii) and target (iii) is 0.380 s, 0.373 s and 0.398 s, respectively.In other words, the time required to compute the sensing matrix using the proposed network is reduced by up to 65% as compared to the conventional computation.The numerical results are summarized in Table 3.

B. ANALYSIS OF THE NETWORK PERFORMANCE WITH SYSTEM NOISE
To assess the practicality and robustness of the presented approach, a noise analysis has also been performed wherein the performance of the learning model in presence of noise is studied.To this end, two types of noise studies were carried out.For the first type, different levels of Gaussian white noise were added to the aperture fields α.As previously discussed, the radiated field, E(m, r), corresponding to the m-th measurement mode at a point r of the scene depends on α.Hence, the noisy aperture distribution is given by: where N 1 is the aperture distribution noise.
In addition, a second type of noise was also considered to simulate noise in the imaging system.For this, different levels of Gaussian white noise were also added into the back-scattered measurements, g [61].The back-scattered measurement with noise is given by: Gaussian white noise is considered in both cases due to the fact that it exhibits a close alignment with the statistical attributes observed in various noise typologies encountered within radar systems [10], [40], [62].Figs.7(a) and 7(b) were generated by the proposed network, while 7(c) and 7(d) were produced with the conventional method.Besides, Figs.7(a) and 7(c) indicate that the reconstructed images are based on 10 dB signal-to-noise ratio (SNR) added to the aperture fields, while Figs.7(b) and 7(d) denote that the reconstruction images are based on 30 dB SNR.As can be seen, the reconstructed images with 10 dB SNR added to the aperture fields are of worse quality in comparison to the 30 dB SNR case.This can be appreciated by analyzing the SVD plots provided in Fig. 3.In Fig. 3, whereas the system with 10 dB SNR provides fewer than 30 useful measurement modes (i.e. the number of measurement modes that remain above the SNR level), the 30 dB SNR case provides more than 60 useful measurement modes.Despite this, even with 10 dB SNR for the coded-aperture distribution, the outlines of the imaged object can be distinguished whereas with 30 dB SNR, the reconstructed images exhibit a good fidelity.Moreover, in Fig. 7, the left column (labelled 1) and the right column (labelled 2) indicate the 10 dB and 30 dB SNR during the acquisition of the back-scattered measurement, respectively.These results suggest that the proposed network can provide similar results as the conventional method under a noisy environment.

V. CONCLUSION
In this article, we developed a CNN with residual blocks to improve the efficiency of back-propagation computations for coded-aperture-based microwave CI systems.The proposed network architecture contains two residual blocks, and each block contains two skip connections to avoid gradient descent.The developed network was trained using 14,000 pairs of training dataset and tested using 1,000 pairs of testing dataset.The CNN successfully enabled the generation of the system sensing matrix by learning the features of the aperture distribution set.The achieved results clearly show the potential of the proposed technique to predict the sensing matrix of coded-apertures with good accuracy while producing on par image reconstructions in comparison to the images reconstructed using the original sensing matrix.Quantitatively, the average MSE of reconstructed images was shown to reach as low as 0.00297.We also demonstrated that the developed CNN system can reduce the computation time of the sensing matrix by 65% while maintaining good accuracy.Finally, in the presence of noise, the proposed network can still provide the sensing matrix that produces high-fidelity reconstructed images.These results offer a significant potential for CI-based microwave imaging systems.In particular, the presented framework opens up the possibility of obtaining the sensing matrix information directly from in-situ measurements, eliminating the need for an additional characterization process to measure the coded-aperture radiated fields for imaging [63].Moreover, the achieved reduction in the calculation time of the sensing matrix, while ensuring good accuracy, is crucial for real-time operation capabilities for microwave CI systems.

FIGURE 1 .
FIGURE 1. Coded-aperture CI setup operating in a bi-static mode that is used for dataset generation.The figure shows the letter ''T'' as the imaging target.

FIGURE 3 .
FIGURE 3. Singular value spectra for the imaging system.It states that the SNR of a reconstruction with 70 measurement modes is slightly higher than 30 dB.It shows that the scene is undersampled, and also in the presence of noise.

FIGURE 5 .
FIGURE 5. MSE curves for (a) training and (b) validation steps for several networks considering a different number of residual blocks.

FIGURE 6 .
FIGURE 6.Comparison of the reconstructed image quality when using the sensing matrix estimated by the proposed network (a) and the conventional sensing matrix (b).Ground-truth targets are shown in (c).Target (i) and target (ii) consist of two reflective stripes oriented along the x−axis and y −axis of the scene.Target (iii) is a T-shaped object.

FIGURE 7 .
FIGURE 7. Comparison of the reconstructed images under noise; (a) and (b) show the reconstructed images using the predicted sensing matrix at 10 dB and 30 dB SNR levels added to the coded-aperture, (c) and (d) show the reconstructed images using the original sensing matrix at 10 dB and 30 dB SNR levels added to the coded-aperture.Columns (1) and (2) denote the reconstructions with 10 dB and 30 SNR for the back-scattered measurements.

TABLE 1 .
Parameters of the imaging system and the imaging scene.

TABLE 2 .
Comparison of metrics for different network architectures.

TABLE 3 .
The summary of numeric results of reconstructions.