Clinical Micro-CT Empowered by Interior Tomography, Robotic Scanning, and Deep Learning

While micro-CT systems are instrumental in preclinical research, clinical micro-CT imaging has long been desired with cochlear implantation as a primary application. The structural details of the cochlear implant and the temporal bone require a significantly higher image resolution than that (about 0.2 mm) provided by current medical CT scanners. In this paper, we propose a clinical micro-CT (CMCT) system design integrating conventional spiral cone-beam CT, contemporary interior tomography, deep learning techniques, and the technologies of a micro-focus X-ray source, a photon-counting detector (PCD), and robotic arms for ultrahigh-resolution localized tomography of a freely-selected volume of interest (VOI) at a minimized radiation dose level. The whole system consists of a standard CT scanner for a clinical CT exam and VOI specification, and a robotic micro-CT scanner for a local scan of high spatial and spectral resolution at minimized radiation dose. The prior information from the global scan is also fully utilized for background compensation of the local scan data for accurate and stable VOI reconstruction. Our results and analysis show that the proposed hybrid reconstruction algorithm delivers accurate high-resolution local reconstruction, and is insensitive to the misalignment of the isocenter position, initial view angle and scale mismatch in the data/image registration. These findings demonstrate the feasibility of our system design. We envision that deep learning techniques can be leveraged for optimized imaging performance. With high-resolution imaging, high dose efficiency and low system cost synergistically, our proposed CMCT system has great promise in temporal bone imaging as well as various other clinical applications.


I. INTRODUCTION
In the clinical practice of otology and neurotology, medical imaging is critical for evaluation and treatment of many diseases [1]. Magnetic resonance imaging (MRI) is an effective imaging tool with excellent soft tissue contrast and invaluable in detecting neoplasms with gadolinium contrast agent, but it is not good at evaluating bony structures [2]. Currently, temporal bone CT is the primary method of choice for otological imaging [3]. Chronic otitis media, otosclerosis, temporal bone fracture, congenital aural atresia, cochlear implantation, dehiscent superior semicircular canal, congenital labyrinthine dysplasia, labyrinthine fistula are all disorders or therapies where temporal bone CT is either absolutely necessary or a commonly desired adjunct to surgical management, for diagnosis or planning. In cochlear implantation, although psychophysical and physiological measures are of primary importance, the ability to localize electrodes and depict their 3D anatomical environment in vivo is of great relevance to understanding variations in threshold, uncomfortable level, and channel interaction.
CT is widely used to image a variety of middle and inner ear pathologies but it is limited by suboptimal image resolution [4]. While most CT scanners are limited to, at best, a spatial resolution of about 0.3 mm, the latest Cannon CT scanner resolves details down to about 0.2 mm [5]. This level of image resolution is still insufficient. For example, it is common that a precise diagnosis is unavailable until direct assessment during otologic surgery allows visual inspection and palpation of the ossicular chain. Also, much-improved image resolution of pre-and post-operative inner ear imaging is required for detailed analysis of cochlear morphometry and its relationship to an implanted electrode array [6]. In these and many other research and clinical applications, there are critical and immediate needs for ultra-fine spatial resolution without compromising other image quality indices at a minimized radiation dose level. A breakthrough in CT image resolution may greatly benefit diagnosis and treatment in general, and otology and neurotology in particular, such as for rational design and implantation of prosthetic devices.
In contrast to the classic CT system, the robotic-arm-based X-ray imaging system allows a great flexibility of scanning. It supports different scanning trajectories optimized for diverse tasks focusing on various organs and locations. Siemens developed a robotic X-ray system named Multitom Rax, which can scan a patient under a natural weight bearing condition. It enables a variety of clinical examinations. Three-dimensional images acquired by Multitom Rax improve diagnostic and planning performance compared to what we obtained when a patient lies on a bed. Researchers with Johns Hopkins University developed a mathematical framework for the design of scanning trajectories optimal to a particular task with conebeam CT [7]. FirstImaging also develops robotic image systems with excellent image quality.
The X-ray photon-counting detector (PCD) is an enabling technology on the horizon for high-resolution (HR) and low-noise imaging, which promises to add a spectral dimension to raw data and boost CT performance [8]. Different from the energy-integrating detector (EID), PCD works in a pulse-counting mode and directly converts individual X-ray photons into the corresponding charge signals which are then sorted into different energy bins based on the pulse heights. Thus, the intensity and wavelength information of incoming photons are simultaneously obtained. PCDs principally have no electronic noise and provide quite small effective pixel size; e.g., around 0.11 mm×0.11 mm. In contrast, EIDs suffer from the dark current and readout noise with the element size of typically about 1 mm × 1 mm. In addition, EIDs put more weights on high-energy photons than on low-energy photons but high-energy photons usually get attenuated less than low-energy photons, leading to a reduced image contrast. On the other hand, with PCDs optimal weights can be implemented on polychromatic photons for improved contrast and dose efficiency. More importantly, the energy discrimination ability of PCDs helps reduce beam hardening and metal artifacts, and enables K-edge imaging and material decomposition.
In parallel to the development of X-ray detectors, innovations of micro-focus X-ray tubes are also important. The NanoX tube is an example [9], and consists of electron emitting and receiving constructs. The receiving part is mainly an anode with a photoconductor. The emission part includes a backplate, a substrate, a cathode, a gate electrode, and an array of field emission electron sources. A microstructured array anode target (MAAT) X-ray source is another example, which was recently designed [10] to offer a significantly higher flux tha n an ordinary X-ray source in phase contrast imaging applications. The key parameters were optimized in the range of 40 keV to 130 keV. These types of technologies could be combined into a micro-focus X-ray tube for temporal bone CT imaging; for example, with a focal spot size of about 0.1 mm or less to match the PCD element size.
We envision a robotic-arm-based design of a clinical micro-CT scanner to integrate a microfocus tube, a PCD, interior tomography, and deep learning into a novel device that can be attached to a clinical CT scanner to form a clinical micro-CT system (CMCT) or used separately. The proposed CMCT workflow starts with a conventional volumetric scan of a patient, and then proceeds to a subsequent image analysis session by an expert or a smart software analyzer. After that, the patient is smoothly translated into the robotic scanning space for an interior photon-counting micro-CT scan, in registration to the previously performed global scan. Based on interior tomography theory, the interior scan can target only a region of interest (ROI) with a small detector panel and a customized scanning trajectory so that the image quality can be optimized. To facilitate accurate registration between the global scan data/images and local counterparts, an optical 3D surface scanner is used at the start of the local scan and it continues to monitor the head movement of the patient [11] during the local scan for motion compensation. This design integrates all relevant cutting-edge hardware and software elements, and systematically upgrades our earliest CMCT design published in 2005 [12] as well as other follow-up designs for similar purposes [13]- [18].
Compared to the above-mentioned prior designs, our approach aims at a much higher resolution of 50 μm with a high-power micro-focus source and a state-of-the-art PCD with fine detector elements for electronic-noise-free spectral imaging. Furthermore, the robotic arms add a great mechanical flexibility compared to the C-arm gantry and enable a moving VOI mask scanning for higher dose efficiency [14].
As far as the image reconstruction is concerned, the current mainstream of interior tomography methods focuses on improving reconstruction quality from truncated projections with no or little extra information [19]- [24]. Nevertheless, in our setting a global scan is available, which is clinically natural and can be utilized for robust reconstruction at least for the initial imaging task. Several methods that complete the HR projection data with a low-resolution global scan have been developed for truncation-artifacts-free reconstruction [12]- [15], [25], [26]. [27], in particular, is computationally-efficient without forward projection operations that directly fuses global and local reconstructions, but it is restricted to a certain type of scanning geometry and inappropriate for our application. Instead of completing the HR truncated projections to cover a global FOV, we propose to directly perform a background compensation on the HR local/interior scan with the global scan. The main benefit is that the subsequent interior reconstruction only involves a small portion of the sinogram, requiring much less memory space and computational time. This is critical since the projection resolution will be enhanced up to 10 folds (from 1 mm to 0.11 mm), hence the amount of interpolated 2D projection data will be increased 100 times compared to that of the global scan, and handling such big data for reconstruction can be very timeconsuming or even prohibitive. Furthermore, for quality enhancement at reduced doses, deep learning techniques can be involved in multiple reconstruction stages; e.g., projection deblurring, image denoising and super-resolution, beam hardening correction and material decomposition.

A. SYSTEM DESCRIPTION
The proposed CMCT system consists of a standard medical CT scanner or a novel clincal CT system (such as a NanoX system) and a robotic micro-CT scanner, as shown in Fig. 1. The CT scanner performs a global scan. A standard CT algorithm can be used to reconstruct head slices, which show the inner ear region of interest. Then, the patient table transports the patient into the robotic micro-CT system. This robotic micro-CT system uses two robotic arms holding the X-ray source and the X-ray detector respectively, as shown in Fig. 2. The micro-focus tube is a key component for high spatial resolution imaging. The flat-panel PCD is suitable for material decomposition and tissue characterization. The robotic arms can perform a scan along an arbitrary trajectory such as a circular or spiral trajectory. A high-performance computer is in overall control. It sends instructions to the robotic arms, X-ray source and detector. It also acquires raw data from the detector. Each robotic arm needs one control box, which receives commands from the computer and drives the servo system.

B. DESIGN PRINCIPLES
1) COORDINATED ROBOTS-The lightweight, highly flexible, and collaborative UR5e and UR16e industrial robot arms with payloads of up to 5 kg and 16 kg, respectively, are suitable for our CMCT system [28]. The working radius of UR5e is 850 mm with power consumption of 250 W, while the working radius of UR16e is 900 mm with power consumption of 585 W. The communication between the control computer and the robotic arm control box is through an Ethernet cable using TCP/IP 1000BASE-T protocol. The 6axis robotic arms have a high position repeatability of 30 μm. Some models can be used to achieve even higher accuracy. The existing Radalytica robotic imaging platform can be modified for CMCT, pushing the limits of robotic imaging considerably further. The roboticarm cone-beam micro-CT sub-system can be made to achieve a spatial resolution of 50 μm or higher to meet temporal bone imaging requirements [29]. The power supply can be from 100 to 240 VAC at 47 to 440 Hz.
2) X-RAY SOURCE-The Hamamatsu L12161-07 tube is selected as the X-ray source for CMCT. It is an RS-232 controlled 40-150 kV X-ray source with a focal spot size adjustable among 7, 20, and 50 μm. Its maximum tube power output is 75 W. The anode target material is Tungsten. The tube window is made of Beryllium of 0.2 mm thickness.
The effective cone beam angle is 43°. Its weight is 13.5 kg, manageable with the larger robot arm.

3) PHOTON-COUNTING DETECTOR-
The photon-counting detector ADVACAM WIDEPIX-5 × 5 of 1280 × 1280 pixels fits the need for human temporal bone micro-CT. Its continuous sensitive surface is supported by an array of 5 × 5 detector tiles. Each tile consists of a single Timepix hybrid detector (256 × 256 pixels) with an edgeless CdTe sensor. Each pixel counts the number of X-ray photons, allowing a large dynamic range. The Timepix technology also allows for the use of multiple energy thresholds. The intrinsic spatial resolution is defined by the detector pitch of 55 μm. That is, the imaging sensor covers a 7 cm × 7 cm area. The detector weighs 3.3 kg and can be easily carried by the smaller robotic arm. If we use the detector in a 2 × 2 binning mode, image resolution is 50 μm, and the diameter of the volume of interest (VOI) is 3.5 cm, assuming a magnification factor of 2. This should be sufficient to cover the human inner and middle ear. In addition to the Timepix detector, we can also customize the Medipix3 detector tiles, whose pixels have two integrated 12-bit digital counters and two energy discrimination thresholds. If we use the detector in a 2 × 2 binning mode, there are 8 spectral bins for data collection in a single scan.

4) RADIATION DOSE-Radiation
dose is mainly determined by the tube voltage, current and exposure period. With the use of the PCD, there is no electronic noise when recording projection data but Poisson noise cannot be avoided. In the interior scanning mode, the X-ray source only radiates about 1/10 of the diameter of the field of view but improvement in image resolution by four times (roughly, from 200 μm to 50 μm) would increase radiation dose significantly (two orders of magnitude) [12]. Thanks to the latest advancement in deep-learning-based low-dose CT imaging techniques [30], we can reduce radiation dose by an order of magnitude. With all of these factors coupled together, we should be able to maintain approximately the current head CT dose for an interior micro-CT scan to achieve about 50 μm resolution.

5) SPATIAL RESOLUTION-
We define the center of the focal spot as S, the center of the detector as D, and the rotation center as O which is the center of a VOI. These three points should always be kept in a straight line. In Fig. 3, a is the source to VOI distance, b is the VOI to detector distance, and c is the source to detector distance. The geometric magnification factors M and M′ are c/a and c/b for the focal spot and the detector aperture respectively [31]. The spatial resolution r of the imaging system can be approximated as a convolution of the detector size d and the focal spot size x respectively scaled by M and M′: In our initial CMCT system design, x is no more than 50 μm, and d is equal to 110 μm (due to 2 × 2 binning). The source to VOI distance can be from 150 to 250 mm, and the VOI to detector distance is in the same range. Consequently, the magnification factor can be adjusted from1.60 to 2.67 with the imaging field of view from 26 to 44 mm in diameter. According to the above equations, the system spatial resolution can be made close to 50 μm.

6) GEOMETRIC ALIGNMENT-
In the CMCT process, global and local projection data are obtained in different scanning geometries. An optical 3D surface scanner is preferably used for image registration, as shown in Fig. 4. The clinical CT scan obtains a 3D image of a patient who may wear landmarks such as a firmly-attached helmet. A boundary detection algorithm extracts the facial/helmet surface and key points for mesh generation as a basis for image registration [32]. The fact is utilized that the inner ear and the face surface are in a rigid relation. As an example of 3D surface scanners, the laser scanner Micro-Epsilon LLT2910-100 is a high-quality profile sensor. The height range may be set from 125 to 390 mm, and the width range can be fixed to 143.5 mm, with spatial resolution of 12 μm at a profiling frequency of 300 Hz. The compact size of 96 mm × 85 mm × 33 mm and low weight of 380 grams are ideal for static, dynamic and robotic applications. The wavelength of the semiconductor laser is switchable between 658 nm (red) and 405 nm (blue). The measurement data can be output via an Ethernet UDP, Modbus TCP, or serial communication protocol. The power supply is 24 VDC 500 mA. The optical scanner and the two robotic arms are mounted on the same pedestal so that they share the same coordinate system. The optically scanned patient head surface needs to be registered with the clinical CT originated head surface. Then, the coordinate conversion is carried out to delineate a VOI for a robotic micro-CT scan. The temporal bone data and images by medical CT and micro-CT also need to be registered and fused to achieve the best imaging performance.

III. IMAGING ALGORITHM
It is well known that the CT interior problem is not uniquely solvable in an unconstrained space [33]. By introducing additional prior knowledge on the image to be reconstructed; e.g., an interior sub-region with known attenuation values or a piece-wise constant model of underlying images, the image reconstruction from local projection profiles that are truncated on both sides becomes uniquely solvable. However, those assumptions often do not exactly hold in practical cases, which potentially results in shifting and cupping in reconstructed attenuation values. To address this problem, we propose to use a low-resolution (LR) global CT scan to estimate the background attenuation in the sinogram of the HR local scan that involves the surrounding volume of the VOI, and obtain an accurate HR local reconstruction of the VOI.
The general idea of background compensation is illustrated in Fig. 5 in a 2D case without loss of generality. The transformation between the underlying/reconstructed image and its projection data is linear and invertible, and we can partition an underlying image into the two parts, the region of interest and the background. Given the sinogram of the background P background and the global sinogram P global , we can easily obtain the pure sinogram of the region of the interest (ROI) P ROI as follows: The above relationship becomes nontrivial with laterally truncated projection data. Let trunc(·) denote the truncation operation, and Eq. 2 becomes trunc P ROI = trunc P global − trunc P background , where trunc(P global ) stands for a local scan P local . By intentionally letting the local scan cover the ROI, the truncated parts of P ROI are all zeros, and we have P ROI = P local − trunc P background . (4) This equation suggests that the ROI within a local scan can be accurately reconstructed from the laterally truncated scan after background subtraction. Clearly, this compensation will improve the stability of interior tomography.
For CMCT, a VOI can be accurately reconstructed at high resolution, given the HR local projection dataset P local HR and an appropriate background estimation. Suppose that we have a prior LR CT scan of the object P global LR , the LR background estimation P background LR can be easily obtained as shown in Fig. 5(a), which can be used to approximate the HR background closely via interpolation. This is based on the assumption that residual high-frequency background estimation errors will mostly cancel out during the integration which makes the LR estimation sufficiently accurate for our purpose.

A. CMCT RECONSTRUCTION
In clinical applications, a standard (relatively low resolution) global CT scan is first performed. Those regions with possible pathology or of physiologic importance can be further examined with a local micro-CT scan, which provides a HR local/interior reconstruction of a VOI. With interior tomography, a minimized additional dose will be involved. The prior information obtained through the global CT scan can be utilized to help the interior image reconstruction (at ultrahigh resolution).
The whole scanning procedure is summarized as follows:

1.
Perform a global head CT scan P global with a scanning geometry G global ;

2.
Determine a VOI and then plan a scanning geometry G local for a local micro-CT scan;

3.
Scan the patient optically with a surface scanner to generate the surface model S local for data/image registration between the global and local scans at the start of the local scan;

4.
Perform the micro-CT scan P local following G local (see the registration step below). During the local scan, the optical scanner continuously scans the patient head (preferably with optical markers) for tracing unconscious head movement.
Then, the VOI reconstruction mainly consists of the following three steps: data/image registration, background compensation, and image reconstruction. For brevity, in the following we assume that the head movement effects have already been compensated for in the local projections P local with the feedback from the optical scanner using advanced correction techniques, like, locally linear embedding motion correction [34], [35].

1) Registration:
Find the relative geometry between G global and G local in reference to the facial surface model S local .

a.
Reconstruct the global volume V global from P global ;

b.
Render the surface model S global from the global reconstruction V global ;

c.
Register the two surface models to align the orientation and position of S global with S local . The registration result is used to guide the micro-CT scan;

d.
From P local , directly reconstruct a volume of interest V local which contains fine structures but may be subject to distorted attenuation values;

e.
Refine the registration parameters (obtained in Step 1c) in reference to the registration between V global and V local .

2) Compensation:
Correct the attenuation offsets in P local to form a pure sinogram of the VOI P VOI . Correct P local with the estimated attenuation background as P local − trunc P background HR to form the pure sinogram of the VOI P VOI ; 3) Reconstruction: Reconstruct the VOI from P VOI with geometry G local using a conebeam reconstruction algorithm, and preferably one developed in the deep learning framework.

A. CMCT ACCURACY AND RESOLUTION
A simulation study was performed to demonstrate the feasibility of our proposed VOI reconstruction through background compensation. When the PCD is used, projections are collected in a number of energy bins. For inner ear imaging, we may initially focus on all the counts in a wide energy window to study the reconstruction performance; i.e., 40keV to 110keV (120 kVp source), for the following considerations: (1) to avoid the blurring from the X-ray fluorescence in the CdTe crystal; (2) to collect most photons that have penetrated through the head; and (3) to reduce the influence of pile-up effects. One head CT image containing the inner ear structures from the Visible Human Project [36] was used as a realistic image phantom, as shown in Fig. 6(a). The original phantom matrix is of 512 × 512 pixels with pixel size of 0.4981 mm. In our study, the phantom image was first interpolated to 10, 240 × 10, 240 pixels with a pixel size of 0.025 mm to generate the HR phantom. Note that the bicubic interpolation method was used to generate pixel values. Then, two resolution bar patterns in horizontal and vertical orientations were embedded in the inner ear region as shown in Fig. 6(b) to (d). The amplitude of the added patterns is 600 HU, and the radius of the ROI is 23 mm as marked with a red circle in Fig. 6(b). A global CT scan was performed with 1.024 mm detector pixel size, and the HR local CT scan covered a region of radius 35.05 mm with a detector pixel size 0.11 mm. Both scans had the same system magnification factor of 2 and tube voltage of 140kVp, and were simulated with an industrial CT simulator CatSim [37]. Two additional HR global scans were also performed with fine detection pixels of 0.11 mm and 0.04 mm to produce the reference reconstructions as the ground truth (GT) of the attenuation value and image resolution, respectively. In Fig. 7, large deviations of attenuation values from the GT and strong cupping effects are observed for the direct local scan reconstruction with filtered back projection (FBP). In contrast, the reconstruction from the background compensated projections demonstrates accurate attenuation values. The negligible artifacts around the ROI boundary may come from minor mismatches between the background estimation. Interestingly, for the direct reconstruction, in spite of the attenuation value shifting, the fine details are still clearly discernible except for the distorted ROI boundary, which might be sufficient in those applications that only need structural features.
To evaluate the potential influence of the proposed method on the attenuation value and image resolution, the profiles along and around the vertical and horizontal midlines of the ROI and through the resolution bar patterns are shown in Fig. 8. The lines overlap well with the ground truth inside the ROI as demonstrated in Fig. 8(a) and 8(b), demonstrating the high fidelity of the reconstructed attenuation values. Similarly in Fig. 8(c) and (d). The overlapped profiles of the GT and FBP reconstruction show that the proposed local reconstruction method does not compromise image resolution as compared with the global HR scan. Particularly, the 50 μm gap is not resolved while the 75 μm gap is well resolved, which indicates the resolution of the local scan protocol is between 50 μm and 75 μm. This agrees well with our analysis on the imaging parameters.

B. MIS-REGISTRATION EFFECTS
While the accuracy and resolution have been demonstrated above, the feasibility of the proposed reconstruction method will be illustrated in this subsection in terms of the robustness to the a potential mis-registration due to imperfect hardware components and their suboptimal coordination; i.e., with respect to mismatches in position, orientation and/or scale. The direct effects of these mismatches on the reconstruction process are an isocenter offset, a falsely tilted initial view angle, and an incorrect magnification factor messing up the interpolation between the local scan and the re-projection through a globally reconstructed image volume for background estimation.
First, the mismatches in the isocenter position were set to the range from 0 to 4.483 mm with an increment of 0.498 mm along the horizontal direction. The corresponding results with misaligned background compensation are in Fig. 13. The absolute and relative error maps were calculated against the GT. Then, the profiles along the horizontal and vertical midlines of the error maps corresponding to 6 selected mismatches are in Fig. 9. It is observed that the maximum relative error of the well-aligned compensated local reconstruction is within ±5%, which came mainly from the error in the background estimation, while the misalignment along the horizontal direction seems having little effect on the vertical profiles while there are significant drops off the horizontal profiles. Specifically, the maximum relative error remains around 6% when the misalignment is 0.498 mm, and it becomes close to 10% when the misalignment reaches 0.996 mm. The above observations suggest the robustness of the compensated local reconstruction with respect to the isocenter misalignment which remains quite accurate even with an up to 0.498 mm mismatch.
The initial view angle mismatches were simulated in the range from −4.39° to 4.39° with an increment of 1.098°. The resultant reconstructions are in Fig. 14. The error maps were calculated in the same way as for the positional mismatches, and the through-center profiles of the error maps are in Fig. 10. Similarly, the mismatches have stronger influences on the horizontal profiles than on the vertical ones, by the directional asymmetry of the background compensation. The maximum relative error is in the range [−10%, 5%] when the angular errors are within ±2.2°. Especially, the relative reconstruction errors are mostly contained within ±6% if the angular errors are within ±1.1°.
The magnification errors were also simulated in the range from −20% to 20% in a step of 5%, and the corresponding reconstructions are in Fig. 15. Interestingly, the "effective" ROI of reconstruction is determined by the applied magnification factor, with the regions outside the effective ROI having opposite attenuation shifting as compared to that inside the effective ROI. The relative reconstruction error profiles are in Fig. 11 with the corresponding magnification mismatches from −10% to 10%. Different from the distortions observed in the cases of positional and angular misalignments, which are mainly concentrated around a peripheral region, the magnification error causes a global attenuation shift inside the effective ROI. As shown in Fig. 11(a) and (b), the attenuation shift is proportional to the magnification mismatch and seems much more sensitive than other types of misalignments. Fortunately, this global shift can be effectively addressed by a bias correction method introduced below.

C. BIAS CORRECTION
The global shift, especially observed in the case with magnification errors, can be addressed with bias correction to make the attenuation value of a known region agree with the target value; i.e., to make the air value close to zero. Other known values can be used for the same purpose, such as that obtained from the global reconstruction; for example, we can select a relatively flat region and calculate its mean value as a benchmark.
Quantitative metrics, including SSIM, Peak Signal-to-Noise Ratio (PSNR), MSE and Root Mean-Squared Relative Error (RMSRE), were used to quantify the reconstructions with these misalignments in reference to the ROI in the GT, with and without additional bias correction. The actual radius of the ROI was set to 21 mm. The evaluation results with respect to isocenter position, initial angle, and magnification errors are summarized in Tables 1, 2 and 3 respectively. The results after bias correction are denoted with the abbreviation 'Crt' in the Tables.
The reconstructions are quite robust with respect to the isocenter positional and initial angular errors, and the bias correction method can further improve the accuracy, as shown in Tables 1 and 2. The attenuation deviation from GT in the reconstruction with aligned background compensation is very small, with MSE 0.454 × 10 −5 and RMSRE only 1.07%. By increasing the position error, the SSIM and PSNR metrics decrease while MSE and RMSRE increase, as expected. The RMSRE value is still below 2.0% when the position error reaches 0.996 mm, and the tolerance is extended to up to 1.992mm after the bias correction, which demonstrate the robustness of our method. Similarly, in Table 2 the RMSRE remains below 2.0% for all angular errors within ±4.39°.
The bias correction substantially improves the magnification-error-affected image reconstruction, as shown in Table 3. To be noted, since the effective ROI is scaled with the magnification factor, the intersection of these effective ROIs and the original ROI (radius 21 mm) was used for evaluation, with the radius being set to 16.8 mm. The metrics on the reconstruction before the bias correction dramatically change as the magnification error varies, demonstrating a relative high sensitivity. Fortunately, with the bias correction method RMSRE in the case of −20% magnification error is significantly reduced to 1.67% from 10.43% within the effective ROI, demonstrating a decent robustness.
An interesting phenomenon is that if we perform the bias correction on the aligned reconstruction, the metric scores drop slightly in the inner region within the radius of 16.8 mm in contrast to a tiny boost when evaluated on the whole ROI region (radius 21 mm).
This suggests that the reconstruction must already be very accurate in the inner region when the compensation is well aligned. The attenuation estimation from the global reconstruction cannot be perfect due to differences in resolution and existence of artifacts, and the linearity of the model to support our compensation method is just an approximation to the polychromatic X-ray imaging process, which may bring residual errors into the peripheral region of the ROI. Thus, there is no need to perform bias correction if the system is well calibrated and we are sure about the accuracy of the registration for background compensation. Otherwise, bias correction can be used for better performance.

V. DISCUSSION
In this study, we have proposed a novel CMCT system which incorporates a micro-focus source, a PCD, robotic arms and advanced imaging algorithms into a synergistic companion of a conventional CT scanner. The designed HR local scan protocol not only improves the dose efficiency but also reduces the area of detectors. The cost of PCDs currently remains high due to the complex manufacturing techniques, and the situation is likely to continue in the near future. Hence, this hybrid system for interior tomography can reduce the system cost and radiation dose without compromising the required performance. In the local/interior scan, the advanced robotic arms allow the free selection of a VOI, which is preferable compared to the traditional rotating gantry. In addition, the mobility of the robotic system enables surgeons to take projections from any view angle without moving the patient, which can be extremely helpful in many applications, such as high-quality evaluation in emergencies and real-time feedback in surgeries. It is worth mentioning that besides the exemplary application in inner ear imaging, the system can also work for other clinical imaging tasks that demand high resolution in a VOI/ROI, such as tumor examination in breast, nodule characterization in lung, bone quality analysis, and plaque imaging in the heart and the neck.
Besides the hybrid design as one imaging system shown in Fig 1, the robotic micro-CT scanner can also be separately used. The reconstruction results from a traditional CT scanner can be used as the prior knowledge following the same procedure described in Section III.
However, this may impose extra work in registration due to different positions of the patient in the local scan and an earlier global scan. Luckily, as analyzed in Section IV-B, the proposed reconstruction method has a good tolerance to geometric misalignment. Although the reconstruction is relatively sensitive to the magnification mismatch, the resultant attenuation shifting can be addressed with the proposed bias correction method. In addition, with the rapid development of face recognition technology [38], human facial surface measurement techniques have matured with high accuracy in real-time [32], [39]- [41]. The registration between surfaces have been studied for many years. High-quality toolboxes, such as 3D slicer [42], can be directly used or adapted for our purpose.

A. PROJECTION DEBLURRING
Another challenge for CMCT comes from the X-ray source. The intensity of an X-ray source with micro-focus has usually an insufficient flux to produce an appropriate signal to noise ratio through a human head during a reasonably short time. To obtain an appropriate contrast, an X-ray source with a slightly larger focal spot may be used to provide enough power. The increased focal spot could generate shadows in the projections and blur structural details. In addition, the balance between the X-ray intensity and image resolution can be optimized using a deep learning deblurring method. Although the cone-beam projection with the finite focus spot is no longer a spatially invariant linear system, which is a challenge with traditional deblurring methods, advanced deep learning techniques have the capability to perform shift-variant deblurring tasks [43]. The big data of paired blurredoriginal projections may be difficult to obtain for training a deblurring network. Alternatively, a forward projection model can be easily built to realistically synthesize the paired data. Then, the network trained with simulated data is fine-tuned with a small amount of paired real projection data. Finally, the trained network can be applied on blurred projections for inference, similar to what we did in an optical deblurring study [44].

B. IMAGE DENOISING
Deep denoising techniques can be used to reduce radiation dose and improve image quality. According to the level of supervision during training, three types of deep denoising methods have been developed, i.e., supervised learning, weakly-supervised learning, and unsupervised learning. Supervised learning methods were designed for image denoising and achieved the best performance, such as deep CNNs with residual learning [45], [46] or with recurrent persistent memory units [47]. Weakly-supervised learning methods relax the requirement of paired noisy-clean data to unpaired noisy-clean data [48] or paired noise-tonoise data [49]. Using the unpaired noisy-clean data, [48] proposed GAN-based learning to create pairs of corresponding noisy-clean images as the training data. Recently, [49] demonstrated that paired noise-to-noise images are equivalent to the paired noisy-clean images in training a model, achieving a denoising performance competitive with supervised learning methods. For the applications where even the unpaired noisy-clean or paired noisenoise images are unavailable, unsupervised leaning methods were proposed using only single noisy images for training. Deep image prior [50] is a generation process that maps the random noise to a single noisy image, and when they terminate the training process at the right moment the network produces a denoised image. Most recently, Noise2Void [51] and its variants [52], [53] achieved promising results only using individual noisy images in training a network. Basically, a Noise2Void network estimates a blind-spot in an image so that the network learns to map the surrounding pixels to the blind-spot, achieving excellent denoising results.
In our inner ear imaging application, paired noise-clean images can be synthesized via Monte-Carlo simulation, and single real noisy images can be acquired with the proposed CMCT system. The former data type can support supervised training although the noise may not perfectly match the real counterpart, while the latter type of data contains realistic noise and texture. Combining these two types of datasets, we can design a semi-supervised leaning method to learn from the data with and without ground truth labels simultaneously. For example, the model can be trained in the Noise2Void mode first and then fine tuned with the paired noisy-clean data or vice versa. An alternative is to train the model in the Noise2Void and supervised modes simultaneously.

C. MATERIAL DECOMPOSITION AND BEAM HARDENING CORRECTION
In addition to the high resolution advantage, if the local/interior scan is performed in the multi-channel photon-counting mode, the energy information can be used for spectral analysis; e.g., K-edge imaging, material decomposition, beam hardening correction, and metal artifact reduction. Compared with traditional dual-energy CT, the PCD provides more energy channels and is more informative while the potential spectral distortion issues at high imaging speed can be overcome with a deep learning based correction method [54]. In principle, the direct spectral measurement with the PCD allows better spectral separation than dual-source, fast kVp-switching, and dual-layer detector techniques. Most relevant to this inner ear imaging application is to utilize the X-ray energy dependent attenuation information for beam hardening correction, metal artifact reduction and material decomposition [8] so that the effects of the implanted electrodes and micro-environments can be optimally modeled.

VI. CONCLUSION
In conclusion, we have proposed a clinical micro-CT (CMCT) system empowered with a number of cutting-edge technologies for regional ultrahigh resolution imaging of a VOI, which is particularly suitable for human inner ear/temporal bone imaging. The background compensation technique has been proposed for fast and accurate local reconstruction with high resolution at a minimized X-ray dose, taking full advantage of prior information from a conventional medical CT scan. The HR feature, attenuation fidelity and robustness to geometric misalignment in the registration between the global and local scans have been demonstrated, establishing the feasibility of CMCT. Overall, our CMCT system is promising for inner ear imaging and other clinical applications.

APPENDIX I.: THE X-RAY TUBE POWER JUSTIFICATION
To justify the selection of our X-ray source, we measured the number of photons emission per solid angle per second from a similar micro-focus X-ray source (Source-Ray, SB-120-350, a spot size of s 0 = 75 μm) with a PCD (CZT-Medipix3RX, 110 μm pitch size d 0 , with anti-charge sharing ASICs) as a reference. The source to detector distance c 0 was 272 mm. The tube was operated at 120 kV and I 0 = 19 μA with a 1.96 mm aluminum filter. The measured average counts per second is 1.03 × 10 4 per pixel, corresponding to N 0 = 6.30×10 10 counts per solid angle per second. The distribution of photons p(E) over the energy range was simulated with SpekCalc [55], as shown in Fig. 12.
According to one study in the United States [56], the average human head circumference is 56.9 cm in males and 53.4 cm in females, corresponding to diameters of ~ 18.2 cm and ~ 17.0 cm respectively. Without loss of generality, the value 20 cm is a reasonable estimation of the maximum length of the X-ray beam passing through the head during a scan.
Considering the mean thickness of skull of 6.32 mm [57], we further assume that the path consists of l b = 1.2 cm bones and l t = 18.8 cm brain tissues.
Based on the CMCT design (the source to detector distance of c m = 400mm, and the detector element size of d m = 0.11mm), when the selected source is operated at 120 kV= and I m = 500μA with a focus spot of s m 50μm and assumed with a similar electron-to-photon conversion efficiency as that of our experiment source, the minimum counts (with the maximum length of path through head) received per second by one pixel on the detector can be estimated after the corrections of the tube current and focus size as follows:  (5) Note that this is the counts received by the darkest pixel in the projections which still provides a decent signal. In our VOI-centered scan, the projections of the other views could be much brighter than this worst case; e.g., if we reduce the l t by half, the counts per second reaches N m = 3185.0. Not to mention that the selected source can be operated at 150 kV which offers a much higher penetration ability. Hence, by taking 720 projections with 1second exposure, we can finish the scan within 12 minutes and obtain a good-quality reconstruction.
In addition, innovative technologies like NanoX tube [9] and MAAT tube [10] can significantly raise the brightness of the micro-focus X-ray source by several folds, which will definitely facilitate our undertaking and shorten the scan time. On the other hand, deep learning techniques can be employed for further scan acceleration and dose reduction. For example, deep denoising and few-view reconstruction can be utilized to reduce exposure time and the number of views, respectively, while maintaining the reconstruction quality. Thus, it is feasible to reduce the scanning time to 3 minutes or less and still provide a goodquality reconstruction.

APPENDIX II.: ROI RECONSTRUCTIONS WITH MIS-ALIGNMENTS
ROI reconstructions through background compensation with various mis-alignments in terms of isocenter position, initial view angle, and magnification factor, as shown in Fig. 13  Schematic diagram of the proposed clinical micro-CT (CMCT) system.  Reconstructed ROI in the same display window, including the direct reconstruction with FBP from local data, FBP reconstruction after background compensation, the ground truth reconstructed from the HR global scan with 0.11mm pixel size, and the HR ground truth from the HR global scan with 0.04mm pixel size. The attenuation value is in cm −1 .     The simulated spectrum p(E) of our tungsten X-ray source operated at 120 kV.  ROI Reconstructions with misalignment of isocenter position in background compensation, together with the ground truths (GT and GT HR), are displayed in the same window (unit of cm −1 ). The misalignment errors are from 0 to 4.483mm with an increment of 0.498mm. ROI Reconstructions with misalignment of initial view angle in background compensation displayed in the same window (unit: cm −1   ROI Reconstructions with misalignment of magnification factor in background compensation displayed in the same window (unit: cm −1 ). The misalignment errors are from −20% to 20% with a step of 5%.