Trajectory-Aware Adaptive Imaging Clue Analysis for Guidewire Artifact Removal in Intravascular Optical Coherence Tomography

Guidewire Artifact Removal (GAR) involves restoring missing imaging signals in areas of IntraVascular Optical Coherence Tomography (IVOCT) videos affected by guidewire artifacts. GAR helps overcome imaging defects and minimizes the impact of missing signals on the diagnosis of CardioVascular Diseases (CVDs). To restore the actual vascular and lesion information within the artifact area, we propose a reliable Trajectory-aware Adaptive imaging Clue analysis Network (TAC-Net) that includes two innovative designs: (i) Adaptive clue aggregation, which considers both texture-focused original (ORI) videos and structure-focused relative total variation (RTV) videos, and suppresses texture-structure imbalance with an active weight-adaptation mechanism; (ii) Trajectory-aware Transformer, which uses a novel attention calculation to perceive the attention distribution of artifact trajectories and avoid the interference of irregular and non-uniform artifacts. We provide a detailed formulation for the procedure and evaluation of the GAR task and conduct comprehensive quantitative and qualitative experiments. The experimental results demonstrate that TAC-Net reliably restores the texture and structure of guidewire artifact areas as expected by experienced physicians (e.g., SSIM: 97.23%). We also discuss the value and potential of the GAR task for clinical applications and computer-aided diagnosis of CVDs.


I. INTRODUCTION
S UCCESSFULLY applying restoration techniques to deal with guidewire artifacts in the IntraVascular Optical Coherence Tomography (IVOCT) system can reliably correct the imaging defect [1], [2]. This correction improves vascular visualization and enables accurate analysis of the vessel wall's microstructure, ensuring the reliability and accuracy of IVOCT examination [3], as shown in Fig. 1. The IVOCT system provides high-resolution microscopic-level imaging [4] in polar and Cartesian domains that enables accurate assessment of wall microstructure, vessel lumen area, and intracoronary stent expansion [5], [6]. However, due to the limitations of the imaging principle, guidewire artifacts inevitably appear in every frame, occupying about 10% to 20% of the image area [7]. These artifacts are caused by the highly reflective artificial guidewires, which reflect the near-infrared light emitted by the probe during the 4Ps (Position, Purge, Puff, and Pullback) imaging process [8], [9], [10]. As a result, the artifact area appears as a black rectangular shadow, obscuring the corresponding inner lumen information ( Fig. 1-Left). To address this issue, restoration techniques can be applied to deal with guidewire artifacts, which can significantly improve the quality of vascular visualization. The goal of Guidewire Artifact Removal (GAR) is to restore the actual vascular and lesion information within the artifact area by integrating clue representations in IVOCT videos that are highly related to artifacts through imaging clue analysis. Although the guidewire artifact area is occupied in a single frame, imaging clues for the artifact can be found in nearby frames of the IVOCT video. These imaging clues in the artifact-free areas consist of texture and structure information, which, when analyzed, enable GAR to reliably restore the imaging defect of IVOCT and enhance the precision of vascular analysis, as shown in Fig. 2.
Despite the significance of GAR has been emphasized in various literature [2], [3], [11], there has been limited development in addressing this task [12]. Tricky challenges make it difficult to reconstruct IVOCT videos accurately, reflecting the actual vascular and lesion. Reconstruction techniques designed for specific tasks usually target natural images or videos, prioritizing content rationality over detail restoration [13], [14], [15]. Achieving GAR using existing image or video reconstruction techniques is a challenging task due to the following tricky challenges: (i) Subtle imaging texture information and diverse   imaging structure changes make it difficult to capture clinically reliable artifact information using imaging clues ( Fig. 3(a)-Left).
(ii) The irregular direction of the guidewire trajectory and nonuniform width distribution of the artifact, are primarily caused by the switch between auto-pullback and manual-pullback ( Fig. 3(b)-Left). The location and variation of guidewire artifacts are unpredictable, which makes traditional self-attention mechanisms cannot perceive inter-frame correlation in a single IVOCT video.
This work proposes an innovative Trajectory-aware Adaptive imaging Clue analysis Network (TAC-Net) to overcome the challenges associated with achieving reliable GAR in IVOCT. TAC-Net addresses these challenges by utilizing two key techniques: (i) Adaptive clue aggregation enables adapting the reconstruction of the texture-focused original (ORI) and the structure-focused relative total variation (RTV) video [16] to suppress texture-structure clue imbalances ( Fig. 3(a)-Right), and (ii) Trajectory-aware Transformer enables accurately perceiving the location and variation of guidewire artifacts using a Trajectory-aware Transformer that captures robust features highly correlated with reconstruction through a video-mask cascade and a polar location encoding, while avoiding interference caused by irregular and non-uniform guidewire trajectories ( Fig. 3(b)-Right). We have made the following significant contributions in this work: r TAC-Net provides high-fidelity image reconstruction, improving the negative effects of guidewire artifacts, addressing the imaging defects in IVOCT, and eliminating the impact of missing signals.
r Adaptive clue aggregation balances texture-structure clues using active weight control in a parallel architecture, resulting in the realistic restoration of subtle textures and variable structures.
r Trajectory-aware Transformer effectively extracts highly artifact-related features by mining self-attention distributions, while avoiding the interference of unpredictable artifact trajectories.
r The conducted experiments on a dataset of 237 IVOCT cases demonstrate that our scheme is capable of achieving practical GAR, which in turn facilitates the clinical application and computer-aided diagnosis of CVDs.

II. RELATED WORK
In recent years, various deep learning-based methods have been proposed for addressing different IVOCT image and video analysis tasks [3], such as plaque analysis [17], [18], [19], lumen contour analysis [20], [21], [22], and stent analysis [23], [24], [25]. Despite the acknowledged significance of GAR in previous studies [2], [3], [11], only a limited number of reliable solutions have been developed for this task. Gharaibeh et al. utilized the traditional Generative Adversarial Network (GAN) to address missing image data in GAR [12]. Therefore, GAR remains an unexplored yet valuable IVOCT analysis task, which is comprehensively explored for the first time in this research.
In the restoration of natural images (videos), deep learningbased methods focusing on ensuring content rationality have made significant progress. Initially, most reported methods utilized convolutional neural networks (CNNs) and recurrent neural networks (RNNs) as the architectural backbone. Kim et al. developed an encoder-decoder model that collected and . TAC-Net is specifically designed to address the challenges of GAR in IVOCT modality, resulting in a high-fidelity restoration that surpasses reconstruction methods designed for natural images or videos. Adaptive clue aggregator effectively integrates subtle textures and variable structures, and the trajectory-aware Transformer accurately perceives guidewire trajectories. refined information from neighboring frames to synthesize still-unknown regions [26]. Chang et al. introduced 3D gated convolutions to address uncertainty in free-form masks and a novel Temporal PatchGAN loss to enhance temporal consistency [27]. Lee et al. proposed copy-and-paste networks for video reconstruction that leverages additional information in other frames of the video [28]. Transformers, which are attention-based sequence-to-sequence (Seq2Seq) networks [29] initially proposed for natural language processing (NLP) tasks, have become a popular architecture in the NLP field due to their excellent ability to process sequence information. Researchers in the computer vision (CV) field have also developed Transformer applications in video reconstruction due to their strong performance in various tasks [30], [31]. Zeng et al. proposed learning a joint spatial-temporal Transformer network for video reconstruction [13]. Liu et al. designed a Transformer model for video reconstruction based on Soft Split and Soft Composition operations [14].
Recently, numerous reconstruction methods designed for specific video objects have emerged [32], [33]. However, these methods were designed for reconstruction of natural images (videos) [34], [35], [36]. Compared with natural image (video) reconstruction that prioritizes content rationality, GAR focuses more on the restoration of texture details and structural changes. Therefore, when natural image (video) reconstruction methods are transferred to the IVOCT video GAR, their performance is degraded to varying degrees. In this work, we propose TAC-Net, a method specifically designed to address the challenges and objectives of GAR. The network architecture is tailored to incorporate the unique characteristics of the IVOCT modality and address the difficulties associated with the GAR task. By carefully considering these factors, TAC-Net is capable of reconstructing IVOCT videos that are both suitable for image analysis tasks and fulfill clinical requirements.

III. METHODOLOGY
TAC-Net adaptively integrates subtle textures and variable structures and accurately perceives guidewire trajectories for representing the actual vascular and lesion. It comprises two main modules: the Adaptive Clue Aggregation module (ACAmodule) and the Trajectory-aware TRansformer module (TTRmodule), as shown in Fig. 4.
Given an ORI video X ori = {x ori 1 , . . . , x ori N } and a RTV video X rtv = {x rtv 1 , . . . , x rtv N }, each with N frames of IVOCT images of size H × W . Artifact-free videos Y ori = {y ori 1 , . . . , y ori N } and Y rtv = {y rtv 1 , . . . , y rtv N } are obtained through trajectory perception and clue balance of the ACAmodule and TTR-module. Specifically, the ACA-module first encodes the input video sequences into clue representations and actively aggregates the representations of the ORI and RTV videos with an adaptive clue aggregator, respectively, then reconstructs the video sequences through two decoders (Section III-A). The TTR-module is a Seq2Seq network structure that takes the clue representations encoded by ACAmodule's encoders as input, then provides the associated clue representations for the active clue aggregator of the ACAmodule that is consistent with the input shape (Section II-I-B). This ensures that the two modules work together seamlessly to accurately represent the vascular and lesion structures in the input videos and remove any interference from artifacts.

A. Adaptive Clue Aggregation Module
To realistically restore subtle textures and variable structures in the guidewire artifact area, the adaptive clue aggregation module actively controls the aggregation weight of texture and structure clues to prevent any potential imbalance between the two types of clues during parallel aggregation.
Specifically, two independent two-stage encoders convert X ori and X rtv frame by frame into clue representation sequences E ori = {e ori 1 , . . . , e ori N } and E rtv = {e rtv 1 , . . . , e rtv N }. The first stage uses convolutions that provide scale-progressive representations, and the second stage uses group convolutions that respond to multi-scale of guidewire artifacts. Next, E ori and E rtv from the TTR-module are fed to the adaptive clue aggregator to obtain clue representations of adaptive integrating texture and structure information for reconstructing the video sequence. Finally, two independent one-stage decoders are symmetrical to the first stage of the encoder and use deconvolutions to reconstruct the video signals Y ori and Y rtv with high efficiency.
The adaptive clue aggregator replaces the common concatenation scheme and actively and softly adapts to the utilization of the representations corresponding to the ORI and RTV videos from the TTR-module. This allows for active texture-structure balance and integration, further enhancing the accuracy and realism of the reconstructed image, as shown in Fig. 4 Specifically, the adaptive clue aggregator is responsible for actively aggregating E ori and E rtv to generate the clue representations required by the one-stage decoder to reconstruct Y ori . Firstly, this process involves two independent adaptation blocks B 1 and B 2 (consisting of a convolutional layer and sigmoid activation function), which obtained two feature parameters corresponding to the two clues based on the comprehensive texture-focused and structure-focused representations. These feature parameters initially regulate the proportional relationship between the two clues during aggregation in a learnable form. Next, the scaling parameters h 1 and h 2 further scales the proportional relationship between the two clues to control and constrain the imbalance of the weight values of feature parameters. The product of feature parameters and scaling parameters is considered as adaptive weights w ori and w rtv to synthetically balance the aggregation of texture and structure clues: where the convolutional layer C is used to ensure that the output clue representation has the same number of channels as E ori . After the processing of the adaptive clue aggregator, the output E ori out adaptively aggregates the structure-focused clues in X rtv , while the output E rtv out is identical to the input E rtv .

B. Trajectory-Aware Transformer Module
The trajectory-aware Transformer module offers the benefits of fast modeling for long-distance associations while accurately detecting attention distribution of artifact trajectories using a new joint attention calculation.
Specifically, the Trajectory attention maps (Section III-B1) are first encoded into E M = {e M 1 , . . . , e M N }, and the resulting E ori , E rtv , and E M are used as input for the TTR-module. Then, the feature representations are divided and flattened into spatial sequences and temporal sequences (S ori tem , S ori spa ), (S rtv tem , S rtv spa ) and (S M tem , S M spa ) ( Fig. 5(a)). Patches at different positions in a single frame are arranged in an embedding for spatial semantics, while patches at the same position in different frames of a sequence are arranged in an embedding for temporal semantics. Each temporal or spatial embedding sequence corresponding to an IVOCT video is denoted as S vid . Next, every S vid is contextually linked with the corresponding S M using a Transformer encoder that replaces traditional self-attention with Trajectory-aware attention (Section III-B2) (i.e., S ori tem and S rtv tem are calculated with S M tem respectively, S ori spa and S rtv spa are calculated with S M spa respectively). Finally, element-wise addition is applied to transform (S ori tem , S ori spa ) and (S rtv tem , S rtv spa ) back into E ori and E rtv to ensure consistency between the input and output.
1) Trajectory Attention Maps: The Trajectory Attention Maps (TAM) describe the extent on how much artifacts affect each position in the polar IVOCT video, which provide precise location and variation information for trajectory-aware computation, as shown in Fig. 5(b).
Specifically, it is calculated by the following generalized operations: (i) The artifact center cen and the symmetry of artifact center sym in the Cartesian domain: (2) where θ 1 and θ 2 indicate the horizontal coordinates of the left and right edges of the artifact in the polar domain. (ii) The mapping for any pixel p(p x , p y ) in a single-frame IVOCT: Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
where α is the hyperparameter of mapping intensity, and M indicates the guidewire artifact area.
2) Trajectory-Aware Attention: The TRajectory-aware Attention (TRA) models the attention distribution of artifact trajectories based on self-attention, thus introducing the location and variation of the artifact masks into the attention calculation to obtain the highly artifact-related semantics, as shown in Fig. 5(c).
Specifically, the vector group (Queries: Q vid , Q M , Keys: Q vid , Q M , and Value: V ) that responds to attention relations is calculated by linear projection layers corresponding to E vid and E M . To perceive the trajectory-attention distribution in the polar domain, P M is integrated into the calculation of trajectory-aware attention through element-wise multiplication: where G L is the parameter group of the polar location encoding. G L rigorously incorporates the relative relation of pixels in the polar-domain video into the attention calculation, so that the trajectory-aware Transformer completes the correlation analysis more precisely. Assuming that there are two voxels in the plane space, the κ-adjusted polar distance is defined as: where ρ 1/2 is the distance from the pixel to the domain origin, and ∠Θ is the absolute value of the polar angle difference between two pixels. For the polar location encoding, assuming d x ∝ κρ 1 , d y ∝ κρ 2 and a 12 ∝ 2 cos(∠Θ), the relative relation between two locations l 1 and l 2 is defined as: where d i and a ij are the distance parameters and the angle parameters for the location relation of spatial sequence, and bs are the tuning parameter that adapts ds to the order relation of temporal sequence. The above parameters are selected from the alternative groupsD,Â, andB [31] according to the relative relation of polar-domain video.

C. Dual-Object Objective Function
The objective function is designed for the dual-object architecture, which aims to restore imaging information within guidewire artifacts in a reliable manner. The function comprises of three losses that calculate gradients independently: a generator loss and two discriminator losses that correspond to the ORI video and the RTV video. The generator loss L G is weighted by the reconstruction losses L rec and the adversarial losses L adv , which are separately calculated for the two object videos. It can be expressed as: where ε is the hyperparameter to balance L rec and L adv , δ 1 and δ 2 are the hyperparameters of L rec and L adv to balance the two restructured objects. The single object video losses L rec and L adv are defined as: where Y andŶ denote the predicted video sequence and the object video sequence, respectively, and τ is a hyperparameter that balances the non-artifact and artifact area. The single object video discriminator loss L vid D is defined as:

IV. EXPERIMENTS
Experimental results demonstrate the ability of TAC-Net to reliably restore the imaging signal of the guidewire artifact area in the IOVCT video, highlighting the superiority of the overall architecture and the effectiveness of the individual modules.

A. Dataset and Preprocessing
The experimental dataset consisted of 237 cases from different patients at Harbin Medical University. Each case comprised of 50 consecutive frames, which were divided into five groups of 10-frame IVOCT video sequences. For training, 200 cases were used, while the remaining cases were used for testing. To ensure the authenticity and reliability of the experimental analysis, we generated video sequences with the ground truth of guidewire artifact areas frame by frame using Algorithm 1 and Fig. 6. The location distribution of guidewire artifacts in the experimental sequence obtained through the preprocessing was consistent with that of the initial video, and the frame-by-frame motion trajectory of the guidewire artifact synchronized with the rotation of the pullback.

Algorithm 1: The Initial Video Preprocessing.
Input: The initial video and mask sequence The experimental input and groundtruth sequences

B. Evaluation Metrics
In order to evaluate the performance of the experimental methods in terms of GAR, we adopted the following quantitative evaluation metrics for image reconstruction: Peak Signal to Noise Ratio (PSNR), Structural Similarity Index Measure (SSIM), Mean Absolute Error (MAE), and Fréchet Inception Distance score (FID). The above-mentioned evaluation indicators are calculated as follows: where MAX I indicates the theoretical maximum value of the image, MSE indicates the cumulative mean squared error between the groundtruth and predicted image.
where μ x and σ x are the average and the variance of image x, μ y and σ y are the mean and the variance of image y, σ xy is the covariance of images x and y, and c 1 and c 2 are constants that maintain stability.
where m × n is the resolution of the groundtruth or predicted image,ŷ i and y i are the gray value of the groundtruth and predicted image.
where μ r and Σ r are the mean and covariance matrices of feature vector set output by the real image set using Inception Net-V3, μ g and Σ g are the mean and covariance matrices of the feature  I  QUANTITATIVE COMPARISON INDICATES THAT TAC-NET IS MORE SUITABLE  FOR THE RECONSTRUCTION OF IVOCT VIDEO SEQUENCES COMPARED TO  THE SOTA METHODS vector set output by the generated image set using Inception Net-V3, and Tr is the track of the matrix.

C. Implementation Details
The video sequence was set to a length of 10 frames, and the input IVOCT frame resolution was uniformly reshaped to 624 × 336. The structured separation processing parameters, including lambda, sigma, sharpness, and maxIter, were set to 0.005, 2, 0.0005, and 4, respectively. For the adaptive clue aggregation module: The kernel size of (group) convolutions was set to 3 × 3, and the negative slope of LeakyReLU activation was set to 0.2 in the encoders and decoders. The out channels of convolutions and group convolutions were set to [64,64,128,128,256] and [384,512,384,256,128], respectively, the group number of group convolutions was set to [1,2,4,8,1], and the scale factor of the bilinear interpolation decoding was set to 2. For the Trajectory-aware transformer module: The α of trajectory attention maps was set to 4.5, the number of trajectory-aware Transformer module was 4, the patch size for temporal semantics was set to 7, and the patch size for spatial semantics is set to 3. Two temporal patch discriminators [27] supervised the reconstruction of the ORI video and RTV video, respectively.
In terms of the objective function, ε, δ 1/2 , and τ were set to 0.99, 0.8, and 0.1, respectively. The implementation in the experimental evaluation was based on Python 3.6 and Pytorch 1.6, and the NVIDIA RTX 3090 GPU server was used for network training. The number of iterations and the batch size were set to 5e5 and 2, respectively. During the training process, the Adam optimizer was used to optimize the loss function, where the parameters β 1 and β 2 were set to 0.9 and 0.99, respectively. The training learning rate was set to 1e-5, which decayed to 0.1 of the current learning rate every 1000 iterations.

D. Comparative Experiment
Based on the comparative experiment, the results demonstrate that TAC-Net produces reconstruction results that better reflect the actual vascular and lesion, as shown in Fig. 7. Moreover, TAC-Net outperforms other state-of-the-art (SOTA) video reconstruction methods, as indicated in Table I.
We compared the reconstruction performance of TAC-Net with SOTA methods to validate the effectiveness of our method for GAR. In order to present the comparative experiment clearly and concisely, we defined the following SOTA methods for comparison: r VINet: This method generates the reconstructed frames by refining information from multiple frames. Both the flow and mask sub-networks aggregate and synthesize feature points stepwise [26].
r CAP: This method copies and pastes the corresponding contents in the reference frame, filling the holes in the object frame with the copy network and the paste network [28].
r STTN: This method uses frame-level encoding-decoding and spatiotemporal transformers to simultaneously fill holes in input frames with coherent content [13]. r ZITS: This method restores the image structure using an attention-based Transformer model in a fixed lowresolution sketch space [15]. We conducted a quantitative comparison of TAC-Net with the aforementioned SOTA methods in GAR, based on the PSNR, SSIM, and MAE indicators presented in Table I. TAC-Net outperformed all the SOTA methods in all the indicators, demonstrating its superior performance in GAR. This improvement validates that the architecture design of TAC-Net, tailored to the characteristics of IVOCT images, is more suitable for reconstructing IVOCT video sequences. Moreover, ZITS for image reconstruction delivered acceptable results in the comparative experiment, highlighting its potential for exploiting structural features. The Transformer-based methods (e.g., Fuseformer) generally outperformed the CNN-based methods (e.g., CAP), as per the quantitative comparison of the SOTA methods. This suggests that Transformers, which omit the transfer of stacked hidden parameters, are better suited for modeling video sequences than CNNs. Furthermore, the improvement of TAC-Net over the Transformer-based SOTA method indicates that our advanced design (trajectory-aware) of Transformer is more appropriate for modeling IVOCT video sequences in the polar domain.
In Fig. 8, the boxplot displays the distribution of the results for the comparative experiment. The boxplot is based on a sample size of 185 cases, where each data point represents the average value of all images in an IVOCT video. The indicator of statistical significance is used to clearly indicate the differences between the various methods. It can be observed that the prediction distribution of ZITS is more scattered, which can be attributed to the inability of image reconstruction to utilize the coherence and correlation between frames in the video sequence. In contrast, TAC-Net has a more stable and concentrated prediction distribution with fewer outliers, indicating its reliability and effectiveness in GAR. Comparing the FID indicator of TAC-Net and the SOTA methods shown in Fig. 9, the distribution of  the extracted feature vectors corresponding to TAC-Net is more similar to the distribution of the real frame set. This demonstrates that the reconstructed video sequence obtained by TAC-Net is closer to the ground truth and provides more realistic imaging information.
The qualitative comparison results in Fig. 7(a)-(d), and (f) were presented to evaluate the detail of the guidewire artifact area, which is difficult to measure using the indicators in Table I. To observe the comparison results more intuitively, this work introduced two additional auxiliary views for the reconstruction results: a heatmap of the gray difference between the predicted result and the ground truth, and an edge binary map of the single IVOCT frame. These views help to reflect the texture reconstruction ability of the experimental method and observe the structure and trend of predicted results without texture interference. The qualitative views showed that TAC-Net recovered fine-grained details and consistent light-dark changes, and its prediction avoided ambiguity, which is important for doctors to diagnose CVDs. Compared with other methods, TAC-Net's reconstruction is more similar to the ground truth in terms of hierarchical structure and contour trend. These qualitative results convincingly demonstrate that TAC-Net is superior in enabling GAR with a more realistic texture and structure.

E. Credibility Study
The credibility study results suggest that the reconstruction results of TAC-Net are highly consistent with the predictions of experienced physicians, as shown in Fig. 10.
To evaluate the reliability of GAR for clinical application, five experienced physicians were invited to score the visual quality of the reconstructed IVOCT videos. For each test case, a single frame of an IVOCT image was randomly chosen for evaluation.  II  ABLATION ANALYSIS INDICATES THAT THE ARCHITECTURE DESIGN OF TAC-NET IS PRACTICAL AND EFFECTIVE ACCORDING TO THE IVOCT MODALITY AND  THE POLAR-DOMAIN VIDEO The scoring involved a total of 185 IVOCT images, resulting in 185 data cases in the experimental results. To determine the final score for a particular case, the average score provided by the five physicians was used. The physicians were instructed to assign scores based on four aspects, with a scale of 1 to 10 (higher scores indicated better visual quality):  r Synthesis: Composite score for visual quality (0.5 × Expectation + 0.25 × Texture + 0.25 × Structure). The reliability study results are presented in Fig. 10, which compares score variations across different cases, physicians, and methods. To analyze different cases, we calculated the average scores of all physicians for all methods within each case. For different physicians, we computed the average scores of physicians for all methods within a specific case. Regarding different methods, we determined the average scores based on all physician ratings for each method. Fig. 10(a) illustrates the variations observed across different cases, indicating discernible differences among them. Fig. 10(b) demonstrates that although there are slight variations in each physician's scoring standards, the impartiality and professionalism of the five physicians' scores consistently reflect the disparities in reliability across different research methods. In Fig. 10(c), it is evident that TAC-Net achieved the highest scores in all four aspects of visual quality, signifying satisfactory results in terms of providing reliable guidance for physicians. Consequently, our method offers reliable assistance for clinical applications.

F. Ablation Analysis
The ablation experiments indicate that the adaptive clue aggregation module and the trajectory-aware Transformer module in TAC-Net play a critical and beneficial role in the accurate restoration of artifact areas through contextual clue analysis, as shown in Table II.
We performed ablation experiments to assess the effectiveness of the proposed TAC-Net architecture. We defined the following The ablation models mainly focus on the adaptive clue aggregation module, the trajectory-aware Transformer module. Specifically, the TAC-AS and TAC-AC variants aimed to evaluate the effectiveness of the asymmetric encoding-decoding and adaptive clue aggregator in the adaptive clue aggregation module. The TAC-ED and TAC-TR variants aimed to verify the effectiveness of the overall design and advanced architecture of the trajectory-aware Transformer, while the TAC-PL variant aimed to verify the effectiveness of the polar location encoding. The quantitative results of our ablation study are summarized in Table II, where PSNR, SSIM, MAE, and FID metrics are used for evaluation.
To verify the effectiveness of the adaptive clue aggregation module, The asymmetric encoding-decoding in TAC-AS slightly improves TAC-Net's performance, indicating that group convolutions are positive for obtaining latent representations. Furthermore, the multi-scale mixed hierarchical features provided by the two-level encoding are more robust than the one-level encoding. However, the improvement is limited compared to other designs, which underscores the significance of the trajectory-aware Transformer and polar-domain modeling module in TAC-Net's overall architecture. The results of the full TAC-Net and TAC-AC ablation also show that the adaptive clue aggregator more naturally integrates the structure clues in RTV video sequences into the reconstruction of ORI video sequences.
To verify the effectiveness of the trajectory-aware Transformer module, we conducted a comparison between TAC-Net and two baselines to validate the effectiveness of the trajectory-aware Transformer module. The first baseline is TAC-ED, which uses an encoding-decoding architecture for image reconstruction. The second baseline is TAC-TR, which uses an encoding-Transformer-decoding architecture for video reconstruction but abandons the trajectory-aware design in the polar domain. The quantitative results demonstrate that simply using encoding and decoding for image reconstruction cannot achieve satisfactory results. In contrast, the performance of TAC-TR is improved, and the full TAC-Net achieves the best performance. These results confirm the significance of modeling frame-to-frame correlation in video sequences and highlight the effectiveness of our trajectory-aware upgrade to the Transformer for modeling and reconstructing video sequences in the polar domain. After replacing the polar location encoding with an existing Cartesian-domain design, TAC-Net achieved better performance gains due to its improved ability to analyze the contextual relevance of video sequences. However, it is worth noting that the performance degradation of TAC-PL was more significant compared to most other ablation models, indicating that the introduction of polar location information is crucial for effective trajectory-aware modeling in the polar domain. The polar location encoding method used in this work accurately described the location information for the polar domain and outperformed the traditional non-learning encoding used for the Cartesian domain in the GAR task.
To further demonstrate the importance of dual-object reconstruction in capturing structure information, we conducted an ablation study by removing the RTV object from the training process and comparing the performance of the full TAC-Net and TAC-Net without RTV objects. As shown in Table III, the results indicate that TAC-Net's performance is degraded without the RTV object, confirming the importance of the structure information contained in RTV video sequences for the network to perceive the structure features of video sequences. Additionally, as shown in Fig. 7(e) and (f) the inclusion of the RTV object leads to more reasonable structure predictions that are spatially coherent. This can be attributed to the weight-balanced aggregation of structure features by the network during the reconstruction of RTV video sequences.

V. DISCUSSION
As a medical image analysis task, GAR has significant potential for exploration in terms of both significance and methods. In the previous section, we detailed a method to address the conundrum of GAR. Next, we further discussed its value for clinical application and computer-aided diagnosis of IVOCT from multiple perspectives. Compared to the initial video, the IVOCT video after the GAR provides physicians with an additional 10% ∼ 20% of reliable imaging signals, which allows for more continuous and complete vascular imaging while avoiding the interference of noisy signals from guidewire artifacts. The sophisticated imaging enables physicians to gain a better understanding of IVOCT video sequences, and the assisted guidance of the guidewire artifact area effectively reduces the subjectivity of doctors in judging related diseases, thus minimizing the possibility of misdiagnosis and missed diagnosis.
The improvement of image quality brought by GAR has a positive significance for computer-aided diagnosis of IVOCT video sequences, as it effectively enhances the integrity and continuity of IVOCT video sequences. This benefit is advantageous for deep learning methods to extract features of IVOCT image signals. To segment vulnerable plaques before and after GAR in IVOCT videos, this work uses a trained segmentation network (U-Net [37] + ResNet50 [38]). Fig. 11 demonstrates that when guidewire artifacts overlap with vulnerable plaques, the reconstructed IVOCT video provides more accurate segmentation results closer to the ground truth. This indicates that the reconstructed video can provide the signal features of the artifact area. These additional features facilitate the segmentation network to better locate plaques and perform accurate segmentation. In video sequences with guidewire artifacts, the segmentation network can only predict vulnerable plaques in artifact areas based on location features, while the reconstructed video enables the network to obtain signal features and location features simultaneously.
In addition, the absence of imaging signals from the guidewire artifact area hinders the segmentation of the vascular lumen contour. The contour of the artifact area is typically obtained through manual drawing and training based on past experience. GAR effectively restores the imaging signal in the artifact area, allowing for accurate segmentation of the vascular lumen contour in this region. Given these benefits, we believe that GAR is a promising medical image analysis technology that has the potential to facilitate the adoption of intelligent medical solutions and upgrades to imaging systems.
While our proposed TAC-Net for GAR shows promising results, our architecture design also has its limitations. The location-related operations in our method assume that the video sequence is in the polar domain, thus requiring domain transformation for video sequences in the Cartesian domain, which may lead to the loss of imaging information. Moving forward, we have several future research directions. Firstly, we plan to explore joint modeling of the polar and Cartesian domains to avoid unnecessary information loss during domain transformation. Secondly, we aim to continue investigating the potential and clinical application of GAR for computer-aided diagnosis of IVOCT to further enhance network performance on real clinical CT images. Lastly, we will refine the evaluation framework for GAR to facilitate the advancement and research of this pioneering task.

VI. CONCLUSION
In this work, we studied deeply into the GAR task to overcome imaging defects and improve the imaging quality of IVOCT examinations, thereby eliminating the interference of missing signals in the diagnosis of CVDs. To tackle the GAR task, we proposed an innovative and reliable GAR method called TAC-Net, which employs effective imaging clue analysis. The architecture of TAC-Net is designed to address the unique characteristics of the IVOCT modality and the numerous challenges associated with the GAR task. TAC-Net leverages adaptive clue balance to realistically restore subtle textures and variable structures, while the mask-aware Transformer precisely perceives guidewire artifacts and associates the spatiotemporal semantics. These design elements enable TAC-Net to complete the imaging signal reconstruction of the guidewire artifact area in a more targeted and accurate manner. Experimental results demonstrate that our method can reconstruct IVOCT videos that are nearly indistinguishable from real imaging. Moreover, our method has great potential for clinical application and computeraided diagnosis. Overall, our work has provided valuable insights into the GAR task and presented a promising solution that can significantly improve the imaging quality of IVOCT examinations.