E-SEVSR—Edge Guided Stereo Endoscopic Video Super-Resolution

Integrating Stereo Imaging technology into medical diagnostics and surgeries marks a significant revolution in medical sciences. This advancement gives surgeons and physicians a deeper understanding of patients’ organ anatomy. However, like any technology, stereo cameras have their limitations, such as low resolution (LR) and output images that are often blurry. Our paper introduces a novel approach—a multi-stage network with a pioneering Stereo Endoscopic Attention Module (SEAM). This network aims to progressively enhance the quality of super-resolution (SR), moving from coarse to fine details. Specifically, we propose an edge-guided stereo attention mechanism integrated into each interaction of stereo features. This mechanism aims to capture consistent structural details across different views more effectively. Our proposed model demonstrates superior super-resolution reconstruction performance through comprehensive quantitative evaluations and experiments conducted on three datasets. Our E-SEVSR framework demonstrates superiority over alternative approaches. This framework leverages the edge-guided stereo attention mechanism within the multi-stage network, improving super-resolution quality in medical imaging applications.


I. INTRODUCTION
The continuous evolution of digital videos and images has led to significant advancements in visual quality across various domains.Cameras have become omnipresent, serving diverse purposes such as surveillance with CCTV, capturing moments through smartphones, contributing to medical sciences for more precise diagnostics and surgeries, aiding space exploration with satellites, and enriching daily life through various imaging devices.Technological progress has transformed imaging from the era of black and white to the current era of 8k resolution and beyond.Video and image resolution, determined by the number of pixels, is a crucial determinant of image quality.
The associate editor coordinating the review of this manuscript and approving it for publication was Zahid Akhtar .
Despite the impressive strides made in imaging devices and standards, certain limitations persist, resulting in blurriness of video noise and loss of vital details.In contrast to stereo images, stereo videos face heightened susceptibility to constrained spatial resolution owing to the added temporal dimension, posing potential limitations for applications demanding finer details.Despite the extensive exploration of super-resolution techniques over the years, prevailing methods predominantly center on restoring stereo images.The realm of stereo video super-resolution (StereoVSR) remains relatively uncharted.Super-resolution (SR) emerges as a solution to this challenge, involving recovering missing information from low-quality images.
Endoscopy is extensively used for surgical navigation and minimally invasive procedures [1].However, the limited depth information and field-of-view in endoscopic videos captured by a single camera has prompted the increasing adoption of stereo cameras, particularly in intricate and robotassisted surgeries [2].Stereo endoscopic images, derived from two distinct viewpoints, offer valuable depth cues and enhanced sub-pixel information compared to their singlecamera counterparts [3].
However, challenges arise in endoscopy regarding maintaining high video quality and resolution, primarily due to the constraints imposed by the confined surgical space and the limited field of view of endoscopic instruments.Optical sensors must be compact to capture various tubular cavities and lumens scales effectively.Furthermore, the limitations of unstable illumination conditions can lead to image degradation and the loss of crucial information in stereo endoscopic images.These issues can negatively impact subsequent procedures such as image classification, segmentation, and reconstruction [3], [4].
Consequently, there is a considerable advantage in improving the resolution and quality of stereo endoscopic images and video frames to mitigate these challenges and enhance the overall efficacy of endoscopic procedures.
In the context of stereo endoscopic video super-resolution (SR), including consecutive frames introduces valuable temporal consistency.Traditional video SR methods typically involve the network processing several successive images, extracting and synthesizing features to reconstruct highresolution (HR) outputs.Nonetheless, when utilizing the conventional 2D video super-resolution approach on stereo video frames, there is a potential risk of losing the alignment or correspondence between the left and right views.This risk is further elaborated in recent studies, such as the one by [80].
Despite the enhanced visual performance achieved by Convolutional Neural Network (CNN)-based super-resolution (SR) techniques compared to traditional methods, their efficacy is hampered by constraints related to the convolution kernel size and the restricted field-of-view regions.CNN models inherently possess limitations in capturing long-range dependencies.Recently, transformer networks, which integrate self-attention mechanisms, have emerged as a promising solution for addressing various visual challenges [5], [6].
Within transformer-based methodologies, the input image and frames undergo segmentation into smaller patches, which are subsequently treated as sequential token inputs.These tokenized inputs are then used to extract image features utilizing self-attention mechanisms, considering the global relationships among these tokens.The Swin transformer [7] distinguishes itself by combining the advantages of both CNN and transformer architectures through parallel computing and applying the shifted window technique.This approach builds a hierarchical feature representation, starting with small patches and gradually combining adjacent patches in deeper transformer layers.By harnessing multi-scale feature maps, the Swin transformer model efficiently and effectively employs advanced methods for dense prediction and image reconstruction [8].Furthermore, similar principles of feature extraction and handling complex scenarios have been effectively applied in gait recognition systems, where parameters such as clothing, angle shift, and walking style significantly impact system performance.These systems utilize advanced machine learning classifiers and feature selection methods to achieve high accuracy in real-time environments, demonstrating the adaptability of transformer models in various applications [75], [76].This integration allows for a more comprehensive consideration of global dependencies, overcoming the limitations of purely CNNbased models.
Another approach entails integrating multi-view and temporal information by first utilizing StereoSR techniques to generate super-resolved images.Following this, video super-resolution (VSR) methods are applied to enhance the high-resolution frames further, resulting in videos with even greater resolution.In contrast to the previous two methods, this approach uses information from different views and adjacent frames within a single view.However, these three approaches, while effective, do not consider cross-time-crossview information, thereby missing an opportunity for further performance enhancement.

II. LITERATURE REVIEW
This section provides a concise overview of the superresolution (SR) methods pertinent to our research, encompassing single-image SR [24], stereo-image SR [25], video SR techniques [26], and Stereo Video Reconstruction.While single-image super-resolution (SISR) is effective for enhancing individual images, it falls short in leveraging the continuity present in video data, such as endoscopic videos, leading to suboptimal super-resolution results.Multiple Image Super-Resolution (MISR) addresses this limitation by incorporating multiple low-resolution (LR) images to generate a single high-resolution (HR) image.
These methods typically involve taking pairs of LR and HR patches and learning mapping to translate LR patches into HR ones.Example pair techniques can be tailored for general images or specific types, such as medical images, depending on the provided training set of examples.Sparse coding stands as a state-of-the-art representative examplebased super-resolution model.
In response to these challenges, researchers have introduced novel methods utilizing deep convolutional neural networks for super-resolution reconstruction.Hu et al. [27] modified the network architecture to simplify performance and training.Another method enhances the correlation between neighboring feature information and overall image quality by incorporating context information into the network.Despite breakthroughs, ongoing research is essential to enhance super-resolution reconstruction methods further.

A. SINGLE IMAGE SRS
Single Image Super-Resolution (SISR) has been a pivotal research focus for decades, with recent advancements showcasing the efficacy of deep learning in achieving high reconstruction accuracy [28], [29], [30].The SENext [31] approach introduces a Squeeze-and-Excitation Next architecture for SISR, leveraging squeeze-and-excitation blocks (SEB) to reduce computational costs and dynamically recalibrate channel-wise feature mappings.Utilizing local, sub-local, and global skip connections enhances feature reusability and stabilizes training convergence.SENext, employing post-upsampling in the pre-processing step, outperforms previous methods.
Kim et al. [32] propose a popular SISR technique, a Very Deep Super-Resolution Network (VDSR) with twenty layers, showcasing the increasing complexity of SR networks in exploiting intra-view information.Zhang et al. [33] fuse residual and dense connections, introducing the Residual Dense Network (RDN) for comprehensive hierarchical feature characterization.Recent advancements include Residual Channel Attention Networks (RCAN) [34], Residual Non-Local Attention Networks (RNAN) [35], and Second-order Attention Networks (SAN) [36].Muhammad et al. [37] present a novel architecture inspired by ResNet and Xception networks, significantly reducing network parameters and enhancing processing speed while achieving high-quality HR images.Experimental results establish this technique as a state-of-the-art SR method regarding accuracy, speed, and visual quality.

B. STEREO IMAGE SR
Recently, stereo image super-resolution (SR) has gained heightened attention, with notable works exploring the effective utilization of stereo information.Enhancing stereo images requires addressing the critical challenge of efficiently applying corresponding information between two views within the SR network.Bhavsar and Rajagopalan [25] introduced a comprehensive framework designed to simultaneously estimate the image depth map and the super-resolved (SR) image using multiple low-resolution (LR) images.This framework was developed by formulating a unified energy function and iteratively minimizing it through updates to the SR image and the disparity map.
Several convolutional neural networks (CNN)-based stereo super-resolution (SR) approaches integrate features such as disparity and parallax attention.Jeon et al. [11] presented the Stereo Enhancement Super-Resolution model (StereoSR), which utilizes a single image along with a set of auxiliary shifted images to produce super-resolution (SR) results with improved details. Nonetheless, this method encountered constraints when dealing with stereo images containing varying disparities, mainly because it relied on a fixed maximum parallax.Addressing this, Wang et al. [38] introduced the Parallax-Attention Stereo Super-Resolution Network (PASSRnet).Their innovation was the introduction of the parallax attention module (PAM), which efficiently captures information from both views along the epipolar line to improve correspondence matching.Ying further expanded on this concept by integrating multiple PAMs into different stages of pre-trained single-image super-resolution (SISR) networks to enhance overall performance.
Yan et al. [9] pioneered a domain adaptive stereo superresolution (SR) network that estimates disparities using a pre-trained stereo matching network.They harnessed cross-view information by warping views to the other side, enhancing the overall stereo SR performance.On a different note, Xu et al. [39] introduced bilateral grid processing into convolutional neural networks (CNNs), presenting a Bilateral Stereo Super-Resolution Network (BSSRnet) explicitly designed for stereo image SR.In contrast, Chu et al. [40] introduced an innovative CNN-based approach known as NAFNet, which includes a distinctive Stereo Cross Attention Module (SCAM) block designed for parallax fusion.
A method of Video Super-Resolution (VSR) methods pertain to the alignment of various frames using motion compensation modules that rely on optical flow estimation [46], [47].Nonetheless, the process of optical flow estimation is inherently challenging and susceptible to inaccuracies [48], [49].Other approaches seek to exploit multiframe information implicitly [17], [23].For instance, Wang et al. [20] employ a combination of techniques, including a pyramid, cascading, and deformable module for alignment, along with a temporal and spatial attention module for information fusion, resulting in the attainment of state-ofthe-art results.
However, when dealing with the StereoVSR task, which involves input with more frames captured from an additional viewpoint, it is apparent that directly applying conventional VSR methods may not yield optimal results.To achieve improved performance, it becomes imperative to carefully account for and leverage the task-specific view-temporal correlations.

D. STEREO VIDEO SUPER-RESOLUTION
Stereo videos offer valuable multi-view information, presenting opportunities for improved performance in various reconstruction tasks.Recent studies have tackled the challenge of stereo video deblurring by simultaneously estimating 3D scene flow and removing blurs [50], [51].Li et al. [52] have harnessed depth information for stereo video re-targeting in a different application, facilitating the seamless adjustment of stereo content to screens of varying sizes and aspect ratios.
In endoscopic surgery, the quest for enhanced visual clarity has led to developing and comparing various super-resolution (SR) methodologies, each presenting its unique strengths and weaknesses.As delineated in Table 1, beginning with the Minimally Invasive Surgery SR in 2011, which offered a comprehensive analysis of SR techniques but required a delicate balance between enhancement and artifact reduction [81].By 2013, the RGB Hybrid 3-D Endoscopy method enhanced spatial resolution using RGB data, albeit its effectiveness was challenged in diverse environments [69].Advancements continued with Hybrid Range Imaging and hybrid imaging with Maximum a Posteriori (MAP) estimation was introduced, though it faced motion estimation and data fusion challenges [69].The introduction of Disparity-Constrained Parallel Attention marked a significant improvement in stereo image quality despite issues with stereo camera inconsistencies [83].Real-Time Surgery Enhancement promised real-time performance in surgery, necessitating further data collection and testing for validation [84].The subsequent years saw the introduction of Disparity-Constrained Stereo SR, which was effective on specific datasets but struggled with adaptability in surgical applications [63].The most recent advancements, Channel and Spatial Attention SR and Hybrid Attention for Endoscopic Video SR highlighted the importance of detail enhancement and image reconstruction through attention mechanisms.Yet, both methodologies underscored the need for refinement in diverse settings and optimization of attention mechanisms, respectively [62], [66].This progression underscores a continuous effort to refine image quality in endoscopic surgery, navigating the trade-offs between real-time applicability, environmental adaptability, and the computational demand of advanced SR techniques.
Our solution addresses the challenges previously proposed models face in the Stereo Video Super-Resolution (Stere-oVSR) field.This area has not been extensively explored in existing literature.This approach aims to leverage the benefits of deep learning for stereo videos, intending to achieve notably enhanced super-resolution outcomes.

A. FEATURE EXTRACTION BLOCK
In this context, superscripts indicate tensor attributes such as left (L) and right (R) views, low-resolution (LR) and super-resolution (SR) resolutions, and processing statuses by specific modules.Subscripts of tensors represent temporal information, specifically the frame count, while subscripts of modules denote their order in the process.This equation demonstrates how the model processes the input left and right LR frames to produce the corresponding high-resolution stereo endoscopic images.The Combined Channel and Spatial Attention Block (CCSB) [66] consists of two integral components: the Channel Attention Block (CAB) and the Spatial Channel Attention Block (SAB), as depicted in Figure .2. The CAB plays a role in determining the importance of various feature maps, while the SAB identifies critical areas within each feature map.These operations involve simultaneous average pooling and max pooling, which combine and condense the features, yielding both max-pooled and average-pooled features.During the training phase, the max-pooled and averagepooled features undergo further processing via two densely connected layers.A reduction parameter is introduced to manage parameter complexity, setting the activation size as (2) Spatial attention mechanisms are explored to emphasize important regions within feature maps.The refined features obtained from channel attention are separately subjected to max pooling and global average pooling, generating a 3-dimensional feature map.Concatenating the outputs of both pooling operations, a 3-dimensional convolutional operation with a kernel size of 3×3×3 is employed, creating a three-dimensional spatial attention map.This map undergoes a sigmoid activation to yield optimized features The low-resolution (LR) images traverse through the CAB, followed by the extracted features from the CAB passing through the SAB for further refinement.

B. DEEP FEATURE REFINEMENT BLOCK
Deep Feature Refinement Blocks (DFRB) comprise four RDB blocks each, further refining features extracted from the RDBs are strategically placed after the feature extraction block and after every SEAM block.Subsequently, the extracted features from these RDBs are concatenated and forwarded to the SEAM for further processing.This approach allows for comprehensive feature refinement and interaction before being utilized in the subsequent stages of the model.

C. SPATIAL FEATURE TRANSFORM
The spatial feature transform block [53] serves as a critical component in the processing pipeline, handling two sets of features: one obtained from an edge detection algorithm and the other refined through the Deep Feature Refinement Blocks (DFRB).
This block is designed to harmonize and integrate these distinct sets of features, leveraging their respective strengths.The features derived from the edge detection algorithm focus on capturing high-frequency information related to edges and boundaries within the input data.On the other hand, the refined features from the DFRB encapsulate more abstract and learned representations of the input, potentially encoding complex structures and patterns.The spatial feature transform block orchestrates the fusion or combination of these feature sets.It might employ various mechanisms, such as attention mechanisms, learnable transformations, or adaptive pooling strategies, to effectively merge the edge-focused information with the enriched and refined features from the DFRB.This fusion aims to leverage the complementary nature of the edge-derived details and the hierarchical representations learned by the DFRB.
By combining these features intelligently and synergistically, the spatial feature transform block aims to create a unified representation that encapsulates detailed edge information and abstract contextual knowledge.This consolidated feature representation can significantly enhance subsequent processing stages, contributing to the overall effectiveness and robustness of the model for the given task.

D. STEREO ENDOSCOPIC ATTENTION MODULE
The Super-Resolution Edge-preserving and Attention Mechanism (SEAM) system, a significant advancement in endoscopic imaging, incorporates several innovative features that set it apart from traditional super-resolution techniques.SEAM leverages stereo vision integration from stereo endoscopes, offering dual perspectives for more accurate depth and spatial relationship reconstruction.Its attention mechanism efficiently focuses on image areas with intricate details, thus enhancing super-resolution effectiveness.Unlike conventional methods, SEAM preserves edges and textures crucial for medical diagnostics and utilizes depth information from stereo images to guide the super-resolution process.This depth-aware approach is key in better preserving fine details.
Moreover, SEAM likely includes specialized noise reduction and artifact suppression components, ensuring enhanced image quality.Empirical evidence from practical applications demonstrates SEAM's superiority in preserving fine details in endoscopic images, backed by quantitative metrics and qualitative assessments.
We have also integrated the Occlusion Handling block within the SEAM module, which plays a pivotal role in generating symmetric stereo correspondence and deriving occlusions by utilizing the attention maps M R→L and M L→R .This block is instrumental in identifying and managing occluded areas, which is crucial for accurate depth perception and feature extraction in stereo endoscopic videos.
The Occlusion Handling block's techniques are adept at processing occlusions, often caused by bodily fluids or tissues, common in endoscopic videos.These techniques focus on isolating and minimizing the impact of occlusions, leading to clearer, more interpretable images.This enhancement is vital in medical diagnostics and procedures where detail is paramount.The integration of this block improves the quality of stereo endoscopic videos by adeptly handling occlusions.It ensures accurate capture and representation of depth and spatial information, bolstering the overall effectiveness of the SEAM module in stereo endoscopic video super-resolution tasks.
Given a pair of stereo images I L and I R ∈ R H ×W , parallax attention maps M R→L , M L→R ∈ R H ×W ×W can be generated.These maps are instrumental in identifying occluded regions, as they highlight areas where depth values change abruptly or near image boundaries.The occluded regions correspond to empty intervals in the attention maps, indicating the absence of counterparts in the other view.
The conversion of the right image into the left perspective, denoted as I R→L , is achieved through the equation: where ⊗ represents batch-wise matrix multiplication.The softmax normalization performed along the third dimension of M R→L and M L→R indicates the matching possibility between corresponding points in the stereo images.
The possibility of a point being occluded in the right view and its effect on the left image is calculated as follows: To account for noise and rectification errors, we extend this equation by ±2 pixels: The valid masks V L and V R for the left and right views, respectively, are calculated using a tanh function applied to P ′ L and P ′ R , with a scaling factor τ set empirically: The SEAM module, depicted in Figure 6, is enhanced by integrating patch-wise (PConv) and depth-wise convolution (DWConv) within stereo endoscopic attention modules for cross-view feature extraction.This integration and the Occlusion Handling block significantly advance stereo endoscopic video super-resolution image processing.It boosts the ability to streamline network architecture by reducing parameters and computational demands.
Q denotes the query matrix derived from the source intra-view feature (for example, the left-view), while K and V represent the key and value matrices derived from the target intra-view feature (for example, the right-view).The dimensions H, W, and C correspond to the feature map's height, width, and number of channels.
SEAM introduces a cross-view attention mechanism that amalgamates information from both left and right-view images to produce cross-view attention maps.
This strategy leverages distinct information in each view, enhancing feature fusion and improving restoration (both with dimensions RH×W×C), the cross-view fusion features F left→right are obtained through a process involving point-wise and depth-wise convolutions, denoted as W ()p and W ()d, respectively.These convolutions refine features from both channel and spatial perspectives.
Similarly, the cross-view fusion features F right→left are derived through a comparable process.Subsequently, the interacted cross-view information F left→right ,F right→left and intra-view information F i,left , F i,right are fused via elementwise addition, utilizing trainable channel-wise scales denoted as γ left and γ right , which are initialized with zeros to stabilize training.
The final fusion equation combines these features to create a more comprehensive representation.
In summary, SEAM employs a sophisticated attention mechanism that integrates information from multiple views, enhancing feature fusion and leading to more effective restoration results.This is achieved through operations involving projections, convolutions, and fusion techniques applied to intra-and cross-view features.

E. EDGE ESTIMATION MODULE
Drawing inspiration from [54], We have refined the BDCN [55] model, specifically tailoring it for endoscopic video super-resolution applications by fine-tuning it on various endoscopic datasets Kvasir [85], Hamlyn [86], diVinci [63], SCARED [57], and EndoVis [87].This refinement has significantly improved the model's efficiency in detecting critical features in endoscopic imagery.Our edge estimation process involves passing the stereo input I LR through the BDCN-based edge detection network [56], generating multi-scale edge probability maps (specifically at a scale of 5) for both the left and right views.These maps, now refined with the fine-tuned BDCN model, preserve stereo consistency across the views, an essential aspect of our methodology.
Subsequently, we employ a conditional subnetwork tailored for processing these enhanced edge probability maps.This subnetwork, consisting of four convolutional layers, takes as input the refined edge probability maps from both views and generates edge-guided features denoted as These features, benefiting from the improved edge detection, serve as a shared input for the cross-view interaction component.
To contain the receptive field of the conditional network and focus on the improved edge features, we opt for 1×1 kernels across all convolutional layers.This design choice minimizes interference from smooth regions within the edge probability maps, emphasizing the extraction of pertinent information associated with the enhanced edge regions.The network, thus, more effectively emphasizes edge features by employing these specific kernel sizes, allowing for a refined and selective extraction of edge-guided features critical for subsequent processing stages.

F. RECONSTRUCTION AND UPSCALING
The Reconstruction Block is the ultimate stage in the image processing pipeline, dedicated to reconstructing highresolution (HR) images from the refined features derived from earlier processing stages.This block is meticulously crafted to enhance image quality and detail, particularly in cross-view integration for stereo endoscopic videos.Comprised of a sequence of tailored operations, the Reconstruction Block initiates with a 1 × 1 convolution layer (Conv 1×1 ) designed to adjust channel dimensions efficiently.Following this, a Residual Dense Block (RDB) captures intricate patterns and fine-level details within the image content.The RDB's densely connected convolutional layers foster feature reuse and the extraction of intricate, hierarchical features.Subsequently, the Combined Channel and Spatial Attention (CCSA) block is employed, enhancing the discriminative power of the reconstructed images by dynamically accentuating pertinent spatial and channel features.The output represented as F m i,left for the left view's m-th block of SEAM and F m i,right for the right view's m-th block of SEAM, is processed by the Reconstruction Block by feeding the output from the cross-view interaction block.
The CCSA layer enhances the model's capability to focus on both channel-wise and spatially relevant features, which is crucial in endoscopic video super-resolution.This dual attention mechanism aids in the recovery of intricate details and textures, which is vital for medical diagnostics.It ensures the model does not overlook subtle yet diagnostically significant details often in medical imagery.
After the CCSA layer, another Conv 1×1 operation fine-tunes the feature representations.To further refine spatial information and ensure contextual coherence, a 3×3 convolution (Conv 3×3 ) is applied.This step contributes to smoothing and enhancing local patterns, ultimately augmenting the overall quality of the super-resolved (SR) images.
The concluding step involves employing an Upsampling Block, which is crucial for upscaling the refined feature maps to the desired HR image size.This crucial stage reinstates LR feature maps into the HR image domain, ensuring that the final output images possess the desired level of detail and clarity.
The output of the Reconstruction Block encompasses both left and right-view SR images, representing the culmination of the entire processing pipeline.Their quality is a testament to the efficacy of the model's feature extraction, attention, and reconstruction mechanisms.Augmented with the CCSA block, the Reconstruction Block is pivotal in transforming LR stereo endoscopic inputs into high-quality, super-resolved output images.

IV. EXPERIMENTAL RESULTS
This section begins by presenting the datasets used and outlining the experimental settings.A comparative analysis is performed between the proposed model and various image SR and video SR methods.Finally, ablation studies are carried out to confirm and validate our proposed method's components and aspects.

A. EXPERIMENTAL SETTINGS
To train our model, we utilized 240 pairs of stereo video frames sourced from the da Vinci dataset [63] as the training dataset.The high-resolution (HR) images were downscaled to create low-resolution (LR) images for training using bicubic operations.Data augmentation included vertical flipping of the images.For testing, two sets of stereo endoscopic video datasets were used: the test set from the da Vinci dataset, comprising 80 pairs of stereo endoscopic video frames recorded using the da Vinci system's stereo cameras; the SCARED dataset [57], containing 120 stereo video frames; and the MICCAI 2017 Kidney Boundary Detection Sub-Challenge dataset; EndoVis dataset [85], which includes a variety of clinical conditions.This diverse testing regimen provides a comprehensive platform to evaluate the versatility and efficacy of the E-SEVSR model.
The network architecture was constructed using PyTorch and trained on an NVIDIA 3090ti GPU.For optimization, the Adam optimizer was employed with specific parameters: β 1 = 0.9 and β 2 = 0.999.A batch size of 8 was utilized during training, and the initial learning rate was set at 1 ×10 −4 .In this scenario, the parameter 'k' was configured to equal 1, signifying that three consecutive frames were utilized as input data during the training process.
In our approach, we use a pixel-wise L1 loss function.When considering a training set with N denoting the number of training pairs, the loss function incorporating the updated parameters can be expressed as follows: Here, denotes the edge priors, which can be applied as a condition within the function.The term E-SEVSR (•) denotes the complete function of the proposed E-SEVSRNet, encapsulating the entire process within the network architecture.This loss function enables the optimization of the parameters to minimize the discrepancy between the output of the E-SEVSRNet model and the ground truth high-resolution images across the training dataset.

B. EVALUATION RESULTS
Image super-resolution evaluations commonly rely on the peak signal-to-noise ratio (PSNR) as a fundamental quantitative measure to assess the similarity between high-resolution (HR) and super-resolved (SR) images.The structural similarity index measure (SSIM) is also a perceptual metric to gauge image similarity.In comparison with various SR methods, our algorithm outperformed them, with these metrics being computed within the RGB color space.The PSNR and SSIM scores are averaged across left and right image pairs among frames and calculated as (Left + Right)/2.
Table 2 presents the noteworthy PSNR and SSIM scores achieved by our proposed network on test sets for ×2 and ×4 SR tasks.Specifically, our method's PSNR values surpass those of other single, stereo, and VSR methods on the three test sets.For the ×2 stereo SR task, our model exhibits superior PSNR and SSIM values across all datasets.These quantitative evaluation results validate our model's effectiveness in leveraging temporal cross-attention and parallel attention mechanisms to reconstruct HR images.Figure 11 unequivocally demonstrates the efficacy of our model in environments influenced by lighting conditions on the EndoVis dataset.Notably, our model has effectively mitigated the impact of lighting variations compared to ground truth high-resolution (HR) images, showcasing its robustness in handling complex lighting scenarios.This advancement is particularly significant as lighting conditions can substantially affect super-resolved images' perceived quality and clarity.
Furthermore, compared to existing methods, our model stands out by SEAM within stereo image pairs, enhancing SR performance, especially in edge and texture details.This incorporation of SEAM contributes significantly to improving the portrayal of intricate details within the superresolved images.

V. THE SIGNIFICANCE OF EDGES IN ENDOSCOPIC IMAGE AND VIDEO SUPER-RESOLUTION
Recent advancements in edge enhancement techniques have significantly improved the sharpness and clarity of endoscopic images [67].Techniques such as edge enhancement optimization increase perceptual sharpness and reduce noise, thereby improving overall image quality perceived by medical This is particularly important in endoscopic procedures where fine details and contrasts in tissue structures play a critical role in diagnosis [67].Edges play a crucial role in the processing and analysis of digestive endoscopy images, the diagnosis of colorectal diseases, and the identification of pathological collagen [68].Enhanced edge representation is vital for distinguishing between tissue types and identifying abnormalities.This becomes even more crucial in minimally invasive surgeries, where visual clarity and detail are paramount for successful outcomes [69].Edge enhancement in endomicroscopy is crucial because it leads to more precise visualization of cellular structures and tissues [70].In conclusion, recognizing that endoscopic videos are essentially sequences of image frames, the role of edge enhancement becomes doubly significant in both endoscopic images and video super-resolution.Edges are crucial in delineating critical structures in each frame, impacting the overall effectiveness of video analysis and diagnostics.Our model, by introducing a novel edge detection technique for endoscopic video super-resolution, has demonstrated improved results both quantitatively and qualitatively.

VI. ABLATION STUDY
A. FE BLOCK Our model's effectiveness undergoes validation through diverse feature extraction techniques: Conv, CCSB, and CCSB+ASPP.These techniques extract features utilized for subsequent transformation.Table 3 results highlight the superior performance achieved when CCSB collaborates with ASPP, showcasing PSNR/SSIM scores of 43.07/0.9943.
Notably, omitting ASPP from the FE process significantly impacts performance.The absence of ASPP leads to a noticeable decrease in PSNR and SSIM, dropping by 0.16 dB and 0.0001, respectively.Furthermore, restricting the feature extraction to Conv, as employed in MESFINet [54], results in a more substantial decline in PSNR and SSIM, with a collective decrease of 0.26 dB and 0.0002, respectively.This notable decrease significantly impacts overall performance, emphasizing the critical role of combining CCSB and ASPP to achieve optimal outcomes.
The quantitative outcomes effectively emphasize the benefits and effectiveness of integrating CCSB and ASPP  TABLE 4. Ablation Study by increasing the number of SEAM blocks from q = 1 to q = 4.
simultaneously, reaffirming their pivotal contribution to substantial performance improvements.These findings underscore the crucial nature of this feature extraction strategy within our model, highlighting its capability to enhance output quality significantly.

B. NUMBER OF SEAM BLOCK
We began our exploration by examining the impact of varying the number of SEAM blocks within the network while keeping the number of RDBs fixed at 4. Figure .11 displays the trade-off between PSNR and network parameters across different quantities of SEAMs.The results, centered on both at ×2 and ×4, are detailed in Table 4, where we conducted an ablation study by progressively increasing the integration of SEAM blocks into the network, varying from q=1 to q=4.
Our analysis indicates that setting q=3 strikes an optimal balance between SR performance and network parameters.This configuration allows for consistent enhancements by leveraging additional stereo information for image reconstruction.To optimize this equilibrium, we ultimately adopt a 3-stage E-SEVSR.

C. EDGE PROBABILITY MAP
In exploring the impact of edge probability maps on Image Super-Resolution (SR), we utilized diverse edge detectors to generate these maps, presenting the outcomes in Table 5.Our analysis, derived from the tabulated data, underscores the pivotal influence of high-quality edge priors in shaping SR performance.Table 5 demonstrates the intrinsic connection between the quality of edge priors and the overall SR performance.Notably, an enhancement in the quality of edge probability maps correlates with superior SR outcomes.Interestingly, while differences among various detectors' edge probability maps are discernible, the impact of the detector choice appears somewhat limited.
Specifically, analyzing edge probability maps generated by Canny [65], Sobel [64], DexiNed [73], RCN [74] and BDCN (Fine Tunned), we note BDCN's significant impact on PSNR, showcasing a remarkable increase of 0.24 dB compared to RCN [74].Additionally, a rise of 0.0004 in SSIM is evident.Consequently, our model incorporates BDCN (Fine-tuned) for edge estimation, acknowledging its crucial role in augmenting model performance.These observations underscore the pivotal significance of edge priors, with BDCN (Fine Tunned) exhibiting a notable advantage in this context.

VII. LIMITATIONS AND FUTURE WORK
Our current model establishes a robust baseline for Stereo Endoscopic Video Super-Resolution, adhering to experimental procedures paralleled in existing studies [62], [71], [72].However, our model currently does not support real-time super-resolution in endoscopic surgeries due to computational constraints and the lack of resources for real-time application in surgical settings.Future enhancements could include integrating motion estimation blocks, frame interpolation, and feature temporal interpolation.Additionally, hardware improvements such as using multiple GPUs, high-speed I/O interfaces, FPGA, server clustering, or Application-Specific Integrated Circuits (ASIC) could significantly augment real-time processing capabilities.
While our model is currently specialized for endoscopy, adapting it for broader applications in medical imaging, including modalities such as MRI, CT, and PET, is a compelling direction for our future research.These potential modifications and advancements pave the way for the practical deployment of our model in real-time surgical environments and beyond, extending its applicability and efficacy in clinical settings.

VIII. CONCLUSION
Our paper introduces a novel Stereo Endoscopic Attention Module (SEAM) to enhance cross-view feature interaction in Video Super-Resolution (VSR).To further augment stereo SR performance, we propose integrating a pre-trained BDCN (Fine-tuned) model to leverage edge information effectively.We demonstrate the effectiveness of our proposed network by conducting comprehensive comparisons, both qualitatively and quantitatively, with existing models in the domain of stereo super-resolution.These experiments are designed to illustrate the superior performance of our model, showcasing its competitive advantage over other methodologies in the field in terms of visual quality and quantitative evaluation metrics.Moreover, we substantiate the effectiveness of our SEAM through a series of experiments that involve quantitative comparisons.These experiments highlight the advantages and improvements of incorporating our proposed Stereo Endoscopic Attention Module.This demonstrates its capability to significantly enhance the quality and performance of stereo super-resolution tasks compared to other existing methods.

FIGURE 1 .
FIGURE 1.The proposed E-SEVSR network architecture provides an overview, depicted with I LR representing the input low-resolution video frames on the left side, while I SR represents the output reconstructed video frame on the right side.This network aims to take in low-resolution video frames and generate high-quality, super-resolved video frames as output.

FIGURE 2 .
FIGURE 2. Architecture of Combined Channel and Spatial Attention Block (CCSB).

FIGURE 3 .
FIGURE 3. Feature Extraction block comprises one CCSB block and three ASPP blocks.ASPP has three dilated convolution layers, each with dilation rates of 1,4 and 8.It provides an enhanced receptive field to features extracted by CCSB.
n channels r×1×1×1 .Finally, a sigmoid activation function is applied, yielding channel attention values F (left,CA) i , F (right,CA) i represents the current frame under processing.

FIGURE 4 .
FIGURE 4. Residual Dense Blocks (RDB), consist of multiple convolutions and ReLU layers to provide deep feature extraction for feature refinement.
FE block.Utilizing Residual Dense Blocks (RDB) allows for generating numerous local features while maintaining a broad receptive field, contributing significantly to superior Super-Resolution (SR) results.Incorporating RDBs is warranted due to their inherent capacity to facilitate learning complex and hierarchical features.These blocks encompass multiple densely connected convolutional layers, fostering feature reuse and empowering the model to capture intricate patterns and structures within the data.

FIGURE 5 .
FIGURE 5. Spatial Feature Transform (SFT) block to process features from pre-trained edge detection model and output features from DFRB.SFT is deployed before every SEAM block to excel model efficiency by integrating edge information.

FIGURE 6 .
FIGURE 6. SEAM Architecture for cross-view feature extraction.

FIGURE 7 .
FIGURE 7. Edge Estimation using the fine-tuned BDCN model for enhanced edge detection in endoscopic imagery.

FIGURE 9 .
FIGURE 9. Evaluation of the perceptual quality of high-resolution images generated by image super-resolution methods for a scale factor of ×4 on di Vinci dataset.

FIGURE 10 .
FIGURE 10.Evaluation of the perceptual quality of high-resolution images generated by image super-resolution methods for a scale factor of ×4 on the SCARED dataset.

FIGURE 11 .
FIGURE 11.Evaluation of the perceptual quality of high-resolution images generated by image super-resolution methods for a scale factor of ×4 on the Endovis dataset.

Figure. 9 ,
Figure. 10 and Figure.11 depict the qualitative performance comparisons of various methods in the context of ×4 SR on the da Vinci, SCARED, and EndoVis datasets.These comparisons provide detailed observations in zoomedin regions.Qualitatively, stereo SR methods better capture finer details than single-image super-resolution (SISR) approaches.E-SEVSR generated clearer and better-quality images than SOTA.

FIGURE 12 .
FIGURE 12. PSNR Performance when using different number of SEAM blocks from q=1 to q=4 onusing di Vinci dataset on ×2.

TABLE 5 .
Ablation Study incorporating different edge detection models using di Vinci dataset on ×2 .

TABLE 1 .
Comparison of Super-Resolution Methods in Endoscopic Surgery.

TABLE 3 .
Ablation study integrating different FE Block using di Vinci dataset on ×2.