Multi-Granular Semantic Analysis Based on Nasal Endoscopic Video

The semantic analysis of nasal endoscopic video is a challenging task since lots of irrelevant and insignificant information exists in the untrimmed surgical video, i.e. background, blur, judder or blood-stained video fragments. It is important to identify the start and end point of the valid surgical fragments automatically and remove the invalid fragments of endoscopic surgery videos for medical education & research. However, the performance of deep-learning based methods, which use a fixed time interval and a sliding window, are severely affected when the interference information appears randomly in the nasal endoscopic video. Specifically, the surgical video is a continuous process globally, while many local discontinuity fragments are brought when endoscope enters and exits the cavity frequently. Hence, we propose a multi-granularity semantic analysis framework that can simultaneously meet the accuracy and timeliness required for endoscopic surgery video semantic analysis. Our approach is an end-to-end solution. First, a joint model is created to extract the temporal-spatial features of the surgical video on a coarse-grained scale. Meanwhile, an attention mechanism is used to automatically select the informative spatial features of endoscopic video. Second, a hierarchical self-correction module is proposed to correct the boundaries of the surgical operation iteratively on a fine-grained scale. Finally, we justify the proposed network through extensive experiments and quantitative comparisons against other state-of-the-art approaches. We achieve a good performance in terms of accuracy and efficiency.


I. INTRODUCTION
Endoscopic surgery has been more and more practiced in nasal surgery in recent years because of its less trauma and quick recover [1]- [3], the number of nasal surgery videos was continuously booming. These videos provided a great basis for documentation, training of young surgeons [4], medical research [5] and analytics in healthcare [6].
Usually, a complete endoscopic surgical video is recorded from the beginning of the operation to the end of the operation. Not only the surgical operation fragments are preserved, but also some unrelated surgical operations such as covering the endoscope lens with blood stains, defocusing the lens during movement, and cleaning the endoscope lens are The associate editor coordinating the review of this manuscript and approving it for publication was Wei Wei . also retained. However, doctors only need the valid video clips after the surgery. They have to edit the video to make it more convenient. It is not only difficult and time consuming for Doctors to manually edit the video but also is very expensive to ask a third-party agency, such as SurgiCast (https://www.surgicast.io/medical-video-editing), to edit [7]. There is a great opportunity for researcher to develop the methods to automate the editing of endoscopic surgery videos. Semantic analysis of endoscopic surgical videos is one of the most important keys in the automation [8]. As is shown in Figure1. Semantic analysis methods not only are able to analyze the start and end point of the surgical operation, but also can analyze the invalid images in the operation. Endoscopic surgery video is characterized by continuity and discontinuity. Continuous surgical operations are interrupted by these invalid shots in the endoscopic surgery video. Especially, the randomness of discontinuity block the way to find the start point and end point. Moreover, the operation of the surgery is a continuous process, but a complete operation is split into discrete pieces due to various phenomena such as the need to clean the endoscope lens. And these interruptions are random, there is no regularity at all. Most of the researches were focused on the lesion detection [9], lesion segmentation [10], and lesion diagnosis [11], all of which were performed on a single frame of image. On the other hand, there were studies on classification of gynecological organs, eight kinds of surgical operation recognition customized in abdominal surgery video [12]. However, there were relatively few studies on semantic analysis of nasal surgery videos [13]. Popular methods usually used a fixed time interval [14] or a sliding window [15] to generate candidate proposals and perform semantic analysis in the field of natural scene video. But these methods were not very effective in semantic analysis of endoscopic surgery video because of the random discontinuity of endoscopic surgery videos.
In this article, we propose a new framework for semantic analysis on endoscopic surgery videos via a deep neural network, which is called Multi-granular Hierarchical Network (MHN) as is shown in Figure2. First, a four classification was performed on successive n key frames by using an endto-end spatial-temporal feature modeling. After obtaining a preliminary prediction sequence result, a more granular correction was applied for a hierarchical self-correction module. Finally, the automatic marking of the surgical operation was implemented, and the automatic editing was completed. From inputting the original video into the network and outputting the edited effective surgical screen video, the whole process did not require human participation. It was a fully automatic processing mode.
In summary, the key contributions of our work include: • This work provides the first semantic analysis for nasal endoscopic surgery video using deep learning method. And we propose a framework that automatically detects non-surgical operations in endoscopic surgery video.
• Semantic analysis of endoscopic surgery video with multi-granular spatial-temporal features combined with modeling scheme.
• The hierarchical structure of the self-correction module from rough to fine is proposed to improve the accuracy of surgical video semantic analysis.
• Compared with state-of-art performance [16], [17], our method further improves the accuracy on our dataset to 89%. The rest of this article is organized as follows. In Section 2 some relevant works are reviewed. In Section 3, we describe the details of our proposed approach. In Section 4, we present the experiments and results. Finally, we present our concluding remarks in Section 5.

II. RELATED WORK
A. ENDOSCOPIC IMAGE PROCESSING FIGURE 2. An overview of the semantic analysis framework. We first extract keyframes from the video and input successive t keyframes into a CNN that incorporates the attention module. Then, the feature map extracted from the CNN is input into the LSTM module for sequential learning. Finally, the results of the previous step are entered into a hierarchical self-correction module for more precise semantic analysis.
image classification [18], image segmentation [19], object detection [20] and so on. Despite the difference between natural and medical images, deep learning has been imported from endoscopic image processing and presented impressive performance on polyp recognition [21], bleeding detection [22], and polyp classification [23]. At the same time, deep learning has also made great progress of solving some specific problems of the field of medical imaging. For example, deep learning is used to study deformable registration methods of medical images [24]. Detection of respiratory diseases from medical images of heuristic algorithms [25]. Bacterial recognition model composed of regional covariance of convolutional neural network [26]. And Ibtehaz and Rahman [27] used MultiResUnet network to segment multi-peak medical images. Further, the semantic analysis of surgical video based on deep learning technology has gradually gained the attention of researchers [13]. For example, Twinanda et al. [28] used CNN to extract image features from laparoscopic cholecystectomy video, and migrated the pre-trained Alexnet model to the medical field, in combination with the hidden Markov model. Finally, a single frame image recognition rate of 92.2% was obtained. Petscharnig and Schöffmann [12] used CNN and support vector machine models to identify eight surgical operations that were customized in the video of abdominal surgery. These works are based on the single frame image of the surgical video. Although CNN effectively improves the ability to express features, the process of processing a video stream into a single frame tends to ignore hidden features in nasal endoscopic surgery videos, which makes it difficult to improve the accuracy of nasal endoscopic surgery video analysis.

B. VIDEO SEMANTIC ANALYSIS
Although few researches on the semantic analysis methods of medical videos were developed, there were many new methods in the natural scene video. Action recognition and temporal action detection are two important branches of video semantic analysis and has been extensively studied [29]- [32].
Action recognition models can be used to extract summarylevel visual features in untrimmed video. Action recognition has been extensively studied in the past few years [29]- [33]. Earlier methods are mostly based on hand-crafted visual features such as HOF, HOG and MBH [33]. In recent years, two-stream network [29], [30], [34]and C3D network [31], [32], [35] learns appearance and motion features. Typically, two-stream network learns appearance and motion features based on RGB frame and optical flow field separately. For example, Lin et al. [30] proposed a Boundary Sensitive Network (BSN), which used two sub-networks (spatial network and temporal network) for encoding video information. Because this kind of method modeled the spatiotemporal features of video separately, it was easy to ignore the relevance. The defects of this method are gradually exposed in many tasks. C3D network adopt 3D convolutional layers to capture appearance and motion features directly from the original frame. For example, Xu et al. [32] introduced a spatial-temporal feature-preserving filter in a C3D network to maximize the resolution of the video in the time dimension, which improved the accuracy of video frame-by-frame recognition effectively. However, the 3D network has the higher requirements on data and hardware, and the training difficulty had to be improved. On the other hand, the method often performs poorly for the scenes with frequent video VOLUME 8, 2020 shot switching. The 3D convolutional network did not improve the performance of video content parsing tasks significantly although it overcame the shortcomings of the above Two Stream convolutional network.
Temporal action detection task aimed to detect action instances in untrimmed videos including temporal boundaries and action classes, and could be divided into proposal and classification stages. Earlier works [36] directly used sliding windows for the proposal generation. Recently some methods [14], [37] generated the proposals with pre-defined temporal durations and intervals, and used multiple methods to evaluate the confidence score of proposals, such as dictionary learning [14] and recurrent neural network [37]. These two methods had a good semantic analysis effect in natural videos with continuous features, especially standard predefined actions. However, these methods for semantic analysis may have some major disadvantages due to the discontinuity of endoscopic video: (1) usually not temporally precise,and surgical video requires more precise positioning; (2) Fixed pre-defined temporal durations and intervals are not suitable for randomly occurring invalid images.
At the same time, the Visual Question Answering (VQA) task is also a new field that involves action detection, but VQA not only needs to focus on action detection, but also needs to understand the text. And VQA also works on singleframe images. So its method is not applied to our work.
Compared to these methods, our multi-granularity semantic analysis method is superior to in two aspects: (1) Coarsegrained analysis overcomes the random discontinuity of endoscopic surgery video. (2) Fine-grained hierarchical selfcorrection more accurately locates the boundaries of surgical operations.

III. MULTI-GRANULAR SEMANTIC ANALYSIS
The semantic analysis of surgical video has become more difficult because of the coexistence of video continuity and discontinuity in endoscopic surgery. We propose a spatialtemporal combined framework MHN to solve this problem as is shown in Figure2. Generally speaking, from input to output, no human intervention is required. After inputting the original video, through the processing of the model, the output only retains meaningful video clips. Firstly, a coarsegrained semantic analysis is performed on the combination of spatial-temporal features. The coarse-grained analysis combines the spatial and temporal characteristics of CNN and RNN networks, and introduces the attention mechanisms on the spatial network to enhance the learning of spatial features. Thereby ensuring the accuracy of the analysis and taking into account the timeliness of the surgical video analysis. Secondly, the hierarchical correction of coarse-grained results provide more precise positioning of surgical action boundaries based on the timing relationship.

A. DATA DEFINITION
An untrimmed video is a sequence of frames. The Key frame or I-frame was defined as a single frame of digital content that the compressor examines independent of the frames that precede and follow it and stores all of the data needed. The video sequence can be denoted as X = {x n } k n=1 where x n is the nth key frame in X. Key frame. In our work, we extract a key frame every fifteen original frames and mark the time points and labels for each key frame. The nasal endoscopic image was pre-defined as four labels Y = In, Out, Fuzzy-in, Fuzzy-out} by the medical professional. As is shown in Figure3, Our data samples have large internal differences and small differences between categories, which will make semantic analysis difficult. In particular, there are big differences not only between the In category and the Fuzzy-In category but also between the Out category and the Fuzzy-Out category. In general, after semantic analysis of the nasal surgery video, only the In category shots are kept during editing. However, sometimes in order to maintain the continuity of the surgical video, the Fuzzy-In category shots are usually retained.

B. CRUDE-GRANULARITY ANALYSIS ON SPATIAL-TEMPORAL FEATURES
As is shown in Figure4, after extracting the key frames of the surgical video, CNN was used to learn spatial features. An attention mechanism was introduced to perform feature tracking based on the particularity of the endoscopic image. Further, the time characteristics were learned through the Recurrent Neural Network (RNN) network, a coarse-grained sequence was generated as the result.
We applied a deep neural network architecture, ResNet-50 developed by He et al. [38]. as our spatial feature extractor. ResNet is a deep CNN architecture containing residual learning blocks to address a problem of degradation during learning very deep networks. The output of each block of ResNet-50 has half spatial resolution compared to that of the previous block. Various settings for the feature extractor have been tested, including deeper ResNet-101, different designs  of the convolution neural network, and up sampling to the image width 512. The results are similar. Therefore, the simpler and computationally efficient setting was chosen.
The endoscopic image has a distinct difference from other images. Since the shapes of endoscopes with different specifications are different, the final image area is an irregular polygon or a circle. As is shown in the Figure5, the shape of the endoscope lens of different manufacturers is also different. There are many studies on the analysis of irregularly shaped endoscopic images. For the related work of detection in the circular area of endoscopic video [39], the MultiRe-sUNet [27] network is used to solve the problems of different scales of medical images. In our work, we introduced the SENet [40] attention module on the spatial network to track the effective information of the image. SENet can automatically obtain the importance of each feature channel through learning, and then use this importance to enhance useful features and suppress features that are not very useful for the current task. As is shown in the Figure4, through the CNN network convolution transformation, a two-dimensional feature map with a channel number of C and a feature map size of H*W was obtained. It is input into the attention module unit as an input feature. First, in the spatial dimension, through the global average pooling layer, each two-dimensional feature channel will be transformed into a real number. This real number was used to characterize the global receptive field of the feature map, and the output dimension is consistent with the input feature channel number. Then, a Bottleneck structure was formed by two fully connected layers to model the correlation between channels, and the same number of weights as the input features were output. First, the feature dimension was reduced to 1/16 of the input, and then activated by ReLU and then returned to the original dimension through a Fully Connected layer. Compared with using a Fully Connected layer directly, it had more nonlinearities and could better fit the complex correlation between channels. It also greatly reduced the amounts of parameters and calculations. Then we used a Sigmoid activation function to obtain the normalized weight between 0 and 1, and finally we used the scale operation to weight the normalized weight to the characteristics of each channel. As is shown in Eq 1, s refers to the weight sequence output by the attention module,σ refers to the ReLU activation function,δ refers to the sigmoid activation function,W 1 ∈ R C r ×C , W 2 ∈ R C× C r , and z refers to the real number obtained by the global average pooling layer.
Compared with some methods for detecting circular areas, the use of attention mechanism is more universal. And we avoid the method of dividing the image first and extracting the effective area before processing.
Since our job is to perform semantic analysis on endoscopic video, the key frames extracted from the original video not only have spatial features, but at the same time, the temporal characteristics of continuous key frames are VOLUME 8, 2020 also what we want to learn. It is known the key frame can be judged roughly by the preceding and succeeding frames when the video is continuous. We use RNN to capture global information and long-term dependencies, which is able to learn patterns and long-term dependencies from sequential data. Moreover, LSTM [41] is a type of RNN architecture that stores information about its predictions in other regions of the cell state, it can predict the classification of key frames based on the relationship of consecutive frames. LSTM has the characteristics of selective memory that can control the transmission status through the gated state: Remembering the useful information for a long time and ignoring the unnecessary information. As is shown in Eq2, z f performs the forgetting control that keeps the previous memory cells that should be retained and be forgotten, z i is the selects memory control that selects important information to record.
We used the CNN network to do local feature extraction, and used the LSTM network to model the timing relationship of consecutive frames. The combination of the two networks could simultaneously took into account the spatial characteristics of the key frame image and the temporal characteristics of consecutive key frames. As is shown in Figure 4, we used the LSTM module to model continuous key frames in order to obtain more temporal feature information in the endoscopic video. We added an LSTM module behind the CNN network based on the attention module. Each training needs to input n consecutive key frames and output n classification results. In our work, we first trained the resnet-50 model with the attention module added. On this basis, we removed the last layer of the fully connected layer and fine-tune it to obtain the 512-dimensional features of the output and used it as an input to connect a unidirectional LSTM network. The LSTM network had 512 neurons and 5 times step. Therefore, the input of the CNN network was a vector unit composed of 5 consecutive key frames. After the LSTM module, the predicted key frame category was output through a fully connected layer. We set 4 neurons for the fully connected layer to correspond to the four key frame categories.

C. FINE-GRAINED SEMANTIC ANALYSIS ON TIME SERIES RELATIONSHIP
In coarse-grained analysis, it is not sufficiently accurate to use keyframes as sample sequences for surgical operation boundary localization. In order to make the boundaries of the cropped effective video clips more precise, we proposed a fine-grained hierarchical self-correction module to solve this problem.
In the coarse-grained module, the CNN model is used to analyze the key frame image sequence and obtain the preliminary result sequence. In addition, in the result sequence, two adjacent key frames with different types of results are found. In this way, it can be considered that the previous frame is the end point of the previous candidate video, and the next frame is the start point of the next candidate video. At a coarse  granularity, in order to consider the timeliness and accuracy of video analysis, we use key frames to analyze the video. In order to improve the accuracy of the video clip boundaries, we performed a hierarchical analysis of the original frames at the beginning and end of the candidate video. As is shown in Figure 6 (a), for adjacent P1 and P2, the original frame in the middle of these two key frames is extracted to reconfirm the boundary frame result. Figure 6 (b) assumes that after the original frame discrimination, the judgment results of the start and end points of the P2 candidate segment have changed. Update the key frame sequence with the correction result. The updated original frame sequence will obtain new candidate fragments. In the fine-grained layered correction, we have designed a total of 3 layers of correction. The example in the figure where L 0 points to L 1 refers to the first layer of correction. The specific calibration process is shown in Figure 7. We assume that there are K original frames between the nth key frame and the n + 1th key frame. Each layer samples K/M frames at intervals of M frames. For these sampled original frames, we analyze and calculate the results through the spatial feature model. If the number of categories in the sampled original frame is the largest and greater than the threshold µ, then we regard the category as a valid result, use this result to update the nth key frame result, and end the analytic hierarchy process. Otherwise, by updating the interval sampling, N is subtracted from M to update to the new M value, thereby sampling more original frames and continuing the analysis. Finally, when M is updated to 1, the maximum number of categories does not exceed the threshold, we end the fine-grained module and retain the original results.

IV. EXPERIMENT A. EXPERIMENTAL SETTINGS 1) DATASETS
Our dataset is based on the key frame of 2 hours of clinical nasal endoscopic surgery video extraction from the First Affiliated Hospital of Xi'an Jiaotong University,China. There are total 17783 images. These images are marked in the four shot categories. The number of each categories are displayed in the Table1. Among them, the training set, verification set and test set are randomly distributed according to a ratio of about 7: 2: 1. These surgical videos are from 12 different nasal surgeries. The model of the endoscope is Olympus ENF-T3 with a resolution of 720 * 576. And the preprocessing is 224 * 224 when inputting the model.

2) IMPLEMENTATION DETAILS
Optimization was performed using synchronous SGD with momentum 0.9, a learning rate of 0.001 and decay of 0.0001. The entire experiment was implemented using Python 3.5, based on the Tensorflow 1.18.0 environment, running on two 12 GB Nvidia Tesla K80 GPU machine with batch size 16 for 100 epochs.

3) EVALUATION METRICS
the performance of the model quantitatively is measured by using four commonly used metrics where TP, TN, FP and FN denote the number of true-positive, true-negative, falsepositive and false-negative detection results, respectively. Recall reflects the classification model's ability to identify positive samples. The higher the recall, the stronger the model's ability to identify positive samples. Precision reflects the model's ability to distinguish negative samples. The higher precision indicates the model's ability to distinguish negative samples. The higher the F1-score is, the more robust the classification model is.   increased after the attention module is added to the backbone. More attention will be given on valid areas. Secondly, we compared and analyzed the ROC curves of the backbone and the coarse-grained model. The results shown that the effect of the coarse-grained model was significantly better than that of the backbone in Figure9.
In addition, we compared the effect of using only the backbone and adding the LSTM module to the backbone. The results were shown in Figure 9. After adding the LSTM module backbone, the accuracy rate reached 0.82, the weighted average accuracy, the recall rate and f1score reached 0.82. Compared with the backbone, the accuracy was improved by 7%. At the same time, we also compared the evaluation results of each analogy. Among them, the improvement of Out category and Fuzzy-In category was more obvious than the Fuzzy-out category. The analysis of the above results showed that for time-series tasks, CNN networks performed semantic analysis by learning spatial features and obtained VOLUME 8, 2020   good results. On this basis, through the LSTM module to further learn the timing characteristics, the performance of classification will be better.
Finally, as is shown in Table 2, compared with the backbone, the coarse-grained model has been partially improved. The accuracy rate reached 0.85, an increase of 10%. As is shown in Table 3, the results of the Fuzzy-In category show that the accuracy of the coarse-grained model reaches 0.79, which is 47% higher than the backbone accuracy, while the f1score reaches 0.70, which is an increase of 23%. The results of the Fuzzy-Out category show that the accuracy of the coarse-grained model reaches 0.94, 44% higher than the backbone, the recall rate is 10% higher than the backbone, and the f1score is 17% higher than the backbone. Compared with the network where the LSTM module is added to the backbone, the results of the coarse-grained model can prove the effect of the attention module. Increased accuracy by 3%. Precision increased by 5%, recall increased by 4%, and f1score increased by 1%. However, the effect is still not satisfied by analyzing the two index recall and f1score. Then we added the fine-grained modules to MHN and shown the results as follows.

2) FINE-GRAINED RESULTS ANALYSIS
Further, we analyzed the results of the fine-grained model from Table2, we concluded that the effects of the macro average and weighted average of our method were better than the backbone and coarse granularity. Especially, the recall and f1score are significantly improved, which mean that our method had better stability. MHN's weighted average precision, recall, and f1score all reached 0.89. Compared with coarse-grained, precision increased by 2%, recall increased by 3%, and f1score increased by 6%. Moreover, the performance of accuracy is 14% higher than backbone and 4% higher than the coarse granularity. The comparison results of each category shown in Table3 suggest that MHN has good classification accuracy and good stability in each category. In particular, recall in the Fuzzy-out category are 40% higher than coarse-grained, and f1score is 35% higher. As is shown in Figure 10, the left picture is the confusion matrix of the coarse-grained model, and the right picture is the confusion matrix of the fine-grained model. It can be clearly found that the fine-grained model improves the fuzzy-in category and the out category. In addition, we analyzed the effect of fine-grained model correction. The success of fine-grained model correction means that the starting and ending points of an effective surgical video will be more accurate, and the video viewing effect after editing will be smoother. Correction failure refers to a situation where the judgment error of the coarse-grained model cannot be corrected. According to the statistics of the test data, a total of 97 corrections were completed, of which 69 corrections were successful and the correction success rate was 71%.
Finally, as is shown in Table4, we compared the accuracy, total model parameters, and model processing time with several state-of-the-art indicators. Our model had the significantly higher accuracy than other models that only learned spatial features. The final accuracy rate reached 0.8927, which was an 8% improvement over Xception. The total parameter amount was not increased remarkably and the processing speed was also within an acceptable range. The processing time was shorter than of InceptionV3 and Xception although our module had more parameters. The main reason is that more time is cost in image pre-processing for both InceptionV3 and Xception.
As is shown in Figure11, we also compared the ROC with these methods. We could find that the ROC chart clearly reflected the performance advantage of our method for video analysis of nasal endoscopic surgery.
These results suggest the effectiveness of MHN. And MHN achieves the salient performance since it can generate proposals with (1) the attention module pays more attention to the effective image area in irregular endoscopic images.
(2) The LSTM models continuous key frames to better capture the timing information in the endoscope video. (3) The self-correction module further accurately judges the boundaries of the surgical operation.

V. CONCLUSION
In this article, we present a framework for nasal endoscopic video semantic analysis. Our method can accurately and efficiently analyze the surgical operation part of the nasal endoscopic surgery video and remove the blurred frame. In experiments, we demonstrate that feature learning combined with spatial and temporal is better than spatial learning alone. Moreover, the hierarchical self-correction from coarse to fine further improves the accuracy of semantic analysis for nasal endoscopic video, and this hierarchical structure greatly improves the efficiency.