Collaborative Multi-Metadata Fusion to Improve the Classification of Lumbar Disc Herniation

Computed tomography (CT) images are the most commonly used radiographic imaging modality for detecting and diagnosing lumbar diseases. Despite many outstanding advances, computer-aided diagnosis (CAD) of lumbar disc disease remains challenging due to the complexity of pathological abnormalities and poor discrimination between different lesions. Therefore, we propose a Collaborative Multi-Metadata Fusion classification network (CMMF-Net) to address these challenges. The network consists of a feature selection model and a classification model. We propose a novel Multi-scale Feature Fusion (MFF) module that can improve the edge learning ability of the network region of interest (ROI) by fusing features of different scales and dimensions. We also propose a new loss function to improve the convergence of the network to the internal and external edges of the intervertebral disc. Subsequently, we use the ROI bounding box from the feature selection model to crop the original image and calculate the distance features matrix. We then concatenate the cropped CT images, multiscale fusion features, and distance feature matrices and input them into the classification network. Next, the model outputs the classification results and the class activation map (CAM). Finally, the CAM of the original image size is returned to the feature selection network during the upsampling process to achieve collaborative model training. Extensive experiments demonstrate the effectiveness of our method. The model achieved 91.32% accuracy in the lumbar spine disease classification task. In the labelled lumbar disc segmentation task, the Dice coefficient reaches 94.39%. The classification accuracy in the Lung Image Database Consortium and Image Database Resource Initiative (LIDC-IDRI) reaches 91.82%.

As a common chronic disease in today's society, lumbar disc herniation will undoubtedly seriously affect the production efficiency and the quality of life of patients [4], [5].In clinical practice, the classification of lumbar disc herniation is crucial for treatment but extremely difficult.Different types of intervertebral disc herniation often require different treatment processes in clinical practice.Specifically, relatively mild types such as bulge can often be relieved or cured with conservative treatment.The more serious diseases such as protrusion, prolapse, migrated, and schummer's nodule requires surgical treatment [7], [8], [9].However, as shown in Fig. 1, the structural similarity between different types poses a challenge for inexperienced clinicians to classify them correctly.Therefore, accurate and effective classification of the types and early stages of intervertebral disc herniation is a crucial and urgent problem to be solved.
In the medical imaging community, automatic disease diagnosis is widely explored and applied to various practical computer-aided medical systems [9].The current computeraided medical tasks mainly include lesion segmentation, disease classification, image reconstruction, image superresolution, and other fields.Disease classification [10], [12], [13], [15] and pixel-level lesion segmentation [11], [14], [16] are the two main fundamental problems in medical image processing.Disease classification aims to predict class labels for disease severity, while segmentation aims to solve the more fine-grained pixel-level lesion identification task.For example, in [37], a knowledge-guided collaborative deep learning approach was employed to address the lung nodule classification task, achieving an accuracy of 91.63%.U-net, U 2 net [52], and nnU-net [20] have achieved excellent results in the field of medical image segmentation.Particularly, nnU-net has shown outstanding transferability, delivering impressive outcomes in various lesion locations, thus providing a reference for image preprocessing in numerous studies.These two tasks are usually studied as independent tasks in most previous literature.But accurate lesion segmentation results are of great help in classifying disease types.The classification information is also beneficial to the execution of the segmentation task.Therefore, collaborative learning of segmentation and classification tasks can be used in scenarios where both tasks are improved simultaneously, and auxiliary tasks complement the main task [18].Based on this, our work aims to build a unified architecture to strengthen the link between disease classification and lesion segmentation and treat the classification task as the main task.In addition, facing the challenging problems of blurred lesion boundary and strong noise in medical images, we propose strategies to enhance the learning ability of image structural information and semantic features.Finally, by designing a new method, we have achieved high-precision classification results of lumbar disc herniation, thereby helping doctors to diagnose diseases.To validate the model's generalization capability, we also conducted tests on a publicly available lung nodule dataset and achieved an accuracy of 91.82%.This demonstrates that our method exhibits excellent classification performance across different medical datasets.
We propose a collaborative multi-metadata fusion classification network (CMMF-Net) for lumbar disc herniation.The network consists of a feature selection model and a classification model.The feature selection model utilizes a classic encoder-decoder framework to perform supervised segmentation tasks, and we enhance its learning ability for strip and plane structures by adding series-connected strip pooling and plane pooling to skip connections.Furthermore, we incorporate the proposed multi-scale feature fusion (MFF) module, which combines multi-scale features and improves the model's cross-scale learning ability.To address the poor segmentation of disc edge and internal damage region, we also introduce a new hybrid loss function constraint network with area weight.In the classification network, we calculate the distance feature matrix using the segmentation results to capture the geometric features of the intervertebral discs.The network outputs the classification results of lumbar diseases and CAM, and we incorporate CAM as ROI class saliency information in the feature selection network upsampling process to account for the location-sensitive nature of segmentation.By repeating the above steps, we achieve collaborative optimization and jointly improve the accuracy.
Our contributions are summarized as follows: • We propose a classification network called CMMF-NET.
The network consists of a feature selection model and a classification model, which cooperate and promote learning ability with each other.We conducted tests on both a private lumbar disc herniation dataset and a publicly available lung nodule dataset and achieved favorable results in both cases.
• In the feature selection network, we propose a MFF module to fuse multi-scale features and improve the performance of the classification network.Moreover, we use the segmentation mask to obtain the distance feature matrix, which can assist the classification task in learning better lumbar disc geometry features and thus further improve classification accuracy.
• We present a new hybrid loss optimization for feature selection, which not only optimizes the learning of the outer edges of the ROI but also constrains the inner edges of the lumbar disc due to hollowing.

II. RELATED WORK
We have find few medical image processing studies on lumbar disc data in the literature.Therefore, we introduce recent mainstream medical image segmentation and classification methods related to our task.
To aid in surgical planning in this setting, Khandelwal et al. [24] proposed a clinically applicable geometric flow-based method for segmenting the human spine from CT scans.Alalwan et al. [19] proposed 3D-DenseUNet-569 was a fully 3D semantic segmentation model with a deeper network and lower trainable parameters.The proposed model adopted Depthwise Separable Convolution (DS-Conv) as opposed to traditional convolution.Al Arif et al. [25] proposed a novel deep probabilistic spatial regression network to localize vertebra centers.Al Arif et al. [26] proposed a deep learning-based fully automatic framework for segmentation of cervical vertebrae in X-ray images.The framework first localized the spinal region in the image using a deep fully convolutional neural network.Then vertebra centers were localized using a novel deep probabilistic spatial regression network.Finally, a novel shape-aware deep segmentation network was used to segment the vertebrae in the image.A Dice similarity coefficient of 0.84 and a shape error of 1.69mm had been achieved.Hammernik et al. [27] proposed a total variation (TV) based framework that incorporated an a priori model, a vertebral mean shape, image intensity and edge information.The algorithm was evaluated using leaveone-out cross-validation on the CSI MICCAI 2014 spine and vertebrae segmentation challenge dataset.The nnU-Net (nonew-UNet) [20] had been proposed to automatically configure the pre-processing, the network architecture, the training, the inference, and the post-processing to a given dataset for medical image segmentation based on the encoder-decoder structure of U-Net.Without manual intervention, nnU-Net surpassed most existing approaches and achieved state-of-theart performance in several fully supervised medical image segmentation tasks.

B. Medical Image Classification 1) Deep Learning-Based Medical Image Classification:
DCNN models provided a unified feature extraction and classification framework that enabled users to avoid the troublesome process of extracting hand-crafted features for medical image classification.
Shen et al. [33] proposed the use of multi-scale features to learn classification of lung nodules.Kumar et al. [34] proposed a deep convolution neural network based on CT images for lung nodule classification.Naik et al. [35] combined manual features with deep features extracted by ResNet-50 for the diagnosis of melanoma.Zhang et al. [36] used computationally efficient surrogate models to approximate the validation error function of hyperparameter configurations.It realized lung nodule classification based on DCNNs with hyperparameter optimization.Unlike existing surrogate models adopted stationary covariance functions (kernels) to measure the difference between hyperparameter points, this paper proposed a non-stationary kernel that allowed the surrogate model to adapt to functions whose smoothness varied with the spatial location of inputs.Jetley et al. [38] proposed a trainable attention estimator.The premise of this method was that it was conducive to identifying important image regions and amplifying their influence while suppressing irrelevant and possibly confusing information in other regions.This method can be extended to mainstream classification network framework to improve the ability of model feature extraction.However, this method did not fuse multi-scale images and needed more learning of global features.Bi et al. [28] proposed a multi-level segmentation method in which the early full convolution network (FCN) learned appearance and localization characteristics and the late FCN learned subtle features of lesion boundaries.Yuan et al. [29] proposed a 19-layer FCN that was optimized using Jaccard distance loss.Li et al. [30] proposed a new dense deconvolution network based on residual learning.Mirikharaji et al. [31] proposed encoding previous star shapes into a loss function to ensure the global structure in each split result.Sarker et al. [32] proposed a robust depth encoder-decoder network to improve the accuracy of lesion boundaries.Zunair et al. [39] proposed a classification method based on video preprocessing, which differed from the traditional classification network in model design modification.The problem with this method was that data with large layer spacing did not perform well for different classification tasks.Y. Wen et al. [17] proposed an OCT image classification model that uses DAISY [51] features to help improve the accuracy of CNN networks in eye disease classification tasks.The DAISY feature description is faster and can effectively calculate the gradient at each pixel.
2) Collaborative Learning-Based Medical Image Classification: Xie et al. [18] proposed the mutual bootstrapping deep convolutional neural networks (MB-DCNN) model for simultaneous skin lesion segmentation and classification.Yu et al. [49] presented an ensemble of multiple pre-trained ResNet-50 and VGGNet-16 models and multiple fully-trained DCNNs by calculating the weighted sum of predicted probabilities.Xie et al. [37] proposed a collaborative method for the lung nodule classification task.This method learned to segment the lung nodule region from 9 perspectives and then inputs the segmentation results into the classification network for the classification task.91.6% accuracy was achieved on the LIDC-IDRI dataset.However, this method only used the segmentation result as a feature aided classification task and did not make full use of hidden features in the segmentation network.Zhang et al. [47] proposed a novel 3D multi-attention guided multi-task learning network for simultaneous gastric tumor segmentation and LN classification, which made full use of the complementary information extracted from different dimensions, scales, and tasks.However, this paper was a modular multi-task network that shared the model's parameters.

III. METHOD
In this part, we will mainly introduce the framework of CMMF-Net, explaining the feature selection network and classification network model in detail.Fig. 2 illustrates the flow of the CMMF-Net.
In the first step, we use the labelled CT data of lumbar disc herniation to pretrain the feature selection network.We initialize the network parameters and obtain the segmentation results of the lesion region and the multi-scale feature map output by the MFF module.Then, we stop the pretraining process.In the second step, we concatenate the cropped CT images, the selected feature maps, and the distance feature matrix to form the input data for the classification network.In the third step, the classification network outputs the lesion classification probability and the CAM of the lesion area.During the upsampling process, the CAM is transmitted to the feature selection network as the class saliency information of the lesion area.By cooperating and passing information to each other, the two networks improve each other's performance.Each network is described in detail below.

A. Feature Selection Network
The main purpose of the feature selection network is to segment the lesion area and extract fusion features.It uses nnU-net as the backbone and modifies it by adding MFF module, strip, and plane pooling modules to the skip-connection and adding a global boundary-constrained loss function.The backbone consists of six layers of downsampling, six skipconnection connections, and six layers of upsampling.
As we all know, spatial pooling effectively captures remote context information for pixel-level prediction tasks such as scene resolution.In this task, the shape of intervertebral discs is irregular, and the pooling layer of different shapes is required to learn semantic information.Therefore, we introduce the pooling strategy of strip pooling and plane pooling.This method is different from the traditional pooling layer and has lean and multi-angle characteristics.The specific design is shown in Fig. 3 and Fig. 4. Taking plane pooling as an example, we use it in three directions to extract the input features from different angles and finally fuse them to get Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.the feature map.The fused features are adjusted in dimension by 1 × 1×C convolution, and the sigmoid function is used to obtain the feature weight.Finally, it is multiplied by the initial input features to achieve weighting.
Furthermore, we designed the MFF module to select relevant features in the skip-connection process.We refer to these features as "more valuable features."The MFF module sorts the feature maps according to their importance for lesion site learning, and thus provides assistance to the following classification network.We also proposed a hybrid boundary loss function that considers the boundary pixels and increases the area weight of each pixel.This method can better converge to the damaged area of the lumbar intervertebral disc.
The following section provides a detailed introduction to the network's components.
1) MFF Module: The MFF module is designed to fuse multi-scale image features and select relevant features to assist in the segmentation task.Inspired by SENet [45], we recognize that the essence of the attention mechanism is to assign weights to the features of different channels through full connection.To address the problem of traditional channel attention being unable to learn the internal weights of each channel, we propose a new attention method.As shown in Figure 5, the MFF module possesses a specific flow of operations to fuse multiple scales of image features and select the most relevant features for segmentation.
The input features are passed through a global average pooling (GAP) layer and two fully connected layers to obtain a 1 × 1×C feature map.In the first fully connected layer, the number of neurons is C/16, and in the second fully connected layer, the number of neurons is C.This additional nonlinear processing can fit complex relationships between channels.Then, a sigmoid layer is followed to obtain a 1 × 1×C feature map, and finally, the input features are multiplied by this feature map.This produces feature maps with different importance for different channels.
The feature map U ∈ R c×h×w output by the convolutional layer of the convolutional neural network model is a three-dimensional tensor, containing depth (c), height (h) and width (w).Therefore, we designed the MFF module to capture valuable information from the height dimension of the feature map and enhance each feature map tensor in the height dimension.
As shown in (1), before inputting the feature map U into the MFF module for operation, the feature map should be transposed into U T ∈ R h×c×w according to the height dimension.
In (2), the MFF module squeezes U T using GAP to obtain Y h ∈ R 1×1×h feature vector.
Using the bottleneck structure of two fully connected layers, the model complexity is reduced and the generalization ability is improved.The first FC layer is used as a dimensionality reduction factor, r is a hyperparameter, and ReLU is used for activation.The last FC layer is restored to the original size.
In ( 4), multiply U T and s correspondingly to get the recalibrated transposed feature map.
Then the penalty coefficient β is applied to the Ũ after the squeeze excitation, and than ÛT is restored to the R c×h×w dimension.
For feature selection networks, we add the vectors of the standard SENet output Û and average them to obtain a crosschannel U N weighted feature.For the classification network, all the feature weights are sorted in descending order.In (7), N means to select the feature map from the channel dimension, sor t i indicates the sorting weight from large to small.The MFF in each skip-conection layer selects the four feature maps with the largest weights.Depending on the number of network layers, a total of 24 feature maps with multi-scales are selected in this paper.We use 1 × 1×24 convolution kernel to obtain a feature map with four layers of feature selection.Finally, input the fusion features into the classification network with the distance feature matrix and cropped CT images.
2) Loss Function: Total loss function L total consists of three losses.L dice and boundary constraint functions L ssim and L w to enhance convergence of ROI boundaries and inner edges.Where L ssim is shown by (8).
where µ x , µ y and σ x , σ y are the mean and variance of x and y, respectively, and σ x y is the covariance of x and y.
We propose the L w loss considering the boundary segmentation constraints and the influences of area weights, defined as (9).Among them, K is a hyperparameter.Based on the error after each batch of forward propagation, the lesion pixels with the largest error and background pixels are selected for sorting.X n represents each pixel in the image.Y n represents the category of each element.W n is the area weight of connected domain, and its calculation formula is shown in (10), where volume is the area of a connected domain and c is a constant, usually set to 85.According to the formula, the larger the area of a connected domain, the smaller the area weight W n of the connected domain, and vice versa.P 0 ni represents the n input image, the predicted probability of the i difficult-to-classify pixel in the background; P 1 n j represents the n input image, the predicted probability of the j difficult-to-classify pixel in the lesion; θ is a hyperparameter that controls differences.During the training phase, the model strictly implements P 1 n j > P 0 ni + θ.Such a design enables a segmentation network to pay more attention to those hard pixels and thus learn more discriminative information.
The purpose of increasing the weight is that the lumbar data will have the problem of incomplete internal region.Adding the weight parameter based on the connected region can effectively control the edge convergence of the internal region.
Finally, we define the hybrid loss function of the feature selection network as (11), where λ is a hyperparameter.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

B. Classification Network
The basic framework of the classification network is the 3D ResNet-18 network.Different from ResNet-50, ResNet-18 not only differs in the number of residual blocks but also uses a 3 × 3 convolution kernel inside each residual block, while ResNet-50 first reduces the number of channels using a 1 × 1 convolution kernel and then performs 3 × 3 convolution on the feature map.The SE channel attention module is added between layers for the network framework.An additional Pyramid Scene Parsing (PSP) [50] module is added between eighth and ninth layers to learn global features.Finally, we get the lumbar spine disease classification and CAM results.
In the actual data acquisition process, there may be angular bias in the data due to inconsistent patient positions.For lumbar disc herniation, the diagnosis is mainly based on the geometric changes in the data after a CT scan.However, the redundant background information caused by angle deviation can make the bounding box fit less accurately.To improve the fit of the bounding box, an adaptive selection strategy is presented, which is different from traditional target area selection based on data layer partition.First, we perform cubic B-spline interpolation on the data to obtain a smoother middle layer.Then, we assign three coordinates, cross-sectional, coronal, and sagittal, to the bounding box.In our experience, it is only necessary to adjust the cross-sectional angle from data collected in the same position for lumbar disc data.We calculated the volume change of the box, and when the volume reached the minimum, we got the best bounding box selection strategy.In the ablation experiment section, we verify the effectiveness of this method.
In the actual diagnosis process, clinicians mainly observe the changes in the geometric characteristics of the lesion area through CT images and preliminarily determine the type of lesion.Therefore, we establish the distance feature matrix by the geometric distance feature matrix based on voxels and take the distance from the lesion boundary to the geometric centre as the measure.Firstly, we extract the boundary of each layer of ROI, establish a coordinate system for each point, and provide geometric information.Secondly, we calculate the average distance to obtain the centre point coordinate B. Thirdly, we randomly downsampled edge voxel points ( p = 200) to obtain edge feature points for each layer.Finally, we calculate the Euclidean distance from the edge voxel point to the central coordinate B. Fig. 6 shows the voxel data calculation distance feature matrix method.The points whose edges are not selected use the nearest neighbour clustering method to give the nearest point distance as the feature.The non-edge points (background and internal points) are given a large distance as the abnormal points.
The classification network outputs the classification results for lumbar disc herniation and CAM.The CAM map is obtained by superimposed the results of multiplying each feature layer's GAP weight.The process is as follows: For the feature map of the last convolutional layer, an n-dimensional vector P is obtained after GAP, then a classifier is connected, and the class with the largest probability is selected as the final output class.There is a weight w i between each unit and Fig. 6.The distance feature matrix based on the geometric center to boundary voxel distance is derived from the segmentation mask.RAS for the planning coordinates system, denoted Right, Anterior and Superior, respectively.
category in P, and the probability of being classified as a focus area is obtained.Then each w represents the influence of each unit value in P on the final classification result.As mentioned earlier, each unit corresponds to a channel of the feature map, then w also reflects the influence of each channel of the feature map on the final classification, so each channel of the feature map is weighted and summed, and then the resulting image overlay with the original image can complete the class activation mapping.Based on the bounding box information, the resulting CAM is derived from the cropped image, so we restore the CAM to the original CT image size.The CAM is transmitted as positioning information to the feature selection network upsampling stage to participate in a new round of training.

C. Implementation Details
In the proposed CMMF-Net, the feature selection network is trained on the training dataset using pixel-level labels, and a total of 300 cases of data are used in the training phase.In the testing phase, we used 30 cases for testing.Each case contains approximately 190 to 223 CT images, with an average of five intervertebral discs per case.To further expand the training dataset, we employ online data augmentation, including random rotation from 0 to 10 degrees, 0 to 20 pixel offsets, horizontal and vertical flipping.In the classification network, the augmented patches are resized to 224 × 224 for training.The Adam algorithm with a batch size of 16 and 32, respectively, is used to optimize the segmentation and classification networks.We set the initial learning rate to 0.001, the maximum number of epochs to 500, and the hyperparameters in the mixture loss to λ = 0.3, K = 35, θ = 0.3, c = 80, β = 0.1.We use the validation set to monitor the performance of each network and terminate the training process when the network falls into overfitting.The experimental environment are Ubuntu16.04and Windows10 systems, the CPU model is Inter® Core(TM) i7-10700K @ 3.8GHz, GPU uses GTX-3090Ti×2, and the IDE uses pycharm2020.

A. Datasets
Our method is separately experimented on two different datasets, our private lumbar disc dataset and the public dataset LIDC-IDRI [48] 1 for lung nodule detection.Our private dataset is used in the segmentation and classification of lumbar disc herniation, which consists of lumbar CT scans, class  (23).In the second phase, each physician independently reviewed the labels of the remaining two physicians and gave a final diagnosis.This study's dataset generated or analyzed cannot be used publicly due to sensitive medical information.The use of patient data has been approved by the "Review and Monitoring Board" and the "Institutional Review Board."All patient data are completely anonymous, and all methods are implemented in accordance with relevant guidelines and regulations formulated by the institution.The LIDC-IDRI dataset is a chest medical image dataset that includes CT and X-ray images, along with corresponding diagnostic lesion labels.The dataset was collected by the National Cancer Institute to study early cancer detection in high-risk populations.A total of 1018 research cases are included in the dataset.For each image in each case, four experienced chest radiologists performed two-stage diagnostic labeling.In the first phase, each physician independently diagnosed and labeled the patient's location, with three categories labeled: ≥3mm nodes, ≤3mm nodes, and ≥3mm non-nodules.In the second phase, each physician independently reviews the labels of the other three physicians and gives his or her final diagnosis.

B. Experimental Results on Lumbar Disc Herniation Dataset
1) Compare With Other Methods: In quantitative comparison with different methods, M1-M6 [18], [36], [37], [38], [39], [47] method is selected for quantitative comparison.M1 is a trainable attention estimator.M2 and M3 methods are both collaborative training-based lung nodule classification models based on CT images.M4 is a lung nodule classification method that replaces the auxiliary evolutionary algorithm for hyperparameter optimization.M5 is a lung tuberculosis classification method based on video preprocessing.M6 is a 3D multi-attention guided multi-task learning network for automatic gastric tumor segmentation and lymph node classification.
Table I shows the comparative experiment with other medical image classification methods based on 2D and 3D CNN.CMMF-Net outperforms other comparison methods by +5.05%, +4.00%, +2.09%, +2.55%, +9.31% and +1.10% under accuracy evaluation metric respectively.For a fair comparison, We use a uniform input image size based on the comparison method to avoid the effect of the input data on the results.
In feature selection networks, the choice of the backbone is not unique.Therefore, in the experimental part, we compare the classification results of lumbar intervertebral disc disease after adding different feature selection methods under the current mainstream framework.We compare SENet and EPSANet [46] (an attention module based on multi-scale fusion).For the fairness of the experiment, we use the same classification network structure.The classification accuracy is shown in Table II.It can be seen from the results that the feature selection module can achieve the purpose of improving the accuracy of the classification in different backbones while keeping the classification network the same.
2) Ablation Experiments: In this subsection, we conduct an ablation experiment on the method proposed in the feature selection network and the feature extraction module added in the classification network.For feature selection networks, compare the impact of loss function L w and its hyperparameters θ, K and c on classification results.We also examine the impact of hyperparameter λ on network performance, as well as the relationship between strip pooling and plane pooling.Additionally, we explore the optimal selection of hyperparameters in MFF.In the classification network, we compare the effects of the form of input data on the results, including adaptive bounding box selection and distance feature matrix.
Table III compares the change in Dice Similariy Coefficient(DSC) after using different loss function strategies in the feature selection network.The performance evaluation standard of the feature selection network in this paper refers to the evaluation standard of the segmentation network.We also examine the impact on classification accuracy.During the experiment, we keep the remaining network parameter settings final to ensure fairness.
From Eq. ( 9), we know that in L w , there are three super parameters: θ, K and c.Therefore, Fig. 7 shows the effect of different parameters on classification accuracy.The experiment  There are three loss functions in L total , of which the weight factor λ is responsible for regulating L w .As we can see from Table IV, when λ = 0.3, the higher the DSC, the better the network converges to the edges.As shown  We compare the baseline and experimental results to verify the effect of strip pooling and plane pooling on the results.In addition, the results of series and parallel are compared structurally.The results are shown in Table V.It is worth noting here that the results shown in Table V are the dice coefficients between the mask and the ground truth obtained from the feature selection network.
In MFF, there is also a parameter β, which controls one of the branches as a penalty factor.This paper's selection of β experiments with control variables.As shown in Table VI, network performance is best when β = 0.1.In the initial phase of the model training test, the penalty factor β impacts the potentially valuable feature information of the feature map while weakening the useless information.However, this effect becomes less and less as the model continues to learn.
In the classification network, the first step is to perform ablation experiments on data preprocessing.We compare the traditional layer-based framing strategy for the adaptive bounding box with the multi-fixed angle strategy.As seen from Table VII, the adaptive bounding box options strategy is superior to the fixed angle.At the same time, the baseline has some advantages over the fixed angle since no interpolation calculation has been done, but it is slightly inferior to the adaptive method.

C. Experimental Results on LIDC-IDRI Dataset
In order to comprehensively verify the ability of the model in the classification task, we have verified it on the LIDC-IDRI public dataset and given the quantitative experimental results.The dataset consists of chest medical images and includes annotations of lung nodules.In order to provide the label information required by the feature selection network, this paper uses the 32 × 32 bounding box to obtain all lung node instances.In the next training process, the dataset uses the same image preprocessing method as the lumbar disc herniation dataset.
In quantitative comparison with different methods (M2, M7-M11 [40], [41], [42], [43], [44]).M2 method is a collaborative training-based lung nodule classification models based on CT images.M7 method utilizes multi-crop CNN for classification on a small dataset.M8 method introduces 3D CNN to solve the nodule classification problem, but it also suffers from a small dataset and a single evaluation metric.M9 method compares different types of Haralick features and finds that they have better performance in lung nodule representation, especially using 3D data.However, this method has poor robustness.M10 method proposes a semi-automatic classification method, where the shape, margin, texture, and other features required for classification need to be provided by the user by selecting a seed point.M11 proposes a decision-level fusion of texture, shape, and deep model-learned information, combining the decisions made by three classifiers to differentiate nodules.
Table VIII shows the comparison results on LIDC-IDRI Dataset.CMMF-Net outperforms the supervised baseline by +4.68%, +0.68%, +6.49%, +4.03%, +3.14% and +0.19% under accuracy evaluation metric respectively.Compared with the M2 method, the advantages of the method we proposed are not obvious, because the M2 method uses a multi-view CNN to learn the representation of lung nodule regions.However, this method is more friendly to small-sized images and has limitations.

V. DISCUSSIONS
In this network, we use the image predicted by the feature selection model as a guide to get the distance feature matrix and the ROI CT image.The input of the classification network is obtained by linking the two with the multi-scale fusion features obtained in the MFF.The objective is to use the multi-meta information to guide the lumbar disc classification.To assess the effectiveness of this strategy, we compare the lumbar disc classification performance obtained on 20 validators, using or not applying multi-meta features.The average AUC was increased from 85.77% to 91.32%, which indicates that this synergistic model can be used to classify the lumbar discs more accurately.This performance improvement is understandable because the cropped CT images and multi-scale fusion features obtained from the predicted mask of the lumbar disc and the distance feature matrix allow the classification network to focus on global geometric morphological changes.We visualize the CAM graph obtained from the classification network in Fig. 9 to validate this explanation further.The results show that when multi-meta feature information is used, the CAM obtained is closer to ground truth.

A. Using Selection Feature to Boost Classification
The feature selection module is based on the idea of weight ranking, selecting the feature layer outputs that are more important for the task.However, it is not that the more features selected to input the classification network, the higher the classification accuracy will be.For example, the number of output channels for each layer skip-connection is four.Finally, the multiscale fusion features are obtained by Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.fusing the multilayer features.We also attempted to output channel three and five features with an accuracy of −1.47% and +0.14%, respectively.After considering the calculation cost and performance, the method selects the number of four channels.

B. Advantages of Mutual Bootstrapping
The proposed CMMF-Net not only transfers the information of lesion location and geometric features generated by the feature selection network to the classification network for lesion classification but also transfers the lesion class saliency information learned by the classification network to the feature selection network to improve segmentation accuracy.Therefore, the two networks enhance each other in a bootstrap manner, which improves the performance of lumbar disc lesion segmentation and classification simultaneously, especially when the training dataset is small.This idea is also reflected in [16], which uses a cascaded training strategy to optimize the three stages gradually by leveraging transfer knowledge.In public datasets like LIDC-IDRI, which have larger amounts of data, models often achieve more robust results.
One of the main advantages of using classifications to facilitate segmentation is training the feature selection network without increasing the number of pixel-level annotated images, thereby reducing the requirement for pixel-level dense

C. Limitation of CMMF-Net
The performance gain decreases when the number of training data for pixel annotation increases from 300 to 400.This is not surprising because the more training images based on pixel annotations are used, the fewer auxiliary positioning features are required.The network performance is more affected by the pixel annotation data.In addition, without adding new training data for lumbar intervertebral discs, CMMF-Net was iteratively trained for three cycles (one cycle includes feature selection network and classification network each running for 500 epochs), resulting in an 11.43% decrease in classification performance, while the segmentation accuracy remained unchanged (±0.21%).This is because the classification network model tends to overfit when there is limited training data, whereas the segmentation task is less affected by this due to mask supervision.
CAM guided lumbar intervertebral disc segmentation also has some shortcomings.As shown in Fig. 11, there are problems of missed segmentation and over-segmentation at some edges and central positions of the lumbar intervertebral disc.This may be because the guidance of CAM on the boundary of the target area is not strong enough, and L w causes the boundary to converge, resulting in segmentation errors.Our method requires pixel-level annotation as segmentation supervision information, which is an urgent problem for medical images.Furthermore, our method is limited by the network model's scale, has poor learning ability for medical images of large lesions, and has not yet achieved end-to-end training.

VI. CONCLUSION
Classification of lumbar disc herniation based on deep learning has significant clinical significance.On the one hand, it can help doctors quickly and accurately diagnose whether a patient has lumbar disc herniation, thus guiding clinical treatment decisions.On the other hand, this technology can help doctors accurately identify the type of disease, develop treatment plans that meet individual patient needs, and shorten the diagnosis period.
We propose a CMMF-Net model divided into a feature selection network and a classification network.Firstly, we propose a feature selection network to enhance edge learning and obtain accurate lesion prediction areas.The lesion location and distance feature matrix is obtained using the segmented masks.Our proposed MFF module is used to obtain multi-scale fusion features, and the boundary constraint loss function is used to converge to the lumbar disc boundary.Subsequently, the multi-meta features are fused into the classification network to obtain the classification probability and CAM of lumbar disc herniation.Finally, the CAM is re-entered into the feature selection network during the following training sampling process so that the network can promote each other.In practical applications, considering the good transferability of CMMF-Net, it can provide accurate disease diagnosis results for doctors and effectively shorten the diagnosis cycle.Our model is more accurate than the baseline and current advanced methods on the private lumbar disc dataset.Our model is also advanced on the LIDC-IDRI dataset.
In the future, we plan to extend the proposed model to a semi-supervised learning framework and incorporate CAM as a weak label supervised feature selection module to predict lesion areas.This will reduce the need for data annotations and improve the computational efficiency of the model.Additionally, we will use a lightweight DCNN structure in the proposed model to enhance its performance.Furthermore, a more indepth study of case information is required, and we plan to incorporate natural language processing models to capture additional information to classify lumbar disc herniation more accurately.

Fig. 1 .
Fig. 1.Images of different types of lumbar intervertebral disc.The first row shows the normal type, the second row shows the bulge type, and the third row shows the protrusion type.The CT image is the intermediate layer corresponding to the type.

Fig. 2 .
Fig. 2. The illustration of the proposed CMMF-Net consists of two DCNNs: (a) Feature selection network and (b) Classification network.The feature selection network serves two main functions: generating an intervertebral disc mask and selecting features through the MFF module.The segmentation mask calculates the distance feature matrix and region of interest (ROI) bounding box, which are then concatenated with the selected features and fed into the classification network.The classification network mainly performs two functions: generating lumbar disc disease classification results and generating CAM maps.The CAM map is input into the feature selection network as class saliency information for the next training, serving the role of locating the lesions.

Fig. 5 .
Fig. 5. MFF module pipeline, where F sq ( • ), F ex (•, W) and F scale (•, •) represent the squeeze operation, the excitation operation and the feature rescale operation, respectively.The β represents the penalty coefficient and C represents the number of channels.

Fig. 7 .
Fig. 7. Quantitative comparison of parameters θ, K and c in L w .

Fig. 8 .
Fig. 8.Comparison of values under different loss function selections.

Fig. 9 .
Fig. 9.The influence of CAM visual feature selection network on classification results.We compared the results of MFF under different backbone segmentation frameworks.

Fig. 10 .
Fig. 10.Display of lumbar disc segmentation results guided by CAM.

TABLE I QUANTITATIVE
COMPARATIVE EXPERIMENTS WITH OTHER MEDICAL IMAGE CLASSIFICATION METHODS BASED ON CONVOLUTIONAL NEURAL NETWORKS.VALUES ARE THE MEAN (STDEV) OBTAINED OVER FIVE-FOLD CROSS-VALIDATION, * INDICATES THAT OUR METHOD SIGNIFICANTLY OUTPERFORMS OTHERS WITH p < 0.05

TABLE II QUANTITATIVE
COMPARISON OF THE CLASSIFICATION RESULTS OF LUMBAR SPINE LESIONS BY DIFFERENT FEATURE EXTRACTION METHODS.VALUES ARE THE MEAN (STDEV) OBTAINED OVER FIVE-FOLD CROSS-VALIDATION TABLE III QUANTITATIVE COMPARISON OF SEGMENTATION AND CLASSIFICATION RESULTS WITH DIFFERENT LOSS FUNCTIONS.DSC IS THE SEGMENTATION EVALUATION METRIC, AND THE REST IS THE CLASSIFICATION EVALUATION METRIC.VALUES ARE THE MEAN (STDEV) OBTAINED OVER FIVE-FOLD CROSS-VALIDATION TABLE IV QUANTITATIVE COMPARISON OF SEGMENTATION AND CLASSIFICATION RESULTS WITH DIFFERENT LOSS FUNCTION WEIGHTS.DSC IS THE SEGMENTATION EVALUATION METRIC, AND THE REST IS THE CLASSIFICATION EVALUATION METRIC.VALUES ARE THE MEAN (STDEV) OBTAINED OVER FIVE-FOLD CROSS-VALIDATION TABLE V QUANTITATIVE COMPARISON OF THE SEGMENTATION RESULTS OF LUMBAR SPINE LESIONS BY DIFFERENT POOLING METHODS.
VALUES ARE THE MEAN (STDEV) OBTAINED OVER FIVE-FOLD CROSS-VALIDATION

TABLE VI QUANTITATIVELY
COMPARE THE IMPACT OF DIFFERENT PENALTY COEFFICIENTS β IN MFF ON THE RESULTS OF FEATURE SELECTION NETWORKS AND CATEGORIZED NETWORKS.VALUES ARE THE MEAN (STDEV) OBTAINED OVER FIVE-FOLD CROSS-VALIDATION

TABLE VII QUANTITATIVE
COMPARISON OF CLASSIFICATION RESULTS OF LUMBAR SPINE LESIONS BY DIFFERENT BOUNDING BOX SELECTION METHODS.VALUES ARE THE MEAN (STDEV) OBTAINED OVER FIVE-FOLD CROSS-VALIDATION TABLE VIII COMPARISON OF LIDC-IDRI DATASETS WITH THE LEADING METHODS.VALUES ARE THE MEAN (STDEV) OBTAINED OVER FIVE-FOLD CROSS-VALIDATION, * INDICATES THAT OUR METHOD SIGNIFICANTLY OUTPERFORMS OTHERS WITH p < 0.05