Mining Domain Knowledge: Improved Framework Towards Automatically Standardizing Anatomical Structure Nomenclature in Radiotherapy

The automatic standardization of nomenclature for anatomical structures in radiotherapy (RT) clinical data is a critical prerequisite for data curation and data-driven research in the era of big data and artificial intelligence, but it is currently an unmet need. Existing methods either cannot handle cross-institutional datasets or suffer from heavy imbalance and poor-quality delineation in clinical RT datasets. To solve these problems, we propose an automated structure nomenclature standardization framework, 3D Non-local Network with Voting (3DNNV). This framework consists of an improved data processing strategy, namely, adaptive sampling and adaptive cropping (ASAC) with voting, and an optimized feature extraction module. The framework simulates clinicians’ domain knowledge and recognition mechanisms to identify small-volume organs at risk (OARs) with heavily imbalanced data better than other methods. We used partial data from an open-source head-and-neck cancer dataset to train the model, then tested the model on three cross-institutional datasets to demonstrate its generalizability. 3DNNV outperformed the baseline model, achieving higher average true positive rates (TPR) over all categories on the three test datasets (+8.27%, +2.39%, and +5.53%, respectively). More importantly, the 3DNNV outperformed the baseline on the test dataset, 28.63% to 91.17%, in terms of F1 score for a small-volume OAR with only 9 training samples. The results show that 3DNNV can be applied to identify OARs, even error-prone ones. Furthermore, we discussed the limitations and applicability of the framework in practical scenarios. The framework we developed can assist in standardizing structure nomenclature to facilitate data-driven clinical research in cancer radiotherapy.


I. INTRODUCTION
In the field of radiotherapy (RT), nomenclature standardization is the process of imposing a unified and structured labeling system on anatomical structures [1]- [3]. This is a prerequisite for clinical data curation and data-driven research, especially in the era of big data and artificial intelligence [1], [4]- [7]. However, because of differences in local policies, vendors, and language environments, structure The associate editor coordinating the review of this manuscript and approving it for publication was Muhammad Sharif . labels are often inconsistent [8], [9]. A large number of retrospective RT datasets [10], [11] cannot be shared and reused without consistent labels, and manually cleaning RT data is very expensive and time-consuming [8], [9], [12]- [14]. Therefore, it is necessary to develop software tools to automate nomenclature standardization to facilitate data-driven clinical research.
Previous works have proposed standardizing the nomenclature of anatomical structures via text-based methods that rely on label matching and clinicians' intervention to correct mismatched labels at a single institution [8], [15], [16]. However, language constantly changes, and different naming conventions make the semantic information in labels difficult to recognize automatically. As a result, text-based methods cannot be applied to datasets collected even from a single institution, let alone to cross-institutional datasets.
The labels of organs at risk (OARs) have a one-toone correspondence with the images (such as Computed Tomography [CT] scans and segmentation masks), and the image data contain invariant semantic information that can standardize nomenclature in multilingual environments. Methods that leverage this image information to tackle crossinstitutional RT datasets are called image-based methods. Existing image-based methods try to automatically standardize nomenclature by exploiting semantic invariance in the image [17]- [20]. Among these methods, algorithms that leverage atlas-based registration can also be used to determine the category of the structure and then relabel it [18], [19]. However, atlas-based registration is unstable and timeconsuming. Other image-based methods convert the task of structure nomenclature standardization to OAR classification based on deep learning (DL) frameworks [17], [20]. Nonetheless, these methods have largely overlooked the problems caused by imbalance and poor delineation in real RT datasets, especially for small-volume OARs with similar positions, shapes, and sizes, such as the pituitary and optic chiasm. RT datasets are imbalanced, not only in the number of OARs, but also in the size of each OAR. For example, in Fig. 1 (a), the volume of the brain is much larger than that of the pituitary. Models built on such datasets tend to be biased and inaccurate. Poor delineation of OARs increases the inter-class similarity and the intra-class variation. For example, in Fig. 1 (b), the pituitary and optic chiasm are very similar, but the larynx varies greatly across patients. Both imbalance and poor delineation will bias the classifier, which will lead to incorrect predictions for small-volume OARs.
As mentioned earlier, some image-based methods treat the task of structure nomenclature standardization as an OAR classification task [17], [20]. In the field of computer vision, deep learning has led to a series of breakthroughs for classification tasks [21], [22] that have improved upon traditional classification methods [23], [24]. Related works seeking to improve the performance of DL-based classifiers have mainly focused on three aspects: 1) constructing deeper, wider, and more elaborate architectures [22], [25]- [30] to increase the capacity for adapting data and training [31]; 2) enriching samples to get close to the actual distribution [32]- [36]; and 3) adding subjective constraints to make high-level features extracted within the network correspond to the domain knowledge required for specific tasks [37]- [40]. Existing state-of-the-art networks [21], [22], [25]- [30] for classification can be applied to the current task. ResNet [25] has a lower computational cost but better performance than other networks [41]. Therefore, we have made many attempts to use ResNet50 for this application, but these attempts have yielded results similar to previous reports [17], [20]: the true positive rate (TPR) of small-volume OARs (such as the pituitary and optic chiasm) cannot meet the requirements for clinical implementation. It is worth noting that clinicians can make quick and accurate decisions for small-volume OARs, even with poorly delineated samples, which means the images contain enough effective information for clinicians to apply their domain knowledge and recognition mechanisms. To date, there has been no relevant research on how clinicians make accurate decisions when classifying OARs, but we can simulate this process and, thus, incorporate the implicit domain knowledge and recognition mechanisms necessary for decision making into the target framework.
The main goal of our work is to explore ways to integrate clinicians' domain knowledge and recognition mechanisms into a neural network to improve the classifier's performance for categorizing small-volume OARs. To this end, we propose an automatic structure nomenclature standardization framework, 3D Non-local Network with Voting (3DNNV). This framework consists of an improved data processing strategy and an optimized feature extraction structure. The data processing strategy was proposed to provide the explicit information that clinicians use when labeling structures. The feature extraction structure simulates the observation process, which enhances the observational fineness of the region of interest (ROI) in the high-level features.

A. IMPROVED DATA PROCESSING STRATEGY
We propose a simple and effective adaptive data processing strategy: adaptive sampling and adaptive cropping (ASAC) with voting. ASAC simulates the process of clinicians observing images and collecting the information needed for decision making, and it generates multi-scale and multi-position inputs for a sample. ASAC constructs a set of augmented inputs, assists the model in mining the effective information implicit in the raw data, and extracts the domain knowledge that clinicians typically need to identify OARs. The voting strategy accounts for variations in a structure's shape and location that may lead to poor delineation. This strategy is a weighted sum of all the predictive results of inputs for the same sample; this makes the final result closer to predicting the ''ideal'' semantic features. The voting strategy also agrees with the principle of clinicians making decisions based on comprehensive information.

B. OPTIMIZED FEATURE EXTRACTION STRUCTURE
The convolutional network only processes one local neighborhood at a time, and the common way to model the long-range dependency on semantic features is to increase the receptive field. In order to fill the gaps in capturing long-range dependency and to enhance the observational fineness in the region of interest in the high-level semantic features, we added non-local blocks [42] to ResNet50 to optimize the feature extraction structure in the network (designated ''NN'' for Non-local Network). Non-local blocks apply a self-attention mechanism [43] to image sequence processing by calculating the similarity matrix for high-level semantic features, thereby containing the long-range dependency and enhancing the representation of the semantic features.
By combining the ASAC/Voting strategy with the Non-local Network, we obtained the final framework, 3DNNV, which can standardize the nomenclature of structures in RT datasets. The 3DNNV integrates clinicians' domain knowledge and recognition mechanisms into the final model from a new perspective, mitigates the problems caused by imbalance and poor delineation in RT datasets, and improves the performance for identifying small-volume OARs. This framework allows us to categorize structures in cross-institutional RT data quickly and efficiently, then automatically relabel these structures with general labels recommended in AAPM TG-263 [1]. Furthermore, 3DNNV is extensible and can be easily transferred to other anatomic sites after fine-tuning on a few samples.
The rest of this paper is organized as follows: Section II introduces related works that have sought to automate nomenclature standardization of OARs in recent years. Section III describes our 3DNNV framework. Section IV shows the results of experiments evaluating 3DNNV's performance and comparing it with other state-of-the-art methods. Section V discusses the limitations of this study and the future prospects of our work. Section VI summarizes our main findings and provides future directions.

II. RELATED WORK A. TEXT-BASED METHODS
Text-based methods standardize structure nomenclature mainly by using structured naming templates or label mapping dictionaries. Mayo et al. [15] built software containing structured templates, which allows clinicians to relabel structures interactively. The fixed template helps to unify labels better than free-text interactive tools. Nyholm et al. [16] mapped the main structure labels in local clinical centers to the name list of the general naming convention, then manually corrected the mismatched labels through the interactive interface. The authors used the tool to aggregate RT data from 15 medical centers in Sweden. More recently, Schuler et al. [8] pointed out that, when standardizing radiotherapy data, it is difficult to distinguish between typographic name variations and fundamental semantic differences in the same structure. Therefore, they developed a tool called Stature that maps a local standard structure name (LSSN) to the AAPM TG 263 naming table by creating a lookup dictionary. The above methods map the original labels to standardized labels based on a dictionary and manual intervention. These kinds of methods can establish the mapping between the original labels and standardized labels to quickly solve the problem of inconsistent labels in the local RT dataset. However, language constantly changes, as shown in Table 1, which limits these methods' applicability to cross-institutional datasets. In addition, the text-based methods cannot handle large-scale retrospective datasets.

B. IMAGE-BASED METHODS
Image-based methods, which are based on the invariant semantic information in medical images, are learnable automatic recognition methods that overcome the problems VOLUME 8, 2020  inherent in text-based methods. The label propagation, which is implemented by an atlas-based deformable image registration (DIR) algorithm, registers an atlas with known labels to the input and then chooses the one with the highest overlap mask to relabel the input [19]. In this way, unknown datasets can be standardized by labels in the atlas. However, the DIR's performance is unstable [18]. Also, this method is highly time-consuming, so it falls well short of practical requirements.
Our previous work departed from these methods, as it converted label standardization to the task of automatically categorizing structures in RT data and modeled the process with a deep neural network, which used the weighted mask of OARs to construct a composite mask as 2D input [17]. This work demonstrated the excellent performance of deep learning networks in standardizing OAR labels, but the experiment did not make full use of the three-dimensional shape and location information on the CT. The classes of OARs in the training dataset were clean and sufficient, but the real dataset contained many other challenges, such as heavy data imbalance, inter-class similarity, and intra-class variation, that could limit the method when extending it to other anatomic sites. More recently, Rhee et al. [20] extended the number of categories to 19 OARs in the head-and-neck region and loosely utilized the encoder of V-Net [44] to construct their framework, TG263-Net. This framework leveraged 3D inputs and achieved high accuracy in identifying 19 OARs, but it did not take into account imbalance and poor delineation in RT datasets, so its performance in identifying small-volume OARs is insufficient for practical clinical needs.

III. MATERIALS AND METHODOLOGY A. OVERVIEW OF 3DNNV
This section outlines the workflow of 3DNNV (Fig. 2). 3DNNV consists of two parts in the inference phase for standardizing structure nomenclature: 1) the ASAC/Voting strategy and 2) the Non-local Network. For any OAR in given Digital Imaging and Communications in Medicine (DICOM) data, the CT and corresponding mask are extracted to form a raw data pair. Then, ASAC generates multi-scale and multiposition inputs for each sample. During training, each input generated by ASAC is regarded as an independent sample, and the parameters of the non-local network are updated and optimized based on the samples in each mini-batch. In the inference phase, multiple inputs for a sample are fed into the network, which outputs the vectors (''Vectors'' in Fig. 2, 256-d vector for each). Sharing weights here allows the consistent representation of multi-scale/multi-position inputs in feature space so that we can leverage the same model with the same parameters to extract high-level features for each input. The 256-d vectors vote for a final predictive result as the output of 3DNNV, and the sample is renamed with a standardized label.

B. DATA
In accordance with Brouwer et al.'s suggestion [45], we selected the 28 categories of head-and-neck OARs shown in Fig. 1 (a) to train our model. We compared our model's performance in standardizing structure nomenclature against other models by testing them on three different head-andneck image datasets.

1) HN_PETCT
HN_PETCT [46], [47] is an open-source head-and-neck RT dataset released on The Cancer Imaging Archive (TCIA) [48] that includes data collected from four different French medical institutions comprising 298 patients. We collected 4372 samples in total for the 28 OAR categories. Then, we divided the samples into three subsets for training, validation, and testing in a ratio of 3:1:1. It should be noted that the number of samples in the dataset is extremely imbalanced. For Glnd_Lacrimal_L/R and Pituitary, only 9 samples for each were used as training data.
2) PDDCA PDDCA [49] is an open-source RT dataset containing data from 48 patients that was released by the MICCAI 2015 Segmentation Challenge. This dataset contains only 9 categories of head-and-neck OARs (Parotid_L, Parotid_R, Glnd_Submand_L, Glnd_Submand_R, Bone_Mandible, Brainstem, OpticChiasm, OpticNrv_L, and OpticNrv_R). All contours for OARs were re-delineated by trained radiologists. We collected 408 samples in total and used all of them as a test set.

3) HN_UTSW
HN_UTSW is an RT dataset collected by our team at UT Southwestern that contains data for 408 patients. We collected a total of 5153 samples for 28 OARs (the same categories as HN_PETCT) and used all of them for testing to show our model's generalizability.

C. PREPROCESSING
For each patient's data in given DICOM files, 3D CT volumes and corresponding masks were extracted to form raw data, then the voxel size of the 3D volumes in the raw data was normalized. To ensure that the small-volume OARs do not lose any information through down-sampling, we chose to use the same voxel dimension ratio z:y:x = 0.77:1:1 from the training dataset HN_PETCT for all other datasets. We performed trilinear interpolation for resizing and reshaping. Due to differences in maxima (and minima) of Hounsfield unit (HU) values for different patients, the range of HU values was truncated to [−1000, 2500], then normalized to [0, 1]. We directly used a binary [0, 1] matrix to represent the mask. For some patients, the OAR contours were missing in some intermediate slices and were generated using the nearest slices.
D. 3DNNV: 3D NON-LOCAL NETWORK WITH VOTING ASAC/Voting is an essential part of 3DNNV. It is worth noting that ASAC is a data processing strategy that can be applied in all stages, but the voting strategy is applied only in the inference phase.

1) ASAC: ADAPTIVE SAMPLING AND ADAPTIVE CROPPING
For each OAR, a pair of pre-processed 3D CT and mask volumes (''Raw Data'' in Fig. 2) are cropped into smaller volumes by using sliding cubes of n × m × m (the blue and orange cubes in the ''ASAC'' part of Fig. 2) along the patient long axis, which are then used as inputs for the non-local network. Here, m and n are the sizes of the sliding cube in the axial plane and in the patient long axis direction. In our experiments, we used 5 different sizes for the cropping cubes: 12 × 128 × 128, 18 × 192 × 192, 24 × 256 × 256, 30 × 320 × 320, and 36 × 384 × 384. The cubes slide at a step size of n/3 × 2. The image volumes cropped using cubes of different sizes are resized to 12 × 128 × 128 before being input into the non-local network. ASAC is not only a way to extract clinicians' domain knowledge, but also a way to deal with the issues related to limited computational resources and imbalanced training datasets. For some oversized OARs, such as Brain, the contour in the axial slice cannot be entirely captured by small-volume cubes (such as 12×128×128). Therefore, it is necessary to adaptively resize the CT and mask first to fit them into the cubes, then perform the sampling. By performing ASAC, we gain multi-scale and multi-position inputs for each sample.

2) VOTING
As mentioned above, ASAC generates multi-scale and multiposition inputs, which contain global and local information. The outputs of the non-local network, corresponding to the different inputs, are summed to vote for the final recognition result. This voting strategy is used in the inference phase.

3) NON-LOCAL NETWORK
We set vanilla ResNet50 as the backbone for our 3DNNV network and also as the baseline for our performance comparisons (Table 2).
Then, we added non-local blocks [42] to the backbone network to form the final 3D non-local network (Fig. 3). Inspired by the self-attention mechanism [43], Wang et al. [42] proposed the non-local block to capture the global dependence on semantic features. It was designed to handle sequential data, so we stacked it into our framework. In this work, we are committed to enhancing the position information's dependence on the CT image and the shape information's dependence on the mask image, so the pairwise function f may be implemented using the concatenated form. The nonlocal block used in our network is defined as follows: x and z are set as the input and output, respectively, of the nonlocal block. Both are of the same size: B denotes the batch size of the input, and C represents the number of channels. D, H, and W are depth, height, and width, respectively. Here, i is the index of an output position whose response is to be computed, j is the index of all possible positions, and y is an intermediate output with the same size as x. ψ, θ, µ, and σ are all 1 × 1 × 1 convolution layers. The self-attention mechanism [14] is applied in the Non-local block to capture the long-term dependency within semantic features. Any input for the network is 2-channel 3D data (12 × 128 × 128 CT and 12 × 128 × 128 mask), and the corresponding output will be 256-d vector. Operator [.,.] indicates the concatenation operation, and W f is the mapping matrix that converts the concatenated vector to the scalar output. ''+x i '' indicates identity mapping, and the input x i is added to the transformed y to get the final output z of the non-local block. C(x) is a regularization term:

A. EXPERIMENTAL SETTING 1) TRAINING DETAILS
Using the training data outlined in section III.B and the preprocessing outlined in section III.C above, we trained our deep learning models as described below. To account for imbalance in the numbers of images for each OAR in the dataset, we applied a non-uniform sampling method-OARs that were represented less were inversely proportionally sampled more times. We augmented the training data by performing affine transformations, including randomly translating, rotating, shearing, and scaling. Finally, the central cube of the sample was cropped as input data. The final input data size was 2 × 12 × 96 × 96, which is two-channel 3D data that includes the 3D CT volume and the corresponding mask on the same slices. All architectures used in this work were initialized as described by He et al. [51]. The Adam optimization algorithm [52] was applied to optimize the networks with an initial learning rate of 1e-4, and cross-entropy was set as the loss function. The batch size was set to 16. For samples generated by ASAC, we set the total number of epochs to 20, and the learning rate dropped by a factor of 10 after 2, 5, and 10 epochs. For other architectures without ASAC, we set the total number of epochs to 200, and the learning rate decreased by a factor of 10 after 10, 20, and 30 epochs. The 3DNNV was implemented on the PyTorch1.0 [50] framework and trained on a single GPU NVIDIA Tesla K80.

2) EVALUATION
For this multi-class classification task, we used true positive rate (TPR), F1 score, and area under the receiver operating characteristic curve (AUC) to evaluate the performance of our models. These metrics are defined as follows: Multi-class classification can be considered as multiple binary classifications and can calculate true positive (TP), false negative (FN) and false positive (FP) values for each category separately. F1 score is the harmonic mean of the positive predictive value (PPV) and TPR. In (7), rank ins i means the i-th positive sample sorted by probability. AUC indicates how well the model distinguishes between different classes. AUC is not sensitive when used on an imbalanced test sample.

B. COMPARISONS AMONG ResNet MODELS
We developed the 3DNNV model in a step-wise manner. First, we set vanilla 3D ResNet50 as the backbone network; then, we optimized the architecture; and finally, we integrated domain knowledge into the network. We evaluated and compared the performance of the models obtained at each step. Our first goal was to determine an initial preprocessing strategy for the raw data. Beginning with the baseline network, we tested three different strategies: taking global samples without voxel size normalization (GS), global samples with voxel size normalization (VN-GS), and local samples with voxel size normalization (VN-LS) as inputs. Samples collected at the scale of 36 × 384 × 384 were marked as global samples (GS), and samples collected at the scale of 12 × 128 × 128 were marked as local samples (LS). ''VN'' means voxel normalization. Accordingly, we designated the architectures as Baseline (GS), Baseline (VN-GS) and Baseline (VN-LS). We found that, for the error-prone smallvolume OARs in the head-and-neck region, detailed information contained in the local samples plays an important role in recognition (see Fig. 4 and Table 2), so incorporating local details benefits models in classifying small-volume OARs. However, the model trained only on local samples, Baseline (VN-LS), could not distinguish between BrachialPlex_L (BP_L) and BrachialPlex_R (BP_R) (Fig. 5). Without the global location information, the model failed to indicate on which side the OAR should be.
To enhance the representation of small-volume OARs in high-level feature space, we added non-local blocks to the backbone networks with voxel normalization and compared the performance of the non-local network (NN) with the baselines. We designated the non-local network architectures as NN (VN-GS) and NN (VN-LS). The NN architectures performed slightly better than the baselines over all categories, especially for Pituitary and OpticChiasm (Table 2). However, their performance on other small-volume OARs was barely satisfactory. Like the Baseline (VN-LS) architecture, the non-local network architecture trained on local samples, NN (VN-LS), could not distinguish between BrachialPlex_L (BP_L) and BrachialPlex_R (BP_R).
Finally, we applied the ASAC/Voting strategy to generate multiple inputs for a sample and combine the information through voting. We constructed and trained the 3DNNV network on the samples generated by ASAC. In the inference phase, all output vectors for the same sample voted for the final predictive result. We found that 3DNNV performed well in identifying small-volume OARs, even those similar in shape, size and location, such as the pituitary and optic chiasm.
When we compared the performance of the six ResNet-based models in classifying the 28 OARs across all three institutional test sets, we found that 3DNNV was superior to the baseline methods for classifying OARs VOLUME 8, 2020 and had good generalizability across different institutional datasets (Table 3).

C. COMPARISONS WITH PREVIOUS WORKS
To further test 3DNNV's ability to standardize structure nomenclature, we compared its performance with that of other image-based methods: specifically, atlas-based registration and several deep learning-based methods.

1) ATLAS-BASED REGISTRATION
Atlas-based registration can standardize structure nomenclature by matching OARs with an atlas in the database and renaming the input with the atlas label that has the largest overlap mask. To test atlas-based registration for this application, first, we constructed a 2D single-atlas database for the 28 OARs, each sample in which contained a CT slice and a mask for the OAR to be identified in the same slice. Second, for each pair of CT and mask of the OAR to be identified (noted as fixed CT and fixed mask), the moving CT in each atlas was registered to the fixed CT, and the transformation (warping parameters) was learned. We applied the transformation on the moving mask of the atlas, then calculated the area of overlap between the deformed moving mask and the fixed mask by using the Dice Similarity Coefficient (DSC). DSC is shown in the following formula, with X and Y denoting the fixed mask in the given data and the moving mask in the atlas database.
For the experiment comparing atlas-based registration with 3DNNV, every structure processed by 3DNNV was first run through an early-match module to avoid processing standardized structures repeatedly. The early-match module performed string matching between the original label and the standardized label: if and only if the original label fully matched one of the standardized labels in the dictionary, then the original label was treated as an already standardized label. Otherwise, the structure was fed into 3DNNV to obtain the prediction result. To limit the running time, we tested both methods on data from two randomly selected patients in the HN_UTSW dataset (Table 4).
The results show that the atlas-based registration algorithm is very time-consuming and unstable on different patient datasets, and its running time is almost 30 times longer than 3DNNV's, which makes atlas-based registration unacceptable for this application. The registration effect of atlas-based deformable image registration often depends on the atlas dataset, the deformation model, and the objective function. However, it is difficult to construct an optimal single-atlas database. Multi-atlas datasets could be applied to make up for this deficiency, but this would be even more time-consuming.

2) DL-BASED METHODS
We also applied and analyzed other DL-based methods for structure nomenclature standardization and compared their performance with 3DNNV. Taking into account the massive impact of different sampling strategies and networks on performance, we set several different architectures for the experiment. For the various inputs used in this section, 1c2d is a 1-channel composite mask [17], 2c2d is a 2-channel input combining 2D CT and the corresponding mask, and 2c3d is a 2-channel 3D CT and mask [20]. For the different networks, we trained and tested 5-layer CNN [17], vanilla 2D ResNet50 [25], and TG263-Net [20] on the same datasets and compared their performances with 3DNNV. To fairly compare different methods with different inputs, we set 4 architectures-5layer CNN (1c2d), ResNet50 (1c2d), 5-layer CNN (2c2d) and ResNet50 (2c2d)-to determine the best combination of network and inputs. We found that the ResNet50-based models performed far better than the 5-layer CNN models (Fig. 6, Table 5, and Table 6), even though ResNet50 has fewer parameters and costs less on computation. The 2-channel inputs include more information, which generally improves the overall performance on the 28 categories (Table 6). Of note, Pituitary got an F1 score of 0.0% (Table 5) because of the extremely imbalanced training sample: not only were there many more samples for Optic Chiasm than for Pituitary, but the two OARs are quite similar and error-prone. As a result, all test samples for Pituitary were predicted as Optic Chiasm.
Based on the results of the above experiments, we set three more architectures-TG263-Net (2c3d), NN (2c3d), and    3DNNV-to determine the optimal sampling strategy and structure for the framework. For TG263-Net [20], we loosely used the encoder in V-Net [44] to construct the classifier. Then, we normalized the voxel size of CT and mask volumes in raw data to 2 mm: 2 mm: 2 mm, and we cropped the central 64 × 64 × 64 cubes (on CT and mask) to construct the 2-channel input. We randomly translated the center-of-mass by 10 mm to gain 9 inputs for each sample. In the inference phase, the 9 vectors extracted from 9 inputs vote for a final prediction result. This sampling strategy is similar to 3DNNV's, so we applied the TG263-Net's sampling strategy (along with the voting strategy) to our Non-local Network (NN) and designated the architecture as NN (2c3d). When we compared these two architectures, NN (2c3d) performed notably worse than TG263-Net (2c3d). Nevertheless, after replacing the NN (2c3d)'s sampling and voting strategy with ASAC/Voting, we arrived at the framework of 3DNNV, which includes the improved data processing strategy and the optimized feature extraction structure. The average TPR, F1, and AUC for 3DNNV and the other DL-based methods over all categories on all test datasets are shown in Table 6. Although TG263-Net performed slightly better than NN, it required a longer running time. Most importantly, 3DNNV outperformed the other DL-based methods and had better generalizability across institutional datasets. VOLUME 8, 2020

TABLE 8.
One-way Analysis of Variance (ANOVA). To illustrate the improvement directly, we compare the results of 3DNNV with Baseline (GS) in terms of TPR, F1, and AUC. Boldface indicates statistically significant improvement (threshold p-value < 0.05, the mean difference > 0).

D. 3DNNV's EXTENSIBILITY
To demonstrate the 3DNNV's extensibility, we fine-tuned the model on other anatomical sites.
Data from 8 lung region patients and 5 prostate region patients were selected to fine-tune the model: we used the parameters of 3DNNV pre-trained on the 28 head-and-neck OAR data for initialization, then we froze all parameters except those on the fourth residual block (''Res4'' in Fig. 3) and the fully-connected layer, and we set the learning rate as 1e-5 for the trainable layers. Next, we tested the fine-tuned model on data from 29 lung region patients and 28 prostate region patients. Other training settings were the same as for 3DNNV.
The experimental results are shown in Table 7. The model only needed 20 epochs to transfer to recognizing OARs in other anatomical regions with a small amount of data, and it obtained a good recognition accuracy. This means that, with very little data and a short amount of time, we can easily transfer the pre-trained model to the target anatomical site to meet the needs of the new application.

A. EFFECTIVENESS
As mentioned before, 3DNNV yields better performance at identifying small-volume and error-prone OARs than all other deep learning-based models we investigated. To some degree, the sampling/voting strategies applied in TG263-Net and our framework are similar: generate many inputs for a sample, and vote for a final result at the inference phase. Here, we try to explain why 3DNNV works for error-prone OARs and compare it with TG263-Net. The 256-d vectors (outputs of the network) for error-prone OARs are visualized in Fig. 7. There are partial small-volume OARs in the head-andneck region, the data of which are often poorly delineated and FIGURE 7. Visualization of the predictive results on the test dataset (HN_UTSW). To show the performance of 3DNNV, we compared it with TG263-Net [20]. For each category of small-volume OARs shown in the top right legend, 9 samples were selected from dataset HN_UTSW and fed into networks to extract high-level features (256-d vectors). Then, we reduced the dimensionality of the high-level features by using Principal Component Analysis (PCA) [53], and the result is illustrated in the figure. We highlighted the results of classifying Pituitary and OpticChiasm. imbalanced; some of these small-volume OARs are similar in location and shape, such as Pituitary and OpticChiasm. Fig. 7 (a) and Fig. 7 (b) indicate the predictive results of TG263-Net on small-volume and error-prone OARs; apparently, most of these OARs are hard to identify without the voting strategy. However, after applying the voting strategy, the OARs that come with similar shapes/locations/sizes and imbalanced training samples still tend to be confused, like Pituitary and OpticChiasm. 3DNNV's improved data processing strategy and optimized feature extract structure solved this problem, as shown in Fig. 7 (c) and Fig. 7 (d). We gained more reliable and credible results: clear boundaries between different categories allow easier classification. VOLUME 8, 2020

B. STATISTICAL SIGNIFICANCE OF THE PERFORMANCE IMPROVEMENT
To illustrate the statistical significance of 3DNNV's performance improvement over the Baseline (GS) model, we performed a one-way analysis of variance (ANOVA) test on the results of the Baseline (GS) and 3DNNV models over all 28 categories of OARs in the head-and-neck datasets. The mean difference denotes the difference between the average TPR/F1/AUC values over six sets of models tested on the datasets (Table 8). Positive numbers in the mean difference indicate that 3DNNV performed better than Baseline (GS), and negative numbers indicate that 3DNNV performed worse. We set the p-value as 1.0 for samples whose variance was identical between the Baseline (GS) and 3DNNV. We found that 3DNNV significantly outperformed Baseline (GS) (p-value < 0.05, Mean > 0.0) in identifying small-volume and error-prone OARs, especially in HN_UTSW (Table 8).

C. LIMITATIONS 1) RUNNING TIME
To reduce the running time and improve the performance of 3DNNV, we added an early-match module (Fig. 8) to the framework and maintained a locally standardized label dictionary. The early-match module performs string matching [54] between the original label and the standardized label: if and only if the original label fully matches one of the standardized labels in the dictionary, then the standardized label is used to rename the given structure. This reduces the number of structures to be processed by 3DNNV and allows the framework to process unknown structures not included in the training dataset.
Originally, 3DNNV was used to process a patient data containing 38 structures. A running time of 7 m 41.83 s was required to obtain all the recognition results. This running time is too long to be acceptable for further applications. To solve this problem, we added the early-match module before feeding the input into the 3DNNV. This module relies on a pre-stored dictionary as the basis for string matching. After adding the early-match module to the framework, 3DNNV only needed to process 17 OARs for this patient (containing 38 structures), so the total running time was 3 m 36.05 s. Timely updates and maintenance for the dictionary will help to optimize the automatic identification process and avoid reprocessing labels that have already been standardized. However, the limitation is that the dictionary can only handle one-to-one mapping. When given RT data collected from a multi-language environment, the dictionary mapping method will not significantly reduce the running time of standardization. This is why a single-dictionary mapping method cannot handle cross-institutional data.

2) MULTIPLE LABELS FOR THE SAME STRUCTURE
The original 3DNNV model was trained and tested on only 28 OARs in head-and-neck datasets, which limits the model's recognition range to these 28 categories. To make the model generalizable to more structures, we tried to extend it to other anatomical sites, and it worked well. However, like Schuler et al. [8], we found that the model cannot distinguish typographic name variations from fundamental semantic differences in the same structure. In this work, we mainly discuss standardizing OAR labels, but in practice, the structures in individual RT data will be labeled differently for different treatment purposes. For example, the same structures might be labeled CavityOral_avoid or CavityOral, SpinalCord or SpinalCord_5mm, IL_Parotid, CL_Parotid, Parotid_L, or Parotid_R, depending on the specific application for which the labels are being used. These inputs have similar semantic features in images, so it is very difficult to identify these structures based on image information. At the same time, some non-target structures will have multi-level labels for a single OAR-such as Musc_Constrict_M, Musc_Constrict_S, Musc_Constrict_I, and Musc_Constrict, or OpticChiasm_aaa and OpticChiasm_bbb, where aaa is the resident's name and bbb is the actual attending physician's name-depending on different RT plans and local policies. These standardization conventions may vary across different medical institutions and treatment plans. At the same time, the standardization of target volumes also warrants attention. The target volume often overlaps with OARs and could be misidentified as an OAR. Additional information can be used to help identify target volumes, such as positron emission tomography (PET), which is widely used in clinical practice and can accurately define the biological target volume (BTV). Utilizing BTV and gross tumor volume (GTV) will improve the accuracy for identifying clinical target volume (CTV) [55], [56]. Adding text information may also help us to improve the performance of 3DNNV and meet the requirements of clinical applications.

3) OUTLIERS
In previous experiments, we found that the masks collected from different clinical centers may have inconsistent contours. These inconsistencies result from differences in physician experience and in how the local institution defines delineation for OARs. Moreover, there are outliers in many datasets: some lack masks in some slices; in other cases, the label does not always match the contour in the mask because of inaccurate delineation or partial depiction. We believe that detecting delineation outliers also presents a challenge to standardizing nomenclature for RT data.

VI. CONCLUSIONS
In this paper, we propose a novel framework, 3DNNV, that combines an ASAC/Voting strategy and a non-local network to integrate clinicians' domain knowledge and recognition mechanisms into our deep learning architecture. To the best of our knowledge, our work is the first to propose an architecture that integrates domain knowledge to solve the recognition problems caused by imbalance and poor delineation. Our model had a significantly higher average true positive rate than the baseline model across three test datasets (+ 8.27%, + 2.39%, and + 5.53%). More importantly, our model outperformed the baseline in terms of the F1 Score of the Pituitary (28.63% to 91.17%) with only 9 training samples when tested on the HN_UTSW dataset.
We visualized the vectors of our predictive results to evaluate the effectiveness of 3DNNV. One-way ANOVA tests showed the statistical significance of 3DNNV's performance improvement over Baseline (GS). Finally, we discussed limitations of the model that could impede application, and we suggested future work for automatically standardizing anatomical structure nomenclature in radiotherapy.
Our findings in this work will advance efforts to automate the standardization of organ labels in DICOM RT data, Which will facilitate and improve data-driven research.
HONGYANG CHAO (Member, IEEE) received the B.S. and Ph.D. degrees in computational mathematics from Sun Yat-sen University, Guangzhou, China. In 1988, she joined the Department of Computer Science, Sun Yat-sen University, where she was initially an Assistant Professor and later became an Associate Professor. She visited the University of Texas Southwestern Medical Center, Dallas, TX, USA, from September 2018 to August 2019. She is currently a Full Professor with the School of Data and Computer Science. She has published extensively in the area of image/video processing and holds three U.S. patents and four Chinese patents in the related area. Her current research interests include image and video processing, image and video compression, massive multimedia data analysis, and content-based image (video) retrieval.