Development of the Osteosarcoma Lung Nodules Detection Model Based on SSD-VGG16 and Competency Comparing With Traditional Method

Osteosarcoma nodule that metastasized to the patient’s lungs was difficult to detect due to limited cases caused by its rarity. The traditional method for finding lung nodules is manually done by radiologists by looking at CT-scanned images. As a result, the error rate for reading lung metastasized nodules ranged from 29 to 42 percent, while the permissible mistake rate for reading should be less than 29 percent. Advanced computer-aid techniques such as image processing and machine learning can help doctors to identify the Osteosarcoma lung nodules easier and more accurately. Convolutional Neural Networks (CNNs) are promising techniques since they could be trained by experienced radiologists. Nodule location and size information was critical for treatments that were obtained by object detector CNNs models. In this research, the Single Shot Detection (SSD) framework combined with the VGG16 backbone, SSD-VGG16, was implemented to obtain bounding box locations and sizes when each box represents one Osteosarcoma nodule with the confidence score. The SSD-VGG16 was selected due to its superior performance. The patient’s CT-scanned images dataset collected from 202 patient cases was provided by Lerdsin hospital and used for training and validating the SSD-VGG16 model. The trained SSD-VGG16 model was trained based on two loss functions which are class confidence and location losses. Then, the trained model experimented with unseen CT-scanned images. The performance scores were calculated. The Result was analyzed and concluded. Finally, SSD-VGG16 shows the ability to detect and locate the nodules efficiently and has less error compared to the traditional method.


FIGURE 1. Overview of the process.
read and identify the nodules from the CT-scanned image of the patient. CAD can be categorized into image processing and machine learning approaches. An image processing approach used the static mathematic model to enhance image, feature extraction, and mathematically categorize the results while a machine learning approach uses a dynamic mathematic model that can be trained by using humans' intelligence and experiences to mimic human decisions for specific situations. Convolutional Neural Networks or CNNs [25], [26], [28]- [30] are one of the most popular machine learning techniques, especially for image-related applications.
This research receives the requirement directly from the medical team who want to initiate applying the artificial intelligence (AI) innovations based on the objects detection CNNs to the public health system of Thailand. Therefore, the object detection method was selected as the initiated AI innovation for aiding the doctor to diagnose bone cancer patients because it is simple and could be explained in other areas in the future. Unfortunately, the deep integration with the field of medicine could not be realized in this research due to lack of medical data. Most Thai hospitals have only patients' CT-scan while other medical usable records such as beginning time of disease and patient family medical records were missing. These medical data could be combined with the object detection CNNs to increase the performance. Osteosarcoma is a rare disease, but it was selected for this research because it is the good start for applying AI to Thailand's public health system since the dataset is small but controllable and manageable. Therefore, the main purpose of this research was to aid clinicians in the detection of a suspected lung osteosarcoma tumor. When compared to typical manual techniques, the main goal is to avoid misreading by some radiologists. Reduce doctor fatigue-related errors. Reduce the workload of radiologists and the time it takes for the radiologist to read the CT-scan result and finally patient got early the treatment.
The object detection CNNs were widely used in many applications such as faces or human detection. the Region-CNNs (R-CNNs) [31], [32], You only look once (Yolo) [33] and Single Shot Detection (SSD) [34]- [36] were the most popular frameworks for object detection network because they could pinpoint the location of the multiple interested objects within an image and classify these found objects with confidences. R-CNNs found the interesting objects by searching the image using different size bounding boxes and found the possibility of class that can represent each search box identity. Because R-CNNs had to search the objects with all possible boxes, the calculation time was very large and impractical for low calculation power computer such as a normal computer in the government hospital. Yolo and SSD proposed different methods. The hierarchy or pyramid method was used to find the interesting objects with a smaller required time. Normally, the output image size is reduced when images pass through CNNs layers. Therefore, several sized images from different CNNs layers combined with fixed-size boxes can be considered as the image was processed to all possible bounding boxes. Thus, the calculation time for both Yolo and SSD are small and practical to implement into the normal computer hardware. In this research, SSD framework with VGG16 backbone structure called SSD-VGG16 [37]- [40] was considered to construct the Osteosarcoma lung nodule detector. Lung nodules images dataset were collected from Lerdsin hospital with ground truth labeled by radiologists and oncologists. Then, the network was trained, and the performance was evaluated. The overview of the research process was shown in Figure 1. The traditional method is explained first in this manuscript. The proposed SSD-VGG16 and data preparation are then discussed in the next sections. Next, the experiments VOLUME 10, 2022 and results are then discussed. Finally, the conclusion and future work are explained in the final topic.

II. TRADITIONAL DIAGNOSTIC METHOD
Osteosarcoma Screening is the process of determining the possibility of sickness or anomalies that require screening, such as the spread of bone cancer cells to the lungs, using medical procedures, clinical examination tools, laboratory testing, or another test. The sickness is not diagnosed during the screening procedure. The purpose, however, is to detect illnesses or problems in the early stages of development. The traditional bone cancer screening procedure is composite of 4 steps. First step is to examine the medical history and conduct a physical examination. Second step could be done by using computed tomography equipment and taking photographs of the organs to be screened. Then, a laparoscopy is performed in the third step. The fourth step is a biopsy in the laboratory. The last step is an examination of cancer cells in the laboratory.
An Osteosarcoma assessment must be conducted by multidisciplinary professional medical specialists consisting of a musculoskeletal oncologist, bone pathologist, medical oncologist, radiation oncologist and musculoskeletal radiologist due to the treatment guideline for the Orthopedic practice standard protocol in Thailand. In the case of patients with Osteosarcoma, to obtain the patient's cancer states, the five medical processes must be conducted which are (1) a History taking and physical examination, (2) Magnetic resonance imaging (MRI) or computed tomography (CTscan) at the site of the lesion, (3) Radiographic examination (X-ray) of the lungs and computed tomography (CT-scan) of the lungs and (4) Bone scanning. For the patient who is suspected of metastasizing, a magnetic resonance imaging (MRI) or computed tomography (CT-scan) is the most efficient procedure for confirming the metastasis because the Osteosarcoma usually metastasizes to lung. Therefore, the MRI and CT-scan reading methods are the most important for the doctor to address the states of the Osteosarcoma. But, The MRI and CT-scan reading process in Thailand was critical and took a long time for the result because experienced radiologists are limited. This issue leads to a big problem in Thailand practicing standard protocol. The Thai medical team must wait for the diagnostic results to be analyzed before decisions on patients. Applying AI to the Orthopedic medical field could help speed up the diagnostic process because the diagnosis and treatment options depend on results that were red by a specialized radiologist. Moreover, some Thai small provincial hospitals do not have a specialized radiologist thus the adoption of AI can be very beneficial for performing a preliminary screening before referring the patient to the doctor. For these various reasons, this research was initialized.
Osteosarcoma cancer screening traditional process could be done easier than other bone cancer screening processes because of helpful radiographic imaging. By using a computed tomography scan (CT-scanned), the bone malignancies could be identified by the human eyes. A pulmonary computed tomography scan can reveal whether bone cancer has expanded to the lungs. If the suspected nodules were found in the lungs, then, the radiologist would submit the information to the oncologist so that they could arrange the necessary joint therapy, and that included deciding on the finest type of surgery. Unfortunately, according to previous researches, the rate of reading nodules in the lung metastasis group had an error rate between 29% to 42% as shown in Table 1. These missed nodules occurred as a result of radiologists failing to identify the nodules when reading CT-scanned images. When a preoperative CT result was compared to intraoperative palpation, the missed nodule was detected. The worst-case showed that the missed nodules were 36% of all nodules from 215 patients where 52% of the missed nodules were benign and 48% were malignant. Osteosarcomas were the most common indication for surgery. Moreover, these missed nodules could be defined as the False Negative outcome which is the focal point of this research.
In addition, according to the medical traditional methods presented in Table 1., the accuracy of the previous research was calculated based on the number of patients not the number of nodules in the image's dataset. The medical traditional method accuracy was not calculated by considering all nodules. Therefore, if the proposed model in this research uses the same approach based on the number of patients, then, the accuracy of the proposed method will be 75.97% rather than 71% which is much better than the traditional methods.
This problem occurred because the traditional method performs a manually CT-scanned reading to find the nodules which involve human performance factors. According to Table 1, the maximum corrected nodules score from traditional technique research was 71%, whereas others have lower correctness. As a result, if the acceptable correctness is 71%, thus, just one traditional method of research passed the qualification while the rest failed.
According to Table 1, the previous studies had also small number sample sizes because Osteosarcoma is rare even in countries with a large population. As a result, the previous research found that radiologists predicted cancer spread to the lung incorrectly.
Today, radiographs are a basic tool used in almost every hospital in Thailand, including small community hospitals. However, those hospitals lack specialized orthopedic doctors or radiologists who can accurately read CT scans, especially CT scans of bone cancer. This can cause delays and errors in diagnosis which could lead to complications and death of the patient. As a result, this study serves as a beginning point for an osteosarcoma screening immediate need in Thailand aims to help alleviate such problems.

III. SSD-VGG16
Due to the purpose of the research study to focus on rare diseases and having little research, this research needs to be a baseline for further development. The researchers chose a model that is stable and has enough previous research for  other diseases to give the medical team confidence and reliability of the results rather than proposing a new cutting-edge model that is not stable and has a little practical result. The medical team will then approve the model to be applied to rare diseases and high mortality rates with confidence. The results will be reliable enough to be practical in clinical practice. In the previous work, the authors conducted experiments and compared many frameworks and found that VGG16-SSD is suitable for this Osteosarcomas CT-image dataset. In the previous study, the authors compared the performance of VGG-16, ResNet-50, and MobileNet-V2 in the backbone of SSD for Osteosarcomas images. VGG16 has the ultimate robustness, and SSDs are good for the object detection network that uses small computation resources with high accuracy with fast speed. The computer calculation power is the biggest boundary for Thai hospitals in implementing AI into the hospital health care system. The smallest hospital cannot afford expensive computers or servers. The authors conducted additional studies on hybrid architectures using Vision Transformers and Convolutional Neural Network. The Vision Transformer is based on a mechanism of selfattention using information on how surrounding pixels are connected and related to one another regardless of their absolute position in the image. Although Vision Transformers have proven to be good CNNs substitutes, there is one major limitation that makes their implementation difficult which is the requirement for enormous datasets. In fact, due to the presence of inductive biases, CNNs can learn even in the presence of a very limited amount of inputs which is suitable for our limited dataset in the Osteosarcoma cases.
Because of these criteria, Single Short Detection with VGG16 was selected. Single Shot Detection (SSD) [34] was the framework for object detection network that was selected to be the lungs nodules detector and used in this research because of the performance. SSD is the framework because it requires other image classification CNNs network called a backbone in order to perform object detection task. VGG16 [41]- [43], MobileNetV2 [44]- [47] and ResNet50 [48]- [50] could be used for SSD's backbone. VGG16 backbone was the original implementation of SSD network public by Google Inc. In the original implementation, SSD-VGG16 [51] used 300 × 300 pixels image as the input and output were the bounding boxes and class confident score of each box. The SSD-VGG16 structure is constructed from a reduced VGG16 which is the regular VGG16 structure without the fully connected layers and extra feature layers. Hierarchy multi-scale feature maps for detection is constructed by several outputs from SSD-VGG16 layers. Unlike the original network, the input image must be changed to 512 × 512 pixels because the CT-scanned image from DICOM was 512 × 512 pixels. Thus, the input image layer in this research was modified to fit DICOM image size. Moreover, one extra feature layer VOLUME 10, 2022 was added to the SSD-VGG16 and the new structure was shown in Figure 2. By feeding different size images from many layers into detection and non-maximum suppression algorithms, the bounding boxes of the interested area in the image were obtained along with class confident score of the box. The output from the proposed SSD-VGG16 were the bounding boxes that could contain Osteosarcoma nodule and confident score of each box.
The SSD-VGG16 network can be trained using Loss function that is the combination of bounding box location Loss (L loc ) and confident score Loss (L conf ) of each bounding box as shown in Eq (1) where N is the number of matched boxes and α is the balance coefficient between classification loss and localization loss. Let x p ik is an indicator for matching between the i th predicted bounding box l i and the k th ground truth box g k of class p by considering the Intersection Over Union (IoU).  For this study, only the specific type of bone cancer that has invaded the lung was considered due to the research requirement, therefore, a broader group of statistics cannot be used. With ethical conditions, the world standards ethic protocol and agreement with the participated hospital in the research, this study cannot combine information from different hospitals or opensource to the participated hospital's patient dataset. The participated hospital of this research is the major public hospital specialized in cancers and has the biggest Osteosarcoma data in the country.
At the beginning the authors attempt to use the AI technique to increase the amount of data in datasets such as GANs. But the information obtained is medically deemed and not to be actual patient information. As the result, the medical team question about the properties of the image data including the number, feature and location of the nodule generated by GANs. This issue led to suspicion of lung anatomy in the posteroanterior (PA) and lateral poses that the location of the nodules in the right upper lobe, right middle lobe, right lower lobe, left upper lobe and left lower lobe in the generated CT images cannot be used like the real patient's CT-scan image because they are not genuine. After discussion with the medical experts, they preferred the real patients CT-scan data rather than generated data. Thus, only real patient CT-scan images can be used in this research.
According to the research, the worldwide incidence of bone cancer is approximately 0.2 percent out of all cancer types. Osteosarcoma is the most common type of bone cancer. However, it is only 28 percent out of all bone cancers.
Thai national cancer institute found the incidence of the disease was about 0.8 people per 100,000 people in Thailand, which is low, but it has the highest mortality after leukemia and central nervous system malignant tumors. In case of Osteosarcoma, if multiple lung metastatic from bone cancer nodules are found, the 5-year survival rate is only 5%. According to statistics, osteosarcoma invades the lungs at a rate of 50-75 percent. Even though, Osteosarcoma is very rare cancer but the calculation of the sample size in this research is carefully calculated. As a result, by using the Cochrane (Cochran) formula, the minimum sample size was estimated as follows.
where, n 0 is the Minimum Sample Size z 2 is a Standard normal deviate corresponding to 5% significant level p is the prevalence in previous studies 50% e is precision set at 0.07. The minimum sample size can be obtained by substituting the value into the equation (5). Thus, in order to screen for osteosarcoma that has invaded the lung, the minimum sample size was 196. The actual data collection consisted of 202 people, which was sufficient for the study. In this research, the image had been pre-processed via an image normalization, brightness and contrast correction, noise reduction and background remover. Due to the face that the CT images from the hospital sometimes they were not cleaned. Original CT-scan images trends to have low intensity level which could be corrected by brightness and contrast correction. Moreover, in order to obtain specific information in human organ, the different filters are required for example, to obtain clear lung image, the high pass filter is used. There were undesirable texts in the CT images such as hospital name and patient information because these CT images were not prepared specifically for deep learning training, then, the background remover is necessary. Pre-processing process is important in order to improve the quality of the input image to help the model learn better, generalize and robust.
To confirm the correctness of ground truths, one radiologist and two oncology specialists, a total of three individuals, red the CT-scanned images and identify abnormal nodules in the images separately. If at least two out of three came with the same result, then these nodules are the Osteosarcoma metastatic disease nodules. The labelled CT-scanned nodule images are shown in Figure 3. The image dataset had been separated into training images and validating images. There were 1,769 training images and 443 validating images which are 80% and 20% of all CT-scanned images respectively. There is one label in this dataset which is the Osteosarcoma nodule.

V. EXPERIMENTS AND RESULTS
According to the traditional screening method, the accuracy should be calculated using number of patient because medical field focused on each patient not on each image individually. If the patient has only one correctly detected nodule then it is a positive case. But In this section, the engineering standard method is concerned for deeper analytic for quantity and quality. Number of image approach is used in order to calculate the useful scores including accuracy. By using number of image approach, accuracy maybe drop but every detail in every image were focused and considered.  First, the SSD-VGG16 network was trained by using CT-scanned images dataset. The network was trained for 9,600 epochs. The confident score and location Losses was decreased when the training epochs were increased as shown in Figure 4. The decreasing location Loss implied that the bounding boxes locations and sizes were closer to the ground truth boxes. The decreasing confident score Loss implied that each bounding box predicts the correct object class. After the training process was finished, the trained SSD-VGG16 network was validated by 443 unseen CT-scanned images. The result shows that the trained SSD-VGG16 network can detect nodules in the CT-scanned images as shown in Figure 5. From the result, the location of each nodule was correctly obtained, and the bounding box sizes were a little bit different from the ground truth box sizes.
In order to evaluate the performance of the SSD-VGG16, the error of the prediction was considered. There were 2 kinds of the error in the predicted result. The first error happened VOLUME 10, 2022 when the network predicted the there was a nodule, but the ground truth showed no nodule at that location. This situation can be defined as the False Positive (FP) issue as shown in Figure 6. The second error were the opposite of the first error which happened when the ground truth showed there is a nodule, but the network cannot detect the nodule in the predicted result. This situation was called the False Negative (FN) issue as shown in Figure 7.
The corrected predicted nodule was considered as the True Positive (TP) result. In order to calculate the performance score, the True Negative (TN) value must be obtained. Basically, the evaluated images did not contain non-nodules CT-scanned images, therefore, the non-nodules CT-scanned  images were added into the evaluated images to obtain the FP value. The non-nodules CT-scanned image was called no-class image. Since there was not a standard procedure to add no-class images to test the object detection network, therefore, in this research, one non-nodule image was considered as one object in the image and the number of non-nodules images that were added in the evaluation process equaled to number of TP that found from the original valuated images. The common performance scores were F1 score and accuracy. F1-score were calculated using Eq (6) when Recall and Precision were calculated by Eq (7) and Eq (8) respectively, while the accuracy was calculated from Eq (9) In the first evaluation, the object detection threshold was set to 0.2 which means if the confident of a bounding box is greater than 0.2 (20%) then this box was selected as the output box. The result of the first evaluation was displayed in Table 2. This table shows the values of TP, TN, FP, FN, the accuracy, F1-score, precision, recall, sensitivity and specificity from the experiment. From the experimental result, the False Negative is 13% which is high in term of the epidemiology. The high False Negative rate (FN rate) may cause problem to the treatment strategy, for example, in the infection or cancer case. If the model erroneously predicted that it was no cancer detected, then, the patient will be released from the hospital resulting difficulty in the long-term treatment. Furthermore, if the model used to predict the outbreak has a high FN rate, it may cause a widespread infection. The high FN rate could be improved by lowering the threshold, which will increase the False Positive rate and the recall. Therefore, reducing the threshold requires a trade-off between the model's performance, the recall and the acceptable False Positive rate.
The predicted confident score shows a level of confidence of the predicted nodule inside the bounding boxed in percentage. Then, if the confident is high, the model has a high confidence that there is a nodule in the bounding box. For example, if the confident score is 0.1, then, it is 10% confidence that there is a nodule inside the bounding box. The threshold is the level of the confident score that is acceptable. If the confident score of the bounding box is greater that the threshold, then this bounding box is considered as the output of the model and it will be displayed as the output bounding box that contains the nodule. If the confident score is less than the threshold then this bounding box will be rejected from the model output, and this bounding box will not show the result that a nodule was detected. For example, if the threshold is set to 0.1, then, the bounding boxes that have their confident scores greater than or equal to 10% are the result of the model and they will be shown in the output. Therefore, the threshold was assigned to obtain the relation to a FN rate and the new thresholds were set to 0.3, 0.15, 0.10 and 0.8. The results were shown in Table 2 which shows the TP, TN, FP, FN, the accuracy, F1-score, precision, recall, sensitivity and specificity from the experiment.
Results show that if the threshold is decreased then the FN value is reduced but the accuracy is also reduced because FP is increased. Therefore, in this research, the best threshold value is 0.2 because its products the high accuracy of 75.97% or 24.03% error rate which is lower than acceptable error rate of 29%. The 0.2 threshold give the best accuracy because when the threshold is increased to 0.3, the TP is decreased because the bounding boxes that have the confident score lower than 0.3 will be rejected from the output. Moreover, in case of 0.3 threshold, the FN is also increased because the bounding  boxes that contains nodules but the confident score less than 0.3 are considered as the non-nodule bounding boxes. From these reasons, the accuracy when the threshold is set to 0.3 is less than the accuracy when threshold is set to 0.2. When the threshold is reducing from 0.2 to 0.15, 0.10 and 0.8, the results show that the FN is reduced but the FP is increased results the lower accuracy. Then, the best threshold for the research must be turned carefully since it trades-off between the model's performance, the recall and the FN rate.
The result shows better a prediction accuracy than the traditional method of researches in Table 1. Furthermore, the output from the SSD-VGG16 has a confident score within each bounding box and the FP or False bounding boxes had very low confidence scores while most of the TP or True bounding boxes had very high confidence scores as shown in Figure 8. From Figure 8, the confident score of TP is 94% while the confident score of FP is 27.3%.
The results also show that the SSD-VGG16 still has high FN which means that the network cannot find the bounding box. This issue was investigated further by considering each CT-scanned image carefully. After investigation, this issue happened when the nodules were very tiny, super large, or weird shape and when the images were blur as shown in Figure 9.

VI. CONCLUSION
This research focus on the Osteosarcoma lung nodule detection which could led to cancer. Normally, the computed tomography (CT) scan image in DICOM format were used to detection the lung nodule but, due to the rarity of the case, it is very hard to detect even for the experienced radiologist to detect this disease. Therefore, the rate of reading nodules in the lung metastasis group had an error rate between 29% to 42% if the traditional method is used. Traditional method could be done manually by radiologists and oncologists.
Computer-aided diagnosis (CAD) was proposed to help radiologists and oncologists to detect the nodules from the CT-scanned image of the patient. The 512 × 512 SSD-VGG16 model was proposed to create the Osteosarcoma nodules detector based on CNNs. SSD-VGG16 consists of a Single Shot Detection framework and VGG16 image classification CNNs network. Input of the SSD-VGG16 is the CT-scanned images that converted from DICOM file and the output is the location of the bounding boxes that contain the nodule and the confident score of each box. The dataset was prepared by using DICOM file of 202 patients from Lerdsin hospital. These images were transformed to CT-scanned image PNG format and were separated into the training dataset and validating dataset with ration of 80% and 20% of all images. SSD-VGG16 was trained and testing with validating unseen images. The experimental result show that the trained SSD-VGG16 can detect the nodules. The accuracy and F1-scores were calculated and they were depended on the threshold. The different thresholds were experimented to obtain the effect to TP, TN, FP and FN value. The FN rate could be reduced by reducing the threshold but it also decreasing the accuracy. Experiment showed that the proposed SSD-VGG16 can detect the true nodules with high confident scores while the false nodules had low confident scores. The maximum accuracy from proposed SSD-VGG16 overcomes the maximum accuracy from the traditional method by 4.97%.
However, the network showed detecting limitation when the nodules were tiny, super large or weird shape and when images were blur. Most of limitation may be reduced by providing more training dataset except the tiny nodules problem. It is very hard to increase the CT-scanned image size without introduces noise into the image and CT-scanned image size in this research is the maximum size that the CT-scan machine can provide. To overcome this issue, the advanced super resolution technique may be considered in the future study.
In the case of implementing a hybrid system to improve the performance, the Vision Transformers will be the next focus. Due to data limitations in this research phase, the researcher intends to collect patient data continue. As additional cases of bone cancer metastasizing to the lungs are gathered, the authors have devised an experiment on Vision Transformers and hybrid systems to obtain a superior model in the next research.
The main objective of this research is to be the starting point and foundation for the development of an artificial intelligence system for detecting nodules within the lungs for Thailand's emerging public health system. As the result, the SSD-VGG16 was chosen a basic model that is simple but functional in this state with acceptable performance and accuracy according to several constraints, allowing it to be easily adapted to other medical requirements in the future.