HipXNet: Deep Learning Approaches to Detect Aseptic Loos-ening of Hip Implants using X-Ray Images

Radiographic images are commonly used to detect aseptic loosening of the hip implant in patients with total hip replacement (THR) surgeries. These techniques of manual assessment by medical professionals can suffer from the drawback of low accuracy, poor inter-observer reliability, delays due to the unavailability of experienced clinicians. Thus, the paper provides a reliable Deep Convolutional Neural Networks (DCNNs) based novel stacking approach (HipXNet) for detecting loosening of the hip implant using X-ray images. Two major investigations were done in this study. Firstly, the performance of four different state-of-the-art object detection YOLOv5 models was evaluated to detect the implant region from the hip X-ray images. Secondly, the study developed a stacking classifier using three different Convolutional neural networks (CNN) models to classify aseptic hip loosening and compared the performance with eight different state-of-the-art CNN networks. Moreover, one publicly accessible dataset with two sub-sets was created for these two experiments, where 200 hip implant X-ray images were collected and annotated by two expert radiologists for implant detection and 206 hip implant X-ray images were collected for loosening detection. YOLOv5m model outperformed the other variants of YOLOv5 to detect the implant region with the precision, recall, mean average precision (mAP)0.5, mAP0.5-0.95 of 100%, 100%, 100%, and 87.8%, respectively. Densenet201 CNN model outperformed other CNN models with the accuracy, precision, sensitivity, F1 score, and specificity of 94.66%, 94.66%, 94.66%, 94.66%, and 94.5%, respectively while the stacking technique with Random Forest meta learner classifier produced the best performance with the accuracy, precision, sensitivity, F1 score and specificity of 96.11%, 96.42%, 96.42%, 96.42%, and 96.74% respectively for loosening detection. The reliability of the performance was confirmed by the popular Score-CAM visualization. This study can help in the early and fast identification of hip implant loosening with the help of simple X-ray images and computed aided diagnosis.


I. INTRODUCTION
T has been reported that about 20% of people with an age higher than forty agonize because of bone degenerative diseases [1] like osteoporosis which leads to a worldwide request for procedures of total hip replacement (THR). Getting involved with osteoporosis for the aged population is inevitable. It is predicted that people aged 65 or higher are more susceptible and the rate of disease for them will be increased from 8.2% in 2018 to 17.6% in 2060 [2]. Functional failure of the implant may be followed by revision surgery, which is often painful and has a relatively low success rate [1,3]. The lifetime of implants depends on (i) the type of materials used for implants, (ii) surgical techniques implemented, (iii) geometry of implant, (iv) patient's physical activity, and (v) patient's age. Of major concern to implant recipients is the life cycle of implants, which is currently limited to the order of 10-15 years. This relatively short life can be attributed to implant wear, loosening, and misalignment, which often cause pain and discomfort to the patient. Wear and corrosion due to the contact of the implants with other parts and body fluids generate debris. Soluble debris goes to blood and secretes through urine, however, particular debris gets accumulated in tissues, lymph, and bone marrow. This accumulated debris has short terms and long terms effects such as inflammation and cell tissue damage, hypersensitivity, chromosomal abrasion, and toxicity, in which both short and long-term effects result in revision surgery. Fibrous encapsulation due to the non-bonding of implants with surrounding tissues and inflammation (rejection) are the other reasons for implant failure. Osteoporosis, osteoarthritis, and trauma diseases are among the main reasons for replacements of the joints. Despite the revolution in total hip arthroplasty (THA) in arthritis treatment, aseptic (mechanical) loosening always leads to joint failure and THR surgery [4]. Implant failure can be identified with radiolucent changes surrounding acetabular and femoral implants and progression of osteolysis [5]. In a previously asymptomatic total hip arthroplasty, new onset of pain could indicate implant loosening, infection, or both. Aseptic loosening-related pain is often increased by weight-bearing and range of motion, especially with internal and exterior rotations. When loosening happens early after surgery for no obvious reason, the infection should be investigated. Fever, chills, and restless pain are some of the symptoms that may accompany an infection. Unfortunately, aseptic loosening does not always cause discomfort to the patient. The loosening can be unpleasant with cemented acetabular components, which are only symptomatic in 10% of instances [6]. Xray or computed tomography (CT) images of the hip area are commonly used for detecting the aseptic loosening of the hip implants by a medical expert as these can be done easily and readily. The foundation of aseptic loosening in radiological evaluation is the visual detection of radiolucent areas around the bonecement or bone-prosthesis interface [7,8]. Even in wellfixed implants immediately after surgery, thin sclerotic lines can be visible, which might indicate that the prosthesis is loose but in reality, it is not [8]. As a result, loosening is determined by the extent of radiolucent zones surrounding the implant and the change (progression) in appearance over time [9,10]. This results in frequent hospital visits and prolonged patient follow-up to confirm the diagnosis of mechanical loosening while increasing patient morbidity and resource consumption. Aseptic loosening is typically indicated by loosening more than 2 mm or increased loosening on repeated radiographs [9,10].
The human eye, especially on consecutive images, has a remarkable capacity for recognizing complicated patterns. The human aspect, on the other hand, has its own set of issues. Individuals notice patterns in different ways and give varying weights to different characteristics based on their unique experiences, making it challenging to translate what one practitioner sees visually into a set of "rules" that others can follow. As a result, there is significant scope for inter-and intra-subject variability and errors. Since the accurate quantification of aseptic loosening and correct identification of progression can reduce the consequence of the aseptic loosening, artificial intelligence (AI) based detection and quantification have a high potential to avoid subjective variability and error. Temmerman et al. [11] reported that X-ray images of the hip implant can be used to diagnose cementless femoral component loosening with the sensitivity and specificity of 50% and 89.5%, respectively. Inter-observer agreement was found to be very low (intraclass correlation coefficient (ICC) of -0.1). Cheung et al. in [9] reported sensitivity and specificity of 83% and 82%, respectively. Temmerman et al. [12] showed that plain radiography has a sensitivity of 85% and a specificity of 78% for the diagnosis of cementless acetabular component loosening. There was a moderate inter-observer agreement, an ICC of -0.53 was reported. Aseptic loosening is even more difficult to diagnose early. Khalily and Whiteside [13] showed that the presence of radiolucent lines around porous-coated femoral stems was 100 percent sensitive but only 55 percent specific for predicting the need for future revision (8-12 years post-surgery) at a 2-year follow-up. These variations in assessment have been questioned by Smith et al. [14]. Alternative imaging modalities including computed tomography, bone scans, and arthrography can increase diagnostic accuracy, but they come with higher costs, ionizing radiation exposure, and the risk of contrast agents.
Recently, machine learning application for reliable automatic detection of abnormalities have become popular for COVID-19 detection using radiological images [15,16], diabetic foot complication detection [17], tuberculosis detection [18] etc. In these applications, the machine has performed comparable to that of professional surgeons and radiologists, and even better than that of general practitioners [19]. To the best of the authors' knowledge, no previous work has reported the use of object detectors to detect the implant region from hip implant X-ray images so that the deep learning model can be trained to detect aseptic loosening accurately. This study first focuses on the detection of the implant area before loosening detection so that the implant region can be precisely used to identify whether the hip implant is loose or not. This study also reported saliency map visualization to confirm that the deep learning models are learning from the relevant region of interest for the classification. The major contributions of this study are: • Firstly, we developed the first publicly accessible dataset of hip implant X-ray images with two sub-sets. One sub-set is made up of hip area X-ray images (single leg or both legs) with implant annotations and another sub-set is for hip loosening detection, where two classes (Control and Aseptic loosening) of X-ray images are available.
• We developed an object detection model based on YOLOv5 architecture for the detection of the implant region from the X-ray images and compared the performance with different versions of the YOLOv5 network.
• Then, we developed a stacking classifier using three different Convolutional neural networks (CNN) models to classify aseptic hip loosening and compared the performance with eight different state-of-the-art CNN networks.
• Finally, Score-CAM based visualization technique was used to display the saliency map of the most contributory area for loosening detection to confirm the reliability of the model. The rest of the paper is divided into the following sections: Section 2 describes the dataset, pre-processing steps, methodology of this study, while Section 3 provides the results and Section 4 discussed the results of two major experiments: implant localization and loosening detection from the hip implant X-ray images. Finally, section 5 concludes the study.

II. METHODOLOGY
The overall methodology of this study is illustrated using Figure 1. Two main experiments were carried out in this study. Firstly, four different versions of YOLOv5 models (such as YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x) were investigated to detect the implant regions from the hip implant X-ray images [20].. Secondly, the implant regions of the X-ray images were used for aseptic loosening detection using a novel stacking classifier, and compared the performance of this classifier with different eight CNN networks. The stacking approach consists of two learners namely, base-learners and meta-learners. We used three different CNN models in base learners and the output of the base learners was predicted by a machine learning classifier as a meta-learner. Lastly, we evaluated the classification reliability using the Score-CAM technique. The experiments were done using PyTorch library with Python 3.7 on Intel® Xeon® CPU E5-2697 v4 @ 2,30GHz and 64 GB RAM, with a 16 GB NVIDIA GeForce GTX 1080 GPU. The section also provides the details of the dataset used, pre-processing steps applied, machine learning approaches adopted, performance metrics, and the visualization technique used in the study.

A. Dataset Description
Although there is a large number of THR surgery taking place all over the world, no publicly available hip implant X-ray image database for implant localization and aseptic loosening detection with control groups are present. This has motivated the authors to create such a dataset. A Kaggle dataset was created and made publicly available so that worldwide researchers can develop an AI-based model for computer-aided diagnosis and take benefit of this dataset. This dataset is made up of two sub-sets: one was for implant localization from the hip implant X-ray images, while the other one was for hip arthroplasty loosening detection. A detailed description of the Dataset, dataset preparation, and experiments are presented below.

1) Implant Detection Sub-set
Authors have collected and indexed this first sub-set of X-ray images of hip implants from different publicly available online medical sources such as medicine journals (articles) and radiology websites. All images should at least include a stem and a cup of the hip implant, and the images have to be X-ray images. These images were carefully checked to avoid duplications and the clinical experts in the team have evaluated each of the images to make sure that the collected X-ray images are for hip implants. For patients who had undergone total hip arthroplasty surgery, an anteroposterior (AP) view of the X-ray images for the patients with fixed (control group) and loosened hip implants were collected, while the X-ray images having a wire or plate attached with the implant were excluded. Authors managed to collect 200 hip implant X-ray images from published articles [10,[21][22][23][24], online resources [25][26][27][28], and Radiopaedia [29] and are also available the complete dataset [30]. Since these images were collected from different resources, different image resolutions, sizes, and types of implants, loosening conditions are available in this sub-set. These collected images were manually annotated by the team and finally validated by an orthopedic surgeon, who has more than 10 years of experience in THR surgery. A sample of X-ray images and corresponding implant annotations are shown in Figure 2.

2) Aseptic Loosening Detection Sub-set
The original X-ray images (with single leg or dual legs) were cropped to get the hip implant section (as shown in Figure  2(i)(B)). There are 206 X-ray images with a single hip implant is available in this dataset, where hip implant images of both loose and control groups are included. The hip implant images in this database had a varying resolutions (256 to 1024 pixels). Out of 206 hip implant X-ray images in the database, 112 images were from aseptic loosening patients and 94 images were from control participants. Figure 3 shows the example X-ray images of the implant detection dataset.

3) Preprocessing
In this study, two different experiments used different types of deep learning models with different input image size requirements, and therefore the datasets were preprocessed to resize the original X-Ray images (Implant Detection Sub-set). The state-of-the-art object detection network, YOLOv5 [31] was used for the implant detection task (first experiment), where the input image size is resized to 640x640 pixels. For the second experiment, popular pre-trained CNN models, such as InceptionV3 [32][33][34], ResNet [35], DenseNet [36], MobileNetV2 [37], and GoogleNet [38] were used. The X-ray images were resized to 299×299 for InceptionV3 and 224×224 for other CNN models. The images were normalized using Z-score normalization [39] with the mean and standard deviation of the entire image dataset. The pre-processing also involved the dataset preparation for the machine learning experiments which are mentioned below.

4) Image Augmentation and Training Parameters
For five-fold cross-validation, the entire image set was divided into 80 percent training and 20 percent testing subsets, with 10% of the training dataset used for validation, with the primary goal of avoiding overfitting. The training dataset has to be balanced to avoid biased training which was done with the help of the data augmentation approach, an effective method to provide reliable results evident in many of the authors' recent publications [40][41][42][43][44][45]. Moreover, the image dataset is small for training deep learning models. To balance the training image classes and to make the training set larger to avoid over-fitting [46], three popular image augmentation techniques (rotation, scaling, and translation) were used. The images were rotated in a clockwise and counterclockwise direction with an angle of 5 to 10 degrees for image augmentation. The scaling operation is the magnification or reduction of the frame size of the image and 2.5% to 10% image magnifications were used in this work. Image translation was done by translating images horizontally and vertically by 5% to 10%. The number of training, validation, and test images used in implant localization and hip implant loosening detection experiments are shown in Table 1.

B. Machine Learning Models
Two different experiments of this study used two different machine learning modalities, which are explained below:

1) Implant Localization
In this study, we used a different version of YOLOv5 object detection models to detect the implant region from the hip implant X-ray images. YOLO is a state-of-the-art, real-time object detector, and Yolov5 is developed as a continuous improvement effort from Yolov1 to Yolov4 to achieve top performances on two official object detection datasets: Pascal VOC (visual object classes) [47] and Microsoft COCO (common objects in context) [48]. There are three reasons to choose Yolov5 (object detector) as the implant localizer model. Firstly, Yolov5 incorporated a cross-stage partial network (CSPNet) [49] into Darknet, creating CSPDarknet as its backbone. CSPNet solves the problems of repeated gradient information in large-scale backbones and integrates the gradient changes into the feature map, thereby decreasing the parameters and FLOPS (floatingpoint operations per second) of the model, which not only ensures the inference speed and performance but also reduces the model size. Secondly, the Yolov5 applied path aggregation network (PANet) [50]as its neck to boost information flow. PANet adopts a new feature pyramid network (FPN) structure with an enhanced bottom-up path, which improves the propagation of low-level features. At the same time, adaptive feature pooling, which links the feature grid and all feature levels, is used to make useful information in each feature level propagate directly to the following subnetwork. PANet improves the utilization of accurate localization signals in lower layers, which can enhance the location accuracy of the object. Thirdly, the head of Yolov5, namely the Yolo layer, generates 3 different sizes (18×18, 36×36, 72×72) of feature maps to achieve multi-scale [51] prediction, enabling the model to handle small, medium, and large objects efficiently. The network architecture of Yolov5 is shown in Figure 4.
In the network, the data is first supplied into CSPDarknet, which extracts features, and then into PANet, which fuses them. Finally, Yolo Layer outputs detection results (class, score, location, size). According to YOLOv5, the confidence score replicates whether a target object exists in a cell or not. Also, it predicts the object accurately. The confidence score is calculated by using the following Equation (1): Where ( )is the prediction of the target, and prediction of the target will be in the range of (0,1); intersection over union (IoU) is calculated between G and B, where G is ground truth and B is the predicted box. The confidence score of each class is projected by the leaky rectilinear unit (ReLu) and sigmoid activation functions, and a threshold value identifies the object. In YOLOv5, the Binary Cross-Entropy with Logistic Loss (BCELL) function from the PyTorch library is used for calculating the loss of class probability and target object scores [52]. Moreover, an image contains multiple target objects, and the objects might be of different shapes and sizes. So, the target objects might be captured perfectly with a single bounding box. The YOLOv5 object detection model creates more than one overlapping bounding box (BB) in a single image to detect target objects but needs to show only a single bounding box for each object in an image. Thus, the Non-Maximum Suppression (NMS) technique is applied to eliminate the overlapping problem, which selects a single BB out of more than one overlapping BB to identify the objects in an image. The NMS method removes the redundant identifications and determines the best match for ending identification. The NMS technique is presented in Algorithm 1.

2) Aseptic Loosening Detection
In this study, we used a stacking approach for hip implant loosening detection using implant X-ray images where the eight state-of-the-art CNN models such as i) Resnet18 [34] ,ii) Resnet50 [34], iii) Resnet101 [34], iv) InceptionV3 [34], Mobilenetv2 [37], and viii) Googlenet [54] was investigated. Then the stacking approach was deployed with the top-performing three models as base learners and the predictions of these models were used to train ten different machine learning classifiers as meta learners to make the final decision. If a single dataset A, which consists of input vectors ( ) and their classification score ( ). At first, a set of base-level classifiers , … … , is trained and the prediction of these base learners are used to train the meta-level classifier as illustrated in Figure 5. We used five-fold cross-validation to generate a training set for the meta-level classifier. Among these folds, base-level classifiers were used on four-folds, leaving one fold for testing. Each base-level classifier produces a probability value for the possible classes. Thus, using input x, a probability distribution is created using the predictions of the base-level classifier set, M:

C. Performance Metrics 1) Implant Localization
The performance of the YOLOv5 models in localizing the implant area in the hip implant X-ray images is evaluated by the different evaluation metrics such as (i) Precision (P), (ii) Recall (R), and (iii) Mean average precision (mAP). Precision represents the ability of a model to detect only the relative objects. On the other hand, recall represents the ability of a model to find out all the relevant cases. The mAP for object detection is the mean of the average precision calculated for all the classes. The intersection over union (IoU) is given by the ratio of the area of intersection and the area of the union of the predicted bounding box and ground truth bounding box. Traditionally mAP at IoU = 0.5 is used to measure the object detection performance while mAP at IoU = 0.5-0.95 is a good performance metric. Here, mAP0.5-0.95 represents the mean average starting at IoU = 0.5 and stepping in 0.05 up to IoU = 0.95. As a result, ten distinct IoUs were calculated to compute the AP threshold. The average is used to provide a single value that rewards better localization detection.

2) Aseptic Loosening Detection
The performance of different CNN models and machine learning classifiers was evaluated using five performance metrics: Overall accuracy, weighted precision, weighted sensitivity or recall, weighted F1-score, and weighted specificity using Equations (6-10). As various classes have varying numbers of images, the networks were compared using a per-class weighted performance metric and overall accuracy. The area under the curve (AUC) was also used to assess the performance.
Here, true positive (TP), true negative (TN), false positive (FP), and false-negative (FN) were used to denote the number of loose hip implant X-ray images were identified as loose, the number of control group hip implant X-rays were identified as control, the number of control group hip implant X-rays incorrectly identified as loose and the number of loose hip implant X-ray images incorrectly identified as control, respectively. We report the weighted performance metric, with a 95 % confidence interval, for Sensitivity, Specificity, Precision, and F1-Score, and the overall accuracy with a 95 % confidence interval for the accuracy.

D. Visualization Techniques
With the emergence of visualization tools, there has been a rise in curiosity about how CNN works and the reasoning underlying its decision-making. Visualization approaches improve the visual portrayal of the decisionmaking process of CNNs. These also improve the model's transparency by showing the rationale behind the inference in a way that humans can understand, hence enhancing trust in the CNNs' outputs. Score-CAM was chosen for this investigation because of its promising performance in recent computer vision medical problems [18,55]. Figure 6 shows a Score-CAM visualization that highlights the regions that CNN considers when making decisions. By confirming decision-making from important regions of the images, these visualizations serve to increase trust in the reliability of deep layer networks.

III. Results
This section discusses the results of the implant localization and aseptic loosening detection experiments along with the Score-CAM visualization to better interpret the model performance.

A. Implant Localization
We investigated different versions of YoloV5 object detection models such as YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x to detect implant regions from hip implant X-ray images. Firstly, the precision, recall and mean average precision performance metrics with respect to epochs of four versions of the YOLOv5 model are investigated. Precision is the ability of a model to detect only the implant objects, whereas recall or sensitivity measures the number of objects correctly detect as an implant. We also investigated the mean average precision performance with IoU = 0.5, which is a traditional performance metric to measure the performance of object detection models, as well as investigated mAP with IoU 0.5 to 0.95. In this study, all versions of YOLOv5 object The F1 score performance curve is depicted in Figure 7. The F1 score is calculated based on precision and recall, where it shows that F1 scores performance curve for implant detection from hip implant X-ray image with respect to confidence score using different versions of the YOLOv5 model. It is also evident in Figure 7 that the YOLOv5m version outperformed other versions of YOLOv5 in terms of implant detection using hip implant X-ray images.  Figure 8 shows that different versions of YOLOv5 models trained on the hip implant X-ray dataset can detect the implant areas of the X-ray images very reliably. However, the YOLOv5m was used for the remaining experiment as it outperformed other models. Some sample test images with ground truth bounding boxes and predicted bounding boxes are shown in Figure 8, where it is clearly seen that ground truth and predicted bounding boxes for implant detection are almost overlapping each other.

A. Aseptic Loosening Detection
This section describes the performance of the different classification networks in detecting control and loose hip implant X-ray images. As mentioned earlier, eight different state-of-the-art CNN networks and a stacking approach with the topperforming networks were investigated to identify loosening of the hip implants using X-ray images. The comparative performance of different CNNs for these classification schemes is shown in Table 2 (A). The best classification accuracy, precision, sensitivity, F1 score, and specificity for loosening detection were found to be 94.66%, 94.66%, 94.66%, 94.66%, and 94.5%, respectively using the Densenet201 CNN model. For binary classification (control vs loosening) using hip implant X-ray images, the top-performing three CNN models were DenseNet201, Resnet50, Resnet18 with an overall accuracy of 94.66%, 93.69%, and 91.75%, respectively. We used these three top-performing models as base learners in the stacking approach, where the predictions of these three models were used as input to another meta learner. Ten different machine learning classifiers were investigated as a meta learners and we found Random Forest classifier outperformed other classifiers with accuracy, precision, sensitivity, F-1 score, and specificity of 96.11%, 96.42%, 96.42%, 96.42%, and 96.74%, respectively for loosening detection (Table 4(B)).   Figure 9 shows the area under the curve (AUC)/receiveroperating characteristics (ROC) curve (also known as AUROC (area under the receiver operating characteristics)) for loosening detection using hip implant X-ray images, which is one of the most important evaluation metrics for checking any CNN model's performance. This is apparent from the ROC curves that the DenseNet201 CNN model outperformed other networks for classification with 97.68% AUC in Figure  9(A). In the stacking model, Random Forest classifiers were the best performer as meta learners with 98.94% AUC which is shown in Figure 9(B). Figure 10 shows the confusion matrix for the best performing CNN model and the stacking model with Random Forest meta learner for loosening detection using hip implant X-ray images. Figure 10(A) shows the confusion matrix of the best performing CNN model (DenseNet201) and Figure 10(B) shows the confusion matrix of the best performing stacking CNN model (with Random Forest classifier as a meta learner). The best performing DenseNet201 network failed to detect 5 out of 112 loose hip implant Xray images while incorrectly detected 6 control group hip implant X-ray images as loose whereas 108 out of 112 loose hip implant X-ray images were correctly detected as loose and 90 out of 94 control hip implant X-ray images were correctly identified as control images with stacking CNN model. Thus, this is evident that the stacking CNN model with Random Forest classifier as a meta learner outperformed other state-of-the-art CNN models.

IV. Discussion
The study carried out two major experiments: i) four different object detection models were investigated in detecting the implant from the hip implant X-ray images and ii) eight different CNN models and stacking models were investigated to classify the loosening in the hip implant X-ray images. The performance of the YOLOv5m model exceeded other models in detecting the implant region from hip implant X-ray images, with precision, recall, mAP0.5, mAP0.5-0.95 of 100%, 100%, 100%, and 87.8%, respectively. For different CNN models, the Densenet201 model outperformed others with the accuracy, precision, sensitivity, F1-score, and specificity of 94.66%, 94.66%, 94.66%, 94.66%, and 94.5%, respectively. However, the stacking CNN approach with Random Forest meta learner classifier produced the best performance with the accuracy, precision, sensitivity, F1-score, and specificity of 96.11%, 96.42%, 96.42%, 96.42%, and 96.74%, respectively for loosening detection. Moreover, it was also confirmed that the stacking approach can improve the detection accuracy by around 2%.
To the best of the author's knowledge, this kind of extensive investigation was not done before for hip or knee implant loosening detection and the authors have compared this state-of-the-art performance with similar works on the hip implant (but on a very small dataset) and on the knee implant in Table 5. In a previous study [56], the loosening classification sensitivity has been reported to be 94% but they used a database consisting of a small number (only 40) of Xray images. In [57], a larger private dataset was used for knee implant loosening detection without object detection, where the reported model has shown 88.3% accuracy. However, in this study, a dataset of 200 Xray images (labeled by 10+ years experienced as a radiologist) was used and a better result is obtained compared to both studies. Moreover, the datasets used by the other studies were not made public to evaluate the performance of our model on that dataset while we have made our dataset public so that interested researchers can replicate our results easily. In addition, the implant localization approach has helped to improve loosening classification/detection performance with a novel stacking CNN model, making the complete approach more robust and versatile, with a detection sensitivity of 96.42 percent. Score-CAM-based heat maps were generated for X-ray images to see the saliency maps of the network's prediction. Figure 11(A) depicts the Score-CAM visualization of correctly identified loose hip implant X-ray images, while Figure 11(B) depicts the same for misclassified loose hip implant X-rays using the best performing model. Moreover, it is visible that the model is deciding on the loosening area of the hip implant X-ray and black arrows indicate the loosening region in the hip implant X-ray images. It is noticed that for the misclassified images, the model is taking a decision from the non-relevant areas (loosening) and it is also visible that most of the misclassified images are in the early stage of loosening. This heat map or saliency map increases end-user confidence in the network's output, which makes the deep learning model more explainable and reliable as a computeraided diagnostic tool.

V. Conclusion
This study used a stacking approach with deep Convolutional Neural Networks to detect aseptic loosening in hip implant radiographs automatically. Moreover, this research looked into two main important experiments. Firstly, the performance of four different object detection models was evaluated to localize the implant region from the hip implant X-ray images. YOLOv5m model outperformed other models to detect the implant region from the hip implant Xray images with the precision, recall, mAP0.5, mAP0.5-0.95 of 100%, 100%, 100%, and 87.8% respectively. Secondly, the performance of stacking models was evaluated to classify the loosening in the hip implant X-ray images and compared with eight different stateof-the-art CNN models. Densenet201 CNN model outperformed other CNN models with the accuracy, precision, sensitivity, F1-score, and specificity of 94.66%, 94.66%, 94.66%, 94.66%, and 94.5%, respectively whereas the stacking CNN approach with Random Forest meta learner classifier produced the best performance with the accuracy, precision, sensitivity, F1-score and specificity of 96.11%, 96.42%, 96.42%, 96.42%, and 96.74%, respectively for loosening detection. It was also confirmed that the stacking approach can improve the detection accuracy by around 2%. The Score-CAM visualization output demonstrates that the loosening detection decision of the model was done based on the relevant region of the hip implant X-ray images when the model correctly detect the loosening. The performance of the study could be improved with a larger dataset to increase the robustness of the model, but as mentioned earlier publically available dataset is not accessible, thus the future study would try to have a larger dataset with severity labeling (mild, moderate, and severe aseptic loosening hip implant X-ray) by the experts, which can help in following-up the disease progression and allow clinicians to apply the intervention to delay the revision surgery requirement. It can be concluded that the newly developed AI-based aseptic loosening system can aid in diagnosis in the presence/absence of an expert radiologist, which can help in reducing a large number of revision surgeries, which might have happened due to delayed or improper diagnosis.

Acknowledgment
This work was made possible by Qatar National Research Fund (QNRF) NPRP11S-0102-180178. The statements made herein are solely the responsibility of the authors. Open Access publication of this article is supported by Qatar National Library.

Data Availability
The database [30] created as a part of this study is made publicly available for the researchers to take benefit of this study. Once the article got accepted, the dataset will be made public.

Ethical Consideration
Since the dataset was created using the publicly available images and the authors did not collect from the hospital, there is no ethical approval is required for creating this dataset. Moreover, the user identity is not available in the X-ray images and the dataset is entirely de-identified and anonymous