Tufts Dental Database: A Multimodal Panoramic X-Ray Dataset for Benchmarking Diagnostic Systems

The application of Artificial Intelligence in dental healthcare has a very promising role due to the abundance of imagery and non-imagery-based clinical data. Expert analysis of dental radiographs can provide crucial information for clinical diagnosis and treatment. In recent years, Convolutional Neural Networks have achieved the highest accuracy in various benchmarks, including analyzing dental X-ray images to improve clinical care quality. The Tufts Dental Database, a new X-ray panoramic radiography image dataset, has been presented in this paper. This dataset consists of 1000 panoramic dental radiography images with expert labeling of abnormalities and teeth. The classification of radiography images was performed based on five different levels: anatomical location, peripheral characteristics, radiodensity, effects on the surrounding structure, and the abnormality category. This first-of-its-kind multimodal dataset also includes the radiologist's expertise captured in the form of eye-tracking and think-aloud protocol. The contributions of this work are 1) publicly available dataset that can help researchers to incorporate human expertise into AI and achieve more robust and accurate abnormality detection; 2) a benchmark performance analysis for various state-of-the-art systems for dental radiograph image enhancement and image segmentation using deep learning; 3) an in-depth review of various panoramic dental image datasets, along with segmentation and detection systems. The release of this dataset aims to propel the development of AI-powered automated abnormality detection and classification in dental panoramic radiographs, enhance tooth segmentation algorithms, and the ability to distill the radiologist's expertise into AI.


I. INTRODUCTION
T HE NEED for precise imaging, and diagnostic tools have led to an exponential increase in computed tomography (CT) and X-ray imaging-based technologies to detect, diagnose, and treat various dental diseases in the last few decades. However, manual X-ray film interpretation and analysis may take large amounts of crucial clinical hours and may lead to misdiagnosis or underdiagnosis, owing to personal factors such as stress, fatigue, and experience levels [1]. These shortcomings can be reduced if intelligent dental X-ray film interpretation tools are developed to assist dentists in improving dental care. From this perspective, dental informatics, including automatic teeth identification, dental anomaly identification, and annotation, become an essential aspect of intelligent health care.
Utilizing classical Machine Learning (ML) techniques such as mathematical morphology [2], active contour [3], and levelset [4] have been used for teeth segmentation. Additionally, Bayesian techniques [5], linear models [6], and support vector machines [7] with hand-crafted features extracted utilizing Fourier descriptors [5], contours [8], and textures [4] have been used to perform classification. However, most of these methods require careful engineering of the feature extractor to transform the raw data into an appropriate representation for the algorithms to detect or classify the input images. The efficiency of these methods relies heavily on the image preprocessing performed and the quality of the features extracted. While some of these methods achieve satisfactory results, the test data is generally small, and they tend to perform poorly when tested on large datasets.
The limitations of the traditional approaches were overcome efficiently with the introduction of Artificial Intelligence (AI). AI-based systems' success can be mainly attributed to the progress made in computer engineering, the massive availability of imagery data, and the development of deep learning algorithms [9]. The deep learning methods involve representationbased learning techniques that consider raw image data as input, process this data in different layers to automatically identify the optimal representations, and produce a meaningful output in the form of detection or classification of the input data. In recent years, deep learning-based methods have been employed in various medical tasks, producing promising results [10]. The main advantage of the deep learning system over the classical This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ ML systems is its ability to learn features directly from the raw data using a general-purpose learning procedure instead of the hand-crafted features.
In dentistry, AI has been applied recently to detect carious lesions on bitewing radiographs using pre-trained convolutional neural networks (CNNs) [11]. It has also been used to identify periodontal bone loss (PBL) in peri-apical radiographs and the detection of apical lesions and treatment planning in panoramic dental radiographs [12]- [16]. Furthermore, given the complexity and the associated diagnostic efforts, the application of CNNs on panoramic radiographs may result in a higher diagnostic yield in terms of increased accuracy, reliability, and time savings for the assessment [12].
In dentistry, imagery plays a vital role in screening and treatment. Multiple types of imagery for the same anatomical region of the same individual are taken over multiple time points, and the corresponding non-imagery-based clinical data is also present. Further, dental conditions such as caries, apical lesions, and PBL are relatively prevalent, making it easy to build the dataset to train and optimize neural networks. Despite these potentials, AI solutions have not yet become the norm in routine medical practice. Systems based on CNNs have only been recently adopted in dental radiograph research, and the applications based on these technologies are only now entering the clinical arena [22].
As suggested in [23], some of the main reasons for AI technologies not being fully adopted in dentistry are: 1) dental data is not readily accessible due to data protection and privacy concerns, 2) datasets lack structure and are complex, multi-dimensional, and often biased with overly sick, overly healthy, or overly affluent data points, 3) datasets are relatively small compared to other image-based datasets for AI, 4) lack of "hard" gold standards and the requirement of an expert to label, and 5) lack of trust in AI, as it does not provide any feedback on how/why it arrived at the prediction.
To alleviate some of the problems mentioned above, a new X-ray dental image dataset called the "Tufts Dental Database (TDD)" is introduced in this paper. TDD consists of 1000 panoramic radiographs with the abnormalities, teeth, and the maxillomandibular region of interest outlined. Among the various types of intra-oral and extra-oral radiographs, panoramic X-ray images were selected for this study due to their advantage over traditional X-ray imaging methodologies. They display (i) the entire mouth region, (ii) all the teeth in both the upper and lower jaws in a single X-ray, and (iii) the positions of fully emerged as well as emerging teeth. Furthermore, faster operation time, more diagnostic capabilities, less radiation exposure, better patient acceptance, and fewer infection control procedures [24] make it ideal for dentistry. Intraoral radiographs such as periapical and bitewing X-rays can image localized regions in the mouth but require multiple scans to capture different areas. This makes it time-consuming, causing discomfort to the patient, and can lead to an increase in radiation dose [25]. Fig. 1 shows an example for each of these three different kinds of radiographs.
Additionally, the perception (the ability to recognize abnormal patterns) and cognition (the ability to interpret) or the "expertise" of the radiologist is captured through eye-tracking and verbal dictation or the think-aloud protocol. To distill the knowledge of the expert into an AI system, it is essential that the eye-tracking data is supplemented with the think-aloud protocol. For example, a more prolonged fixation may point to an interesting area or an area challenging to interpret. Hence, supplementing the eye-tracking data with verbal dictation allows for better interpretations. The eye-tracking details are presented in TDD as gaze maps, and the oral dictation is provided as text within the .json file.
Incorporating these aspects into TDD makes it a multimodal dataset with modalities such as X-ray images, gaze maps, and audio-text files. The multimodal nature of the dataset allows for a much more comprehensive range of applications, such as deeper investigation of visual characteristics of medical experts and optimizing instructional design for students. For example, the Eye Movement Modelling Examples (EMME) [26] technique utilizes the experts' eye movements and think-aloud reasoning to direct the students' attention to the relevant information in the Xray images. Furthermore, the multimodal dataset approach could be very helpful for use cases such as AI-based dental biometric, which can utilize the X-ray images with the technicians' eyetracking and verbal dictation for better detection. Similarly, in general-health condition investigation, oral X-ray images can be utilized along with the patients' medical records for a better prognosis.
The remainder of the paper is organized as follows. In section II, a brief review of the relevant literature is presented. Section III provides a detailed description of the Tufts Dental Database along with its acquisition procedure. Section IV outlines the instruction for accessing the database along with its structure. The baseline results and performance review are presented in section V. Finally, section VI summarizes this work's contributions and discusses the future directions.

II. RELATED WORK
Databases: AE Rad [27] introduced a dataset containing 120 various Periapical dental X-ray images of the top and bottom jaw. An expert specified the abnormality for each image. The dataset solely focuses on dental caries and contains the ground truth for each image. Abdi [28] introduced a dataset consisting of anonymized and de-identified panoramic dental X-rays of 116 patients. The subjects cover a wide range of dental conditions, from healthy to partial and completely edentulous cases. The dataset consists of mandibles that two dentists manually segmented. Jader [19] introduced a dataset containing 1500 panoramic radiographs with each tooth labeled. To the best of our knowledge, these are the only datasets that are publicly available. The literature covers several studies that use privately owned Detection: Kim [29] proposed a fully automated network using Convolution Neural Networks (CNNs) to detect periodontal bone loss (PBL) using panoramic dental radiographs. The overall framework consists of multiple stages: the first stage was trained to extract the region of interest (the teeth region), the second stage focused on training a network segment and predict the PBL lesion, the third stage consisted of using the pre-trained weight of the second stage encoder to create a classification network that predicts the existence of PBL in each tooth, and the final stage consists of a classification network that predicts the existence of PBL lesion specifically for the molar and premolar teeth. The network was trained and tested on panoramic dental radiographs. Kim [30] investigated the application of CNN's to classify four types of implant fixtures. Periapical radiographs from 801 patients were used for this study and tested on state-of-the-art networks such as SqueezeNet [31], GoogLeNet [32], ResNet-18 [33], MobileNet-V2 [34], and ResNet-50 [31]. All five networks produced an accuracy of above 90% in detecting the four types of implants, with SqueezeNet and MobileNet displaying accuracy of 96% and 97%, respectively. Chang [35] developed a hybrid framework to automatically detect and classify periodontal bone loss in dental panoramic radiographs. A total of 340 panoramic radiographs were used for detecting periodontal bone level, cementoenamel junction level, and teeth. A modified Mask R-CNN [36] based on a feature pyramid network and a ResNet101 backbone was employed to detect the abovementioned categories.
Further, the percentage rate analysis of the radiographic bone loss combined the tooth long-axis with the periodontal bone and cementoenamel junction level. This percentage rate would automatically classify the periodontal bone loss. The Pearson correlation coefficient of the automatic method with the diagnosis by radiologists was 0.73 overall for the entire jaw and the intraclass correlation value of 0.91 for the entire jaw. Geetha [37] proposed a back-propagation neural network architecture to detect caries in intra-oral radiographs. The radiographs are passed through a Laplacian filter to obtain sharper images. The images are further passed through different adaptive thresholding filters to get the segmented output. These images are further fed to a feature extractor, and 16 statistical features are acquired.
Finally, the features are passed into the back-propagation neural network to detect the presence of caries.
Segmentation and Classification: Muresan [38] proposed a deep learning-based semantic segmentation technique to classify and segment 15 different classes of specific tooth issues. The proposed solution consists of 5 main stages. The first stage focuses on image preparation, where the image is cropped to the required dimension, and the pixels are annotated for each image. A total of 1000 images were selected and annotated for semantic segmentation. The second stage employs the Efficient Residual Factorized Convolutional Network (ERFNet) [39] to perform semantic segmentation on the images. The application of the ERFNet on the test set resulted in an intersection over union score of 60.03%. In the third stage, different morphological operations were performed on different classes to separate the teeth better. The final part of this stage consisted of a two-step labeling algorithm where each object was assigned a unique numeric label. The bounding box for each tooth or dental anomaly was computed for each instance of the label. The fourth stage focuses on inconsistencies caused by segmentation errors and utilizes a thresholding technique to eliminate pixel regions below the threshold. Finally, the classification is performed using a majority voting technique. This process is repeated for each tooth, and a number is associated with a dental abnormality. Oktay [17] proposed the application of a modified AlexNet [40] to classify teeth in panoramic X-ray radiographs. The images are further preprocessed to detect the mouth gap, which in turn gives the potential positions of the different classifications of teeth. The images are split into four quarters and fed to the AlexNet for tooth classification. Kim [41] presented automatic tooth detection and classification methods that combine the regional convolutional neural network (RCNN), Single Shot Detector, and heuristic methods. The proposed algorithm yielded precision values of 84.5%, sensitivity values of 75.5%, and specificity values of 80.4%.
Forensics: Fahmy [42] proposed an Automatic Dental Identification System (ADIS) architecture capable of handling large amounts of data. The architecture adopts a two-stage approach: the first stage reduces the number of candidate images from thousands to hundreds. The second stage reduces the number of candidate images to a match list of few cases to be presented to the forensic experts for their final decision. The experiment was tested and validated on a database consisting of antemortem and postmortem scans provided by the FBI. Experimental results demonstrated that, for 90% of the cases, the authentic images were among the top 10% of the retrieved images. Anil [43] developed a semi-automatic method to identify humans based on dental radiographs. The technique consisted of three stages: radiograph segmentation, tooth feature extraction, and tooth feature matching. Contours of the teeth were utilized as features to which a transformation was applied to align the contours and correct the imaging geometric variations, thereby generating a matching distance. The proposed identification system was tested on 38 query images and indicated to be a feasible approach. However, the proposed method would fall short in cases where a) the images are very blurry, b) the query shape is partially occluded, c) there is a substantial change in the imaging angle between the antemortem and postmortem images, and d) some teeth have been extracted. Sathya [44] proposed a three-stage transfer learning approach to identify humans using digital radiographic images. The first stage uses a pre-trained AlexNet [40] to determine if the query tooth image belongs to the upper or lower jaw. The second stage further classifies the tooth as Molar, Premolar, Canine, or Incisor. Finally, the tooth number is identified and matched with the images in the antemortem dataset.
Eye-tracking and radiology: Karargyris [45] introduced a chest X-Ray dataset with 1083 images. This dataset included the eye gaze coordinates data along with the radiologist dictation audio and transcription. Khosravan [46] developed a collaborative computer aided diagnosis (CAD) system that combined eye tracking with CAD. By creating a novel algorithm for graphbased clustering and sparsification, this system could collaborate with radiologists and help them improve their diagnostic decisions. Stember [47] utilized eye tracking with deep learning for generating segmentation masks based on eye tracking data and achieved results that were similar to hand annotations. Similarly, Aresta [48] and Mall [49] utilized gaze information for automatic lung nodule detection and mammogram interpretations. Hermanson [50] conducted a study to identify the visual search patters of dentists while viewing and interpreting periapical dental radiographs. While this research showed that dentists tend to observe more significantly at regions of high contrast (radiopaque or radiolucent), it did not perform any correlation to the presence of abnormalities in the region of high contrast.

III. TUFTS DENTAL DATABASE
The protocol developed to collect and publish the TDD was approved by the Tufts University Institutional Research Board (IRB ID MODCR-01-12631, approved on 7/14/2017). The protocol discussed the study's purpose and procedure, the confidentiality and public access of the database, and the protection of participants' identities. We have included one thousand de-identified digital panoramic radiographs of Tufts University School of Dental Medicine patients in this pilot study. These radiographs were randomly selected from the electronic patient database (axiUm) at the school of dental medicine. The random selection process was done through a search of the panoramic radiographic procedure codes from January 1, 2014, to  December 31, 2016. Up to 1000 qualifying panoramic radiographs were selected from the generated list of chart numbers with panoramic images in the date frame. The inclusion criteria were optimum diagnostic quality of the image with minimal or no technical errors in the image. The 1000 selected radiographs were de-identified and saved in a generic image format (TIFF/JPEG) with a unique identifier. Each of these radiographs were annotated by an expert and a student from the Tufts University School of Dental Medicine. In addition to the annotated radiographs, eye-tracking and audio data were collected from the expert and the student. Fig. 2 displays the acquisition system setup used for this project.
The dataset consists of 6 major components: a) radiographs, b) labeled masks, c) eye tracker generated maps (gray and quantized), d) text information describing each radiograph, e) teeth mask for each radiograph with labels, and f) maxillomandibular region-of-interest mask. Fig. 3 shows an example of each imaging component. The expert and the student used Labelbox [36] to annotate and label each radiograph. To capture the annotator's eye movements, our system uses a Tobii Eye Tracker 4C, which has an operating distance of 50-95 cm [51]. The eye tracker was placed at the bottom of the screen. The audio was captured using a microphone and was synchronized with the eye tracker.
All the digital panoramic radiographs included in this study were prescribed by clinicians based on the diagnostic and clinical needs of these patients. All panoramic images included in the dataset were obtained using OP100 Orthopantomograph (Instrumentarium Imaging/Kavo Kerr) and Plammeca Promax 2D (Henry Schein) radiographic units. Automatic exposure control was utilized to obtain clinically diagnostic images with optimal image density and contrast.
The expert utilized a systematic approach to analyze each radiograph, identify the abnormality, and describe the radiographic characteristics of the abnormality, which allowed for categorization and differential diagnosis of the entity. The student was a fourth-year dental student who had completed the Oral and maxillofacial radiology (OMFR) didactic course and clinical OMFR rotations with a passing grade in the radiographic interpretation case-based competency exam. The course includes instruction and assessment in the systematic analysis of radiographs.
A consistent logical sequence of steps was followed for evaluating the radiographs in this study. As each panoramic radiograph was reviewed, the expert and the student recorded their findings. Diagnostic reasoning and analytical framework have been routinely described to aid in the interpretation of diagnostic images. Therefore, the first step was recognizing the abnormality, which required a thorough knowledge of variations of normal findings. Evidence shows that the employment of a systematic search strategy by novice clinicians improves their ability to detect abnormalities in panoramic radiographs [52].
If an abnormality was detected, a detailed description of radiographic characteristics was recorded in five different levels, as depicted in Fig. 4. The first level described the anatomic location relative to the jaw and teeth. The second level described the periphery or marginal characteristics of the abnormality. The periphery was described primarily as well-defined versus ill-defined. The internal architecture was the third level of description and focused on the radiodensity-Radiolucent, Radiopaque, and Mixed with Septae or Calcifications. The fourth level focused on the effects of the abnormality on adjacent structures and was recorded in terms of tooth displacement, root resorption, osseous thinning and expansion, and extension into the adjacent soft tissue or degenerative changes. Based on the overall descriptions, a category (fifth level) for the abnormality was selected from Trauma, Inflammation, Dysplasia, Developmental, Benign tumor or Cyst, Malignant Neoplasia, Systemic or Metabolic conditions. An example of a radiograph with its abnormality labeled is seen in Fig. 5 . Table II summarizes the level one finding of the expert and the student in selecting the location of the abnormality for all 1000 panoramic radiographs. As seen in Table II, there lie discrepancies in categorizing the location of the abnormality by the student. These discrepancies are a result of either missing the abnormality in the radiograph or misclassification of the abnormality. The five-level classification of each radiograph is recorded in the .json files for the expert and the student.
The regions of abnormality are outlined using the Labelbox software, and the ground truth masks are generated. As the expert and the student analyzed each radiograph, their eye movements were recorded simultaneously using the Tobii eye tracker. The Tobii starter kit [51] was employed to generate a file for each radiograph containing the x and y coordinates.
These coordinates are further processed to generate the gaze plot. An example of the gaze plot can be seen in Fig. 6. The gaze plot displays the location and time spent looking at each location on the radiograph [51]. The time spent looking at a location, or commonly known as fixation time, is shown by the diameter of the fixation circle. As the fixation time increases, the diameter of the circle also increases.
The audio files acquired from the expert and the student describing each radiograph were converted to text using Google [53] and Dragon [54] automatic speech recognition platform. The text generated for each radiograph is added to the .json file. The structure of the .json file containing the label and text information is shown in Fig. 7. In addition, our dataset has teeth labeled by the expert for all radiographs. These images also have the maxillomandibular region of interest annotated, where researchers can emphasize on the information contained within this region. Fig. 8 provides a sample of a radiograph with the teeth annotated and its maxillomandibular region of interest. The universal tooth numbering system was used to identify and label teeth in all the panoramic images.

IV. HOW TO USE THE DATABASE
The TDD contains 9000 images and 3 JSON files describing each radiograph. These images belong to 4 main subsets: panoramic radiographs, the labeled masks highlighting the abnormalities, the eye-tracking gaze plots (gray and quantized), and the labeled teeth mask. The labeled masks and gaze plots are present for both the expert and the student. Availability of this database is critical for research transparency in the field. Therefore, TDD will be publicly available for download from our research website on the Tufts ECE servers. To access the database, researchers will need to submit a request form to receive the download permission. The information on obtaining a copy of the TDD can be found on the project website (http://tdd.ece.tufts.edu/). Fig. 9 shows the folder structure of the TDD.

V. PERFORMANCE REVIEW
In this section, we present a practical evaluation of the TDD. In particular, the dataset is evaluated on recent state-of-the-art image enhancement and segmentation techniques. The performance of the segmentation and enhancement algorithms are assessed using popular metrics in computer vision and deep learning fields.

A. Image Enhancement
In diagnostic radiography, the goal of image enhancement is to make the image visually more appealing. This is accomplished by increasing contrast, optimizing brightness, improving sharpness, and reducing noise. Subjective image enhancement is not expected to improve the accuracy of image interpretation and is often diagnostic task-specific [55]. We test images from our dataset on methods previously used for medical image enhancement. These methods include Histogram Equalization (HE) [56], Contrast Limited Adaptive Histogram Equalization (CLAHE) [57], Bi-Histogram Equalization (BBHE) [58], Dynamic Fuzzy Histogram Equalization (DFHE) [59], and Guided Filtering (GF) [60]. In general, we assess these enhancement methods' performance using reference-based measures such as SSIM and PSNR. However, due to the unavailability of a ground truth, we employ the use of non-reference-based measures. AME and LogAME [61][62][63][64][65][66] are Human Visual System inspired quality metrics for testing algorithms and new datasets in the absence of a reference. Furthermore, the presence of complex structures and non-uniform textures make these measures good candidates for panoramic radiographs. Each image in the dataset was enhanced, and their quality scores were evaluated. An example set of the images and their respective quality scores are seen in Fig. 10. As seen in Fig. 10, CLAHE performs better in comparison in enhancing the contrast of the images. A lower AME and LogAME measure is evidence of higher image contrast [61]. Visually comparing the results of each algorithm, we see that GF performs well in enhancing edges when compared to HE, BBHE, and DFHE. To quantify the performance of each enhancement algorithm over the entire dataset, we compute the quality scores for each enhanced image. The average across all images resulted in CLAHE performing better with an average AME score of 24.32 and a LogAME score of 0.3702. Table III displays the average scores and scores for a subset of images. As seen in the table, the CLAHE algorithm exhibits the best performance in enhancing the contrast of the images.

B. Segmentation of Teeth From Radiographs
We evaluate the performance of different deep learning-based algorithms for segmenting teeth from panoramic radiographs. An example of the radiograph and its respective ground truth mask can be visualized in Fig. 3(a) and Fig. 3(e). We compare the performance of FPN [67], UNet [68], UNet++ [69], PSPNet [70], DeeplabV3 [71], and DeeplabV3+ [72]. Each of these networks was trained with a VGG19 [73], ResNet18 [33], and ResNet50 [33] backbone and was initialized using the ImageNet pre-trained weights. The data consisted of the 1000 radiographs and ground truth masks, which were split into training (85%) and validation (15%) sets. The models were trained on NVIDIA Tesla V100 GPUs with 32-GB memory. For training, we use the Adam optimizer [74] with an initial learning rate of 1 × 10 −3 and weight decay of 1 × 10 −4 . All models are trained for 150 epochs with a batch size of 16. To evaluate the performance of each model, we employ the most popular metrics for assessing the accuracy of segmentation algorithms, namely, Pixel Accuracy (PA) [75], Intersection over Union (IoU) [75], and Dice coefficient [75]. Table IV summarizes the findings of the segmentation algorithms. The table displays the scores for each model with its respective backbone architecture. The UNet and UNet++ perform marginally better in comparison. The average PA, IoU, and Dice coefficient score across all models is 95.01%, 86.19%, and 91.96%, respectively. In addition, we test the performance of the nnUNet [76] and CE-Net [77] architecture, which were designed for medical image segmentation. Both architectures were trained from scratch without the use of a backbone network. We utilize the generalized version of the nnUNet architecture and train the model for 150 epochs with the same hyperparameters. The nnUNet produce comparable results as shown in TABLE IV. In the case of CE-Net, the architecture required a lot more training iteration to achieve comparable results. After 400 epochs, the network produced a 92.67% PA score.

VI. CONCLUSION
In this work, the Tufts Dental Database, a comprehensive new dataset with the potential to revolutionize the dental AI practice, has been presented. TDD consists of a tooth segmentation mask, abnormality mask, maxillomandibular region of interest mask, eye-tracking gaze map, and text description of the abnormality. A detailed evaluation of baseline results for dental radiographic image enhancements and tooth segmentation is also provided. Furthermore, TDD opens the opportunity to seamlessly integrate the radiologist's expertise into the AI system, making it more efficient and reliable. This dataset also paves the way for a new generation of AI systems to highlight the identified features and explain the reasons for the outputs generated.