Toward Better Ear Disease Diagnosis: A Multi-Modal Multi-Fusion Model Using Endoscopic Images of the Tympanic Membrane and Pure-Tone Audiometry

Chronic otitis media is characterized by recurrent infections, leading to serious complications, such as meningitis, facial palsy, and skull base osteomyelitis. Therefore, active treatment based on early diagnosis is essential. This study developed a multi-modal multi-fusion (MMMF) model that automatically diagnoses ear diseases by applying endoscopic images of the tympanic membrane (TM) and pure-tone audiometry (PTA) data to a deep learning model. The primary aim of the proposed MMMF model is adding “normal with hearing loss” as a category, and improving the diagnostic accuracy of the conventional four ear diseases: normal, TM perforation, retraction, and cholesteatoma. To this end, the MMMF model was trained on 1,480 endoscopic images of the TM and PTA data to distinguish five ear disease states: normal, TM perforation, retraction, cholesteatoma, and normal (hearing loss). It employs a feature fusion strategy of cross-attention, concatenation, and gated multi-modal units in a multi-modal architecture encompassing a convolutional neural network (CNN) and multi-layer perceptron. We expanded the classification capability to include an additional category, normal (hearing loss), thereby enhancing the diagnostic performance of extant ear disease classification. The MMMF model demonstrated superior performance when implemented with EfficientNet-B7, achieving 92.9% accuracy and 90.9% recall, thereby outpacing the existing feature fusion methods. In addition, five-fold cross-validation experiments were conducted, in which the model consistently demonstrated robust performance when endoscopic images of the TM and PTA data were applied to the deep learning model across all datasets. The proposed MMMF model is the first to include a category of normal ear disease state with hearing loss. The developed model demonstrated superior performance compared to existing CNN models and feature fusion methods. Consequently, this study substantiates the utility of simultaneously applying PTA data and endoscopic images of the TM for the automated diagnosis of ear diseases in clinical settings and validates the usefulness of the multi-fusion method.


I. INTRODUCTION
Chronic otitis media (COM) with or without cholesteatoma is a significant public health issue affecting 0.5%-30% of the population and can lead to severe complications due to its characteristic recurrent infections [1].Cholesteatoma causes include congenital or chronic ear infections and even trauma.Additionally, cholesteatoma can have severe consequences, such as hearing loss, facial paralysis, and intracranial complications [2].Otolaryngology involves various diagnostic methods for ear diseases, such as computed tomography (CT) and endoscopy analysis [3].However, COM is often difficult to diagnose because it shows various signs and symptoms, such as middle ear inflammation, TM perforation, and retraction [4].Moreover, because the diagnosis of ear diseases primarily relies on visual data, such as endoscopy or CT images related to the eardrum, the accuracy of this diagnosis may be limited by the clinician's experience [5].To address these problems, we applied artificial intelligence (AI) models in otolaryngological examinations to increase the accuracy of ear disease diagnosis.
Amidst recent advancements in deep learning technology, numerous studies have applied AI in the medical domain [6].
In otolaryngology, research leveraging deep learning technology to diagnose middle ear diseases has attracted increasing attention [7].In an earlier study, Shie et al. deployed 865 otorhinolaryngological images in an AdaBoost model and discerned four diagnostic categories based on ear diseases: normal, acute otitis media, otitis media with effusion (OME), and COM, achieving an accuracy rate of 88.06% [8].Khan et al. utilized 2,484 endoscopic images of the TM in a DenseNet-161 model, classifying ear diseases into three categories: normal, COM with perforation, and OME, and achieved an accuracy of 94.9% [9].
Endoscopic images of the TM are vital for diagnosing ear diseases because they qualitatively offer various visual markers indicative of ear pathologies, such as the color and transparency of the TM and the presence of middle ear effusion [5].Consequently, the most prevalent method for automatically diagnosing ear diseases involves analyzing endoscopic images of the TM using AI [7], [10], [11].However, the diagnostic performance of AI decreases when classes such as OME, which visually resembles normal TM, are included [12].In addition, when solely relying on endoscopic images of the TM and not pure tone audiometry for diagnosing ear diseases, abnormal eardrum images with only subtle visual deviations may lead not only to misdiagnosis but also to disease progression.For example, when subtle otitis media is overlooked, symptoms such as hearing loss and ear fullness may worsen, with the possibility of newly formed chronic otitis media and cholesteatoma [13].
Studies have suggested that ear diseases can be diagnosed using pure-tone audiometry (PTA) data in conjunction with endoscopic images of the TM [1], [14], [15].PTA measures air and bone conduction.The PTA air conduction threshold is ascertained using headsets or earphones, whereas bone conduction is assessed by vibrating the skull to stimulate the inner ear using a bone vibrator [16].Both methods determine the patient's decibel threshold for each frequency band [17].The discrepancy between PTA air and bone conduction thresholds, known as the air-bone gap (ABG), can help differentiate between normal and abnormal conditions of the middle ear [1], [16].Numerous studies have proposed the fusion of images and electronic health record (EHR) data in the medical field [18].However, to the best of our knowledge, no previous studies have integrated PTA data with endoscopic images of the TM for AI applications.
Ongoing efforts are being made to overcome the limitations of image-only models by fusing medical images with EHR data [18].In prior research, Prabhu et al. applied MRI images combined with EHR data to Multi-Modal Deep Learning Models for the classification of Alzheimer's Disease [19].Jabbour et al. diagnosed acute respiratory failure (ARF) by applying chest X-rays and EHR data to CNN and ANN models, respectively [20].Additionally, besides EHR data, Kumar et al. analyzed patients' cough sounds, leveraging deep learning models to recognize pulmonary diseases [21].Our model employs late fusion to integrate the endoscopic images of the TM and PTA data.However, while endoscopic images of the TM are high-dimensional data containing considerable information on ear diseases, PTA data contain significantly fewer details, resulting in an imbalance in the information between the two data types.Using a single-fusion approach may not deliver optimal performance [18].To address these challenges, we propose a multi-modal multi-fusion (MMMF) model that employs multiple fusion methods rather than relying on a single method.
Cross-attention captures the interplay between two datasets [22].Ying et al. [23] used a cross-attention method to fuse features from text and image data on social media platforms to detect fake news.Consequently, they surpassed the performance of existing state-of-the-art models.Concatenation is a straightforward feature fusion method that links features.Although simple, this method allows the fusion of different features without compromising the original state of each feature.Hilmizen et al. concatenated the features extracted from CT-scan and X-ray images to diagnose COVID-19 pneumonia, and the performance of this approach was superior to that of other approaches [24].Gated multimodal units (GMUs) are modules designed to identify intermediate representations based on various feature combinations, enabling the learning of hidden, latent variables by fusing each feature [25].Arevalo et al. developed a GMU fusion method to classify movie genres by fusing features from movie posters and plot data [25].They outperformed other fusion methods, including a mixture of expert models.We implemented a multi-fusion method that applies feature fusion techniques, including cross-attention, concatenation, and GMUs.
The primary contributions of this work are as follows: -We propose an MMMF model that automatically diagnoses ear diseases using semantic information from endoscopic images of the TM and PTA data.
-Our model employs a multi-fusion approach (crossattention, concatenate, and GMU), incorporating each feature fusion method rather than depending on a single-feature fusion method to fuse the information extracted from the convolutional neural network (CNN) and multi-layer perceptron (MLP) models.
-We improved the diagnostic performance of ear diseases by fusing information derived from endoscopic images of the TM and PTA data.Furthermore, we expanded the classification capability beyond the typical categories of normal, TM perforation, retraction, and cholesteatoma by introducing an additional class: normal (hearing loss).
-We have verified the efficacy of the multi-fusion method.Moreover, we applied the proposed MMMF model to the endoscopic images of the TM and PTA data, validating its superior diagnostic performance for ear diseases.

A. PATIENT SELECTION AND DATA ACQUISITION
We collected and analyzed 1632 TM endoscopic images from Korea University Ansan Hospital.Among these, 1,480 endoscopic images of the TM included 330 normal images, 554 images of TM with perforations, 300 retraction images, 159 cholesteatoma images, and 137 images obtained from patients categorized as normal (hearing loss).The remaining 152 images were excluded because they exhibited severe swelling, bleeding, indistinguishable diseases, and overlapping or blurred foci.Furthermore, the TM size, angle, location, rotation, light reflection, and smudging varied across the endoscopic images of the TM; however, these images were analyzed without filters, mirroring real-world clinical situations.The endoscopic photography equipment was replaced midway through the data acquisition process.Consequently, because the image resolutions varied between 1920 × 1080 and 640 × 480, we adjusted the image size to a uniform size of 384 × 384 pixels.
The clinical features of ear diseases are shown in Figure 1, with red circles indicating the location of a specific feature.TM perforation, TM or attic retraction, and cholesteatoma are features observed in individuals with hearing impairment.The presence of TM retraction indirectly shows the patient's Eustachian tube function while suggesting potential cholesteatoma formation in the middle ear.Cholesteatomas can induce symptoms such as hearing loss, otorrhea, vertigo, and headache.Table 1 lists the characteristics of the patients from whom PTA data were collected.Previous studies have suggested that hearing levels vary according to sex and age [27]; therefore, we included both factors in our PTA data.We collected PTA decibel thresholds at 0.25, 0.5, 1, 2, 3, and 4 kHz frequencies and computed the average decibel thresholds across the entire frequency band using two testing methods: air and bone conduction.Given that ABG can help differentiate normal and abnormal TM [1], [16], we included ABG in our analysis of patient characteristics.We also included the values calculated using the sexenary average formula to determine hearing loss grades [26].The sexenary average is calculated as follows: When endoscopic images of the TM show normal eardrums but the sexenary average exceeds 25 dB, diseases such as sudden sensorineural hearing loss, congenital middle ear anomalies, and otosclerosis may cause hearing loss [28].
In addition, normal eardrums with ABG of 11 or more may indicate otosclerosis, bone anomaly, and inner ear disorders [29].Therefore, we included a hearing loss flag in our data because a difference of 11 or more between the ABG and the average value of PTA air, or a sexenary average value of 26 or more, significantly increases the likelihood of hearing loss.Consequently, we constructed one-dimensional data with a length of 25.The patients' ages ranged from 0-80, and the decibel levels varied between −10 and 120 dB, a range.In addition, we classified the PTA dataset into 330 normal and 1150 abnormal cases (perforation, retraction, cholesteatoma, and normal (hearing loss)) based on the analysis of endoscopic images of the TM.
The endoscopic images of the TM and PTA data were randomly divided into five distinct datasets, each representing 20% of the total data per disease category, with no overlap.Four datasets, comprising 80% of the images (1184), were employed for training, whereas the remaining dataset, containing 20% of the images (296), was used for validation.In addition, all procedures in this study were performed following the rules of the 1975 Helsinki Declaration, and the use of the data was approved by the IRB (2021AS0329) of Korea University Ansan Hospital.The ethics committee waived informed consent because of the retrospective nature of the study.

B. MLP MODEL FOR EXTRACTING SEMANTIC INFORMATION FROM PTA DATA
Images are high-dimensional data containing a wealth of information [30].Consequently, they can accurately identify the properties of normal, perforation, retraction, and cholesteatoma-affected eardrums.While one-dimensional PTA data can classify eardrums into normal and abnormal categories, identifying the specific characteristics of retraction, perforations, and cholesteatoma using PTA data presents a challenge.Hence, we applied PTA data to an MLP to develop a simple PTA model that extracts information regarding normal and abnormal TM states.The architecture of the proposed PTA model is shown in Figure 2. We constructed an MLP comprising an input layer with 25 nodes and a hidden layer with 144 nodes.

C. CNN MODELS FOR EXTRACTING SEMANTIC INFORMATION FROM ENDOSCOPIC IMAGES OF THE TM
We extracted endoscopic image information using a pre-trained public CNN model validated using the ImageNet database.The CNN model was trained to classify images into 1,000 categories.Therefore, the ImageNet CNN model included a fully connected (FC) layer of 1,000 nodes.As we used the CNN model solely to extract image features, we employed it with the FC layer removed.

D. MMMF MODELS FOR THE AUTOMATIC DIAGNOSIS OF EAR DISEASES USING ENDOSCOPIC IMAGES OF THE TM AND PTA DATA
The architecture of the MMMF model is shown in Figure 3. First, the MLP model was applied to PTA data to extract MLP features related to hearing.In addition, endoscopic images of the TM were applied to the CNN model to extract the features of the TM.The CNN features were extracted both before and after average pooling.The CNN feature map before Avg pooling was reshaped into channel × (width × height) images.The MLP feature map was then expanded by the number of channels in the CNN feature map to obtain CNN and MLP feature maps of the same size.Cross-attention was used to generate the CA features to fully integrate the information from the endoscopic images of the TM and PTA data.To preserve feature information from both models, the CNN feature map after Avg pooling was concatenated with the feature map extracted from the MLP, resulting in CC features.Finally, the GMU [25] module fused the CA and CC features to produce multi-fusion features.An FC layer was then used to classify the multi-fusion data into the following classes: normal, perforation, retraction, cholesteatoma, and normal (hearing loss).

E. MULTI-FUSION METHOD
First, we applied the cross-attention mechanism in our model to mix information from the endoscopic images of the TM and PTA data.The cross-attention structure we used aligned with the scaled dot-product attention structure proposed in the transformer [31].To learn the correlation between the information in endoscopic images of the TM and PTA data, each feature sequence should be utilized as three variables: query, key, and value.The CNN feature map was employed, which contains information from endoscopic images of the TM as the query, and the MLP feature map, which contains PTA data information, as the key and value.
We computed the correlation between the endoscopic images of the TM and the PTA data using dot products of the query and key.By scaling this with the softmax function, we derived the attention weights for both datasets.These weights were then multiplied with the value to produce the CA feature, used for the GMU module.The operation process for cross-attention is as follows: 116724 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Second, the concatenation method was used to focus on the interactions between the features extracted from the CNN and MLP, and these features were fused without damaging their original state.The structure of the GMU module is shown in Figure 4. We used the GMU module to extract a multi-fusion feature that learned the intermediate representation and hidden, latent variables of the cross-attention-fused CA feature and CC feature.This CA feature contained information about the interrelation between the two datasets, and the CC feature retained the original forms of the endoscopic images of the TM and PTA data.
Within the GMU module, our first step extract hidden features from the CA features and CC features.Subsequently, employing the concatenation operator and the sigmoid activation function, we derive the z activation function from the two features.Then utilized z activation function and the hidden features of CA features and CC features, resulting in a multi-fusion feature used for ear disease classification.The GMU operational process is as follows: where θ represents the parameter to be learned, and [•, •] denotes the concatenation operator; both operations are differentiable.This fusion method can be easily integrated with other neural network architectures and trained using stochastic gradient descent.

F. TRAINING DETAILS
We evaluated the usefulness of the features extracted from the MLP model described in Figure 2, in which we added an FC layer with two output nodes to the MLP model.To extract features from endoscopic images of the TM, we compared seven CNN models from ImageNet.The models include Vgg-19 [32], ResNet-152 [33], GoogleNet [34], DenseNet-161 [35], Inception-V3 [36], Inception+ResNet-V2 [37], and EfficientNet-B7 [38].In addition, we added an FC layer with four output nodes to each CNN model.Following this, we adopted the CNN model that exhibited the best performance and compared and evaluated our proposed MMMF model with conventional feature fusion methods.All models were trained using a batch size 16, a learning rate 1e−4, an Adam optimizer, and a cross-entropy loss function.Moreover, owing to a class imbalance in the data, loss weight was applied by calculating the ratio of the number of data points for each disease.All experiments in this study were conducted on a deep learning server equipped with eight NVIDIA GeForce RTX 3080 12GB graphic processing units.

G. EVALUATION PROTOCOLS
We evaluated the performances of all the models using accuracy and recall indicators.Accuracy, which represents the proportion of correctly predicted data across an entire dataset, is the most commonly used performance metric.Recall indicates the proportion of data correctly predicted to belong to the actual class from the entire dataset of that class.The formulas for these evaluation metrics are as follows: where TP, TN, FP, and FN represent the true positives, true negatives, false positives, and false negatives, respectively.The higher the value of each metric, the better the classification performance.

A. PTA MODEL FEATURE EXTRACTION PERFORMANCE
As depicted in Figure 3, to implement cross-attention between the semantic information from endoscopic images of the TM and PTA, a feature map of the same size is required.Therefore, we generated a feature map with a length of 144, corresponding to one channel size of the CNN feature map, to represent the semantic information of PTA data.In addition, it is essential to ascertain whether the features of PTA data can distinguish between the normal and abnormal states of the eardrum.We trained the model by adding an FC layer with two output nodes to the MLP model, as shown in Figure 2. The MLP model demonstrated an accuracy of 90.5%, recall of 94.0%, and loss of 0.221.This indicates that when the PTA data were applied to our proposed MLP model, the features of the normal and abnormal eardrum conditions were effectively differentiated.

B. CNN MODEL CLASSIFICATION PERFORMANCE
The performances of seven CNN models pre-trained on Ima-geNet were compared for use in our MMMF model.Our endoscopic images of the TM showed the characteristics of four ear diseases: normal, perforation, retraction, and cholesteatoma.Therefore, the CNN models were classified into four classes.

C. MULTI-FUSION AND SINGLE-FUSION COMPARISON RESULTS
We extended the classification of ear diseases into five categories: normal, perforation, retraction, cholesteatoma, and normal (hearing loss).The latter was identified using features abnormal hearing extracted from the PTA data in conjunction with features of the normal class extracted from endoscopic images of the TM.Table 3 compares the performance of the proposed MMMF model with that of conventional single-feature fusion methods.Our model exhibited superior performance with an accuracy of 92.9%, recall of 90.9%, and loss of 0.671.Our proposed model also outperformed the EfficientNet-B7 model, which was trained to classify four classes, despite adding a fifth class: normal (hearing loss).Figure 5 shows cases of improved diagnosis of ear diseases 116726 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.using our proposed MMMF model compared to using only endoscopic images of the TM in the EfficientNet-B7 model.This proves that our model not only effectively classifies the additional normal (hearing loss) class but also enhances the diagnostic performance of the original four types of ear diseases.

D. MMMF MODEL FIVE-FOLD CROSS-VALIDATION RESULT
Table 4 summarizes the results of the five-fold crossvalidation of the proposed MMMF model.Our model demonstrated consistent performance across all datasets, achieving an average accuracy of 89.7%, recall of 87.2%, loss of matrices depict rate of predictions for each class, as the proportions at which certain classes are mistakenly identified as others.Within these matrices, the diagonal elements represent the correct prediction rate for each specific class.Aggregating the correct prediction rates from all classes results in the overall accuracy value.Furthermore, elements outside the diagonal provide insights into which specific classes were commonly misclassified.Figure 6 shows the confusion matrix results of all datasets for our proposed model, demonstrating the performance of our model in accurately classifying all five types of ear diseases across all datasets.This confirms the capability of the proposed MMMF model to accurately categorize the added normal (hearing loss) class across all datasets.

E. GRAD-CAM ANALYSIS
Grad-CAM is a common method for visualizing the areas in an image that the CNN model focuses on during classification.Figure 7 shows the Grad-CAM outputs for each disease type for the MMMF and EfficientNet-B7 models.EfficientNet-B7, when integrated into the proposed MMMF model, displays a heat map of the precise eardrum and affected area locations, resembling the findings from the standalone EfficientNet-B7 model.This demonstrates that the EfficientNet-B7 model incorporated into the MMMF

IV. DISCUSSION
The diagnosis of ear diseases predominantly depends on ear examinations and clinician expertise [3], [5].Moreover, diagnostic accuracy can vary based on the clinician's experience, as the primary basis of diagnosis lies in visual data, such as endoscopic, CT, and MRI images [5].Studies have indicated that the average diagnostic accuracy of pediatricians and otolaryngologists is <70%, indicating that their diagnostic precision is not high [39].Hence, incorporating AI technology into otolaryngology could assist in making objective judgments when diagnosing ear diseases.Studies on the automated diagnosis of ear diseases suggest that deep learning models based on endoscopic images of the TM can considerably assist physicians [40].Most AI research in otolaryngology utilizes eardrum images [7], [10], [11].However, if only endoscopic images of the TM are used in a deep-learning model, there is a risk of misdiagnosing ear diseases that appear similar to a normal eardrum [12].In previous research, the fusion of medical images and EHR has been employed to overcome the challenge of diagnosing complex clinical situations using only medical images [18].In otolaryngology, eardrum images and PTA data can be used to diagnose ear diseases [1], [14], [15].Therefore, applying PTA data and endoscopic images of the TM to our deep learning model, we solved problems such as the misdiagnosis of images similar to or patients hearing problems exhibited normal eardrum images, as shown in Figure 5.In addition, our MMMF model achieved superior diagnostic performance compared with traditional deep learning models that only use conventional feature fusion methods and endoscopic images of the TM.
Our proposed MMMF model, using 1,480 endoscopic images of the TM and PTA data, automated the diagnosis of five ear diseases with an accuracy of 92.9% and a recall of 90.9%.These results demonstrate improvements of 1.4% and 3.3% over those of the CNN models that used only endoscopic images of the TM for diagnosing the four ear diseases.In our proposed model, the EfficientNet-B7 model accurately classified specific ear diseases by pinpointing the exact location of the eardrum and affected area in the otoendoscopic image.The PTA model accurately classified a patient's eardrum condition as normal or abnormal.Therefore, we not only classified an additional normal (hearing loss) class by merging the CNN and PTA models but also enhanced the performance of automatic ear disease diagnosis.Furthermore, our proposed method represents the first attempt to apply a multi-fusion method in otolaryngology and is also an inaugural study to classify the normal (hearing loss) class of ear disease.This validates the capabilities of our MMMF model and multi-fusion method to diagnose a broader spectrum of ear diseases more efficiently than previous diagnostic strategies.Moreover, it validates the enhanced utility of concurrently using endoscopic images of the TM and PTA data over solely relying on endoscopic images of the TM in deep learning models.
The medical significance of this study first, lies in the fact that it is a study of this type in developing countries, where otologists are not readily available for patients' hospital care.Using both PTA and endoscopic images of the TM to increase diagnostic rates may also be employed in telemedicine in developing countries, while endoscopic images of the TM and PTA data may be sent for analysis and diagnosis in a cheaper and cost-efficient manner.Moreover, physicians with endoscopic systems but different specialties, such as pediatrics, internal medicine, or family medicine, may examine features that signal chronic ear diseases, such as attic destruction or minimal TM perforations.The automatic diagnosis of chronic otitis media may assist medical doctors in diagnosing patients with such features.Finally, the current study not only enables the discrimination of normal TM from hearing loss but also facilitates the diagnosis of the aforementioned diseases in clinical situations.
However, our proposed method has the limitation of not implementing data augmentation.Moreover, despite the concurrent use of endoscopic images of the TM and PTA data in our proposed MMMF model, it only classifies five ear disease categories.Previous studies indicated that increased amounts of data could enhance the accuracy of automatic diagnostic systems for ear diseases [40].Nevertheless, patient medical data collection is challenging owing to various regulations such as the Privacy Act [41].Hence, previous otolaryngology studies have used data augmentation techniques such as image rotation and flipping [9], [40], [42].In our approach, we similarly augmented the eardrum endoscopy data.However, for the PTA data, data augmentation generates duplicate PTA data, equivalent to the number of augmented eardrum endoscopic images.Therefore, we could not perform data augmentation.Despite this, our study demonstrated excellent performance with a relatively small dataset of 1,480 samples.If we manage to amass a larger dataset consisting of more eardrum endoscopy images and PTA data, our proposed MMMF model can potentially be used to diagnose more than six classes.Therefore, future studies should aim to collect more data and conduct tests to diagnose various ear diseases.Furthermore, a more detailed exploration of the relationship between the endoscopic images of the TM and PTA data, or the adoption of state-of-the-art technologies, might prove beneficial in classifying not just the 'normal with hearing loss' category but also in introducing new categories for diagnosis.

V. CONCLUSION
In this study, we developed an MMMF model for automatically diagnosing five ear disease classes: normal, perforation, retraction, cholesteatoma, and normal (hearing loss), employing endoscopic images of the TM and PTA data.Our model demonstrated the best performance when EfficientNet-B7 was applied, with an accuracy of 92.9% and recall of 90.9%.Furthermore, the proposed multi-fusion method exhibited superior performance over the single-fusion method, and our model demonstrated excellent results across all datasets in a five-fold cross-validation.Despite using a feature fusion method, our model categorizes ear diseases by referring to the precise locations of the eardrum and affected areas in the endoscopic images of the TM.Thus, the proposed model outperformed traditional single-feature fusion methods, and CNN models that solely utilize endoscopic images of the TM.Considering that our PTA model was integrated with a conventional CNN model in this study, the additional classification of the normal (hearing loss) class could be attributed to the PTA model.Furthermore, considering the application of the multi-fusion method to multi-modal data, the enhanced performance of ear disease diagnosis can be attributed to the multi-fusion method.Consequently, the proposed method demonstrates that deep learning models can leverage new semantic information from eardrum endoscopic images of the TM and PTA data to diagnose complex ear diseases further, thereby achieving high diagnostic performance.This indicates the potential benefits for future clinical scenarios, such as telemedicine and diagnostic support systems.

FIGURE 1 .
FIGURE 1. Types of ear diseases in the collected dataset.The red circles indicate the locations representing specific characteristics of each disease.(a) Normal TM.(b) Marginal perforation of TM.(c) Attic retraction.(d) Attic destruction with cholesteatoma.

FIGURE 2 .
FIGURE 2. MLP architecture for extracting meaningful features from PTA data.

FIGURE 3 .
FIGURE 3. MMMF model architecture.Information extracted from endoscopic images of the TM and PTA data through CNN and MLP is integrated using the cross-attention, concatenate, and GMU fusion methods.Following fusion, the combined features are forwarded through the FC layer to classify five ear diseases (normal, perforation, retraction, cholesteatoma, and normal (hearing loss)).PTA, Pure-tone audiometry.CNN, Convolutional neural networks.MLP, Multi-layer perceptron.CA, Cross-attention.CC, Concatenate.GMU, Gated multimodal units.Avg, Average.FC layer, Fully connected layer.c, Channel size.w, Width size.h, Height size.

FIGURE 5 .
FIGURE 5.Examples of improved misclassification results of the EfficientNet-B7 model when using the proposed model.Tensor is the predicted value for each class by the model.For the EfficientNet-B7 model, the order is normal, perforation, retraction, and cholesteatoma.The order of the proposed model is normal, perforation, retraction, cholesteatoma, and normal (hearing loss).The target is ground truth.
model classifies ear diseases by focusing on exact eardrum locations.

FIGURE 7 .
FIGURE 7. Comparison of heat maps of the Grad-CAM of the EfficientNet-B7 and MMMF models for each disease condition.The closer the color is to red, the greater the influence on the model's classification of ear diseases.Heat maps for the (a) normal, (b) perforation, (c) retraction, (d) cholesteatoma, and (e) normal (hearing loss) classes.

TABLE 1 .
PTA data characteristics for patients included within the dataset used to train the model.

TABLE 2 .
Performance comparison of CNN models.

TABLE 3 .
Performance comparison between the proposed MMMF model and conventional single-feature fusion methods.

TABLE 4 .
Five-fold cross-validation performance results of the proposed MMMF model.