Characterization of Synthetic Health Data Using Rule-Based Artificial Intelligence Models

The aim of this study is to apply and characterize eXplainable AI (XAI) to assess the quality of synthetic health data generated using a data augmentation algorithm. In this exploratory study, several synthetic datasets are generated using various configurations of a conditional Generative Adversarial Network (GAN) from a set of 156 observations related to adult hearing screening. A rule-based native XAI algorithm, the Logic Learning Machine, is used in combination with conventional utility metrics. The classification performance in different conditions is assessed: models trained and tested on synthetic data, models trained on synthetic data and tested on real data, and models trained on real data and tested on synthetic data. The rules extracted from real and synthetic data are then compared using a rule similarity metric. The results indicate that XAI may be used to assess the quality of synthetic data by (i) the analysis of classification performance and (ii) the analysis of the rules extracted on real and synthetic data (number, covering, structure, cut-off values, and similarity). These results suggest that XAI can be used in an original way to assess synthetic health data and extract knowledge about the mechanisms underlying the generated data.


I. INTRODUCTION
T HE area of synthetic data generation is gaining growing attention in healthcare. Generation of high-quality synthetic data can help build realistic datasets that can be shared openly in the educational and scientific community, for example to support the development of predictive models of disease, averting issues related to patient identification and data privacy that frequently limit the widespread use of health data [1], [2], [3], [4]. Data  can help limit issues related to missing data, misuse, or lack of compliance [5], [6], [7]. Synthetic data generation can help develop large datasets from small ones as well as balanced datasets from highly unbalanced ones, and it can help limit the costs of building datasets from large cohorts of patients [8], [9]. The goal of data augmentation algorithms is to create realistic and useful synthetic data, namely preserving distributions, predictive capabilities, and relationships [1], [9]. The field of data generation algorithms is still an important area of research, however it is beyond the aims of this study to develop and assess synthetic data generation techniques. Rather, this study focused on introducing and characterizing novel metrics to assess the quality of synthetic data. Several approaches have been introduced in the literature, such as utility metrics derived from the distributions (e.g., Maximum Mean Discrepancy (MMD), Hellinger distance (HD), Classifier Two Sample (C2S) metric), or measures based on classification performance on real and generated data [10], [11], [12], [13], [14]. Utility metrics and classification performance can give a general picture of the quality of generated data but they provide limited insight into the way input-output relationships are preserved in synthetic data. EXplainable AI (XAI) techniques could help assess if, and to what extent synthetic data maintain input-output relationships similar to those found in real data [15], [16]. When dealing with health data, XAI methods are particularly promising as they can help healthcare experts enter the logic of the machine learning process and extract knowledge about the mechanisms underlying the observed phenomena in a meaningful and transparent way, so that synthetic data can be validated against available knowledge [17], [18]. In a preliminary study on data from a pilot experiment on respiratory disease monitoring, we showed that conventional utility metrics are able to anticipate XAI classification performance, being low utility metrics associated with low classification performance of XAI models trained on synthetic data and tested on real data [6]. However, the ability of XAI to provide additional information about the logic underlying synthetic health data has not been specifically investigated so far. The aim of this study is to apply and characterize XAI-based models and metrics as a means to assess the quality of synthetic health data. The novel contributions here introduced consist of: XAI evaluation of synthetic datasets in terms of feature relevance, visual inspection of rules, and classification performance; and the definition of a new rule similarity method to compare rule-based XAI models trained on synthetic and real data.

II. DATASET
The example of health dataset assessed in this study includes hearing screening data collected from a self-administered adaptive speech-in-noise test for adult hearing screening in the context of project WHISPER (Widespread Hearing Impairment Screening and PrEvention of Risk) [19], [20]. Multivariate hearing screening data are particularly useful as they can be used to develop machine learning models able to identify individuals with hearing loss, therefore supporting widespread screening of this largely underdiagnosed condition, that is currently the third leading cause of years lived with disability worldwide [21], [22]. However, there is scarcity of multivariate hearing screening data to build machine learning models to predict hearing loss as, to date, screening outcomes are typically determined on the basis of a single variable (e.g., the speech recognition threshold, SRT, or the number/percentage of correct responses) [23].
The dataset used in this study includes 156 records related to eight input features extracted from speech-in-noise testing and one output class, that is the presence or absence of hearing loss in the tested ear, as determined by the pure tone average (PTA), i.e. the average value of pure-tone thresholds measured at 0.5, 1, 2, and 4 kHz. The output class is defined following the World Health Organization (WHO) definition of slight/mild hearing impairment, in force until Feb 28, 2021 [24]: hearing loss ("HL", PTA > 25 dB HL: 55 records) and no hearing loss ("no HL", PTA ≤ 25 dB HL; 101 records). The input features extracted upon completion of the speech-in-noise test comprise: subject's age, SRT, measured in dB signal-to-noise ratio; #trials, i.e. number of presented stimuli; #correct, i.e., number of correct responses; %correct, i.e., percentage of correct responses; avg reaction time, i.e. average time needed to provide a response; total test time, i.e. total time needed to complete the test; and volume, i.e. self-adjusted volume set by the participant before taking the test, computed on a range from 0 to 1. The experimental protocol was approved by the Politecnico di Milano Research Ethical Committee (Opinion n. 2/2019, Feb 19 2019).

III. GENERATION OF SYNTHETIC DATA
In this study, synthetic data are generated using Generative Adversarial Networks (GAN) [25], a deep learning approach able to reach remarkable performance in generating high-quality synthetic data, for example in the field of images [26], biosignals [27], [28], and time series from patient monitoring devices [6]. A GAN comprises two neural networks: a generator (G) for generating fake but realistic data x , and a discriminator (D) for distinguishing whether the generated data are real or fake. Learning is achieved by an adversarial game between G and D: G uses the encoder-decoder scheme to build synthetic data, whereas D infers the separation between real and synthetic data. Therefore, D learns to become better at distinguishing real from synthetic data and G learns to generate better data to fool the discriminator [25]. In this study, a conditional GAN [29], [30] is implemented (see Fig. 1), i.e. a GAN in which G and D are conditioned during training by using output class labels in a way that G learns to produce realistic examples for each label in the training set starting from random noise, and D learns to distinguish fake example-label pairs (x , y ) from real example-label pairs (x, y). A set of balanced synthetic datasets are generated by varying different GAN parameters, namely the number of nodes per layer in G and D networks, the batch size, and the number of epochs, as follows: 64. For each of these three configurations, the number of epochs is set at five different values: 10 000, 15 000, 20 000, 25 000, and 30 000, to obtain a total of 15 different synthetic datasets.

IV. ASSESSMENT OF SYNTHETIC DATA USING UTILITY METRICS
To monitor the quality of the GAN generation process, we use the following measures: MMD, C2S metric, HD and Pairwise Correlation Difference (PCD) [10].
The MMD metric is a measure of dissimilarity between two probability distributions P and Q that uses samples drawn independently from each of them [10]. Given a kernel k and its associated Reproducing Kernel Hilbert Space (RKHS) H k of functions defined on a set X, the distance between the two probability distributions P and Q in the original space is converted into a distance between their relative mean embeddings of features in the space H k [31]. A statistical hypothesis test is introduced to test the null hypothesis H 0 : P = Q versus the alternate hypothesis H 1 : P = Q [31]. The test statistic is compared to a threshold which depends on the probability P and K and is selected based on the chosen α level. In this study, a Gaussian radial basis function kernel (rbf) is chosen for the MMD statistical test.
The C2S metric uses a machine learning classifier to assess whether two samples are drawn from the same distribution [10], [32], [33]. The C2S metric computation comprises the following steps: 1) A dataset D is built by combining the real samples as 0 and the synthetic samples as 1; 2) The dataset is randomly split into two disjoint training and testing subsets (D train and D test , respectively); 3) A binary classifier (e.g., logistic regression) is trained on D train and the C2S metric is defined as the classification accuracy of this classifier computed on D test . Hence, the higher the C2S metric (i.e., the accuracy), the more likely the two distributions are different, whereas for samples drawn from the same distribution the accuracy should remain near chance-level. To maintain class balance between real and generated data, the MMD and C2S metrics are computed using the same number of samples as in the original dataset and then averaged across 10 random realizations of the sampled subsets.
The HD [14] is a utility metric related to the Bhattacharyya coefficient-based measure that evaluates the distance between two probability distributions in their original space. This metrics ranges from 0 (i.e., identical distributions) to 1 (i.e., totally dissimilar distributions). The HD has been derived in this study starting from the probability density functions of the datasets to be compared.
Finally, PCD [12] was evaluated to investigate if synthetic datasets were able to retain the correlations among features that characterize the original distribution. The PCD between a real and a synthetic dataset is defined as the Frobenius norm of the difference of the correlation matrices extracted from the two datasets to be compared. The lower the PCD, the greater the similarity between the correlations in the original dataset and those in the synthetic dataset.

V. ASSESSMENT OF SYNTHETIC DATA USING XAI
Among the various XAI techniques available, in this study we used the Logic Learning Machine (LLM), a technique able to generate transparent models whose inner logic could be described using a set of n intelligible rules, in the form if (premise) then (consequence), where premise is a logical product of m conditions c j , and consequence provides a class assignment for the output y [34], [35]. 1 Let x 1 , . . ., x n be the input features, each defined in a specific domain. Then, a condition involving the variable x j , can have one of the following forms:  (1) and (2): Error where T P (R(i)), F P (R(i)), T N(R(i)), and F N(R(i)) are the true positives, false positives, true negatives, and false negatives associated with the rule R(i). Feature relevance is derived from (1) and (2). In order to obtain the relevance Rel(c j ) of a condition, we compare the rule R, in which condition c j occurs, and the same rule without that condition, called R . Since the premise part of R is less stringent, we obtain that E(R ) ≥ E(R), thus the quantity Rel(c j ) = (E(R ) − E(R))C(R) indicates the relevance for the condition of interest and, therefore, for the feature involved in that condition.

A. Analysis of Classification Performance
The real and synthetic datasets were randomly split into training and test sets by applying stratification. The classification performance was addressed by computing sensitivity, specificity, and F1-score in LLM models deployed with the following combinations of training (Tr) and test (Te) sets: r Condition D: TrR = training set from real dataset (80%), TeS = test set is the whole synthetic dataset. A cross-classification (CC) measure [12] was introduced to summarize the similarity between real and synthetic datasets in terms of classification performance. Two CCs were computed as the ratio of the accuracy in conditions C and D to the accuracy in condition A.

B. Analysis of Similarity Between Rules
A measure of similarity between rules is introduced, based on the cosine similarity between Bag of Words (BOW) [36] representations of the set of rules extracted from real and synthetic datasets. BOW is a widely used text representation approach (e.g., [37], [38]) where a text is decomposed into a matrix of words and their relative frequencies. In a preliminary study [39], a BOW-based metric was introduced to individually compare rules from different classes of the same dataset and rule sets referring to stratifications (e.g., different age groups) of the same phenomenon. In this study, we further elaborated this metric by considering the difference in covering between different rules and introducing a global similarity metric that, based on the similarity between pairs of rules, provides an estimate of similarity between rule sets, i.e., between the models that describe the underlying data. Each rule R(i), associated to the output class y, can be defined by a set of m conditions, each described by a word w (i.e., the combination of the feature name and direction of the inequality sign) and the related cut-off value t as shown in [35].
Two rules can be considered similar when their conditions share the same structure (i.e., same feature and same direction) and similar cut-off values [40]. In the specific case of classification rules, there can be at maximum one condition for each feature (i.e., a word can be present only once in the rule), so the related cells of the BOW matrix contain binary values (1 if the word is present and 0 if the word is not present). For each word, an additional column is added to account for the cut-off value, normalized between 0 and 1 based on the theoretical lowest and highest possible values of the feature. Once the BOW matrix is created for both rulesets to be compared, cosine similarity is applied to all the combination of couples of rules {R real (i r ) m r i r =1 , R synthetic (i s ) m s i s =1 }, divided by class, to obtain a measure of similarity between rules S rs . Cosine similarity is a widely used text similarity measure, often combined with BOW representation (e.g., [41]), that measures the similarity between two vectors in terms of the cosine of the angle in between. To compute rule similarity, only rules with covering higher than 15% are considered, as rules with lower covering are representative of only a few input data and therefore may be subject to greater variability due to the choice of training and test partitions, especially in small datasets like the one here used. Intuitively, if the real and synthetic datasets are similar, their rules should be similar in terms of structure and covering. Hence, the difference in covering between rules extracted from real and synthetic data is introduced as a weighting factor in the computation of rule similarity. Therefore, the resulting similarity metric is: where C(R real (i r )) and C(R synthetic (i s ) are the covering of the real rule and of the synthetic one, respectively 2 . A global similarity metric between rulesets G x is defined as the ratio of the number of real-synthetic rule pairs n x with similarity greater than a pre-determined threshold value x (i.e, 0.6 in this study) to the total number of rules extracted from the real dataset m r .
For the sake of simplicity, a reduced version of the experimental dataset including the three most relevant features (age, #correct, SRT), and the output class is here used to assess the outcomes of XAI on synthetic data and enable straightforward visualization and interpretation of results. Table I shows the MMD, the related p-value, the HD and the C2S metric as a function of the GAN settings for the 15 synthetically generated datasets. For the MMD and C2S metrics, the mean and standard deviation (s.d.) are computed over 10 iterations as described in Section IV. The results in Table I suggest that the synthetic datasets more similar to the real one in terms of MMD (lower values, p-value > 0.05), HD (lower values e.g., < 0.40), and C2S metrics (near chance level), are #8, #9, #13,  I  MMD, HD, AND C2S METRICS FOR THE SYNTHETICALLY GENERATED  DATASETS AS A FUNCTION OF THE GAN SETTINGS and #15, however, on the basis of the observed metrics, no straightforward indication of the 'most similar' synthetic dataset can be derived. The PCD was calculated to assess whether the synthetic datasets are able to maintain correlations between features that resemble those in the original dataset. PCD values obtained for the datasets with better MMD, C2S, and HD are very close to each other (i.e., P CD #8 = 1.01, P CD #9 = 0.83, P CD #13 = 0.65, P CD #15 = 0.81). Moreover, PCD values are better (i.e., smaller) in the datasets mentioned above than in the synthetic datasets with worse values of MMD, C2S, and HD (e.g., P CD #11 = 2.51).

A. Analysis of Classification Performance
The LLM model trained on the WHISPER dataset includes 12 rules overall (7 for "no HL", average covering = 25.32%; 5 for "HL", average covering = 25.65%). The rule with highest covering for class "no HL" (R r,noHL 1) indicates that subjects younger than 52 years are more likely to have better hearing ability than older subjects, in line with the well-known relationship between age and hearing loss [42]. The second rule with highest covering for class "no HL" (R r,noHL 2) indicates that subjects with a negative SRT (i.e., below −7.35 dB SNR) who achieve good results in the speech-in-noise test (i.e., more than 96 stimuli correctly identified) will probably belong to the normal hearing class. This rule synthesizes well the relationship between speech recognition ability and hearing loss. Conversely, subjects with a poor performance of speech recognition in noise (i.e., lower than 59 correct responses) as in R r,HL 1 will more likely suffer from hearing loss [23], [43].   2 shows the classification performance on the test set (sensitivity, specificity, and F1-score) of the four synthetic datasets with low MMD and HD (#8, #9, #13, and #15) and a synthetic dataset with high MMD and HD (#11), as computed in the conditions A, B, C, and D defined in Section V-A. The degree of overfitting in the analyzed models was assessed by evaluating the difference between training and test accuracy of the LLM models obtained with the four different combinations of training and test set. The following mean differences were calculated for the datasets considered: condition A = 5.72% (sd = 3.6%) (average difference in performance obtained with 5-fold-cross validation), condition B = 2.28% (sd = 1.1%), condition C = 11.17% (sd = 1.8%), condition D = 12.43% (sd = 4.8%). The discrepancy between training and test performance is limited (i.e., lower than 13% on average), including conditions C and D in which training and test portions are extracted from different datasets. Overall, the test performance is satisfactory, with accuracy around 75%-80% in the models with lower classification performance, thus demonstrating limited overfitting.
Generally, the performance metrics measured on the synthetic datasets #8, #9, #13, and #15 are higher than those measured on datasets #11, reflecting the well-known capability of the MMD to discriminate between datasets that are significantly different from the original one and datasets that are similar to the original one. In condition B, the classification performance of models trained and tested on synthetic datasets #8, #9, #13, and #15 is similar to or higher than that of models trained and tested on real data (condition A). Specifically, higher specificity and F1-score, and similar sensitivity is observed. In condition C, i.e. the condition in which the capability of synthetic models to be applied on real data is assessed, synthetic models from datasets #8, #9, #13, and #15 maintain a similar specificity and F1-score with respect to real data, but are characterized by a slightly lower sensitivity, suggesting that models trained on synthetic datasets are in general less able to detect the "HL" class, when applied on real data, compared to real models. In condition D, i.e. the condition in which the capability of the real model to classify synthetic data is evaluated, similar F1-score, higher specificity, and a drop in sensitivity are observed compared to condition A. The crossclassification based on test accuracy in condition C yields the following results: CC #8 = 0.98, CC #9 = 0.97, CC #13 = 0.94, CC #15 = 0.96, CC #11 = 0.60. The cross-classification based on the test accuracy in condition D yields the following results: CC #8 = 0.86, CC #9 = 0.99, CC #13 = 0.95, CC #15 = 0.92, CC #11 = 0.56. The classification performance is similar to the real one (i.e., CC close to 1) for synthetic datasets with lower MMD, HD and C2S, whereas classification performance is worse (i.e., CC lower than 1) for the synthetic dataset with higher MMD, HD and C2S metrics. Table II shows the rule similarity coefficients, as defined in (5), obtained by comparing the LLM model trained on the real dataset with those trained on the synthetic datasets #8, #9, #13, and #15, i.e. the ones that are not significantly different from the real dataset, according to the MMD, HD, and C2S metrics. A global metric of comparison between rulesets G x is shown in the last column, defined as the ratio of the number of real-synthetic rule pairs with similarity greater than 0.6 to the total number of rules extracted from the real dataset. For the sake of clarity, only rules with covering higher than 15% are considered. The rules are reported in full detail in Appendix I.

B. Analysis of Similarity Between Rules
For most of the rules extracted from the real dataset there is at least one rule with similarity greater than 0.3 in each of the four synthetic datasets considered. It is worth noting that the rule similarity measure here used considers the rule structure, the cut-off values and the related covering, as defined in Section V-B. For example, from each of the four synthetic datasets a rule in the form Age ≤ μ Age is observed, that is very similar to the one extracted from the real dataset (R r,noHL 1: Age ≤ 52), but the resulting similarities are slightly different, mainly due to differences in covering. The highest value of rule similarity has been identified for the rule R 15,noHL 5 that is similar to the real rule R r,noHL 3 (SRT ≤ −16.11; C: 21.74%) in terms of both structure and covering (SRT ≤ −17.75; C: 26.37%). Among the four synthetic datasets here assessed, #15 is the one with the highest global similarity G x . Fig. 3 shows a visual overview of the rules extracted from the real dataset, from the optimal dataset, i.e. the one with low MMD, HD and C2S metrics and high rule similarity (#15), from a dataset with low MMD, HD and C2S metrics but relatively low global similarity (#9) and from a dataset with high MMD, HD and C2S metrics (#11). The inner circular crowns represent the rules of each model in terms of covering (outer diameter), error (inner diameter), and class (color) whereas the outer slices represent the values of each of the three input features in terms of class (color) and relevance (opacity). The rules extracted from the synthetic datasets #15 and #9 are more similar to the ones obtained from the real dataset in terms of number, covering, and error compared to those extracted from the synthetic dataset #11, that is associated with a higher number of rules, lower covering, and higher error. In terms of value ranges associated with the two output classes, as shown in the outer slices, the synthetic dataset #15 shows a clear separation of the two classes for each of the three input features (cut-off values: Age: 49 years; #correct: 65; SRT: −9.49 dB SNR), with cut-off values that are similar to those observed in the real dataset (Age: 52 years; #correct: 64; SRT: −10.16 dB SNR). The model trained on dataset #9 presents similar cut-off in terms of #correct (i.e., 63), a clear, but higher, cut-off on age (i.e., 66), but no clearly defined cut-off on SRT. Conversely, no clearly identifiable cut-off values are found in the model trained on dataset #11 as the features are distributed in a similar way between the two classes.

VII. DISCUSSION
Synthetic data generation may be of help in creating large, balanced, de-identified medical datasets that can be used to train and validate new AI algorithms to improve disease detection and prediction, overcoming common problems in real-world clinical datasets such as data scarcity and class imbalance [5], [6], [7], [8]. Trustworthiness of medical decisions supported by AI models becomes essential, especially when the model has been built using synthetic or augmented data [44]. In this context, XAI techniques may enable transparent data generation and analysis, allowing the end user to understand the logic of the model and decide whether to trust and validate its decisions.
In this exploratory study, we propose and characterize a framework of XAI as a means to assess the quality of synthetic tabular data. Specifically, a fully interpretable algorithm (the LLM) is used to generate rule-based models of the data in order to simultaneously assess distributions, predictive capabilities, and relationships in synthetic data by the analysis of the set of rules and the related classification performances.
For a first characterization of the proposed approach, a dataset including multivariate measures of hearing performance, with a single record per subject (WHISPER dataset, 156 records) is considered. This dataset was chosen as an example, but the proposed approach is general and can be extended to different applications. Synthetic data (1000 records) are generated from this real dataset by using a conditional GAN and by systematically varying the number of G and D nodes, the batch size, and the number of epochs.

A. Assessment of Synthetic Data Using XAI
Datasets with significantly different values of MMD, HD, and C2S metrics are characterized by different levels of quality. Vice versa, when dealing with different synthetic datasets that exhibit similar values of utility metrics such as the MMD, HD and C2S metrics here used, quantitative analysis of XAI in terms of classification performances and inspection of decision rules is helpful to assess the similarity between synthetic and real data.
An example of application based on the WHISPER dataset including a subset of the most relevant input features (i.e., SRT, age, #correct), and output class is proposed in Section VI. Specifically, four different datasets similar to the real dataset based on MMD (i.e., low MMD, from 5.13 × 10 −2 to 5.40 × 10 −2 , p-value > 0.05) and HD values (i.e., low HD, < 0.40) are compared in terms of classification performance (Fig. 2). LLM models trained on the selected synthetic datasets (#8, #9, #13, and #15) have, on average, slightly lower sensitivity with respect to the LLM model trained on real data, when tested on real data (condition C), thus they are generally less able to detect the target output class. Vice versa, synthetic models are on average better in identifying normal hearing subjects (i.e., higher specificity). All the LLM models trained on the selected synthetic datasets maintain a satisfactory classification performance, remarkably similar to the performance of the model trained on real data, as demonstrated by the cross-classification metric.

B. Analysis of Similarity Between Rules
In this study a rule similarity metric (5), defined as a combination of similarity in rule structure, cut-off values, and covering, is introduced to assess possible differences between the sets of rules that characterize the models extracted from different synthetic datasets. Rule similarity analysis highlights that the LLM model trained on synthetic dataset #15 is described by rules that are closer to those of the real model (i.e., higher G x ), with respect to those of the other candidate datasets (Table II). Rule visualization (Fig. 3) helps intuitively appreciate the differences in LLM models trained on the real dataset, the optimal synthetic dataset (#15 i.e., low MMD, HD and C2S metrics, highest G x ), a suboptimal synthetic dataset (#9 i.e., low MMD, HD and C2S metrics, low G x ) and an example where data generation process has not achieved the desired results (#11 i.e., high MMD, HD and C2S metrics). As it can be noticed in Fig. 3, the inner logic of the LLM model trained on dataset #15 resembles that of the real one, by maintaining similar input-output relationships and cut-off values. Moreover, the data augmentation process seems to simplify the intrinsic behavior of certain variables, by cleaning up some regions of uncertainty in classification. For example, the model trained on synthetic dataset #15 amplifies the well-known relationship between SRT and hearing loss and allows us to define a cut-off at −9.49 dB SNR which is similar to the one suggested by previous studies (e.g., [43]). As expected, the LLM model trained on the synthetic dataset #11 (worse MMD and C2S metrics) has a much higher number of rules, with lower average covering, different structure and different cut-off values, than the one trained on the real dataset. Rule similarity analysis provides additional information about the quality of the datasets compared to statistical measures derived from distributions (e.g., utility metrics like MMD, HD, and C2S metrics) or from model testing (e.g., classification performance in conditions B, C, and D). The synthetic datasets that pass the MMD, HD, and C2S tests (#8, #9, #13, #15) are then filtered by rule similarity, that confirms their quality as they all present one or more rules with similarity higher than 0.3 when compared to the real rules. However, for some of these datasets (e.g., dataset #15) higher G x is observed, suggesting a higher similarity to the real dataset in terms of input-output relationships. Therefore, the proposed rule similarity metric allows us to select a specific dataset, within a set of good-quality datasets that are considered equally similar in terms of utility metrics. For the computation of a global metric of comparison between datasets, in this preliminary study rule similarity has been considered as high when it exceeds 0.6, however this value needs to be further validated. The proposed metric has been applied to LLM models, but it is in principle applicable to other native rule-based methods (e.g., Decision Trees) or black box models made explainable by post-hoc XAI methods. For example, visual inspection of partial dependence plots estimated from Random Forest models trained on real and synthetic datasets shows that the averaged partial dependence trends obtained from the synthetic datasets #15 and #9 are similar to the one obtained from the real dataset, and their approximate cut-off values are similar to the cut-off values of the rules as shown in Fig. 3. However, for partial dependence plots and, more generally, for post-hoc XAI techniques, further processing is needed to determine decision rules and further research in this direction would be necessary. As common guidelines are still lacking on the evaluation of synthetic data in healthcare, further research may deal with a broader range of synthetic datasets, generated from other real-world datasets, to determine their specific similarity thresholds.

C. Related Literature
In the past few years, some studies have explored different approaches for the generation and subsequent analysis of synthetic datasets in healthcare. Lu et al. [11] investigated the Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.  III  RULES WITH HIGHER COVERING EXTRACTED BY THE LLM FROM WHISPER DATASET AND FROM THE SYNTHETIC DATASETS #8, #9, #13, AND #15, DIVIDED BY OUTPUT CLASSES: "NO HL" AND "HL" use of GANs to produce privacy-preserving synthetic data to circumvent possible privacy violation issues due to the release of publicly available datasets containing sensitive or identifying information. Specifically, correlation matrices were calculated to check whether the synthetic data preserved the original pairwise correlations between variables, and the similarity between the synthetic and original data distributions was assessed by evaluating the accuracy in a machine learning classification task, by considering the same conditions A, B, and C as described in Section V-A. In our study, we further expand the approach, by assessing whether the model trained on the original data is able to properly describe the synthetic data (condition D in the analysis of classification performance, Section V-A). A recent study by El Emam et al. [13] investigates the ability of a variety of utility metrics in evaluating 30 different health datasets and 3 different synthetic data generation methods including Bayesian networks, GANs, and sequential tree synthesis. According to the authors, the HD is the metric that best ranks the synthetic data generation methods based on prediction performance. Another interesting example of synthetic data validation is the study by Goncalves et al. [12] that evaluates the quality of data generated from the cancer registry data from the Surveillance Epidemiology and End Results program of the US National Institutes of Health (NIH). Data were generated using Bayesian Networks and GANs and a set of different metrics were proposed, including utility metrics such as the Kullback-Leibler divergence, pairwise correlation difference, log-cluster metric, support coverage, as well as cross-classification (i.e., models trained on the original data only and tested on hold-out data from both original and generated data, and models trained on synthetic data only and tested on hold-out data from both original and generated data). However, even if a decision tree was used to compute the cross-classification metrics, the study did not address the rules extracted by the decision tree trained on the real and synthetic datasets. To our knowledge, no study so far has evaluated the quality of synthetic data by combining statistics, performance metrics and XAI-based measures. The results of this study confirm the potential value of XAI for assessing synthetic data qualitatively and quantitatively due to its ability to drive inspection of rules, thus clarifying the intrinsic mechanisms underlying the data.

VIII. CONCLUSION
This study demonstrates that XAI can provide additional insights in evaluating the quality of synthetic data, beyond the use of conventional utility metrics, in a hearing screening dataset. Specifically, a global similarity metric was introduced to assess the quality of synthetic data based on the similarity between the classification rule sets extracted from real and synthetic datasets. This metric allows for additional information about the synthetic dataset to be selected, when utility metrics do not allow for clear ranking. Moreover, XAI helps to highlight which input-output relationships are amplified in synthetic data and which ones may be neglected. Among the several XAI techniques available, the LLM was used in this study due to its ability to generate fully interpretable, rule-based models. However, future studies will be needed to investigate novel metrics based on other XAI approaches, for example post-hoc XAI techniques such as partial dependence plots or Shapley additive explanations. Further research is needed to investigate other datasets, including multivariate longitudinal data or time series from a large sample of subjects or biomedical signals to assess the generalizability of the proposed approach. Moreover, investigation of synthetic health data generated using other data generation algorithms (e.g., probabilistic models, classificationbased imputation models, and different GAN algorithms) will be important to test whether XAI-derived metrics can be adapted to specific data generation algorithms and possibly used to assess the quality of synthetic data in real time, during the generation process. Providing real time feedback during the data generation process is one of the most promising goals to pursue as it could help improve the performance and efficiency of synthetic data generation methods. Table III shows the rules with covering higher than 15% obtained from the real dataset and from four out of 15 synthetic datasets, specifically the ones with better MMD (lower values, p-value > 0.05) and HD (lower values, <0. 40), as shown in Table I. The results of Table III are used to compute the coefficients  shown in Table II (Section VI).