Radiomics-Based Assessment of Primary Sjögren's Syndrome From Salivary Gland Ultrasonography Images

Salivary gland ultrasonography (SGUS) has shown good potential in the diagnosis of primary Sjögren's syndrome (pSS). However, a series of international studies have reported needs for improvements of the existing pSS scoring procedures in terms of inter/intra observer reliability before being established as standardized diagnostic tools. The present study aims to solve this problem by employing radiomics features and artificial intelligence (AI) algorithms to make the pSS scoring more objective and faster compared to human expert scoring. The assessment of AI algorithms was performed on a two-centric cohort, which included 600 SGUS images (150 patients) annotated using the original SGUS scoring system proposed in 1992 for pSS. For each image, we extracted 907 histogram-based and descriptive statistics features from segmented salivary glands. Optimal feature subsets were found using the genetic algorithm based wrapper approach. Among the considered algorithms (seven classifiers and five regressors), the best preforming was the multilayer perceptron (MLP) classifier (κ = 0.7). The MLP over-performed average score achieved by the clinicians (κ = 0.67) by the considerable margin, whereas its reliability was on the level of human intra-observer variability (κ = 0.71). The presented findings indicate that the continuously increasing HarmonicSS cohort will enable further advancements in AI-based pSS scoring methods by SGUS. In turn, this may establish SGUS as an effective noninvasive pSS diagnostic tool, with the final goal to supplement current diagnostic tests.


I. INTRODUCTION
P RIMARY Sjögren's Syndrome (pSS) is a chronic autoimmune disease, whose manifesting symptoms are oral and ocular dryness, fatigue, arthralgia and arthritis. The annual incidence of pSS has been estimated at a range from 200 to 3000 per 100.000 people, with highly unbalanced gender ratio (∼10 females per 1 male) [1]. Standardization of the pSS classification has been the subject of debate for decades. Chronologically, four standardized guides are: European Classification (PEC) criteria [2], American European Consensus Group (AECG 2002) classification criteria [3] the American College of Rheumatology (ACR 2012) criteria [4] and the more recent ACR-European League Against Rheumatism (EULAR) 2016 criteria [5]. Briefly, these guides are based on the combination of examined clinical symptoms, results of autoantibody tests and salivary gland (SG) biopsy [6]. All these criteria do not incorporate new insights in pSS enabled by noninvasive salivary gland ultrasonography (SGUS) [7]. According to clinical reports, failing to include any imaging modalities (as mentioned in the standardized guides) has been reported as an obstacle in the practice -as patients frequently complain at invasive tests and biopsies, especially during follow-up studies or when presented with negative findings [8].
Up until now, various SGUS-based pSS scoring approaches have been introduced and showed satisfactory results in comparison to both ACR 2012 and AECG 2002 [9]. The proposed approaches are based on the visual observation of parotid and submandibular SGs' characteristics from SGUS. These scores are further subtracted and compared to the cut-off threshold to determine the final pSS score [10]- [15]. In order to inves- tigate human-dependency, international experts have recently participated in the consensus meetings with the aim to evaluate reliability of SGUS echo structural parameters [16]. Considering the obtained results of inter/intra observer reliability, it is concluded that there is still no gold standard for pSS diagnosis based on the observation of echo structural abnormalities in SGUS images. In order to resolve these obstacles, leading SS experts (35 partners from 13 countries) have recently started the Harmon-icSS (http://harmonicss.eu) initiative. The aim of the joint European research initiative is to envelop independently reported cohorts and metacentric data with the end goal to ease further progress in the diagnosis and treatment of pSS. As recently suggested, one of desirable advances in SGUS is the development of dedicated computerized tools that could reduce screening time and dependency on human experts [17]. To the best of our knowledge, there is still no available solution with such ability on the market, nor reported in the scientific literature. By using the growing HarmonicSS cohort, the aim of the present study was to propose a novel radiomics-based approach for the assessment of pSS in SGUS images.
All subjects underwent a diagnostic work-up for pSS according to the AECG [3]. The evaluation included the following: 1) questionnaire with six questions to assess ocular and oral symptoms, 2) evidence of dry eye (Rose-Bengal), 3) presence of anti-SS-A/SS-B antibodies, 4) sialoscintography for the evidence of salivary dysfunction and 5) biopsy of minor salivary glands. Characteristics of patients involved in this study are shown in Table I.
In this study, pSS scores of SGUS images were defined using the original scoring system proposed by De Vita et al. [14]. This easy-to-apply score was chosen because adequate discriminant analyses were employed to select the items to build the score itself, and these items were subsequently confirmed to be of value. Since the scoring is expert-based approach, ground truth values were defined by using the Delphi method. The scoring assumed grading images on the 0-3 scale regarding SGs' echo structural characteristics, as proposed in Luciano et al. [14]. SGUS images were randomized and assessed twice by five independent clinicians, whose expertise varied between experienced to leading rheumatologists in the field. After the definite scores were obtained by the experts' consensus, the resulting class distribution in the database was 30%, 13%, 39% and 18%, respectively (class distribution of images used for the development of train and test sets are given in Table I).

A. Reliability of pSS Clinical Assessment in SGUS Images
Intra-observer and inter-observer reliability were assessed using the kappa coefficient. The intra-observer agreement was measured using the Cohen's weighted kappa, showing the substantial agreement: κ = 0.71 ± 0.11 (κ min = 0.58 and κ max = 0.88). The overall inter-observer agreement (before the expert consensus) was measured using the Scott/Fleiss' kappa (κ = 0.61).
In order to compare the performances of proposed radiomicsbased algorithms with clinicians, the performance indicators given in this paragraph were calculated with respect to scores adapted through the expert consensus. The mean Pearson's correlation of the five observers with ground truth was R 2 = 0.690  The agreement of observers with ground truth was measured using the weighted Cohen's kappa κ = 0.66 ± 0.14 (κ min = 0.44 and κ max = 0.83).

III. METHODS
Characteristics that are specific to the assessment of pSS from SGUS images are: a) high variance of SGs in appearance, shape and size, and b) low relevance of SGs' surrounding tissues for the diagnosis. Considering the cohort size, and these requirements, we propose a novel radiomics-based procedure as a suitable approach for solving the given problem. The procedure's workflow is sketched in Fig. 1 and Fig. 2, while its composing steps are described in the rest of this section.

A. Features Extraction
Semi-automatic segmentation of SGs was performed using the Snake algorithm ( Fig. 1(a)) [18]. The presence of artifacts and pepper noise were reduced by using the Wiener and Median filters, respectively ( Fig. 1(b)). The extraction of radiomics features from the segmented SGUS image was performed using a series of algorithms (Fig 1(c-j)). Invariance of the proposed procedure to both size and shape of the segmented SGs was ensured by computing: a) Histogram-based features f 1−9 (after expressing the bin counts in percentage, following Fig. 1(i)); and b) Descriptive statistics features f 10−19 that account for pixels inside the segmented SG region ( Fig. 1(j)). The considered feature-extractors were: 1) Multi-Resolution Image Gaussian Pyramid: The Gaussian filter (GF) is a 2D convolutional smoothing operator, whose kernel was generated using the Gaussian function ( Fig. 1(c)): where F Gaussian is the filtered image, * is the convolution operator, I is the image, whereas x and y are pixels coordinates in the local Gaussian filter space. By varying the σ parameter in the range ∼0:2:16, we obtained eight new filtered images. From each of them we extracted 17 histogram-based features listed in Fig. 1(i) -which represent the feature vector f 1−136 in Fig. 1(k).
3) Multi-Resolution Gabor Representation: The Gabor filter represents a two-dimensional sinusoidal wave (with predefined orientation and wavelength), whose amplitude is multiplied with the Gaussian function [19]: where λ is the length of the wave, θ is the wave orientation (so that x = x cos θ + y sin θ and y = −x sin θ + ycosθ), ψ is the phase shift, σ is standard deviation of the Gaussian function and γ is the factor that control elasticity of the filter. We created a bank of 16  The LBP features were computed following the steps in the literature [20]. Briefly, for each pixel LBP compares its value to N pixels along a surrounding circle with a diameter R. If the centre pixel's value is greater than the circle neighbour's value, LPB writes 0, otherwise it writes 1. This gives an N-digit binary number, which is finally converted to decimal for convenience, as sketched in Fig. 1(g). In the present study, we generated the feature set f 460−731 in Fig. 1(k) by varying N = 8:8:32 and R = 4:4:16.

B. Data Stratification
SGUS images were stratified on the training-learning and independent test sets, to more rigorously assess the generalization ability of the proposed procedure. Both data sets were computed only once and saved, so that they could be loaded on-demand during the development and evaluation of the considered predictive models.
1) Development of Balanced Independent Test Set: Since we deal with the development of multiclass predictor using the imbalanced cohort, the size of the independent test-set was determined with the size of the most under-sampled class. In this way, we ensured that the sufficient amount of balanced and realword samples of each class are available during both learning stage and subsequent independent testing. We created the balanced test-set by randomly sampling 25 images (approximately 30% of the least presented class -grade 1) from each grading category ( Fig. 2(a)). The remaining data was further used as the training-set.

2) Development of Balanced K-Folds for the Cross-
Validation: The training was performed using the k-fold crossvalidation. Accordingly, the training set was divided into k = 6 folds, ensuring that each fold consisted of the same number of samples with equal distribution of classes ( Fig. 2(b)). Since the training set was unbalanced, we applied the ADASYN algorithm that sharpens boundaries between classes by generating synthetic samples in minority classes (Fig. 2(c)) [22]. Therefore, the ADASYN improves learning by: 1) reducing the bias introduced by the class imbalance, and 2) adaptively shifting the classification decision boundary toward the difficult-challenging examples [11].

C. Selection of Optimal Features Subset
Since each feature extractor depends on several (hyper) parameters, there is a risk of omitting to use relevant (combination of) features if feature extractors' parameters are not set correctly. This problem was solved by varying the extractors' parameters (Section III-A) and using a robust supervised wrapper feature selector to subsequently find an optimal feature subset for a particular predictive model. The workflow of the proposed procedure is given in Fig. 2(d), whereas its key steps are explained in the following paragraphs.
1) Genetic Algorithm-Based Wrapper: Genetic algorithm (GA) is the iterative method for solving optimization problems [23]. The process starts from an initial guess of parameters (called population) subjected for the optimization. At each iteration (called generation), the GA selects some portion of best individuals from the current population and uses them as parents to produce the candidates (called children) for the next generation (crossover and mutation). Over successive generations, this process leads to the evolution of populations of individuals that are better adapted to their environment than the individuals that they originated from (similar to natural adaptation).
In the present study, we employed the GA to select optimal features subset by considering the feature selection as the integer optimization problem within the bounds 0 and 1. In each call of the GA objective function, the predictive model was cross-validated using the previously created k-folds. Parameters of the objective function represented features selector (i.e., F2 chromosome in Fig. 2(d)), so that the parameters with value 1 indicated features that should be selected while parameters with value 0 indicated features to be neglected during the training.
The value of the GA objective-function was the kappa-statistics for classification and the Pearson's correlation for the regression predictive models. In this study, population size was set to 600, number of generations was set to 500, while the rest of GA hyper parameters had default values defined within the Matlab gaoptimset function (see the Matlab online documentation).
2) Considered Predictive Models: The pSS scoring could be considered as both classification (the score is the ordinal value: 0, 1, 2 or 3) and regression (the score is any real number on the interval 0-3) problem. In order to find which one is the most efficient approach for the pSS scoring, we evaluated 7 classifiers and 5 regressors [24]: Decision Table (DT), J48 tree, K-nearest neighbors (KNN), Linear Regression (LinR), Logistic regression (LogR), Multilayer perceptron (MLP), Naive Bayes NET (NB NET ), Naive Bayes (NB) and Random forest (RF). The parameter settings for each of the predictive models were set iteratively, while the MLP was configured following the Evolutionary assembling approach [25].

IV. RESULTS
The implementation of the proposed procedure was performed using the Matlab R2010 (MathWorks, Natick, MA) and Java wrapper for the Weka v. 3.8 library (University of Waikato) [13]. The computational time needed to find optimal features and develop predictive models varied among algorithms. In worst case scenarios, it took up to several hours on the Dell Pow-erEdge server (204 processors, 800 GB RAM, 4.5 TB SSD). After the learning process had been completed, execution of the developed algorithms for scoring newly supplied SGUS images was done almost in real-time.
Performances of the assessed algorithms are given in Table II. Calculated efficiency indicators were: Pearson's correlation (R 2 ), Mean absolute error (MAE), Root mean squared error (RMSE) -for regression-based algorithms; and: Accuracy (ACC, %), Area under the receiver operating characteristic curve (AUC), kappa-statistics (κ), MAE and RMSE -calculated for classification-based algorithms.
After the identification of the top ranked algorithms, we further analyzed their sensitivity on errors during the SG segmentation (the only user dependent step). For each image in the test set, we automatically created six intra-observers' variability scenarios by: scaling up the segmented contours 20%, scaling down segmented contours 20%, as well as translating segmented contours in four directions for [20 20], [− 20 20], [20 −20] and [−20 −20] pixels. The obtained results of the sensitivity analysis are given in Table III. Histogram of the most frequently used features used for the development of considered predictive models is shown in Fig. 3, while sample results obtained by the top-ranked predictive models are shown in Fig. 4.

A. Configuration of the Top-Performing MLP Classifier
The development of the MLP may be intuitively described as scheduling of its hyper parameters (type of activation functions in layers, learning rate, learning momentum, number of neurons per layer, training algorithm, number of learning epochs)   with the aim to maximize the classification performances. In order to set these parameters automatically and correctly, we employed the recently proposed Evolutionary assembling approach [25].  Fig. 3). Robustness of the proposed procedure comes from the fact that hyper parameters of both feature extractors and MLP classifier are set automatically. Thus, we emphasize that it could be applied for solving a wide range of problems in computer aided diagnosis.

A. Performances of Top-Ranked Algorithms
Depending on the type of 12 considered algorithms, number of features selected by the GA wrapper varied from 23 up to 187. Histogram in Fig. 3 indicates that there were no key radiomics features. Instead, the challenge was to find an optimal feature subset that maximizes performance of a particular predictor. Each of the developed predictive models was evaluated twice, with the cross-validation and on the independent test-set, in order to more rigorously assess the generalization performances. Results from Table II indicate that RF, KNN and MLP were top-ranked algorithms in both classification and regression categories. In the following sections, classification and regression approaches for pSS scoring will be discussed separately in order to highlight their benefits.

1) Benefits From Using Classification-Based Algorithms:
The obtained results show that classification algorithms produce a lower mean absolute error and root mean squared error, which is important during the definite pSS classification (when clinicians consider scorings of four SGUS images acquired from a single patient). In such situations, procedures that are able to accurately grade 3 out of 4 images (over 75% accuracy) represent a considerable contribution to the current practice [2]. In our study, RF, KNN and MLP reached above 66% accuracy (guarantee that at least 2 of 4 images will be graded correctly). However, only the MLP classifier surpassed the threshold of 75% accuracy, which we recommend as the most reliable for the pSS diagnosis using the SGUS scoring system developed by De Vita et al. for pSS (14). In terms of the kappa-statistics, which is commonly used in pSS related studies, the MLP showed substantial agreement (κ = 0.7) with the ground truth defined via the expert consensus.
2) Benefits From Using Regression-Based Algorithms: One of the most challenging issues related to the screening of pSS from SGUS is the follow-up of patients, when clinicians have to estimate the disease progress by inspecting two or more SGUS images. Although using scores in the interval 0-3 is more appropriate for the follow-up (compared to the ordinal scale), it is difficult for clinicians to objectively perform such accurate estimation. In such situation, regression algorithms may appear as useful tools for assisting clinicians. In the present study, the RF and MLP regressors performed the best in terms of both Pearson's correlation (R 2 = 0.83) and RMSE (Table II) with respect to the ground truth defined via the expert consensus.

B. Sensitivity to Errors in SGUS Segmentation
Considering RF and MLP as top ranked algorithms, we first used the Stuart-Maxwell's test to prove that predictions of two multi-class classifier are statistically significant (χ 2 = 9.6 and p = 0.022 values for the significance level of α = 0.05). Furthermore, we analyzed RF and MLP classifiers sensitivity to errors that may occur during the SG segmentation (the only user dependent step). The obtained results in Table III showed that the proposed predictive models are robust on the SG underestimation (case 2). However, case 1 and case 3 types of inaccurate SG segmentation decreased accuracy of predictive models for 4-8%. Although larger segmentation errors are uncommon for the trained clinicians, the recommendation is to prefer underestimating SG when a user is presented with noisy SGUS images. In summary, we recommend the MLP as the most reliable predictive model for the assessment of De Vita scores from SGUS images.

C. Contribution to the State of the Art 1) Computerized Analysis of SGUS and Assessment of pSS:
After literature review, we report that the computerized medical image analysis of SG and pSS remain underestimated problems. Instead, the most of related work is focused on analyzing pSS biopsy images or other diseases present in SGs. Chernomordik et al. proposed a fluorescence scanning imaging system that performs a noninvasive optical biopsy of the Sjögren syndrome (based on the 2D CCD imaging of the lower lip), with the endgoal to replace the traditionally used histological biopsy [26]. Regarding the SGUS-based studies, Chikui et al. suggested using the fractal analyses to characterize SG tumors [27]. The same author afterwards reported that average size of the particles, area ratio of the particles within the region and Hurst-ori were useful predictors for detecting abnormal sialographic stages [28]. Siebers et al. performed multi-feature tissue characterization for differentiating malignant and benign parotid gland lesions using maximum likelihood supervised classifier [29]. Murakami et al. applied 2D wavelet analysis to SGUS images for the diagnosis of SS [30]. A couple of studies investigated the possibility of using the elastography techniques for diagnosing pSS in SGUS. Dejaco et al. used real-time sonoelastography of SGs for the diagnosis and assessment of glandular damage in pSS [31]. Zhang et al. assessed SG stiffness in pSS via acoustic radiation force impulse imaging [32].
Therefore, we found that currently there is a lack of methods for automated analysis and scoring of pSS in SGUS. This may be justified with a few facts: 1) studies that introduced scoring systems were mono-disciplinary (relied mostly on clinicians' experience in image analysis) [10]- [15]; and 2) the nature (rarity) of pSS makes it difficult for a single institution to collect a larger cohort appropriate for the training of robust AI algorithms. To the best of our knowledge and insight into the topic, the present study is the first one that proposes the procedure for computer-aided diagnosis and scoring of pSS from SGUS.

2) Performances of AI With Respect to Trained Clinicians:
By comparing the average performances of clinicians (average intra-observer agreement was κ = 0.71 and average agreement with ground truth was κ = 0.67) with the performances obtained with the proposed classifier, it may be found that the MLP classifier over-performed (κ = 0.7) clinicians by the considerable margin while its reliability is on the level of humans' intra-observer variability. Therefore, we confirm the hypothesis that the proposed AI-based procedure represents a potential improvement of healthcare standards present in most clinics worldwide [33]. Also, it is worth emphasizing that it still cannot compete with the most-skilled clinicians who are the leading scientists in the field (κ max = 0.83 and intra-observer κ max = 0.88). Regarding the regression-based assessment, the MLP (R 2 = 0.83) over-performed clinicians (R 2 = 0.69) by a large margin and it is on the level of leading experts (R 2 max = 0.837). In addition, while the score by De Vita was proposed as ordinal values, our findings may be a starting point towards developing a continuous scale that is more appropriate for the follow-up assessment of pSS and for the detection of changes.
3) Reliability of SGUS Scoring Systems: Incorporation of SGUS scoring systems into standardized diagnosing guides has been prolonged due to their dependency on experts. As a solution to this problem, the score by De Vita et al. was proposed as an easy and practical measure [14]. Our findings support this claim since the average intra observer reliability was quite good κ = 0.71, while the most experienced clinicians reached excellent results in terms of both intra κ max = 0.88 and inter reliability κ max = 0.83. Therefore, we report that the obstacle for wide acceptance of SGUS in pSS screening may be related to the lack of highly skilled sonographers rather than to the need for more suitable scoring systems. The present study confirmed this hypothesis and showed that the problem of clinicians intra/inter observer variability could be solved by employing AI-based algorithms. Particularly, AI algorithms could be trained from data annotated by highly skilled experts and afterwards they could be used to assist and ease the training of less experienced clinicians.

D. Future Work on This Topic
Further development and improvements of dedicated computerized software tools for the pSS assessment from SGUS may significantly advance the way of treating pSS by reducing the invasiveness, screening time and dependency on experts. We highlight the strong potential of applying such technology for assisting and training of novice clinicians, whose perfecting could improve and equalize the healthcare quality worldwide [33]. Although, at the current stage, we have achieved performance on the edge with trained clinicians [11], we refer to this study as the first milestone of the wider HarmonicSS initiative. Considering the size of the cohort at the moment, we have assessed the radiomics-based algorithms to prove our hypothesis. In the future work, we aim to automate both SG segmentation and pSS scoring by employing other AI methods that benefit from the ongoing cohort growth, like Deep learning methods [34].

VI. CONCLUSION
Although humans are efficient in high-level cognitive tasks, our limitation in performing lower-level vision tasks such as calculation of textures' statistics or differentiating shades of colors are well studied and depend on many factors (i.e., ageing, genetics, fatigue, environment, diseases and so on) [35]. As an alternative to using descriptive and linguistic nominal attributes to characterize pSS in SGUS images, the present study aimed to assess various radimomics-based AI algorithms for pSS scoring from SGUS using the score proposed by De Vita in pSS [14]. We found that the MLP classifier (κ = 0.7) overperformed average score achieved by the clinicians (κ = 0.67) by the considerable margin, while its reliability is on the level of humans' intra-observer variability (κ = 0.71). We emphasize that the proposed procedures still cannot compete with the leading scientists in the field (κ max = 0.83 and intra-observer κ max = 0.88). With further increase in the HarmonicSS cohort and improvements, validation and democratization of the AI able to compete leading clinicians in the pSS scoring, SGUS could be established as a reliable assessment procedure supplementing or replacing currently used invasive diagnostic tests.

ACKNOWLEDGMENT
The Titan V GPU used for this research was donated by the NVIDIA Corporation to A. M. Vukicevic.