Adversarially-Regularized Mixed Effects Deep Learning (ARMED) Models Improve Interpretability, Performance, and Generalization on Clustered (non-iid) Data

Natural science datasets frequently violate assumptions of independence. Samples may be clustered (e.g., by study site, subject, or experimental batch), leading to spurious associations, poor model fitting, and confounded analyses. While largely unaddressed in deep learning, this problem has been handled in the statistics community through mixed effects models, which separate cluster-invariant fixed effects from cluster-specific random effects. We propose a general-purpose framework for Adversarially-Regularized Mixed Effects Deep learning (ARMED) models through non-intrusive additions to existing neural networks: 1) an adversarial classifier constraining the original model to learn only cluster-invariant features, 2) a random effects subnetwork capturing cluster-specific features, and 3) an approach to apply random effects to clusters unseen during training. We apply ARMED to dense, convolutional, and autoencoder neural networks on 4 datasets including simulated nonlinear data, dementia prognosis and diagnosis, and live-cell image analysis. Compared to prior techniques, ARMED models better distinguish confounded from true associations in simulations and learn more biologically plausible features in clinical applications. They can also quantify inter-cluster variance and visualize cluster effects in data. Finally, ARMED matches or improves performance on data from clusters seen during training (5–28% relative improvement) and generalization to unseen clusters (2–9% relative improvement) versus conventional models.


INTRODUCTION
I N predictive modeling, one often assumes that data is independent and identically distributed (iid), such that no samples are correlated or interdependent.However, this assumption is frequently violated in the natural sciences when samples are clustered.For example, many multisite neurological studies acquire cognitive scores using a different human rater at each site, which are subject to interrater differences [1], [2], [3].As a result, these measurements have inherent intra-site correlation and inter-site variability.Another example is medical imaging, such as magnetic resonance imaging (MRI), where differences in imaging protocol and scanner hardware lead to substantial site effects in multi-site studies [4], [5].Clustering also occurs in biological data, such as when measurements are collected across different experimental batches [6] or tissue samples [7], and in environmental data collected across locations [8].
If not properly handled in analysis, the cluster effects of • The authors were with the Lyda Hill Department of Bioinformatics, University of Texas Southwestern Medical Center, Dallas, TX, 75390.E-mail: kevin3.nguyen@utsouthwestern.edu, albert.montillo@utsouthwestern.edu • Data used in preparation of this article were obtained from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu).
As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analysis or writing of this report.A complete listing of ADNI investigators can be found at: http://adni.loni.usc.edu/wpcontent/uploads/how to apply/ADNI Acknowledgement List.pdf non-iid data can lead to erroneous conclusions.The so-called Simpson's paradox occurs when an association between two variables appears, disappears, or even reverses when analysis is performed at the population level vs. when analysis is stratified by cluster, indicating a confounding effect.This situation can lead to Type I (false positive) or Type II (false negative) findings in many situations, including clinical studies [9], proteomics [7], and economics [10].Despite these consequences, the machine learning community has generally ignored the problems underlying noniid data.Meanwhile, the traditional statistics community has addressed clustered data with mixed effects models, which learn a combination of fixed and random effects.The most common of these is the linear mixed effects (LME) model, which builds upon the basic linear regression model.Suppose that we have data X ∈ R n×p with n samples and p independent variables (features), originating from c clusters, and a dependent variable (target) y ∈ R n×1 .We can define the following LME regression model; for a sample i = 1, 2, ..., n originating from cluster j = 1, 2, ..., c we have: ŷi = β 0 + x i,1 β 1 + ... + x i,p β p +u j,0 + x i,1 u j,1 + ... + x i,p u j,p + i (1) where ŷi is the predicted target, x i = [x i,1 , ..., x i,p ] is the p-dimensional feature vector of the i th sample from X, and i is the residual.The model contains two types of weights.

PREPRINT
The fixed effect intercept β 0 and slopes β = [β 1 , ..., β p ] are cluster-invariant and apply globally to all samples.The random effects weights include the intercept u j,0 and slopes u j = [u j,1 , ..., u j,p ], whose values are specific to each cluster j.The random effect weight values are assumed to follow a random distribution, most often a multivariate normal distribution with mean 0, i.e. u ∼ N (0, σ).Consequently, the random effect weights u can be interpreted as clusterspecific offsets from the fixed effect weights β.The LME model separates the variance explained by global associations from the inter-cluster variance, controls for correlated samples, and improves weight estimates [8], [11].Unfortunately, proper handling of mixed effects in deep learning, delivering all of these gains, has gone unanswered.In this work, we describe how appropriate handling of mixed effects can address the inadequacies of deep learning models when applied to clustered data.

Related work
Previous deep learning approaches for clustered data have key limitations.A naive but prevalent strategy is to insert cluster information into the model as an additional, onehot encoded covariate [12].This increases data dimensionality, which may cause overfitting with a high number of clusters c [12], [13], and it entangles the cluster-invariant and cluster-specific features within the model weights, hampering model interpretation.Domain adaptation techniques train a model on a source domain (i.e.cluster), then adapt it in a subsequent training step to a target domain [14].This yields an adapted model for each target domain but not a single unified model.It also does not scale easily to many domains or separate domain-invariant from domainspecific features, which also limits interpretability.Domain generalization techniques address some of these weaknesses by producing a single generalized model agnostic to domain differences.Earlier approaches used gradient reversal layers, which modify backpropagation to maximize domain invariance [15], [16].Other methods use meta-learning to guide gradient descent in a direction that reduces the loss for all domains [17], [18].However, these involve second-order optimization which vastly increase computational cost.A third category of domain generalization methods uses an adversarial classifier [19], [20].The adversarial classifier learns to classify domains from the latent features of the main model, while the main model learns features that maximize domain classification error.The common limitation of all domain generalization techniques is that they produce a model that has only learned the domain-invariant features (fixed effects), while domain-specific information (random effects), are discarded.Our proposed framework captures this ignored information in a separate random effects subnetwork, while an adversarially-regularized subnetwork captures global fixed effects.We show that this adds predictive value and allows users to understand more about cluster variance in their data.
To date, there have been three prior approaches to incorporate mixed effects into deep learning.Xiong et al. proposed MeNet, a mixed effects convolutional neural network (CNN), for a gaze estimation dataset containing repeated images per subject [21].While improving accuracy, the method requires an expensive expectation-maximization algorithm with inversion of large covariance matrices (n j ×n j where n j is the number of samples within each cluster).MeNet also only models random slopes and not intercepts.Next, Tran et al. proposed DeepGLMM, a mixed effects approach for dense feedforward neural networks (DFNNs) using Bayesian deep learning and variational inference for more efficient training [22].Though theoretically capable of modeling both random slopes and intercepts, their applications only used models with random intercepts.Their experiments also lacked comparisons with other deep learning methods.Finally, Simchoni et al. proposed LMMNN, a mixed effects approach for both DFNNs and CNNs, and demonstrated a performance benefit across multiple applications.However, LMMNN is trained using expensive covariance matrix inversions, and their real-world applications use only random intercepts [13].
There are several common limitations across the MeNet, DeepGLMM, and LMMNN approaches.These methods prioritize the improvement of predictive performance and ignore the additional interpretability afforded by mixed effects, such as quantification and visualization of intercluster variance.They also lack explicit guidance of the fixed effects to be cluster-invariant, so their resilience to confounded associations is unclear.Additionally, none of these works demonstrate models with both random slopes and intercepts or unsupervised learning models, such as autoencoders.Lastly, there are no specific recommendations for applying these models to new data that does not originate from the same clusters seen during training, which limits real-world utility where data from new clusters is frequently encountered.

Contributions
We propose an Adversarially-Regularized Mixed Effects Deep learning (ARMED) framework that generalizes across model archetypes and alleviates the shortcomings of the previous approaches.This framework contains three components that can be readily added to a conventional deep learning model with minimal modification of the existing architecture.First, inspired by domain generalization, we employ an adversarial classifier to regularize the model to learn cluster-invariant fixed effects.We show through simulations that this improves the separation of cluster-specific, potentially confounded features from cluster-invariant features.Second, we introduce a Bayesian random effects subnetwork to learn the cluster-specific features, and we demonstrate how it can quantify and visualize the variance across clusters.Third, we add another classifier which infers random effects for so-called "unseen cluster" data, where samples originate outside the clusters seen during training.We demonstrate the advantages of our framework across 4 test cases using DFNNs, CNNs, and convolutional autoencoders, including simulations and three biomedical examples.In each case, we achieve not only the separation and identification of fixed and random effects, but also better predictive performance on data from seen clusters and better generalization to unseen clusters.

Input data
Fig. 1.The ARMED framework for a generic neural network.The conventional model (blue area) predicts ŷF from the data sample x.Cluster membership of the sample is one-hot encoded into z.The fixed effects subnetwork (blue + gray areas) is constructed by adding an adversarial classifier (gray area) to predict cluster membership ẑ.The original model is penalized through the generalization loss for learning features that allow cluster membership prediction.The random effects subnetwork (orange area) uses Bayesian layers to learn cluster-specific weights, dependent on z, that follow zero-mean multivariate normal distributions.These weights can be formulated as nonlinear slopes multiplied by the fixed effects latent representation h F (X; β), linear slopes multiplied by X, and/or intercepts.The fixed and random effects are combined with a mixing function m(...).For prediction on data from clusters unseen during training, z is inferred with a classifier (Z-predictor) trained on data from seen clusters.

METHODS
In general, a conventional feed-forward neural network computes a nonlinear transformation of the data X through its layers (Fig. 1, blue area).We denote the output of the penultimate layer as h(X, β) ∈ R n×q , where β contains all learned weights up to and including this layer and q is the number of neurons.For a typical regression or classification task, a final linear or softmax output layer o then transforms h(X, β) into the final prediction output ŷ: During training, the model finds the weights, β, which minimize a given loss function quantifying the error for the predictive task, L e (y, ŷ).
To encode cluster membership information for a dataset with n samples and c clusters, we introduce a one-hot encoded design matrix Z ∈ R n×c , where Z i,j = 1 if sample i belongs to cluster j and Z i,j = 0 otherwise.The following sections present a description of the ARMED framework components, agnostic to model architecture.These components include the fixed effects subnetwork h F , including a conventional neural network and an adversarial classifier a that together learn cluster-invariant features, the random effects subnetwork h R for learning Z-dependent clusterspecific features, the mixing function m that combines the fixed and random effects for prediction, and the Z-predictor used to apply random effects to new clusters.

Fixed effects subnetwork
First, we add an adversarial classifier (Fig. 1, gray area) to the conventional model (Fig. 1, blue area) to enforce the learning of cluster-invariant fixed effects, creating the Fixed effects subnetwork h F (X; β).This is based on the adversarial learning technique for domain generalization [19], [20].For a neural network with L layers, let represent the collected outputs of each layer, where β F,l contains the weights up to the lth layer.We define an adversarial classifier a which predicts a sample's cluster membership from these layer outputs, Ẑ = a(H F (X; β F ); β A ), where β A contains the weights for this adversary.The adversary is trained to minimize the categorical cross-entropy loss: Meanwhile, the main model is penalized for learning features that allow the adversary to predict cluster membership.It must maximize this cross-entropy, which we call the cluster generalization loss.The resulting training objective of the fixed effects subnetwork is where the hyperparameter λ g controls the weight of the generalization loss.We use ŷF to denote the prediction output of this fixed effects subnetwork.

Random effects subnetwork
We next define a second subnetwork to learn the Random effects, h R (X; U (Z)) with cluster-specific weights U (Z) (Fig. 1, orange area).The cluster-specific values for each individual weight u(Z) in U (Z) are assumed to follow a normal distribution with mean 0, i.e. u(Z) ∼ N (0, σ) where σ represents the inter-cluster variance of each weight.Collectively, Σ contains the inter-cluster variance for all weights in U (Z).We implement these weights using a Bayesian formulation.We specify a zero-mean normal prior distribution for each weight p(U ) ∼ N (0, σ p ) with the fixed prior variance σ p as a global hyperparameter.The posterior distribution p(U |X) is then learned through variational inference, which reframes Bayesian modeling as an optimization problem that can be efficiently handled through gradient descent [23], [24].The objective of variational inference is to learn a surrogate posterior q(U ), here a multivariate normal distribution, which closely approximates the true posterior p(U |X), where "closeness" is measured by the Kullback-Leibler (KL) divergence: Minimizing D KL (q(U )||p(U |X)) directly is impossible because computing the posterior through Bayes Rule, p(U |X) = p(X|U )p(U ) p(X) , involves the intractable marginalization p(X).Instead, variational inference maximizes the Evidence Lower Bound (ELBO) which contains fully tractable and differentiable quantities: where the first right-hand term is the log-likelihood and the second term is the KL divergence between the surrogate posterior and the prior.For gradient descent, we minimize the negative ELBO and let L e (y, ŷ) represent the first term, i.e. the negative log-likelihood loss.This yields the following objective: L e (y, ŷ) with the hyperparameter λ K controlling the strength of the regularization to the prior.Note that our method based on variational inference does not require expensive inversions of covariance matrices as in MeNet and LMMNN [13], [21].
The architecture of this subnetwork will depend on the types of random effects to be modeled.Nonlinear random effects slopes can be modeled as weights multiplied by the fixed effects latent representation h F (X; β): where z i and x i are the rows in Z and X for the ith sample and u nlin (z i ) ∈ R q×1 returns the slopes for cluster z i , q being the number of output neurons of h F (x i ; β).A random intercept is modeled simply as a weight: where u int (z i ) is a scalar value.Additionally, for tabular data, we can model linear random effects slopes multiplied directly with X, which allows each slope to be interpreted directly with respect to a corresponding input variable: where u lin (z i ) ∈ R p×1 returns the slopes for cluster z i .The random effects subnetwork outputs the sum of these random effects: +h R,int (u int (z i )) These three cases will apply to most models with a dense penultimate layer producing a vector-form h F (X; β).For models such as autoencoders, we describe in the Supplemental Materials how random effects can be readily applied across multiple convolutional layers (Section 3.1.3,Fig. S4).

Combining fixed and random effects
We construct the final ARMED model by combining the outputs of the fixed effects and random effects subnetworks.
In the linear model of Eq. 1, random and fixed effects were combined through addition.For greater flexibility here, we substitute the addition in Eq. 1 with a more general mixing function m(...).
For example, in the following binary classification applications, we use a nonlinear analog of Eq. 1.We add h R (X; U (Z)) to the logit of ŷF (equal to h F (x i ; β) β L where β L are the weights of the output layer), then apply the sigmoid activation function: The objective function is obtained by combining Eq. 2 and Eq.3: L e (y, ŷM ) + λ F L e (y, ŷF ) The second term ensures that the fixed effect subnetwork will still be capable of prediction on its own so that the fixed effect features will be meaningful in later analyses.The loss weight λ F < 1 balances the fixed effect error with the mixed effect error L e (y, ŷM ).ARMED includes these hyperparameters: the generalization loss weight λ g , the KL divergence weight λ K , the fixed effect prediction error weight λ F , and the prior distribution variance σ p .Usage of linear vs. nonlinear slopes must also be considered.In practice, we find that these can be easily tuned for model performance using standard hyperparameter optimization approaches, such as random search or Bayesian optimization, and appropriate cross-validation.

Prediction on unseen clusters
The previous mixed effects deep learning approaches provide no method for using the learned random effects when predicting on data not from clusters seen during training, i.e. not included in Z [13], [21], [22].The authors of LMMNN propose to use only the fixed effects of their model on unseen clusters [13].While the learned fixed effects, by definition, represent population-average associations, new data is not necessarily free of cluster effects and performance may be improved by fully utilizing the learned random effects.We propose to infer Z for unseen cluster data using a classifier we call the Z-predictor.We train this classifier to predict Z from X on the data from seen clusters, then use it to infer Z for data from unseen clusters.The unthresholded softmax predictions from the classifier provide a weighted combination of seen clusters that are most similar to each unseen cluster sample.In our applications, the Z-predictor uses the same architecture as the adversarial classifier.

Applications of ARMED to dense feedforward neural networks
Our first architectural application of ARMED is to a dense feedforward neural network (DFNN), which is suited to tabular data such as clinical measurements or pre-engineered image features.We describe the specifics of the ARMED-DFNN architecture in Fig. S1 and the Supplemental Materials 3.1.1.

Spiral classification simulations:
First, we evaluated the ARMED-DFNN on a simulated classification problem where cluster effects can be controlled, model-learned information can be compared to ground truth, and known confounded features can be added.The simulations are built upon the well-known spiral classification problem, where points must be classified into one of two spirals based on their coordinates x 1 and x 2 [25].We simulated a nonlinear random effect by dividing the points into 10 clusters and randomly varied the spiral radius across clusters (Fig. S2).There were 3 variations of this simulation: 1) spiral radii varied across clusters, 2) spiral radii varied across clusters and spiral labels were inverted in half of the clusters (a more severe random effect), and 3) spiral radii varied across clusters and 2 known confounded probe features x 3 and x 4 were added.These probes created a spurious association between cluster and label but were not associated with the underlying spiral functions.Further details on these simulations can be found in the Supplemental Materials 3.2.Because we have defined the random effects to be nonlinear, we used an ARMED-DFNN architecture with nonlinear random slopes (Eq.4) and a random intercept (Eq.5).
To test the ability of the fixed effects subnetwork to correctly downweight these confounded probes, we measured feature importance by computing the gradient of the model output with respect to the input features [26], [27].Features with larger gradient magnitudes are more important in forming the model output.We compared the importance of each confounded probe (x 3 and x 4 ) to that of the least important true feature (x 1 or x 2 ).

Mild cognitive impairment conversion prediction:
For a complementary real-world application, the ARMED-DFNN was used to predict the future development of full Alzheimer's Disease (AD) in subjects with mild cognitive impairment (MCI).MCI is an early stage of cognitive decline that may progress to dementia.Our target was to distinguish progressive MCI (pMCI), where a subject converts to AD within 24 months of baseline observation, from stable MCI (sMCI), where the subject does not convert within 24 months.We used data from the Alzheimer's Disease Neuroimaging Initiative, which includes baseline demographic information, cognitive scores, neuroimaging measurements, and biomarker measurements, as well as longitudinal diagnoses for each participant, acquired with informed consent and institutional review board approval (Supplemental Materials 3.3.1).The training dataset came from the largest 20 study sites, and we used site as the random effect cluster.Inter-site variance has been shown to affect cognitive scores, which are sensitive to judgments by human raters, and neuroimaging, which is sensitive to MRI scanner parameters [1], [3], [28].We held out the remaining 34 sites to evaluate model performance on sites unseen during training.Performance metrics included area under the receiver operating characteristic curve (AUROC), balanced accuracy, sensitivity, and specificity.For this application, we used an architecture with linear random slopes (Eq.6) and a random intercept (Eq.5).These were chosen to allow direct interpretation of the learned random slopes and inter-site variance for each input feature.
As with the spiral simulations, we subsequently added simulated confounded probe features to test how well each model could downweight known confounded features.We generated 5 confounded probes that were nonlinearly asso-ciated with site and with the probability of being labeled pMCI but had no real biological relevance (Supplemental Materials 3.3.1).We then compared how highly each model ranked the probes based on feature importance (gradient magnitudes).

Application of ARMED to convolutional neural networks
We next applied our approach to a convolutional neural network (CNN), another important deep learning archetype, creating an ARMED-CNN capable of learning nonlinear random slopes and random intercepts.Architecture details are described in Fig. 3 and Supplemental Materials 3.1.2.
We applied the ARMED-CNN to the classification of AD vs. cognitively normal (CN) structural MRI, with study site as the random effect cluster.We acquired T1-weighted MRI from 12 sites in the ADNI dataset (inclusion criteria and preprocessing details are in Supplemental Materials 3.3.2).These 12 sites were selected to emphasize the confounding site effect, where sites using General Electric MRI scanners had a greater proportion of AD subjects compared to sites using Philips or Siemens scanners (Table S1).The remaining 51 sites were held out to evaluate performance on sites unseen during training.We extracted a two-dimensional coronal slice through the hippocampi from each image.Performance metrics included AUROC, balanced accuracy, sensitivity, and specificity.

Application of ARMED to autoencoders
To demonstrate our framework on unsupervised learning models, we developed a mixed effects autoencoder.Our fourth application was the melanoma live-cell image compression and phenotypic classification problem described in [6].In this work, the authors used a convolutional autoencoder to compress the images into a vector latent representation, then trained a classifier to label cells as having either high or low metastatic efficiency.They revealed that batch effects are prominent in this dataset, due to discrepancies between image batches acquired across different days, and that the latent representations strongly segregated by batch.The dataset is described further in Supplemental Materials 3.4.The training data from the melanoma cell image dataset contained images acquired over 13 days (batches), and the remaining 11 days were held out as unseen batches.
We extended their autoencoder architecture by connecting the metastatic efficiency classifier directly to the autoencoder and training the autoencoder-classifier (AEC) end-toend.We then applied our ARMED framework to create an ARMED-AEC, containing a fixed effects subnetwork that produces batch-invariant latent representations and a random effects subnetwork that learns how the batch effects alter image appearance (Fig. S4).Our hypothesis was that the modeling of mixed effects would improve classification performance over the base AEC.This architecture is described in Supplemental Materials 3.1.3.
In addition to evaluating the reconstruction error (MSE) and phenotype prediction performance (AUROC), we also measured how strongly each model's latent representations clustered by batch.We computed the Davies-Bouldin (DB) score, where lower values indicate stronger clustering [29], and the Calinksi-Harabasz (CH) score, where higher values indicate stronger clustering [30].Consequently, we desire a higher DB score and lower CH score to achieve batchinvariant latent representations.

Compared methods and ablation tests
In each application, we compared the proposed mixed effects model with the following approaches.First, we tested a conventional neural network where the cluster membership Z is disregarded and data is assumed to be iid.Second, for the DFNN and CNN, we tried the "cluster input" approach of treating the one-hot cluster membership Z as a categorical covariate, i.e. an additional model input.For the DFNN, Z was simply concatenated to X.For the CNN, Z was concatenated to flattened output of the last convolutional layer, before the dense hidden layer.When evaluating on unseen clusters, we used the inferred Z from the Z-predictor.Third, we also compared to meta-learning domain generalization (MLDG) [17].However, due to the high computational cost of second-order gradients in MLDG (training took 10 times longer than the conventional DFNN) and poor performance, we dropped the MLDG comparison for the other applications, after the spiral simulation application.Fourth, we tested a domain adversarial (DA) neural network, i.e. the fixed effect subnetwork by itself.Despite regularization to learn only fixed effects, it does not model any cluster-specific random effects.Finally, for the DFNN and CNN, we tested MeNet [21] and LMMNN [13].For the autoencoder, only the proposed ARMED approach has a suitable adaptation.
Additionally, we performed two ablation tests of the proposed mixed effects approach.We first trained the ARMED models without the adversarial classifier ("w/o Adv.") to test the necessity of the generalization loss to learn fixed effects.Additionally, we evaluated the model using randomly-assigned cluster memberships in Z ("randomized Z").For data from seen clusters, this tested whether the model truly learned cluster-specific effects.For data from unseen clusters, this tested the impact of using the Zpredictor to infer cluster membership.

Spiral classification simulations
The classification accuracy of each model, with 10-fold cross-validation, is presented in Table 1.In simulation 1 (random cluster-specific radii distributed around 1), the ARMED-DFNN outperformed all other models and had statistically significantly higher accuracy than the secondbest model, MeNet (78.8% vs. 77.4%,p = 0.003 in paired T-test).It was also uniquely able to learn appropriate cluster-specific decision boundaries that scaled in size with the cluster-specific spiral radii (Fig. 2).For example, cluster 1 (left column) has the smallest ground truth radius (green dashed line), and while cluster 2 (middle column) has the largest true radius, and the ARMED-DFNN uniquely learned this difference.In simulation 2 (greater inter-cluster variance, spiral labels inverted in half), only the cluster input DFNN, MeNet, and ARMED-DFNN achieved accuracy substantially higher than chance (50%), with 67.1%, 53.3%, and 65.0% respectively.The cluster input DFNN and ARMED-DFNN statistically significantly outperformed MeNet (p 0.001), but did not differ significantly from each other at p < 0.05.In simulation 3 (confounded probe features added), the cluster input ranked first (76.4%),followed by the ARMED-DFNN (74.5%) and MeNet (73.0%).However, the ARMED-DFNN more effectively downweighted the 2 confounded features compared to the true features (T-statistic = 12.631 and 18.173) compared to the cluster input DFNN (T-statistic = 5.346 and 5.042) and MeNet (T-statistic = 7.923 and 4.541) (Table 2).The conventional and meta-learning models placed greater importance on the confounded than the true features.
In ablation tests, removing adversarial regularization non-significantly improved the accuracy of the ARMED-DFNN in simulations 1 and 2, but decreased accuracy in simulation 3. It also worsened the separation of confounded and true features in simulation 3 (T = 1.390 and 0.343).Using randomly assigned cluster memberships in Z uniformly decreased performance, confirming that the ARMED-DFNN learned necessary cluster-specific information.

MCI conversion prediction
The performance of each model in classifying pMCI vs. sMCI, over 10×10 nested cross-validation folds, is compared in Table 3.On study sites seen during training, the ARMED-DFNN outperformed all other models in AUROC, accuracy, and specificity (Table 3, top).The AUROC of the ARMED-DFNN was statistically significantly higher than that of the second-best model, the conventional DFNN (0.926 vs. 0.884, p = 0.048).On held-out study sites unseen during training, the ARMED-DFNN again outperformed all other models in AUROC, accuracy, and specificity (Table 3, bottom).The AUROC of the ARMED-DFNN was statistically significantly higher than that of the second-best conventional DFNN (0.837 vs. 0.806, p = 0.007).The DA-DFNN performed the poorest on both seen and unseen sites, with AUROC of 0.811 and 0.723 respectively.
Removing the adversarial regularization of the ARMED-DFNN reduced AUROC (0.926 to 0.919) and accuracy (81.9% to 81.4%) on seen sites and accuracy (75.6% to 73.5%) and sensitivity (72.4% to 65.4%) on unseen sites.On seen sites, randomizing the site assignments reduced all metrics, including AUROC from 0.926 to 0.889.On unseen sites, using random instead of inferred site assignments also reduced all metrics including sensitivity from 72.4% to 69.8%.
We examined the feature importance ranking, based on the fixed effects subnetwork, and learned site-specific random slopes, based on the random effects subnetwork, of the ARMED-DFNN (Fig. 3).Demographic features including, race, ethnicity, and marital status had especially low intersite variance.Cognitive scores such as the Clinical Dementia Rating Sum of Boxes (CDR-SB) and and Mini Mental State Exam (MMSE) had especially high inter-site variance.These results are further discussed in Section 4.3.1.Feature importance rankings for all 6 DFNNs are presented in Fig. S6.We also examined the site-specific random intercepts of the ARMED-DFNN and found they correlated strongly with the percentage of pMCI subjects at each site (Pearson's r = 0.860, p < 10 −5 ), indicating the random intercepts captured the variability in class balance across sites, a major source of confounding effect.When simulated confounded probe features were added, the ARMED-DFNN ranked these probes the lowest.The 10 highest ranked features for each model are shown in Fig.

AD diagnosis
The cross-validated performance of each model in classifying brain MRIs as CN vs. AD is presented in  LMMNN had the lowest AUROC on the unseen sites, indicating poor generalization.Without adversarial regularization, the performance of the ARMED-CNN increased slightly on unseen sites (mean AUROC 0.645 to 0.655) but decreased on seen sites (mean AUROC 0.900 to 0.816).
Randomizing the site membership for seen sites drastically reduced all metrics, including mean AUROC from 0.900 to 0.585.On unseen sites, randomizing instead of inferring site membership reduced mean AUROC from 0.645 to 0.551.
Gradient-weighted Class Activation Mapping (Grad-CAM) visualizations from each model revealed differences in the features learned (Fig. 5) [31].The conventional, cluster input, MeNet, and LMMNN CNNs attributed more weight to regions in the edges of each image, near the periphery of the brain.However, the DA-CNN emphasized medial brain areas, including the hippocampi and surrounding parahippocampal gyri.For the ARMED-CNN, we produced Grad-CAMs using the fixed effects subnetwork, which contains  the learned cluster-invariant features.Like the DA-CNN, the ARMED-CNN also emphasized medial brain areas but gave additional weight to the superior regions including the lateral ventricles.Furthermore, we created separate Grad-CAMs to visualize the distinct site-specific features learned by the ARMED-CNN random effects subnetwork (Fig. S7), which involved the image periphery for some sites and more medial areas for others.

Cell image compression and classification
The performance of each AEC model in compressing and classifying melanoma live-cell images is presented in Table 5.For computational efficiency, the pre-trained and frozen DA-AEC was reused as the fixed effects subnetwork of the ARMED-AEC.For the ablation test without adversarial regularization ("w/o Adv."), the pre-trained conventional AEC was reused as the fixed effects subnetwork.Confidence intervals were computed using DeLong's method [32].On seen batches, (Table 5, first column group) the ARMED-AEC had the highest performance in classifying metastatic efficiency (AUROC 0.869), statistically significantly outperforming the second-best model, the conventional AEC (p < 0.001), and it had the lowest reconstruction error (MSE 0.0012).On unseen batches (Table 5, second column group), the ARMED-AEC again showed the best classification performance (AUROC 0.789).This classification performance was statistically significantly higher than the second-best model, the conventional AEC (p < 0.001).All models had similar reconstruction error (MSE 0.0024) on unseen batches.Examining each AEC's latent representations, the DA-AEC and ARMED-AEC (using the DA-AEC as its fixed effects subnetwork) exhibited much less batch effect contamination.Compared  to the conventional AEC, the DB score improved from 8.885 to 43.009 (484% relative increase) and the CH score improved from 545.9 to 20.4 (96% relative decrease).
In the ablation tests, removing the domain adversarial regularization of the fixed effects subnetwork in the ARMED-AEC (using the conventional AEC as the fixed effects subnetwork) slightly increased classification AUROC on seen batches (0.876 vs. 0.869) and on unseen batches (0.791 vs. 0.789).However, this came at the expense of greater batch contamination of the latent representations (DB score 8.885 and CH score 545.9).When cluster assignments were randomized instead of using the true cluster assignments on seen batches, reconstruction MSE worsened from 0.0012 to 0.0018 and classification AUROC decreased from 0.869 to 0.732.On unseen batches, randomized instead of Zpredictor-inferred cluster assignments reduced classification AUROC from 0.789 to 0.712.
To visualize the random effects learned by the ARMED-AEC, we generated image reconstructions from the random effects subnetwork with various learned batch-specific effects applied (Fig. 6).These simulate the appearance of an image if it had been acquired within different batches.We compared these with the image reconstructions from the fixed effects subnetwork, where batch effects have been removed.Some batches showed stronger specular highlights (e.g.batches 2 and 5), while others had greater contrast in the cell periphery (e.g.batches 1 and 3).

General observations
Our experiments across four applications illustrate the three critical contributions of ARMED.First, we demonstrated that the fixed effects subnetwork of ARMED models assigns feature importance more appropriately than the compared models.In the spiral simulations, the ARMED-DFNN most strongly separated the true and confounded features by feature importance.The conventional and MLDG models erroneously placed greater importance on the confounded probes than the true features, while the cluster input, DA-DFNN, MeNet, and LMMNN models downweighted the confounded probes to a lesser degree than the ARMED-DFNN.In MCI conversion prediction with simulated confounded probes, the ARMED-DFNN ranked the probes statistically significantly lower than any other model, including the DA-DFNN and MeNet.In contrast, the conventional, cluster input, and LMMNN models were most sensitive to the probes.In AD diagnosis, Grad-CAM visualizations showed that the ARMED-CNN highlighted more biologically plausible brain regions than the conventional, cluster input, MeNet, and LMMNN CNN's, which is further discussed in Section 4.3.2.
Second, we demonstrated the ability of ARMED to visualize random effects learned by the random effects subnetwork.In MCI conversion prediction, we quantified the learned inter-site variance of the random slopes for each feature.This allowed us to identify which features are most contaminated by site effects, and we discuss these below (Section 4.3.1).In AD diagnosis, we visualized site-specific differences in Grad-CAMs.Finally, in the cell imaging application, we generated image reconstructions showing the impact of learned batch effects.
Third, ARMED typically outperforms the compared nonmixed effects methods and outperforms or matches the other mixed effects methods.In the spiral simulations, the ARMED-DFNN had either the best or second-best accuracy, while being more discriminative between true and confounded features and learning the most cluster-appropriate decision boundaries.In the MCI conversion application, the ARMED-DFNN outperformed all other methods on both data from seen and unseen sites.In AD diagnosis, the ARMED-CNN performed similarly to MeNet and LMMNN methods on seen sites and competed favorably with the DA-CNN on unseen sites.Meanwhile, MeNet and LMMNN generalized poorly to unseen sites.In the cell imaging application, the ARMED-AEC had the best reconstruction error on data from seen batches, the best metastatic efficiency classification on both seen and unseen batches, and substantially reduced batch effects in its latent representations compared to the conventional AEC.In ablation tests, we found that ARMED models without DA often performed similarly to or non-significantly better than the full model with DA, but their fixed effects subnetworks were more sensitive to confounded probe features.Therefore, we recommend always using the full ARMED model with DA, as any small performance increase comes at the cost of confound susceptibility.We also found that randomizing cluster assignment reduced performance on seen clusters, confirming that the ARMED models had learned clusterspecific information in the random effects subnetworks.Similarly, performance on unseen clusters decreased when using randomized cluster assignment instead of using the Zpredictor to infer cluster membership.This indicates that the Z-predictor is needed to fully exploit the learned random effects when predicting on data from unseen clusters.
Though we have focused on biomedical data in this work, we anticipate that our approach will be of use to any case where data is non-iid and subject to random effects.Given its flexible and modular nature, the ARMED framework should apply readily to other architecture types besides the three demonstrated here.

Comparison to prior work
A common approach to handling clustered data is to include the cluster membership, which is an unordered categorical variable, as additional one-hot encoded covariates in X [12].This approach is unable to disentangle the cluster-specific random effects and cluster-independent fixed effects, and we found it was more sensitive to simulated confounded probes than ARMED models.We also found inferior performance vs. ARMED, likely due to the high cardinality of the added features which can lead to overfitting [13], [33].For example, the MCI conversion application had 20 sites and 37 input features, meaning that to add cluster membership to X would increase the width of X by 35%.ARMED is better suited to handling this high-cardinality information by modeling clustering as a random effect, which imposes a normal distribution prior.
A more recent approach to handling differences across clusters is domain adversarial learning.We showed that DA does improve generalization to data from unseen clusters.However, ARMED improves upon DA, adding a random effects subnetwork to capture the cluster-specific information that DA discards, which results in better performance on clusters seen during training.Using the Z-predictor, this cluster-specific information can also be used when pre-dicting on data from unseen clusters, allowing ARMED to outperform DA on unseen clusters as well.
This work remedies key weaknesses in previous approaches to incorporate mixed effects into deep learning.We described specific random effects architectures for random intercepts, linear slopes, and/or nonlinear slopes.This allows greater flexibility than DeepGLMM and LMMNN, which only learn random intercepts, and MeNet, which only learns nonlinear random slopes [13], [21], [22].Another key improvement was adversarial regularization of the fixed effects subnetwork to learn generalizable, clusteragnostic information.In our experiments with simulated confounders, this allowed ARMED models to appropriately upweight nonconfounded features and downweight confounded features, while MeNet and LMMNN, lacking adversarial regularization, were susceptible to the spurious confounded features.Next, we demonstrated interpretation and visualization of the learned random effects, which was not explored in these previous works.Finally, we evaluated ARMED models on data from clusters unseen during training and provided a method to infer cluster membership and apply learned random effects on this data.The previous works lacked such a method, meaning that the learned random effects cannot be utilized on new data.This is a major limitation for practical applications, where a deployed model may need to be applied to data from a new cluster, such as a new clinical site or patient.

MCI conversion prediction
The ARMED-DFNN quantifies the inter-site variance of the learned random slope for each feature (Fig. 3).We found that demographic features such as race and ethnicity had the lowest inter-site variance, which in unsurprising as the association between these features and MCI conversion should not be sensitive to measurement differences across sites.Certain cognitive measurements, however, had distinctly high inter-site variance.The CDR-SB score had the highest variance, which concurs with a previous report that CDR-SB has suboptimal inter-rater reliability in early dementia patients, such as those with MCI [34].MMSE had the second highest variance in our ARMED-DFNN, again agreeing with previous findings of low inter-rater reliability [35].
Though we intentionally held out a large portion of the ADNI dataset to evaluate our models on unseen sites, our ARMED-DFNN performed similarly to or better than several published results on predicting 24-month MCI conversion in ADNI using deep learning.Lee et al. achieved 80% accuracy compared to the 81.9% of our ARMED-DFNN [36].Shi et al. and Lian et al. achieved AUROC of 0.816 and 0.793, respectively, compared to our 0.926 [37], [38].Note that neither of these studies held out entire study sites for evaluation, and our AUROC on unseen sites (0.837) still exceeded their results on seen sites.

AD diagnosis
The Grad-CAMs of the DA-CNN and ARMED-CNN appropriately emphasized the importance of medial brain regions including the hippocampus and surrounding medial temporal lobe, which are involved in AD-related brain atrophy (Fig. 5) [39], [40], [41].The ARMED-CNN Grad-CAMs also indicated the importance of the lateral ventricles, where enlargement has been connected to AD [40], [42].The incorporation of these additional structures likely contributed to the better performance of the ARMED-CNN (AUROC 0.900) vs. the DA-CNN (AUROC 0.823).Meanwhile, the conventional, cluster input, MeNet, and LMMNN models relied highly on likely spurious features in the image periphery.Such features appear to be related to site effects on imaging, since the random effects of the ARMED-CNN affect similar peripheral areas (Fig. S7).
The performance of our ARMED-CNN compares favorably to previous models using 2D MRI to diagnose AD in the ADNI dataset.We achieved 88.7% accuracy, while Kang et al. report 90.4% and Ebrahimi et al. report 87.5% [43], [44].However, we trained on a fraction of the total ADNI data that these reports used, holding out the rest for evaluation of models on unseen sites.Consequently, our work focuses on comparisons across architectures, not with previous studies.

Cell image compression and classification
We compare our ARMED-AEC results to the previous analysis published by Zaritsky et al. [6].While they discussed the batch effect present in the latent representations produced by their autoencoder, their methods did not explicitly suppress this batch effect.In contrast, our proposed ARMED-AEC reduced the batch effect in the latent representations by 484% based on the DB score, compared to an unmodified AEC.We also improved classification AUROC to 0.876 compared to their reported 0.723, though this may be partially due to the direct incorporation of the phenotype classifier into the autoencoder, while Zaritsky et al. trained their classifier separately from their autoencoder.

Limitations
Mixed effects models generally require the presence of several clusters to accurately estimate the random effect distributions; with <4 clusters, LME models provide less of an advantage over generalized linear regression [8], [11].Consequently, we suggest some caution when using our method for data with fewer than 4 clusters.Additionally, our method applies to datasets with a single level of random effects, but there are often cases with multiple levels of random effects, such as when multiple observations are collected per subject who are then clustered by study site.We plan to extend our methodology to such multi-level cases in future work.Finally, a practical limitation of ARMED is the additional complexity, which may increase training time by approximately 1.5-2x.However, we note that other methods have even greater computation cost, such as metalearning domain generalization (MLDG) which uses secondorder optimization and MeNet and LMMNN which involve expensive matrix inversions [13], [17], [21].

CONCLUSION
Our proposed approach uses powerful mixed effects techniques from traditional statistics to improve the interpretability, reliability, and performance of deep learning models on non-iid data.ARMED models separately learn random effects and fixed effects in distinct subnetworks, with the fixed effects subnetwork more appropriately assigning feature importance with resilience to confounding effects, helping to avoid Type I and Type II errors.In biomedical applications, this allows better hypothesis formation and prevents waste of resources in following up biased or confounded results.Meanwhile, the random effects subnetwork allows users to understand the cluster effects in their data, which can inform future research.For example, clinical study organizers could prioritize measurements with less inter-site variance in future studies.Besides these benefits, ARMED also increases predictive performance on clustered data, including better generalization to clusters unseen during training.Given these advantages demonstrated across multiple model architectures and applications, we broadly recommend the ARMED framework to deep learning practitioners dealing with non-iid data.We make our code available at tinyurl.com/ARMEDCode.
Adversarially-regularized mixed effects deep learning (ARMED) models improve interpretability, performance, and generalization on clustered (non-iid ) data: Supplemental Materials

Mixed effect prediction
Figure S3: ARMED convolutional neural network (ARMED-CNN).For the conventional CNN (blue area), the penultimate dense layer produces a latent representation h F (X, β) of the data X.A final dense layer with weights β L produces the classification prediction ŷF .To create the fixed effects subnetwork, an adversarial classifier (gray area) is added that learns to predict the sample's cluster ẑ from the outputs of the layers of the CNN.The generalization loss penalizes the CNN for learning features that allow the adversarial classifier to predict cluster ẑ.The random effects subnetwork (orange area) learns normally-distributed weights U (Z) ∼ N (0, Σ) dependent on cluster ẑ, including scalars (nonlinear slopes) applied to h F (X, β) and a bias.These are combined with the fixed effects subnetwork output to yield the mixed effects-based prediction ŷM .Note: for illustration, not all CNN layers are shown.

Fixed effect phenotype classification
Figure S4: ARMED autoencoder-classifier (ARMED-AEC).A conventional autoencoder-classifier (blue area) contains an encoder that compresses the input image X into a latent representation e F (X; β), an auxiliary classifier that predicts the phenotype label ŷF , and a decoder that reconstructs the image XF .The fixed effects subnetwork combines this conventional model with an adversarial classifier (gray area), which penalizes the encoder through the generalization loss for learning features predictive of cluster membership ẑA .The decoder shares weights with the encoder and so receives the same adversarial guidance.Meanwhile, the random effects subnetwork (orange areas) is a parallel autoencoder using random effects convolution blocks with cluster-dependent weights (Fig. S5).Its encoder compresses the image into a latent representation e R (X; U (z)) containing information predictive of cluster ẑL .Its decoder then reconstructs an image XR that is also contains information predictive of cluster ẑI .A learned combination of the two latent representations yields the mixed effects-based phenotype prediction ŷM .Note: for illustration, not all autoencoder layers are shown.Our base DFNN architecture for both the spiral classification and the MCI conversion prediction problems contained 3 hidden dense layers with 4 neurons each (Fig. S1, blue area).Each was followed by a ReLU activation function.For binary classification, the output layer was a single-neuron dense layer with a sigmoid activation.To construct the ARMED-DFNN, we began by introducing an adversarial classifier containing 3 hidden dense layers with 8, 8, and 4 neurons, respectively (conceptualized in Fig. S1, gray area).ReLU activations were used for the hidden layers, and the output layer was a softmax layer for multi-class classification.The random effects subnetwork (Fig. S1, orange area) of the ARMED-DFNN varied by problem.For the spiral classification problem where the random effect was nonlinear by design, we included an 4-dimensional cluster-specific nonlinear slope (main text Eq. 4), which was multiplied with the output of the last hidden dense layer in the main (fixed effects) subnetwork, and a cluster-specific intercept (main text Eq. 5):

LMMNN ARMED-DFNN
For the MCI conversion prediction problem, we included a p-dimensional cluster-specific linear slope (main text Eq. 6), which was multiplied by the input X with p features, and a cluster-specific intercept (main text Eq. 5): To combine the fixed (h F (x i ; β)) and random effects (h R (x i ; U (z i ))) subnetwork outputs and obtained the mixed effects prediction ŷM,i , we used the following additive mixing function: where β L contains the weights of the output layer in the fixed effects subnetwork.The final training objective for the ARMED-DFNN is where L BCE is the binary crossentropy loss: The Z-predictor model, which infers the cluster membership matrix Z of unseen sites, used the same architecture as the adversarial classifier.All DFNN models were trained with the Adam optimizer with a learning rate of 0.001 for 50 epochs, with early stopping based on validation loss.

Convolutional neural network (CNN)
The base CNN used for AD vs. CN classification contained 7 blocks, each comprising a two-dimensional convolutional layer with 3 × 3 kernels, batch normalization, and a PReLU activation function (Fig. S3, blue area).A 2 × 2 max-pooling layer was used between each block.The output of the last convolutional block was flattened and fed into a 512-neuron dense layer.Here, h F (X; β) is defined as the output of this hidden dense layer.The output layer was a single-neuron dense layer with a sigmoid activation.To form the fixed effects subnetwork of the ARMED-CNN, we first added an adversarial classifier (Fig. S3, gray area).We used an architecture similar to the base CNN, but the output layer was replaced by a softmax layer for multi-class classification.Intermediate outputs from each layer in the original CNN were sent into the adversarial classifier at the layer with the corresponding shape.Next, the random effects subnetwork (Fig. S3, orange area) consisted of a cluster-specific bias and 512-dimensional cluster-specific scalars (nonlinear slopes, akin to Eq. 1), which was multiplied with h F (X; β).We used the mixing function in Eq. 3 to combine the fixed and random effects subnetwork outputs.The loss function was the same as in Eq. 4. The Z-predictor model used the same architecture as the adversarial classifier and all models were trained with the Nadam optimizer with a learning rate of 0.0001 for 20 epochs.
Nguyen & Montillo: ARMED Models for Clustered Data Supplemental Materials

Autoencoder-classifier (AEC)
We began with the autoencoder architecture used in [1] (Fig. S4, blue area), which contains an encoder to compress an image into a 56-dimensional latent representation e F (X; β).A decoder then reconstructs the image XF .The encoder contained 6 blocks, each with a convolutional layer with 4 × 4 kernels and 2 × 2 striding, batch normalization, and PReLU activation.The output of the last convolutional block was flattened and fed into a 56-neuron dense layer with produced the compressed latent representation.The decoder architecture was symmetric and replaced convolutional layers with transposed convolutional layers.To simultaneously perform classification, we introduced an auxiliary classifier subnetwork which predicts the cell phenotype ŷF , i.e. high vs. low metastatic efficiency, from the latent representation.The auxiliary classifier took the encoder's latent representation as input and contained a 32-neuron hidden layer and a sigmoid output layer.
To create the fixed effects subnetwork, we added an adversarial classifier to predict each image's cluster ẐA from the layer activations of the encoder (Fig. S4, gray area).Through the generalization loss, the autoencoder is penalized for learning features that allow accurate cluster prediction.The encoder and decoder weights are tied so that this penalty affects both modules of the autoencoder.The adversary used the same architecture as the encoder, with the addition of a final softmax output layer.
Because image-level batch effects are believed to pervade all levels of the feature hierarchy, including lower level pixel-level features as well as higher-level morphology features, we construct a second, mirrored autoencoder as the random effects subnetwork (Fig. S4, orange area).In this second autoencoder, called the RE-AEC, random effect features are learned at every layer using random effects convolution (RECON) blocks (Fig. S5), which were inspired by style transfer networks [2,3].RECON blocks replaced the batch normalization layers of the original autoencoder.After a convolutional layer, these blocks apply instance normalization.In contrast to batch normalization which normalizes each mini-batch to zero mean and unit variance, instance normalization normalizes each sample independently [3].Afterward, a cluster-specific random scale and bias is applied to each feature map, effectively rescaling and shifting the convolution outputs in a cluster-dependent manner.
The RE-AEC produces a cluster effect-laden latent representation e R (X; U (Z)) and reconstruction XR .To further enforce the learning of cluster-specific features in the RE-AEC, we added two more classifiers to predict cluster membership from the latent representation, ẐL , and from the image reconstruction, ẐI .By minimizing the prediction error of these two classifiers, the RE-AEC is encouraged to produce latent representations and reconstructions characterizing the cluster effects.
To create the ARMED-AEC, we combined the fixed effects subnetwork (the domain adversarial AEC, denoted as DA-AEC) with the RE-AEC.Specifically, we concatenated the latent representations e F (X; β) and e R (X; U (Z)) into a single vector and used this as input to a separate DFNN classifier trained to produce a mixed effects-based phenotype prediction ŷM .In other words, the mixing function was ŷM,i = m (h F (x i ; β), h R (x i ; U (z i ))) = m([h F (x i ; β), h R (x i ; U (z i )]) where m(...) represents this new classifier using the concatenated latent representations as input.The combined training objective of the ARMED-AEC was The first line in Eq. 5 contains the image reconstruction loss for the fixed and random effects subnetworks.Here, we use the mean squared error (MSE) between the input X and the reconstruction X.The second line contains the phenotype classification loss (binary crossentropy) between the true and predicted phenotype labels y and ŷ.The third line is the KL divergence for Bayesian layers.The fourth line contains the categorical crossentropy between the true cluster label Z and the cluster labels predicted by the latentcluster classifier ẐL and the image-cluster classifier ẐI .Finally, the last line is the cluster generalization loss which encourages the fixed effects subnetwork to learn features that prevent the adversary from predicting cluster labels ẐA .
We used an additional CNN classifier analogous to the adversarial classifier as the Z-predictor.All AEC models were trained with the Nadam optimizer with a learning rate of 0.0001 for 20 epochs, with early stopping based on reconstruction mean squared error on validation data.

Spiral classification simulations
The spiral simulations generated two spirals, labeled y = 0 and y = 1 and separated in phase by π radians (see Fig. S2 and Eq. 6).We randomly generated n = 10, 000 points along the spirals at arc lengths between 0 and 2π.Consequently, the measured features for the i th point are its two-dimensional coordinates x i,1 and x i,2 and the label is y i .We then added Gaussian noise η ∼ N (0, 0.1) to X to increase the classification challenge.
x i,1 = − r j t i 2π cos(t i − φ) + η x i,2 = r j t i 2π sin(t i − φ) + η t ∈ R n×1 ∼ Uniform(0, 2π) To simulate random effects, we divided the points evenly into c = 10 clusters and varied the spiral radius r j for each cluster j.In simulation 1, we sampled the cluster-specific radii from a normal distribution centered at 1, r j ∼ N (1, 0.3) (Fig. S2a).In simulation 2, we increased the severity of the random perturbations by sampling the cluster-specific radii from a normal distribution centered at 0, r j ∼ N (0, 1.0), which causes the spirals to be inverted in half of the clusters (Fig. S2b).In simulation 3, we added two new features x 3 and x 4 which were confounded, i.e. they were associated with spiral label, y, but uncorrelated with the underlying spiral function.To simulate confounded variables, we first sampled a random variable ρ j ∼ N (0, 0.1) for each cluster.We let ρ j determine the class balance for each cluster, e.g. in cluster 3 with ρ 3 = 0.2, 20% of samples belong to class y = 0 and 80% belong to class y = 1.We then generated two additional data features as a function of ρ j (Eq.7), thereby creating a confounded relationship between x and y that is unrelated to the true underlying spiral function (Fig. S2c).

Alzheimer's Disease and mild cognitive impairment clincal and neuroimaging datasets
Data used in the preparation of this article were obtained from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database (https://adni.loni.usc.edu).The ADNI was launched in 2003 as a publicprivate partnership, led by Principal Investigator Michael W. Weiner, MD.The primary goal of ADNI has been to test whether serial magnetic resonance imaging (MRI), positron emission tomography (PET), other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of mild cognitive impairment (MCI) and early Alzheimer's disease (AD).

MCI conversion prediction dataset
The DFNN models for the classification of progressive vs. stable MCI (pMCI vs. sMCI) were trained on the curated "ADNIMERGE" dataset provided by ADNI.This dataset contains a selection of key variables from the ADNI study.We selected the subjects with the MCI diagnosis at the baseline visit and who had a 24-month follow-up visit.We then selected the 20 sites with the most subjects and held out the remaining 34 sites as the "unseen site" testing data.Subjects were labeled pMCI if their 24-month diagnosis changed

Fig. 2 .
Fig.2.Decision boundaries learned by each model in spiral simulation 1, where spiral radii varies across clusters as a random effect.Each row illustrates one of 3 representative clusters from 10 total simulated clusters.Each column contains the decision boundaries (black solid line) learned by one model.The green dashed line illustrates the true decision boundary, computed as the midpoint between the two spirals.Only the ARMED-DFNN was able to learn the appropriate cluster-specific decision boundaries.

4 .
The conventional DFNN, cluster input DFNN, LMMNN, and ARMED-DFNN without domain adversarial regularization all included 3 of the 5 confounded probes within the top 10 features.The DA-DFNN and MeNet included 1 confounded probe and the full ARMED-DFNN included none in the top 10 features.Paired sign tests indicate that the ARMED-DFNN ranked the confounded probes significantly lower than any other model, e.g.p = 0.031 when compared to the second-best models, DA-DFNN and MeNet.

Fig. 3 .
Fig. 3. Feature importance and random slope variance for the ARMED-DFNN predictor of stable vs. progressive mild cognitive impairment.left) Features are ranked by descending median feature importance (gradient magnitude) across 10 cross-validation folds, measured from the fixed effects subnetwork.right) The inter-site variance of each feature's random slopes.See Supplemental section 3.3.1 for abbreviations.

Fig. 5 .
Fig. 5. Grad-CAM visualizations indicating important image regions for classifying Alzheimer's Disease vs. cognitively normal individuals.Each row contains examples from one of three representative study sites.

Fig. 6 .
Fig. 6.Reconstructed melanoma cell images from the ARMED-AEC.The first row contains the real image, the second row contains the fixed effects-based reconstructions, and remaining rows show random effects-based reconstructions using different learned random effects.

Figure S1 :
FigureS1: ARMED dense feedforward neural network (ARMED-DFNN).A conventional DFNN (blue area) applies a nonlinear transformation to the input data sample x, yielding h F (x, β) as the output from the penultimate dense layer.The final dense layer with weights β L then produces a fixed effects-only prediction ŷF .The fixed effects subnetwork consists of this conventional DFNN and an adversarial classifier (gray area) which, through the generalization loss, penalizes the DFNN for learning features that are predictive of the sample's cluster ẑ.The random effects subnetwork (orange area) learns normally-distributed weights U (z) ∼ N (0, Σ) dependent on cluster z.These include nonlinear slopes of h F (x, β), linear slopes of x, and/or intercepts.The combination (e.g.sum) of the fixed and random effects subnetwork outputs produces the mixed effects-based prediction ŷM .

Figure S5 :
FigureS5: Random effects convolution (RECON) block used in the proposed mixed effects autoencoderclassifier (ARMED-AEC).A convolutional layer with f filters yields a d × d × f output tensor.Instance normalization is applied, centering and rescaling values to zero mean and unit variance.A cluster-specific scale γ(z) and bias b(z) are then applied to each of the f feature maps.These cluster-specific scales and biases are regularized to follow normal distributions with learned variances σ γ and σ b , respectively.

Figure S6 :
Figure S6: Feature importance in the classification of stable vs. progressive mild cognitive impairment with various DFNN models (continued).

TABLE 1
Spiral simulation results with 10-fold cross-validation.The best results in each simulation are bolded.

TABLE 2
Sensitivity to confounded probe features in spiral simulation 3. Paired T-tests were computed to compare the feature importance of each confounded probe with the least important true feature.Positive and larger T-statistics are desired, indicating the model placed higher importance on the true feature than the confounded probe.

Table 4 .
On study sites seen during training, LMMNN showed the highest AUROC, followed by MeNet and the ARMED-CNN (Table4, top).Neither LMMNN nor MeNet significantly outperformed ARMED-CNN (paired T-test, p = 0.064 and p = 0.210, respectively).On unseen sites (Table4, bottom), the ARMED-CNN performed second best after the DA-CNN in AUROC, accuracy, and sensitivity.MeNet and

TABLE 3
Prediction of stable vs. progressive mild cognitive impairment.Confidence intervals were computed through 10×10-fold nested cross-validation.Sensitivity and specificity were computed at the Youden point.The best results for each metric are bolded.

TABLE 4
Alzheimer's Disease diagnosis from MRI. Metrics were computed through 10 Monte Carlo cross-validation replicates.Sensitivity and specificity were computed at the Youden point.The best results for each metric are bolded.

TABLE 5
Melanoma live cell image compression and classification, and batch effect contamination of latent representations.Confidence intervals were computed with DeLong's method.The best results for each metric are bolded.

Table S1 :
Characteristics and MRI acquisition parameters for ADNI2 and ADNI3 study sites included in Alzheimer's Disease diagnosis training dataset.