Decentralized Data-Privacy Preserving Deep-Learning Approaches for Enhancing Inter-Database Generalization in Automatic Sleep Staging

Automatic sleep staging has been an active field of development. Despite multiple efforts, the area remains a focus of research interest. Indeed, while promising results have reported in past literature, uptake of automatic sleep scoring in the clinical setting remains low. One of the current issues regards the difficulty to generalization performance results beyond the local testing scenario, i.e. across data from different clinics. Issues derived from data-privacy restrictions, that generally apply in the medical domain, pose additional difficulties in the successful development of these methods. We propose the use of several decentralized deep-learning approaches, namely ensemble models and federated learning, for robust inter-database performance generalization and data-privacy preservation in automatic sleep staging scenario. Specifically, we explore four ensemble combination strategies (max-voting, output averaging, size-proportional weighting, and Nelder-Mead) and present a new federated learning algorithm, so-called sub-sampled federated stochastic gradient descent (ssFedSGD). To evaluate generalization capabilities of such approaches, experimental procedures are carried out using a leaving-one-database-out direct-transfer scenario on six independent and heterogeneous public sleep staging databases. The resulting performance is compared with respect to two baseline approaches involving single-database and centralized multiple-database derived models. Our results show that proposed decentralized learning methods outperform baseline local approaches, and provide similar generalization results to centralized database-combined approaches. We conclude that these methods are more preferable choices, as they come with additional advantages concerning improved scalability, flexible design, and data-privacy preservation.


Decentralized Data-Privacy Preserving Deep-Learning Approaches for Enhancing
Inter-Database Generalization in Automatic Sleep Staging Adriana Anido-Alonso and Diego Alvarez-Estevez Abstract-Automatic sleep staging has been an active field of development.Despite multiple efforts, the area remains a focus of research interest.Indeed, while promising results have reported in past literature, uptake of automatic sleep scoring in the clinical setting remains low.One of the current issues regards the difficulty to generalization performance results beyond the local testing scenario, i.e. across data from different clinics.Issues derived from dataprivacy restrictions, that generally apply in the medical domain, pose additional difficulties in the successful development of these methods.We propose the use of several decentralized deep-learning approaches, namely ensemble models and federated learning, for robust inter-database performance generalization and data-privacy preservation in automatic sleep staging scenario.Specifically, we explore four ensemble combination strategies (max-voting, output averaging, size-proportional weighting, and Nelder-Mead) and present a new federated learning algorithm, socalled sub-sampled federated stochastic gradient descent (ssFedSGD).To evaluate generalization capabilities of such approaches, experimental procedures are carried out using a leaving-one-database-out direct-transfer scenario on six independent and heterogeneous public sleep staging databases.The resulting performance is compared with respect to two baseline approaches involving single-database and centralized multiple-database derived models.Our results show that proposed decentralized learning methods outperform baseline local approaches, and provide similar generalization results to centralized database-combined approaches.We conclude that these methods are more preferable choices, as they come with additional advantages concerning improved scalability, flexible design, and data-privacy preservation.

I. INTRODUCTION
I N SLEEP Medicine, the polysomnographic (PSG) recording of the physiological activity of a patient throughout the night represents the standard tool for the diagnosis of numerous sleep disorders.Sleep macrostructure characterization, a.k.a.sleep staging, constitutes one of the most important tasks involved in the clinical review of the PSG.According to the standard protocol, the process involves the analysis of various recorded electroencephalographic (EEG), electrooculographic (EOG), and electromyographic (EMG) derivations, labeling the corresponding signal activity according to a set of pre-established visual scoring rules.This process, which takes place on a 30 s epoch-by-epoch basis, leads to construction of the so-called hypnogram, i.e. the resulting alternating epoch sequence of five possible sleep stages (W, N1, N2, N3, and R) throughout the night [1].
Visual analysis of vast amounts of data contained in the PSG, however, is complex, which also makes scoring prone to errors and subjective interpretations.Clinician's time, in addition, is expensive and scant.As a consequence, PSG analysis is one of the most time-consuming and costly tasks in the daily routine of a sleep center.Introducing automatic scoring to support clinicians in the sleep staging task, therefore, is interesting.It should contribute to reduce associated analysis times, enhancing production, and reducing the overall associated costs.Furthermore, expert-supervised automatic scoring has been shown to be able to improve inter-rater agreement, reducing variability and improving diagnostic quality [2], [3], [4], [5].For this reason, many attempts have been made to automatize this process [6], [7], [8], [9], [10], [11], [12].However, despite promising evaluation results reported in many of these works, uptake of automatic sleep scoring in the clinical setting remains low [13], [14], [15].
Accurate validation of automatic sleep scoring approaches has been traditionally biased due to limitations related to the associated benchmark datasets.Data would be scarce and lack enough heterogeneity, and evaluation would usually involve single source datasets relying on widespread train-test partitioning, or local k-fold cross-validation.This approach leads to optimistic broad generalization estimation.Effectively, under the former setting, training and testing data are still gathered from the same local distribution.However, when the same algorithm is evaluated on completely external databases (e.g. from another sleep center) scoring performance drops significantly [16], [17], [18], [19], [20].A number of reasons can be mentioned that contribute to this "database variability problem" in the case of sleep medicine [18].These include differences among source patient populations, recording and/or acquisition methods, or the aforementioned divergences among clinical experts [3], [5], [21], [22], [23], [24].More generally, challenges regarding the associated domain-shift are well-known in the scope of machinelearning, leading to related work in the sub-field of "domain adaption" [25].Some recent approximations to automatic sleep staging are indeed focusing on developing ideas in the context of transfer-learning, i.e. reusing previous parametrization, or parts of a model, trained on a source dataset, to be fine-tuned using external independent datasets on a target domain [26], [27], [28], [29], [30], [31], [32], [33].Nevertheless, all of the referenced approaches use more or less data from intermediate datasets to perform the transfer step.Therefore, the actual generalization capacity of these models on completely unseen target domains remains uncertain.That we know of, only one study has considered evaluation of an automatic sleep staging model, developed using transfer-learning, on an unbiased direct-transfer scenario [34].
An alternative approach to improve domain adaptation is to train the model by arranging a large centralized database using data from different heterogeneous cohorts [35], [36].This strategy, however, has its own disadvantages, concerning complex logistics and high demand of resources related to centralization and learning from all these data.The resulting model, in addition, becomes inflexible as new data become available through time.That is, if a new dataset becomes available, any previously derived centralized model would need to be retrained, either completely from scratch, or by relying on transfer-learning methods.Regardless, re-learning can be expensive and, eventually, it is exposed to catastrophic forgetting risk [37].Furthermore, privacy and ethical problems quickly arise when dealing with potentially sensitive information, as it is the case in the clinical domain.This may prevent data exchange between different centers.
In contrast with these methods, decentralized learning strategies represent an interesting alternative to be used in the context of restrictive data-sharing scenarios.One such possibility is to train machine-learning models locally within each data source location, and then integrate the resulting models using an ensemble.Such shows advantages regarding flexibility and scalability of the design, for which the resulting ensemble can be easily expanded by adding new local models when new training data or datasets become available without the need of re-training from scratch.Furthermore, because each model integrating the ensemble has been locally developed in the context of its data source, there is no need of sharing and/or centralizing data from different centers.Only the resulting local model parameters (i.e.weights) would need to be shared for their integration in the final ensemble.Therefore, potential issues due to patient privacy protection regulations are minimized.This approach has been explored on a recent work by the authors, with preliminary results also suggesting that more robust inter-database generalization can be achieved in comparison to individual models derived from single source datasets [38].These preliminary results are reviewed and expanded in this work.More specifically, in past experimentation, direct comparison to centralized-based approaches was missing.Moreover, ensemble combination was only considered by assuming a majority voting strategy.
Alternatively, recent progress in the area of federated learning is opening interesting new paths of development.More specifically, the federated approach is based on the idea of collaboratively training a learning model across multiple participating nodes, holding decentralized local samples, without the necessity of exchanging their data [39].Instead, individual client nodes from different geographic locations would exchange local model parameters or aggregated non-sensitive information, therefore preventing sensitive raw data from being directly shared.An interesting property of federated learning, that contrasts with other distributed learning approaches, is that it does not assume client data to be mutually independent and identically distributed (IID) across the participating nodes.This is a relevant assumption in the clinical setting, where the representative patient-phenotype would presumably diverge across different medical centers.Federated learning has been barely examined in the context of sleep medicine.That we know of, only one recent work has considered this approach, nevertheless in which the corresponding client data were simulated by partitioning one single dataset [40].As stated before, this approach involves considerable relaxation of the non-IID assumption, neither does it allow proper evaluation of actual generalization capabilities of the proposed solution.
In the light of the above observations, in this work we investigate the use of different decentralized deep-learning approaches based on the two previously described scenarios, ensemble and federated learning, and explore their utility in the context of automatic sleep staging.The main objective is to develop predictive models with robust inter-database generalization capabilities while, at the same time, overcoming limitations due to exchange and centralization of sensitive information.As novel contribution, we expand preliminary work on ensemble learning [38], [41].First, by including direct comparison to centralized-based approaches.Second, by exploring four different ensemble combination strategies, namely max-voting, output averaging, sizeproportional, and Nelder-Mead model weighting approaches.In addition, we explore the use of federated learning and present a new variant of the more general federated stochastic gradient descent (FedSGD) approach [42], namely Sub-sampled Federated SGD (ssFedSGD).Inter-database generalization performance from each of these methods is examined on a leaving-onedatabase-out direct-transfer scenario using six independent and heterogeneous sleep staging databases collected from public online repositories.For setting up a baseline for comparative analysis, the obtained results are also compared against traditional approaches consisting on training individual models i) on each of the local datasets, and ii) on the centralized dataset that results from gathering together data from the individual cohorts.
Based on the results of our experimentation, we analyze and discuss the advantages and disadvantages of each of the explored approaches.

II. MATERIALS AND METHODS
In this section we describe the two proposed decentralized learning approaches, ensemble and federated learning, including the specific explored variants on each case.In addition, we detail the general deep neural network architecture and the different sleep staging databases used during experimentation.

A. Ensemble Approach
Ensemble comprises the aggregation of several pre-trained local model outputs to produce a final prediction.Intending to expand results from [38], in this work we explore four different output assembly techniques in order to compare their effectiveness with respect to local and centralized approaches (in addition to federated approach, to be described in the next section).More specifically, the following ensemble combination approaches are considered: r Max-voting: each local model integrating the ensemble selects its output class according to the corresponding highest softmax activation at its output layer.The final ensemble prediction corresponds with the most represented class, that is, the most frequently voted among the models composing the ensemble.
r Output averaging: in contrast with max-voting, this method averages each of the corresponding output softmax activations of the models participating in the ensemble, prior to individual class assignment.The final resulting prediction corresponds with class associated to the highest averaged value.
r Size-proportional weighting: under this approach differ- ent weights c i are assigned to each of the models M (i) integrating the ensemble, proportionally with respect to the corresponding amount of data contained in their local datasets (n i ).Let us denote N = n i the total amount of virtual data, then c i = n i N .The respective output softmax activations are then balanced by multiplying their value with the corresponding coefficient c i .The output with the highest score is selected as final predicted class.
r Nelder-Mead: this method uses a weighted combination of the output softmax activations from each of the models' integrating the ensemble, similar to the previous method.Here, in contrast, the Nelder-Mead optimization algorithm [43] is used to find the best possible weights combination following an iterative process.The loss function of the corresponding ensemble combination evaluated on an ancillary (validation) dataset is used as reference for this purpose.

B. Federated Learning
Federated learning is a machine-learning technique which involves collaborative learning while preserving data-privacy [39].
It applies the General Data Protection Regulation's (GDPR) data minimization principle [44], for which the information transmitted is intended to be the minimal necessary for guiding the targeted data learning process.In particular, it is assumed that the exchanged information is always less than the raw source data, and it does not contain any personal nor potentially sensitive information.From a general perspective, federated learning comprises a global model, so-called "server", which is successively improved by aggregating parametric information from multiple decentralized local nodes, so-called "clients".Let us consider the general optimization problem to be represented as: where (x i , y i ; M (w)) denotes the prediction loss on the sample (x i , y i ) of a global model M , that depends on the set of parameters w.If we assume that n data points are distributed across K decentralized datasets, let us denote P k as the set of data indexes within the client k, where k = 1. ..K, then we can reformulate the problem in a federated setting as: Notice, with the above general formulation, we are implicitly assuming that data can be non-IID, and imbalanced across the partitioning P .Effectively, several issues might be attended when considering the optimization of this function collaboratively.First, because the non-IID assumption, clients' particular distribution may not be representative of the entire population.Further, as stated, there can be uneven data availability resulting in unbalanced datasets.In addition, the server might be massively distributed, meaning that the number of participant clients might exceed the amount of corresponding local data.And last but not least, clients-server communication might be limited due to temporal unavailability or slow connectivity [42].Under this setting, the general federated learning workflow, which is shown in Fig. 1, involves three basic steps repeating along an iterative process: i) distribution of the current server state (w t ) to the clients, ii) computation of the local update for each client (θ k,t ), and iii) client parameter aggregation and server model state update (w t+1 ).In general, θ k,t = g(P k ; w t ), a certain function over the corresponding set of local data points and the current server model parameters.Similarly, the exact aggregation formula needs to be defined leading to different implementation variants of the federated learning algorithm [42], [45], [46], [47].Importantly, and regardless of the exact formulation, during the local update step, each client computes its θ k,t independently, using information from its local data source, but without exchanging raw data with any other client nor the server.After each cycle, the process starts over with the server distributing the new updated global state to the participating clients.The procedure stops when a predefined number of learning rounds is reached, or certain specific stopping condition is met.

C. Sub-Sampled Federated Stochastic Gradient Descent (ssFedSGD)
Stochastic gradient descent is the most popular optimization algorithm in (deep) machine-learning [48].In the federated environment, this optimizer leads to the so-called Federated SGD (FedSGD) algorithm, which applies a single batch gradient descent calculation per round of communication [42].
More specifically, during the learning process in FedSGD, each client computes θ k,t = ∇F k (w t ), where: That is, θ k,t represents the average gradient on the local dataset k given the current sever model state w t .In fact: The central server thus simply aggregates all the local ∇F k (w t ) for which, assuming a fixed learning rate (η), the server state update formula becomes: Notice, under this approach, the averaged local gradients are balanced taking into account the respective amount of data used at each client.However, one problem with aforementioned FedSGD baseline approach concerns the very large amount of rounds needed for training accurate server models.The associated computation times, in fact, might become unpractical depending on the specific type of application, the amount, or the complexity of data contained on each client [42].For this reason, many FedSGD variations are emerging aiming to speed up the learning process, and cope with instability due to dissimilar local updates, or the presence of non-IID data [46], [49], [50], [51], [52].
While the matter remains as an open area of research, in this work we propose a new variant, namely Sub-sampled Federated SGD (ssFedSGD) using the above described FedSGD framework as baseline.The main contribution of this method is the use of an arbitrary fixed-length (n s ) sub-sample S k,t , by uniformly randomly sampling each client dataset k at the beginning of each federated training round t.Notice, n s = |S k,t |, ∀k, t, where k = 1. ..K and t = 1. ..max_rounds.
Hence, under ssFedSGD, each client locally computes θ k,t = ∇F k (w t ), where: and the global server state update formula becomes: Notice that, because of using uniform random sub-sampling, the resulting S k,t still hold the same data distribution as the original P k 's.By selecting the appropriate n s we thus hypothesize that effective learning can be still achieved at a fraction of the cost per round, therefore, speeding up the overall learning process in practice.Moreover, by using a fixed-length sub-sample we ensure equal client contribution to the global learning at each step, irrespective of the total amount of local data.Notice, in contrast, that in the original FedSGD setting more relative importance is given to the update resulting from the client with the largest dataset.Furthermore, we would avoid possible collateral effects due to disparity of local computation steps among the client node, in particular, if assuming the use of a common batch size.

D. Deep-Learning Model Architecture
We use a convolutional (CNN) long short-term memory (LSTM) deep-learning model architecture based on the general schema proposed in past work [38].The model was completely re-implemented in Python (version 3.9.7,Tensorflow 2.7.0) with some additional modifications regarding elimination of an artifact removal preprocessing step, and a batch normalization layer in the operation block, which were included in the original design.The latter was motivated by new experimental results in this work showing convergence problems in the federated scenario.This effect will be further analyzed in the discussion section.An overview of the resulting architecture can be seen in Fig. 2. We refer to past work for detailed discussion on the remaining general architectural design [38].

E. Databases
For testing our methods a heterogeneous dataset comprising six independent sleep staging databases (DREAMS, Dublin,  SHHS, Telemetry, ISRUC and HMC) was used.For the sake of repeatability, all data were collected from public online repositories, digitally encoded using the open EDF(+) format [53].Fig. 3 summarizes the different number of samples across the six collected databases and their corresponding class distributions, which illustrates the presence of size and class imbalance.Detailed description of each of the databases can be found in past work [38].

III. EXPERIMENTAL DESIGN
The experimental design involves the scheduling of four learning strategies following different local, centralized, and decentralized methods described in the previous section.The purpose is to compare the resulting inter-database generalization performance on the targeted sleep staging prediction task.All experiments use as reference the set of databases mentioned in Section II-E, and the deep neural network architecture referred to in Section II-D.We write TR(x), VAL(x), and TS(x), to refer to the respective training, validation, and testing split partitions resulting from database x.The complete set of data in the corresponding database is denoted as FULL(x).Likewise, we denote M(x) to refer to the model derived from (training) data from dataset x, which can be a single source, or a combination of several databases, depending on the specific experiment as described next.Detailed diagrams of the experiments are shown in Fig. 4.

A. Experiment 1: Local Models
Six different deep-learning models are built.Each M(x) model is trained using data from one single database, i.e.TR(x), using VAL(x) as the corresponding validation set for implementing early stopping.Local generalization performance of the resulting model is evaluated on TS(x), while the actual (databaseagnostic) external generalization is assessed on all remaining FULL(i), i = x (Fig. 4(a)).

B. Experiment 2: Centralized Database-Combined Models
We built six database-combined models, C x , by pooling data from five out of the six available databases, following a leavingone-database-out strategy.In other words, let d be the leaved-out database, the corresponding C x model is trained using the combined dataset {TR(k), k = 1...6, k = d}.Likewise, the same procedure is followed for arranging the corresponding VAL(x) and TS(x) datasets to implement early stopping and perform local evaluation, respectively.Notice, under this approach, all data but the leaved-out database are locally available.It therefore represents the baseline for the classical centralized learning approach towards which distributed ensemble and federated learning strategies can be compared.Inter-database generalization performance is here evaluated on the leaved-out FULL(d) dataset (Fig. 4(b)).

C. Experiment 3: Ensemble Models
Six ensemble models, E x , are created by combining five out of the six local models resulting from Experiment 1, thus using a leaving-one-model-out strategy.Let d be the leaved-out model, then E x = {M (k), k = 1...6, k = d}.In contrast with Experiment 2 involving combined models, this approach is implemented sharing local model's parameters, not local data.External performance for each resulting ensemble is again evaluated on its corresponding leaved out FULL(d) dataset, of which data were not used for derivation of any of the M(k) included in the ensemble.For this purpose the four output assembly strategies described in Section II-A, namely max-voting, output averaging, size-proportional weighting, and Nelder-Mead, are tested and compared (Fig. 4(c)).

D. Experiment 4: Federated Models
We build six different servers, F x , using the described federated learning approach.As in Experiments 2 and 3, each F x takes as reference five out of the six gathered databases, leaving one Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.out for the purpose of evaluating the corresponding external generalization performance.In order to simulate the federated environment, each of the five used databases is distributed and treated separately as one independent client.Hence, each client k, k = d corresponds to one database and uses its own TR(k) dataset to perform the local update step.The proposed ssFedSGD learning algorithm, described in Section II-C, is then used training and deriving the corresponding global model.Early stopping is here implemented using as reference the aggregated performance of the corresponding clients' local validation partitions.For this purpose, each client k locally evaluates its performance on the corresponding VAL(k) dataset using the last communicated server's state.The result is then sent back to the server which, in the general FedSGD scenario, averages the individual client's validation performances proportionally to the respective number of local data samples n k .Notice that in the context of ss-FedSGD equal weighting results as n k = |S k,t |, ∀k = 1..K, t = 1..max_rounds.Early training stopping then takes place once the number of rounds configured in the patience setting have been surpassed without further improvement over the so-far best validation performance obtained.Finally, as in previous cases, inter-database generalization performance is evaluated on the leaved-out FULL(d) dataset (Fig. 4(d)).Additionally, a set of ablation experiments involving different parameter configurations are conducted in order to analyze their effects on ssFedSGD performance.More specifically, the experiments focused on evaluating the influence of different learning rates and sub-sample sizes.

E. Configuration Settings
For all the above described experiments, the same epoch-wise database partitioning configuration scheme was used.For each local database 20% of their data were set aside as testing dataset, while the remaining 80% were further split following a 80-20 proportion into training and validation data respectively.Results of all experiments were evaluated on the respective datasets using Cohen's Kappa (κ) as the main reference for performance assessment [54].The multi class and imbalanced nature of the sleep staging scenario makes the use of Kappa a more adequate criterion in this context than other widespread metrics in machine learning, such as accuracy, as the former corrects by agreement due by chance.Remarkably, Kappa is the standard measure of agreement reported across literature regarding human and computer-based inter-rater scoring agreement in the context of sleep studies [3], [5], [21], [22], [23], [24].Regardless, in order to provide a more complete picture, final experimental summary results will be quantified, in addition, using supplementary widespread evaluation metrics including accuracy, macro-F1 score, precision, and recall.For the learning step, stochastic gradient descent was used with constant learning rate (lr = 0.001) and momentum (p = 0.9).The categorical cross-entropy was selected as the loss function to be minimized.A batch size of 100 was used, and the maximum number of iterations was set high enough so that learning would stop based on the corresponding validation performance, namely early stopping criterion.In this regard a patience of 5 was used for all but Experiment 4 regarding federated learning, were patience parameter of 100 was used.The latter larger patience was required to compensate for the sub-sampling step in ssFedSGD, effectively making it necessary to perform more federated rounds to process a comparable amount of samples as in one regular training epoch in the case of Experiments 1 to 3.More specifically, for ssFedSGD a random sub-sample size n s = 2000 was used for each participating client.
Concerning the Nelder-Mead ensemble combination strategy described in Section III-C, proportional weights were used for initializing the ensemble combination of the participating models at iteration 0. That is, the initial conditions match the output of the additionally tested output averaging strategy.A maximum of 10 optimization iterations rounds were then allowed.At each iteration, the loss function evaluated on the corresponding combined validation dataset was taken as reference to guide the underlying search process.
All experiments were conducted on an Intel Xeon CPU E5-2620 v3 @ 2.40 GHz x 8, equipped with 2 NVIDIA RTXA 6000-48 C GPUs.Source code for reproducibility of our experiments and methods will become available online at Github https://github.com/adrania/Decentralized-deep-learning.git.results of Experiment showing performances of local models on their respective datasets (including training, validation and test partitions) and on the external FULL databases, respectively, the latter meant to assess their generalization capabilities on the direct-transfer scenario.Comparing Tables I and II, it can be seen that when local test partitions are evaluated, performance ranges between κ = 0.79 to 0.83.Notice that reasonable generalization is achieved with respect to their corresponding training and validation partitions.However, when an external data source is used, generalization performance decreases to ranges between κ = 0.14 to 0.70 (κ = 0.40 to 0.58, on average, per model).The best scenario corresponds to M(ISRUC) in FULL(DREAMS) with κ = 0.70, the worst corresponding to M(SHHS) predicting FULL(Telemetry).Overall, M(HMC) is the best generalizing model (averaged κ, κ avg = 0.58, across all leaved-out external databases), followed by M(ISRUC) with κ avg = 0.57.In contrast, the worst generalization capability corresponds to M(Telemetry), with κ avg = 0.40.Database-wise, DREAMS seems to be the easiest database to be predicted (κ avg = 0.54) while Telemetry results the most difficult one (κ avg = 0.44).

Results
Table III details results of Experiment 2, regarding centralized database-combined models.Similarly to local models, the corresponding train, validation and test performances are reported on the first columns.For each combined model (C x ), interdatabase generalization is assessed on the corresponding leavedout FULL(i) as described in III.This database is identified in the fifth column of Table III.A similar performance downgrading effect can be observed as in the case of local models, whereby performance on the local TS dataset shows higher and more stable results (κ = 0.78 to 0.79) than in the corresponding external databases (κ = 0.61 to 0.71).Performance drop in this case is more contained, as expected, due to the higher amount and heterogeneity of training data used.The best scenario is obtained by (C 3 ) which uses DREAMS as the predicting database (κ = 0.71), in line with results observed for Experiment 1, while the worst generalization is observed for (C 5 ), predicting HMC (κ = 0.61).
Ensemble models' results are described in Table IV.Generalization performances obtained for each of the corresponding external FULL datasets and the different tested combination strategies can be compared.Overall, the best results are achieved using the size-proportional weighting strategy (κ avg = 0.64), followed by Nelder-Mead (κ avg = 0.63), output averaging (κ avg = 0.62) and lastly, max-voting (κ avg = 0.59).Size-proportional weighting approach also shows the most stable results ranging κ = 0.60 to 0.65 across databases.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE VI SUMMARY OF AVERAGE GENERALIZATION PERFORMANCE (COHEN'S KAPPA)
Performance results in regard to federated models are detailed in Table V.As in Experiments 2 and 3, each federated model F x is built using a leave-one-database-out strategy, while in this case, the remaining databases used for developing the model are distributed and treated as independent local clients.In order to assay the resulting local database performance, the final server state is independently evaluated, on each client, using their corresponding local train, validation, and test partitions.These metrics are finally averaged resulting in TR, VAL and TS values shown in the the corresponding columns of Table V. Generalization performance evaluated on the corresponding leaved-out FULL(i) database is indicated in the last column.From the Table it can be observed that the range of obtained metric values varies with κ = 0.59 to 0.70.In comparison with local testing set, a similar downgrading trend can be observed when predicting external data.The effect is similar to the one observed in Table III with regard to centralized database-combined models, with perhaps slight less local over-fitting in this case.On an individual basis, the best generalization was obtained by F 3 (κ = 0.70), corresponding to the prediction of the DREAMS database.Likewise, the worst scenario was represented by F 5 when attempting to predict HMC data (κ = 0.59).Additionally, intending to analyze the effects of different parameter setting in ssFedSGD convergence and generalization, we conducted several ablation experiments regarding different sub-sampling sizes and learning rates.We evaluate four sub-sampling scenarios (500, 1000, 2000 and 4000) and three different learning rates (0.0001, 0.001 and 0.01) using F 5 configuration as baseline.The results of these experiments are shown in Fig. 5.As it can be seen in Fig. 5(a)) no convergence variations are observed when using different sub-sample sizes, neither on the corresponding generalization results as similar performances are obtained when these models are presented to the external database (κ = 0.56-0.59).Notice that model ss1000 is the one that converges earliest (round ≈ 3800), however, it has presented the worst results when predicting HMC data (κ = 0.56).Regarding learning rate effects, the results can be seen in Fig. 5(b)).Differences in speed convergence appeared, as expected, with normal behavior since the higher learning rate, the earlier the convergence and vice versa.Notice that variations in convergence speed among different settings, specifically the number of learning rounds or iterations, occur naturally when applying a consistent early stopping criterion (with a patience of 100) for all federated learning experiments.Despite loss differences are displayed, it has no significant variations in generalization performance (κ = 0.51-0.59).Finally, for easy comparison of the different tested methods, Table VI and Fig. 6 show a summary of the averaged generalization performances obtained on each case.VI regards Kappa results and Fig. 6 details the complete performance picture involving accuracy, precision, recall, macro-F1 score and Kappa metrics.As it can be seen, the federated model practically matches the classical database centralization approach with respective κ = 0.65 and 0.66.This result is followed by the size-proportional ensemble combination strategy with κ = 0.64, and Nelder-Mead with κ = 0.63.All decentralized ensemble and federated approaches obtain considerable better generalization results than those of models derived from local databases.

V. DISCUSSION
In this work we have explored several machine-learning strategies applied to the medical domain of sleep staging.More specifically, special focus was targeted toward assessment of generalization robustness in the context of a multi-database prediction scenario, and the preservation of local data-privacy.
We have proposed different decentralized learning approaches, namely ensembles and federated learning, whereby global model development can take place taking advantage of heterogeneous data distributed across independent decentralized nodes.Remarkably, these approaches avoid direct sharing of local raw data contained on each node, therefore allowing different medical centers (in this case) to contribute without Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
unnecessary exposure of potentially sensitive or restricted information.We explored different configurations of such learning approaches, using a deep-learning neural network architecture as reference, and studied their generalization capabilities on the prediction of several independent and heterogeneous sleep staging databases.Traditional approaches that derive local models from individual databases, or from centralizing data from multiple cohorts, were submitted to the same experimental benchmarking for baseline comparison of the resulting inter-database generalization performances.
Our experimentation has evidenced that performance of local models is influenced by the specifics of the source dataset, leading to poor inter-database generalization performance.In particular, while comparable performance between local validation and test partitions were observed, considerable performance drop was experienced when local models were targeted the prediction of data from external sources (see Tables I  and II).Similar performance downgrading effects were reported in recent literature regarding the same automatic sleep staging scenario [16], [17], [18], [19], [20].Our current results confirm this trend.The effect is directly related to the aforementioned database variability due to differences in patient's conditions and physiology, signal acquisition methods, or intra-and inter-expert scoring interpretations [18].Thereunder, local models are biased to the specific training source domain, providing unreliable results when being evaluated external data, leading to the need of retraining the model from scratch.Transfer-learning has been recently explored as a means to mitigate the amount of required retraining by focusing on fine-tuning only certain parts of the model, or the use minimal subsets of targeted domain data [26], [27], [28], [29], [30], [31], [32], [33].However, certain amounts of target data and model re-parameterization are still needed.Alternatively, one might opt to centralize data from different cohorts to develop more robust prediction models [35], [36].Our experimentation does also confirm this result.Overall, models derived from combined datasets have achieved the best generalization performance (κ = 0.66, see Table VI).This is expected as it is well-known that an enlarged and more casuistic-rich dataset will contribute to reduce database dependency.However, as introduced before, this approach has several disadvantages.First, all data need to be centralized in a single dataset which might be technically difficult and cost-expensive.Second, data sharing might be problematic if involving sensitive information, as in the medical domain.Third, this strategy is inflexible and does not escalate well if new data becomes available through time, therefore still requiring to re-adapt the model or re-train completely from scratch.
In contrast, decentralized methods proposed in this work provide a promising alternative to cope with the aforementioned problems.In this regard, our experimental data have shown that better generalization can be achieved in comparison with local models.Moreover, similar performance can be obtained as with respect to the centralized database-combination approach (see Table VI).
Model ensemble has been introduced as a suitable option in terms of scalability and dynamism.It address catastrophic forgetting as the resulting global model can be easily enlarged by just adding new local models derived from newly available data sources.Integrity of previously integrated models, therefore, remains intact.Moreover, ensemble naturally addresses dataprivacy problems, as it is the model, and not data, what is shared to build the ensemble.Likewise, less memory and computational resources are needed for the implementation of individual local training in contrast with combined database models.One caveat here regards the Nelder-Mead approximation, which is not fully database-independent.This is because Nelder-Mead algorithm uses the combined VAL datasets of models integrating the ensemble to guide the weights optimization search.In this regard, it resembles more as a sort of transfer-learning approach, where partial data (i.e.validation dataset) is used for guiding the weights combination fine-tuning.Similar to transfer-learning, flexibility and data-privacy are therefore compromised in this case.In contrast, target domain data (leaved-out dataset) remains completely independent.Regardless, notice that the other three ensemble combination strategies analyzed in this study are not subject to such limitations.Moreover, size-proportional weighting shows the best performance results overall.
A further step in decentralized strategies research involves the study of federated learning.As mentioned above, federated scenario hosts collaborative and distributed training without sharing of local sensitive information.Applicability of this technique within the field of sleep medicine has been barely investigated.The only study that we know of [55] simulated a federated environment by partitioning one database (Sleep-EDF [56]) among different nodes.This approach hence violates the non-IID assumption, nor does it allow assessment of inter-database generalization capabilities of the resulting model.To extend knowledge in this area, we have experimented on a federated scenario involving six independent and heterogeneous sleep staging databases.In addition, we have proposed a new federated learning algorithm, namely ssFedSGD, as a variant of the baseline FedSGD approach.Our fist attempt, was in fact to directly apply FedSGD in the context of our problem.However, we experienced one of the main reported limitations regarding this method, that is, the very large amount of required rounds to achieve effective training convergence [42].More specifically, with our described setting and available hardware resources, experimental times by using this method became intractable with estimations above the six-months of uninterrupted computation.We also experimented with alternative approaches aimed to solve aforementioned FedSGD problems, in particular Federated Averaging (FedAvg) [42].Our experiments with FedAvg, however, were also unsuccessful.In particular, we were unable to achieve stable convergence trend during the federated training.We could speculate with the possibility that the unbalanced nature and non-IID properties of our experimental cohort came into detriment of applicability of this algorithm.More specifically, we hypothesize that because the aggregation step in FedAvg involves local model weights (instead of gradients) together with the fact that disparate amount local learning updates occur on the same learning round (due to the differences in size for each database) leads to quick misalignment of the local models in the parameter space.Literature, in fact, has reported on certain limitations of this approach depending on the specific node data distribution [49], [50], [52], [57].In this scenario, we have proposed ssFedSGD, whereby using a fixed-length client sub-sample on each round, we enforce equal client contribution to the global learning, regardless of the specific amount of local data contained on each node.By using this approach we were able to speed up computation time per round with respect to baseline FedSGD by a factor of 17.5 x, enabling tractability of the problem with the same setting and computational resources.Moreover, our ablation experiments have demonstrated that varying the sub-sample size and learning rate has no significant effects on the generalization performance, therefore, these parameters can be conveniently fine-tuned according to clients' requirements.At the same time, we avoided aforementioned problematic related to FedAvg.Our experimental data confirms the intuition, as we show that similar inter-database generalization performance can be achieved as with respect to baseline database centralized approach.
One final remark with regard to the federated learning scenario concerns the usage of batch normalization layers in the architecture of the underlying deep neural network learning model.The original version of our design, in fact, included the use of this layer following the output of the average pooling operation in each of the three operational blocks at the CNN block [38].In the context of non-federated experiments (local, combined, and ensemble models) the use of this layer was able speed up the training process by reducing the number of learning epochs necessary to achieve model convergence.Deep-learning literature has extensively discussed the general benefits of the input layer normalization provided by this method [58].While experimental results including batch normalization are omitted due to length restrictions, it is worth mentioning that no improvement on the local or external generalization performance among these methods was observed.However, in the federated scenario, we experienced convergence problems when including this layer in our deep-learning pipeline.It is possible that the non-IID properties of our data, involving different databases, cause weights and bias of batch normalization layer to be affected by local offset, which might penalize client parameter aggregation at the server.Similar problems have also been reported in the literature [46], [47], [49], [59], however the exact implications still remain uncertain, and debate around inclusion of batch normalization on federated learning is considered an open area of research.In light of the experimental data, for our final design described in Section II-D, we finally excluded the use of this layer.
Further investigation is needed for better understanding of some of the reported effects.One interesting line of future work will be to explore additional federated learning algorithms recently proposed in literature to address some of the described convergence problems [49], [50], [59], [60].Future development should also incorporate additional realistic assumptions to the experimental setting such as time-varying and unobservable content in designing the distributed learning approach.One possible limitation our current work concerns the assumption of a rather static environment where clients (i.e.hospital centers) are assumed to be remain stable through time.While this might be a valid assumption in the context of in-hospital PSG, it might compromise applicability in the context of a more dynamical setting related to mobile and edge-devices (e.g.sleep apps or wearables).More in general, in the context of the rapid progress that takes place in the area, future work should also reassess optimality of the baseline deep learning model used as reference in this work.Likewise, we would also like to extend the proposed methods and experimentation to other applied domains beyond sleep staging to check generalization of our results.

VI. CONCLUSION
With our current setting, we conclude that the use of decentralized learning methods outperform local methods in terms of inter-database generalization, and provide similar results to centralized learning, however with additional advantages concerning scalability, flexibility, and data-privacy protection.Ensemble learning with size-proportional weighting, in particular, shows a good compromise between generalization performance and simplicity of the method.The reported federated approach shows slightly improved performance, and similar advantages regarding privacy preservation, however it usually involves more cumbersome communication infrastructure and training convergence.Further work and experiments have to be performed in order to confirm generalization of these results.

Fig. 1 .
Fig.1.General Federated learning process.The workflow is divided in t rounds of computation.The current server state (w t ) is distributed to the K clients where the learning process is performed using their corresponding local datasets (x k ).Clients' updated local parameters (θ k,t ) are then communicated back to the server, where aggregation and global model state update (w t+1 ) takes place.The process continues until the predefined number of rounds (t) has been reached.

Fig. 2 .
Fig. 2. Preprocessing steps and general CNN-LSTM architecture.The process is divided in three blocks: preprocessing step, convolutional step (CNN block) and time-series dependencies (LSTM block).

Fig. 3 .
Fig. 3. Amount of sleep stages files used by database.

Fig. 4 .
Fig. 4. Diagram of the experimental design.Figures are divided in two panels depending on the performance evaluation location: local (dashedblue panel) and external (blue-colored panel).(a) Experiment 1: Local models.Local models M (x) are built using T R(x) and V AL(x) datasets split from database source x.Local T S(x) is used for local generalization evaluation.The remaining databases, other than x are used for evaluation of external generalization performance.(b) Experiment 2: Centralized database-combined models.TR and VAL datasets are combined using a leave-one-database-out strategy into a central larger and single dataset.Resulting combined-datasets are then used to train the model C X while the discarded database d is used to perform external evaluation.(c) Experiment 3: Ensemble models.Local models M (x) are assembled using a leave-one-database-out strategy.On each case the excluded database d is used to evaluate generalization capability.Output activations of each model integrating the ensemble, which correspond to the five sleep stages, are combined for calculation of the final ensemble classification.We explore four approaches namely max-voting, output averaging, size-proportional weighting and Nelder-Mead weight optimization.(d) Experiment 4: federated models.As well as in the other experiments, each federated model is built using a leave-one-database-out strategy.Databases included in federated model, are used as an independent clients where ssFedSGD is used to perform local gradient calculations and server model integration during t rounds.Resulting federated models are tested on the corresponding leaved-out d databases.

Fig. 5 .
Fig. 5. Ablation experiments of the proposed federated algorithm (ssFedSGD) using several parameter settings in F 5 configuration.Results regard loss values during training and accuracy, precision, recall, macro-F1 score and Cohen Kappa metrics when attempting to predict the corresponding external dataset (HMC).(a) Convergence analysis using several sub-sample sizes (500, 1000, 2000 and 4000) and a fixed learning rate (0.001).(b) Convergence analysis of different learning rates (0.0001, 0.001 and 0.01) and a fixed sub-sample (2000).
of the experiments described above are detailed in Tables I, II, III, IV, and V and Fig. 5. Tables I and II refer to Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

Fig. 6 .
Fig. 6.Overview of the generalization performance metrics (accuracy, precision, recall, macro-F1 score and Cohen Kappa) by explored strategy.Displayed results regard the averaged performances of each approach when predicting full external datasets.

TABLE I LOCAL
MODELS' SELF PERFORMANCE (COHEN'S KAPPA)