Explicit Modeling of Brain State Duration Using Hidden Semi Markov Models in EEG Data

We consider the detection and characterization of brain state transitions based on ongoing electroencephalography (EEG). Here, a brain state represents a specific brain dynamical regime or mode of operation that produces a characteristic quasi-stable pattern of activity at the topography, sources, or network levels. These states and their transitions over time can reflect fundamental computational properties of the brain, shaping human behavior and brain function. The hidden Markov model (HMM) has emerged as a useful tool for uncovering the hidden dynamics of brain state transitions based on observed data. However, the limitations of the Geometric distribution of states’ durations (dwell times) implicit in the standard HMM, make it sub-optimal for modeling brain states in EEG. We propose using hidden semi Markov models (HSMM), a generalization of HMM that allows modeling the brain states duration distributions explicitly. We present a Bayesian formulation of HSMM and use the variational Bayes framework to efficiently estimate the HSMM parameters, the number of brain states, and select among candidate brain state duration distributions. We assess HSMM performance against HMM on simulated data and demonstrate that the accurate modeling of state durations is paramount for making reliable inference when the task at hand requires accurate model predictions. Finally, we use actual resting-state EEG data to illustrate the benefits of the approach in practice. We demonstrate that the possibility of modeling brain state durations explicitly provides a new way for investigating the nature of the neural dynamics that generated the EEG data.


I. INTRODUCTION
There is great current interest in the identification and characterization of transiently stable and recurrent patterns of activity in resting-state EEG at topography [1], [2] The associate editor coordinating the review of this manuscript and approving it for publication was Humaira Nisar .sources [3], [4], or functional connectivity levels [5], [6].These switching patterns are often interpreted as traces of hidden brain states or modes of operation with functional and clinical relevance [2], [7], [8].Although the relationship between such ''phenomenological'' description and the causal mechanisms of brain dynamics is not clear, the described brain state switching might still reflect fundamental computational properties of the brain, which shape human behavior and brain function [9], [10].Of particular interest is the duration of the brain states (dwell-time of each brain state).A heavy-tailed duration distribution of transient interhemispheric synchronization patterns can be a sign of metastable brain dynamics [10], and longer brain state durations are found in patients of Lewy body dementia compared to Alzheimer's disease and healthy controls [11].However, current methods of inferring brain state transitions (i.e.dynamic brain state allocation [4]) in EEG are not ideally suited to model the brain state duration.This study advances the state of the art by proposing a new method that overcomes this limitation.
Current brain state allocation methods are dominated by descriptive rather than explanatory methods, including temporal sliding windows [12], [13] and adaptive segmentations using clustering [14], [15], [16], among others [8], [17].The shortcomings of these methods are well established [17], [18] and stem from the fact that the properties of brain state dynamics are not considered during brain state allocation, but are assessed post-hoc.On the contrary, explanatory methods go about brain state allocation by inverting an assumed model of the generative process by which the hidden brain state dynamics gives rise to data [5].The switching and recurrent behavior of brain state dynamics suggests using a Hidden Markov Model (HMM) to represent the generative process.Although HMM has proved to be an effective generative model for EEG brain state allocation [3], [4], [6], [10], [19], it still imposes restrictions on the type of brain state dynamics that can be modeled.In brief, the HMM generates data sequentially by using a Markov chain to model the hidden brain state transitions, while one observation is emitted by the active brain state at each time point.The Markov chain however, satisfies the first-order Markov property, which requires that, given the current brain state, the probability of a future brain state is independent of the past history of the chain.This induces a memoryless Geometric distribution (discretized exponential) of brain state durations, so that the probability of staying in the same brain state for a fix duration does not depend on the time already spent in the brain state.Moreover, the Geometric distribution places higher probability on shorter brain state durations [20], [21].Therefore, the dynamics prescribed by a HMM are characterized by short-term correlations (short memory) and fast brain state switching, which is not consistent with observed features of resting-state EEG data.
Recent findings suggest that resting-state EEG is a lot more complex typically showing multiscale dynamics.On one hand, using HMM and microstates analysis [22] to study power amplitude fluctuations and time-dependent functional connectivity, has revealed brain state switching in the 50-100 ms time-scale [2], [3], [4], [5], [15], [19], [23].On the other hand, haemodynamic correlates of this fast dynamics show spatial patterns that resemble the resting-state networks observed in functional Magnetic Resonance Imaging (fMRI), which vary in the seconds scale [1], [8], [24].The link between such disparate time-scales is hypothesized to be mediated by underlying mechanisms of dynamic instability (see [25] for a review), such as multistability [26], metastability [10] and criticality [27], which endow brain state dynamics with long-term dependencies (memory) and scale-invariance across several timescales [27], [28], [29], [30].For example, [30] found signs of monofractality (scaleinvariance) over 6 dyadic time-scales (256ms-10s range) in EEG microstate sequences.Notably, the degree of long-range dependency was destroyed when microstate durations were equalized, suggesting that important information about the nonlinear mechanisms underlying resting-state EEG is carried in the brain state switching times.The complex dynamics described above can induce brain state duration distributions which are far from the memoryless Geometric distribution predicted by HMM.
Realistic computational simulations of large-scale spontaneous brain activity have demonstrated the emergence of metastable traveling patterns of activity with heavy-tailed duration distributions [10].This result was in broad agreement with the duration distribution of phenomenological brain state obtained from resting-state MEG data [6].More generally, heavy-tailed and power-law distributions also occur in human behavioral data, such as the duration of percepts in perceptual multi-stability [31], [32], [33] and the duration of active periods during human activity [34].These distributions are consistent with persistent (longlived) brain states (''trapping'' effect) and the existence of long-term dependencies in the data.Therefore, while standard HMM seems like a natural fit to the dynamic brain state allocation, it inadequately models the temporal properties of the resting-state EEG and can potentially bias the inference towards unrealistically fast brain state switching.
The present paper proposes to extend HMM into a more flexible model called Hidden Semi Markov Model (HSMM) [35].HSMM is a generalization of HMM where the Markov assumption is relaxed to allow for explicit modeling of the brain state duration distribution.By choosing an appropriate duration distribution HSMM can naturally encapsulate prior beliefs that brain states are persistent, while at the same time allowing brain state switching and recurrence.This flexibility comes at the cost of an increased model complexity.We will show however that in many situations, the added complexity is compensated for by the explanatory power of the HSMM model.Using simulations, we investigate the sort of problems the HMM assumptions can have a negative impact on, and demonstrate how HSMM resolves these issues.We use a variational Bayes formulation of HSMM [36], which allows for coherent and efficient estimation of the parameters of the model, the hidden brain state sequence and its dimensionality (number of brain state modes).The proposed Bayesian framework also allows for selection of the appropriate brain state duration distribution based on data by means of formal Bayesian model comparison.This is demonstrated in practice using actual resting-state EEG data from 5 subjects to arbitrate between different brain state duration distributions to model the data.A MATLAB (Mathworks ®) implementation of our Brain State Dynamics (BSD) toolbox is also available ( https://github.com/daraya78/BSD) The paper is organized as follows.Sections II-A and II-B describe the technical aspects of HSMM highlighting its main differences with respect to the standard HMM.Sections II-C and II-D summarize the variational Bayes framework used to infer the parameters of the models.Sections II-E and II-F present the computational simulations and the performance evaluation metrics used to compare HSMM vs HMM.Sections II-G and II-H describe the EEG data used in this paper, its preprocessing and the methodology used for its analysis with HSMM.Finally, sections III and IV describe respectively the results obtained, and a discussion of the main findings in relation to the existing literature, as well as the limitations of the work and future research directions

II. METHODS
The proposed method is motivated by hybrid dynamical systems' (HDS) theory for multivariate time series analysis [36] applied to EEG.The principles of HSMM are summarized and interested readers are referred to Appendices and previous descriptions of these models for additional mathematical and algorithmic details.

A. HIDDEN SEMI MARKOV MODEL AS A GENERALIZATION OF HIDDEN MARKOV MODELS
HSMM and HMM are successful models of HDS.A HDS can be thought of as a continuous system with some hidden discrete logic so that it produces continuous observations while its hidden dynamics can rapidly transition or switch among a discrete set of states or modes of operation.HSMM and HMM both use two stochastic processes to capture such dual behavior, which is shown in their Dynamic Bayesian Network (DBN) representations in figure 1 (a, b).The underlying discrete process generates the hidden states of the system.This influences the continuous process that generates the observed data or emissions [21], [37].
However, HSMM and HMM differ in the way the state duration is considered.In HMM, the hidden process is a Markov chain where the state emits one observation at a time (figure 1a).In this case the state duration is implicitly captured as consecutive transitions to the same state, which induce a Geometric distribution over the state duration [21].In HSMM, the hidden process is a semi-Markov chain where the state is a ''super-state'' that emits a segment of observations of random length (figure 1b).The segment's length determines the state's duration explicitly and is assumed to be sampled from a duration distribution that can be of arbitrary form.In order to uniquely define the state duration, self-transitions between super-states are not permitted.The super-state representation does not lead to a valid DBN, because the number of observation nodes of each segment is random and then so is the structure of the DBN (figure 1b).Therefore, standard inference methods developed for DBN are not applicable [38].Instead, we use the equivalent representation in figure 1c, where the HSMM is embedded in a standard HMM [36], [38].The state-space is augmented to consider the joint process of states and their residual times (time left in the state) as a single Markov chain.The HMM embedding allows the efficient inference algorithms developed for standard HMM to be used for HSMM.In this case, the data are generated in the following way.When the chain first enters a state, the residual time is set to a duration sampled from the state's duration distribution and it then deterministically counts down to 1.At this point, the chain is free to switch to a different state, and the residual time is reset to a duration sampled from the duration distribution of the new state.The state's value is copied across every time slice spent in the state to ensure a regular structure.There are several modeling approaches to semi-Markovianity [35], [39].Our focus is on the setting where VOLUME 12, 2024 12337 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
each state is given an explicit parametric duration distribution (so-called Explicit-Duration HMM).

B. HSMM AS GENERATIVE MODEL FOR BRAIN STATE ALLOCATION
We consider the brain as a HDS that can switch between a discrete set of hidden brain states while emitting continuous EEG signals or related features.This is modeled using a HSMM (figure 1c), where y 1:T (y t ∈ R n ) denotes the n-variate time-series of EEG signals of length T at n emission channels; s = s 1:T is the sequence of hidden brain states taking values in the set S = {1, 2, . . ., K } of possible brain states at each time point (s t ∈ S); and τ = τ 1:T with (τ t ∈ D) is the sequence of residual times taking values in the set = {1, 2, . . ., D} of possible brain state durations.The HSMM is then defined by four distributions: the initial, the transition, the duration and the emission distributions of the brain states; which will be specified as follows.

1) BRAIN STATE TRANSITION DISTRIBUTION
In the present setting it is assumed that: (i) the transition to a brain state is independent of the duration of the previous brain state, (ii) the duration is only conditioned on the brain state; (iii) the observation at a given time point only depends on the active brain state; and (vi) the transition probability of each segment is constant.The joint process (s, τ ) t taking values in S × D, then evolves according to a stationary Markov chain with a transition kernel such that the probability of transition from brain state i with residual time d' at time t to brain state j with residual time d at time t+1, is where a ij are the elements of the K × K brain state transition matrix A with diagonal elements a ii = 0; p j (d) = p(d|λ (j) ) is the duration distribution of brain state j with parameters λ (j) ; and the indicator function I (x) =1 if x is true and zero otherwise.In words, equation (1) means that during a transition, the brain dynamics evolves in two stages.If the brain is in state i, and the end of the segment has not been reached (d ′ >1), then it remains in that state and decrements by 1 its residual time (d = d ′ −1).Instead, if the end of the segment has been reached (d ′ =1), then the brain probabilistically transitions to a different state j(a ii = 0) with transition probability a ij and the residual time is reset to a duration d sampled from the duration distribution p j (d) of state j.Note that the transition kernel is factorized over states and durations.

2) BRAIN STATE DURATION DISTRIBUTIONS
We model duration distributions based on Normal and Lognormal densities.Since EEG measurements are always in discrete time (encoded as positive integers) and states' durations have a pre-defined maximum duration, the duration probabilities were then obtained by discretisation and truncation of the corresponding probability density.That is, the duration probabilities of the duration distribution with probability density function g k of state k, is obtained as: Almost all probability densities can be used in this way to model the states' duration probabilities.In this paper we use a Normal density to model states' durations in the simulated data and compare its performance to using a Lognormal density in the empirical data section where ν (k) and ρ (k) −1 are respectively location and dispersion parameters of the two densities for the k − th brain state.Other options of duration distributions are available in our MATLAB (Mathworks ®) toolbox.

3) EMISSION/OBSERVATION DISTRIBUTIONS
Since our focus is on the modeling of brain state durations, without loss of generality we assume that the brain states are characterized by simple features where the observations emitted during the k-th brain state are independently sampled from a multivariate Normal distribution of dimension n with state-specific mean µ (k) and precision (k) [1], [3], [4], [40].
The conditional probability of observing y t given the brain was in state k at time t is then where

4) INITIAL DISTRIBUTION AND BOUNDARY CONDITIONS
The initial brain state distribution π = {π kd } is defined so that and represents the probability of the brain being in state k with residual time d at the beginning of the sequence.
To completely specify a HSMM, boundary conditions at the beginning and end of the observation sequence must also be defined.Commonly used simplifying conditions assume that observations start and/or end at a state segment boundary [35], [41].Here we assume the more realistic case that observations may start and end at any time during a state segment [35], [37].
12338 VOLUME 12, 2024 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

C. BAYESIAN INFERENCE OF HSMM
We followed the variational Bayes (VB) approach to HSMM in [36], and adapted it to the model specification used here.We present the main principles of VB for HSMM, and refer the reader to Appendix for mathematical details.The specification of the HSMM defines a complete likelihood function of the form where θ = θ (1:K ) and λ = λ (1:K ) denote the collection of parameters of the emission and duration distributions of all the states, respectively; M is the model class defined by the choice of the model type (HSMM or HMM) or by the choice of the number of brain states given a model type, and = {π, A, θ, λ} denotes all the parameters that specify a model within the class.HMM is completely defined by the reduced set of parameters = {π, A, θ}, because the duration distribution is not explicitly defined for that model.For notational simplicity, dependency on M in the right-handside of equation (7) and in the rest of the paper is implicit and only used when strictly necessary.
The Bayesian approach requires defining prior distributions on all the parameters.After combination with the likelihood via the Bayes rule, the joint posterior distribution of parameters and hidden states given the data is obtained, based on which statistical inference can proceed.We assume that all parameters are independent a priori so that the posterior is: where the prior distributions p(π), p(A) and p(θ) are assumed to be conjugate (see Appendix B).Exact Bayesian inference based on ( 8) is not possible due to the dependence between the parameters and the hidden states.However, the augmented state representation of HSMM allows using the efficient VB algorithm developed for HMM [36], [42].VB allows making inference on parameters, hidden states, and models by approximating the joint posterior with a simpler variational density.The usual Mean Field Approximation [43] requires the approximate posterior to factorize over subsets of parameters and states: q(s 1:T , τ 1:T , ) = q(s 1:T , τ 1:T )q( ) = q(s 1:T , τ 1:T )q(π)q(A)q(λ)q(θ) (9) Note that the structure of the HSMM implies an exact posterior that is already factorized over π, A and θ (equations ( 7) and ( 8)), therefore only the factorization over parameters and states is an actual approximation in (9).In the present paper additional factorizations were required over the parameters of the duration, and the emission distributions (see Appendix A).
The VB algorithm minimizes the Kullback-Leibler (KL) divergence [44] between the true and the approximate posterior.This is equivalent to maximizing the Negative Free Energy (NFE), a lower bound on the logarithm of the model's evidence log(p(y (1:T ) |M ))).By approximating the model's evidence, the NFE becomes a measure of model quality that trades model accuracy for model complexity and can be used for both model class selection and for monitoring the convergence of the VB algorithm.

D. 2.4 ESTIMATION OF THE NUMBER OF BRAIN STATES
The number of states is assumed to be known (i.e. a structural parameter), therefore inferring its optimal value given the data amounts to a model class selection task.In the VB framework proposed here, we used a model selection strategy based on the NFE, similar to [5].First, a model space is created by inferring multiple models with varying numbers of states.The optimal number of states is then inferred by selecting the model with maximal NFE.In practice we infer models with an increasing number of states until the NFE peaks.Additionally, during each model's inference, an automatic state ''resetting'' (ASR) was implemented within the VB algorithm, whereby at each VB iteration, all parameters of states not receiving support from the data, were reset to their prior values.The interim irrelevant states were defined as those getting negligible posterior probability at all time points after each iteration: where γ is the marginal posterior probability of being at state k at time t, marginalized over all possible durations; and ϵ is a small threshold (0.001 in our case).After convergence of VB, the remaining irrelevant states were eliminated, which effectively amounts to an automatic relevance determination of state.Note also that the ASR makes the VB algorithm more robust to falling in local minima.This is because, in subsequent VB iterations, those states reset to their priors, can account for data points which have not been well explained by the other states.

E. SIMULATION FRAMEWORK
We used computational simulations as a test ground for evaluating the performance of HSMM vs HMM for brain state allocation.The simulations pipeline is shown in figure 2. The data in all simulations were generated from a HSMM with uniform distributions over state durations.However, the HSMMs that are then fitted to this data are not provided with knowledge of this distribution, they rather assumed a Normal distribution.All simulations were based on 3 brain states, each characterized by a specific EEG topography VOLUME 12, 2024 12339 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
map and a specific data covariance matrix.The brain state topographies were obtained using a generative process, whereby a simulated source activity was projected to the sensor space using the linear EEG Forward Model [45].A realistic lead field matrix relating the source and sensor spaces was computed based on the digital brain phantom developed at the Montreal Neurological Institute (MNI) [46].The source space consisted of a mesh of G=20000 current dipoles located at the vertices of the tessellated grey/white matter interface.The sensor space consisted of the standard 128 electrodes BioSemi system, which was digitally placed on the scalp of the phantom.In each simulation experiment, the source activity image associated with each brain state was sampled from a G-variate Normal distribution with zero mean and covariance matrix J = (L T L) −1 , where L is the discrete surface Laplacian operator [47] defined on the nodes of the vertices in source space.This choice of J is consistent with the assumptions of the popular LORETA source reconstruction method [48] and ensured producing spatially smooth distributed source activity.After projection to the sensor space, the EEG map associated with each brain state were normalized to have unit norm.
The temporal dynamics of the EEG signals where generated considering the brain as a HDS using a HSMM with uniform brain state duration distributions, where at each brain state, the brain emitted a segment of observations of length sampled from a uniform duration distribution centred on a brain state-specific mean duration ⟨d⟩ and with a brain state-specific width w.During the lifetime of a brain state, state emissions were generated by sampling from a multivariate normal distribution with constant mean and diagonal covariance matrix specific to that brain state.The total length of the observed data in each realization of a simulation experiment was T =400 in all cases.The mean of the brain state emissions was the scalp-projected source activity associated with the brain state, and the covariance matrix modeled the (possibly brain state-specific) zero mean Gaussian iid observation noise.The variance of the noise was manipulated to achieve the required Signal-to-Noise Ratio (SNR) in each simulation experiment (see Appendix C for details).
Brain state transitions were designed to emulate the quasi-periodicity of empirically observed EEG microstate sequences [49].At the end of a brain state segment, a new brain state s was randomly generated based on the asymmetric transition matrix: This choice of matrix promotes cyclic brain state transitions (s1 → s2 → s3).Without loss of generality, in all simulations the system was assumed to be in the brain state 1 at the beginning of the sequence.
Since our focus is on the modeling of brain state durations, brain state maps were generated ensuring a fair degree of dissimilarity between the maps of different brain states.That way, differences in performance between the evaluated methods could be attributed to the differences in how brain state duration is modeled, rather than to inability to discriminate between similar brain state maps.

F. EVALUATION OF MODEL PERFORMANCE
The mathematical details of the evaluation scores used to assess the performance of the models are summarized in Appendix D. Following [49], the Auto Information Function (AIF) (Appendix D) was used to demonstrate that the HSMM was capable of producing features of state dynamics similar to those observed in EEG microstate sequences.The AIF of a brain state sequence measures the amount of information about the future of the brain state sequence that is contained in its own past, therefore providing an indicator of the long-term dependencies in the sequence.The Sequence Accuracy Score (SAS) (Appendix D.2) was used to compare the accuracy of HSMM vs HMM in solving the brain state allocation problem.SAS measured the similarity between the inferred and the true brain state sequences expressed in percentage, so that a value of 100% indicated identical sequences, while 0% indicated no matching between any of the elements of the two sequences.The explanatory power of a model was assessed in terms of the extent to which new state sequences predicted from the trained models, showed a brain state duration distribution similar to the one used to generate the original training dataset.The Jensen-Shannon Divergence (JSD) [50] (Appendix D) was used to measure the dissimilarity between the histograms of the brain state durations distributions of the training and the predicted sequences.The JSD is bounded between 0 (perfect similarity) and 1 (perfect dissimilarity).
Model quality measures were used to evaluate the relative performance of two given models (or model classes) based on information theoretic principles, which do not require comparison to a gold standard.Two model classes were compared based on the Log-Bayes Factor logBF(M 1 , M 0 ) [51] (Appendix D.4) that measures the evidence provided by the data in favor of the model class M 1 (e.g.HSMM) against M 0 (e.g.HMM).Models within the same model class were also evaluated based on their Discriminative Power Score DSP(H 0 , H 1 ) [52] (Appendix D.5).DSP is a directed divergence that measures the ease of discriminating between two models (H 0 vs H 1 ) of the same model class (e.g. two HSMMs with different set of parameters 0 and 1 ) given an observed data sequence.This divergence is always positive and is zero if the two models are the same ( 0 = 1 ).

G. EMPIRICAL DATA ACQUISITION AND ANALYSIS
EEG data collection was approved by the ethics committee of the University of Manchester, UK (ref.14374) and was conducted according to the declaration of Helsinki.Ten minutes of spontaneous task-free EEG were recorded from 12340 VOLUME 12, 2024 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.five healthy subjects who were asked to sit comfortably with eyes open during the experimental session.The EEG data were recorded from 64 scalp electrodes using the ActiveTwo system (BioSemi, Amsterdam, Netherlands) and Actiview® acquisition software (Biosemi, Amsterdam, Netherlands) at a sampling rate of 2048 Hz.

1) DATA PRE-PROCESSING
For consistent comparisons, we followed a similar pre-processing pipeline as in [3], but slightly modified to the case of EEG data.The pipeline was implemented in MATLAB (Mathworks ®) as follows.Channels and periods of data containing obvious artefacts were visually identified and discarded (5%).The signals were band-pass filtered between 0.1 Hz and 120 Hz using a high-pass and low-pass FIR filter in sequence; and were subsequently de-trended and down-sampled to 512 Hz.Artefact detection and correction was carried out by decomposing the data into 30 temporally independent components using independent component analysis.Stereotyped components related to ocular, cardiac, motion, and mains interference artefacts were classified and rejected based on their topographic, temporal and spectral features.The remaining components were then used to reconstruct the clean data.Bad channels were interpolated using spline interpolation and the data was subsequently re-referenced to average reference for further analysis.

2) DATA PREPARATION FOR HSMM ANALYSIS
Following previous studies, the HSMM was applied to the power envelopes of the EEG signals [1], [3], [40].The pre-processed data was first band pass (FIR) filtered between 4 and 30 Hz.The amplitude envelope of the oscillatory activity at each channel was then derived by computing the magnitude of the analytic signal, obtained from the Hilbert transformed data.For computational efficiency, the envelope amplitudes were down-sampled to 40 Hz using a polyphase anti-aliasing filter as implemented in the MATLAB signal processing toolbox.Principal Component Analysis (PCA) was then used on the normalized envelopes (zero mean and unit standard deviation) for whitening and dimension reduction to keep the first 40 principal components (99.21% explained variance).

3) HSMM SETUP
Two HSMMs were inferred using either a Normal or a Lognormal duration distribution, and maximum brain state duration of 5 s.In all models, the brain states were assumed to emit observations using each a 40-variate Normal distribution (the first 40 principal components of the envelope data), specified by their corresponding mean vector and covariance matrix.
Group level parameters were inferred by sharing parameters across subjects [21], [42] by assuming that the observed EEG sequence of different subjects are independent realizations of the same hidden semi Markov process.This should be contrasted with other works where subjects' data sequences are concatenated in time and a single HMM model is fit to the full concatenated data sequence [1].The parameter sharing approach avoids introducing potential artificial state switches at the boundaries of concatenated data, as well as allows accommodating boundary conditions for each subjects' data sequence.VOLUME 12, 2024 12341 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

4) BRAIN STATE TOPOGRAPHIC MAPS
Due to the PCA dimensionality reduction, the means and covariance matrices characterising the brain states are not straightforwardly interpretable since they are estimated in the low dimensional space spanned by the principal components.To obtain interpretable scalp activity maps associated with each brain state, we followed a procedure similar to [3].The partial correlation of the estimated brain states time courses with the amplitude envelope data at all the electrodes, was computed.brain state time course correspond a binary vector indicating whether the brain states is on or off at each point of the most probable brain state sequence, which was estimated using the Viterbi algorithm for HSMM [37].The obtained correlation maps represent the unique contribution of each brain state to the oscillatory activity power, while components of the envelope signals that are common across brain states are minimized.
The partial correlation was computed using a General Linear Model (GLM) analysis [53] with the brain state time course as predictors and the full-rank envelope data (prior to dimension reduction and whitening) as response variables.Given the brain state time course ŝ1:T , each element of the [T × K ] GLM's matrix of predictors X is The columns of X and the envelope data were both z-scored prior to GLM analysis.The K vectors of estimated regression coefficients defined scalp spatial maps representing estimates of the partial correlation coefficient between each brain state and the envelope data.To facilitate interpretation, brain states from different models or model classes were assigned the same label if they had similar topographic maps.

H. INITIALISATION OF THE VB ALGORITHM
To account for dependence of the inference on the initial conditions (e.g.convergence to local minima), the HSMM and the HMM VB algorithms were ran 10 times each using different starting points.Each starting point was determined using k-means to cluster all data points and defining an initial brain state sequence by assigning each data point to the closest cluster.The k-means algorithm was itself initialized 3 times using random data points as initial centroid positions and the solution producing the most compact clusters was chosen as the optimal one.Finally, the sample estimates of the models' parameters obtained based on the obtained k-means brain state sequence were used to initialize the VB iterations.The optimal VB initialisation was determined as the one with highest NFE after 10 iterations of the algorithm.In practice, this strategy produced identical results as compared to running the VB fully until convergence for every starting point and then choosing the solution with highest NFE; but it was cheaper computationally.

A. NON-GEOMETRIC BRAIN STATE DURATIONS CAN GENERATE ''SENSIBLE'' BRAIN STATE DYNAMICS
We first demonstrated that the brain state sequences underlying the simulated data showed dynamical properties of empirically observed EEG microstates sequences.We focused primarily on the relatively long but finite memory (temporal dependencies) demonstrated via distinct periodicities of the AIF of microstate sequences, with a tail that converges asymptotically to Markovian memoryless in the long run [54].Figure 3 shows the AIF of brain state sequences generated by HSMM with Uniform durations and the corresponding HMM predictions; as functions of the mean (⟨d⟩ in figure 3a,b) and width (w in figure 3c,d) of the generating duration distributions.The HSMM used 3 brain states with identical duration distributions and the transition matrix defined in section II-E.The HMM predictions were generated using the empirical transition matrix computed based on the matching HSMM sequence.The AIF of HSMM showed oscillations with amplitude decaying slowly with the time lag.This behavior corresponds to a process with memory (time dependencies), but Markovian ''memoryless'' in the long run; which is consistent with empirical results of EEG microstates analysis [49].Both the frequency and amplitude of the AIF oscillations depended on the brain states' mean duration, with higher mean durations corresponding to slower but stronger oscillations (figure 3a).Changes in the width of the duration distribution however, only affected the amplitude of the AIF oscillations, so that increased widths were associated with reduced amplitudes (figure 3c).The AIF of the predicted HMM sequences showed monotonic fast decays with the time lag in all cases.The decay rate was higher for smaller mean duration values (figure 3b), while it was insensitive to changes in the width of the duration distribution (figure 3d).

B. MODELING BRAIN STATE DURATIONS IMPROVES GENERALIZATION POWER
The generalization power of any model is underpinned by the accuracy of its predictions.A simulation study was then carried out to investigate whether the correct modeling of the brain state duration distribution had an important effect on the ability of HSMM vs HMM to make accurate outof-sample predictions.For this analysis, synthetic EEG data sequences of length 4000 a.u.were first generated based on HSMMs with different uniform brain state durations but identical transition matrix as in previous section.The duration distribution used to generate each dataset was varied by changing the mean duration of a target brain state (20-140 a.u. with 20 a.u.intervals), while the mean of the other two brain states was fixed (50 a.u.).All brain state duration distributions had identical width (6 a.u.).Two SNR levels (30% and 60%) were also used, for a total of 8 simulated EEG datasets (5 mean durations x 2 SNR levels).Inference on each dataset was carried out by training models with two different 12342 VOLUME 12, 2024 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.The periodicities and the asymptotic decay in the case of the HSMM resemble those of the empirically observed EEG microstates, as illustrated by [49].The AIF is measured in natural units of information (nats).
brain state duration distributions, a HSMM with Normal distribution for brain state durations and a HMM with an implicit Geometric distribution for durations.After training, new (predicted) brain state sequences were generated from each of the trained models and the resultant histograms of durations of the target brain state were compared to the true duration distribution that generated the training dataset.
Figure 4 shows the true and predicted histograms for one exemplary dataset.As expected, the Geometric distribution implicit in HMM allocated nonzero probability mass across the whole range of possible duration values with higher probability at shorter durations.On the contrary, the explicit Normal duration distribution in HSMM concentrated its probability mass in the correct range of duration values.
Plots of the JSD between the true and predicted histograms of each model are depicted in figure 5 as a function of the brain state mean duration and SNR of the training dataset.HSMM was consistently more accurate and noise-robust at regenerating the true duration distributions, as evidenced by the lower JSD values across all simulations.Notably, HMM reproducibility deteriorated as mean duration increased.This primarily reflects the effect of the unrealistic Geometric duration model implicit in HMM.That is, underlying duration distributions with longer mean durations have to be captured more widespread duration histograms when used to generate new sequences after training.A Geometric distribution that encourages short brain state durations may also lead to less accurate estimation of HMM parameters in the training phase on data produced by long brain state durations.That is, given a fixed length of data, a longer brain state duration translates into a lower number of state transitions and therefore less data is available for a reliable estimation of transition probabilities.In this situation, the ability of HSMM to model brain state durations helps to regularize and stabilize the inference.No apparent effect of the SNR was found for any of the two models.These results demonstrate that, beyond good data fitting, an accurate modeling of the brain state durations is important when prediction and interpretation is the goal.In this case, using HMM to model a system with non-Geometric durations (as seems to be the case with the brain) could lead to wrong conclusions.In the next sections we investigated how this problem can affect the performance of HSMM vs HMM, in some typical EEG applications.

C. BRAIN STATE DECODING UNDER DATA UNCERTAINTY
Common applications of EEG brain state allocation involve brain state decoding.This is relevant e.g. for developing systems for Brain Computer Interfaces, online EEG monitoring of patients in intensive care units, identification of epileptic seizures, sleep monitoring and study of memory retrieval in movie viewing, among others.Typically, a model is first trained on a dataset representative of the dynamics of interest; and then the trained model is used for decoding the brain states in a new dataset in an unsupervised fashion.In such a situation, accurate predictions are important for the model-based monitoring system to perform robustly in a noisy environment.We tested the impact of incorrectly modeling brain state durations, on the brain state decoding task.For this, the accuracy of HSMM (with Normal durations) and HMM (Geometric durations) was evaluated when used to decode data generated from a system with uniformly distributed brain state durations.Given a previously trained HSMM or HMM, with estimated parameters , and a new observation sequence y (1:T ) (test dataset), we were interested in the most likely brain state sequence that produced the new observations.This is solved using the well-known Viterbi algorithm [37].In this simulation experiment, the same model was used to generate the training and test datasets, but the test dataset was noisier.This mimics a situation in which brain state monitoring is performed under uncontrolled observation noise fluctuations.Two sets of training and test SNR values and 100 pairs of training and test datasets in each case were simulated.All parameters of the simulation setup were the same as in the previous section, except for the mean duration of the two fixed brain states, which was 10 a.u. in this case.
As expected, using the more suitable Normal duration model resulted in a better brain state decoding performance.Figure 6 shows the SAS for HSMM with Normal durations and HMM as a function of the mean duration of the target brain state, and for different SNR values of the training and the test datasets.In all cases, HSMM showed higher accuracy than HMM.For increased difference between the SNR of the training and the test datasets, the accuracy of the HMM reduced significantly; while HSMM remained stable with a SAS above 90% in almost all cases.Interestingly, in the case of greatest reduction in data quality (figure 6 lower panel) the accuracy of the HSMM increased for increasing values of the mean brain state duration, while the HMM deteriorated.

D. CLASSIFICATION OF BRAIN STATE TEMPORAL STRUCTURES
It has been shown that the presence (or absence) of specific brain states during rest or task conditions can have cognitive or clinical relevance, because a brain state, as defined here, could represent short lasting but meaningful modes of brain operation [55].Similarly, the temporal structure of brain state sequences, might contain additional information, representing ''phenotypes'' of longer lasting experimental, 12344 VOLUME 12, 2024 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.clinical or environmental conditions, for example anesthesia, sleep or effect of drugs.Therefore, HSMM and HMM can enable ''condition phenotyping'', via model-based classification of data streams (e.g.task conditions or clinical groups) corresponding to brain state sequences with distinctive temporal structures.
The temporal structure of the brain state sequence refers to the brain states temporal order or brain state ''syntax'' (given by the transition probabilities), and the dwell time in each brain state (given by the duration distributions).We focussed on evaluating the impact of modeling brain state durations on condition phenotyping.That is, given the generative assumptions (a specific duration model), which condition is most likely to be associated with previously unseen data.That is, given previously trained HSMMs (or HMMs) each representing one condition, with estimated parameters ˆ 1:C (one for each condition c = 1, . . ., C), and given a new sequence of observations y 1:T , we are interested in the probabilities p(y 1:T | ˆ c , M ) that the new data were generated by each of the models.These probabilities can be used for model-based Naïve Bayes classification [56] to allocate the new data to the most probable model (condition).HSMM and HMM were then compared in terms of their ability to classify a test data sequence as belonging to the correct condition among a set of candidate conditions.The conditions differed in the mean duration of the underlying brain states.This experiment mimics a situation in which the models are used as e.g.diagnostic tools by allocating a new patient to one among a set of clinical conditions.Each clinical condition is characterized by a phenotypical alteration of the mean time that the brain spends in each brain state.We used a semi-supervised setting for classification, where labeled data were first used to train different models, with each label denoting a specific condition.The new unlabeled test data sequence was then allocated by assigning the label corresponding to the model with highest probability of having generated the test data.The training set consisted of N =11 simulated data sequences (length T=4000 a.u.), one for each condition, and their corresponding labels.All conditions were simulated based on HSMMs with the same transition matrix as in previous sections and Uniform duration distributions for the 3 brain states.The mean duration of all brain states within each condition was the same, but varied across conditions from 2 a.u. to 22 a.u.(step-size 2), which determined each condition's label.The unlabeled test sequence was generated from the model corresponding to condition 10.The widths of all brain state duration distributions in the training and the test data were fixed to 6 a.u.A previously unseen data sequence (with mean brain states duration of 10 a.u) is classified as being generated by one of 11 HSMM and HMM models which were trained on data that differed in the mean duration of the underlying brain states, from 2-22 a.u. in steps of 2. Each point on the curves is a measure of the discrimination of the hypotheses H0 that the test data was generated by its true model (⟨d⟩=10) against the hypothesis H1 that it was generated by the candidate model with duration on the x-axis.A value of DPS=0 indicates no discrimination between the two hypotheses, that is the candidate model is very close to the truth.The shadowed areas represent the 95% confidence interval for discrimination.HSMM with Normal durations is able to allocate the test sequence to the correct model, whereas HMM has no discrimination power.
HSMM (with Normal durations) and HMM (Geometric durations) were used for training and classification in 100 repetitions of the experiment as described above.Figure 7 shows the ability of each model to discriminate between the test sequence and each condition, measured in terms of the DPS.Each point of the curve represents the average information per test data sample for discrimination of the hypothesis H 0 that the test sequence was generated by the correct model (mean brain state duration 10 a.u.) against the hypothesis H 1 that it was generated by the candidate condition model (mean duration in the x-axis).The zero value represents no ability to discriminate between the two hypotheses and therefore indicates the condition the test sequence is allocated to.As expected, using the more suitable Normal duration model resulted in better classification performance.Discrimination based on HSMM allowed for allocating the test sequence to the condition associated with the correct mean brain state duration.In the case of HMM, there was not enough information in the test data for discrimination between the two hypotheses in any of the conditions, except for the one with the shortest mean brain state duration (2 a.u.).That is, classification based on HMM would allocate the test data sequence to any of the conditions with equal probability except for the case of shortest mean brain state duration (high sensitivity but low specificity).This is consistent with the inability of the implicit Geometric duration distribution to model long durations accurately.

E. RESTING-STATE EEG: LIGHT-TAILED VS HEAVY-TAILED DURATION DISTRIBUTIONS
We have so far shown that accurate modeling of brain state durations is paramount for a meaningful brain state allocation.A key question then is which duration distribution should we use?The choice of the duration Probability Density Function (PDF) will ultimately define the kind of brain state dynamics that the model can explain and therefore it effectively imposes constraints on the temporal properties of the brain state process.For example, a light-tailed ''bellshaped'' PDF is typical of a process with a characteristic timescale, while a heavy-tailed right skewed PDF corresponds to a process with long-term dependencies and possibly temporal scaling law [28], [29], [57].A Geometric (or Exponential) distribution in turn describes a memoryless process, and separates heavy-tailed from light-tailed distributions.That is, heavy-tailed (light-tailed) distributions are those whose tail decays slower (faster) than the Geometric distribution.We refer the reader to the last section for a more detailed discussion on the relationship between the shape of the distribution and the types of dynamics of the underlying system.
We use real resting state EEG data to select among two alternative duration models.Specifically, we compare using HSMM with Lognormal (heavy-tailed and skewed) vs Normal (light-tailed and symmetric) duration distributions to model the fluctuations of the EEG amplitude's envelopes.
First, we investigated whether there is evidence in the data supporting one duration model vs the other.Two HSMMs, each using one of the duration distributions with maximum duration D=5000 ms, were trained on the actual data as described in section II-G.In the two cases, 5 brain states were automatically inferred from data using model comparison.Figure 8 shows the convergence curve of the NFE for the two optimal models with 5 brain states.The HSMM with Lognormal durations converged faster and showed higher NFE with a log-Bayes Factor of 400 in its favor after convergence.The partial correlation maps associated with each brain state are shown in figure 9. Based on the pair-wise similarity between the maps of the two duration models, the brain states were relabeled and matched based on these maps by simple visual inspection.When analyzing the EEG topographic maps brain state 4 and brain state 5, it is notable that both maps exhibit identical activation patterns, but with inverse polarities in all corresponding areas.This symmetry and inversion of polarities suggest a possible functional operation with a complementary relationship.When comparing the EEG topographic maps associated with normal duration and lognormal duration models, notable differences in brain activation patterns can be observed.The most prominent discrepancies are found in maps brain state 4 and brain state 5.In the case of the normal duration model, the band that extends horizontally from the left temporal area to the right temporal area, crossing the central region of the brain, is interrupted at the center of the head.In addition, it can be seen that the empirical durations of brain state 2 show asymmetric heavy-tailed histograms in the two cases, which do not follow a Normal distribution, whereas the Lognormal distribution shows an accurate fit.

IV. DISCUSSION
We have proposed Hidden Semi-Markov Model (HSMM) as a generative model for brain state allocation in EEG, which allows for the explicit modeling of the brain state duration distribution.The added flexibility circumvents a well-known limitation of the standard HMM where the implicit distribution of brain state duration is Geometric.The Geometric distribution places more probability mass at shorter than at longer brain state duration, making HMM intrinsically biased towards fast brain state switching.A more fundamental problem is that the Geometric distribution is memoryless.This property implies that the probability of waiting time until the next brain state transition is independent of the time already spent in the current brain state.For a model to describe such behavior, it must continuously forget the state the system is currently in (hence memoryless).A Geometric duration is then not consistent with the long-range temporal dependency (''long memory'') and near scale-invariance (self-similarity) ubiquitous in real EEG data [29], [57], [58], [59], [60].This poses limitations on using HMM for modeling EEG data accurately [49], as demonstrated in the model generalization study (figure 3).More importantly, we showed that modeling brain state duration distributions explicitly with HSMM allows capturing EEG dynamical features that are inconsistent with the Geometric duration implicit in HMM.This qualifies HSMM as a more suitable generative model for brain state allocation when solving problems where prediction accuracy is important.
Another usually overlooked limitation of HMM is that it places a small penalty on adding a new state, and no penalty on the new state having similar features to another state [20].Consequently, using model comparison to infer the number of brain states based on EEG data, where persistent brain states are expected, can lead to the creation of unnecessary extra states and unrealistically fast switching, in order to reproduce the long brain state durations.Indeed, prior EEG studies using HMM for brain state allocation noted a problematic increase in NFE with an increasing number of states, rendering estimation of the number of states through NFE maximization impractical [40].This issue is crucial as extra states may produce unreliable parameter estimates and pose a risk of overfitting due to insufficient data per state.To enhance inference stability, HMM-based brain state allocation has been commonly applied to extensive datasets with hundreds of subjects [61].Nevertheless, this approach fails to address the challenge of optimal state number selection.The results in this study demonstrate the successful application of the proposed HSMM framework for analyzing EEG data from a limited number of subjects, while still allowing estimation of the optimal number of brain states through NFE maximization.
We note that the potentially redundant brain states and fast switching in HMM is not necessarily a problem if model averaging is the goal [4].However, if interpretation and prediction is important, one would like to incorporate prior knowledge that a ''slow'' switching dynamics is more likely.HMM does not incorporate this kind of prior, while in HSMM this is done explicitly by choosing a suitable duration distribution to model the data.We therefore compared the performance of these model in two tasks which rely on accurate model predictions.The first task tested whether an accurate modeling of brain state durations helped brain state allocation based on data with increased external noise.The (Bayesian) rationale is that if the data is unreliable, the accuracy and robustness of the brain state decoding is determined by the ability of the model to represent the actual dynamics of the system producing the observations.We found that HSMM with Normal brain state durations allowed robust and accurate brain state allocation for different data uncertainty levels and different mean brain state durations of the underlying system.This is because the Normal duration used in HSMM was a good approximation of the true brain state duration distribution.In contrast, the Geometric brain state durations in HMM led to non-robust brain state inference dominated by the noise fluctuations.The second task involved identifying data segments whose underlying state sequence differed in their temporal structure (states' syntax and durations), rather than the individual states' emission parameters.We argue that brain state sequences with a characteristic temporal structure could serve as phenotypes of complex mental processes, which live in a time-scale beyond that of an individual brain state.Similar concepts have been used to interpret both task-related [62], [63] and resting-state [2], [61] activity.The target phenotype to be identified differed from the rest of the candidate phenotypes only in the mean duration of their component brain states.This task tested whether the implicit Geometric distribution in HMM was ''good enough'' to discriminate between phenotypes, or if the more accurate modeling of the state duration in HSMM was required.Our results showed that, HSMM allowed the correct identification of the target phenotype, whereas HMM could not discriminate between the target and any of the candidate phenotypes.It can be argued that in the identification of these long-lived complex mental processes, the temporal structure of the underlying brain state sequence might play an even more important role than in the identification of individual short-lived brain states, and therefore HSMM is recommended over HMM.
We argue that the choice of duration PDF determines the temporal properties of the brain state process, which will ultimately impose constraints on the kind of dynamics that the model can explain.In other words, the choice of duration PDF can be used to represent our beliefs about specific nonlinear dynamical mechanisms (instabilities) generating the data, such as multi-stability, meta-stability or criticality (see [27]).In a multi-stable system driven by noise, the system jumps erratically between different attractors where it gets briefly trapped in each basin of attraction.Consequently, the time series produced by a multi-stable system shows a relatively heavy-tailed (stretched exponential) dwell distribution.In a meta-stable system, there are no attractors, but rather a sequence of linked weakly unstable fixed points, such that the system dwells in the neighborhood of each, but does not show trapping.In this case, the distribution of sequential dwell times show a unimodal but highly skewed heavy-tailed distribution characteristic of long term dependencies.In a critical system, a single fixed point is very weakly attracting or neutral, so that the system noise leads to long and unstructured excursions corresponding to scale-free fluctuations and corresponding power-law statistics (e.g.power-law duration distributions).We can therefore use the explicit brain state duration PDF in HSMM to express alternative hypotheses of temporal dependency and scaling-law of the brain state sequence, and select the best distribution given the EEG data in a principled way via Bayesian Model Selection.This approach can help bridge between the observed brain state data features and the associated underlying neural mechanisms, which would allow interrogating the data about such mechanisms.To illustrate the approach we used actual resting-state EEG data to select between a heavytailed (Lognormal) and a light-tailed (Normal) brain state duration PDF for brain state allocation using HSMM.Model comparison demonstrated evidence in favor of the Lognormal PDF.
The results in figure 10 also showed that the histogram of brain state durations obtained by fitting a HSMM with Normal brain state durations did not actually conform to a Normal distribution.This indicates a ''mismodeling'' of brain state durations by the Normal model.Additionally, the results in figure 9 showed differences in the brain maps inferred based on the two models, which indicates differences in the identification of the active brain state at each time point.As both models have the same emission model, then these differences arise from the type of duration model assumed.These findings suggest that, in certain time periods, the information derived from the emission model may be ambiguous and not allow for accurate identification of brain states.In these cases, having an appropriate duration model becomes relevant.
Finally, our results suggest that brain states' fractional occupancy (FO), a widely used post-hoc summary statistic [3], [6], [61], is difficult to interpret because it conflates brain state duration and transition probabilities.A clear advantage of the proposed HSMM is that it allows dissambiguating the individual contribution of state duration transitions probabilities to FO because the two parameters are modelled explicitly and independent of each other.This implies that experimental or clinical conditions linked to changes in brain state durations, can be distinguished from those associated with changes in transition probabilities between brain states.In contrast, in HMM, this is unfeasible due to the confounding of brain state transitions and durations.
The framework proposed here enables several lines of work in the study of brain function and related translational applications.Future research may delve into refining the subtle aspects of brain state modeling introduced by HSMM.Of particular interest, investigating different forms of the explicit brain state duration distribution within HSMM could be used to express hypotheses concerning non-linear dynamical mechanisms, such as multistability, meta-stability, or criticality inherent in neural processes.In this context, the proposed application of Bayesian model comparison for selecting optimal duration distributions presents an avenue for principled exploration, linking observed EEG data features with underlying neural mechanisms.Coupling this approach with integrative research exploring the extension of the HSMM to other functional imaging modalities (e.g.MEG, fMRI, fNIRS) or behavioral data, could deepen our understanding of brain function.Additionally, evaluation of the proposed approach for multi-subject inference and addressing challenges associated with extensive data collection could also enhance its utility in large-scale studies.Future translational research may focus on assessing the clinical relevance of the proposed framework, particularly in identifying aberrant brain state dynamics associated with neurological disorders.In this context, novel therapeutic interventions may be developed by adapting HSMM to real-time EEG-based applications such as brain-computer interfaces and neurofeedback systems.

V. CONCLUSION
The present study introduced and extensively evaluated the Hidden Semi-Markov Model as a robust and flexible generative model for brain state allocation in EEG data.By explicitly modeling the distribution of brain state durations, HSMM overcomes limitations of the widely used (conventional) Hidden Markov Model, particularly its inherent reliance on a Geometric distribution for representing brain state durations.The findings presented herein highlight the advantages of HSMM in capturing the nuanced dynamics of EEG data, especially in scenarios demanding accurate predictions and meaningful interpretations, indicating its potential significance in neuroscientific and clinical research.Additionally, the demonstrated superiority of HSMM in tasks involving increased external noise and the identification of complex mental processes suggests its potential utility in real-world applications.In summary, the introduced HSMM framework is a promising avenue for advancing EEG-based brain state modeling, and future research endeavours can build upon this foundation to address emerging challenges and extend its applicability to diverse domains.
(Nelson J. Trujillo-Barreto and David Araya Galvez are co-first authors.)

APPENDIX A MEAN FIELD APPROXIMATION AND NEGATIVE FREE ENERGY
The mathematics of the VB framework for HSMM is described in detail in [36].The mean-field approximation used here has the following form: q(s, τ , π , A, θ, λ) = q(s, d, τ )q(π) K k=1 q(a k: ) • q(µ)q( )q(δ)q(ρ) (12) where, in addition to the factorization between hidden states and parameters (see main text) we have assumed factorization over the parameters of both the duration and the emission distributions; as well as over the rows of the transition matrix [42].This factorization induces the following form of the NFE: NFE(q(s, τ , )) = −KL(q(π)∥p(π)) KL(q(ρ (k) )∥p(ρ (k) ) + log(C(s 1:T ))) (13) where KL(q(x)||p(x)) denotes the Kullback-Leibler (KL) divergence between the approximate variational posterior q(x) and the prior p(x); and C(s 1:T ) is a normalization constant that can be computed after each VB-E step [36].Note that due to conjugacy of the priors, all KL divergences are between known standard distributions and can be computed analytically.We refrain from including the KL expressions here, since these are standard results available somewhere else.

APPENDIX B PRIORS AND POSTERIOR UPDATES FOR ALL PARAMETERS
This section presents the update equations of the VB-M-Step.It uses the following marginal statistics obtained from the Forward-Backward algorithm implemented in the VB-E step: • The joint posterior probability of the joint process (s t , τ t ) being at (i,d) at time t − 1 and at (j, d ′ ) at time t: • The posterior probability of the joint process (s t , τ t ) being at (i,d) at time t: Given the structure of the HSMM presented here, these statistics can usually be reduced respectively to ξ t (i, 1, j, d), the posterior probability of a transition occurring at time t; and γ t (i) = q(s t = i) = D d=1 γ t (i, d), the posterior probability of state i being active at time t.

1) MULTIVARIATE NORMAL EMISSION MODELS
The signal emitted from the k-th brain state is modeled as a n-variate Normal distribution with mean µ (k) and precision matrix (k) .

p(y t |s
For the simulation experiments, the data emitted from each state was assumed to be independent over channels with precision matrix (k)  = diag(σ (k) ), where σ (k) is the n × 1 vector of precisions (inverse variances) of each channel's VOLUME 12, 2024 12349 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
data.For the empirical data analysis a full precision matrix was assumed.
• Prior distribution for means of emissions For both the simulations and the actual data we assumed that all brain state means are independent across states and i.i.d a priori with identical conjugate weakly informative priors where 0 n × 1 is a n×1 vector of zeros, I n is the identity matrix of size n × n.We used σ 0 = 0.1 for the simulations and σ 0 = 0.001 for the empirical data.
• Prior distribution for precision of emissions For the emission model used in simulations, precisions were assumed to be independent over states and channels a priori, with identical conjugate Gamma priors and weakly informative shape and scale parameters a and b, respectively where a 0 = 0.001 and b 0 = 1000.For the full precision model used in the actual data, precision matrices are independent over states a priori with identical conjugate n-dimensional Wishart distributions with degrees of freedom a and scale matrix B p( |a 0 , B 0 ) where a 0 = n and B 0 = nI n • Posterior updates for means of emissions Due to conjugacy of the priors, the variational densities of the states' means of the two emission models have both Gaussian forms.For the independent emission model the posterior is also independent over channels: with posterior mean and precisions given by where â(k) i and b(k) i are the shape and scale parameters of the approximate posterior of the precisions, associated with emissions from state k (see below) at channel i; and N (k) is the expected state's counts (i.e. the expected number of state's visits) In the case of the emission model with full precisions, the variational posterior for the means are independent over states so that with posterior parameters • Posterior updates for precisions of emissions Again, conjugacy of the priors ensures that the variational posteriors have the same functional form as the prior.In the case of the independent observation model, the variational posterior of the precisions factorizes over states and channels so that with posterior shape and scale parameters In the case of the emission model with full precision matrices, the variational posterior of the precisions factorizes over states with state-specific Wishart distributions so that with posterior degrees of freedom and scale parameters Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

2) DURATION MODELS
• Normal duration model We impose independent conjugate and weakly informative prior distributions on the mean and the precision of the normal duration distribution of each state.
-Prior distribution for the parameters Independent Normal prior distributions with identical mean ν 0 and precision η 0 were assumed on the mean duration of each state so that the full prior factorizes over states: where ν 0 = 1 (ν 0 = 100 for the empirical data) and η 0 =10 −5 .In the same way, independent weakly informative Gamma prior distributions with identical shape and scale parameters ((u 0 and v 0 )) were assumed for the precision of the duration of each state where u 0 = 0.001 and v 0 = 1000, in order to achieve non-informative priors.-Posterior updates for the parameters Conjugacy and independence of the priors induce variational posterior densities of the same functional form as the priors, which are also factorized over states, so that where the posterior mean and precision of the duration of state k are given by Similarly, the variational densities of the precision of the durations have the same functional form as their priors and also factorize over states, so that where the posterior shape and scale parameters of the posterior Gamma distribution associated with state k are given by In the above expressions, γ • Lognormal duration model Similar to the Normal case, we impose independent conjugate and weakly informative prior distributions on the parameters of the Lognormal distribution associated with each state.Given that the conjugate priors for the parameters of the Lognormal distribution are the same as for the Normal distribution, the expressions for the priors and their variational posteriors are the same as for the Normal case, but replacing the mean and precision of the Normal model with the first and second parameters, respectively of the Lognormal model.In this case, the values of the parameters of the priors (hyperparameters) were ν 0 = 4.5, η 0 = 0.2, u 0 = 0.001 and υ 0 = 1000

3) TRANSITION MODEL
From the definition of HSMM, the transitions between state segments are governed by a Markov process without selftransitions.Therefore, the transition distribution between states is Multinomial.We then use the usual conjugate priors and assumptions as in the standard HMM [42].
• Prior distribution for the transition probabilities Using standard results from the Bayesian inference of HMM, we assume that the rows a i: of the transition matrix are independent and identically distributed a priori.We then choose the same conjugate non-informative prior distribution for all a i: , which is a symmetric non-informative Dirichlet distribution with concentration parameter α 0 and zero mass on self-transitions, Dir(a i: ; α 0 ) (43) where a i: = (a i1 , . . ., a ii−1 , a ii+1 , . . ., a iK ) is the i-th row of the transition matrix A without its i-th element; and α 0 = α 0 1 1×(K −1) with α 0 > 0. A non-informative prior is achieved by using α 0 = 1.
• Posterior updates for the transition probabilities Due to the conjugacy of the prior, the variational posterior of all the rows are Dirichlet, so that Dir(a i: ; α) Similar to the transition distribution, the initial probability distribution π is also Multinomial and therefore can be treated in a similar way as one row of the transition matrix.
• Prior distribution for the initial probabilities In the general case (no simplifying initial condition) a conjugate symmetrical Dirichlet prior can be imposed on the initial joint state (s 1 , τ 1 ) as

APPENDIX C DEFINITION OF SIGNAL-TO-NOISE RATIO
For a given data sequence, the SNR was defined as the percentage of noise-free signal energy present in the data: For iid Gaussian noise in each channel and normalized brain state maps, it can be demonstrated that where N s 2 is the unbiased sample variance estimate and n and T are the number of channels and the length of the data sequence respectively.This expression was then used to compute the simulated noise variance required to achieve a selected SNR value.

APPENDIX D EVALUATION SCORES 1) AUTO INFORMATION FUNCTION
The Auto Information Function (AIF) is analogous to the autocorrelation function, but for symbolic sequences such as the brain states sequences.It measures the dependence between different time points with a given time lag l.This is done by measuring the Kullback-Leibler divergence between the symbol distributions at time t and t+l.That is, the AIF for time lag l is defined as: This is the difference between the entropies H of the distributions p(s t+l ) and p(s t+l |s t ).For a stationary Markov chain the AIF can be obtained analytically [54].

2) SEQUENCE ACCURACY SCORE
Denoting the brain state decoded at time t as ŝt and the true one as st , the Sequence Accuracy Score (SAS) between the sequences ŝ1:T and s1:T of equal length T is defined as where δ(•, •) is the Kronecker's delta function and HD(x, y) denotes the Hamming distance between sequences x and y [64].Given two sequences of equal length, the Hamming Distance is an edit distance defined as the minimum number of substitutions required to change one sequence into the other.A SAS of 100% means the two sequences are identical (no substitutions needed), while 0% indicates no matching between any of the elements of the two sequences (all elements need to be substituted).

3) JENSEN-SHANNON DIVERGENCE
In probability theory, the Jansen-Shannon Divergence (JSD) [50] is used to measure the similarity or dissimilarity between two probability distributions or histograms.JSD is a symmetrized version of the Kullback-Leibler divergence KL(P∥Q) between two distributions or histograms P and Q.
It is defined as where H = 1 2 (P + Q).The JSD is bounded between 0 (perfect similarity) and 1 (perfect dissimilarity) when the KL is expressed in base 2 logarithm (bits); and its square root is a metric known as the Jensen-Shannon distance.

4) LOG-BAYES FACTOR
Given a data sequence y 1:T and two given model classes M 1 and M 0 , the log-Bayes factor [51]  Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
where p(y 1:T |M k ) is the evidence of model class M k .Since the evidence cannot be calculated in closed form, we used its lower bound approximation in terms of the (variational) Negative Free Energy, p(y 1:T |M k ) ≈ NFE(M k ).

5) DISCRIMINATIVE POWER SCORE
Given two models (of the same model class) each defined by the set of parameters 0 (model 0) or 1 , (model 1), and given a data sequence y (1:T ) that was generated by the model 0, the Discriminative Power Score is defined in terms of the directed divergence [52] DPS(H 0 , H 1 ) = lim where p(y 1:T | k , M ) is the probability that the data sequence was generated by model k (i.e. the model with parameters k ), after marginalization over all possible brain state paths.For a sufficiently long sequence, if H 0 and H 1 are the hypotheses that the data sequence was generated by model 0 and model 1, respectively, DPS(H 0 , H 1 ) is the average information per data sample for discrimination of H 0 against H 1 .This divergence is always positive and is zero if the two models are the same ( 0 = 1 ).

FIGURE 1 .
FIGURE 1. Dynamic Bayesian Network (DBN) representations of the hybrid system models used in this paper.Shaded grey circles represent observations, clear circles represent hidden state variables.Parameters have been omitted for simplicity.(a) HMM: A hidden Markov chain (s t ) emitting one observation (y t ) at each time point.(b) Super-state representation of HSMM: A hidden Markov chain on a set ''super-states'' (z l ) where each super-state visited emits a segment of observations of random duration d l .(c)HMM embedding of HSMM: A hidden Markov chain on the augmented state-space of states and residual times (s t , τ t ) emitting one observation at each time point.The augmented joint-state is highlighted in blue.When the chain first enters the state s t at time t , the residual time τ t is set to a duration sampled from the state's duration distribution; it then deterministically counts down to 1 at which point the state is free to transition and τ t is reset to a duration sampled from the duration distribution of the new state.

FIGURE 2 .
FIGURE 2. Simulation pipeline.EEG activity is simulated considering the brain as a hybrid dynamical system.The hidden discrete state process represents fast transitions between source activity configurations, which remain stable over a time segment.The length (duration) of the time segments are randomly sampled from uniform distribution with state-specific mean ⟨d ⟩ and width w .The neural source activity is instantaneously projected to the sensors' space using the linear EEG Forward model and subsequently corrupted with Normal noise to produce the observed EEG signals.

FIGURE 3 .
FIGURE 3. Reproducing microstates features in simulated brain state sequences.Auto Information Function (AIF) of the brain state sequences generated using HSMM (a & c) and HMM (b & d) without emissions, as a function of the mean and width of the duration distributions of the brain states.In (a) and (b), the mean duration of the three brain states was varied while the widths were fixed at 6 a.u.In (c) and (d), the widths were varied while the duration means were fixed at 50 a.u.The periodicities and the asymptotic decay in the case of the HSMM resemble those of the empirically observed EEG microstates, as illustrated by[49].The AIF is measured in natural units of information (nats).

FIGURE 4 .
FIGURE 4. Histograms of durations of a test brain state, generated from trained HSMM and HMM.HSMM and HMM were trained on the same dataset simulated using the uniform duration distribution in red (mean 50 a.u. and width 6 a.u.).Subsequently, new sequences of brain states were generated from each trained model and the respective duration histograms corresponding to one brain state were computed and shown.The geometric distribution implicit in HMM, produced brain state durations across the whole range of possible values (green bars).The explicit normal duration distribution in HSMM concentrated the durations in the correct range of values (blue bars).The best fitted geometric (green line) and normal (blue line) brain state duration distributions (respectively for the HMM and the HSMM), are also shown.

FIGURE 5 .
FIGURE 5. Jensen-Shannon Divergence (JSD) between the true (H0) and predicted (H1) brain state duration histograms of HSMM and HMM.A lower JSD value indicates higher similarity between the histograms of the predicted (out-of-sample) durations by a model after training, and of the true durations underlying the training dataset.HSMM out-of-sample predictions consistently and robustly reproduce the brain state durations of the underlying system.The simulations were the same as in figure 4, except the mean duration of one of the brain states in the training sequences was now varied from 20 a.u. to 140 a.u.(step 40 a.u.) at two SNR values.The error bars over 100 realizations of the experiment are indicated.

FIGURE 6 .
FIGURE 6. Accuracy of brain state allocation based on unreliable test data.HSMM and HMM were first trained on data produced by a HDS with non-geometric states' duration distributions, and were then used to decode the brain state sequence underlying a new test dataset from the same system but with reduced SNR.The mean duration of a test brain state was varied at two levels of SNR of the training dataset (SNR 0 ) and the test dataset (SNR).The accuracy of the decoding was evaluated using the Sequence Accuracy Score (SAS), between the decoded and the true brain state sequence.Higher SAS indicates a more accurate decoding.HSMM achieved higher SAS for all duration values explored, suggesting robustness to the noise fluctuations.The error bars over 100 realizations of the experiment are indicated.

FIGURE 7 .
FIGURE 7. Discriminative Power Score (DPS) of HSMM vs HMM.A previously unseen data sequence (with mean brain states duration of 10 a.u) is classified as being generated by one of 11 HSMM and HMM models which were trained on data that differed in the mean duration of the underlying brain states, from 2-22 a.u. in steps of 2. Each point on the curves is a measure of the discrimination of the hypotheses H0 that the test data was generated by its true model (⟨d⟩=10) against the hypothesis H1 that it was generated by the candidate model with duration on the x-axis.A value of DPS=0 indicates no discrimination between the two hypotheses, that is the candidate model is very close to the truth.The shadowed areas represent the 95% confidence interval for discrimination.HSMM with Normal durations is able to allocate the test sequence to the correct model, whereas HMM has no discrimination power.

FIGURE 8 .
FIGURE 8. Convergence of the VB algorithm.Negative Free Energy (NFE) of the HSMM for the two duration models used in the EEG resting-state data.The HSMM with Lognormal durations converged faster and reached a higher NFE value.

FIGURE 9 .
FIGURE 9. Brain states allocation in resting-state EEG.Partial correlation maps of the estimated brain states in each duration model.Based on the pair-wise similarity between the maps of the two duration models, the brain states were relabeled and matched based on these maps by visual inspection.The analyzed EEG topographic maps display a distribution of brain electrical activity in which clearly localized and differentiated areas of positive and negative polarities are observed.The regions with positive polarity, indicated by the red color, and the regions with negative polarity, represented by the blue color.

FIGURE 10 .
FIGURE 10.Estimated brain state duration distributions.The left panels show the logarithm of the approximate posterior distribution of duration for all brain states.The right panels show (in log-scale) the empirical survival distribution of duration (P(X > x)) for each duration model (Normal and Lognormal) for the exemplary state S2, as well as their corresponding estimated parametric distributions (red curve).

Figure 10
Figure 10 summarizes the temporal characteristics of the inferred brain states in terms of their duration distributions induced by the posterior expected values of the parameters of the Lognormal and the Normal duration models.It can be observed that the Lognormal distribution allows for assigning probability mass to longer durations, while the Normal distribution allocates the mass only to shorter durations.In addition, it can be seen that the empirical durations of brain state 2 show asymmetric heavy-tailed histograms in the two cases, which do not follow a Normal distribution, whereas the Lognormal distribution shows an accurate fit.

VOLUME 12, 2024 12347
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

d
and C(k) are derived from the sufficient statistics obtained in the VB-E step of the VB algorithm γ

VOLUME 12, 2024 12351
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.