Model-Free Change Point Detection for Mixing Processes

This paper considers the change point detection problem under dependent samples. In particular, we provide performance guarantees for the MMD-CUSUM test under exponentially <inline-formula><tex-math notation="LaTeX">$\alpha$</tex-math></inline-formula>, <inline-formula><tex-math notation="LaTeX">$\beta$</tex-math></inline-formula>, and fast <inline-formula><tex-math notation="LaTeX">$\phi$</tex-math></inline-formula>-mixing processes, which significantly expands its utility beyond the i.i.d. and Markovian cases used in previous studies. We obtain lower bounds for average-run-length (<inline-formula><tex-math notation="LaTeX">$ {\mathtt {ARL}}$</tex-math></inline-formula>) and upper bounds for average-detection-delay (<inline-formula><tex-math notation="LaTeX">$ {\mathtt {ADD}}$</tex-math></inline-formula>) in terms of the threshold parameter. We show that the MMD-CUSUM test enjoys the same level of performance as the i.i.d. case under fast <inline-formula><tex-math notation="LaTeX">$\phi$</tex-math></inline-formula>-mixing processes. The MMD-CUSUM test also achieves strong performance under exponentially <inline-formula><tex-math notation="LaTeX">$\alpha$</tex-math></inline-formula>/<inline-formula><tex-math notation="LaTeX">$\beta$</tex-math></inline-formula>-mixing processes, which are significantly more relaxed than existing results. The MMD-CUSUM test statistic adapts to different settings without modifications, rendering it a completely data-driven, dependence-agnostic change point detection scheme. Numerical simulations are provided at the end to evaluate our findings.


I. INTRODUCTION
Change point detection studies the problem of monitoring for abrupt changes in the statistical properties of an observation sequence, which has been widely considered in the literature [1,2,3,4].Change point detection has a diverse application that spans many areas, including cybersecurity, network intrusion detection, automated fault monitoring, factory quality control, etc.In many of these application scenarios, one may face various challenges, such as complex unknown dynamics, noisy non-i.i.d observations, and unknown preand post-change distributions.Ideally, a completely datadriven method with very few distributional assumptions (independence, density functions, etc.) would be preferred.The goal of this paper is to study the change point detection problem under a completely data-driven setting.To tackle this problem, we employ the MMD-CUSUM statistic proposed in [5] and analyze its performance under three common mixing conditions, namely α, β, and ϕ-mixing.
The MMD-CUSUM statistic is an extension of the wellknown CUSUM statistic [6] with the maximum mean discrepancy (MMD).MMD has wide adoption in statistical twosample tests [7] and the training of generative adversarial networks [8].As a probability distance, MMD can be easily estimated from samples on general domains (continuous or discrete) without the need for a density function.Thus, it is well suited for change point detection under the completely data-driven setting where pre-and post-change distributions can be unknown.Additionally, kernel methods have wide compatibility [9,10] due to the diversity of kernel functions with different data structures, such as discrete data, continuous data, graphical data, etc.Thus, the kernel base method has vast application potential in designing completely datadriven change point detection schemes.In particular, the sequential testing procedures using the maximum mean discrepancy (MMD) have sparked some research interests lately [11,12,13,14,5].Most of the existing studies focus on studying the properties of the MMD-based procedures under the i.i.d.case.For continuous state space Markov chains, the MMD-CUSUM test is proposed in [5] for uniformly ergodic Markov chains, which is known to be hard to satisfy in practice.
Thus, more relaxed assumptions need to be considered to meet the demands of the completely data-driven setting.The main challenge in generalizing the performance analysis of MMD-CUSUM lies in the dependence of samples.Our proposal assumes the mixing property of the stochastic pro-cesses generated by the dynamic system.Mixing measures the dependence in the process by its definition [15], and it is widely considered in extending various results in probability theory to dependent time series [16,17,18].Thus, establishing the performance bounds under various mixing conditions is a natural choice.Furthermore, the mixing conditions we assume highlight the fundamental limit for MMD-CUSUM to achieve a good performance; that is, the speed and strength of the mixing condition the processes satisfy.
In the current paper, we analyze the performance of MMD-CUSUM under three common mixing conditions, namely α, β, and ϕ-mixing.We provide bounds on the average-run-length (ARL) and average-detection-delay (ADD) which are the common performance metrics [19].ARL characterizes how frequently the false alarm occurs and ADD characterizes the quickness of the reaction.As outlined in [20], the information-theoretic lower bounds are O(exp(b)) for ARL and O(b) for ADD for large b > 0, where b is the threshold parameter.We show that under the fast ϕmixing condition, the MMD-CUSUM achieves these lower bounds and thus is order optimal.Under the exponential α/β-mixing, ADD is bounded by O(b) where ARL is bounded by O(exp b γ/(γ+1) ), where γ > 0 controls the mixing speed (more details in IV).
The rest of the paper is organized as follows.Section II introduces the necessary background about reproducing kernel Hilbert space and mixing processes.Section III states the problem setting for online change point detection and introduces the MMD-CUSUM test statistic.Section IV establishes the main results of this paper.Section V presents the experiments of the MMD-CUSUM test on synthetic datasets.We conclude the paper with discussions of the limitations and future work in Section VI and VII.

A. Related works
Continuous efforts have been made to adapt the kernel two-sample test to a sequential setting, i.e., change point detection.Early work has been focused on detection change in a stream of i.i.d.samples [11,12,13,14].In [11,12], the authors developed a Shewhart chart-type [21] procedure that maintains a running estimate of the MMD between a set of curated reference data and incoming samples within a fixed sliding window.Analysis shows strong performance guarantees with an O(exp(b2 )) average-run-length (ARL) and an O(b) average detection delay (ADD), where b is the threshold.However, testing schemes with sliding windows suffer from loss of information as older samples are discarded.To maintain history information, kernel-based CUSUM-type statistics were proposed in [14] with an O(exp(b)) averagerun-length (ARL) and an O(b) average detection delay (ADD).In [13], the authors devised a neural network-based kernel selection strategy that finds a kernel whose MMD can best separate the nominal distribution from an adversarial one.The testing scheme is to estimate the MMD with the selected kernel on two adjacent sliding windows.Empirical studying shows promising performance, albeit without theoretical guarantees.
The analysis of the above methods is based heavily on the i.i.d.assumption.Their technique and results do not carry over naturally to the non-i.i.d.case.Due to the ubiquity of time series data in machine learning, signal processing, economics, and dynamic systems, the i.i.d.assumption limits the application of these methods.More recently, researchers have been adapting the kernel-based change point detection to dependent data.In [5], the MMD-CUSUM test is proposed and analyzed under the setting of uniformly ergodic Markov chains on general state space.Recently, [22] extended the analysis of MMD-CUSUM to noisy observations of uniformly ergodic Markov chains, i.e., hidden Markov models (HMM).Both cases are special cases of ϕ-mixing processes [15].In fact, we show that the same performance can be obtained even when the Markovian and HMM structures are ignored.In other words, the Markov chain and HMM assumptions are not necessary for the performance of the MMD-CUSUM test.Our work even extends to the α/βmixing processes, which have never been considered for the MMD-CUSUM test previously.
More broadly, our study falls under the umbrella of the quickest change detection (QCD) theory [23].Studies on the QCD problem can be split into two categories: the Bayesian and minimax formulation, depending on the assumption of the change point.The Bayesian formulation, pioneered by [24,25], places a prior on the distribution of the change point (usually a geometric distribution).Whereas the minimax formulation, first considered by [26], assumes the change point is unknown and deterministic.Under both formulations, the different notions of detection delay are minimized while constrained on the probability of false alarm or the false alarm rate (1/ARL).A well-known Bayesian QCD formulation is Shiryaev's problem [24], which seeks the stopping rule that minimizes the average detection delay (under the change point prior) while constrained on the probability of false alarm.The minimax formulations include Lorden's problem [26] and Pollak's problem [27], where the former minimizes the worst-case average delay and the latter minimizes the conditional average delay while both contained on the false alarm rate.
Although the CUSUM statistic was first proposed as a heuristic for the minimax formulation under i.i.d.setting by [6], strong optimality properties have been shown for CUSUM statistic under various settings.Under the i.i.d.setting, exact optimality was shown by [28,29] for Lorden's problem.For general non-i.i.d.settings, [20] has shown that an extension of the CUSUM statistic achieves the information-theoretic lower bound on the conditional average delay (as well as the worst case delay) asymptotically as the false alarm rate goes to 0.
However, the optimality result mentioned previously requires specific knowledge of the pre-and post-change distributions.Furthermore, the QCD problems are intractable for general stochastic processes due to the lack of problem structure [20].Thus, the numerous studies on QCD for noni.i.d settings [1,20,30,31,2,19,3,32,33,34,35,36] cannot be easily converted to the completely data-driven setting.

B. Contributions
As a non-parametric model-free change point detection procedure, the MMD-CUSUM test exhibits great potential in completely data-driven applications where distributional assumptions may be difficult to verify.Our performance guarantees under general mixing conditions establish its robustness under dependent samples and further strengthen its capability as a model-free testing scheme.The mixing conditions considered in this paper not only subsume the i.i.d., Markov chain, and HMM settings but also greatly expand beyond those appearing in previous studies on the performance of the MMD-CUSUM test.Our results indicate that the Markovian or HMM structures are not necessary for the strong performance of the MMD-CUSUM test.Additionally, we provide the first performance guarantee for the MMD-CUSUM test under exponentially α/β and fast ϕ-mixing processes.Note that stationary exponentially βmixing processes include the geometrically ergodic Markov chains as a special case, which violates Doeblin's condition [37, page 402].In stark contrast, Doeblin's condition is the core assumption for the performance analysis of the MMD-CUSUM test in [5] and [22].

II. BACKGROUND
In this section, we introduce the necessary background for our discussion.Section A collects the usual facts about reproducing kernel Hilbert space (RKHS) and maximum mean discrepancy (MMD).Section B presents the two notions of mixing used to obtain the main results.Our standard reference is [38] for RKHS and [15] for mixing processes.

A. RKHS and MMD
Let (X, X , P) be a measure space with Borel σ-algebra X and σ-finite measure P. Let P(X ) denote the set of probability measures over the σ-algebra X .The supremum norm of f is written as ∥f ∥ ∞ := sup x∈X |f (x)| and its span is written as span(f A reproducing kernel Hilbert space (RKHS) H(X) on X with kernel k : X × X → R is a Hilbert space of real-valued functions on X equipped with inner product ⟨•, •⟩ H(X) .The corresponding Hilbert space norm ∥f ∥ 2 H(X) = ∥∥ The kernel function k satisfies the reproducing property: The current paper relies on a particular application of RKHS -Hilbert space embeddings of probability measure.The Hilbert space embedding of µ under k is written as where U(µ) is also called the kernel mean embedding of µ.Suppose ν ∈ P(X ) is another probability measure.One can define a distance function between µ and ν using the Hilbert space metric between U(µ) and U(ν) which is known as the maximum mean discrepancy (MMD) [7].The kernel k such that MMD k (µ, ν) = 0 ⇔ µ = ν for all µ, ν ∈ P(X ) is call a characteristic kernel [39].MMD k with a characteristic kernel k is a metric on P(X ).
MMD enjoys a computational advantage, compared with other probability distance functions, such as KL divergence [40] and total variation metric (Definition 7), that allows it to be easily estimated empirically for distributions on general domains [9,10].
Define their empirical measures as μm , νn , respectively.The consistent estimation of the squared MMD is This was first used by [7] to propose the kernel two-sample test, and it is the core component of the MMD-CUSUM test studied in the current paper.
Throughout the paper, we assume the kernel k is real-valued, measurable, characteristic, and bounded, i.e., sup x∈X k(x, x) = k < ∞.The boundedness ensures MMD k is well-defined.

B. Mixing processes
The definitions of the mixing process require the following necessary notations.Consider the space of X-valued doubly infinite stochastic processes as (X ∞ , X ∞ , P) where the indices of a process i=t is written as P∞ t and the joint probability measure on {X i } t i=−∞ as Pt −∞ .With these notations, we have the definitions of α, β, and ϕ-mixing coefficients following [41,15].Definition 1 (α-mixing coefficient).The α-mixing coefficient [42] of a stationary process X is defined as The following β-mixing coefficient provides a stronger notion of decaying dependence.It can be shown that 2α(n) ≤ β(n) [41].
Definition 2 (β-mixing coefficient).The β-mixing coefficient [16] of a stochastic process X is defined as The β-mixing coefficient can be equivalently written as Comparing the second definition of β-mixing with the following definition of ϕ-mixing, we can see that β(n) ≤ ϕ(n).
Definition 3 (ϕ-mixing coefficient).The ϕ-mixing coefficient [18] of a stochastic process X is defined as We say X is stationary with respect to µ ∈ P(X ) if the one-dimensional marginal probability of X i equals µ for ∀i ∈ Z.For stationary processes, the supremum over t in the above definitions can be ignored, and one can set t = 0 without loss of generality.To maintain the simplicity of the presentation, we focus on stationary stochastic processes with α, β, and ϕ-mixing properties in the sequel.However, the results put forward in the current paper can be extended to asymptotically stationary processes, which is discussed in Section VI.
The decay rates of the mixing coefficients play an important role in our discussion.The following definitions introduce the exponential α/β-mixing condition and fast ϕmixing, which are used throughout the paper.Definition 4 (exponential α/β-mixing).X is said to be exponential α or β-mixing, if the α or β-mixing coefficient satisfies An exponentially decaying ϕ-mixing coefficient is certainly summable and thus is covered under the above definition.Definition 4 and 5 form the basic assumption on the mixing processes studied in the current paper.
To bridge the notions of mixing with RKHS, it is convenient to consider the following kernel mixing coefficient introduced in [43].
Definition 6 (kernel mixing coefficient).Let X be a stationary process with distribution µ.For n ∈ N, define the kernel mixing coefficient as We denote the cumulative sum of the kernel mixing coefficient as Σ µ := ∞ n=0 ρ k (n).If we treat {k(X i , •)} i∈Z as a sequence of Hilbert space valued stochastic process, then as shown by [44, Lemma 2.2] ρ k (n) can bounded by a constant multiple of the α-mixing coefficient, i.e., ρ k (n) ≤ 10α(n) k2 .Thus, we get Σ µ < ∞ under the assumptions of exponential α-mixing, exponential β-mixing, and fast ϕ-mixing.

C. Examples of mixing processes
One notable example of ϕ-mixing processes is the uniformly ergodic Markov chain.A Markov chain is said to be uniformly ergodic if it is aperiodic and satisfies Doeblin's condition [37].Thus, it is also called the Doeblin chain.A q-th order autoregressive (AR) process is ϕ-mixing if the Markov chain generated by stacking q consecutive states is a Doeblin chain.The ϕ-mixing coefficient decays exponentially for uniformly ergodic Markov chains, therefore satisfying the fast ϕ-mixing condition in Definition 5.
Examples of exponential β-mixing processes include Vgeometrically ergodic Markov chains.The Markov transition kernel P : X × X → [0, 1] with stationary distribution π is said to be V -geometrically ergodic if it satisfies where

III. PROBLEM FORMULATION
In this section, we first introduce the online change point detection problem and the commonly used performance metrics [see 19,4].Later, we discuss the proposed MMD-CUSUM test and its properties.
In the sequel, we make the following assumption and restrict our attention to stochastic processes satisfying the exponential α/β-mixing and fast ϕ-mixing conditions in Definition 4, 5.
Assumption 1.The stochastic processes considered in what follows satisfy one of the three mixing conditions in Definition 4 and 5.

A. Online change point detection
The online change point detection problem is often formulated as a sequential two-sample test which has been widely considered in the past [6,26,21,24].Given a sequence of samples {X i } from a stationary mixing process X with distribution µ, at each time step, the following null and alternative hypotheses are proposed H 0 : µ remains the same, H 1 : µ has changed.
Test statistics are calculated using the samples collected up to the current time step.To detect the change quickly and accurately, one attempts to reject the null hypothesis H 0 via a threshold rule at every time step.More formally, consider a stationary stochastic process X = {X i } i∈Z ∈ X ∞ adapted to its natural filtration with unknown distribution µ.At some unknown but deterministic time index τ ∈ Z, we have X i ∼ µ for 0 ≤ i ≤ τ and X i ∼ ν for i ≥ τ + 1, where µ, ν ∈ P(X ) and µ ̸ = ν.This can be conceptually thought of as having a separate and independent stochastic process X ′ ∈ X ∞ following unknown distribution ν running alongside X.From the outside, one can only observe X up to time τ , and at time τ , the observation is immediately switched to X ′ .
Suppose the null hypothesis is rejected at time T (b), which is a stopping time adapted to the filtration {X i −∞ } i∈Z and a function of the threshold b.If we use E ∞ and E 0 to denote the expectation under H 0 and H 1 respectively, then the average-run-length ARL and the average-detection-delay ADD can be written in terms of the stopping time T as follows Unlike the Bayesian formulation, we assume the change point τ is unknown and deterministic, and thus we set τ = 0 without loss of generality.ARL measures the robustness of the test against false alarms.Whereas ADD measures the quickness of the test in response to an abrupt change.The overall goal of online change point detection is to have a ARL that grows with b as fast as possible and a ADD that grows with b as slowly as possible.

B. MMD-CUSUM test
The MMD-CUSUM test is a sequential adaptation of the kernel two-sample test.Consider a bounded, measurable, characteristic, reproducing kernel k : X × X → R. Denote the reference dataset as The detection algorithm processes the incoming data in blocks of size r, which is denoted as B r (t) = {X i } tr−1 i=(t−1)r for an integer t ≥ 1.Let νh and μr denote the empirical measure constructed using the dataset D h and B r (t).Define the MMD between these two empirical measures as At time step i = t • r, the algorithm computes the following test statistic; otherwise, it collects the new observations and remains idle.Let integer M ≥ r be the minimum number of samples required to perform the test.We write the test statistics at time step i as where ∆ > 0 is a tunable parameter that keeps the summand slightly blew 0 under the null hypothesis.The corresponding stopping rule with threshold b and M minimum samples is written as We make the following remarks regarding the above MMD-CUSUM statistics.

a: Convergence of Empirical MMD
To correctly configure the offset parameter ∆, we need to determine the envelope of the deviation of the empirical MMD from the true one.The result collected in the following lemma shows that the estimation error is bounded by a term diminishing in the sample size plus a small margin, almost surely for all three mixing conditions.Note that the empirical MMD can be equivalently written as the MMD between empirical measures.For probability measures µ and ν, we write MMD(µ, ν) as MMD(μ r , νh ) where μr , νh are empirical measures of µ and ν with r and h samples, respectively.Lemma 1.Let X and X ′ be two independent processes with stationary distribution µ and ν satisfying the mixing conditions introduced before.Given δ > 0, there exist constant C(r, h) such that the following holds almost surely for sufficiently large h, Applying triangle inequality, we get the following two expressions: Let us consider the first inequality above, and the other one follows similarly.Suppose we take expectation over the randomness of μr , and due to independence we have, On the right hand side, the term E X [MMD(μ r , µ)] ≤ 1+2Σµ r by Lemma 7.1 of [43] for all r > 0 and X which satisfies Σ µ < ∞.It remains to bound MMD(ν, νh ) for a particular νh .Observe that is a Hilbert space valued stochastic process and {H i } enjoys the same mixing property as X ′ since H i is a measurable function of X ′ i .Thus, we can apply the law of iterated logarithm for Hilbert space valued α-mixing processes [44,Theorem 6] or [46,Theorem 2] to conclude there exists constant c 0 > 0 such that almost surely Note that the hypothesis of [44,Theorem 6] holds in our case under the assumption of bounded kernel k and exponential α/β-mixing and fast ϕ−mixing.Thus, there exists a constant C(r, h) = O 1 r + log log h h such that MMD(μ r , νh ) − MMD(µ, ν) ≤ C(r, h) + δ for sufficiently large h.Similar, MMD(µ, ν)−MMD(μ r , νh ) can bounded from below with −C(r, h) − δ, and the proof is complete.
Lemma 1 indicates that under the null hypothesis (no change), the bias of empirical MMD is bounded by a positive quantity decaying at rate o(r −1/2 + h −1/2 log log h) plus a small margin for sufficiently large reference data.To maintain a low value of the MMD-CUSUM statistics under the null hypothesis, it is necessary to apply a certain negative offset to the empirical MMD so that the cumulative sum in (4) does not blow up when change is absent which leads to the second remark regarding the parameter ∆.We shall determine the appropriate range for the offset parameter ∆ in (4) using Lemma 1.Note that ∆ needs to be sufficiently large under the null hypothesis such that the MMD-CUSUM statistic does not blow up due to the estimation error of the empirical MMD.As suggested by Lemma 1, if ∆ is strictly larger than C(r, h), i.e., ∆ ≥ C(r, h) + δ for some margin δ > 0, then the empirical MMD is bounded by ∆ almost surely for sufficiently large sample size.On the other hand, the upper bound for ∆ appears under the alternative hypothesis (with post-change distribution ν).As we shall see in Theorem 3, ∆ should be strictly less than MMD k (µ, ν) − C(r, h) − δ otherwise the ADD can be unbounded.To tune ∆ in practice, one can simulate the pre-change scenario with different values of ∆ ≥ C(r, h) + δ using the reference dataset.For each value of ∆, the ARL can be estimated with multiple runs of the experiment.Then, choose the smallest ∆ that yields the acceptable ARL performance.Keeping the ∆ small allows the MMD-CUSUM to achieve better ADD.

IV. MAIN RESULTS
In this section, we establish the detection performance of the MMD-CUSUM test using the metrics introduced in the previous section.The average-run-length ARL characterizes the average interval between false alarms, which is lowerbounded in Theorem 2. The average detection delay measures the quickness of the detection, and an upper bound is given in Theorem 3. The proofs are omitted due to the page limit, and they can be found in the Supplementary Material's Appendix.
Before we state the results, let us briefly summarize the technique we employed.Recall E ∞ denotes the expectation under H 0 .We can expand the E ∞ [T (b, M )] as follows where union bound is applied to the last inequality.At this point, it suffices to obtain an upper bound on the tail probability P ∞ {s k:t ≥ b} using Proposition 4, 5.The tail probability bounds in Proposition 4, 5 offers simple explicit subGaussian decay rates with linear or sublinear dependency 1 on the sample size n inside the exponential.This kind of decay rate is necessary for our analysis as it dictates the scaling of ARL in threshold b.As we shall see in the theorem below, the slower decay rate of Proposition 5 causes the difference in ARL between exponential α/βmixing and fast ϕ-mixing processes.
We note that the existing concentration inequalities obtained for generic purposes are not well-suited for the task at hand.For example, the classic concentration inequalities for α-mixing, such as [45, Theorem 3.5], have tail bound with an additive term in addition to the common exponential term seen in the usual Hoeffding's inequality.When combined with our technique, it leads to a prohibitively cumbersome derivation of the ARL.The α-mixing concentration inequality in [47,Theorem 2] gives the tail bound on the relative deviation (scaled by variance) instead of the absolute deviation.The β-mixing results in [48] and the α-mixing results in [49] provide a subexponential bound of O(exp(−ϵ)) which is a weaker dependency on ϵ than we desired.The detailed discussion of the concentration inequalities we derived is postponed until the main results are introduced.
We now state the main result on the upper bound of ARL under the mixing condition described in Definition 4.
Theorem 2. The average-run-length for test statistics (4) and stopping rule (5) under the null hypothesis has the following lower bounds.
1) Suppose X is α/β-mixing satisfying Definition 4, then 2) Suppose X is ϕ-mixing satisfying Definition 5, then where γ is defined in Definition 4, and δ > 0 is defined in Lemma 1 and depends on ∆, h.

Proof:
Appendix D Theorem 2 establishes the first ARL bound for the MMD-CUSUM test under α/β/ϕ-mixing processes.The performance of the MMD-CUSUM test under α/β-mixing case has not been considered in the literature before, and Equation 6provides the first exponential lower bound on the ARL.In previous studies, ϕ-mixing processes are considered in certain specific cases, such as the uniformly ergodic Markov chains [5] and hidden Markov models (HMM) [22].Equation 7generalizes the ARL bound therein to the broader ϕ-mixing processes without loss of performance.It also indicates that Markovian or HMM structures are not necessary for the exponential lower bond of the ARL.
The ARL bound in Equation 6 has a dependency on γ, which controls the mixing speed (Definition 4).This dependency on γ also is the result of applying the concentration bound in Proposition 5. Suppose the α or β-mixing coefficient has a decay rate of O(exp(−n)), i.e., γ = 1, the ARL then achieves a Ω(exp(b 1/2 δ 3/2 )) lower bound which is slighted degraded in terms of the threshold b compared to Equation 7.
Surprisingly, the ARL under the fast ϕ-mixing condition (Equation 7) achieves the Ω(exp(b)) lower bound (same as Markovian samples) while only requiring a summable ϕmixing coefficient.In comparison, the ARL lower bounds in [5] and [22] are obtained under the Doeblin's condition [37, page 402], which corresponds to exponential ϕ-mixing conditions.
To measure the quickness of the MMD-CUSUM test, we estimate the expected value of the stopping time T (b, M ) under the alternative (H 0 ).Recall that E 0 denotes the expectation under H 1 .We can write the E 0 [T (b, M )] as follows where the first inequality is due to s 1:t ≤ ŝt .Splitting the summation at t 0 and trivially bound the first term with t 0 .With a certain choice of t 0 , the second term can be shown to be ultimately negligible or o(1) compared to t 0 using the concentration inequality in Proposition 4 and 5.
Theorem 3. Suppose X is a mixing process satisfies Definition 4 or 5 and pre and post change stationary distribution µ and ν satisfy MMD k (µ, ν) > C(r, h) + ∆ + δ for some δ > 0. The average-detection-delay for test statistics (4) and stopping rule (5) under the alternative hypothesis has the following upper bounds.

Proof:
Appendix E Theorem 3 gives the first O(b) upper bound on ADD under all three types of mixing conditions.Similar to the ARL lower bound, it was previously considered only under uniformly ergodic Markov chains and HMM.Our result shows that the Markovian or HMM structure is also not necessary for O(b) upper bound on ADD.
Intuitively, the realization of the MMD-CUSUM statistics should track its mean, which is just nMMD(µ, ν) for time n.Therefore, the threshold b should affect the average detection delay in a linear fashion.We note that the sufficient separation between µ and ν is required due to the estimation error of the empirical MMD as indicated by Lemma 1.This can be satisfied by choosing r, h sufficiently large and δ sufficiently small to ensure MMD k (µ, ν) − 2C(r, h) − 2δ > 0.
We now establish the concentration inequalities for the sum of bounded functions under mixing conditions.Proposition 4 is a Hoeffding-type inequality for ϕ-mixing processes with summable mixing coefficients.We provide a proof based on the martingale decomposition.Concentration inequality for exponential ϕ-mixing processes is obtained in [50] using an information inequality-based argument.The martingale-based method was used in [51] to study the concentration inequality of dependent random variables on countable spaces.Proposition 5 compliments the results therein by considering stationary ϕ-mixing processes on completely separable metric spaces.The sum of bounded functions of ϕ-mixing processes has a tight concentration bound that resembles that of i.i.d.random variables, which can be recovered by setting Φ = 0. Proposition 4. Let X be a stationary ϕ-mixing process with coefficient satisfying Definition 5. Assume that f : X → R has bounded span and let S n = i i=0 f (X i ).Then for ϵ ≥ 0, it holds where Φ is defined in Definition 5.

Proof: Appendix B
Compared to the O(exp(−nϵ2 )) tail bound in Proposition 4, the following concentration inequality for β-mixing processes has an O(exp(−nϵ 2 )) tail bound where n grows sublinearly with the sample size n.The proof follows [47, Theorem 2] with the modification of replacing Bernstein's inequality with Hoeffding's Lemma (Lemma 10) to yield the desired result for our purpose.Proposition 5. Let X be a stationary β-mixing sequence with the coefficient satisfying Definition 4. Assume that f : Y → R has bounded span, i.e., span(f ) < ∞, and let Then for all ϵ ∈ (0, span(f )), it holds where n = ⌊n⌈(10n/c) 1/(γ+1) ⌉ −1 ⌋ and c, γ are defined in Definition 4.

Proof: Appendix C
To our knowledge, the tail bound in the above form has not been considered previously.As opposed to the classic two-term version in [45, Theorem 3.5] and the relative error version in [47, Theorem 2], which can be difficult to be applied in our analysis, Proposition 5 streamlines the calculation of ARL and ADD in Theorem 2 and 3.
Compared to regular Hoeffding's inequality for bounded i.i.d.random variables [52], the exponent of the tail bound has a sublinear dependence on sample size due to the presence of n. n is close to n when γ is large corresponding to a faster decaying β-mixing coefficient (Definition 4).This sublinear relation with respect to n is also reported by [49] and [48] as well under exponential α and β-mixing conditions with γ = 1.They provided an O(exp(−nϵ/(log n log log n))) tail bound, which is a faster rate in n compared to Proposition 5 with γ = 1.It is tempting to think that this tail could improve the lower bound of ARL in Theorem 2. However, the subexponential, instead of subGaussian 2 , dependency on ϵ makes it not applicable to our proof.A similar concentration type inequality for αmixing processes is obtained in Proposition 14 following an analogous proof.

V. NUMERICAL SIMULATIONS
In this section, we apply the MMD-CUSUM test to a simulated stochastic process and verify the theoretical results.The stochastic process is generated by simulating a stable linear system A ∈ R 4×4 with an observations matrix C ∈ R 2×4 .Let Z = {Z i } i∈N denote the state process and Y = {Y i } i∈N denote the observation process.The system update equations can written as follows where A =[[0.96, 0.99, -0.88, 0.56],[0, 0.98, 0.75, -0.65],[0, 0, 0.97, 0.95], [0, 0, 0, 0.94]] and C =[[1, 0, 0, 0,], [0, 0, 0, 0], [0, 0, 1, 0], [0, 0, 0, 0]].Randomness is introduced into the system through the actuation noise W i and observation noise V i where ∼ N 2 for all i.In our experiments, N 1 and N 2 are two multivariate normal distributions.This is an example of a hidden Markov model (HMM).The state observation joint process (Z, Y ) and the state process Z along are Markov chains; however, the observation process Y in general is not.The observation process of this system is exponential β-mixing.This can be deduced from the fact that the matrix A is stable and the noise has bounded variance [45, Section 3.5, page 100].To obtain an exponential ϕ-mixing process from the observations, one can simulate the above system with truncated versions of N 1 and N 2 .
The kernel chosen for the MMD-CUSUM test is the rational quadratic kernel k rq σ (x, y) = (1+(2σ) −1 ∥x−y∥ 2 ) −σ for σ > 0 instead the popular Gaussian RBF kernel k rbf σ (x, y) = exp(−∥x − y∥ 2 /(2σ 2 )) for σ > 0. As demonstrated by [53], the rational quadratic kernel is favored over the Gaussian RBF kernel in GAN applications, which indicates its superior performance in separating probability distribution.We fix the parameter α = 1 for all experiments.The reference dataset is obtained by recording Y for 10 4 steps under the pre-change configurations with an appropriate burn-in period applied to the samples to maintain stationarity.We estimate the ARL and ADD by taking the average of 50 independent experiments for each threshold.The experiments are performed under 3 different offsets to demonstrate the sensitivity of this parameter.
We apply abrupt changes to the noise distribution N 1 of the state process.The MMD-CUSUM test is applied to the process can be improved.We discuss the difficulty associated with this improvement in Section VI.

A. Unbiased MMD estimator
The following unbiased estimator of the squared MMD, introduced in [7], can also be used to replace Equation 3. We write the unbiased estimator of the squared MMD between µ, ν using m samples from µ and n samples from ν as where We abuse the notation here and write the empirical square MMD as the square MMD between empirical measures, although they are not equivalent to the unbiased estimator.Due to the unbiasedness, it is not always non-negative and thus should be directly plugged into the partial sum with the square root.To adapt MMD 2 k to the current framework, it suffices to obtain a consistency result such as Lemma 1, and the rest should follow.Consider two independent stochastic processes X = {X i } and X ′ = {X ′ i } with stationary distributions µ, ν and summable kernel mixing coefficients as in Definition 6. Suppose we use m consecutive samples from X and n consecutive samples from Y .Then, we can bound the estimation bias caused by the dependency between samples as follows, where the second inequality comes from [43, Lemma 7.1], Σ µ , Σ ν are defined in Definition 6, and the expectations are taken with respect to the randomness in the samples.After denoting C µ,ν (m, n) := Σµ m + Σν n and replacing C µ,ν (m, n) with C µ,ν (m, n) throughout the paper, the same set of results also holds for the CUSUM statistics defined with the unbiased estimator MMD 2 k .

B. Computation complexity
The time complexity at each time step is O(rh) where r is the block size on the incoming data, and h is the size of the reference dataset.Compared to the overlapping block design in [5] with time complexity O(r 2 h), the non-overlapping block design here increases the speed at the expense of incurring a constant detection delay.The memory usage here is constant since only the current block and the reference data need to be stored.We present the implementation of the detection procedure in Algorithm 1.

C. Connection to HMM
Hidden Markov models (HMM) cover a wide array of real-world scenarios where the MMD-CUSUM test can be applied.For a comprehensive review of HMM, please refer to [54] and the references therein.Change point detection for HMM arises from the monitoring complex dynamic systems [55], such as communication networks [56], power plants [57], healthcare monitoring [58], manufacture process monitoring [59], distributed machine learning systems, etc.
For change detection, HMM can be treated as a mixing process.Consider a Markov chain X := {X i } ⊂ X and its observation process Y := {Y i } ⊂ Y, where Y is a complete separable metric space with Borel σ-algebra Y. Define the observation kernel Then, Y is α/β/ϕ-mixing as soon as X is α/β/ϕ-mixing [45, Theorem 3.12].

D. Asymptotic stationary processes
In practice, many mixing processes may not be strictly stationary but convergent towards a stationary distribution at a certain speed.For example, a Doeblin chain starts from an initial distribution that is different from its stationary distribution.Weak asymptotic stationarity was introduced in [60] to study the generalization bound of online algorithms.It combines the convergence to a stationary distribution and β-mixing into a single condition, which we choose not to include for the sake of simplicity.Instead, we provide a discussion on how to adapt Proposition 4 to asymptotic stationary processes in the supplementary materials.The adaption of Proposition 5 follows a similar argument.The intuition is that as long as the process converges sufficiently fast, the concentration of the partial sum will still hold.Thus, the same results on ARL and ADD can be extended to asymptotic stationary processes at no cost.
E. Obtain Ω(exp(b)) bound on ARL under α/β-mixing As shown in Figure 1a and 1b, the difference in ARL between α/β-mixing and ϕ-mixing is minimal which might indicate a tighter O(exp(b)) bound on ARL under α/β-mixing.This would be an improvement over the Ω(exp(b 1/γ )) in Theorem 2 where γ controls the mixing speed.However, the difficulties lie in the unavailability (to the best of our knowledge) of a subGaussian tail bound with linear dependency on the sample size n for stationary α/β-mixing processes.This bottleneck is also reported by a recent study [61] on the concentration of kernel density estimator with dependent data.Their findings are limited to ϕ-mixing processes due to the same issue.Circumventing this bottleneck might require significantly new techniques, which are left as future work.

VII. CONCLUSION
In this paper, we derive the ARL and ADD for the MMD-CUSUM test under three stationary mixing conditions.Under the ϕ-mixing condition, the performance of the MMD-CUSUM test is shown to match the i.i.d.case and the Markov chain case with uniform ergodicity.As a byproduct, we provide concentration inequalities of the partial sum of bounded functionals under α, β, and ϕ-mixing processes.To our knowledge, the concentration inequality in Proposition 5 and the proof of Proposition 4 are novel.
We note the limitations of this study and future directions as follows.MMD is known to have a poor separation between probability measures, with differences only in the high-frequency region [39].The MMD-CUSUM test may experience performance degradation in such scenarios.A recent study [62] tackles this problem in the kernel twosample test setting via kernel spectral regularization.The spectral regularized kernel achieves the optimal minimax separation boundary, which results in an improved sample efficiency compared to the usual kernel two-sample test.Additionally, there have been several other exciting developments on kernel two-sample test [63,64,65].It would be an interesting future direction to adapt those methods to the sequential test setting and analyze their performance.
Another limitation is that our technique does not exploit the finer structures produced by the max operator over the partial sum.The theory of extremes of random fields [66] provides handy tools to estimate the probability of events such as {sup θ∈Θ S θ > ϵ}, where S θ is the sum of n random variables in the random field and Θ is an index set, such as integers or real numbers.[12] has demonstrated the utility of this technique in the i.i.d.case and shown a sharp ARL bound of O(exp(b 2 )).However, the extension of this technique has yet to be explored in the non-i.i.d.cases.Additionally, leveraging the martingale property of the MMD-CUSUM statistics with an unbiased estimator and the non-asymptotic version of the law of logarithm for martingales [67] yields another possible route to establish the performance bounds.We plan to investigate these directions in the future.
Lemma 7. Suppose {X i } is a stationary ϕ-mixing process.Let g : X ∞ → R be an essentially bounded function and is measurable with respect to the σ-algebra X ∞ t+n .Then where x t −∞ , y t −∞ are two realizations of the trajectory up to time t, and span(g) ≤ ∥g∥ ∞ when g is non-negative. Proof: where the first inequality is due to triangular inequality and the second is due to the Definition 3 of ϕ-mixing and Lemma 6.
Lemma 8 (Corollary 2.2 in [45]).Suppose {X i } is a stationary α-mixing process.Suppose g 0 , ..., g l are essentially bounded functions, where g i depends only on X ik .Then where span(g i ) ≤ ∥g i ∥ ∞ when g i is non-negative.
Lemma 9 (Theorem 2.1 in [45]).Suppose {X i } is a stationary β-mixing process.Suppose g 0 , ..., g l are essentially bounded functions, where g i depends only on X ik .Then where span(g i ) ≤ ∥g i ∥ ∞ when g i is non-negative.
Lemma 10 (Lemma 8.1 in [69]).Let X be a random variable such that a ≤ X ≤ b almost surely.Then, for r > 0,

B. Proof of Proposition 4
We show a generalized version of Proposition 4 with timedependent functions in Proposition 11, that is, the partial sum S n of interest is replaced by Sn = n−1 i=0 f i (X i ) where f i are potentially different.The key technique employed here is the martingale decomposition of the partial sum process generated by any stochastic process.In Lemma 12, we demonstrate the martingale decomposition.In Lemma 13, we establish that the martingale difference is bounded under the ϕ-mixing condition in Definition 5. Finally, we give the proof of Proposition 11 using the two supporting lemmas.
Proposition 11.Let X be a stationary ϕ-mixing process with coefficient satisfying Definition 5. Assume that Then for ϵ ≥ 0, it holds where Φ is defined in Definition 5 and {A 0 , • • • , A n−1 } is defined in Equation 13.
First, we give the martingale decomposition of the partial sum process generated by any stochastic process.Lemma 12. Let X = {X i } be a stationary stochastic process on the probability space (X ∞ , X ∞ , P).
Then, there exists a martingale difference sequences where the expectations are taken w.r.t. the stationary distribution.

Proof:
We employ the martingale decomposition technique introduced in Chapter 23 of [68].For the following develop-on the desired quantity.Taking the moment generating function on both sides of Equation 11 and applying the chain rule for conditional expectation recursively yield for θ ≥ 0 From lemma 13, we know that D i lies in an interval of length A i for all i ∈ {−1, • • • , n − 1}.By Hoeffding's Lemma [68,Lemma 23.1.4]for bounded martingale difference sequences, we have for θ ≥ 0, ), and plugging above into Equation 15yields Applying Markov's inequality to the left-hand side, we have Picking θ = 4nϵ/ n−1 i=−1 A 2 i minimizes the right-hand side and yields The tail probability of the other side can be bounded analogously.Therefore, we have This completes the proof after noticing that X is stationary and hence Remark 1.For asymptotically stationary processes, the marginal distribution of X i differs from the stationary distribution µ but converges to µ as i → ∞.To consider the tail probability of Sn centered around n−1 i=0 µ(f i ), we apply triangular inequality to (a) and Equation 16to (b) and obtain Thus a similar tail bound can be obtained after assuming there exists a constant upperbound for n−1 i=0 span(f i )2TV( Pi , µ) for all n.

C. Proof of Proposition 5
We modify the proof of [47,Theorem 2] by replacing Bernstein's inequality with Hoeffding's Lemma (Lemma 10) in bounding the Φ 1 term to yield the desired result for our purpose.

Proof:
Given integer n, choose any integer k ≤ n and define l = ⌊n/k⌋.Let p = n − kl and define the index sets I i for i = 1, 2, ..., k as follows Note that ∪ i I i = {1, ..., n} and within each set I i the elements are pairwise separated by at least k.
Now, we write the moment generating function of n i=1 G i /n for r ≥ 0, which can be bounded as follows using the convexity of the exponential function, We now bound the right-hand side in the following fashion.For i = 1, 2, ..., k, we have For convenience, we denote the first term on the right-hand side of the above as Φ 1 and the second term as Φ 2 .We bound them separately.Φ 1 can be estimated with Hoeffding's Lemma (Lemma 10) for bounded random variables.For r > 0, where (a) is due to stationarity and (b) comes from Lemma 10.Note that |I i | ≥ l for i = 1, 2, ..., k, thus we have Φ 2 can be bounded by the β-mixing inequality in Lemma 9 and the exponential β-mixing condition (Definition 4).
where (a) is due to Lemma 9 and (b) is due to Definition 4 and the fact . Now, we plug Equation ( 19) and (20) into Equation ( 18), Since |I i | and k are free variables, we add some structure to simplify the right-hand side of the above.First, we require Then, we have which holds for 0 < r ≤ 4l span(f ) ≤ 4|Ii| span(f ) for i = 1, ..., k.Plugging the above back to Equation 17 and using the fact Applying Markov's inequality, we have for ϵ > 0 The right-hand side achieves minima w.r.t.r when Plugging the minimizer into Equation ( 21) yields Replacing l by n = ⌊n/k⌋ = ⌊n⌈(10n/c) 1/(γ+1) ⌉ −1 ⌋ gives the desired result A similar proof gives an analogous Hoeffding-type inequality for exponential α-mixing processes which is of independent interest.We document it here for completeness.Proposition 14.Let X be a stationary α-mixing sequence with the coefficient satisfying Definition 4. Assume that f : Y → R has bounded span, i.e., span(f ) < ∞.Let S n = n−1 i=0 f (X i ).Then, for all ϵ ∈ (0, span(f )), it holds where n = ⌊n⌈(10n/c) 1/(γ+1) ⌉ −1 ⌋ and c, γ are defined in Definition 4.

Proof:
The proof is the same as that of Proposition 5 except for the part where Φ 2 is estimated.In the α-mixing case, Φ 2 can be bounded by Lemma 8 and exponential α-mixing condition (Definition 4).

Proof:
To determine the upper bound for ARL, we condition on the fact that the change point τ is ∞.We use E ∞ and P ∞ to denote the expectation and the probability under τ = ∞.For threshold b > 0, minimum burn-in period M , and stopping rule T (b, M ) in Equation ( 5), the ARL reads where (a) is due to the majorization of the event {T (b, M ) = t} to {ŝ t > b}, (b) is due to the application of the union bound, and M < L < ∞ is an integer constant.
To further lower-bound the right-hand side of the above, we consider the tail probability in (24).Due to stationarity, we can study P ∞ {s 1:t ≥ b} for some t ≥ 1 without loss of generality.Suppose we pick the offset parameter ∆ = C(r, h) + δ for some δ > 0. By Lemma 1, we known that E ( The right-hand side achieves maxima when L = L * , where L * is obtained as the largest solution of where C(r, h) = O 1 r + log log h h and E[•|D h ] denotes the expectation taken over the randomness in μr conditioned on the reference dataset D h .Proof: VOLUME 00 2024 b: Offset parameter ∆ (a) log 10 (ARL) with Gaussian noise.(b) log 10 (ARL) with truncated Gaussian noise.(c) ADD under mean shift.(d) ADD under variance change.observation process Y only.The noise observation poses an additional layer of challenge for the detector.To simulate an α/β-mixing process, it suffices to use the regular Gaussian noise.To simulate a ϕ-mixing, we sample from the same Gaussian distribution and reject the samples falling outside a [−1, 1] 4 box.The random seeds are kept the same across the regular and the truncated cases to ensure comparability.The log(ARL) under both cases are shown in Figure Figure 1a and Figure 1b.To our surprise, the ARL under the regular Gaussian case maintains an exponential relationship with the threshold, which suggests the ARL bound for α/β-mixing

D. Proof of Theorem 2 1 )
Case 1: β-mixing When X is exponential β-mixing satisfying Definition 4, we show the lower bound of ARL as follows.For exponential α-mixing processes satisfying Definition 4, the proof follows the same procedure after replacing Proposition 5 with Proposition 14.