Latent Neural Stochastic Differential Equations for Change Point Detection

Automated analysis of complex systems based on multiple readouts remains a challenge. Change point detection algorithms are aimed to locating abrupt changes in the time series behaviour of a process. In this paper, we present a novel change point detection algorithm based on Latent Neural Stochastic Differential Equations (SDE). Our method learns a non-linear deep learning transformation of the process into a latent space and estimates a SDE that describes its evolution over time. The algorithm uses the likelihood ratio of the learned stochastic processes in different timestamps to find change points of the process. We demonstrate the detection capabilities and performance of our algorithm on synthetic and real-world datasets. The proposed method outperforms the state-of-the-art algorithms on the majority of our experiments.


I. INTRODUCTION
Recognition of state changes of complex systems is a common task in data analysis.The system's behaviour is represented by a signal produced by continuous monitoring with multiple sensors.The task of unsupervised detection of abrupt changes in the signal forms a standalone field of research in time series analysis, and is called change point detection (CPD).CPD arises in many applications such as production quality control [1], chemical process control [2], detection of climate changes [3], human motion and health state analysis [4], aircraft monitoring [5], vibration monitoring of mechanical systems [6], seismic signal processing [7], detection of cyberattacks [8], video scene analysis [9], audio signal segmentation [10], and many others [11].
There are numerous CPD algorithms like subspace methods [12], [13], [14], [15], [16], probabilistic methods [17], [18], window-based [19], [20], clustering [21], stochastic differential equation-based methods [22], [23], etc.However, most of these methods are limited with time series dimension (and applicable to univariate time series only), change point The associate editor coordinating the review of this manuscript and approving it for publication was Dost Muhammad Khan .types, complexity (most of CPD algorithms are still learningfree), or robustness.Other methods, like [24] and [25], utilize deep learning (DL) power, but provide unstable results with suboptimal quality.At the same time, with the development of deep learning, most of the conventional CPD approaches remain unchanged and do not use the full power of DL.
In this work, we propose the first SDE-based likelihoodratio CPD algorithm that utilizes the full power of DL.Namely, we use deep neural networks to learn a Latent Stochastic Differential Equation (SDE) [26], [27], which approximates the time series.The proposed method provides fitting time series dynamics, where the continuous flow is described by a Latent SDE [26], [27].In previous works [22], [23], SDE is applied to a limited set of change points and uses a strictly limited parametric set of drift and diffusion functions.In contrast, we develop an unsupervised CPD algorithm for arbitrary-dimensional signals that uses deep learning to find the appropriate SDE and approximate its solution.The method has no limitations on the set of drift and diffusion functions or change point types.We evaluated the performance of the algorithm on open CPD benchmarks: TCPD [28], SKAB [29], TEP [30], and TSSB [21], and compared the results with available state-of-the-art methods.
The work has the following structure.Section II gives the problem statement, describes related works and background, and presents Latent SDE fit using deep neural networks.The proposed change point detection algorithm and data processing are described in Section III.Section IV defines the quality metrics that we use to compare our method with others.The experimental results and their discussion are provided in Sections V and VI respectively.Finally, the conclusion with the main results of this work is presented in Section VII.

II. BACKGROUND, MOTIVATION, AND CHALLENGES
This section contains the CPD problem statement along with the existing CPD methods overview.We also provide here all necessary background information for our work, including the Latent SDE inference framework.
A. PROBLEM STATEMENT Consider a d-dimensional time series, where each observation at a moment t is described by a vector of features We assume that the distribution of all observations with t < ν are sampled from distribution P, and the distribution of all observations with ν ≤ t are sampled from distribution Q ̸ = P.The moment ν ≥ 1 when the distribution changes is called a change point.In other words, the change point is the moment when a time series changes its behaviour.The illustration of several change-points is demonstrated in Fig. 1.The goal is to detect all change points in time series data.Usually, this is an unsupervised problem in statistics and machine learning due to the absence of the true positions of such points.In this work, we study this problem in such an unsupervised setting.

B. RELATED WORK
Change point detection is a long-studied problem.The first works on the change-point detection are dated in the 50s [31], [32] for detecting a shift in the mean value of a signal for quality control of manufacturing processes.In the following decades, a range of change-point detection methods was developed that could be split into several groups based on cost function, search method, and additional constraints [33].
One of such groups is a set of subspace methods.This line of CPD algorithms is based on the time series subspace analysis of the original time series, which has a strong connection with a system identification method.This method has been thoroughly studied in control theory [12].Some subspace methods, such as subspace identification (SI) [13] and singular spectrum transformation (SST) [15], [16], are based on classical approaches, for example, matrix and SVD transformation of the time series.Some others use more complicated neural network projections to a time-invariant subspace [24].
Another common group of change-point detection methods is based on a comparison of the empirical probability density distributions before and after change points.The CUSUM [32], [34] algorithm assumes that these distributions are known and detects a change point using a sequential hypothesis test procedure.The GLR [11], [35] method supposes that the parameters of the distribution after the change point are unknown and estimates it by likelihood minimization.Change forest [36] is a likelihood ratio classifier based on random forests that uses class probability predictions to compare different change point configurations.Generally, these methods use hypothesis comparison tests, comparing null and alternative hypothesis likelihoods in each point [37], [38], [39], [40], [41].Usually [17], [35], [42], a null hypothesis suggests that no change points occur in the timestamp, whereas an alternative hypothesis suggests that a change point is present in the timestamp.
The next group of methods is based on estimation of some statistics.For instance, a Gaussian process (GP) is a probabilistic method to describe stationary time series analysis and prediction [18].Another method, called Bayesian Change Point Detection (BCPD) [17], estimates the posterior distribution over an auxiliary variable run length r t which represents the time that elapsed since the last change point.Given the run length at a time instant t, the run length at the next time point can be either reset back to 0, if a change point occurs at this time, or increased by 1, if the current state continues for one more time unit.
A range of CPD approaches is based on direct probability densities ratio estimation for two samples without the need to know the individual densities [43].One of the first such algorithms uses logistic regression on RBF kernels [44].Later, other methods based on RBF kernels were proposed: KLIEP [45], uLSIF [46], and RuLSIF [47].Their application in change-point detection are described in [12], [14], and [48].
Another group of methods also splits the time series into windows and then uses a kernel-based statistical test to assess the homogeneity between subsequent windows [19].One of such recent deep learning kernel-based methods is called KL-CPD [25].It uses maximum mean discrepancy (MMD) statistical test [20] between distributions P and Q of windows before and after the change point respectively (see previous section II-A).
One more group of CPD methods is based on cost functions [49].These methods estimate the discrepancy score between two segments of a time series by comparing cost function values before and after splitting the segment into two by a change-point.The most popular algorithms of this group are Binseg [50], Pelt [51], [52], and Window [49].
Graph-based CPD methods first infer a graph by mapping observations (i.e., windows or sets of time series) to nodes and connecting nodes by edges if their pairwise similarity exceeds a predefined threshold.Next, a bespoke graph statistic is applied to split the graph into subgraphs leading to change points in the time series [53].From a different perspective, the problem of change point detection can be considered as a clustering problem with a known or unknown number of clusters, such that observations within clusters are identically distributed, and observations between adjacent clusters are not.If a data point at the time stamp t belongs to a different cluster than the data point at the time stamp t + 1, then a change point occurs between the two observations.One of such recently introduced methods is the Classification Score Profile (ClaSP) which performs segmentation of the time series [21] using KNN classification procedure.
However, most of the aforementioned CPD algorithms do not use deep learning power behind the implementation, and tend to overfit on time series outliers, or can be used in a supervised setting only.We believe that the main reason for this is the lack of labeled data in most real-world tasks, the specifics of each individual CPD case data with the resulting complexity of transfer learning, and the complexity of generalizing the aforementioned conventional approaches to a DL generalization.
In this work, we evolve conventional SDE approaches [22], [23] and propose a novel robust Latent Stochastic Differential Equation's likelihood ratio method for the change point detection problem.Unlike the previous algorithms, the proposed method combines a limited conventional time series analysis approach based on SDE with the full power of deep learning.The method is entirely unsupervised and requires no labeled change points in the training dataset.
Then, a stochastic process {x t } t∈T ∈ R D can be defined by an Itô SDE: if x 0 is independent of the σ -algebra generated by w t , and where, µ(x t , t) : 3) is the Itô stochastic integral [54].When the functions are globally Lipschitz, that is, for some constant L > 0, there exists a unique t-continuous strong solution to (2) [26], [54].
In case of high dimension D of time series X , a more efficient use of SDE on latent spaces {z t ∈ R d |d ≤ D} can be used [26], [27].In [27], the authors propose an efficient variational inference framework for such latent SDE models.In particular, given observations X , they parameterize both a prior over functions and an approximate posterior of latent Z using SDEs: where µ θ , µ φ , and σ are Lipschitz in both arguments.µ θ is a prior drift function with prior parameters θ, µ φ is an approximate posterior drift function with variational parameters φ, ψ is an encoder, and f is a decoder.In such a setting, the evidence lower bound (ELBO) can be written [27] as: where where the expectation is taken over the approximate posterior process defined by (approx.posterior).u(z t , t) can be considered as a Kullback-Leibler (KL) divergence between the approximate posterior and prior, which regularizes when the approximate posterior is far from the prior at a point (z t , t).The likelihoods of observations x 1 , . . ., x N at times t 1 , . . ., t N depend solely on latent states z t at corresponding times and a predefined likelihood function p(x t |z t ).In practice, this function is set [26] as a Gaussian distribution: where f is the decoder, N (x t |f (z t ), C) is a Gaussian distribution p.d.f. with mean f (z t ) and diagonal variance matrix C.

D. CHANGE POINT DETECTION USING SDE
In [22], the authors propose the following SDE model of volatility change point: and find a time ν when the diffusion coefficient θ ν−1 ̸ = θ ν is changed.However, one of the main limitations of the method is that volatility may have more complicated kind of change than up to a multiplier θ.
In [23], a more complicated volatility change point model is used: Here, the time when the diffusion parameter θ is changed is also considered as a change point.
However, the aforementioned CPD algorithms [22], [23] still have a set of limitations.In first, they are designed to detect volatility change points only.In second, they are strictly limited with a set of drift and diffusion functions µ(•), and σ (•; θ).
For instance, in [55] the drift and diffusion functions have the following forms: In [56], the forms are following:

III. PROPOSED METHOD
In contrast to the previous works [22], [23], we propose a new CPD method based on Latent SDE, which approximates drift and diffusion functions with arbitrarily neural networks.The proposed method is aimed to find different types of change points in an arbitrarily multivariate time series with unknown drift and diffusion functions.
In the following section, we describe the proposed algorithm with all the related preprocessing and postprocessing stages.

A. PREPROCESSING
First, the original time series is preprocessed with a standard scaling technique.Very often, a time series has a trend and seasonality, and its observations x i can be autocorrelated.In some cases, these properties can complicate the detection of change points.To remove the trend and autocorrelation, we use the SARIMAX [57] model implemented in the Statmodels [58] package.We fit the model for the whole time series with the (5, 1, 0) order of AR parameters, differences, and MA parameters respectively.The prediction of the model for a time moment i is denoted as x SARIMAX i .Residuals of the predictions r i are used for further analysis and are estimated as In this work, we optionally use this preprocessing as a hyperparameter of our model for each dataset (14).Furthermore, the positional encoding features are passed as input instead of time t [59].

B. POSTPROCESSING
The output of our algorithm is multidimensional, containing the scores for all the original and auxiliary dimensions from the previous section III-A.In the post-processing stage, we use max aggregation over all the dimensions.That choice of aggregation is justified by the fact that the change point in one dimension denotes the change point of all the time series, and the most likely dimension change point at each time is taken.
Moreover, we use prominence post-processing [24] to remove the duplicated peaks around the top one.

C. ALGORITHM
In this work, we present a novel likelihood-ratio CPD method based on latent stochastic differential equations.We use the following criterion of change point for our algorithm: where p(x|t) is a probability to observe x at time point t, and L > 0 is a time lags range, which is considered as a hyperparameter of the algorithm.To estimate the conditional likelihood p(x t |ν), we use the following Monte-Carlo approximation: where N trajectories z i ν ∼ p(z ν |ν) are sampled from the pretrained latent SDE model [27].The SDE dynamics of latent Z = {z t } is approximated by maximizing the ELBO (5) described in the previous section.If latent trajectories z i ν are complex enough, the likelihood function becomes p(x t |z i ν , ν) ≈ p(x t |z i ν ), and can be defined as in (7).So, we use the following form of CPD score at each point x t : where f is a decoder, L = 5 (lags range) and C = 0.In this section, we provide a range of theoretical properties of the proposed algorithm for change point detection.
We demonstrate that the CPD(x t ) score ( 15) is capable to detect changes of mean, trend, and variance of the given signal.To show that, let's estimate an approximate analytical form of the score, which is defined in Theorem 1.
Theorem 1: Let ν ∈ T is time moment, L > 0 is a predefined time lags hyperparameter (15), and z ν is latent state in time ν.Consider the following normal form of the p(f (z ν )|ν) and p(x t |f (z ν ), ν) distributions: Then, the change point detection score CPD(x t ) (15) takes the following analytical form: Proof: Let's substitute Equations ( 22) and ( 21) into the definition of p(x t |ν).Since the distributions p(x t |f (z ν ), ν) and p(f (z ν )|ν) are conjugate normal, p(x t |ν) is also normal: Then, the change point detection score, defined in Equation (15), takes the following form: □ This theorem shows how the change point detection score behaves with different alternations of the signal.We assume that the mean vector b t and the covariance matrix C + t represent the observed values.Thus, all changes of the time series are reflected in the score values.
Using Theorem (1), let's estimate the behaviour of the change point detection score for three popular cases.Consider a multivariate time series with a change point at time moment t.In the first case, we assume, that the mean value of the signal is changed in some time.In the second case, the trend of the signal is changed.Finally, in the third case, the variance of the signal alternates.Corollaries 1, 2, and 3 define the score values for these three change points.
Corollary 2: Consequence of ( 1).Consider a multidimensional time series x with the trend jump 2 b at the time t and its first difference time series x with a constant covariance matrix and mean jump b Then, the change point detection score CPD(x t ) has the following form: The Corollary (2) proves that trend changes can be also detected if first difference preprocessing is performed over the original time series x and used as input to the model.

Corollary 3: Consider a multidimensional time series with the covariance change
Then, the change point detection score CPD(x t ) has the following form: The theorem and corollaries considered in this section provide theoretical foundation for the proposed algorithm.They demonstrate the ability of the algorithm to detect all three main types of change points on multivariate time series and predict the score values for different cases.

A. F1
Consider a time series with several change points.Following [28], we define T = { τi } m as a set of change point locations provided by a detection algorithm and let T = {τ i } n be a combined set of all human annotations.For a set of ground truth locations T , they denote a set of true positives TP = {τ i |∃ τj : | τj −τ i | < M }, where M = 5 [28].That means the algorithm change point prediction can be away from ground truth not farther than M .We also ensure that only one τj can be used for a single τ i .The latter requirement is needed to avoid double-counting, and M = 5 is the allowed margin of error.Then, precision (P) and recall (R) are defined as, Then, F 1 is used as a quality measure of change point detection: The second quality metric is based on a segmentation approach, where each change point is considered as a border between two separate segments of a time series.By analogy with other segmentation tasks, the Jaccard score can be used here: where A is a ground-truth segment and A ′ is a segment formed by the found change points.To expand this metric to a multiple segments case, the authors of [28] propose the Covering metric: where G and G ′ correspond to ground truth and algorithm segmentation, respectively, T is a time series length.In the case of multiple annotators, the metric was averaged over the annotators labels.

C. NAB
The Numenta Anomaly Benchmark (NAB) score was introduced in [61] and uses a distance-weighted score for predicted change points within and outside the predefined region around ground truth change points.The region starts at the ground truth position of a change point.Thus, all the change point predictions before the ground truth and all others outside the region are considered as false positives (FP).Within the region, only the nearest prediction is considered as a true positive (TP).All the rest of the predictions within the region are ignored.
For each predicted change point τ ∈ T , NAB score is calculated in the following way: where | τ − τ | is a relative position of the detected τ within the region.Here, the profile coefficients A = {A TP , A FP , A FN , A TN } are predefined.In this work, we use 3 different profiles for the evaluation: In our work, NAB score for these three profiles is denoted as NAB.Standart, NAB.LowFP, and NAB.LowFN respectively.
We use default region sizes in NAB scores computation.In [61], the authors show that the region size has a minor impact on the final metric value and is chosen to be in the range from 5  |T | % to 20  |T | % of the time series length T .For the predictions T , the raw score is: where |FN | is a number of false negatives (empty windows with no detections around the ground truth).Then, the final NAB score for the predictions T is written in the following rescaled form: where S perf denotes a raw score for ''perfect'' detector (one that outputs all true positives and no false positives) and S null denotes a raw score for ''null'' detector (one that outputs no anomaly/change point detections).

D. RELATIVE CHANGE POINT DISTANCE (RCPD)
RCPD was introduced in [60] and later used in the ClaSP change point algorithm [21].It computes an average distance between the predictions { τ ∈ T } and nearest change points {τ ∈ T }: where T is a time series length regarding the aforementioned notation in the section.

V. EXPERIMENTAL RESULTS
In this section, we describe an evaluation of our model along with a comparison of state-of-the-art CPD algorithms like KL-CPD [25], TIRE [24], ClaSP [21], BOCPD [17], BINSEG [62], CHANGE_FOREST [36] and ruptures [49] algorithms like PELT, OPT, KERNEL, and WIN.Each algorithm is evaluated at the best threshold for each quality metric described in the previous section.If the algorithm does not return detection scores, the optimal number of the predicted change points is taken for each metric.Univariate algorithms, like ClaSP and BOCPD are evaluated on univariate datasets only.For uncertainty estimation, each algorithm was trained and evaluated on each dataset 5 times from different initialization seeds.
In further subsections, we describe all the evaluation corpuses in detail and provide aggregated results tables over each corpus.A detailed models and hyperparameters setup is described in Appendix VII-A.All other implementation details and datasets are available at our public paper repositories for source code1 and data2 respectively.

A. OUR SYNTHETIC STUDIES
To test the model performance, we first take a simple set of synthetic experiments using artificial datasets.The aim of this synthetic experiment is to check how well our algorithm detects different kinds of change points: trend changes, mean jumps, and volatility changes on univariate and multivariate cases.Each experiment represents noisy data generation with further models quality and robustness estimation.
Figure 2 shows the behaviour of the algorithm on 3 simulated change point types: mean, trend, and volatility change points.
The averaged metric values for synthetic corpus are listed in Tables 1,2.For univariate datasets, our algorithm outperforms all the others on all CP metrics.For multivariate datasets, the algorithm outperforms the others on all metrics besides Low.FN.[28].Each row (except header) represents the results of a specific algorithm.The first column contains algorithm names.All the rest columns correspond to specific metrics listed in the header.On each cell, a metric value averaged over all the datasets is shown.The best values for each metric are highlighted in bold.The uncertainty for each algorithm is estimated on 5 different runs.

TABLE 4.
Results on multivariate TCPD datasets [28].Each row (except header) represents the results of a specific algorithm.The first column contains algorithm names.All the rest columns correspond to specific metrics listed in the header.On each cell, a metric value averaged over all the datasets is shown.The best values for each metric are highlighted in bold.The uncertainty for each algorithm is estimated on 5 different runs.

TABLE 5.
Results on SKAB benchmark datasets [29].Each row (except header) represents the results of a specific algorithm.The first column contains algorithm names.All the rest columns correspond to specific metrics listed in the header.On each cell, a metric value averaged over all the datasets is shown.The best values for each metric are highlighted in bold.The uncertainty for each algorithm is estimated on 5 different runs.
The complete description of the synthetic corpus, along with detailed case studies is listed in Appendix VII-B.

B. EVALUATION ON OPEN DATASETS
In this section, we evaluate our algorithm and compare it with baselines on open datasets for change point detection: TCPD [28], SKAB [29], TSSB [21], and Tennessee Eastman Process(TEP) [30].
TCPD dataset consists of 37 real time series collected for the change point detection benchmark [28].It includes 33 univariate and 4 multivariate series with manually labeled change points.
The SKAB corpus contains 35 individual data files.Each file represents a single experiment and contains a single anomaly (change point).The dataset represents a multivariate time series collected from the sensors installed on the testbed [29].[21].Each row (except header) represents the results of a specific algorithm.The first column contains algorithm names.All the rest columns correspond to specific metrics listed in the header.On each cell, a metric value averaged over all the datasets is shown.The best values for each metric are highlighted in bold.The uncertainty for each algorithm is estimated on 5 different runs.

7.
Results Tennessee Eastman Process datasets [30].Each row (except header) represents the results of a specific algorithm.The first column contains algorithm names.All the rest columns correspond to specific metrics listed in the header.On each cell, a metric value averaged over all the datasets is shown.The best values for each metric are highlighted with bold.The uncertainty for each algorithm is estimated on 5 different runs and 21 data samples.TSSB benchmark consists of 98 univariate datasets: 49 single change point time series, 22 datasets with two change points each, 10 datasets with three change points each, and 11 datasets with 4 change points each [21].
For univariate and multivariate datasets of the TCPD corpus, the results are shown in Tables 3,4 respectively.The results for the SKAB, TSSB, and TEP corpuses are shown in Tables 5,6,7 respectively.
On the TSSB benchmark, the proposed algorithm outperforms the others only on the F1 score.However, in all the rest studied benchmarks, our algorithm outperforms all other algorithms.

VI. DISCUSSION
Theorem 1 and the corresponding corollaries 1, 2, 3 prove the theoretical discriminating power on trend, jump, FIGURE 3. Architecture of Neural SDE for CPD.In our work, we use 3-layer dense neural network to approximate posterior drift, and constant diffusion (equal to 1 in each point).This configuration is similar to the default configuration in the original neural SDE work [27].Here, D denotes the dimensionality of original dataset, n_pos_encodings denotes the number of the added time positional encoding features.Factor 2 before the dimensionality D means that we augment each time series with SARIMAX residuals of the same dimensionality.TABLE 9. NAB.Stadart metric for synthetic datasets [61].The first column contains dataset indices.The header contains algorithm names.In cells, NAB.Stadart metric for the corresponding dataset and algorithm is given.TABLE 10.NAB.LowFP metric for synthetic datasets [61].The first column contains dataset indices.The header contains algorithm names.In cells, NAB.LowFP metric for the corresponding dataset and algorithm is given.TABLE 11.NAB.LowFN metric for synthetic datasets [61].The first column contains dataset indices.The header contains algorithm names.In cells, NAB.LowFN metric for the corresponding dataset and algorithm is given.and volatility change under certain conditions (21), (22).This provides a strong hint for good results.The obtained experimental results support the aforementioned properties of the proposed algorithm (follow the for more details and experiments).According to the experiments and theoretical properties, the algorithm effectively detects all the main types of change points and outperforms the existing CPD methods on a wide range of benchmarks.Unlike the previous algorithms, our method shows the best average quality both on univariate and multivariate datasets.We suggest the main reason for such good quality is that our method effectively fits the latent dynamics of a multivariate process with the robust continuous dynamics of SDE.The proposed method generalizes the conventional likelihood ratio CPD methods based on stochastic differential equations [22], [23] with modern DL techniques.To our knowledge, this is the first DL approach that uses Latent Stochastic Differential Equations.Moreover, unlike the preceding SDE-based methods, our algorithm is designed for all main types of change points, including trend, mean, and volatility change.
Along with a good performance, our algorithm has good scalability and computing efficiency.The training and inference stages of the algorithm are linear with the respect to the time series size N , whereas lots of other state-of-the-art algorithms with comparable quality are much less scalable, or require much higher computational complexity.
Another main property of our algorithm is flexibility, which helps to use any deep learning based encoder-decoder architectures behind the algorithm.Moreover, more complicated preprocessing techniques can be used, particularly to detect seasonality changes.
This way, we first provide and study a general Neural SDE framework in a CPD setting, which makes it possible to apply a wide range of modern deep learning techniques to CPD problems.This fact, along with all the aforementioned properties, provides a broad perspective for further study and improvements of the proposed algorithm.

VII. CONCLUSION
The work is aimed to designing an efficient DL generalization of the conventional likelihood ratio CPD approaches based on stochastic differential equations.To this end, we present a first study of Latent SDE in a change point detection setting.As a result of this work, a novel CPD algorithm on the edge of modern deep learning approaches and conventional CPD methods is introduced.This is the first deep learning modification of stochastic differential equations approach to change point detection.
It was theoretically and experimentally shown that the proposed method is capable of detecting all the main types of change points in multivariate time series data: trend, mean, and volatility changes.In most of the scenarios and metrics, the model shows high robustness and a performance which is strongly higher than other state-of-the-art CPD algorithms used in this work.
With all the aforementioned, the proposed algorithm represents a big interest from theoretical and performance perspective for change point detection problem.

APPENDIX A. EXPERIMENT SETUP. TRAIN, INFERENCE, AND IMPLEMENTATION DETAILS
In this auxiliary section, experimental and training specifics details are described.
Neural Stochastic Differential Equation is implemented using torchsde3 framework with noise_type = diagonal, sde_type = Stratonovich_SDE.The neural SDE networks configuration is shown in Figure 3.In this work, we use torchsde configuration as a basis for our one [27].
The configuration is trained 100 iterations on each dataset with batch size 512.We monitor the evaluation metrics on each epoch in inference mode and save the best values of it.
For training, we use Adam optimizer with learning rate 10 −2 and default parameters.
In all the experiments, we use a machine with single 8-core CPU Xeon E5-2689 and single Nvidia 1080 Ti GPU.The training stage for each dataset on that hardware takes approximately 200 minutes.

B. SYNTHETIC CORPUS
In our experiments, an additional corpus of synthetic datasets is introduced.We make this corpus to see how each algorithm works on different types of change points.To accomplish that, a set of synthetic tests (datasets) is generated for three main types of change points: trend, mean, and volatility change.In

FIGURE 1 .
FIGURE 1. Example of a time series with two change points at times ν 1 = 200 and ν 2 = 400.At these moments, the mean value of the signal changes with jumps.

Corollary 1 :
Consider a multidimensional time series with the mean jump b at a time t.It means, that, t−l = t = , b t = b + b, and b t−1 = b t−2 = . . .= b t−L = b.Then, the change point detection score CPD(x t ) has the following form:

FIGURE 2 .
FIGURE 2. In these figures, the behaviour of our algorithm on three main types of change points is shown.On (a), the mean change point (step) is shown.On (b), the trend change point (fracture) is figured.On (c), the volatility change point is shown.The top halves of the figures contain original time series with averaged SDE dynamics.The bottom halves of the figures represent the score of our algorithm, maximized over all the original and auxiliary dimensions.

TABLE 1 .
Toy synthetic benchmark (Univariate).Each row (except header) represents the results of a specific algorithm.The first column contains algorithm names.All the rest columns correspond to specific metrics listed in the header.On each cell, a metric value averaged over all the datasets is shown.The best values for each metric are highlighted with bold.The uncertainty for each algorithm is estimated on 5 different runs.

TABLE 2 .
Toy synthetic benchmark (Multivariate).Each row (except header) represents the results of a specific algorithm.The first column contains algorithm names.All the rest columns correspond to specific metrics listed in the header.On each cell, a metric value averaged over all the datasets is shown.The best values for each metric are highlighted with bold.The uncertainty for each algorithm is estimated on 5 different runs.

TABLE 3 .
Results on univariate TCPD datasets

TABLE 6 .
Results on TSSB benchmark datasets

TABLE 8 .
List of synthetic datasets.The first column contains dataset indices.The second column contains types of change points which presents in a dataset.The last column contains detailed dataset description.

TABLE 12 .
[28]etric for synthetic datasets[28].The first column contains dataset indices.The header contains algorithm names.In cells, F1 metric for the corresponding dataset and algorithm is given.

TABLE 13 .
[28]ring metric for synthetic datasets[28].The first column contains dataset indices.The header contains algorithm names.In cells, Covering metric for the corresponding dataset and algorithm is given.

TABLE 14 .
[21] metric for datasets[21].The first column contains dataset indices.The header contains algorithm names.In cells, RCPD metric for the corresponding dataset and algorithm is given.