Large-Scale Characterization and Segmentation of Internet Path Delays With Infinite HMMs

Round-Trip Times are one of the most commonly collected performance metrics in computer networks. Measurement platforms such as RIPE Atlas provide researchers and network operators with an unprecedented amount of historical Internet delay measurements. It would be very useful to process these measurements automatically (statistical characterization of path performance, change detection, recognition of recurring patterns, etc.). Humans are quite good at finding patterns in network measurements, but it can be difficult to automate this and enable many time series to be processed at the same time. In this article we introduce a new model, the HDP-HMM or infinite hidden Markov model, whose performance in trace segmentation is very close to human cognition. We demonstrate, on a labeled dataset and on RIPE Atlas and CAIDA MANIC data, that this model represents measured RTT time series much more accurately than classical mixture or hidden Markov models. This method is implemented in RIPE Atlas and we introduce the publicly accessible Web API. An interactive notebook for exploring the API is available on GitHub.


INTRODUCTION 1.Scope of the paper
Network management has traditionally been entrusted to humans.But this mode of operation is expensive, error-prone, and slow to adapt to changes.The task of human experts is very complex because of the large number and heterogeneity of equipments, as well as the wide variety of applications.
We believe that the future of network management is in automation, or driverless (autonomous) networks.[3,11,12,24].For self-driving networks to become reality, it is necessary to rely on recent machine learning techniques to extract information from network measurements and automate decision-making.Different needs should be addressed: statistical characterization, prediction, detection of changes or anomalies, classification, etc.The results should be reliable and accurate to automate decision-making related to network management or to security and resilience and the analysis should be scalable and fully automated (no human intervention).
Delay is an important performance metric.In particular it is easy to measure Round Trip Time (RTT) and there is a good availability of data from measurement infrastructures at the Internet scale like RIPE Atlas [36].Humans are pretty good at finding patterns in this latency data (try it for yourself in Figure 1), but it is difficult to automate this which would allow many time series being processed at the same time.In this article we propose to use a Hierarchical Dirichlet Process Hidden Markov Model (HDP-HMM), also called infinite HMM, or nonparametric Bayesian HMM.
This model mimics human cognition very well (in terms of segmentation of the series, recognition of different states, etc.).
These models are used for the segmentation of audio sequences for which they give very good results for speaker recognition [19].These recent techniques are more complex than standard HMMs but they are worth the effort.
The goal of the article is to recall the major principles of infinite HMMs and apply this theory to network measurement data.Whereas [19,28,37] are written for statisticians, we want to make the theory accessible to a wider audience and show the potential of this model for automating the analysis of a wide variety of delay time series.The method has been implemented in RIPE Atlas to automate the processing of anchor to anchor RTT measurements, and a Web API is available.The article introduces the API and an accompanying notebook is provided to help taking control of the API 1 .

State of the art
Network delay modeling and prediction is a well-studied problem.Some of the simplest models assume independent observations 1 https://github.com/maxmouchet/atlas-trends-demo.

arXiv:1910.12714v1 [cs.NI] 28 Oct 2019
and can be used to detect anomalous delay values.They, however, cannot predict the delay or find recurring patterns in a delay series since they do not account for time dependencies.Such models include the Pareto distribution [40], mixtures of Weibull [22] or Normal distributions [32].
More complex time series models have been used for short-term delay predictions (from seconds to minutes), with applications such as telerobotics.These include autoregressive models [23,39] and deep neural networks [9,15].As a drawback their parameters are more difficult to interpret and they do not provide a segmentation of the data.
HMMs are another kind of time-series model that can model different delay distributions and the dynamics between them.In [31] a discrete-time HMM is used to model packet losses, while in [38] a continuous-time HMM is used to model both packet losses and delays.In [13] a HMM is used to model inter-packet times and packet sizes.HMMs have few parameters and those are easily interpretable (state transition probabilities, means, variances, ...).
However, standard HMMs require to use heuristics to determine the number of hidden states.To remedy this problem we use a nonparametric HMM for which the number of hidden states is inferred from the data.A nonparametric mixture model has been used in the past to model the delay of a set of hosts measured over disjoint time intervals [18].In comparison our model is a nonparametric HMM and concerns the delay between two hosts over a large and continuous time interval, from a few hours to a few weeks.
We first introduced the use of the HDP-HMM for RTT time series in [27].This article expands on the statistical theory behind the model, describes two new applications (CAIDA MANIC measurements, and anomaly detection), and introduces a RIPE Atlas API for time series segmentation as a service.

Structure of the article
The paper is structured as follows.Section 2 is a reminder on mixture models (MM) and hidden Markov models (HMM).In section 3 we describe their nonparametric Bayesian counterparts, the Dirichlet Process MM (DPMM) and the Hierarchical Dirichlet Process HMM (HDP-HMM, or infinite HMM).In the same section, we explain how to automatically calibrate these models, that is how their parameters can be inferred from measurements without human intervention.
In section 4 the accuracy of the model is demonstrated on a dataset that has been labeled by humans, as well as on some RIPE Atlas RTT time series where we discuss the matching between routing configurations (from traceroutes) and states learned by the statistical model.We also briefly address the analysis of some CAIDA MANIC measurements.In Section 5 we introduce a Web API that permits requesting the HDP-HMM analysis of anchor to anchor RTT measurements in RIPE Atlas.We also demonstrate that analyzing the frequency of state changes in RTT time series over Atlas allows a very precise detection of the moment of occurrence of events affecting large infrastructures of the Internet (such as IXPs).In Section 6 we conclude and present our vision of the research axes to be developed in the future.
Readers that are less interested in the description of the Bayesian nonparametric context can skip most of sections 2 and 3, just reading their summaries (subsections 2.5 and 3.6).

A REMINDER ON MIXTURE MODELS AND HIDDEN MARKOV MODELS
In the next two sections our goal is to let the reader understand the HDP-HMM model, starting from simpler and more popular models such as mixtures or HMMs.

A taxonomy of statistical models
We will start by a taxonomy of the different models discussed in the article.Our taxonomy takes into account three criteria: (i) whether there is naturally a notion of "hidden state" in the statistical model (ii) whether time dependency is taken into account, and (iii) whether the number of states is supposed to be known (and finite) or unknown (and potentially infinite).
The RTT is stable over long periods of time (usually a few hours), and its distribution switches from one probability law to another (see Fig. 1).This can be explained by IP-level routing changes, congestion, and traffic engineering at lower layers than layer 3 [29].Propagation delays give a lower bound on the RTTs, and as routers queue lengths increase with the traffic, so do the observed RTTs.From a statistical point of view, it is natural to think of models with "hidden states" such as MMs or HMMs.
Knowing that the delay is stable over several hours means that, if the path quality is measured at a frequency of one "ping" every few minutes, the delay distribution remains stable for tens or hundreds of time slots.In order to have a model that can be used for prediction, it is necessary to account for this temporal dependence.This is made possible by HMMs, while mixture MMs assume independent observations.But a classical problem in statistics with MMs or HMMs is that the order of the model is assumed to be known (and finite).In practice this hypothesis is unrealistic, in most applications considered.This is where models with Dirichlet processes (DP) priors on the number of components of the mixture, or of the HMM, find their interest.
In the Dirichlet Process MM (DPMM) and the Hierarchical Dirichlet Process HMM (HDP-HMM), the number of model states is "infinite".And the order of the model can be learned from the measured data, as it is the case for the other parameters of the model.This is an important property to have a model that is flexible enough to adapt, without manual human intervention (initialization of algorithms, etc.), to a large number of time series.
In Table 1 we have summarized which properties are satisfied by which models.This justifies the choice of the HDP-HMM to characterize RTTs and to automate their processing.
This flexibility is obtained at the cost of a greater complexity of the model of inference algorithms for parameter estimation.However, we could provide an efficient implementation for it embedded in an operational API (see Section 5.1).

Mixture Models
Some of the simplest statistical models that include hidden states are mixture models.MMs are a kind of generative statistical models used to describe data produced by different system states.For instance, in a Gaussian Mixture Model (GMM), observations y 1:T = (y 1 , y 2 , . . ., y T ) are assumed to be independent and a normal distribution is associated to each hidden state.For continuously distributed observations, conditionally to the underlying state z t = k ∈ {1, 2, . . ., K }, where K denotes the number of states of the model, the observation y t follows a distribution with probability density function p θ k , where θ k is a parameter vector.For example, in a GMM, θ k consists of mean and variance parameters, so Finally, the data distribution writes p(y where π k denotes the probability that the state of an observation is k, that is, π k = P(z t = k).MM parameters ϕ = {π k , θ k } k=1:K can be estimated from measurements y 1:T according to different criteria and algorithms.A common choice is the Maximum Likelihood Estimator (MLE) which supplies the parameters that maximize the likelihood of the observations: ϕ M LE = arg max ϕ p(y 1 , y 2 , . . ., y T ; ϕ).In general, direct maximization of the likelihood p(y 1:T ; ϕ) with respect to ϕ is infeasible.The Expectation-Maximization (EM) algorithm [14] is a popular iterative two-steps algorithm to compute the MLE for models with incomplete data, in particular mixture models.

Hidden Markov Models
Because of the independent observations assumption, the predictive ability of MMs is limited.Knowing model parameters and which state value z t has generated the last observation y t does not bring any information about the next state z t +1 .HMMs are a generalization of MMs that take into account temporal dependencies among states.These temporal dependencies are expressed through a Markov property assumed for the states that writes p(z t +1 |z 1:t ) = p(z t +1 |z t ).Thus, the probability distribution of the next hidden state z t +1 depends on the current state z t only.
The MLE of HMM parameters can be estimated using a variant of the EM algorithm known as the Baum-Welch (or forward-backward) algorithm [1].While easy to implement and well-studied, this approach is prone to overfitting on noisy data or data with few samples.Furthermore this method requires the number K of hidden states to be known, which is usually not the case when studying RTTs.

Limitations of vanilla MMs and HMMs
Classical mixtures and HMMs are parametric models, meaning that they have a set of parameters with fixed size.This is a major difficulty when estimating HMM parameters as often the number of hidden states is not a priori known.
One could estimate models for different numbers of states, but the maximum of the likelihood would increase with the number of states as a model of order K is a degenerated case of model of order K + 1.A classical approach consists in penalizing the MLE optimization criterion by adding a penalty term to the loglikekihood such as the AIC [6] or the BIC [33] criterions and by selecting the model that minimizes this penalized criterion.Another approach is to use nonparametric models with unbounded number of parameters.
Another limitation of parametric models is that the EM algorithm usually used to tune the parameters of the model is sensitive to the choice of its initialization point.Appropriate initialization strategies must be used otherwise it may converge to a local but non-global maximum of the likelihood.
Because of these limitations standard MMs or HMMs cannot be used on a large scale to analyze Internet measurements.In what follows we introduce a new approach for RTT measurement analysis, based on nonparametric Bayesian models, and more particularly the HDP-HMM.

Section summary
MMs and HMMs are interesting for characterizing time series of RTTs.They are designed to model phenomena that change state from time to time and in which the value of the observations, here the RTTs, noisily depends on the hidden states.One can imagine that different hidden states result from different routing configurations, traffic engineering choices, or link loads.However, these models are too simple to characterize a large variety of RTT series and not suitable for automating their processing at large scale.
We propose to use a more generic model, the HDP-HMM.This model does not make assumptions about the number of states of the system, contrary to vanilla mixtures or HMMs, and it is possible to learn the number of states from the data itself.Contrary to DPMMs it also takes into account time dependency and makes it possible to account for the RTT distribution being stable for a long period of time.

NONPARAMETRIC BAYESIAN APPROACH
A more formal approach to models with an unknown number of components can be found in Bayesian statistics.The Bayesian framework allows one to specify models with several layers of uncertainty and to perform inference of the parameters in a systematic way.We will make better use of this flexibility to estimate HMMs from RTT series where neither the number of states, nor the probability distribution in each state is known.

Bayesian setting
In the MLE approach, estimates of the parameters are inferred from data.In contrast, Bayesian approaches make use of prior distributions upon the model parameters, and output a posterior probability distribution over the model parameters.These prior distributions can account for prior knowledge upon the parameters distributions.
When the dimension of the model is unknown as for MMs or HMMs with unknown order K, one can resort to nonparametric Bayesian approaches, where the number of components of the model is inferred from the data itself.
Bayesian inference can be performed from the posterior likelihood which is defined as p(ϕ |y 1:T ) ∝ p(y 1:T |ϕ)p(ϕ) where p(y 1:T |ϕ) is the likelihood of the data y 1:T , p(ϕ) is a prior distribution and ∝ denotes proportionality.
In general, a direct maximization of the posterior likelihood p(ϕ |y 1:T ) with respect to ϕ is not feasible as p(ϕ |y 1:T ) can be quite complex.Note, however, that there are situations where the likelihood and the prior distribution are such that posterior distribution belongs to the same family as the prior.In this case, the prior is said to be conjugate.Using conjugate priors, when possible, often makes inference simpler.
Markov Chain Monte Carlo (MCMC) techniques, and in particular Gibbs sampling, can be used in very general situations for inference [30].Alternatively, variational Bayesian methods can be considered ( [25], chap.33).The principle of MCMC methods is to use simulations to draw a large number of samples ϕ from the posterior distribution p(ϕ |y 1:T ).

Dirichlet Processes and DP mixtures
Modelling a HMM with an infinite number of states is generally achieved by means of a Dirichlet process (DP) prior.DPs have been introduced by Ferguson [16] in 1973 and have first been applied to mixture models with an unknown number of components in [7].The extension to the modelling of HMMs has first been defined in 2002 in [8].More recently this has been formalized in the framework of hierarchical Dirichlet processes (HDP) in [37] where HDP-HMMs have been introduced.These models are called nonparametric Bayesian, meaning that they are Bayesian and involve parameter spaces of infinite dimension [20].
A Dirichlet Process (DP) is a stochastic process G ∼ DP(α, H ), the realizations of which are probability distributions.It is parameterized by a concentration parameter α and a base distribution H .It can be seen as a process indexed by partitions (A 1 , . . ., A n ) (n > 0) of the space E on which H is defined, with n-variate Dirichlet random realizations: Here Dir(α 1 , . . ., α n ) denotes the n-variate Dirichlet distribution with parameters α 1:n = (α 1 , . . ., α n ), that is to say the probability distribution with density function: where 1I A (x) = 1 if x ∈ A and 0 otherwise, and B(α ) is a normalization factor.0 1 Figure 2: The stick-breaking process Alternative definitions of DPs are also useful both for their understanding and simulation.In particular it can be proved that a Dirichlet Process G ∼ DP(α, H ), can also be defined via the stickbreaking constructive approach [34].The idea is to build a discrete distribution by assigning probabilities π k to samples θ k drawn independently from H .As the probabilities π k must sum to 1, a unit-length stick is divided as displayed on Figure 2. The stick is first broken into two parts, of lengths η 1 and 1 − η 1 .Then the second portion, of length 1 − η 1 , is broken again into two parts in proportions η 2 and 1 − η 2 .The three resulting portions are now of lengths η 1 , η 2 (1 − η 1 ) and (1 − η 2 )(1 − η 1 ).The process of breaking the stick into smaller parts continues indefinitely.
The weights π k are defined as is the beta distribution with parameters a and b and probability density function The distribution with weights π We then get the stick-breaking representation of the Dirichlet Process G: Note that the π k s tend to decay to zero at geometric rate.Indeed it can easily be proven that: Now, suppose we want to fit a mixture model to some observations y 1:T = (y 1 , y 2 , . . ., y T ).Assume that the mixing distributions are in the form p θ (y), where θ is a vector of parameters and that the prior distribution over the vector of parameters is θ ∼ H .We can build a nonparametric Bayesian generative model of observations in the form of a Dirichlet Process Mixture model (DPMM).In this model the distribution of observations is a mixture: and the weights π k and parameters θ k of the different components of the mixture are defined as a Dirichlet Process:

Hierarchical Dirichlet Process HMM
The idea of using a DP as a prior in mixture models has been extended to the case of Hidden Markov Models (HMMs).In fact, for some technical reasons that we will explain, the extension of this approach to HMM modelling involves a hierarchy of DPs.
In the Hierarchical Dirichlet Process HMM (HDP-HMM), DPs are used as priors on the rows π i = (π i1 , π i2 , . . ., π ik , . ..) of the transition matrix Π of the hidden Markov chain (z t ) t .This makes it possible to specify that the number of states of the Markov chain is unknown.
But it is also necessary to ensure that the transition probabilities π ik , for all row i, weight the same emission distribution p θ k .This is made possible by parameterizing the DPs G i (i = 1, 2, . ..) by the same discrete valued base distribution G 0 where G 0 is modeled by a DP prior with base distribution H λ : This hierarchy of DPs yields the HDP-HMM process [37].A graphical representation of the HDP-HMM is given in Figure 3, where the arrows represent the dependencies.The HMM itself is represented by states z t and observations y t .Its parameters are (θ k ) k ≥1 and (π i ) i ≥1 , where p θ k (y t ) = p(y t |z t = k) and π i denotes the i th row of the transition matrix Π of the HDP-HMM, so α, γ and λ are hyper-parameters.γ and λ are the parameters of a Dirichlet process G 0 ∼ DP(γ , H λ ) that lies at the top of the HDP hierarchy.These random dependencies and vague priors introduce enough flexibility in the model to let it adapt to many different time series.

Inference in DP mixtures
Inference in DPMMs is better addressed via the so called Polya urn representation of DPs than through stick breaking.Imagine an urn that contains black and colored balls.The "values" of balls are their colors.At initialization the urn only contains α black balls.When drawing a ball from the urn, if the ball drawn is black then a new colored ball is drawn from a base distribution H and the black and colored balls are put back into the urn.If it is not black, the ball is put back into the urn together with a new one of the same color.The labels of the infinite sequence of draws follow a DP.
We are going to use this formalism in a DPMM, where z t denotes the hidden state and y t is the observation.The Polya urn model can be described as follows.Let us introduce θ ′ t = θ z t the value of θ associated to z t .If z t = k then θ ′ t = θ k and y t is distributed according to p θ k (•).Given a sequence of random variables (θ ′ t ) t >0 with P(θ ′ 1 ∈ B) = H (B), and it has been shown in [10] that the distribution of θ ′ t converges almost surely to a DP(α,H ) (when t → ∞).
The estimation of the parameters and states of a nonparametric mixture model from the posterior distribution p(z 1:T , θ 1:K T |y 1:T ), where y 1:T represent the data, can be addressed via Gibbs sampling [28].The principle of Gibbs sampling [30] is to sequentially update, in turn, the values of z t , t = 1, . . .and θ k , k = 1, . . .conditionally to y 1:T and to the current value of the other parameters.It requires knowing the distribution of each latent variable conditionally to the observations y 1:T and the other latent variables.
Going back to the Polya urn's model, let us index by 1, . . ., K t the distinct colors of the balls present in the urn at time t and let z t denote the color index of the new ball.As the role of the balls can be exchanged, letting z −t = {z 1:t −1 , z t +1:T } and n −t,k = #{z τ ∈ z −t ; z τ = k} be the number of occurrences of the value k among z −t , it can be shown [28] that: where K −t is the number of distinct elements in z −t with indexing set from 1 to K −t .Equation 10 can be interpreted as follows: knowing the values of z 1:t −1 and z t +1:T , the distribution of z t is a mixture of the values k ∈ z −t and of a new index value (K −t + 1).
The respective weights of this mixture are α +T −1 for any k ∈ z −t and α α +T −1 for the value K −t + 1.It can be proven [28] that, if observations y 1:T and parameters θ ′ −t are moreover taken into account, it comes that: where Note that, provided I(y t ) is known, the proportionality factor in Eq. ( 11) can be obtained from the normalization condition If p θ and H λ are conjugate distributions, I(y t ) can easily be calculated in closed form.In other cases one can resort to Metropolis-Hastings simulation using the prior distribution of z t in (10) as an auxiliary distribution [28] to calculate I(y t ).
After sampling z t , t = 1 : T from Eq. ( 11), θ k , k = 1, . . .can be sampled from the following distribution [28]: Here again simulation can be performed directly or via Metropolis-Hastings simulation depending whether p θ and H λ have conjugate distributions.

Inference in HDP-HMMs
Inference in HDP-HMM is technically more involved than for mixture models.We briefly summarize it here.Interested readers can find additional information in appendices of [19].
Letting K denote the current number of states, the Gibbs sampler should sample z 1:T .Note that θ 1:K can be marginalized out and does not need to be sampled in Gibbs.To make it possible, we will also have to sample the π j , which in turn requires sampling the weights of the base distribution G 0 = k =1:∞ β k δ θ k .As only (β k ) k=1:K is concerned for describing the weights of the states of the finite size data set at hand, letting that follows a Dirichlet distribution of order K + 1.The sampling of (β 1:K , β −K ) is described in [19].
Note also that we want to implement inference for a sticky HDP-HMM, that is, a modified version of the HDP-HMM that models persistency of the states by biasing the model towards self transitions (z t −1 = j, z t = j).This is ensured by introducing an additional parameter κ and changing the prior upon π j : When κ = 0 we get the standard HDP-HMM, while when κ → ∞, π j tends to only weight state j.
To implement the Gibbs sampler for the states z 1:T let ψ = (α, β, κ, λ), and π = (π j ) j .Then P(z t |y 1:T , z −t ,ψ ) can be expressed by marginalizing against the π j s and θ k s: Let us introduce the following notations: x i• = j x i j and n −t jk denotes the number of transitions from state j to state k, not counting the transitions z t −1 → z t or z t → z t +1 .Then, the first factor in (14) writes The second factor in (14) writes As far as discussed earlier, if the θ k s have conjugate prior distributions, p(y t | y −t , z 1:T , λ) can be calculated in closed form.Note in addition that to avoid a particular choice of hyperparameters (α, γ , λ) biasing the solution, they can also be given some prior distribution.
At the end of the process, after the z 1:T have been estimated, the θ k s can be estimated easily, e.g. by maximizing the likelihood p({y t ; z t = k} | θ k ).

Section summary
In this section we have introduced non-parametric Bayesian approaches.In Bayesian statistics some of the parameters on which the data depend are considered random.The term "non-parametric" means that there is a large number of parameters that are estimated from the data.
When the number of states of a mixture or a HMM is not known in advance, it is possible to use a non-parametric Bayesian approach using Dirichlet processes (DP) as priors.This is called the Dirichlet Process Mixture Model (DP-MM) or the Hierarchical Dirichlet Process Hidden Markov Model (HDP-HMM).Equivalently, the name infinite (or non-parametric) mixture or HMM can also be used.
Missing data, that is, states and parameters, can be estimated from observations using a Gibbs sampling algorithm which comes up to randomly simulating, in turn, the different components of the model which are not measured directly.These components are simulated according to some conditional distributions which have been specified in this section.

A FIRST LOOK AT RTTS THROUGH THE HDP-HMM
As stated previously, HDP-HMM is a flexible method for inferring HMM parameters and segmenting data when the number of latent states is unknown.This fits the problem of segmenting RTT time series (remember that of Figure 1), where the number of different states is not a priori known.Furthermore, it is not mandatory to make an assumption on the type of RTT distribution in each state (Gaussian, exponential ...).This distribution can be assumed nonparametric, which introduces even more flexibility and allows a very generic model that adapts to a very large number of traces.
In this section we show that the model produces realistic segmentations from a human point of view, and that the inferred parameters are easily interpretable with respect to the application domain.In addition, we provide two validations for the model.We show on a labeled change point dataset that the model performs at least as well as ad-hoc change point detection methods.And we also show that the states inferred from the RTT time series match well with the AS and IP paths seen in RIPE Atlas traceroutes.

A nonparametric observation model
Many parametric models have been proposed in the literature to explain the distribution of the delay in computer networks and on the Internet.For example, in [32] a Gaussian mixture model is proposed, in [22] a Weibull mixture model, and in [40] a Pareto distribution.In practice, however, it seems that the distribution can be very different depending on the network state.For example, in some states the delay can be relatively stable with occasional spikes above a baseline, in which case it might be modeled by an exponential distribution, while in other states the delay can experience large variations caused by a high traffic level, and might be better modeled by a normal distribution.
In this work we choose instead to use a nonparametric Dirichlet Process Mixture Models (with a Gaussian as "base" distribution) as emission distributions of the HDP-HMM.As such, the delay in each state is modeled by a varying number of Gaussian components.This allows us to model a wide range of distributions, and we avoid choosing a particular parametric emission distribution for each state of the HDP-HMM.For each Gaussian component, we use a Normal-Inverse-χ2 prior, which is the conjugate prior to the normal distribution with unknown mean and variance.The use of a nonparametric observation model reinforces the need for Bayesian inference methods, since a more traditional MLE approach would require several layers of penalization.
The segmentation of the series from Figure 1 using an HDP-HMM with DP-GMM emissions is shown in Figure 4.A same state corresponds to a same color.As a matter of comparison, we provide in Fig. 5 the segmentation obtained with a HDP-HMM with DPMM emission distributions, with that resulting from parametric and nonparametric MMs and HMMs with a Gaussian observations model.In the case of the gaussian MM and of the HMM, the number of latent states has been chosen by estimating the model for a varying number of components and choosing the number that minimizes the penalized log-likelihood using the BIC criterion.As we can see the HDP-HMM produces a segmentation close to what a human would do, contrary to other models which generate far too many state changes.

Change point detection
Quantifying the performance of the HDP-HMM on real RTT time series is not easy since there is no ground truth.The "network state" is not known or vaguely defined.But it was possible to compare the performance of the model on a change point detection task where the goal is to detect significant changes in the delay.While not the primary purpose of the HDP-HMM, detecting change points is simply a matter of segmenting the data and finding changes in the inferred state sequence and this this allows to partially validate the quality of the segmentation obtained.
We have benchmarked the HDP-HMM against different change point detection methods on a labeled dataset introduced by [35].This dataset is particularly interesting because change points in RTT timeseries have been manually labeled by human experts.To our knowledge, there are no other RTT time series datasets that are both realistic and labeled.
The dataset consists of 50 RTT series of varying length for a total of 34,008 hours of observations.In [35] change point detection is performed by minimizing m+1 i=1 C(y τ i −1 +1:τ i ) + β f (m).m is the number of changes, C is a cost function that measures the stability of the delay over a range of successive values, and f (m) is a penalty that prevents overfitting.Different cost functions and penalties are considered.
We have compared the performance of the segmentation obtained by HDP-HMM with the best performing changepoint detection methods of [35].In our approach a HDP-HMM model is learnt on each timeseries, the most likely hidden state sequence is computed, and changepoints are simply defined as changes in the hidden states sequence.
We show on Fig. 6 that the HDP-HMM performs similarly to the best performing change point detection methods of [35] in terms of precision ( # True Positive # True Positive+# False Positive ), while performing better in terms of recall ( # True Positive # True Positive+# False Negative ).This means that our model is more sensitive to small changes in the delay without generating unnecessary false alarms.

RIPE Atlas measurements
In addition to detecting significant changes in the delay, the HDP-HMM also provides a notion of hidden states.In this section we validate the quality of this clustering visually and by studying the correlation with AS and IP paths revealed by traceroutes.

Dataset. RIPE Atlas offers two types of measurement sources:
probes and anchors.Probes are deployed in heterogeneous environments while anchors are restricted to high-availability environments such as data centers, universities, and IXPs (Internet eXchange Points).Anchors tend to be located in well-connected autonomous systems and measurements between anchors represent more stable paths than what may be observed from probes located at the edges of the Internet.On the other hand, anchors are more powerful and perform the so-called anchoring mesh measurements, where various measurements are performed regularly between each pair of anchors.This allows us to collect traceroute results both on the forward and on the reverse path.
Our dataset consists of one week of IPv4 RTT measurements between all Atlas anchors and the at-vie-as1120 anchor 2 .Delay is measured every four minutes using three ICMP (Internet Control Message Protocol) pings towards the target anchor.We kept the minimum value of the delay for each time step.Considering the subset of anchors that were online over the time period, we collected 301 series of 2520 data points.We also collected the associated traceroute measurements, both on the forward path, and on the reverse path.Traceroutes are performed every fifteen minutes using three ICMP probe packets for each hop.

Inference.
We have segmented each series using our Julia implementation of the Gibbs sampler.It takes less than 5 seconds on a single thread of a 2.80GHz Intel Core i7-7600U CPU to process a 2520 point time series (1 week of an Atlas RTT measurement) with 500 iterations of the sampler.The task is highly parallelizable as each time series can be processed independently.Using 4 threads, 300 one-week long time series can be processed in 6 minutes.Figure 8 shows the distribution of the number of states in the resulting HMMs for different measurement timescales.It is clear that the number of states grows with the length of the series.This is not surprising and visual inspection by a human expert would also probably identify more states in longer timeseries.One, three, and seven days long series have respectively less than 8, 10, and 11 states.This confirms the capability of the HDP-HMM to learn more complex models as the number of RTT observations, and possibly the number of underlying network configurations, grows.

State durations vs. delay variations.
An advantage of HMMs over other timeseries models (e.g.autoregressive models or neural networks) is that the parameters are more easily interpretable with respect to the application domain.In our case, the state transition matrix Π gives us information about the frequency of network configuration changes and the relation between them, while the observations distributions give us in particular the mean value of the delay and its variance (of the delay in each configuration).
On most time series we can distinguish two types of states: those where the delay is relatively constant (such as the green one on Fig. 7), and states where the delay is very variable (such as the purple one).This is reflected by the variance of the delay in the state.And the average duration of a HMM in a state i is given by 1/(1 − π ii ) where π ii is the probability of self-transition.In the example of Fig. 7 the average duration of the purple state is of 45 timesteps (= 3 hours) and of 149.5 timesteps (= 9 hours 58 minutes) for the green state.The standard deviation of the delay in the purple state is of σ = 10.3 msec while the standard deviation of the green state   if of σ = 4.1 msec.States with a high variance could possibly be explained by intra-domain load-balancing (since Atlas pings flow ID is not constant), congestion, or in-path devices delaying the processing of ICMP packets.However asserting the cause of such variations and studying the possibility of detecting them from delay measurements is to be done in future works.
Figure 9 displays the standard deviation of the RTT against the average duration in a state.In the analyzed dataset the average state duration decreases as the RTT standard deviation increases.This is not surprising as we expect Internet paths to spend more time in stable states.

4.3.4
Relationship with the AS and IP paths.We hypothesized that the distribution of delay observations is conditioned on the underlying network state, such as the inter and intra-AS routing configuration, as well as the traffic level.As shown in Figure 10, the majority of the states learned over all the paths in our dataset matches only one AS path and one IP path.For example there are 595 states which always correspond to the same AS path over the 746 states learned.Stated differently only 16% of the states learned can match two AS paths or more.States associated with more than one AS path can  be explained by delay differences too small to be separated into two clusters.
Conversely, one AS or IP path can be mapped to several states.For example in Figure 7 we only observe the AS path ASN MARKLEY → GTT BACKBONE → NTT COMMUNICATIONS → ACONET SERVICES in the traceroutes from us-bos-as26167 to at-vie-as1120 and ACONET SERVICES → ACONET → NEXTLAYER AS → NTT COMMUNICATIONS → ASN MARKLEY in the reverse traceroutes (as resolved using the RIPEstat API).In the forward traceroutes we observe IP path changes every 15 minutes, in the GTT and NTT ASes, probably due to intra-AS load-balancing, while in the reverse traceroutes we only observe two different IP paths in NTT AS that are perfectly correlated to state changes in the model.

CAIDA MANIC measurements
In addition to RIPE Atlas delay measurements, the HDP-HMM fits other kinds of network measurements as well.In this section we show the results obtained on delay measurements from the CAIDA MANIC project [2].The CAIDA MANIC project uses Time Series Latency Probes (TSLP) to measure inter-domain congestion.Once a peering link between two ASes has been identified, ICMP probes are sent to the near-end (i.e. the last router in the first AS) and the far-end (i.e. the first router in the second AS) of the link.The intuition is that if there is congestion the router queues will fill up, and the delay between the near-end and the far-end will increase.Using the same as for the RIPE Atlas RTT series, we segment the delay difference time series (far-end -near-end) from publicly available measurements.
In Figure 11 we show the resulting segmentation for a peering link experiencing periodic congestion.Three states are learned.The green state, corresponding to a non-saturated link, has a standard deviation of 0.1 ms, while the standard deviation for the red and blue states are of, respectively, 7 ms and 11 ms.The blue state seems to correspond to a state of increased traffic level, while the red state seems to correspond to a saturated link.Because the model accounts for temporal dependencies, it is able to clearly separate those two states even though their distributions are overlapping.good chance to provide enough information to let detect anomalous latency patterns in important network components, such as IXPs or large transit providers.However, detecting and characterising these anomalies has proven challenging (e.g. the analysis in [4] took weeks).In this section we will show how aggregating change points learned with the HDP-HMM from a large number of origindestination pairs is a simple and elegant method to detect and characterise anomalies in key Internet infrastructures.

RIPE Atlas Trends API
In order to make our method accessible to many people, we have developed a publicly exposed Web API into RIPE Atlas.Given an origin-destination pair (measurement and probe ID) and a time frame (start and stop time), the trends API provides the segmentation of a RIPE Atlas delay measurement.The API offers three endpoints, described in Table 2.The /ticks endpoint returns the minimum RTT for a given pair with a constant time interval (duplicated results due to probes connectivity problems are suppressed, and missing results are explicitly inserted).The /trends endpoint returns the minimum RTT and the associated segmentation.For example, the URL https://trends.atlas.ripe.net/api/v1/trends/1437285/6222/?start=2018-05-02&stop=2018-05-10 gives the segmentation of the Figure 7 (it should take less than 10 seconds to segment one week of data).A summary of the time series, as shown in Listing 1, can also be requested by appending /summary to the path.Start and stop time are specified as YYYY-MM-DDTHH:MM where THH:MM is optional and defaults to the start of the day.
Additionally to this article we provide interactive notebooks to document and demonstrate the API, and compare various statistical models.Links to interactive Google Colab sessions, as well as the notebooks source and code to facilitate the usage of the API are provided on GitHub [26].

AMS-IX May 2015 outage.
According to [4], on the 13th of May 2015, AMS-IX experienced a partial outage due to a switch interface generating looped traffic on the peering LAN.The event lasted for seven minutes and two seconds, from 10:22:12 to 10:29:14 (UTC time) before the switch interface was disconnected.This event caused some peers located at AMS-IX to loose their BGP session.In [4] the event has been studied using traceroutes, by looking at the percentage of paths seeing AMS-IX peering LAN in their traceroute over time.However changes in the IP paths often result in changes in the delay.Using the ping measurements corresponding to the same origin-destination pairs, provided by RIPE NCC, we learned the models and extracted the changepoints.
By default RIPE Atlas ping measurements are performed every 4 minutes, with a jitter of 2 minutes to maximize the temporal coverage over all the probes participating in a measurement.Hence we counted the number of changepoints in buckets of 6 minutes.We show the resulting state change frequency on Figure 12.We highlighted in red the real event duration.The event corresponds to a clearly visible increase in the number of changes.The frequency stays high for a few hours as first of all many peers switch to alternative paths, and then some of them come back to AMS-IX.
We also see a spike between 14:45 and 15:00 (UTC).Further investigation has shown that almost all the changepoints that have occured during this period are related to measurements targeting the DNS Root-A server.We have repeated a similar procedure for all the origin-destination pairs in the Atlas built-in measurement to this DNS server and we have seen a similar spike, but all source ASes seem to be affected equally, leading us to believe that the spike was caused by an event close to one of the DNS Root-A instances.

DE-CIX April 2018 outage.
According to [5], between April 9th and April 20th 2018, some networks located at DE-CIX Frankfurt lost their connectivity to the route servers, and as a result rerouted their traffic to other interconnections, or experienced an interruption of traffic.An analysis of the rates of BGP updates received by route collectors located at DE-CIX showed that the rates of updates dropped close to zero between 19:43 and 23:28 on the 9th of April, and between 02:02 and 03:51 on the 10th of April.Applying the same methodology as for the AMS-IX event, we show the state changes frequency for this time frame in Figure 13.The two largest spikes match exactly the two times where the rates of BGP updates dropped to zero.The two smaller spikes match with the two times when the collectors started receiving BGP updates again.

Validation of the HDP-HMM model at large scale on RIPE Atlas
In Section 4.2 we show that the HDP-HMM model is at-least as good as classical change point detection methods on a labelled RTT change points dataset.This however, does not tell us whether the model fits well RTT data from a statistical point of view.In this section, we propose to compare the likelihood of the time series (with respect to their inferred model) with the likelihood of time series simulated according to an HDP-HMM model.If the models fit well the data, we can expect that the likelihood of the data with respect to the model should follow the same distribution as the likelihood of synthetic data generated by the model.
To do so, we consider 100k time series of one week duration (2520 data points) from the anchoring mesh measurements.We learn the model for each time series, and compute their likelihood p(y|π, θ ) with respect to the model.In addition, for each HMM with parameters (π, θ ) we sample a time series y ′ and compute its likelihood p(y ′ |π, θ ).
We compare the distributions of the likelihood on observed and synthetic time series in Figures 14 (Q-Q plot) and 15 (histograms).It can be seen that both distributions are similar, with the simulated time series being slightly more likely.This demonstrates that the HDP-HMM explains well the diversity of observed trajectories in RIPE Atlas measurements.Thus, we have not only visually verified on a large number of series that the segmentation obtained with the model is consistent with what a human expert would produce (Section 4).But in addition, we have checked on a very large scale (about 100k randomly chosen series among the Atlas mesh measurements) that all these series are well modeled by the HDP-HMM.

CONCLUSION
In this paper we have shown that the HDP-HMM model, a hidden Markov chain model with a potentially infinite number of states, is a very promising method for analyzing RTT time series over the Internet on long time scales (hours to weeks).We have recalled the principles of this model that produces an accurate segmentation of time series and identification of hidden states.Unlike black box  approaches, the HDP-HMM provides some explainable parameters that can be used as input in different network management tasks such as the choice of routes, QoS prediction, or optimization of the measurement strategy.Segmentation results are very close to what a human expert would provide.But the analysis method is fully automated with no human intervention, even in the initialization phase, and it is scalable.As proof, it has been implemented on an Internet-wide operational measurement infrastructure, RIPE Atlas, with a publicly available Web API.
We have shown that this method can accurately detect moments when abnormal events occur on the Internet.In the future we would like to automate this detection, and in particular to locate anomalies (infrastructure failures, etc...) in a precise way.This will require the use of other methods exploiting the diversity of the measured paths and tomographic approaches or using a preliminary timeseries filtering strategy.We will also work on a real-time processing of measured data to detect novelties in RTT series with HDP-HMMs in an almost instantaneous way, based on some recent sequential approaches to inference in HDP-HMMs.

Figure 1 :
Figure 1: RTT between two RIPE Atlas anchors from May 1st to May 5th, 2018.

Figure 6 :
Figure Segmentation of a RTT time series with parametric and nonparametric mixture models and HMMs.

Figure 7 :
Figure7: Segmentation of RTT observations between at-vie-as1120 and us-bos-as26167 using an HDP-HMM with DP-GMM emissions.Each color identifies a state or an IP path observed in the traceroute.

Figure 8 :
Figure 8: Distribution of the number of states learned for different timescales.

Figure 9 :
Figure 9: Density estimation of the (standard deviation, average duration) couple.Darker colors indicate a higher density.
Number of unique IP-level paths

Figure 10 :
Figure 10: Distribution of the number of states associated with a given number of unique paths.

Figure 11 :
Figure 11: Segmentation of a RTT difference (far -near) time series obtained with TSLP probes from the CAIDA MANIC project.Each color identifies a state.

Figure 13 :
Figure 13: Change frequency between the 9th and the 10th of April 2018 for the 60k pairs that saw DE-CIX Frankfurt in their traceroutes the day before.

Figure 14 :
Figure 14: Q-Q plot of observed vs. simulated log-likelihood on 100k time series.

Figure 15 :
Figure 15: Distribution of observed and simulated loglikelihood on 100k time series.

Table 1 :
Taxonomy of models Internet monitoring projects such as RIPE Atlas provide a large amount of latency information.Due to its scale RIPE Atlas has a

Table 2 :
Endpoints of the Atlas Trends API.