Point-process modeling and divergence measures applied to the characterization of passenger flow patterns of a metro system

The problem of characterizing the passengers’ movement in a public transport system has been considered in the literature for analysis, simulation and optimization purposes. In particular, origin-destination matrices are commonly used to describe the total number of passenger that travel between two points in the transport system on a given time interval. In this paper, we propose to model the instantaneous rate of arrival of passengers for the origin-destination pairs of a metro system using point processes. More specifically, we apply the expectation-maximization algorithm to estimate the parameters of a Gaussian mixture intensity function for the daily flow using real passengers data from multiple days provided by EFE Valparaíso metro system. The uncertainty in the parameter estimates is quantified computing standard errors and confidence intervals, Secondly, we consider the problem of how to quantitatively analyze the similarity of the obtained intensity functions among all the different origin-destination pairs. In particular, we propose a dissimilarity index based on the Kullback-Leibler divergence. Then we apply this index in hierarchical agglomerative and partitioning methods to cluster origin-destination pairs with similar daily flow of passengers. The obtained numerical results confirm expert knowledge about the passengers’ behavior in EFE Valparaíso metro system and, more interestingly they provide additional insights on the passengers’ behaviour for specific origin-destination pairs.


I. INTRODUCTION
Efficient and reliable transport systems have become increasingly important to support the large urban and suburban daily flow of passengers in large cities all over the world. In particular, metro systems are one of the most commonly used means of transport due to their cost for passengers, effectiveness and speed. At the end of 2017, there were metros in 182 cities in 56 countries carrying on average a total of 168 million passengers per day [37].
The modeling of the passenger flow in a metro system may be of interest to reduce operating costs, to improve service quality and for long-term investment in the metro system infrastructure [28]. The passenger flow is usually modeled using origin-destination (OD) matrices that gather the information of the flow of passengers between different pairs of stations of the line. Transit agencies in the past used to rely on survey to estimate these matrices, however, nowadays database management and geographic information systems provide large amounts of data that can be exploited [42]. In fact, payment systems are the main source of information for passenger movement, where some transport lines only validate payment cards at the passenger entry, and in others, payment cards are validated at both ends of the trip. Alternatively boarding and alighting data has been used with, for example, the iterative proportional fitting method that could be considered as the state-of-the-art practical OD estimation method [14]. However, this kind of methods generally describe the total number of passengers traveling in a time interval (hours, days, weeks or years) for a given OD pair, and not the variation within that period.
Poisson processes have been used to model transport systems. For example, in [22] short-term demand for bus arrival was predicted by describing an intensity using neural networks, and in [23], the demand for taxis was modeled using a discrete intensity. Moreover, Poisson processes have been used to model patient arrival to emergency departments using a piecewise constant approximation of intensity [9]. The modeling of the intensity of the passenger flow is not always possible and, in such cases, researchers have proposed methods such as the short-term forecasting of realtime OD matrices [5], [43]. For public transport, Munizaga and Palma [24] estimated an OD matrix combining bus and metro system data.
In this paper, we propose a point process model for the instantaneous rate of arrival of passengers for every OD station pairs of a metro network along a day of operation. The modeling of passenger arrival was previously considered in Allende et al. [1], where we studied three different point process models: Hawkes-Phan process, Poisson process with Gaussian mixtures, and so-called Hawkes-Gauss process (that combined the previous two). In that work, the intensity function was obtained applying maximum likelihood and the EM algorithm, however, using only single-day real data provided by EFE Valparaíso. Additionally, Bayesian information criterion was used in that work to determine the (best) model order and, thus, the number of parameters for the different point processes intensity functions.
In this paper, our first contribution is to continue and extend the work in [1] by estimating the intensity function of the passengers arrival using data from multiple days using the superposition principle [17]. We consider the nonhomogeneous Poisson processes (also considered in [1]) by defining a time-dependent intensity function λ(t) as a Gaussian mixture model.
To estimate the parameters of the intensity Gaussian mixture function for each OD pair, we use the EM algorithm [8], [21]. This algorithm has been applied to truncated Hawkes point process modeling in [11]. For estimation from real data, the accuracy analysis is an important issue to detect high uncertainty in the parameter estimates. In this work, the accuracy of the parameter estimates is analyzed via confidence intervals and standard errors. The Oakes method [25] calculates the derivatives of the conditional expected function of the EM algorithm to obtain the observed information matrix that leads to estimating confidence intervals. In addition, parametric bootstrap [10] based on a Monte Carlo simulation is used to approximate standard errors. Moreover, we analyze the impact of combining data from multiple days to reduce the associated uncertainty in the parameter estimates of the intensity functions. It is worth noticing that the derivation of the observed information matrix using the Oakes method is a novel side result, which to the best of our knowledge has not been reported previously for Gaussian mixture models.
In the literature, different works have considered the characterization of passenger flow pattern in transport systems, focusing on the study on clustering days with similar behavior of user arrivals. For example, Weijermars and Berkum [36] classified daily traffic profiles using hierarchical Ward's clustering for a Dutch highway. Yang et al. [40] applied dimensionality reduction to OD matrices to classify a set of days of a metro system using affinity propagation. For daily traffic data, Yu and Hellendoorn [41] proposed a clustering algorithm for mixture models. Caceres et al. [3] applied hierarchical clustering to Euclidean distances of road section features to estimate traffic flow.
In this paper, our second contribution is to propose a methodology to characterize OD pairs with similar passenger flow patterns using a quantitative index to compare the associated intensity functions. In particular, we propose a dissimilarity index between OD pairs using a symmetrized Kullback-Leibler divergence [6]. In this way, a dissimilarity matrix between all OD pairs of Metro Valparaíso is obtained. To characterize similar passenger flow patterns, we use two clustering algorithms: agglomerative hierarchical methods using single, complete and average linkage [15] and partitional methods using the k-medoids algorithm [16]. A key problem of clustering algorithms in practice is the validation of the obtained set of clusters [27], that is dependent on the context of the application. For the metro system, the choice of the number of clusters may be considered as a design parameter depending on the number of groups required for analysis, simulation or optimization purposes.
The remainder of the paper is structured as follows: in Section II, the parameter estimation of the intensity function is presented, and the associated confidence intervals and standard errors are obtained. Section III provides background on the Kullback-Leibler divergence and how it can be used to define a symmetric distance between OD intensity functions. Then, in Section IV, the clustering methods and algorithms are presented. In Section V we present the results obtained by the proposed analysis in the case of EFE Valparaíso metro line using real passenger data. Finally, in Section VI, we discuss the results obtained and draw conclusions.

II. ESTIMATION OF INTENSITY FUNCTIONS
Origin-destination (OD) matrices are used to model the movement of passengers in a public transport system. They usually summarize the total number of passengers that travel between given origin and destination points (or stations) during a time interval. For instance, for a set of t stations, the matrix  is an OD matrix where N ij represents the number of trips made from station i to station j over a given period of time.
Note that it is assumed that there are no trips with the same station as the origin and destination.
In this section, we introduce point processes intensity functions to model the instantaneous rate of arrival of passengers for each OD pair of a metro system along one operation day. We assume that the intensity associated with an OD pair is denoted by λ(t) and corresponds to the intensity of a Poisson point process as described in [1], [31]. Definition 1: A mixture of one-dimensional Gaussian densities is a finite linear combination of g Gaussian densities of the form where denotes the Gaussian density function with mean µ i and variance σ 2 i , and the coefficient π i is the weight associated with each Gaussian component. If a random variable Y has a density function (2), we shall denote Y ∼ N(µ i , σ 2 i ). In (1), the parameter vector associated to the Gaussian mixture model is θ = (π , µ , σ ) with π = (π 1 , . . . , π g−1 ) , µ = (µ 1 , . . . , µ g ) and σ = (σ 2 1 , . . . , σ 2 g ) . Accordingly, the parameter space is defined as where p = 3g − 1 is the dimension of Θ g . Following [1], for a fixed day d, we consider (1) to model the intensity of a point process for an OD pair such that the Gaussian mixture of densities is weighted by the number of arrivals n d to the destination. In this framework, we define the intensity through The above model is an extension of the problem addressed in [1] in which the parameter vector θ was estimated using data from one day only. Here, we consider data from a set of days where for each day we assume that the intensity function of the Poisson point process depends on the same fixed parameter vector θ. Furthermore, if we assume that the Poisson processes for different days are independent of each other, then from the superposition principle [17], is the intensity of the realizations of the Poisson point processes of the selected N d days andn d is the average number of arrivals (or passenger trips).
In the framework of discrete mixtures of Gaussian distributions, the EM algorithm [8] is often used to perform maximum likelihood (ML) estimation of the parameters of interest. The simplicity and stability of this estimation approach has allowed it to become a popular algorithm (see, for instance, [39]). The power of the EM algorithm lies in considering a data augmentation scheme Y aug by including latent variables or missing data to the observed data Y obs = M(Y aug ) for some many-to-one mapping M. Then, we proceed to maximize the observed log-likelihood function (θ) iteratively based on a surrogate function known as the log-likelihood function of augmented data Y aug = (Y obs , Y mis ) denoted by c (θ). The EM procedure iteratively computes the ML estimator by alternating between the following steps: A remarkable property of the EM algorithm is that under mild general conditions [38], this procedure increases the observed data log-likelihood function after each iteration, and the sequence {θ (k) } k≥1 converges to a stationary point of (θ).
To obtain the maximum likelihood estimates in the context of discrete mixtures of normal, we augmented the observed data Y obs = (Y 1 , . . . , Y n ) by incorporating latent variables to obtain Y aug = (Y 1 , . . . , Y n , Z 1 , . . . , Z n ) , where Z ij = (Z j ) i is an indicator variable that identifies if the observation Y j belongs to the ith component of the mixture. This leads to the following hierarchical model: where Mult g (1, π) denotes the multinomial distribution with parameters 1 and probabilities π = (π 1 , . . . , π g ) . Thus, the augmented-data log-likelihood function assumes the form It is straightforward to show that the conditional expectation required to evaluate the expectation step of the EM VOLUME 4, 2016 algorithm takes the form for j = 1, . . . , n; i = 1, . . . , g. This allows us to calculate the conditional expectation of the log-likelihood of the augmented data. Moreover, Q(θ; θ (k) ) is given by Thus, the M-step related to π, µ and σ are given by for i = 1, . . . , g. The E-and M-steps detailed in (7) and (8)- (10) are iterated until they reach convergence. Note that the updates of µ (k+1) i and σ 2(k+1) i correspond to weighted averages whose weights are given by z ij . This is computationally inexpensive and guarantees no negative variance estimates.
Under mild regularity conditions, the ML estimator θ of θ is asymptotically normally distributed, that is, where I(θ) = E{−∂ 2 (θ)/∂θ∂θ } corresponds to the Fisher information matrix. Note that the Fisher information matrix is necessary not only to obtain the standard error of ML estimators but also to evaluate confidence intervals and hypothesis test statistics such as Wald and score statistics [2]. The following alternatives for estimating the Fisher information matrix have been frequently suggested: which correspond to observed and empirical versions of the information matrix, respectively. Here U i (θ) represents the score function for a single observation, which is common when the likelihood function is additive. Another alternative to derive the observed information matrix uses the missing information principle [19]. In this paper, we obtain the observed information matrix in the Gaussian mixture model using the method proposed by Oakes [25], who suggested carrying out the following computation: MatricesQ θθ (θ) = ∂ 2 Q(θ; θ)/∂θ∂θ andQ θ θ (θ) = ∂ 2 Q(θ; θ)/∂θ∂θ assume the partitioned forms: The entries of matricesQ θθ (θ) andQ θ θ (θ) are presented in Appendix A.
To the best of our knowledge, the use of the Oakes method to obtain the observed information matrix for Gaussian mixture models has not been reported previously in the literature.

III. DIVERGENCE MEASURES
In this section, we introduce a dissimilarity index based on the Kullback-Leibler divergence to quantify the similarity among the intensity functions estimated for the different OD pairs of the metro line.
Let Z ∈ R be a random variable with density function f Z (z). The Shannon entropy, or expected information, is given by (see [6]) Now, suppose there are two random variables X and Y with probability density functions f (x) and g(y), respectively, which are assumed to have the same support. Based on the entropy notion, we can define divergence measures between the distributions of X and Y . One of the most common measures to determine the divergence between two distributions is the Kullback-Leibler (KL) divergence [6], which is defined as where the notation emphasizes that the expectation is defined with respect to the probability density function f (x).
Although the KL divergence measures the distance between two densities, it is only a pseudodistance measure. The KL from f to g is generally not the same as the KL from g to f . However, from the statistical point of view, it is relevant that D(f g) ≥ 0 and D(f g) = 0 if and only if f = g almost everywhere. A familiar symmetric variant of the KL divergence is the J-divergence (see, for instance, [6]), which takes the following definition: These divergence measures have several useful applications including telecommunications, image analysis and econometrics. An excellent description of the properties and extensions of these procedures is given in [34].
Although it is possible to obtain explicit expressions for the KL divergence between normal, uniform or gamma variables, closed forms for the KL divergence are not available for the class of discrete mixture of normal distributions. This has motivated considerable effort in proposing procedures to approximate the KL divergence in the context of discrete mixtures of normals. For a review of a wider variety of methods, see [7] and [12]. In our numerical experiments, we use the Monte Carlo approach, which will be described next.
Assume that f and g are density functions for random variables following a discrete mixture of normal densities as defined in (1). A Monte Carlo estimate of D(f g) is based on a random sample {x 1 , . . . , x M } from f (x) and is given by By the Law of Large Numbers, we have that D M (f g) → D(f g) as M → ∞. Moreover, this procedure allows us to obtain a Monte Carlo estimate for the variance of D M (f g), given by In addition, This procedure can be as accurate as required simply by increasing the number of random variables generated.

IV. CLUSTER ANALYSIS
In this section, we briefly describe the most common methods used in clustering analysis in order to apply them to group OD pairs with similar behavior. We seek to determine whether clustering techniques with a symmetrized KL distance, i.e., the J-divergence in (13), is a suitable technique to characterize different types of passenger flow between different pairs of stations in the OD matrix. In particular, we expect a high value of the divergence between the OD pairs with a higher passenger load in the morning and the OD pairs with a peak of passengers flow in the afternoon, leading to clearly separated clusters. Other types of flow are expected to appear in the analysis when clustering together OD pairs that have similar traffic profiles during the day.

A. CLUSTERING METHODS
Let X = {x 1 , . . . , x n } be the finite set of all possible elements to be grouped. In this case, X ⊂ F Θg is the set of all mixed Gaussian densities associated with an OD pair. Now, the network concept can be introduced [4].

Definition 2:
A network N is a pair (X , D) where X = {x 1 , . . . , x n } is a finite set of points to be grouped, and D(· ·) : X × X → [0, +∞) is a divergence measure. The set of all networks is denoted as N .
A partition P of X is a collection of sets P = {G 1 , . . . , G k } such that G i ⊂ X for all i = 1, . . . , k, k i=1 G i = X , and G i ∩ G j = ∅ for all i = j. We will call the partition P a grouping or clustering of X . A partition P depends on the network we used; thus, two different divergences measures will not necessarily be associated with the same partition.
Among all possible clustering techniques, we consider two methods: partitional clustering and hierarchical. Within the first approach, there are two well-known techniques: k-means and k-medoids. These methods require to choose the number of groups k and the use of a symmetric divergence measure J(· ·) between pairs of elements in X . Because the k-means method is well known in the literature (see, for instance, [15]), in the Appendix we briefly describe the k-medoids technique.
Hierarchical clustering techniques proceed by either successive mergers or successive divisions. The way in which these techniques use the nearest neighbor between items is known as linkage methods. Hierarchical clustering generates a series of nested partitions [32], the description of which has been relegated to the Appendix.

B. SELECTION OF NUMBER OF CLUSTERS
One of the issues in clustering is choosing the number of groups for an available data set. Depending on the method used or the type of data, there are several techniques for the validation of the groups, of which heuristic-based indexes are usually used for the selection of the number of groups. The main difficulty is to find the right number of clusters so that there are not too few clusters with very dissimilar data but not too many clusters with very dissimilar data. Among all existing coefficients, analysts prefer to use techniques that satisfy an optimal condition. We mention three coefficients of this type: the elbow method, the Silhouette method, and the Gap statistic [30], [33]. These methods are widely known, and implementations in R are available on several websites. See, for instance, https://uc-r.github.io/kmeans_clustering#gap. In the numerical experiments carried out in Section V, the silhouette method was used to select the number of clusters. However, for the metro system considered here, the choice of the number of clusters may be considered as a design parameter depending on the number of groups required for analysis, simulation or optimization purposes.

V. CASE STUDY: EFE VALPARAISO
In this section, we introduce numerical experiments to gain insights into passenger behavior in the EFE Valparaíso metro line, analyzing a real data set consisting of passenger trips of August 2019. EFE is the chilean state railway company and EFE Valparaíso is the largest metro line outside Santiago, VOLUME 4, 2016 capital city of Chile, having 20 stations over a 43 kilometers line and moving around 20 million passengers per year.

A. PARAMETER ESTIMATES
Our first goal is to estimate the parameters of a Gaussian mixture model of the form (1) for one if the OD pairs. In particular, we estimate the intensity function for the pair (17,1) using the data of all Tuesdays during August 2019. We also provide confidence intervals for all parameters of model (1). Based on the experience of line operators, we expect clear patterns in different OD pairs. Trips from the last to the first stations correspond mainly to the work population, so a greater flow of passengers is expected in the morning. For the opposite direction, when mainly workers return to their homes, a higher flow is expected in the afternoon. Later, in Section V-B, we numerically analyze the flow passenger patterns of the metro system considering all OD pairs of EFE Valparaíso.
In [1], Bayesian Information Criterion (BIC) was used to determine the (best) model order and, thus, the number of parameters for the different point processes intensity functions. That analysis showed that the choice of g = 3 Gaussian components for the different OD pairs of the EFE Valparaíso metro line was the optimal trade-off between fitting of the real data and model complexity. Thus, g = 3 is used for all the experiments carried out in this section.
To define the initial parameters for model (1) that are required for the EM estimation algorithm, the time instants were defined based on the available operation schedule as T i = 6 and T f = 23. The initial parameter vector was θ (0) = (0.33, 0.33, 10.25, 14.5, 18.75, 1, 1, 1) . The bootstrap confidence intervals for the components of θ were generated using B = 1000 bootstrap samples [10].
For OD pair (17,1), the total number of passenger arrivals ordered from the first to the last days for every Tuesday of August 2019 were 275, 309, 302, 305 and 287. The daily average number of passengers wasn d = 295.6. Figure 1 shows the initial intensity and estimated value provided by the EM algorithm. In addition, Table 1 shows the obtained parameter estimates, its standard errors, and the corresponding Oakes and bootstrap 95% confidence intervals for all parameters of model (1). Notice that, from (3) the weight of the third Gaussian density is given by π 3 = 1 − π 1 − π 2 . The Oakes method obtained the lower standard errors in 7 of the 8 parameters of the model, and the bootstrap method gives the best results for π 1 , the Gaussian component with the highest passenger arrivals (see Table 1). The Oakes method also achieves the shortest confidence intervals for the second and third Gaussian components, which have lower passenger flows. The confidence intervals for π 1 , µ 1 and σ 2 1 exhibit similar behavior for all methods.
The estimated intensity in Figure 2 shows a good fit, in particular, for every day included in the study. The passenger arrival quantiles vs. the Gaussian mixture quantiles show small deviations from the straight line, supporting the goodness of fit in each case.   Figure 3 shows the 95% confidence intervals of the parameter estimates obtained by the Oakes methods, when the number of days used as data for estimation is increased. It can be noticed that the uncertainty around the point estimate of the parameter is clearly reduced as the number of days increases. In fact, all the parameter estimates (except σ 1 ) show a 50% reduction in the associated confidence intervals. In the figure, it can also be noticed that the intervals obtained for 3 or more days of data are similar for each parameter.
In the next subsection we are interested in quantifying the similarity between the intensity functions for different OD pairs. In order to estimate these densities, the same procedure in this subsection was followed: a mixture of g = 3 Gaussian

B. CLUSTERING STATIONS PAIRS
In this subsection, we use the OD pair densities of the line model to estimate a dissimilarity matrix by the Monte Carlo method using 10 6 random samples. Later, the dissimilarity matrix is used as input to hierarchical and partitional clustering algorithms.
We compute the symmetrized KL divergence (13) for all 380 × 380 OD pairs in the metro line. The values of the dissimilarity index are in the range from 0 to 8 approximately, where low values of divergence are mostly observed. Figure 4 shows the KL based dissimilarity index where the maximum of the color scale has been set to 2.25, which includes 90% of distances between OD pairs. The pairs containing station 11 as an entry divide the matrix into quadrants. The most significant contrast of the matrix appears in the lower left quadrant that compares OD pairs with entry between the first and eleventh stations with respect to pairs with entry between the twelfth and twentieth stations. The silhouette coefficient suggests that best fit for all algorithms is a partition into 2 clusters ( Figure 5).
Although complete linkage has the second worst silhouette coefficient, Figure 6 shows that this method produces more balanced hierarchies with respect to a single linkage (that tends to link OD pairs one by one). For instance, when using complete linkage, a partition into 2 clusters produces clusters with 155 and 225 elements, and a partition into 7 clusters still does not produce a singleton. Average linkage is an intermediate case between single and complete linkage because outliers are detected for higher values of dissimilarity. However, average linkage is also capable of recognizing clusters with many densities but for lower dissimilarity values (see Figure 7).
The choice of the number of clusters for the OD pairs may be considered as a design parameter depending on the number of groups required for analysis, simulation or optimization purposes. Although the silhouette coefficient suggests a partition into 2 clusters, we analyze the partition of 5 medoids to recognize more traffic profiles in the metro VOLUME 4, 2016  system. The number of grouped densities for each cluster from 1 to 5 are 131, 74, 42, 63 and 70. Figure 8 shows that most OD pairs above the main diagonal of the OD matrix (cluster 1 in blue) exhibit an afternoon peak traffic, when a large number of passengers return home after work. As a way of contrast, OD pairs corresponding to clusters 3 and 5 show a higher flow in the morning, when passengers most probably travel to their workplaces. Notice that these two clusters are different because 3 shows a more pronounced peak. On the other hand, cluster 2 shows both a morning and afternoon hour peak. Finally, a key characteristic appears in cluster 4 where a higher passenger flow at midday can be noticed. Figure 8 also shows the singular behavior of station 11. When considered as an origin station (row 11 of the matrix), we notice an afternoon peak (cluster 1 in blue). On the other hand, when considered a destination station (column 11), peak traffic appears in the morning. This behaviour might be expected since this station corresponds to El Salto industrial area in Viña del Mar, where people travels to in the morning to work, and leave in the afternoon to go home.
An interesting insight that arises from the clusters and matrix shown in Figure 8 is related to the OD pairs close to the diagonal. Most of these pairs are in clusters 2 and 4, which correspond to a more even flow of passengers during the day. In fact, this kind of passengers' behaviour has not been previously recognized by EFE Valparaíso operators, showing an advantage of the proposed methodology.

VI. CONCLUSIONS
In this paper, we have proposed a methodology to model passenger movement in a metro line and to characterize the flow patterns that appear among the different origin-destination pairs. The paper makes two contributions. First, the instantaneous rate of arrival of passengers for each origin-destination pair is modeled using a Gaussian mixture intensity function, exploiting the EM algorithm to obtain parameter estimates and the associated confidence intervals. In fact, the use of data from multiple days leads to an improvement in the uncertainty in the parameter estimates showing, for example, a 50% reduction of the confidence intervals obtained by the Oakes method when going from 1 day to 3 or more days of data.
The case study presented in the paper confirms that the Gaussian mixture model allows for good approximation of the (observed) intensity functions throughout the day. Moreover, the confidence intervals of the estimates obtained by different approaches are comparable and allow us to detect estimates with poor accuracy, for example, for stations with fewer passengers.
The second contribution presented in the paper is to quantitatively compare the obtained intensity functions for the different origin-destination pairs of stations in the line. The dissimilarity index proposed to measure the distance between the intensity functions was derived from the KL divergence. This approach provides asymptotic properties that enhance its use in practice since approximate confidence intervals and hypothesis tests could be derived as a direct consequence.
In addition, the hierarchical and medoid clustering methods proposed in the paper to characterize similar patterns among the origin-destination intensity functions, provide both visual and quantitative information about different passenger behavior in the line. The results obtained show patterns that are consistent with knowledge about passenger behavior in morning and evening peak hours but, more significantly, provide additional insights for specific origindestination pairs.
We believe that the general methodology proposed in the paper can be applied to other similar transport networks, providing useful information for analysis, simulation and optimization. .

APPENDIX B CLUSTERING ALGORITHMS
Let m i ∈ G i be the medoid of the group G i ; then, for all i = 1, . . . , k, m i is obtained as Let S ⊂ X be the set of all selected medoids. The objective of the method is to determine the set S = {m 1 , . . . , m k } that minimizes the sum of all dissimilarities between each medoid and the elements of the respective group. i.e.,

arg min
Algorithm 1 summarizes the necessary steps to yield the final clusters.
One of the most commonly used methods to obtain a set of medoids is partitioning around medoids (PAM) introduced in [18]. This methods consists of two steps BUILT and SWAP. More details can be found in [18,Section 2.4].
In order to describe the hierarchical agglomerative methods we state the following definition.
Algorithm 1 k-Medoids Method 1. Define k initial medoids m 1 , . . . , m k . 2. Assigns to each x ∈ X the nearest medoids cluster: x ∈ G i ⇔ m i = arg min m∈S J(x x ).
3. Update S through Equation (14). 4. Repeat steps 2 and 3 until there are no new updates in S.
Definition 3: A partition P 1 of k clusters is said nested within P 2 that has r < k clusters if for each G ∈ P 1 there exists G ∈ P 2 such that G ⊆ G . The nested partitions are denoted as P 1 P 2 . Hierarchical clustering generates a hierarchy of nested partitions. Let P 0 be the partition of X into n groups, i.e., singletons, and let P n−1 be the partition of a single group containing all objects. Agglomerative hierarchies generate the following nested sequence of partitions: in which if two objects are joined together in a group, they will never be separated again. On the other hand, divisive hierarchies generate the reverse sequence P n−1 P n−2 . . . P 0 .
Merging of the clusters is carried out using one of the three linkage criteria described as follows. Single linkage: Groups are formed from the individual entities by merging nearest neighbors. i.e., Complete linkage: At each stage, the distance between clusters is determined by the distance between the two elements, one from each cluster, that are most distant. i.e., Average linkage: Computes the average between dissimilarities between objects from different clusters through with |G i | being the number of objects belonging to G i . The hierarchical agglomerative method is summarized in Algorithm 2.