Cluster-Specific Latent Factor Estimation in High-Dimensional Financial Time Series

Unsupervised learning methods have been increasingly used for detecting latent factors in high-dimensional time series, with many applications, especially in financial risk modelling. Most latent factor models assume that the factors are pervasive and affect all of the time series. However, some factors may affect only certain assets in financial markets, due to their clustering within countries, asset classes, or sector classifications. In this paper we consider high-dimensional financial time series with pervasive and cluster-specific latent factors, and propose a clustering and latent factor estimation method. We also develop a model selection algorithm, based on the spectral properties of asset correlation matrices and asset graphs. A simulation study with known data generating processes demonstrates that the proposed method outperforms other clustering methods and provides estimates with a high degree of accuracy. Moreover, the model selection procedure is also shown to provide stable and accurate estimates for the number of clusters and latent factors. We apply the proposed methods to datasets of asset returns from global financial markets using a backtesting approach. The results demonstrate that the clustering approach and estimated latent factors yield relevant information, improve risk modelling and reduce volatility in optimal minimum variance portfolios.


I. INTRODUCTION
With the rise of data driven decision making in risk management, statistical and machine learning methods are becoming increasingly important as their ability to uncover meaningful information and perform well out-of-sample is put to the test in real world scenarios. This field has recently attracted a fair amount of interdisciplinary research, bringing together mathematical, physical, econometric and computer science approaches [1]- [3]. These methods are of critical importance in financial risk modelling, where the dynamics of asset return time series are driven by underlying risk factors [4]. To estimate the effects that these underlying factors have on observed asset returns, traditional modelling approaches use observable macroeconomic time series (such as GDP growth, interest rates, or market returns) as model inputs [5], while others focus on finding proxies for unobservable factors (known as size, value, or momentum) using economic firm-level data [6], [7]. However, this information The associate editor coordinating the review of this manuscript and approving it for publication was Juan Wang . is not always available for every security (i.e. derivatives or certain ETFs or indices), meaning that these standard approaches may not be universally applicable [8]. Moreover, recent empirical results have been challenging some of these models, giving advantage to more agnostic statistical approaches [9].
Today, with the advances in financial technology and the globalization of financial markets, the number of investable securities and their diversity in terms of asset classes and country of origin is larger than ever. Throughout the past decades, these developments motivated the increased focus on statistical and unsupervised learning techniques for uncovering latent risk factors in asset return data [10], [11]. However, even though the number of assets continues to grow, the observable time period used to estimate these models must remain short. This is due to the fact that financial markets are known to exhibit sudden changes in dynamics and stationarity can not be assumed over long time periods -asset return volatilities and correlations change over time, especially in the presence of financial bubbles and crashes [12]- [14]. Therefore, these unsupervised learning methods must be VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ able to perform well on high-dimensional datasets when the number of time series N is commensurate or even larger than their length T [15]. The out-of-sample performance of these estimates is crucial for many portfolio optimization or risk management applications [1], [16]. Since deep learning techniques (such as autoencoders, restricted Boltzmann machines, or GANs) use nonlinear models with a large number of parameters, they require large amounts of data for training [17] and may perform poorly in high-dimensional settings -instead, more restricted and parsimonious methods are required [18].

FIGURE 1.
A grid view of a setting with time series X 1 , . . . , X 7 affected by factors F 1 , . . . , F 5 , such that F 1 and F 2 are pervasive factors, F 3 is specific to time series X 1 , . . . , X 4 , while F 4 and F 5 are specific to time series X 5 , . . . , X 7 .
In this search for tractable and plausible latent factor estimation methods, it is crucial to take advantage of the structural specifics and statistical stylized facts of financial markets [19]. Since global markets consist of assets from different exchanges and various asset classes, certain risk factors will be specific only to a subset of assets [20]. For instance, in a global set of financial assets, pervasive global factors may affect all time series (such as the global macroeconomic and market shocks), and cluster-specific factor related to certain countries will affect only specific clusters of assets (for instance, European stocks will be affected by their own set of factors and may not be affected by some Asian market factors, after controlling for the common global component). Such a setting is shown in Figure 1, where the assets X 1 , . . . , X 7 are exposed to pervasive factors F 1 and F 2 and only certain clusters of assets are exposed to the cluster specific factors F 3 (affecting cluster of assets X 1 , . . . , X 4 ) and F 4 , F 5 (affecting cluster of assets X 5 , . . . , X 7 ). However, the majority of modelling approaches consider only pervasive latent factors, decomposing the asset return variability into the variability explained by pervasive factors (affecting all assets) and idiosyncratic components (individual asset risk) [5], [21]. In this paper we consider the existence of cluster-specific latent factors, and propose a clustering and latent factor estimation method which simultaneously estimates the unknown cluster structures with the pervasive and cluster-specific latent factors. We consider an approximate factor model 1 which belongs to a class of models proposed by Ando and Bai [22], [23], who consider panel data with observable pervasive and unobserved pervasive and cluster-specific factors. The variability of asset returns is decomposed into the variability explained by pervasive factors, cluster-specific factors and idiosyncratic components. The pervasive factors affect all asset return time series, and these assets are divided into clusters in which a certain number of cluster-specific latent factors (the number of which may vary between clusters) affect the assets within that cluster. Since the clustering procedure may be biased towards the clusters with a larger number of cluster-specific factors (due to the fact that more factors will always be able to explain more variability in the data), the algorithm is divided into two main phases: the clustering phase which uses a fixed number of cluster-specific factors for all clusters, and the latent factor estimation phase based on the estimated asset clusters. We also propose a computational approach to model selection which detects the number of pervasive factors, the number of clusters and the number of cluster-specific factors in each cluster.
The main contributions of this paper are: (i) a new method for clustering and estimation of pervasive and cluster-specific latent factors in high-dimensional financial time series; (ii) a new model selection procedure based on spectral properties of the time series correlations and asset graphs. Since there is no ''ground truth'' in financial data (the number of factors, the factors themselves, as well as the clusters are all unknown), we also develop a simulation framework based on data generating processes (DGPs) which feature heavy-tailed distributed returns and correlated residuals (thus replicating statistical properties of asset returns), in which the ground truth is known -allowing us to measure the performance of the estimation procedure and the model selection method. Furthermore, we consider two datasets covering global financial markets, and apply the proposed method to the weekly return data. To measure the quality of the estimates, we develop a backtesting framework which enables us to obtain cross-validation results for the out-of-sample performance of the estimated models. We also construct a portfolio optimization scenario based on mean-variance optimization and perform backtests on financial market data to demonstrate the value of the proposed approach and the ability of the method to reduce risk in portfolio optimization scenarios.
The rest of the paper is organized as follows. In Section II we provide a deeper look into the latent factor model approach and its relation with dimensionality reduction techniques. Section III defines the clustering and latent factor estimation algorithm, as well as the model selection approach. In Section IV a detailed description of the DGPs used to obtain simulation results is given. Section V provides our results on both simulation data and financial market data, and Section VI ends with a conclusion.

II. LATENT FACTORS IN HIGH-DIMENSIONAL FINANCIAL TIME SERIES
Dimensionality reduction techniques are commonly applied to obtain lower-dimensional representations of highdimensional data, such that these representations maintain some key properties of the original data [24], [25]. This is a crucial step in coping with the so-called curse of dimensionality which manifests itself through computational issues, such as sparse samples in high dimensions [26] or the rank deficiency of sample covariance estimates and the difficulties in estimating sample distributions [15]. Feature selection algorithms primarily focus on finding a function z = g(x) transforming the original high-dimensional data x (which may have irrelevant or redundant information) to a lower-dimensional set of variables z which aggregate the relevant information for a certain modelling task [27]. The function g is found by optimization of certain properties, which may be assisted by the class labels or target variables, depending on the modelling task. When the class labels or target variables are not available, unsupervised feature selection techniques focus on finding features which best preserve the clusters in the data [28], remove redundancy [29], [30] or optimize certain spectral properties of the underlying data graphs [31], [32] -either in the original data space or new subspaces [33]. Generally, unsupervised feature selection approaches have been found to yield relevant results in many machine learning tasks, including sequence analysis in bioinformatics [34], text classification [35], and other applications [36].
As opposed to feature selection, the latent factor model approach is focused primarily on finding a function x = h(f ) which explains the high-dimensional observed data x by a lower-dimensional set of factors f . The task of estimating factor models in high-dimensional data may be reduced to a regression task when these factors are known and observed -such cases may be common in biometric, psychometric or economic applications, where factor models are used to investigate the driving factors underlying the dynamics of some phenomena or processes [37]. However, these factors can often be unknown and unobserved, meaning that they must be estimated as latent variables from the data [38], requiring an unsupervised learning approach. The primary task is still to estimate the function x = h(f ), but now the factors need to be estimated from the data f = g(x). Evidently, autoencoder-type approaches can be used to estimate the encoder (f = g(x)) and decoder (x = h(f )) parts of the model, offering a large range of architectures and the ability to model non-linear relationships [39], [40]. However, in the presence of high-dimensional data with the number of samples being small in comparison to the number of features/variables, nonlinear models often fail to generalize due to the large number of parameters -this turns the attention of recent research to high-dimensional latent factor estimation based on robust and regularized statistical methods [18], [41].
In this paper we consider high-dimensional financial time series of asset returns, with the goal of modelling the asset return time series by associating the assets with a lower-dimensional set of underlying factors. Since risk in finance is most commonly proxied by the variability of asset returns, the goal is to explain the variability of asset return time series by their exposure to latent factors. In addition to explaining risk, the estimated latent factor models are often used to obtain better estimates of the high-dimensional covariance matrices, which are ultimately a key component in portfolio optimization [15]. Traditionally, latent factor models in finance assume that the factors are pervasive (they affect all assets) and thus can be found as common components in high-dimensional asset return time series [18], [21]. On the other hand, some recent results suggest that assets indeed tend to form clusters and communities which can be observed in their dependence network structures (modelled either by correlation or other measures of connectedness) [42], [43]. Assuming a strict hierarchical clustering structure, Tumminello et al. [44] form a hierarchical latent factor model and propose an estimation method based on the minimum spanning tree of the underlying assets. Clusters of assets are also known to emerge in stocks of single equity markets (for instance, clusters of stocks belonging to the same sectors) -Kakushadze et al. [45] consider clustering techniques for estimating these groups from the asset return time series. Verma et al. [20] proposed a cluster-specific factor model for the log-volatility with the goal of studying the heteroskedastic properties of volatility in financial assets returns. Other clustering approaches were also shown to improve high-dimensional covariance matrix estimates, which ultimately reduces risk in optimized portfolios [46]- [49]. However, while the evidence on the existence of asset clusters is compelling, certain latent factors may still be pervasive and affect all assets -for instance, global macroeconomic shocks or the market factor [50], [51]. These may not be omitted in the search for asset clusters. To fully exploit the structural properties and obtain better latent factor models, both the asset clustering as well as latent pervasive and clusters-specific factors need to be estimated from the data.

A. MODEL
Let X ti denote the return 2 of asset i at time step t, calculated as the percentage change in prices between periods t − 1 and t. Each asset i is associated with one of K clusters where g i ∈ {1, . . . , K } denotes the cluster index for asset i. We assume a latent factor model in which asset returns depend on the realizations of pervasive (common) factors f tp and 2 In this paper we consider the periodic (also known as arithmetic) returns X t = (S t − S t−1 )/S t−1 , where S t is the asset price at time t. Even though log-returns may have more elegant statistical properties, periodic returns allow for efficient matrix operations to be used in cross-sectional and portfolio return calculations. For more details on this topic see [52]. VOLUME 8, 2020 cluster-specific factors φ tq : where t = 1, . . . , T is the time index, i = 1, . . . , N is the asset index, p = 1, . . . , P is the pervasive factor index, and q = 1, . . . , C k is the cluster-specific factor index for cluster k = 1, . . . , K . Each of the K clusters is allowed a different number of factors C k -thus, the total number of cluster-specific factors is Q = k C k . The residual term e ti (also called the idiosyncratic component) represents the sources of risk which are individual to each asset and are not explained by the common factors. The model (1) can also be written in matrix notation as: where X ∈ R T ×N contains N asset return time series of length T , F ∈ R T ×P are the realizations and B ∈ R N ×P are the loadings for P pervasive factors. The realizations of Q cluster-specific factors for all K clusters are = (1) , . . . , (K ) and the cluster-specific factor loadings are the C k columns of and corresponding to factor realizations and loadings associated with cluster k. The term e ∈ R T ×N contains all of the N individual idiosyncratic components.
Since the pervasive factors affect all time series, the pervasive factor loading matrix B is full, whereas the cluster-specific loading matrix is non-zero only for the elements which correspond to assets and factors associated with the same cluster: The pervasive and cluster-specific factors are assumed to be uncorrelated: Cov(f p , φ q ) = 0, ∀p, q. The factor covariance matrices Cov(F) and Cov( ) are assumed to be positive definite, thus allowing for some correlations between cluster-specific factors. The assumed factor model is approximate, meaning that the error terms e, also known as idiosyncratic components (since they represent individual sources of risk for each asset), are zero-mean but are allowed cross-sectional correlations and heteroskedasticity. This implies that residual covariance Cov(e) is not necessarily diagonal, but it needs to be sparse (the cross-correlations in the idiosyncratic components can not be a consequence of common factors in the data) [18].
The factors are latent (unobservable), the clustering is unknown, as well as the numbers of factors, clusters, and cluster-specific factors -all of these need to be estimated from the data. Given the model (2) and the assumptions, in the following we propose an approach to estimate all of the above. First an iterative method clusters the data assuming a fixed number of cluster-specific factors in each cluster. Then the numbers of cluster-specific factors inferred from the data using the estimated clusters. To estimate the number of pervasive factors and clusters, we propose a model selection method based on the spectral properties of the asset correlation matrix and the asset graph estimated from the return time series.

B. CLUSTERING AND LATENT FACTOR ESTIMATION
ij denote the Frobenius norm of a matrix A. Given a data matrix X, and assuming a known number of pervasive factors P, number of clusters K and number of cluster-specific factors in each cluster C k , consider the following loss function: The loss function is the error of unexplained variation in the data. According to the Eckart-Young-Mirsky theorem, if all factors are pervasive the optimal low-rank approximation is given by the principal components (PC) estimator [5], [18], [53], based on the eigenvalue decomposition of the matrix 1 T X X = UDU . The pervasive factors and loadings are then estimated as: where U P are the P eigenvectors corresponding to the largest P eigenvalues, contained in the diagonal matrix D P . The principal components estimator is in the focus of many high-dimensional statistical applications [18], [21] -however, it is not applicable in the presence of cluster-specific factors. Moreover, for the assumed model, such a direct analytic estimation is not obtainable, since the loss function (4) needs to be optimized subject to the cluster-specific factor condition (3), given the clustering G = [g 1 , . . . , g N ]. The estimates of the pervasive factors, cluster memberships, and cluster-specific factors all depend on each other, and thus require an iterative approach -in which the PC estimator will prove useful.

1) CLUSTER ASSIGNMENT
If the pervasive factors F with loadings B and cluster specific factors are known, each asset can be assigned to the cluster which minimizes its value of the loss function (4).
To do so, we define Y = X − FB and find the candidate cluster-specific loadings for cluster k as: where (k) are the cluster-specific factor realizations for cluster k, as defined previously. Using the estimates we calculate the loss matrix L ik = l(X i ; F, B, (k) ,˜ (k) ) for each combination of assets i = 1, . . . , N and clusters k = 1, . . . , K . The clusters are then directly assigned as: meaning that each asset belongs to the cluster whose factors minimize the loss function (4) for that asset. This step can also be interpreted as a generalization of the assignment step in Lloyd's algorithm for k-means clustering, with C k cluster-specific factors instead of centroids, and the loss function (4) instead of the Euclidean distance.

2) ESTIMATION OF CLUSTER-SPECIFIC FACTORS
For a given clustering g = [g 1 , . . . , g N ] and known pervasive factors F with loadings B, all assets within cluster k are exposed to the cluster-specific factors (k) -for that subset of assets, these factors could be considered pervasive. This enables the estimation of the factors using the subset of asset return time series Y (k) ∈ R T ×N k containing only the N k time series in cluster k. Following the logic in (5), the factor loadingsˆ (k) for cluster k are then estimated from the eigenvectors of the largest C k eigenvalues of the The cluster-specific factor realizations are calculated asˆ .

3) ESTIMATION OF PERVASIVE FACTORS
Given the clustering g and cluster-specific factors with loadings , we define Z = X − . The pervasive factor loadingsB are estimated from the eigenvectors of the largest P eigenvalues of 1 T Z Z, and the factor realizations arê F = ZB(B B ) −1 .

C. MODEL SELECTION 1) ESTIMATING THE NUMBER OF PERVASIVE FACTORS AND CLUSTERS
To estimate the number of pervasive factors P and the number of clusters K from the data, we apply the Ahn-Horenstein eigenvalue ratio (ER) test [54] and propose a method for estimating the number of clusters using a graph (network) of assets. Since the estimates depend on each other, we propose a method in which P and K are estimated from several considered candidates, based on a joint criterion.
The ER approach sorts the eigenvalues of the data correlation matrix in a descending order and defines the eigenvalue ratio: where ξ i is the i-th largest eigenvalue. The test detects the shift from the common factor part of the spectrum to the idiosyncratic part [54], as seen in Figure 2. The larger the ER ratio η i . However, in this case, the shift will be between the pervasive factor part and the cluster-specific factor part (since the cluster-specific factors affect less assets, the eigenvalues corresponding to them will be lower than those representing pervasive factors). Moreover, instead of just picking the maximum value of ER, to obtain a more robust final estimate and avoid discarding potentially better solutions, we select a number of candidates for the the number of pervasive factorsP 1 , . . . ,P n , corresponding to the n largest ratiosη To detect the clusters of data, for eachP i we form an asset graph from the time series Y = X −FB , whereF andB are estimated from the data using the PC estimator. Each of the N nodes in the graph represents an asset and the edges between them depend on a similarity measure w ij = |ρ(Y i , Y j )|. In order to obtain accurate and robust estimates, the asset graph needs to reflect the following properties: i) assets which are very close (having a high w ij ) should be connected, ii) assets in the same cluster should have a short path between them (high connectivity clusters), iii) the spectral properties of the graph need to be stable, since the estimation depends on the Laplacian spectrum.
The first property is found in the -neighborhood ( -N) graph, constructed simply by keeping only the edges w ij > which are above a certain threshold . The second property is found in the k-nearest neighbors (kNN) graph constructed by keeping the k edges with highest values of w ij for each node i = 1, . . . N , commonly used in spectral clustering [55]. Finally, since the -N and kNN graphs may not always be connected graphs (they may contain multiple connected components), their spectral properties may differ depending on the number of connected components, we also consider the maximum spanning tree (MST) graph, which always consists of one connected component. To obtain the maximum spanning tree, the edges w ij are multiplied by −1 and Kruskal's algorithm for minimum spanning tree construction is applied. The final asset graph is a combination of the three approaches: with W ( N ) , W (kNN ) , and W (MST ) being the adjacency matrices of the -N, kNN and MST graphs. The asset graph has favorable properties from all three methods combined, resulting in a structure shown in Figure 3. Given the graph adjacency matrix W, the number of clusters can be estimated based on the spectral properties of the VOLUME 8, 2020 asset graph Laplacian: where D is the diagonal node degree matrix D = diag(d 1 , . . . , d N ). The number of zero valued eigenvalues in the spectrum of the Laplacian matrix is equal to the number of connected components in the graph. Since the proposed graph W contains the MST, it will always have one connected component -thus the Laplacian L will have exactly one eigenvalue equal to zero. The K − 1 eigenvalues ψ 2 , . . . , ψ K will be close to zero for a graph containing K clusters (the end case being a graph divided into K connected components which will have exactly K eigenvalues equal to zero). 3 To find the number of clusters in the graph, eigenvalues ψ i of the Laplacian are sorted in an ascending order and, in analogy with (8), we define the the Laplacian eigenvalue ratio (LER) η (c) i = ψ i /ψ i+1 , as seen in Figure 4. Just like in the ER test, the number of clusters is estimated as the i which maximizes the LER: Combining the two approaches, for eachP i with its corresponding ERη (p) i , we have aK i with its LERη (c) i . To decide between the candidate numbers of pervasive factors and clusters, we focus on the ER and LER values -the higher the ER and LER, the more evidence towards them representing the correct P and K . Thus, taking both into account, we propose a heuristic rule to selectP =P j andK =K j , where j = argmax iη i . By doing so, among the similar candidates for P, we select those which yield better resolution for The overview of the proposed model selection algorithm is given in Algorithm 1.

2) ESTIMATING THE NUMBER OF CLUSTER-SPECIFIC FACTORS
During the cluster assignment step, the clusters with a larger number of cluster-specific factors C k will naturally attract more assets (since the time series in clusters with more cluster-specific factors will tend to have a lower value of L ik ), and the cluster membership estimates will be biased towards them. Even knowing the right number of cluster-specific factors in each cluster will not guarantee that the assets will be associated with the correct clusters. Our algorithm resolves this issue by having the number of clusters equal for all clusters C k = C 0 , ∀k during the entire iterative clustering procedure. Given the estimated clusteringĝ, the N k time series Y (k) = X (k) − FB (k) will have a pure factor structure, containing C k factors, and C k can be estimated using the ER estimator. After C k is estimated for each cluster, another phase of the iterative procedure is run, containing only the update step for the cluster-specific factor estimates and the pervasive factor estimates. An overview of the entire procedure, including clustering, factor estimation and the estimation of the number of cluster-specific factors is given in Algorithm 2.

Algorithm 2 Clustering and Estimation of Pervasive and
Cluster-Specific Factors initializeF,B,ˆ ,ˆ ,ĝ set C k = C 0 for all clusters k = 1, . . . , K while clustering convergence criteria not met do update cluster membership: ] update pervasive factors: givenˆ ,ˆ , calculate Z = X −ˆ ˆ estimate and setF,B from Z end givenF,B,ĝ update C k for all clusters k = 1, . . . , K while error convergence criteria not met do update cluster-specific factors: ] update pervasive factors: givenˆ ,ˆ , calculate Z = X −ˆ ˆ estimate and setF,B from Z end

D. INITIALIZATION AND CONVERGENCE CRITERION
For the initialization, the P pervasive factors F and loadings B are estimated from the data X first, then the asset graph is constructed from Y = X − FB , based on which a spectral clustering method is used to obtain the initial clustering. Finally, for the given clustering g and pervasive factors F with loadings B, we estimate the cluster-specific factors using the data Y (k) , for each cluster k = 1, . . . , K . In both phases (the clustering and the cluster-specific factor estimation), the algorithm stops when there are no cluster changes and the reduction in the loss function l (i) −l (i−1) is less than 10 −5 ·σ 2 m , where σ 2 m is the median variance of all time series X.

IV. SIMULATIONS AND DATA
To verify the validity of our proposed approach and test the estimation algorithm, we define several data-generating processes (DGP) which correspond to the assumed factor model structures. To obtain a model in the form of 2, we generate random factor loadings. The elements of the pervasive loadings matrix B are drawn from a uniform random distribution with mean 0 and variance 1. For the cluster-specific loadings matrix , the elements (k) i are random (also uniform with mean 0 and variance 1) if asset i belongs to cluster k, and are zero otherwise. We form clusters which are all of equal size N k = N /K . Since the approximate factor model allows for some off-diagonal elements in the covariance of residuals, we also generate random sparse covariance matrices with a given idiosyncratic variance σ 2 e on the diagonal (for the details on the procedure for generating positive semi-definite sparse covariance matrices, see the Appendix VI). Given the factor loadings and the idiosyncratic components, the asset mean and covariance can then be calculated as where µ F are the means F is the covariance of P pervasive factors, while µ are the means and is the covariance of Q cluster-specific factors. In our simulations, the means are all zero, and the covariances are both diagonal matrices with equal variances σ 2 F and σ 2 on the diagonal. The full set of simulation parameters is given in Table 1. To simulate asset returns, we use the asset mean and covariance (11) to simulate T returns, drawing from the Gaussian distribution N (µ X , X ) and the Student's t-distribution, with 6 degrees of freedom t 5 (µ X , X ) and 4 degrees of freedom t 3 (µ X , X ). Although such models and estimation methods are often tested using simulations of normally distributed data [18], [22], we additionally use the heavy-tailed Student's distribution, since they replicate the properties of financial returns, as seen in Figure 5.  In addition, we also use two dataset containing weekly financial return time series. Firstly we use a dataset of NAS-DAQ global equity indices between 2005 and 2020 [56]. The original dataset contains a large number of redundant time series, from which we select only the total return NASDAQ indices available in the considered period, leaving us with N = 797 assets. We also consider a dataset of N = 621 ETFs 4 available between 2010 and 2020. The data for this dataset is obtained by downloading historical return data using Yahoo Finance for the ETF tickers available on etf.com, and then again selecting only those time series which have price data for the entire considered period. Both of these datasets cover a wide range of exchanges, countries and specific sectors, and can be used to represent and study the latent risk factors in global financial markets.

A. SIMULATION RESULTS
We verify the proposed algorithm and measure the accuracy of the clustering and model selection methods using the proposed simulation scenario. Using the parameters defined in Table 1), we randomly generate models and for each random model we simulate time series of length T .
We apply the proposed method given the correct P and K , and measure the quality of clustering and the accuracy of the detected number of cluster-specific factors C. The estimated clusteringĝ and ground truth clustering g are compared using the Rand statistic and Jaccard coefficient (see Appendix C). Moreover, for both of these cluster validation measures and any pair of clustering methods we define a paired statistical test 5 in order to test the hypothesis H 0 : There is no difference between two clustering methods, with a one-sided alternative H 1 : Method 2 outperforms Method 1. For each randomly generated model m = 1, . . . , m max , the considered clustering methods are applied and the cluster validation measure is calculated for both results (for instance Rand 1 (m) and Rand 2 (m)), then the p-value is calculated as the fraction of pairs for which Method 2 outperforms Method 1 (in this example, the fraction of samples for which Rand 2 (m) > Rand 1 (m)). We repeat this procedure for the both cluster validation measures, pairing our proposed model-based method with several commonly used clustering approaches (k-means algorithm, spectral clustering [55], and the Ando-Bai estimation procedure [23]). The k-means method uses 1 − |ρ ij | as a distance measure, and the spectral clustering method employs the proposed asset graph estimated directly from X. The Ando-Bai procedure iteratively estimates clusters and latent factors, but using a procedure which does not account for the bias in clusters with different numbers of cluster-specific factors.
We generate a number of m max = 1000 models and for each we simulate time series realizations of length T = 1000, 500, 250, and apply the considered clustering methods and tests. The average Rand and Jaccard statistics, as well as the p-values of the paired resampling tests (comparing the proposed model-based method with each of the other considered clustering methods) are shown in Table 2. These results demonstrate the advantage of the proposed model-based approach, as well as the fact that the existence of pervasive factors may severely hinder clustering accuracy when they are not taken into account. Moreover, in the paired tests, the proposed method outperformed the considered methods for virtually all of the 1000 resampled model realizations (the p-values of < 0.001 mean that in the m max = 1000 simulated models, none were found for which the considered method outperformed our algorithm). To better visualize the paired comparison for these two methods across the simulations, we show the Rand statistic for the Ando-Bai and the proposed model-based method across all 1000 simulations in Figure 6.
In order to look into the bias in clustering when the number of cluster-specific factors differ between clusters, in Figure 7 we also look into the average number of assets in clusters with different numbers of cluster-specific factors, given by two different approaches. The first uses the real C k as the number of cluster-specific factors in each cluster (corresponding to the Ando-Bai method [23]), while the second uses the fixed C 0 in each cluster during the clustering phase. The bias towards the clusters with a larger C k is evident and might be a large source of inaccuracy in the clustering procedure, while our method with C 0 seems to provide accurate clustering without any evident bias in the cluster sizes. These results are obtained for the T = 500 window and the t 4 distribution, but hold for all of the considered combinations.
We also evaluate the performance of the model selection method, using the same simulation environment and the simulated time series lengths. In addition to measuring the percentage of correctly estimated number of pervasive factors, clusters and cluster-specific factors, we also measure the mean absolute deviation for each of these. The results are shown in Table 3. The accuracy of the proposed model selection method is remarkably high, even when presented with heavy tailed data and short time window length. Only the number of pervasive factors seems to suffer a bit in case of the t 4 distribution and T = 250 -nevertheless, the accuracy of 90% achieved for this case is high.

B. FINANCIAL MARKET DATA
The simulation results confirm the ability of the proposed method to provide accurate estimates, even in the presence of correlated residuals, heavy tails, and high-dimensional sample data. However, in real financial market data, such as the NASDAQ global equity indices and ETF data, the latent factors are unknown, as well as the clustering and the number of clusters and latent factors. The proposed method allows us TABLE 2. Clustering performance on simulation data for the proposed method and other considered clustering techniques, using different simulation time window lengths and data distributions. The brackets below each value contain the p-value of the paired resampling test of the considered method compared to the proposed model-based algorithm. All of the values are obtained using simulation parameters given in Table 1.  Table 2. to study and estimate these from the data directly. We first focus on two distinct periods in the NASDAQ dataset: Figure 8 shows the asset graph for the period 2007-2009 FIGURE 7. The sizes of clusters (number of assets N k for different numbers of cluster-specific factors C k , given by two estimation methods. The real number of assets in each cluster is known in the simulation and is equal to N k = 200 for each k. around the global recession, and Figure 9 show the graph for the subsequent period 2010-2020 which corresponds to one of the strongest and longest bull markets in the  history of financial markets. In both graphs, some common clusters emerge (shown in same colors on both graphs): European markets (pink), Brazil and Latin America (purple), North America and global developed market indices (blue), Asian emerging markets (teal), Middle East and Africa (darker green), Asian developed markets (light green). The 2007-2009 graph contains another cluster for India and New Zealand (yellow), and European emerging markets (red)both of which are encapsulated within other clusters in the 2010-2020 graph. The clusters found in the ETF dataset reflect different asset classes, rather than geographic origin, mainly because of the assets within -they mostly follow broad indices from various countries, but differ between stocks, bonds, commodities and others (to remain concise in this section, we do not display them additionally). In addition to serving as a sanity check for the meaning behind the estimated clusters, these results suggest that the clusters and latent factor structures in the data may change through time. This is why, in the rest of our analysis, we focus on rolling time window estimates of the latent factors and clusters, and use out-of-sample data from subsequent future windows to measure the quality of our estimates.
To validate our approach on the available financial market data, we propose a backtesting framework in which the model is estimated on return time series X on look-back windows of fixed length T . Using the estimated model (mainly, the factor loadingsÂ = [B,ˆ ]), a reconstruction of any time series X can be obtained using the N × N filtering matrix of rank P + Q:M =Â(Â Â ) −1Â . This enables us to obtain a reconstruction of any out-of-sample time series X using the in-sample loadings estimates from whichM is calculated: Using the reconstructed time seriesX, we can calculate the unexplained variance in each asset (either for the in-sample or out-of-sample data): where X ti is the realization of time series i at time t,X ti is the model reconstruction given by (12), and X i is the sample mean of time series i. This framework enables us to apply cross-validation principles for estimating the out-of-sample model performance. Specifically, the model estimates from a look-back window of length T are used to reconstruct the future out-of-sample returns on a look-ahead window of length T -using these we can measure the average unexplained variance V = 1 N N i V i both in-sample and outof-sample (denoted V i and V i , respectively).
In addition, to measure the decline in out-of-sample model performance, we calculate the average percentage deterioration in out-of-sample vs. in-sample unexplained variance: We compare the proposed estimation method with the PC estimator, where the number of components for the PC estimator is selected so that it explains at least the amount of variance explained by the proposed model.
The results in Table 4 demonstrate that the proposed approach finds relevant estimates of latent factors in the data which outperform the PC estimates in out-of-sample data, for both considered datasets. Even though the PC estimator yields the lowest in-sample unexplained variance V , the PC estimates deteriorate much more than the proposed model, as seen in the out-of-sample unexplained variance V and the average deterioration d. Moreover, all of these results are in line with other econometric and unsupervised learning studies which find that approximately 30-50% of variance in financial data corresponds to idiosyncratic components [37], [51]. In addition, we find that the model performance in terms of unexplained variance deteriorates fairly less than the PC estimates, suggesting that the proposed estimation method finds more persistent and relevant latent factors in high-dimensional financial time series. In other words, the proposed model-based estimation method generalizes very well to out-of-sample data. This result is expected since the proposed method utilizes the assumed clustering structures within the markets to reduce the number of parameters, thus providing a type of structural regularization of the estimates.
To demonstrate the applicability of the proposed method to risk modelling and portfolio optimization, we consider global minimum variance (GMV) portfolios [51], obtained by solving the following problem: where is the asset covariance matrix. Since the estimation procedure only depends on the covariance, it is a suitable way to demonstrate the ability of the proposed method to provide reliable estimates of asset risk, and is often used to benchmark covariance estimation methods [21]. The covariance matrix is obtained using the expression (11), based on latent factors estimated on a look-back window of length T . The idiosyncratic covariance e is estimated using a thresholding approach, as described in the Appendix VI. At each time step, the estimated GMV portfolios w are held on a look-ahead window of length T and their risks are measured as the realized volatility: √ w w. We compare the volatilities GMV portfolios estimated using the empirical covariance and the covariances calculated using the latent factors estimated by PC and the proposed method. We also use all of the out-of-sample returns of the portfolios built using the PC estimator based covariance and the model based covariance, and apply a test for the equality of their variances. Since these data are paired (measured at same timesteps and thus dependent on some common market factors), and may exhibit heavy tails, we apply a non-parametric version of the Morgan-Pitman test for the equality of variances with paired data, proposed by McCulloch [57]. This test for H 0 : σ 2 x = σ 2 y uses variables u = x + y and v = x − y, and H 0 is rejected if the Spearman correlation coefficient between u and v is significant. The out-of-sample GMV portfolio volatilities and the p-value for the equality of variance tests (comparing our method with the PC estimator) are given in Table 5.
The results demonstrate that not only the out-of-sample portfolio risk is reduced by implementing latent factor models VOLUME 8, 2020 as opposed to empirical estimates, but also that the proposed method works best at reducing out-of-sample portfolio risk. All of the out-of-sample portfolio variances are significantly reduced in comparison to the PC estimator, except for the T = 4 years case in the ETF dataset, where the variance is still reduced but the statistical evidence is not strong enough to reject the equality of variances. Moreover, these results suggest that the empirical covariance based portfolios deteriorate with reducing the look-back time window length, but the model-based covariance estimates yield portfolios which perform very well in short time windows. Evidently, accurate estimates of latent factors allow for the usage of shorter time windows for estimation, which in turn may improve the properties of optimal portfolios since they adapt more quickly to new market conditions.
The results presented in this section demonstrate that the proposed method yields relevant and robust estimates of latent pervasive and cluster-specific factors, which can be applied to global equity market data with the goal of modelling and managing financial risk.

VI. CONCLUSION
In this paper we consider latent factor models of highdimensional financial time series with pervasive and cluster-specific factors. We propose an estimation method which performs time series clustering and estimates latent pervasive and cluster-specific factors iteratively. In order to estimate the unknown number of clusters and latent pervasive and cluster-specific factors, we also propose a model selection method based on the asset correlation matrices and asset graphs.
We test the method using several data generating processes under the approximate factor model assumptions, featuring heavy tailed returns with some off-diagonal correlations of residuals. The simulation study shows that the proposed method yields very accurate clustering results, even for the most severe high-dimensional setting and heavy-tailed distributions. Moreover, we show that the proposed two-phase model-based method estimates clusters which are not biased towards those clusters with a larger number of cluster-specific factors, as is the case with other clustering methods using cluster-specific factors. In addition, the simulation study results suggest that the proposed model selection method provides stable and accurate estimates of the number of clusters, latent pervasive, and latent cluster-specific factors.
We also apply our methods to datasets of return time series of NASDAQ indices and ETFs in a backtesting approach which allows us to use in-sample model estimates to reconstruct out-of-sample return data. By doing so we cross-validate the unexplained variance, and find that the proposed model-based estimation method, although explaining less variance in-sample than the PC estimator, explains more variance out-of-sample, meaning that it generalizes better and provides more robust estimates. In addition, we apply the method for estimating the asset covariance matrix, based on which we build optimized minimum variance portfolios. The results demonstrate the ability of the proposed method to reduce risk in the minimum variance portfolios, which outperform the portfolios built on empirical and PC estimates of the covariance matrix. Moreover, we find that, whereas the empirical covariances deteriorate with the shorter look-back windows, the model-based estimates thrive in these high-dimensional situations, allowing one to use short look-back windows and thus being more adaptive to changing market conditions.
The results presented in this paper suggest that the clustering assumption in high-dimensional financial time series data holds, and that the model-based estimation method indeed extracts useful information about the latent factor structure. These findings affirm and refine asset pricing theories based on multi-factor models, providing evidence on the clustering structures of latent risk factors. This approach may help shed more light on the intricate latent factor structures in global financial markets, as is demonstrated in our results. Ultimately, the robust estimates of pervasive and cluster-specific factors may be used to improve risk assessment and enhance the out-of-sample performance of portfolios built on the estimated models.

APPENDIX A SPARSE COVARIANCE ESTIMATION
To estimate a sparse covariance matrix (sp.) e we start from a full sample covariance estimate (est.) e and apply an adaptive thresholding technique [58], [59]. A specific threshold is set for each element of the matrix (est.) e , so that the scale (the variance of each time series) is taken into account. The simplest way to do this is to consider the sample correlation matrix R (est.) e , and apply a fixed threshold ε r to all elements: The sparse correlation matrix R (sp.) e thus contains only elements larger than ε r or smaller than −ε r . However, this simple hard thresholding rule does not always produce positive-definite matrices R (sp.) e , since certain elements r (sp.) ik and r (sp.) jk may be non-zero (pass above the threshold ε r , but the element r (sp.) ij may be zero (fall under ε r ). This case may be generalized in the term of graphs -the sparse correlation matrix defines a graph where the edges are only those pairwise correlations which pass the threshold value ε r . This graph is actually a very sparse graph with a relatively large number of connected components -however each component may not necessarily be fully connected, and as long as they are not, the resulting correlation matrix will not necessarily be positive-definite. Thus, in order to correct this, we iterate over all connected components defined by matrix R In our simulations, to generate random sparse correlation matrices, we first generate N × N random matrices U where each element is drawn from a uniform distribution U(0, 1), and then define the simulated full correlation matrix R (est.) e = √ UU UU √ UU . Based on this correlation matrix and the procedure proposed above, we obtain the sparse correlation matrix R (est.) e -the threshold in simulations is set so that approximately 10% of off-diagonal elements are kept nonzero. Finally, the sparse covariance (est.) e is calculated similarly as in (16) so that the diagonals are the idiosyncratic component variance σ 2 e . In the estimation procedure on the NASDAQ equity indices dataset, we set the threshold equal to the -neighborhood graph threshold used in the model selection procedure.

APPENDIX B HYPERPARAMETER SELECTION
The proposed estimation method depends on a handful of hyperparameters: the fixed number of cluster-specific factors C 0 in the clustering phase, number of neighbors k in the kNN graph, and the neighborhood threshold in the -N graph. Although the algorithm is not too sensitive to small changes in these hyperparameters, here we provide some quick guidelines on how to select them. Firstly, the algorithm in its clustering phase will not depend too much on the selection of C 0 since the cluster-specific factors in clusters where C k < C 0 will model the C k latent factors and the rest will be noise, while for clusters where C k > C 0 , all C 0 latent factors will be relevant. Nevertheless, we find that a balanced C 0 which is not too large but encapsulates most of the cluster-specific factors will be best, thus we use C 0 = 4 in all of our simulations and results. Furthermore, the number of neighbors k in the kNN graph should primarily reflect the size of the clusters we want to detect in the data. These are naturally dependent on the number of time series N -we find that as a rule of thumb, a good choice will be somewhere between log N and √ N . In our simulations and results we use: k = (log N + √ N )/2 . Finally, for the selection of the neighborhood threshold in the in the -N graph, we suggest that both the length of the time series T and their number N are taken into account. Since longer time series will provide smaller estimation error and more accurate correlations between asserts ρ ij , the standard error in the estimates will be reduced and the threshold may be lowerhowever, the threshold still needs to be above a certain level ρ 0 above which we wish the pairs of assets to be connected in the graph. To account for the statistical uncertainty in the estimation, we propose to set the threshold to the critical value of the approximate Pearson correlation test for the hypothesis H 0 : ρ ij = ρ 0 : where we set ρ 0 = 0.4, T is the time window length, and z is the 1 − α quantile of the standard normal distribution N (0, 1). To account for the fact that the test is applied to all pairwise coefficients ρ i j, we use to Bonferroni correction and set α = 0.05/ N 2 . These values are used in our simulations and results for all different lengths of time windows.

APPENDIX C CLUSTER VALIDATION
To measure the clustering performance of the proposed method, we calculate the Rand statistic and Jaccard coefficient, both of which are commonly used techniques to measure the agreement between different partitions of the same set and can be used even when there are no class labels available [60]. Given the estimated clusteringĝ and the ground truth clustering g, define the following variables: where 1[c] is an indicator function with value 1 if the condition c in the brackets holds, and 0 otherwise. The variable SS simply counts the number of pairs of assets which belong to the same cluster in both clusteringsĝ and g; SD counts the number of pairs belonging to the same cluster inĝ and different clusters in g; DS counts the number of pairs belonging to different clusters inĝ and the same cluster in g; DD counts the number of pairs belonging to different clusters in both clusteringsĝ and g. Given these variables, the Rand statistic and the Jaccard coefficient can be calculated: Following the above expression, in our case the Rand statistic simply measures the proportion of pairs which are correctly clustered together or apart, and the Jaccard coefficient measures the intersection of the correctly clustered pairs in proportion to the union of all the pairs of assets. Both of these can be interpreted as focusing on the sets of pairs, rather than the original set of assets, and look into whether the pairwise clustering properties match in the two given clusterings.