Principal Tensor Embedding for Unsupervised Tensor Learning

Tensors and multiway analysis aim to explore the relationships between the variables used to represent the data and ﬁnd a summarization of the data with models of reduced dimensionality. However, although in this context a great attention was devoted to this problem, dimension reduction of high-order tensors remains a challenge. The aim of this paper is to provide a nonlinear dimensionality reduction approach, named principal tensor embedding (PTE), for unsupervised tensor learning, that is able to derive an explicit nonlinear model of data. As in the standard manifold learning (ML) technique, it assumes multidimensional data lie close to a low-dimensional manifold embedded in a high-dimensional space. On the basis of this assumption a local parametrization of data that accurately captures its local geometry is derived. From this mathematical framework a nonlinear stochastic model of data that depends on a reduced set of latent variables is obtained. In this way the initial problem of unsupervised learning is reduced to the regression of a nonlinear input-output function, i.e. a supervised learning problem. Extensive experiments on several tensor datasets demonstrate that the proposed ML approach gives competitive performance when compared with other techniques used for data reconstruction and classiﬁcation.


I. INTRODUCTION
T ENSORS, also referred to as multiway arrays, are high-order generalizations of vectors and matrices and have been adopted in diverse branches of data analysis, to represent a wide range of real-world data. Examples of tensor data are grayscale and color video sequences [1]- [4], gene expression [5], genome-scale signals [6], magnetic resonance imaging [7], to cite just a few.
Data modeling and classification of these data are important problems in several applications, such as human action and gesture recognition [8], tumor classifications [9], spatio-temporal analysis in climatology, geology and sociology [10], neuroimaging data analysis [11], big data representation [12], completion of big data [13], and so on.
To address these problems most previous works represent a tensor by a vector in high-dimensional space and apply ordinary learning methods for vectorial data. Representative techniques in this context include feature extraction and selection [14], [15], linear discriminant analysis (LDA) [16], and support vector machine (SVM) [17]. Unfortunately this approach needs to arrange the tensor data into long vectors causing two main problems, i) loss of structural information of tensors, ii) vectors with very high dimensionality. To face these problems specific tensor learning algorithms that retain the original structure of tensor data have been recently developed [18]- [26]. However, in all these methods the problem of high dimensionality of data, also known as the curse of dimensionality, remains. To deal with such an issue the classical dimensionality reduction method, known as principal component analysis (PCA) [27], [28] was generalized to the second-order case (2-DPCA) [29], low-rank matrices (GLRAM) [30], and high-order cases (MPCA) [31], [32]. Likewise, the linear discriminant analysis (LDA) [33] technique was extended to a 2D case (2-DLDA) [34], [35] and multilinear discriminant analysis (MDA) [36]. Nevertheless to overcome the limitations of methods based on classical approaches, the decomposition of tensors into low-rank components, using two popular models, namely the Tucker decomposition (TD) [37] and the CANDECOMP/PARAFAC (CP) decomposition [38], has been one of the main concerns in tensor analysis to reduce the dimensionality of data [39], [40]. In all the above methods, data are considered as points in a multidimensional space, thus using the global structure information of the dataset alone. Instead many studies have shown that some classes of real world high-dimensional data exist in which they lie on a low-dimensional manifold (a parametrized surface), thus showing a local geometric structure.
Manifold learning (ML) is a nonlinear dimensionality reduction (NLDR) technique that, assuming the existence of an intrinsic structure, the manifold, has proven to be very effective in modeling data with reduced dimensionality [41], [42]. ML, also classified as embedding method, is based on the assumption that high-dimensional data are embedded in a nonlinear manifold of lower dimension [43]- [47], [48]- [50]. In this context several algorithms have been proposed, such as locally linear embedding (LLE) [51], local tangent space alignment (LTSA) [52], locally multidimensional scaling (LMDS) [53], and ISOMAP [54].
To deal with the high-order tensor data, some of these methods were extended by using multiway data analysis [55] and in particular higher order tensor decomposition [56], [57]. For example, Lai et al. [58] proposed a robust tensor learning method called sparse tensor alignment (STA) for unsupervised tensor feature extraction. Ju et al. [59] introduced a new tensor dimension reduction model based on the Bayesian theory. The proposed method assumes that each observation can be represented as a linear combination of some tensor bases, thus CP decomposition and variational EM algorithm are used to solve this model. He et al. [60] proposed tensor subspace analysis (TSA) for second-order learning. In the method suggested by Jiang et al. [61], given image tensor data, a k-nearest neighbour graph to encode the geometrical structure of data is constructed. Liu et al. [62] proposed a non-linear dimensionality reduction algorithm based on locally linear embedding called supervised locally linear embedding in tensor space (SLLE/T). SLLE/T preserves local manifold structure within each class based on locally linear embedding (LLE) and enforces separability of data points belonging to different classes. Chen et al. [63], assuming that data lie in a nonlinear manifold, attempted to discover the intrinsic structure of this manifold with a twostage algorithm named tensor-based Riemannian manifold distance-approximating projection (TRIMAP). Jia et al. [64] suggested a low-rank tensor subspace learning for RGB-D action recognition, in which the tensor samples are factorized to obtain three projection matrices by Tucker Decomposition (TD).
The central objective in ML algorithms is to determine an effective parametrization of data. This is a key issue in order to accurately capture the local geometry of the lowdimensional manifold and the following aspects are relevant to this end: i) nonlinearity, ii) explicit modeling, iii) intrinsic dimension (ID) estimation.
With regard to nonlinearity data generally have a nonlinear geometric structure, thus using the tangent space at each data point to locally describe its neighbour, as assumed in some of the previous techniques, is a strong limitation.
With reference to the second aspect, a main drawback of most ML methods is that no explicit mapping representing the local manifold parametrization can be obtained after the training process, as they learn high-dimensional data implicitly.
Regarding ID estimation, the intrinsic dimension (ID) may be interpreted as the minimum number of parameters required to describe the data [65], thus to derive a lowdimensional model, the dataset ID has to be discovered first.
At present, none of the methods suggested so far are able to take into account all the key requirements of nonlinearity, explicit modeling, and ID estimation for low-dimensional tensor modeling.
The aim of this paper is to develop a manifold learningbased approach, named principal tensor embedding (PTE), for unsupervised tensor learning, that is able to address the first two of the aforementioned key points, while adopting the most relevant state-of-the-art methods for the ID estimation. This result represents an advancement with respect to the state-of-the-art, as the standard tensor learning techniques are not able to combine all the relevant aspects previously mentioned. The method has been derived by considering a tensor as an element of the finite-dimensional linear space of tensors, in which the inner product defines a metric for the space. Once a basis is computed using the Gram-Schmidt procedure, the coefficient vector in this basis, establishes an isomorphism between a vector space of rank-one and the space of tensors. In this way the problem of dimensionality reduction in tensor space reduces to the dimensionality reduction in vector space. To this end an effective manifold algorithm recently proposed, can be used for the parametrization of data, to accurately capture the local geometry of the low-dimensional manifold that represents the data. In such a way a nonlinear model of data with reduced dimensionality is obtained.
The model establishes an explicit one-to-one correspondence between a tensor, a point on the manifold embedded in the high dimensional space, and a vector, a point in the low-dimensional Euclidean space. Additionally a relationship for the geodesic distance of all pairs of points on the manifold, as a nonlinear function of the Euclidean distance between points in the low-dimensional space, is given.
The rest of the paper is organized as follows. Section II reviews related work on tensor learning. Section III summarizes our method highlighting the most relevant aspects. Section IV introduces some general concepts on finite dimensional linear space of tensors. Section V gives a representation of a tensor in terms of a basis derived by the tensor version of the Gram-Schmidt procedure. In Section VI a nonlinear dimensionality reduction approach, named principal tensor embedding (PTE), is developed, and the estimation of nonlinearity in such a model is treated in Section VII using a nonparametric kernel regression (NPKR) technique. Experimental results are presented in Section VIII, in which the proposed tensor learning approach is used and compared with some other techniques for data reconstruction and classification problems.

II. RELATED WORK
Tensor learning techniques have been more widely applied to 2-order and 3-order tensors. In the following we summarize the most relevant approaches in these two main fields.

A. TENSOR LEARNING OF 2-ORDER TENSORS
Principal component analysis (PCA) is one of the most common techniques for unsupervised subspace learning, however when applied to tensor objects it requires their reshaping into vectors with high-dimensionality (vectorization). As a result this implies high processing costs in terms of computational and memory demand.
Multilinear principal component analysis (MPCA) [31] is the multilinear extension of the classical PCA that have been used both for two-order and three-order tensors. Recently a method called graph-Laplacian PCA (gLPCA) that combines a manifold learning method, i.e. the graph-Laplacian, with PCA has been proposed [66].
Tucker decomposition (TD) is a technique that reduces a given tensor to a low-rank tensor [37]. It solves an optimization problem, by minimizing the Frobenius distance between the given tensor and a tensor of lower dimensionality.
To account for the geometrical (manifold) structure of image tensor data, a technique called graph-Laplacian Tucker tensor decomposition (GLTD), that combines graph-Laplacian with a regularized version of TD, has recently been proposed in [61].

B. TENSOR LEARNING OF 3-ORDER TENSORS
Tensor neighborhood preserving embedded (TNPE) and tensor locality preserving projection (TLPP) [67] are tensor embedding techniques. They extend neighborhood preserving embedding (NPE) and locality preserving projection (LPP), which can only work with vectorized representation, to be used with more general tensor representations. More specifically, given a set of data points {x i , i = 1, . . . , N } in higher-dimensional space, NPE and LPE seek a transformation matrix that maps each data point x i to a corresponding lower-dimensional data point y i . Similarly TNPE and TLPP find a set of transformation matrices for defining the embedded tensor subspaces that together give an optimal approximation to the tensor manifold, preserving some local geometric properties.
Orthogonal tensor neighborhood preserving embedded (OTNPE) [68] is a generalized tensor subspace model similar to TNPE. However, while TNPE cannot ensure the obtained transformation matrices have orthogonal column vectors, OTNPE aims to derive orthonormal basis tensor for TNPE.
Sparse tensor alignment (STA) [58] is a sparse representation incorporated into tensor alignment (TA) framework, a technique that unifies the tensor learning methods. Since a tensor X i can be unfolded into a large size matrix X i , to k nearest neighbours tensors correspond k large size matrices X (k) i . The alignment techniques aims to obtain the projection matrices U k that map the unfolding tensors X Unfortunately none of the aforementioned techniques is able to satisfy the key requirements in order to accurately capture the local geometry of tensors embedded in a manifold, i.e. nonlinearity, explicit modeling and ID estimation.

III. OUR METHOD
In this paper we address the problem of unsupervised learning of tensors. In this case one has a set of N observations, the data Ω = {X (1) , . . . , X (N ) }, of a random M -order tensor X . The goal is to derive a model that depends on a reduced set of parameters, the latent variables, that is able to reconstruct the data. To be effective the dimension d of parameters in the model must be less than the dimension L of the tensor linear space.
Mathematically this problem is equivalent to de- The main steps of our approach are: • Given the data set Ω = {X (1) , X (2) , . . . , X (N ) } the Gram-Schmidt procedure is applied to L observations {X (1) ,X , . . . ,X } so that an orthonormal basis {U (1) , U (2) , . . . , U (L) } is obtained. • The generic tensor X of the set Ω can be represented as the summation where α = (α 1 , α 2 , . . . , α L ) T is the coefficient vector. Thus to the dataset Ω = {X (1) , X (2) , . . . , X (N ) } corresponds a set of N vectors {α (1) , α (2) , . . . , α (N ) } in R L . • We assume these vectors are points sampled from a manifold M of dimension d embedded in the L-dimensional observation space, so that a local parametrization exists, with d < L, where θ is the vector of latent variables and d is the so-called intrinsic dimension. • Since θ is hidden, i.e. not known, it will be shown that a local parametrization γ of the manifold, through a partition α = ϕ(α ) = (α , G(α )) T of the vector α can be derived. This is an explicit modeling of the manifold depending on known variables α ∈ R d with VOLUME 4, 2016 d = ID. The model is completely defined once α and G(·) have been estimated. • By estimating the variance vector σ α = (σ α1 , . . . , σ α L ) T of α, their elements are put in decreasing order and the corresponding terms {Û (1) ,Û (2) , . . . ,Û (L) } are ordered accordingly. In this way a new representation of the tensor X is obtained in terms of the new vector β = (β 1 , β 2 , . . . , β L ) T such that σ β1 ≥ σ β2 ≥ . . . ≥ σ β L . It is worth to notice that as the values σ βi are in decreasing order, then the higher the index i, the lower the importance of the corresponding component. An overview diagram that explains the main steps of the approach is reported in Fig. 1.
• On the basis of this result, a low-dimensional representation of tensor X can be simply obtained by truncating the summation in (4) to the first r terms thus obtaining a truncation error that decreases as r increases. Following this property the proposed technique has been called principal tensor embedding (PTE). • Assuming the ID has been determined with one of the methods known in literature, to estimate the function G(·) an effective method for nonparametric inputoutput nonlinear function regression in tensor space, called nonparametric regression kernel (NPKR), will be used.
It is worth to notice that the proposed approach satisfy nonlinearity, explicit modeling and ID estimation, i.e. all the key aspects of ML, thus representing a real advancement to previous techniques.

IV. THE FINITE DIMENSIONAL LINEAR SPACE OF TENSORS
Let us refer to tensors, regarded as multidimensional arrays and denoted by Euler script calligraphic letters, e.g. X ∈ R I1×I2×...×I M , where × represents the Cartesian product. The number of dimensions M , also known as modes, of a tensor denotes the order of a tensor. The elements of an M -order tensor X will be represented by x i1,i2,...,i M , i l = 1, 2, . . . , I l , l = 1, 2, . . . , M . (7) The inner product of two tensors of the same size X , Y ∈ Input dataset From this definition it follows that the norm of a tensor is given by The outer product of tensor where the generic element of Z is given by In particular for two vectors x and y the generic element of outer product Z = x • y is the matrix With reference to the canonical basis e 1 = (1, 0, . . . , 0), . . . , e I l = (0, 0, . . . , 1) for R I l , an M order tensor X can be decomposed as where the outer product e i1 e i2 . . . e i M is the M -order canonical basis tensor. As this basis is of size For easy of reference Table 1 reports some notations, that will be frequently used in the following.

V. A BASIS FROM DATA
One of the main problems in representing elements of a linear space, is to find a basis. For vectors v ∈ R I1 the problem can be solved estimating the covariance function from data. Assuming a set {v (1) , v (2) , . . . , v (N ) } of observations is collected, then (15) can be approximated as is the data matrix. Thus, once an estimation of R vv is derived, the problem reduces to the decomposition of R vv By noting that vv T = v•v, (15) can be generalized to tensor space by simply substituting v with X , thus obtaining where R χχ is a tensor of 2M order. Even though some techniques for the decomposition of a tensor are known in literature, the dimension of tensor R χχ in (18) can be very large, so that these approaches cannot be used in practice.
A more effective approach for this purpose can be derived using the well known Gram-Schmidt procedure. Extracting from a given dataset ,X (2) , . . . ,X (L) } and assuming they are independent each other, the tensor version of Gram-Schmidt procedure is as follows.
It is straightforward to show that the tensors so obtained Having derived a basis U = {U (1) , U (2) , . . . , U (L) }, then the generic tensor X ∈ R I1×I2×...×I M can be represented as the summation where α i = X , U (i) is the i-th coordinate with respect to the element U (i) of the basis. Due to the randomness of X thus α = (α 1 , α 2 , . . . , α L ) T is a realization of a random vector α. Here a bold face character is used for random variables. In such a way the correspondence ξ is defined, or that is the same where H is the space of random variables. For every observation X (k) of dataset Ω, let us determine the vector where Thus the following correspondence holds, where A ∈ R L×N is the data matrix of the coefficient random vector α. To represent this correspondence in a more compact way we use the concept of cell array in which the elements, the cells, are containers that can hold arrays of different sizes. Using this concept (26) becomes

VI. LOCAL PARAMETRIZATION OF DATA
On the basis of previous results for a generic tensor X we can write which establishes a one-to-one correspondence (an isomorphism) between R L and the space of tensors.
To reduce the dimensionality of the above representation, a Manifold Learning (ML) approach will be used. In this context the problem can be formalized as follows.
To the dataset which are assumed to be sampled from a manifold M of dimension d embedded in the Ldimensional observation space. We further assume that the data points do not contain either noise or outliers. This implies that a local parametrization exists, with d < L. The dimension d represents the so-called intrinsic dimension (ID), which may be interpreted as the minimum number of parameters required to represent the data [65]. From differential geometry [69], [70] it can be shown that assuming the values of α lie on a manifold M of dimension d, then a local parametrization ϕ exists represented by the graph of a function G(·) such that where (α , α ) T is a partition of α and α , α are row vectors. A proof of this result is reported in the Appendix for ease of reference. Comparing (29) and (30) it clearly results θ = α and F (·) = ϕ(·). In this way the data matrix A can be partitioned accordingly where A and A are the data matrices of α and α = G(α ), respectively.

A. PRINCIPAL TENSOR EMBEDDING
Being α a random variable, the mean vector and the variance vector can be computed from the set A, so that the following cell array can be defined. If the elements (σ α1 , . . . , σ α L ) of σ α are put in decreasing order and the corresponding terms {U (1) , . . . , U (L) } are ordered accordingly, then a new cell array is derived where β = (β 1 , . . . , β L ) T is a random vector such that σ β1 ≥ σ β2 ≥ . . . ≥ σ β L and a bold face character is used, following the convention previously adopted for random variables. In this way for every X (k) it results where the terms correspond to the cell arrays It is worth to notice that when β is used in normal face it represents an observation or realization of the random variable β. Having reordered the basis elements, the correspondence (27) becomes where the components of β are such that σ β1 ≥ . . . ≥ σ β L . Among the possible choices the vector α can be partitioned, (40) has the following useful property. By choosing a generic index r < L, (28) can be rewritten as where . Assuming X r is a good approximation of X and µ α = 0 for the sake of notation simplicity, then the truncation error in approximating X with the first r components is given by the norm of residual N = X − X r , i.e., As the values σ βi are in decreasing order, thus the partition (41) ensures the truncation error (42) is minimum when With reference to the new partition (40), the local parametrization of corresponding data {β (1) , . . . , β (N ) } can be written as (43) and the generic observation of dataset Ω can be modeled by The data matrix B of β can be partitioned accordingly where B and B are the data matrices of β and β = G(β ) respectively. As you can see from (44) the tensor dataset Ω of dimension L × N is represented by the data matrix B of dimension d × N , thus a reduction of complexity in representing data from L × N to d × N is obtained with the model (44). Formally this model is equivalent to the following correspondence which can be rewritten in a more compact form as The values X (k) in (44) can be interpreted as observations of a random tensor X , thus the following stochastic model that established a one-to-one correspondence between the random tensor X and the random vector β , holds. In the context of random models for tensors, β represents the vector of latent variables, that is a smaller set of variables that cannot be observed directly, and γ(β ) is a local parametrization of X . Once the new basisÛ is obtained from U by the reordering procedure previously described, the hyperparameter vector β (k) for a generic observation X (k) , can be easily derived by the inner product (44).
On the basis of previous results, a low-dimensional representation of the tensor X can be obtained by truncating the summation in (48) to the first r terms thus giving a truncation error that decreased as r increases. As the terms in (49) correspond to the most important components in the representation of the tensor X , this approach for tensor learning can be called principal embedded tensor (PTE) technique.

B. THE METRIC OF THE MANIFOLD M
A relationship between the metric on the manifold M, that is the geodesic distance of all pairs of points on the manifold, and the Euclidean metric of the corresponding points in R d , can be derived as follows. By taking advantage of the one-to-one property of the correspondence (48), the geodesic distance between two points X (i) and X (j) on the manifold can be defined as where β (i) , β (j) are the corresponding points in lowdimensional space R d and γ −1 is the inverse of γ. Besides a relationship for the Euclidean distance between two points X (i) and X (j) on the manifold is given by Using the property of differentiability of local parametrization we can apply the first-order Taylor expansion at β 0 to represent a generic point on the manifold at β where J(γ) is the Jacobian of γ. Choosing β (j) = β 0 and β (i) = β and substituting (53) into (52) we have This relationship clearly shows that the geodesic distance between two points X (i) , X (j) , defined by (51), is different from their Euclidean distance.

VII. NONPARAMETRIC KERNEL REGRESSION (NPKR)
Assuming the ID of dataset B has been determined with one of the methods known in literature [71]- [75], the unsupervised learning of the stochastic process (s.p.) X that generates the data Ω, reduces to the estimation of the inputoutput function G(·) in (43). In this way the initial problem of unsupervised learning reduces to a supervised learning problem as the input data B of G(·) are known. The function G(·) represents a mapping (in general nonlinear) from data B in the low-dimensional feature space to highdimensional data B . The estimation of this function is a regression problem that can be solved using several different approaches.
In this context an effective method for non parametric input-output nonlinear function regression in tensor space has been recently proposed [76]. The method can be summarized as follows. Given any continuous and bounded function f (x) of the n-dimensional variable x = (x 1 , . . . , x n ), defined in a compact subset I ∈ R n , then some sequences k m (x), named kernel functions, exists such that the convo- converges uniformly to f (x) on I, as m → ∞. Examples of these functions are: i) the polynomial kernel defined as where x = (x T x) 1/2 is the norm of X and C m is a normalized factor given by ii) the Gaussian kernel defined as As a consequence of property (55) we have which can be considered as a universal approximating relationship. The main issue of (59) is that it requires calculating the integral on the right hand side of (55). To overcome this problem, suppose we want to compute the mean value of the n-dimensional function g(t), t = (t 1 , . . . , t n ), in the interval where t is a realization of the random variable (r.v.) t = (t 1 , . . . , t n ), with probability density function (pdf) p(t), and E(·) denotes the expected value. To numerically solve the integral in (60) a Monte Carlo integration technique [77], [78] can be derived as follows. Let us select at random N points (t 1 , . . . , t N ) sampled from pdf p(t), then the Monte Carlo approximation of (60) is By applying this approach to the convolution g m ( and in particular for Combining (62) and (63) we finally get The function approximation (64) is a non-parametric model of function f (x) as it only depends on the observations f (t i ) and not on parameters to be estimated. The method described above, named nonparametric kernel regression (NPKR), can be used for the approximation of the function G(·) in (48), thus giving the following relationship where k m (·) is a given kernel function, β (k) and β (k) are the input and output training points respectively and β is a testing point, chosen from data matrix B = [β (1) , . . . , β (N ) ]. It is worth to notice that the previous relationship has the same form as the well-known Nadaraya-Watson nonparametric kernel estimator [79]- [81] that was proposed for the estimation of the regression function of data (x 1 , y 1 ), . . . , (x n , y n ) sampled from a population having a density f (x, y), thus giving a link between the two theories.

VIII. EXPERIMENTS
The experiments for the validation of the proposed tensor learning approach address two different problems, namely data reconstruction and classification. Data reconstruction aims at reconstructing the original dataset Ω = {X (1) , . . . , X (N ) } by the embedded vectors B = {β (k) , k = 1, . . . , N } using the model given by (48) and (65).
A pseudo-code of the algorithm used for the estimation of the model from the dataset Ω is reported in Algorithm 1.
As far as the extraction of L observations is concerned, the following considerations are useful. A complete basis can be derived provided the number N of observations is larger than the dimensionality L of tensor X . In this case to obtain a set of non-zero orthonormal tensors the selected L data are required to be independent. In general this assumption is naturally satisfied since each element in the database is obtained independently from each other. Nevertheless independence can be easily proven by checking that the elements obtained by the Gram-Schmidt procedure are non-zero, as they are a linear combination of dataset. In case we have less data, i.e. N < L, a complete basis cannot be obtained but the method for manifold nonlinear dimensionality reduction can still be applied provided the condition d N is satisfied. However, in this case, the main consequence of having a reduced dataset is a large error in the estimation of data ID, as it will be discussed in the experiment VIII-B for classification.
In order to visually assess the quality of Algorithm 1, a simple experiment on a synthetic dataset was preliminary performed. To this end data X ∈ R 3 in a low dimensional space were generated by the following parametrized function , . . . ,Û  Using Ω as the input dataset in Algorithm 1, the model X = γ(β ) has been derived. Fig. 3 depicts the surface achieved with the points sampled from the model. As you can see the model is able to reconstruct the manifold embedded in the high-dimensional space of data. In all the experiments a Gaussian kernel was used for the regression of function G(·) in (48), using the NPKR method.
Classification aims at classifying the data Ω by the kNN algorithm, using the embedded vectors B as lowdimensional features.

A. DATA RECONSTRUCTION
The capability of the proposed method to model data with a reduced dimensionality, has been validated by three experiments conducted on different datasets, namely CIFAR-10, RGB-D Object Dataset, AT&T Faces Dataset.
In this experiment, we use the 50000 RGB images of the training set for image reconstruction by the model (48)- (65). For this purpose, the data has been organized in a 3D-tensor dataset Ω = {X (1) , . . . , X (N ) }, with N = 50000, X ∈ R 32×32×3 , so that the dimension of the basis is L = 32 · 32 · 3 = 3072. Once a basis U is derived with the Gram-Schmidt procedure, to the set Ω corresponds the data matrix A = α (1) , . . . , α (N ) where the columns can be considered as realizations of the random vector α. For the set chosen in this experiment the estimated value of the vectors µ α and σ α are reported in Fig. 4.
To estimate the intrinsic dimension d of data matrix B we used the following relevant state-of-the-art intrinsic dimension (ID) estimators: Dimensionality from Angle and Norm Concentration (DANCo) and its faster variant (FastDANCo) [84], [85], Minimum Neighbor Distance -Maximum Likelihood (MiND ML ) and Minimum Neighbor Distance -Kullback Leibler (MiND KL ) [86], Maximum Likelihood Estimation (MLE) [87], Intrinsic Dimensionality Estimation of Submanifolds in R d (Hein) [88]. Table 2 reports the values of intrinsic dimension as obtained with the author's Matlab implementation 1 of the above mentioned methods for ID estimation. As you can see, although the value of ID so obtained show a large spread, they are all of the same order and significantly reduce the dimensionality of tensor data, which is two-orders higher (L = 3072). To stress the model and prove the dimensionality reducing capability of the approach we chose the minimum value of ID.   Table 3 reports the root mean squared error (RMSE) for each processed images, computed using the NPKR method for regression of function G(·) in (48), (here a Gaussian kernel and m = 2 are used in (58)), and compared with two well-known regression methods, namely Support Vector Machine (SVM) (with different kernels) and Regression Tree. Table 3 shows that the NPKR method for the regression of the function G(·) gives the better results, as it is able to reconstruct the data with the minimum error.

I1
I100 I500 I1000 I3000 I10000 I10500 I20000 I20500 I50  In order to study how robust is the NPKR method with respect to hyperparameters, the sensitivity of RMSE and PSNR (peak signal-to-noise ratio) to the dimension d of β is reported in Fig. 6 and Fig. 7 for different values of m. The mathematical representation of the PSNR is as follows: where f represents the original image and MSE the mean squared error. PSNR and MSE are used to compare the squared error between the original image and the reconstructed image. There is an inverse relationship between PSNR and MSE. So a higher PSNR value indicates a higher quality of the image.

2) Experiment on RGB-D Object Dataset (4D-Tensor)
The RGB-D Object Dataset [89] contains 300 objects divided in 51 categories. For each object the dataset provides a number of images ranging from a minimum of 506 to a maximum of 852 for a total of 207920 frames. In this experiment we used the subset Cropped RGB and depth images with object segmentation masks [90] that contains the cropped RGB-D frames and tightly include the object as it is spun around on a turntable. We used all the 207920 images, resized into 32 × 32 pixel box. In particular, we divided each class in 5 frames of 32 × 32 × 3 RGB images to obtain a 4D-tensor dataset Ω = {X (1) , . . . , X (N ) }, with N = 207920, X ∈ R 32×32×3×5 , resulting in a dimension L = 15360 of the basis. Once a basis U is derived with the Gram-Schmidt procedure, to the set Ω corresponds the data matrix A = α (1) , . . . , α (N ) where the columns can be considered as realizations of the random vector α. For the set chosen in this experiment the estimated value of the vectors µ α and σ α are reported in Fig. 8. Table 4 reports the results obtained with the same ID estimators used in Experiment VIII-A1. Choosing an intrinsic dimension d = 5 for data matrix B , thus a reduction of dimensionality from L = 15360 to d = 5 is obtained with the model (48). Then we applied the proposed reconstruction method to different 4D-tensors, that is 32 × 32 RGB videos of 5 frames each. Figs. 9-13 show the 5 frames that composed the original videos and the corresponding reconstructed frames obtained with the proposed approach, demonstrating the validity of this approach.

Original
Reconstructed frame1 frame2 frame3 frame4 frame5    (58), showing that the method is able to reconstruct the data with a very low error. In this case, as well, to assess the robustness of the PTE method with respect to hyperparameters, the sensitivity of RMSE and PSNR to the dimension d of β is reported in Fig. 14 and Fig. 15 for different values of m.
In this experiment, we applied the proposed approach on the original and occluded images achieved from AT&T   Faces Dataset, with the same procedure described in [61].
Here, 20 distinct persons are selected and each face image is resized into 56×46 format. In addition to original face data, we also test our method on the partially occluded face data.
Here, 20% images were selected randomly and corrupted manually for each person class, and the size of corruption is 11 × 10. To apply the proposed method, the data has been organized in a 2D-tensor dataset Ω On the basis of the estimated ID values reported in Table 6, we select an optimal intrinsic dimension d = 6 for the reconstruction of this dataset.
Also in this case to assess the robustness of the PTE method with respect to hyperparameters, the sensitivity of RMSE and PSNR to the dimension d of β is reported in Fig. 17 and Fig. 18 for different values of m. Fig. 19 and Fig. 20 show a set of original images and  the corresponding reconstructed images with the model (48) - (65), demonstrating the validity of this approach. Furthermore, Fig. 21 and Fig. 22 report a set of original partially occluded images and the corresponding reconstructed images showing the good noise tolerance of the method. Table 7 reports the reconstruction residuals for different reconstruction methods. The methods used to evaluate the VOLUME 4, 2016 PTE method are: PCA-based methods (PCA [27], gLPCA [66]) and tensor decomposition methods (TD [37], GLTD [61]). In this table the average residual for reconstructed tensor Res(X ) has been defined as: where X 0 = (X (1) 0 , · · · , X (N ) 0 ) represents the original N images and X = (X (1) , · · · , X (N ) ) the reconstructed images. Meanwhile, for the occluded image data, the same equation (70) defines the average noise-free residual (NF-Res), being X 0 in this case the set of original non occluded signals.

1) Experiment on Classification of 2-Order Tensors
This experiment was addressed to the classification of data collected from the following datasets.
• AT&T Faces dataset: as described in the experiment of Section VIII-A3. • MNIST dataset: is consisted of 8-bit gray-scale images of digits from "0" to "9". There are about 6000 examples for each class [93]. Each image is centered on a 28 × 28 grid. In our experiments, we randomly choose 50 images for each class. • COIL-20 dataset: contains 20 objects [94]. Each object has 72 images. The size of each image is 32×32 pixels, with 256 gray levels per pixel. We use the first 32 images for each object in our experiments.
It is worth to notice that for all such datasets the number of images in each class is far less the dimension of the corresponding basis. The main consequence of this mismatch is a large error in the estimation of intrinsic dimension of data, so that all the methods for ID estimation fails. However, although the intrinsic dimension of data cannot be determined with a certain degree of reliability, the model given by (48) -(65) continuous to be valid with an uncertainty on the dimension d. As a consequence, in all the experiments the dimension d was empirically determined, to obtain a good modeling of data, instead of using the method for ID estimation.
We perform semisupervised learning on different datasets, by training the classifier on the labeled data (20% of dataset) and use the rest as unlabeled data (80% of dataset). In particular 20% of data points for each class were randomly selected as labeled data, and the rest was used as unlabeled data. The classifier was trained on the labeled data and the class labels were predicted on the unlabeled data.
We performed classification using k-nearest neighbor (kNN) algorithm [28], and compared the results obtained with features extracted by our model (matrix B ) and several other tensor learning methods, including PCA-based methods (PCA [27], gLPCA [66]), tensor decomposition methods (TD [37], GLTD [61]). Although other valuable methods for classification exist, we choose to use kNN algorithm since it is common in the literature of manifold learning [60], [62], thus making comparison with other tensor learning techniques easier. Table 8 reports the results for the classification experiment as achieved by the methods used in [61] and PTE method. Also in this case the intrinsic dimension d was empirically determined. As you can see, our method outperforms all the other methods.

2) Experiment on Classification of 3-Order Tensors
In order to compare the proposed method with some other tensor-based learning methods, a last experiment was performed on the following datasets.
• Weizmann Action Database: an high-order dataset commonly used for human action recognition. The database includes 90 videos coming from 10 categories of actions -a) bending (bend), b) jacking (jack), c) jumping (jump), d) jumping in places (pjump), e) running (run), f) galloping sideways (side), g) skipping (skip), h) walking (walk), i) single-hand waving (wave1), g) both-hands waving (wave2) -which were performed by nine subjects [95], [96]. A tensor samples of size 32 × 24 × 10, represented in a spatio-temporal domain, is formed by 10 successive frames of each action, each of which was normalized to the size 32 × 24 pixels. • Cambridge Hand Gesture Database: consists of 900 image sequences of nine gesture classes, which are defined by three primitive hand shapes and three primitive motions [97], [98]. Each class contains 100 image sequences (5 different illuminations × 10 arbitrary motions × 2 subjects). The procedure to format data is the same as in the Weizmann action database. In these experiments, we randomly selected six action tensors of each category for training and the remaining tensors were used for testing. The experiments were independently performed 10 times using the kNN algorithm with Euclidean distance for classification, following the procedure in [58].
The methods used to evaluate the proposed PTE are: the baseline method (nearest classifier on the original data), multilinear PCA (MPCA) [31], tensor locality preserving projection (TLPP) and tensor neighborhood preserving embedded (TNPE) [67], orthogonal tensor neighborhood preserving embedded (OTNPE) [68], sparse tensor alignment (STA) [58]. Table 9 reports the results for the classification experiment as achieved by the aforementioned methods and PTE method. Also in this case the intrinsic dimension d was empirically determined. As you can see the proposed PTE outperforms the other methods in terms of recognition rate.

IX. CONCLUSION
In this paper a nonlinear, explicit model of tensor data that depends on a reduced set of latent variables is derived. The main steps required for the estimation of the model from data are: i) compute a basis by a Gram-Schmidt procedure; ii) reorder the basis in such a way the variances of coefficients are in decreasing order; iii) estimate the intrinsic dimension d of data; iv) define a data parametrization of dimension d; v) approximate the nonlinearity in the parametrization by a regression model.
The capability of the proposed approach for data reconstruction has been validated by performing several experiments on datasets with tensors of different orders. In these experiments several methods for regression, i.e. SVM with different kernels and NPKR method, have been adopted. In all cases the proposed tensor learning approach gives good performance for data reconstruction, nevertheless the PTE method is able to reconstruct data with minimum error. Additionally the proposed tensor learning approach has proven to be effective for classification problem, using data of reduced dimensionality. To show this ability, classification on several different datasets has been performed. The comparison of the results obtained with feature extracted by the proposed approach and state-of-the-art tensor learning methods (PCA, gLPCA, TD, GLTD, MPCA, TLPP, TNPE, OTNPE, STA), shows the effectiveness of the suggested model.

APPENDIX. LOCAL PARAMETRIZATION OF DATA EMBEDDED IN A MANIFOLD
Assume all the values of α embedded in the manifold M are described by the following parametrized surface in R L α = F (θ), θ ∈ U ⊂ R d , α ∈ R L , d < L .
This means that a bijective and differentiable function f (x) defined on a subset V = U × R m ⊂ R L , given by This ensures that a one-to-one correspondence is established between a point x = (θ, 0) T in V and a point α in the manifold M. As a consequence f is invertible and unique. Rearranging (72) according to the dimension d of θ we have and from differentiability of f one gets where I mm is an (m×m) diagonal identity matrix and J(f ) is the Jacobian of f . In order that f (x) to be invertible, as it is a bijective mapping, the condition det J(f ) = 0 on Jacobian must be satisfied. As a consequence from (74) we have det J(F ) = 0, meaning that the function F (θ) = α is invertible. Thus the inverse F −1 of F exists such that θ = F −1 (α ) .

ACKNOWLEDGMENT
This work is supported by Università Politecnica delle Marche.