Meta-Transfer Learning Using Wavelet Decomposition for Multi-Horizon Time Series Forecasting

Multi-horizon time series forecasting is a very challenging task in many fields of research. In the field of machine learning, artificial neural networks have been used to carry out these tasks. However, there are still problems that are of general interest to researchers such as: Loss of data in data acquisition and long-term forecast. In this paper, we propose a hybrid Meta-Transfer Learning technique based on transfer-learning, meta-learning and signal detection by means of the discrete wavelet transform to solve the aforementioned problems in multi-horizon time series forecasting. Input-to-state stability analysis and the strong and weak convergence analysis for the proposed method are included. To validate the effectiveness of the method, the long-term prediction of earthquakes magnitude (M>4.5) in Italy is taken as a case of study, using information from Italy and Mexico. Simulations of classic methods for forecasting time series based on neural models are performed. The forecasting performance obtained is the minimum square error (MSE) is 0.091, while for the meta-transfer learning, the MSE is 0.032.


I. INTRODUCTION
Time series forecasting is one of the most important tasks in the field of information engineering. Two main types of forecasting can be distinguished [1]: short-term prediction and long-term prediction, also called multi-step or multi-horizon prediction. Many popular methods are being used to solve this problem, such as Box and Jenkins' approach [2]. In [3], a review of the most common methods to resolve this issue are presented. There are linear and non-linear regression models that allow the modeling of a time series, for instance, the linear methods as ARX, ARMA, NARMA models [4], and neural networks for nonlinear modeling [5].
Unlike short-term time-series forecasts, long-term forecasts often present a challenge to research efforts as they have well-known problems, such as increased forecast error when forecasts are made over a period of time, since spatiotemporal conditions are normally unknown. In addition, there is uncertainty due to lack of information as a result of failures The associate editor coordinating the review of this manuscript and approving it for publication was Yiming Tang .
in data acquisition or failures in measurement instruments. Prediction based on linear models under these conditions tends to have poor performance.
Neural networks have been successfully applied in the problem of time series prediction [5]- [7]. The Multilayer Perceptron (MLP) is the most widely used type of neural network. The Backpropagation (BP) algorithm is adopted to minimize modeling error and update the weights accordingly. However, the MLP with BP, has two main problems: slow convergence and local minima [8].To avoid these problems, several approaches of machine learning have made considerable efforts to improve the results, such as Deep-Learning [9] and Meta-Learning (ML). The strong and weak convergence of ML and BP with momentum term are given in [10] and [11].
The methods of Meta-Learning and Transfer-Learning (TL) are relatively new ideas from psychology studies to explain how the learning process works by solving new problems based on knowledge and experience [12]. Moreover, the ML algorithm has been proposed to solve the problem of multi-step forecasting [13]. In general, the combination TL or ML has poor exploration in the advancement architectures of neural networks; both concepts are used in a qualitative approach [14]- [16].
Strictly speaking, meta-learning is only capable of improving the learning process in the same domain or task. When the information is insufficient to complete the task, or there are many problems in the implementation of the solutions, the transfer-learning method can be used when the tasks and distributions used in the training and testing stages are different. Therefore, the neural model has been given the ability to learn from other problems or tasks. A full description of this topic is provided in [17]. It should be considered that for the transfer-learning method it is necessary to answer the question: What knowledge should be transferred? This represents a challenge for researchers, since determining the similarities or patterns between databases is not an easy task to perform.
The application of deterministic or stochastic methods is not enough for multi-horizon prediction. It is necessary to understand time series in a domain other than time. Better results can be obtained using Wavelet Transform (WT) analysis. An advantage of wavelet analysis is the ability to perform local space-time analysis of time series [18], [19]. The WT allows us to reveal aspects of signal that other analysis techniques overlook, such as trends, breakpoints, discontinuities, etc.
This multiple resolution can also be obtained using WT, called discrete wavelet transform (DWT) [20]. The DWT uses filter banks, while the discrete WT uses discrete versions of the scale and expansion axes. The DWT is a transformation that decomposes a given signal into a number of sets, this technique has been successfully implemented in [21], [22].
The complexity in the prediction of time series increases when dealing with chaotic systems, since the trends and behaviors do not follow the characteristics of seasonality and periodicity.
Since friction is a nonlinear phenomenon [23], earthquakes can be considered a chaotic deterministic system [24] with limited predictability. Therefore, the interpretation of earthquakes can be regarded as a stochastic process or as a deterministic chaotic process [25]. In general, there are two approaches for earthquake forecasting: 1) Earthquakes are considered a stochastic process, where the main shock intervals between events are stationary and typically follow a Poissonian distribution [26]. The earthquakes can be based on some renewal time model that mimics the theory of elastic rebound [27]. Although now nobody can predict exactly next earthquake, some parameters of next big earthquake, such as time interval and magnitudes, can be estimated in the sense of probability based on past seismicity. 2) Earthquakes are considered as the result of a deterministic process, such as the result of a stick slip friction slip [28]. The deterministic predictability of earthquakes remains a debated topic in seismology. Theoretical and numerical studies based on deterministic equations indicate that stick slippage can be chaotic time series [29]. However, natural climatic earthquakes are explained by a chaotic deterministic time series. The chaotic behavior in regular earthquakes remains a challenge due to the short period of observation time [30].
Long-term forecast events are based on periodically arriving earthquakes, in general, a long-term event is too difficult to predict due to the limited information available. A complete earthquake prediction procedure should have three types of information: magnitude, location, and time of occurrence. Many methods are used to predict earthquakes, such as rulebased approach [31], shallow neural network [32], and deep learning [33]. Many methods use neural network models [34], which have great difficulties due to the rarity of the data, the quality of the historical earthquake data, the lack of pattern and the variability of the performance in different geological locations. The most important challenges are: forecast precision is limited to large magnitude, big forecast error in long-term prediction, the effect of environmental factors, and uncertainty in the factors.
In this work, a new method called Meta-Transfer Learning (MTL) with searching algorithm based on wavelet decomposition to solve the classical problems on multi-horizon forecasting of time series. The method of MTL is a hybrid of ML and TL methods. The Transfer-Learning modified is applied to solve the problem where there are not enough historical data in training domain and in combination with the wavelet decomposition it is possible to have a tool that allows determining what information to use within a set of secondary tasks, this knowledge will improve the accuracy in the prediction of a main task. The Meta-Learning modified helps us to solve the problems of local minima and slow convergence of neural networks.
Comparisons with other classical neural network models are proposed. The comparative analyzes show that: 1) Novel method has better modeling performances than the other algorithms in earthquake forecasting in order to minimize the MSE criterion; 2) The proposed method has a rapid convergence and is capable of achieving the assigned task.
In order to create an effective learning method for neural models, especially for long-term forecasting, we make the following contributions: 1) Meta-transfer learning and neural networks are applied for time series forecasting in the cases of multi-horizon and lacking data. 2) A modification is made for transfer learning with multiple resolution wavelet decomposition, such that the most important information are used to transfer. 3) A modification to meta-transfer learning is made to provide an output to a cycling problem within the algorithm. 4) Some important properties such as stability and the convergence (weak and strong) of the proposed meta-transfer learning are analyzed. 5) The proposed method is successfully applied to earthquake forecasting. VOLUME 10, 2022

II. MULTI-HORIZON TIME SERIES FORECASTING USING NEURAL NETWORKS
The behavior of a time series y(1), . . . , y(N ) can be described as a dynamic system as: where F(·) ∈ C ∞ is an unknown nonlinear function, n * is the number of past events needed to make the forecast. The multi-horizon forecasting of the time series y(1), . . . , y(N ) is: whereŷ is the prediction value, d σ is the prediction horizon, d α is the recursive delay. Such that, d σ = {0, 1, 2, . . . , n σ }, d α = {1, 2, . . . , n α }, n σ is the maximum horizon, n α is the number of past events. The multi-horizon forecasting becomes: or:ŷ where: Because F (·) is unknown, the following neural model is used to approximate it: If NN (·) has a single-layer neural network the model is: where W k ∈ R n is the weight matrix, (·) is the activation function.
If NN (·) has a two-layer neural network, where W k ∈ R m×n is the weight matrix of the hidden layer, V k ∈ R o×m is the weight matrix of the output layer.
If NN (·) is deep-neural network the model is: where l is the number of hidden layers. The scheme of the time series modeling using neural networks is shown in Figure 1. In this paper, we will use these three types of neural networks (7)-(9) for multi-horizon time series forecasting. The objective of the time series forecasting is to minimize the following modeling error: For multi-horizon time series forecasting, the modeling error is: The training object of the neural network models is to update the weights W k and V k , such that the modeling error is minimized: The following gradient method for (8) can minimize: (12) where η is the positive learning rate η < 1, ∂J . This is the backpropagation algorithm. To increase convergence speed, the momentum term is added to (13): where W k = W k − W k−1 , α is a constant 0 < α < 1.

III. NEURAL NETWORK WITH WAVELET DECOMPOSITION
Meta-learning is used to avoid local minima, while transferlearning is applied for the insufficient information in the training data of neural network models. Meta-transfer learning brings together the properties and characteristics of the meta-learning and transfer-learning. Our method can be divided into two parts: 1) The modified Transfer-Learning method is responsible for determining what information is relevant to transfer between neural models, through synaptic weights W * s . For this, the searching algorithm is based on the Discrete Wavelet Transform using the multilevel decomposition, and with this determine the coefficients (σ cA , σ cD ) to compute the deviation standard between different databases. If there is a low standard deviation then the databases have a strong correlation and it is possible to use that information. This stage aims to overcome the problem of lack of information in a time series due to failures in data acquisition.
2) The purpose of the Meta-Learning method is to avoid local minima. This is achieved once the modified Transfer-Learning method selects a matrix of weights W * s called sub-optimal matrix, such that each iteration the weight matrix W k converges to the sub-optimal weights W * s by means of the modified BP learning law due to the addition of terms β s * w,kX s * W ,k associated with Meta-Learning.

A. WAVELET TRANSFORM
For non-stationary and multi-horizon forecasting of real world time series, meta-transfer learning cannot provide good prediction accuracy. We will use wavelet to solve these problems.
A wavelet function ∈ L 2 (R) is defined as For a orthogonal basis for L 2 ( ), the function is also called mother wavelet. Considering the closed space Z i , for all i ∈ Z the Wavelet base have the following properties: 2) Using (16), it is possible to build an orthogonal basis for L 2 . where W n is an orthogonal complement from Z m whit respect from Z m+1 : Thus: and can be rewritten as: then the system { n } n∈Z is an orthogonal basis of W 0 . Consequently the system m,n (x) n,k∈Z is an orthogonal basis of the space W m , therefore, it is an orthogonal basis of L 2 .
Any continuous function f ∈ L 2 [0, 1] , can be expanded by the series: where the coefficients w m,n , m, n ∈ Z, can be calculated by the inner product: As described above the subspace formed by the base: can be reduced to the trivial space j −→ −∞, and the series can be written as follows:

B. WAVELET DECOMPOSITION
The wavelet decomposition is actually the application of the discrete wavelet transform (DWT), but for different scale factors [18]. The DWT can be represented as where m represents the scale index, n is the translation variable, is the wavelet mother, L is the length of the series or the function f . Haar wavelet [35] is the simplest discrete wavelet transforms. Haar wavelet is the most commonly used. When we need a model which can eliminate the high-frequency noise and avoid the distribution of the rest of the signal, the disadvantages of Haar wavelet are that it is discontinuous, and it does not approximate continuous signals very well.
The Haar wavelet is produced from the Haar mother function: where the input has 2 n numbers, it may be considered to simply pair up input values, it operates on data by calculating the sums and differences of adjacent elements. This function is capable to capture the frequency and temporal contents. A typical Haar wavelets is: where m and n are integers, ψ is defined by (25). However, it is necessary the discrete-time wavelet Haar: the system {WH n } n∈Z is an orthogonal basis of w 0 . Moreover, the system {WH m,n (x)} n,k∈Z is a normal basis of the space w m , therefore is an orthogonal basis of L 2 . Any continuous function f ∈ L 2 can be rewritten by the series: where the coefficients w m,n with m, n ∈ Z, are calculated by the inner product, w m,n = φ m,n , WH m,n here, φ m,n is the Haar wavelet transform, it starts with 2 n array, and performs a process with n iterations of the basic VOLUME 10, 2022 transform. For each index l ∈ {1, . . . , n}, the array structure consists of in coefficients for 2 n−(l−1) step functions: where φ mn is also called scaling function. The base Haar can be formed into a subspace: Without losing of generality, we can assume that j = 0, the Haar series is:f Even though the mathematical wavelet transform concept is applied, it consists of a set of low and high-pass filters [18]. Figure 2 shows an example on how a decomposition of scale 4 (or level 2) for a signal is done. The wavelet decomposition is applied to each domain.
The wavelet transform uses a broad range of compact orthogonal supporting analyzing wavelets. Orthogonality in DWT causes that the information deduced at a certain scale m, which is disjoint from the information at other scales: where N represents the number of wavelet coefficients at a given scale m, W m.n is the average among the coefficients.
The following equations described how to compute the deviation standard using the wavelet coefficients to a pair of domains: where ''cA'' represents the lowest frequency of the signal, and ''cD'' is the highest frequency of the signal, W m1 = mean(cA), W m2 = mean(cD).

IV. NEURAL NETWORK WITH META-TRANSFER LEARNING
A. TRANSFER LEARNING According to the fundamental property of knowledge transfer, which states that it is possible to use the previously acquired knowledge in an auxiliary task a and thus help in the performance of the main task p . Let us define two sets a and p . The domain a is for learning task a . The principal domain p is for the principal learning task p . Transfer-Learning for neural model aims to improve the time series forecasting with p in p using the knowledge of a and a . The auxiliary domain data a and p are: There is a fundamental problem within the Transfer-Learning technique: how to select previous knowledge acquired by an auxiliary task to improve the performance of a defined task?
In this work, we propose the following method to find the optimal information through the Wavelet Transform, this allows to find information that helps to determine a correlation between the two domains a and p . In general, there can be n domains n across which comparisons can be made to find the best set that guarantees the improvement of the results obtained by the main task.
There can be two ways to interpret the standard deviation: strong correlation and weak correlation, depending on the nature of the time series. We create a domain TF to generate a hybrid database between a and b , which is used to the meta-learning and find the optimal weights W * of neural network. In weak correlation, local minima in the main task p may be avoided. We assume the time series has a definite time Ts. We use the following function to generate TF : The model (29) has information of the time series from the tasks p and a . TF mixes the data from both sets. It depends on the nature of the phenomenon that has been described in the time series, in addition, the selected characteristic is based on the previous knowledge of the researcher in the problem.

B. META LEARNING
The time between events is an important characteristic of time series. We use wavelet transformation for multiple solutions databases p and a , then we use Meta-Transfer Learning to consider other characteristics. Algorithm 1 Transfer-Learning Modified 1: Choose a task defined by p 2: Propose several tasks p . This set could be a correlated or not with p 3: Apply the model (24) to each p and ai 4: Apply the model (27) to obtain coefficients cA and cD 5: Compare the factor correlation with the model (26) 6: Select the pair p and p with more or less correlation. This criterion is chosen by test. 7: Apply the (29) to combine the data-sets of p and p 8: Use the model (9) 9: Return the W * We use the following method for the inter-event time: when the inter-event time of p is smaller than γ , the data information of a it is saved, such that we can know where is the information of a is, adding to p . β which is a parameter that indicates the number of data of a added to p , see Algorithm 1. Figure 4 shows how to use transfer-learning to find W * .
After wavelet transformation and transfer learning, we use Meta-Learning and back-propagation to train the neural network models. This is our modified meta-transfer learning, see Algorithm 2.
In order to improve the forecasting accuracy, the following modified back-propagation algorithm is applied, which uses the principal task p and the knowledge of W * , where η is the positive learning rate η < 1, ∂J whereX W ,k ∈ R m×n is the vector which forces the angle between W k and W * , θ W ,k , to arrive the maximum value, θ W ,k is the angle.
The Meta-Learning term (β W ,k X W ,k ) can reduce the forecasting error in each step k for the neural model (2), and produce a fast convergence between the pairs (W k , W * ). The weights W k are projected to the sup-optimal weights W * , i.e., the current weights W k go towards the desired weights W * with the directionX W ,k and the step size β W ,k . Figure 4 and Figure 3 shows how to apply the modified Meta-Transfer Learning for neural network training.
The following steps show the MTL methodology: 1) We train the neural model (9) with the classical gradient decent algorithm (13) using different initial weights V 0 and W 0 , for TF . 2) We select the best final weights, V * and W * ,which can minimize the modeling error in the sense of (10). This idea is to extract some properties from previous knowledge. 3) We further train the neural model (9) with the ML algorithm ( 30). The step size β W ,k reduces the distance between W k and W * ,

VOLUME 10, 2022
This time-varying term ensures that the angle condition is fulfilled in each step. Similar However, to obtain β i,k the angular condition Ac is needed. When a deep-neural network model is applied, the size of the vectorsX i,k increases by dimension due to the number of weights. We need W k and V k converge to W * and V * . Normal algorithms need long times in the execution. We propose the following modified meta-learning to avoid the aforementioned problem. Calculate X V ,k (k) 6: Compute 7: end for 8: Select the max(cos V ,i ) 9: ReturnX s * V ,k According to the above, the models (31) and (30) can be rewritten as And the modified meta-learning is given by,

V. CONVERGENCE ANALYSIS
To show the effectiveness of our meta-transfer learning for time series forecasting, we will give strong and week convergence properties of the proposed algorithm. We first give the following stability result of the metatransfer learning.
The following theorem gives convergence of the modified meta-learning.
Theorem 1: If the meta-learning algorithms (33)-(32) are applied, the training processes of the neural networks (7)- (9) are stable in the sense of L ∞ |e (k)| < ∞ (34) Proof 1: For the single layer neural network (7), the meta-learning is We define the following Lyapunov candidate function, From the meta-learning update law (33) where n min w 2 i and n max w 2 i are functions of κ ∞ , as well as π e 2 (k) which and η k ζ 2 (k) are κ functions. The Lyapunov function L k is the function of e (k) and ζ (k), then L k is a smooth ISS-Lyapunov function. So, the dynamics of the identification error is an Input-State Stable.
For the multi layer neural networks, we use the following positive defined matrix L k : From the update law (33): Similar development with a single layer neural network: Furthermore, is a κ function. L k admits a smooth ISS-Lyapunov function, moreover is the function e(k) and ζ (k). If the ''input'' ζ (k) is bounded, then the dynamics of the ''state'' e(k) is bounded. The weak convergence of the proposed Meta-Transfer learning is given by the following theorem.
Theorem 2: The Meta-Transfer Learning (32)(33) are the weak convergence, i.e., the increments of the weights are bounded, herẽ Proof 2: For the single layer neural network (7), we use the following Lyapunov function: For the multi layer neural networks, we use the following positive defined matrix L k : From the stability proof of Theorem 1: where The update law is input-to-sate stable, and (37) is established. We use the following Lyapunov function: From the stability proof of Theorem 1 where π and K are defined in (39). Since the modeling error e(k) and L k are bounded, the gradient term associated with each of the layers due to the ML learning law is bounded with time going to infinity. Therefore, the increment defined byW k andṼ k are bounded, and (37) is established. The week convergence continues to be fulfilled as long as there is a sufficient number of iterations. The following theorem gives the strong convergence of the proposed metatransfer learning.
Theorem 3: There exist a W * ai ⊂ such that W s * ⊂ W σ . The meta-learning (33) leads the strong convergence with proper initial conditions and rich input signals, Proof 3: We define W s * as the sub-optimal weight of W k at the k. The projection angle θ of the two vectors W k and W s * is: We also define l as, see Figure 5: Using the triangular inequality: Using the updated law (33), the increment is: where k = 1, 2, . . . .Using the Cauchy-Bunyakovsky-Schwarz inequality, the increment of W and Lemma 1, the increment along the sequence is: and modeling error is bounded as (34), then (41) is fulfill. From Lemma 1 and The difference on the right side of the above inequality is: Let the vectorX V ,k lead the weights V k towards V * . Then the vectorX s * V ,k will drive the weights V k towards V s * . Therefore, if the number of iterations is sufficient according to the Algorithm 2, then V s * = V * , otherwise the weights V s * ∈ , where is a closed compact of radius δ such that δ < ∞, being is a neighborhood close to V * and V * = V s * + δ

VI. EARTHQUAKE PREDICTION USING META-TRANSFER LEARNING
We use the proposed method to forecast the earthquakes in Italy (M>4.5) by using the data both from Italy and Mexico. The data of Italy are extracted by the publicly available database in ''cnt.rm.ingv.it/en/iside'', the data of Mexico are extracted by the publicly available database in ''http://www.ssn.unam.mx''. The motivation of using both datasets of Italy and Mexico for Italy is the available earthquake data of Italy are not sufficient for neural network models. We add Mexico earthquake data to the time series of Italy. Normally, it is not reasonable, because these two time series are corresponding to different models. However, we successfully combine three techniques: wavelet decomposition, meta-learning, and transfer-learning, such that these two earthquake datasets can be applied to train one neural network model.
The time series for the M>4.5 data of the Italian seismic catalog contains a quantity of 104 elements during 1970-2018. This available information may become insufficient to make a multi-horizon prediction, if Figure 9 is analyzed, through visual inspection it can be seen that there is no trend in the prediction, since the neural models classics can get it right due to a test datum and in the immediately subsequent datum fail due to a considerable error. The idea of modified TL is to generate a data set that shares the information of the Mexican and Italian catalogs, through which the weights of W * s necessary for the ML method can be determined. To choose the information, a search algorithm based on the multi resolution of the DTW is proposed. The WT allows to identify spatio-temporal features within a time series, thus extracting relevant information from a time series that cannot be easily obtained with the simple analysis of discrete-time data. In [18], they use the standard deviation to compare the levels obtained from the multi-resolution of the DTW. since lower scales are associated with higher frequency oscillations, the increase in σ wav (m) with scale indicates that higher frequency fluctuations are less strong than lower frequency ones. Similarly, in this work they allow to determine the correlation between databases and thus determine what information should be transferred by the MTL method. Until this moment of the investigation there is no way to determine which values σ cA , σ cD ) of the standard deviation should be chosen to obtain a good performance in the method. For this paper, the closer the deviance coefficients of the Mexican and Italian datasets are, the higher their correlation will be, and therefore there is a high similarity in characteristics or properties between the time series.

A. WAVELET DECOMPOSITION FOR MULTI-HORIZON TIME SERIES
In other words, the standard deviation, in this case, let us know which set of seismic data of Mexico is similar to the seismic information of Italy. If similar information in terms of standard deviation is added to the data set of Italy, it will be fed with similar events, but for correctly training a neural network, diversity in the signal is necessary. Therefore, the time series of Mexico that is selected is the one with a major standard deviation from the seismic information of Italy.
The eighth level of decomposition was achieved. In Figure 6 the decomposition of the seismic data of Italy is shown. In Figure 7, the two sets of data with the higher standard deviation from the information of Italy are shown. The information of Mexico in 2016 is selected. We will show the standard deviation of the wavelet coefficients from two data sets, the inter-event time of both data sets, as well as the created functions from two time series.
1) The inter-event, time allows us to graphically find a condition to add the seismic information of Mexico in 2016 in the information of Italy. The inter-event time has the information of the time intervals between successive seismic events. When the inter-event time of Italy is smaller than an inter-event called γ , the magnitude information of Mexico is saved in an array, which will be full of not only the 2) magnitude of the earthquakes registered but also zeros.
Then, the position is detected where the magnitude information is saved in the last array so that this parameter allows us to know where to add the seismic information of Mexico in the information of Italy. After that, a β parameter indicates the number of seismic data of Mexico added to the data set of Italy. Finally, the new signal to train the neural network is ready and is constructed according the model (20) 2) META-LEARNING Table 1 shows the comparisons of the developed algorithm with some well-known methods. Only 10 events are taken for the testing stage. The experimentation takes into account the information available for a magnitude window M > 4.5  The experiments are repeated at least 15 times to reach the repeatability of the results from the same conditions in the aforementioned hyper-parameters. With this way, we can avoid random errors in the predictions. Also we use the mean squares error (MSE) as performance index.
C. RESULTS DISCUSSION AND FINAL REMARKS 1) As shown in Figure 9 and in Table 1  satisfactory, because the amount of information contained in the historical database for earthquakes with M > 4.5 from Italy is insufficient. In general, when forecasting the magnitude of earthquakes, a deviation between the actual data and the estimate implies a release of more or less energy due to the nonlinear nature of the earthquakes. Our meta-transfer learning minimizes the MSE performance to 0.060. It is much better than the other methods. 2) A advantage of meta-transfer learning are that it gives neural network the ability to use knowledge from another data set, and thus improve the accuracy of time series forecasting. The proposed method, through the search method based on the multi resolution wavelet transform, poses a parity between the data set on which the prediction of the time series is intended to be made, and the data set are extracted. Properties and characteristics are transformed into knowledge and experience. Therefore, it intends to analyze the selected time series in a space-time domain to determine a correlation that cannot be determined in another domain. With the above method, W * s is found.
3) The meta-transfer learning method allows taking advantage of the experience acquired between neural models, to improve the accuracy in the forecast stage of a task. This is achieved because the synaptic weights W of the ANN converge to the suboptimal synaptic weights through the projection generated by the metalearning-based learning law on W * s . 4) There are two main limitations of the proposed approach: • This method requires a large number of similar data for meta-training which is costly • Each neural model is a low complexity base learner, such as shallow neural network, to avoid model over-fitting. So it is unable to use deeper and more powerful architectures.

5)
There are also some implementation aspects: • The selected databases must have an intrinsic relationship, however, to determine it, it is necessary to have adequate knowledge about the phenomenon that is intended to be predicted and thus make a selection of information with logic sense. Otherwise there can be no connection between the information and therefore the results can be worse than the classical methods.
• The hyper-parameters such as the number of hidden layers, the learning constants and the number of neurons per layer are the ones to be chosen, so it is necessary to do tests to determine the appropriate ones for the task.
• The computational cost can increase if you have too much information about it for the secondary database from which the best in W * is obtained, since the databases must be compared individually together with the original database.
• To determine the amount of information to be added according to equation (47) it is necessary to experiment until an acceptable response is obtained in the forecast of the time series.

VII. CONCLUSION
In this paper, the multi-horizon time series forecasting is realized by a deep neural networks with Meta-Transfer Learning and wavelet decomposition. We successfully solved the common problem in time series forecasting with missing data and long-term prediction. The proposed method is applied to predict earthquake magnitude with two different data sets.
The future works will focus on studying its adaptation properties of the meta-transfer method for the automatic control, to improve the results of identification and control systems based on neural networks.