Characteristic Representation of Stock Time Series Based on Trend Feature Points

Stocks are the most active part of the securities market, and the analysis of stock generally starts from the price fluctuation. Stock trading data have the characteristics of time series, which make it possible to record the transaction prices in a time-evolving order. Due to the large data and high research complexity of time series, some ideal data are difficult to obtain for analyzing and predicting stock movements. Aiming at the problem, we utilize the piecewise linear representation which combines turning points and maximum absolute deviation points in stock time series to extract sequence features. Firstly, the proposed method finds turning points that satisfy the condition given in this paper, and defines the point distance formula to calculate the fitting errors between the subsequence segment and the fitted straight line, whose average value is set as the threshold <inline-formula> <tex-math notation="LaTeX">$P$ </tex-math></inline-formula> and the subsequence length is as the threshold <inline-formula> <tex-math notation="LaTeX">$d$ </tex-math></inline-formula>. Secondly, if the fitting error or the length of the subsequence segment is greater than <inline-formula> <tex-math notation="LaTeX">$P$ </tex-math></inline-formula> or <inline-formula> <tex-math notation="LaTeX">$d$ </tex-math></inline-formula>, respectively, the maximum absolute deviation point is obtained according to the difference between the fitted value and the original data. Finally, all trend feature points are connected by linear interpolation. In this paper, different sequence lengths, different thresholds <inline-formula> <tex-math notation="LaTeX">$d$ </tex-math></inline-formula>, different methods and industrial data from different fields are discussed and compared in detail to verify the proposed method. The experimental results show that the proposed method gets satisfactory data fitting and expanding effect, and retains the characteristics and integrity of the initial time series.


I. INTRODUCTION A. BACKGROUND
The securities time series refers to the data of the futures, stocks and other valuable securities in the stock exchange market as the stock price changes with time, and it is affected by various aspects [1]- [3]. All kinds of major events in the society may affect the trend of economic development, which in turn causes stock prices to fluctuate in the trading market [4]- [6]. Therefore, the stock is the most active part in the securities market. Stock trading activity is affected by the data generated for daily trading, and stock investors also buy and sell stocks according to the economic rules reflected by these data. How to effectively utilize these data to obtain valuable The associate editor coordinating the review of this manuscript and approving it for publication was Alberto Cano . information and provide scientific guidance for investors has become a hot research topic [7]- [10].
Since the stock time series has the characteristics of large data volume, high data dimension and fast update, direct data mining or similarity measurement on the initial time series is not only computationally intensive, but also highly complex, and affects the reliability and effectiveness of experimental results. Researchers generally adopt the feature representation method to preprocess the stock time series. The existing feature representation methods are based on time domain representation [11], [12], feature domain representation or model representation [13]- [16]. The first method is the simplest one of representing the time series using a minimum operation, which directly represents time series data according to the processing of the time domain signal. This method preserves the initial state characteristics of the time series, such as time wave analysis and probabilistic analysis [17]. It has the advantages of being easy to implement and preventing the loss of time domain information. The second one is to project the original time series into a new feature space. The new feature space can be divided into segmentation representation and global representation according to the feature classification criteria [18], [19]. The last one is the application of mathematical models to generate sequence data based on time series. The classical time series feature representation models include the auto regressive moving average model [20], [21], the first-order Markov hybrid chain [22], the hidden Markov model and dynamics Bayesian network etc., [23].
In addition to the data analysis mentioned above, we also find that some simulation methods based on entropy, Poisson mixture and agent have good effects on data analysis. Ponta and Carbone [24] proposed a piece segmented interpretation information based on the intersection of random sequences and moving averages, which was novel in that it divided the inherent informative/uninformative clusters along the sequences and better distinguish the sequence features. Teng and Shang [25] proposed a new transfer entropy coefficient to quantify the level of information flow between financial time series in view of the complex variability of transfer entropy, so as to analyze which stocks dominated the market, providing a reference for the analysis and development of stock market data. Stock series are random sequences that change according to a certain time, and the Poisson mixture itself is a random order distribution. Therefore, it is widely used in risk markets such as stock market. Ponta et al. [26] studied tick-by-tick financial returns of the FTSE MIB index of the Italian Stock Exchange and proposed a simple nonstationary return model based on a non-homogeneous normal compound Poisson process. Compound Poisson assigns a certain probability value to a random variable. The authors utilized the property to approximate financial data, and used three information criteria to analyze and compare its effects. It was found that the compound Poisson was most effective when the parameters were small. This study provided an alternative for the analysis of high-frequency financial time series. The essence of the agent-based simulation method is to simulate the process of market changes and provide guidance for decision-makers and managers. In [27], the authors utilized different types of stocks as characteristics and information multi-asset artificial stock market composed of heterogeneous agents to analyze the influence of the agent network on the market structure. The characteristics of agents included cash, stock, and emotion. The results showed that the agent-based simulation method is of great significance in the analysis of stock market data.
The securities industry is resource-intensive and knowledge-intensive, and the analysis of the stock market has been a major challenge for this industry, in order to overcome the problem, many experts at home and abroad have explored many methods and theories for it. The emergence of data mining technology has become an important technical means of stock analysis [12], [28], [29], providing lots of novel ideas, mainly for customer analysis, financial indicator analysis, transaction data analysis and investment analysis. In this paper, the feature representation of the stock time series is usually the mining of data information, so it belongs to the field of data mining. The mining of stock market information has certain significance not only for investors but also for social and economic development.

B. RELATED WORK
The time series refers to the sequence of corresponding changes with time. For the characteristics of the time series itself, such as high dimension and large scale, Faloutsos et al. [30] believe that when the sequence dimension decreases, the mining effect is more obvious. In recent years, researchers [31], [32] have proposed many new ideas and methods for the preprocessing of time series dimensionality reduction. Generally, there are two kinds of dimensionality reduction preprocessing methods for time series, which are feature extraction and feature representation.
The feature extraction of time series is to select an optimal feature subset that reflects the initial time series by some algorithms. Feature extraction of time series T= {t 1 , t 2 , · · · , t n } is to find the subset Q= {q 1 , q 2 , · · · , q m } that best reflects its characteristics, where Q belongs to T and m is much smaller than n.
The feature representation of the time series is based on the preservation of the initial time series features, whose goal is to reduce the sequence dimensions. A characteristic representation of a time series T= {t 1 , t 2 , · · · , t n } is the transformation of the sequence into a low dimensional time series Q = {q 1 , q 2 , · · · , q m }.
In view of the processing methods for the time series mentioned above, domestic researchers have already done many works. Yan et al. [33] proposed a time series segmentation method based on local maximum and minimum points by combining some important points including the extreme points. The maximum and minimum values were obtained according to the extremum function, and the first and last endpoints of the original sequence were added to fit them. In their experiment, the authors utilized different datasets to demonstrate the effectiveness of the method. Yin et al. [34] studied a novel segmentation method based on the turning points. Turning points in their method were defined as the points extracted from the maximum or minimum of the time series. The segmentation level was mainly two layers. The first layer selected the maximum point and the minimum point, and the second layer eliminated some unimportant points according to four given strategies. At the same time, in order to combine a small range of trends into a large trend, some local turning points were discarded. The results showed that the segments generated by this segmentation method could retain more trends of sequences. Si and Yin [35] proposed a segmentation method based on the inflection point for the different importance of turning point to time series. Their method evaluated the importance by the tendency and shape in the sequence, and stored the turning points in Optimal Binary Search Tree VOLUME 8, 2020 (OBST), which reduced the average retrieval cost. The results showed that their method kept the lowest average retrieval cost while retaining more trends. Luo et al. [36] proposed a piecewise approximation algorithm with max-error guarantees based on the given time series and the specified error. Their method firstly constructed a piecewise linear function to judge whether the error between the function and the sequence met the conditions and then selected the sequence that satisfied the conditions as an approximate representation of flow data trend. Finally, an online algorithm was designed to generate the optimal piecewise approximate representation with the maximum error guaranteed. Lin and Wang [37] researched a time series segmentation algorithm based on first-order filtering. Their method is to prevent the segmentation algorithm for edge extraction of slope falling into the local optimum under the situation of drastic slope fluctuation. In their method, firstly, the first and last points of the sequence, the relative extreme points obtained from the midpoint, the points that met the change of slope, and the rest points were divided into four levels of points. Among them, the most significant was to add the first-order filter smoothing sequence burr when extracting extreme points to reflect the basic trajectory of the sequence. Then, the priority queue was used to realize the classification storage of different important points. And finally, the final sequence was obtained according to the compression rate. The experimental results showed that the method got good quality for sequence fitting with small slope variation. For the piecewise linear representation proposed by Keogh [38], the more subsequence segments are divided, the more basic feature information is extracted. On the contrary, the extracted basic feature information of the sequence is less. The extraction of these key points is not only studied in time series analysis, but also in other aspects, such as face recognition, which is often used for positioning [39], [40].

C. OUTLINE OF PAPER AND APPROACH
As can be seen from the research about the piecewise linear representation of the stock time series in the previous section, the basic features of the initial time series should be retained as much as possible. Therefore, how to achieve a better modeling effect of time series under the condition of obtaining a higher compression rate is the key to the current time series feature representation. In this paper, we mainly focus on the research of the stock time series. The existing piecewise linear representation cannot retain the characteristics of the initial time series well, so we propose an improved piecewise linear representation method based on searching the trend feature points. By the proposed method, we could also obtain the trend fluctuation feature points in the local range. On the basis of the piecewise linear representation, the improved scheme of trend feature points searching can effectively extract and discriminate the initial time series trend fluctuation information, thus providing the necessary data for the securities investment.
In this paper, we first introduce the judgment method of time series trend fluctuation point, which is robust to find the segmentation point on the initial time series. Then, we connect the found segment points by the straight-line interpolation, that is, the fitting of the preliminary feature points is realized. Finally, according to the search principle of the fitting error and the trend fluctuation feature points, the feature points are searched again for the subsequence segments that satisfy the defined requirements.
The structure of this paper is arranged as follows: Section II reviews the theoretical basis of time series. Section III analyzes stock time series data. Section IV introduces the algorithm flow and the improved method of piecewise linear representation. Section V describes the experimental results analysis, and Section VI summarizes the conclusions.

II. THEORETICAL BASIS OF TIME SERIES
With the rapid development of science and technology, the ability of computer is becoming more and more powerful, and the data stored in computers are pretty huge, such as the basic information of students stored in the school management system, the large amount of stock data stored in the securities company system and so on. Many of the data in these systems are listed in a chronological order which is the time series [41]- [43]. The researches on time series mainly focus on how to find out the intrinsic connection of things based on the found sequences, and draw more valuable information from these sequences [44]- [46]. This section briefly introduces the basic concepts of time series, time series feature representation and some models of time series analysis, providing a theoretical basis for the next work.

A. TIME SERIES ANALYSIS
The definition of time series analysis has been presented in much literature, although it is expressed in different ways, the basic theory is same. With the passage of time, the creative methods and technologies about time series are constantly emerging, so we can discover potential information more effectively for solving problems related to time series analysis. Time series analysis is based on the time series data from relevant systems, and then establishes the mathematical model by parameter estimation and curve fitting. It includes the general statistical analysis, the creation and identification of statistical analysis models, and the prediction and utilization of relevant time series. Generally, time series analysis includes two contents: frequency domain analysis and time-domain analysis. For the first one, frequency, phase and amplitude are usually used to extract some meaningful features of time series. For the second one, it mainly covers time-domain analysis methods currently available linear Auto-Regressive (AR) models [47], [48], Moving Average (MA) models [49], [50], and Auto-Regressive Moving Average (ARMA) models [51].

B. CLASSIFICATION OF TIME SERIES FEATURE REPRESENTATION
Feature representation of time series refers to data dimensionality reduction, which is to convert the initial time series data from high-dimensional space to low-dimensional space, and reflects the characteristics of the initial time series as much as possible. The time series feature representation is shown in Fig. 1.
1) The data non-adaptive method refers to transforming the time series from the current dimensional domain to another dimensional domain, and the transformation process is related to the setting of the characteristic coefficients, and has nothing to do with the original data.
2) The data adaptive method is affected by the local sequence data value, and its experimental effect is affected by the overall sequence.
3) The model-based representation method, such as regression model, hidden Markov model and neural network, is applied to the time-series data for searching the optimal model parameters, and then extracts some meaningful time series features.

III. STOCK TIME SERIES DATA ANALYSIS A. RECORDING METHOD OF STOCK DATA
The stock data is from the records of stock trading in the stock exchange market, and each transaction price and volume constitute the basis of stock data. There is no fixed time interval between each trade. There may be only a few times in an hour for the low-frequency stock trading and dozens of times in one second for the high-frequency stock trading. In order to record these data, they are generally recorded at a fixed time interval in the securities field. As shown in Table 1, this kind of recording in the securities market utilizes four prices including the opening price, the closing price, the lowest price and the highest price to represent the stock trend in a fixed time interval.

B. RAW DATA AND ACQUISITION TO STOCK TRADING
The stock trading data is generally derived from the stock exchange. There are four stock exchanges in China, which are the Shanghai Stock Exchange, the Shenzhen Stock Exchange, the Hong Kong Stock Exchange and the Taiwan Stock Exchange. For ordinary individuals, we can obtain the data through website data such as Sina Finance and Hexun Finance's stock interface, or export data in trading software such as communication software, or by using the financial data interface package Tushare package in Python language to obtain financial data. In addition, the data can be purchased in some interfaces, such as Nezip stock interface.

C. DATA TYPES OF STOCK TRADING
Through data mining technology, we can acquire new ideas and knowledge which could be utilized to the real-time stock trading, picking or stock analysis. In order to achieve these goals, it becomes important to know how to push the data. There are two types of pushing data in stock trading software: on-demand data and push data. On-demand data refers to that stock is selected according to requirements specified by users, and then downloads the data of the stock. The advantage of this pushing mode is that it occupies fewer resources, avoids network congestion, and is conducive to the stability of the server while the disadvantage is that it is impossible to timely realize the stock picking. The full push data means that the data resources are updated in time, all the stock data are sent synchronously, and there will be no stagnation when viewed. Generally, it is sent every three seconds, which is beneficial to real-time stock picking and intraday warning. Due to the large network occupied by the push data, if the server is interrupted or the network is blocked, it is necessary to manually supplement the missing data.

IV. THE PROPOSED METHOD
This section mainly describes the algorithm for finding trend feature points and the piecewise linear representation based on trend feature points aiming at the shortcomings of current trend feature points search schemes. During finding the trend feature points, firstly, standardizing the data based on the theoretical knowledge. Then, presenting the trend feature points of the paper. Finally, analyzing experimental results. The basic frame diagram of the proposed method is shown in Fig. 2. In the extraction of the trend turning point, the slope and the set threshold are used to derive point B as the turning point. The data is segmented according to the turning points, where points O, P, Q, A, B and C are turning points. L O−P in Fig. 2 represents the number of points between OP, E OP represents the fitting error of line segment OP, and ME OQ represents the average fitting error of line segment OQ. In VOLUME 8, 2020  the maximization deviation point extraction, the straight line AC is the fitted line segment, the curve AC is the initial data line segment, and the point B in the curve is the maximum deviation point. Finally, the found turning points and the deviation points are combined into feature points and linearly represented.
In this paper, the piecewise linear representation of stock time series is defined as in the stock time series Q = {q 1 , q 2 , · · · , q N }, which is segmented in the form: where w i represents the coordinates of the two endpoints of the time interval [w i−1 , w i ], and f i (t, w i ) represents the linear function of the two endpoints in the connected mode w i , and e k (t) is the error between the stock time series and its segmentation mode.

A. THE ALGORITHM FLOW
The algorithm flow is shown in Fig. 3. This paper realizes the feature representation of stock time series data based on trend feature points through two parts of feature points search. Get data from the JoinQuant trading platform and standardize the data. The most critical part of the process is the determination of feature points. This paper combines the slope and distance difference to fit the data.

B. THE ALGORITHM DESCRIPTION FOR FINDING TREND FEATURE POINTS
The algorithm for feature points lookup is shown in Table 2.
In this paper, the key of piecewise linear representation algorithm based on trend feature points is to find the piecewise points in subsequence. According to the definition of the trend feature point and the judgment method of the fitting error, it is determined whether the subsequence still needs to continue the process of finding the feature point of the trend. Firstly, the algorithm traverses the original time series, calculates whether the slope changes between the adjacent points meet the given threshold, and finds the initial feature points. In this process, (x i+1 − x i ) (t i+1 − t i ) operation needs to be performed n-1 times, and the time complexity and O (f (n)) equals O (n). Thus, finally the time complexity for T 1 equals O (n) in this step. Then, the fitting error of sequence points is introduced to calculate whether the fitting of subsequence segment meets the given threshold value, so as to determine whether the second step of feature point search needs to be carried out. At this step, the judgment needs to be made for n times, so T 2 = T 1 = O (n). Finally, a single scan was performed to obtain all sequence points meeting the condition, T 3 = T 2 = O (n). As described in this review, the time complexity of the trend feature points is O (n).
When judging whether the current point is the initial feature point, two constrains need to be met. On one hand, this point is an extreme point. On the other hand, the slope values of adjacent points are different from each other and larger than the set threshold. This method can be used for online segmentation to facilitate the application of time series analysis.

C. THE MANIFESTATION OF TREND FEATURE POINTS
The piecewise linear representation replaces the initial time series with interconnected straight-line segments. The purpose is to remove noise points and some unimportant points, and to better detect the trend of the sequence change. When people observe data, they always pay attention to those special points. As shown in Fig. 4, the evolutionary trend between the two points is divided into ascending, descending and stationary. The development trend of some continuous points is evolved from the basic development model between the two points. As shown in Fig. 5, the evolutionary trend for three adjacent points is divided into nine situations, such as ''rise-up'', ''rise-stable'', ''rise-down'', ''stable-stable'', ''stable-rise'', ''stable-down'', ''down-down'', ''down-up'' and ''down-stable''.

D. TREND FEATURE POINT OF TIME SERIES 1) JUDGMENT OF TREND FEATURE POINT a: JUDGMENT OF PRELIMINARY FEATURE POINT
As shown in Fig. 6, the feature point is not only the extreme point, but also the fluctuation range between adjacent points reaches a certain extent. In Fig. 6, a, b and c are adjacent three points, and point b is the desired trend feature point, namely X i . The time series could be seen as an ordered set, and the elements contain the time records t i and the value v i . In the time series: the foundations for judging X i whether it is the extreme point with a relatively sharp fluctuation range is listed as follows: 1) X i must be an extreme point of time series, except the first endpoint X 1 and the tail point X N ; 2) Set the threshold value K , and the slopes of adjacent segments are respectively k ab and k bc , if k ab × k bc < 0, and |k ab − k bc | > K is established. VOLUME 8, 2020

b: TIME SERIES FEATURE POINTS AGAIN TO FIND THE BASIS FOR JUDGMENT
For the subsequence that satisfies the search condition again, calculating the difference between the fitted value of each subsequence and its original value to obtain the maximum absolute deviation point, which is the point to be found. If y = f (x) is a function of the experimental fit in this paper, and the subsequence has n data points, the absolute deviation is maximized: where {y i } are the feature points searched again in the original time series. The sequence points determined by the above judgment method represent the trend of the stock price at this point. In this paper, the points obtained by the judgment methods (a) and (b) are defined as the trend feature points of the stock. According to this method, this paper can quickly judge the changing characteristics of any point in the stock and effectively describe the changing trend of the stock market.

2) PIECEWISE LINEAR REPRESENTATION BASED ON TREND FEATURE POINTS
The key to the segmentation of the stock sequence is the determination of the feature points. According to the definition of trend feature points, the stock time series feature representation method based on trend feature points is divided into two parts. The first part is to find the feature points according to the slope and the set threshold, and the second part is to find the feature points according to the distance difference. And then the stock time series is represented in a piecewise linear based on the feature points found.
There are two kinds of time series representation after linear segmentation: the first one is linear regression: It shows that each segment of the segment fits all the original data in the segment by least squares method, and the adjacent segments can be discontinuous. The least squares method is: The second type is linear interpolation: It means that each line segment is a simple connection between the beginning and the end, and the adjacent segments are connected end to end. The method used in this paper is linear interpolation, and the segmentation of linear interpolation indicates that the model is continuous.
(1) Using the preliminary search principle of time series trend feature points in Section IV-D-1, determine all the trend feature points in the time series V as: m represents the number of trend feature points found in the original stock market sequence, and time series V as: 97022 VOLUME 8, 2020 N represents the number of points in the original stock time series. The subsequence containing the first and last endpoints, and the trend feature points of the initial time series is linearly represented as: where f represents the linear function fitted for each sequence, and the subsequence linear function values V 1 , V 2 , · · · , V N of all the sequence points in the initial time series are obtained based on the function. By fitting the calculation formula to the fitting error, the fitting error of the sequence value after the above processing and the initial time series value is calculated as E N . The fitting error reflects the difference between the recorded value and the fitted value. In this paper, the residual between the original time series value and the fitting value of time series is measured by the Euclidean distances: where X i is the original sequence value, and X i is the fitting value.
(2) Define the average fitting error of each sequence point as E N N , the fitting error is E i at the i-th subsequence and the length of i-th subsequence is length (i), and length (i) = W , W is the number of data points contained in the i-th segment, and then compare and find out the subsequence segment fit value of the stock market trend feature point. Set the threshold as P = E N N × length (i): 1) If E i ≥ P, the i-th subsequence needs to perform the process of finding the trend feature point and the fitting error. In this process, it can be seen from (b) in Section IV-D-1 that the method of obtaining the initial feature points sequence segment is combined with the method of searching for the feature points again to obtain large deviation points, that is, the desired feature points are obtained; 2) If E i < P, then the segment subsequence need not continue to be divided.
(3) Repeat the above steps (1) and (2) until the fitting error E k < P at the k-th subsequence or the subsequence length length (k) is less than the set threshold d (d ≤ 7).
(4) Through steps (1), (2) and (3), find all the trend feature points of the stock market time series, and linearly interpolate these feature points, and then use the improved linear representation method based on the trend feature point to perform these feature points in piecewise linear representation. to December 31, 2015, with 244 data per stock. In order to highlight whether the experimental results are influenced by the sequence length, this paper also adds closing price of stock in one year to each stock on the basis of the original data. The specific time period is from January 2, 2014 to December 31, 2015, with 489 closing prices per stock. Five different industries are selected from the Eastmoney, and the five industries are the real estate and construction industry, machinery and equipment industry, energy industry, petrochemical industry and jewelry industry. The selected indices are the total sales of commodity housing, the price index of hardware and machinery, US crude oil index, chemical index and gold index. Each industry index selects 200 values. In the field of financial metrology, daily, weekly, monthly, quarterly or annual type of financial data belongs to low frequency data, and the time series data studied in this paper are all based on daily data. The experimental algorithm implementation uses matlabR2016a to obtain segmentation points and draw images. The experimental environment is i7-8700 CPU, memory 32GB, operating system is Windows 10. The first step after getting the data is to standardize the data and transform them into between 0 and 1, which is convenient for calculating the fitting error. The standard formula is defined as following: where x i is the original data, min (X ) and max (X ) are the minimum and maximum values in the original data, respectively. 0.000001 is to avoid the data equal to 0. In order to better understand the characteristics of the experimental data in this paper, we make descriptive statistics on the research data, as shown in Tables 3, 4 and 5, in which the indicators include mean, standard deviation, maximum, minimum, kurtosis, and skewness. Kurtosis Coefficient (hereinafter abbreviated as KC) is an index to describe the sharpness of the peak of the symmetrical distribution curve. There are two forms of expression: (1) If KC > 0, the data show a sharp peak distribution; (2) If KC < 0, the data reflect a flat peak distribution. Skewness Coefficient (hereinafter abbreviated as SC) is an index describing the symmetry of data based on the standard of normal distribution. SC is expressed as follows: (1) If SC = 0, the data embody a symmetric form; (2) If SC > 0, the data show a negative skewed distribution; (3) If SC < 0, the data reflect a positive distribution; (4) If SC > 1 or SC < −1, the data give the performance to a highly skewed distribution; (5) If SC ∈ [0. 5,1] or SC ∈ [−0.5, −1], the distribution of the data is medium skewed distribution.
In this paper, we mainly analyze the distribution of all the data from the skewness and kurtosis coefficients. From Table 3, it can be concluded that the closing of Ping An Bank shows a medium skewness distribution with flat peak; Hualian Holdings and China Southern Airlines show a   flat peak negative skewness distribution; CITIC Securities presents a flat peak positive skewness distribution and the closing price of Zhejiang Energy Power presents a sharp peak-type highly skewness distribution. Table 4 is the data sample descriptive statistics added one year on the basis of Table 3, in which the closing price of Ping An Bank shows a sharp peak-type medium skewness distribution; CITIC Securities, Hualian Holdings and China Southern Airlines present a medium skewness distribution with flat peak; Zhejiang Energy Power embodies a sharp peak-type highly skewness distribution.
Since Table 5 is the sales index of some industries, the indicators are relatively large. From the analysis of skewness and kurtosis, Commodity housing sales shows a medium skewness distribution with sharp peak; Hardware and electrical price index, Chemical index and Gold index embody a flat peak negative skewness distribution; US crude oil index presents a positive skewness distribution with sharp peak.

B. ANALYSIS OF EXPERIMENTAL RESULTS
Considering the length of the paper, only the figure of one stock and one industry index is given here, and we give the specific evaluation indexes of all stocks and industries indexes in the following tables. The initial time series data before data modeling are shown in the black curve in all figures, and the sequences represent the daily closing price of the stock after standardization. From the initial data, we can see that the stock closing price changes irregularly and the trend fluctuates greatly. Therefore, in order to reduce the risk as much as possible, we will process the initial data to obtain important information.
In time series analysis, compression ratio and fitting error are generally adopted as evaluation indexes for the performance of piecewise linear representation algorithm of time series. Data compression refers to reducing the amount of data to reduce space and improve transmission efficiency without losing information. The compression ratio is an evaluation of the effectiveness of the data compression. The fitting error is to evaluate how close the fitted data is to the original data. If the compression ratio of time series is higher and the fitting error is smaller, then the performance of the piecewise linear representation method is better and the initial time series features can be depicted. Otherwise, the performance is lower and the initial time series features cannot be depicted.  Giving the definition that the stock time series is: The endpoint of each subsequence is (X t1 , X t2 , · · · X tN ), and the piecewise linear representation of the subsequence is Thus, the compression ratio is defined as: where n is the number of segment points, and N is the number of points in the original time series. In this paper, a piecewise linear representation algorithm based on trend feature points is utilized to extract segmentation points for stock time series with data length 244, as shown in Fig. 7, which are the sequence diagrams of the fitting of the stock data (closing price). To illustrate the reliability of this method, this paper not only adds one year of data as shown in Fig. 8 for the stock which is presented in Fig. 7, but also gives index data for five industries and different segmentation threshold of stock data. Besides, other methods in related work are added to further illustrate the performance of the method. Figure 9 presents the sequence diagrams of the fitting of the industry index, the change figures of segmentation threshold are shown in Fig. 10 and others methods are presented in Fig. 11. Due to the large variation of stock data, the way of changing the threshold (the segmentation length which can better reflect the change of sequence points) is selected in this paper to evaluate the performance of the method. Set thresholds d (d ≤ 5) and d (d ≤ 9) to compare with the original threshold d (d ≤ 7). Considering that the threshold cannot be too large or too small, two other thresholds d (d ≤ 5) and d (d ≤ 9) are set based on the threshold d (d ≤ 7) for experimental comparative analysis.
The effect diagrams of the initial fitting and re-fitting of the stock time series can be seen from Figs. 7 and 8, where the black curve is the initial stock data and the red segment line is the fitted stock data trend. The red dot on the left indicates the feature point for the first search, and the red dot on the right indicates the feature point determined by combining the feature point of the first search with the maximum absolute deviation point. During extracting the feature points of the initial fitting of the stock data, the principle of extreme points of the data needs to be followed and a certain trend change is reached between adjacent points. The feature points that are fitted again are the maximum absolute deviation points obtained according to the previous formula (3). As shown in Fig. 7, although the initial fitting curve of the stock data is close to the trend of the original data, it is found that many important points are ignored and a large deviation has been generated compared to the original data. Combining the VOLUME 8, 2020   Table 7.
principle of finding points again to fit the stock price makes the result closer to the initial price data, and more consistent with the change trend of the initial data, and the fitting effect is better.
It can be seen from Fig. 8 that the fitting effect is comparable or better than that in Fig. 7, which indicates that the results are still reliable and the fitting effect is better even if the length of the data time series is different. As shown in Fig. 9, the effect maps of the initial fitting and re-fitting of the industry indices time series could be seen, where the black curve corresponds to the initial index, and the red segmentation line is for the fitted data trend. The specific evaluation criteria for the data fitting effect of Figs. 7-9 are derived from Table 6. Figure 10 presents the effect of fitting the different segmentation thresholds to the original data, in which the black curve is the initial data and the red segmentation line is the fitted stock data trend. The red dots in (a) are the fitting feature points with the threshold d (d ≤ 5), (b) is the fitting graph of the feature points with the threshold d (d ≤ 7), and (c) is the fitting graph of the feature points with the threshold d (d ≤ 9). Visually, it is found that the differences of the fitted curves given by the threshold d (d ≤ 5) and threshold d (d ≤ 7) are less, while the difference between the threshold d (d ≤ 7) and threshold d (d ≤ 9) is obvious. Therefore, their numerical values are analyzed in this paper, and the specific performance is shown in Table 7, from which we can see their differences. Figure 11 gives the effects of fitting other methods in related work and the proposed method in this paper, in which (a), (b) and (c) mean different segmentation methods in [30], [31] and [34] respectively, and (d) represents the proposed method of this paper. In addition to the visual changes, the specific evaluation indicators are obtained from Table 8, in order to find more significant differences between different methods. Table 6 gives the analysis of the fitting of the stock data with different sequence length and the fitting analysis of five different industries indices. Table 7 gives a comparative analysis of fitting error and compression ratio represented by different thresholds, in which d (d ≤ 7) is the threshold set in this paper. Table 8 shows the comparative analysis of compression ratio and fitting error of different methods. We mainly use the evaluation index of compression ratio and fitting error, and the compression ratio (the calculation of compression ratio is shown in formula (13): C = (1 − n/N ) × 100%, and n is the number of segment points, and N is the number of points in the original time series) is related to the number of feature points to be searched. It can be seen from Table 6 that the number of feature points found for the first time is smaller than the feature points found again. Although the compression ratio is relatively high, the fitting error of the initial search is also high, and the fitting effect is relatively poor, while the proposed method gives the much lower fitting error when the difference of compression ratio is less. As shown in Table 6, the difference between the initial fitting and re-fitting compression ratio of Hualian Holdings with data length 244 is only 4.508%, while the fitting error is   [30], [31] and [34], respectively, (d) is for the method fitting of this paper. 0.792. From Table 7, we can see that when the threshold d is less than or equal to 5, although the fitting error is relatively low, the compression ratio is also low; when the threshold d is less than or equal to 9, the compression ratio is high, while the fitting error is also high. When fitting the data, the fitting effect is better only when the compression ratio is high and the fitting error is low. This means that the threshold should be set moderately according to the data, too high or too low to meet the requirements. In this paper, the threshold is more close to the data changes and more suitable for data fitting. As shown in Table 7, the compression ratio and fitting error of Hualian Holdings are 72.951% and 0.326 when threshold d is less than or equal to 5; the compression ratio and fitting error of Hualian Holdings are 78.689% and 0.513, respectively, when threshold d is less than or equal to 9. In this paper, the compression ratio is 77.459% and the fitting error is 0.387. In conclusion, the proposed method provides a lower fitting error under keeping a higher compression ratio. Table 8 provides the results of different methods. It is found that the mean fitting error generally is lower in the proposed method than that in the methods from [30], [31] and [34]. As shown in Table 8, the mean compression ratio and the mean fitting error are 81.639% and 0.767 in [30], 80.492% and 0.599 in [31] and 76.223% and 0.958 in [34], respectively. Moreover, it can also be found that the proposed method of this paper got the mean compression ratio of 77.213% and the mean fitting error of 0.478. Compared with the method in [30], the proposed method got a lower fitting error under keeping compression ratio with less difference; compared with the method in [31], although the compression ratio in this paper is 1% to 2% lower, the fitting error is basically half of it, for example, Ping An Bank; compared with the method in [34], two indicators are better from the perspective of single stock or overall mean, and even about a third of it, for example, Hualian Holdings and Zhejiang Energy Power. In a word, the proposed method in this paper has more advantages than those in [30], [31] and [34].
To sum up, by analyzing the experimental results above, the piecewise linear representation of the time series using the trend feature points initially found in the time series has a higher compression ratio and larger fitting error. On this basis, based on the judgment of subsequence fitting error and global fitting error mean, the operation of finding feature points is performed in the subsequence segment that meets the threshold. From the Table 6, we can see the change of the compression ratio and the fitting error of the data in the different industries and the different stocks with different sequence length. In the initial fitting, the fitting error and the compression ratio are high, but the fitting error is significantly reduced on the re-fitting and the compression ratio retains relatively high. From the compression ratio and fitting error shown by different thresholds in Table 7, it can be seen that the threshold in this paper has low fitting error when the compression ratio is guaranteed to be high, which meets the requirements of data compression. The analysis and comparison of different methods in Table 8 give same conclusions that the proposed method obtains low fitting error and maintains high compression ratio. According to the experimental results, we can conclude that the method in this paper retains the characteristics and integrity of stock time series, and gets a good fitting effect in the feature representation of the stock time series, which provide a choice for data fitting. At the same time, the stock data fitting of different time series lengths shows that the method in this paper is not affected by the sequence length. Last but not least, the research in this paper also brings benefits to managers and traders. For managers, the research on changes in stock market trend can prompt them to improve the relevant regulatory system and make the regulation more targeted, so as to further promote the healthy and stable development of the stock market. A stable stock market plays a very important role for the national economy, and drives the economic lifeline ahead of development and benefit many people. For traders, the research of stock market turbulence provides a certain theoretical basis and guiding suggestions when they make decisions, which can reduce investment risks for them to a certain extent.

VI. CONCLUDING REMARKS
The feature representation of the security time series is to convert the original data from higher dimensions to lower dimensions on the premise of preserving the initial time series features as much as possible, which is beneficial to improve the efficiency of data mining, similarity measurement and other research work. Piecewise linear representation of stock time series is an effective method to reduce the difficulty of stock time series data processing. According to the prior knowledge, firstly this paper selects the feature points that meet the conditions, and then using the piecewise linear representation method to process the stock time series. Finally, analyzing the experimental results according to the evaluation indicators.
The experimental results show that the proposed algorithm achieves the purpose of effectively compressing time series data while reflecting the trend of time series, and work well on time series with obvious periodicity and drastic mode fluctuations. At the same time, the research is also adapted to online segmentation. However, there are still some weak points that are open for debate. Firstly, the threshold of this paper is manually set, which is time-consuming to manually set for experiments with a large amount of data. Secondly, historical data is used to analyze the stock trends, which the indicator is single. In order to better predict the movements of the stock market, in future research work, we will search for some ways that enable intelligent selection of thresholds based on data, and also start from the behavioral finance in social media to predict the stock trend by interpreting the emotions in the text information and combining stock historical data. The textual information related to social media and online news have been proven to be effective in predicting the future trend of the stock market. By increasing textual feature information in the media, we can make the results more reliable and greater benefits market regulators and traders.