Classification of Interbeat Interval Time-Series Using Attention Entropy

Classification of interbeat interval time-series which fluctuates in an irregular and complex manner is very challenging. Typically, entropy methods are employed to quantify the complexity of the time-series for classifying. Traditional entropy methods focus on the frequency distribution of all the observations in a time-series. This requires a relatively long time-series with at least a couple of thousands of data points, which limits their usages in practical applications. The methods are also sensitive to the parameter settings. In this paper, we propose a conceptually new approach called attention entropy, which pays attention only to the key observations. Instead of counting the frequency of all observations, it analyzes the frequency distribution of the intervals between the key observations in a time-series. Attention entropy does not need any parameter to tune, it is robust to the time-series length, and requires only linear time to compute. Experiments show that it outperforms fourteen state-of-the-art entropy methods evaluated by real-world datasets. It achieves average classification accuracy of AUC = 0.71 while the second-best method, multiscale entropy, achieves AUC = 0.62 when classifying four groups of people with a time-series length of 100.

Living systems exhibit self-regulating mechanisms that process inputs with a broad range of characteristics [10], [11]. Many biological time-series such as heart rate variability (HRV) also called interbeat intervals extracted from ECG are extremely inhomogeneous, non-stationary, and fluctuate in an irregular and complex manner [12]. Fig. 1 shows four timeseries of interbeat intervals from different subjects. We can see that they vary in an irregular manner. HRV is used to physiological analysis, such as depressive disorder analysis [2], stress recognition [13], [14], [15], and affective states analysis [16]. There also has been considerable interest in quantifying the complexity of HRV to uncover hidden information, such as heart failures [17], [18], [19] and coronary artery disease [20]. Typical methods such as multiscale entropy (MSE) [21] and grouped horizontal visibility graph entropy (GHVE) [22] analyze complexity by segmenting the signals into equallength sub-series and calculating the entropy based on how frequently the artificial patterns occur extracted from the subseries. The process of a typical method is illustrated in Fig. 2 (top). Given a time-series X, the method segments into overlapping sub-series of equal length, extracts patterns from the sub-series, and then calculates the entropy based on the frequencies of the patterns. The result depends on the length of the sub-series and the definition of the artificial patterns.
There are three main challenges with typical entropy methods. One is that the patterns in the time-series data must be complex enough to be able to model the data. Therefore, it requires a lot of data to populate all histogram bins to obtain a dense histogram. Typical entropy methods need a time-series length of at least 30,000 samples to model the data [21]. This takes more than 30 minutes to collect which incurs high cost in the clinical diagnosis and therefore limits their usage in the real-world applications.
The second challenge is that it can take considerable time to extract the patterns. Most methods require O(m a n b ) time, where m is the dimension of the vector (see Section 2), n is the time-series length, and a, b >1. This is a dilemma as the methods need a lot of data to calculate reliable entropy value, but having more data means also more time required. This prevents the use of the methods from large-scale data.
The third challenge is that the artificial patterns also lack clear intuitive interpretation. As a result, the patterns have no direct analytical capability which limits its contribution to the medical analysis of different diseases.
To overcome these challenges, we propose a conceptually new method called attention entropy, which pays attention only to the key observations and focuses on how regularly they repeat in the time-series. Fig. 2 (bottom) illustrates the process of computing the attention entropy. Given a time-series X, attention entropy extracts the key patterns and uses the intervals between the key patterns to calculate the entropy value.

ENTROPY METHODS
Entropy is a quantitative measure of the randomness and disorder of a system. Rudolf Clausius [23] was the first to introduce a mathematical version of the concept to measure the proportion of heat energy transferred from a body to another. Boltzmann and Gibbs [24], [25] extended the concept into statistical mechanics to model the molecular disorder and chaos. Shannon later defined the entropy as the smallest size that a message can be encoded without loss [26]. In this section, we review the entropy measures that are most relevant to our study.

Existing Methods
The process of typical entropy methods has four components as summarized in Fig. 3: (1) convert the original series into another series; (2) construct the sub-series; (3) extract the patterns from the sub-series; (4) analyze the frequency distribution of the patterns. Different entropy methods are based on the different combinations of these four components, as summarized in Table 1.
From Table 2, we can see some entropy methods convert the original series into another series and then segment the converted series into sub-series to extract patterns. For example, spectral entropy [27], average entropy [28], and MSE [21] convert the series using the discrete Fourier transform [27], the grid [28], and the coarse-graining function [21], respectively.
We can also see that there are three typical methods to construct the sub-series: single value, template vector, and delay vector. They can be formed as z m;t i ¼ ½x i ; x iþt ; . . . ; x iþðmÀ1Þt for 1 i n-ðm À 1Þt, where t is the time delay, and m is the dimension of the vector, given a finite timeseries X ¼ x 1 , . . ., x n with the length n. Single value is the case of z m;t i with m ¼ 1 and t ¼ 0. Template vector is the case of z m;t i with m > 1 and t ¼ 1. Delay vector is the case of z m;t i with m > 1 and t > 0. Different entropy methods have major difference in the way they extract the patterns from the sub-series. Shannon     [26], R enyi entropy [29], Tsallis entropy [30], spectral entropy [27], and average entropy [28] use the values directly. Permutation entropy [31] and edge permutation entropy (EPE) [32] use the permutations of the rankings of each value in the template vectors as the patterns. Approximate entropy [33], sample entropy [34], and multiscale entropy [21] use similar template vectors as the patterns. Bubble entropy [35] uses the swaps of sorting sub-series with bubble sort algorithm as the patterns. Horizontal visibility entropy (HVE) [36] uses visibility graphs [37] and GHVE [22] uses grouped visibility graphs as the patterns. The singular value decomposition entropy (SVDE) [38] uses the singular values obtained by performing singular value decomposition on the embedding space spanned by the delay vectors as the patterns.
Once the patterns are defined, the entropy values will be calculated by analyzing the frequency distribution of these patterns. Approximate entropy [33] and sample entropy [34] analyze the frequency distribution of the patterns defined with m and mþ1 dimensional template vector, respectively. They calculate the entropy value from the difference of these two distributions.
From Table 1, we can also see that the proposed attention entropy does not need to convert the series. It uses peak points in the series as the patterns. It analyzes the frequency distribution of patterns' intervals, which will be discussed in Section 3.

Discussion
Each method introduced above has its advantages and disadvantages. Shannon entropy [26], R enyi entropy [29], and average entropy [28] can be applied globally to all data, or locally only to points around specific points [39]. However, they ignore the temporal order of the patterns in the signal [40].
Permutation entropy [31] and edge permutation entropy [32] use the temporal information [39], but they rely on the occurrence of equal values in the sub-series [41]. Approximate entropy [33] has the advantage of lower computational demand and less effect from noise, but it strongly depends on the time-series length and therefore lacks consistency [40]. Sample entropy [34] is invariant to the time-series length and it performs more consistently under various conditions. However, it has a strong dependency on the input parameters [39].
Bubble entropy [35] and GHVE [22] are not sensitive to the parameter settings. However, they have high computational costs, and therefore, they are not practical for largescale data [35], [36].
MSE [21] is capable of discovering the multiscale feature of data but it requires long time-series to work. SVDE [38] allows analyzing even very short and non-stationary data, but it has high computational costs when applied to largescale data [38]. Spectral entropy [27] has the advantage of simplicity, but it is sensitive to noise and relies on the assumption that the data error is independent of time [27].

ATTENTION ENTROPY
To overcome the shortcomings of the typical entropy methods, we propose attention entropy. We first introduce the general principle and then give a suggestion of how to select the key patterns.

The General Principle of Attention Entropy
Attention entropy is calculated in three main steps: (1) define the key patterns; (2) calculate the intervals between two adjacent key patterns; (3) calculate Shannon entropy of intervals. The difference between classical entropy methods and attention entropy is demonstrated in Fig. 4. Classical frequency-based entropy methods cannot separate Series 1 and 2, as both have the same frequency distribution of the patterns. Attention entropy can do it because the distribution of the intervals of the key patterns (Apple) in the series are different.
Formally, given a finite series X, we first define the key pattern V. Second, we calculate the intervals I V ¼ fvjv ¼ j À ig for any given sub-series u i , u k , and u j of X which satisfy that u i and u j match in the pattern V, but u k does not match in V for any i < k < j. We finally calculate Shannon entropy over I V as the attention entropy.

Peak Points as the Key Patterns
We define a point x i as a peak point, including local maxima and local minima, if it satisfies one of the conditions below: x iÀ1 < x i and x i > x iþ1 (x i is defined as local maxima) x i < x iÀ1 and x i < x iþ1 (x i is defined as local minima) If each point in a time-series is considered as one state of a system, the change of the state can then be seen as the system's adjustment to the environment. A complex system is expected to have a complex process of the state changes when adapting to the environment. The peak points represent the local upper and lower bounds of the state changes. This makes them as the potential key patterns.
A time-series then can be represented by the series of the peak points. We then calculate the intervals between two successive peak points. We consider four cases: Intervals of local maxima to local maxima (Max-Max) Intervals of local minima to local minima (Min-Min) Intervals of local maxima to local minima (Max-Min) Intervals of local minima to local maxima (Min-Max) We can use any one of these four cases individually by calculating the entropy of the respective interval distribution. We can also merge the results by analyzing the four distributions separately and then taking the average of the four individual entropy values. In the rest of the paper, we use this merging strategy as our recommended method and denote it as Average-4. Fig. 5 shows an example of how to calculate the attention entropy when defining peak points as the key patterns. In general, the individual entropy values are not expected to differ much from each other. In most cases, the result is about the same regardless which of the four cases we use. However, using all the four cases brings two additional benefits. First, it can smooth possible abnormalities in the data. Second, we have four times more data. This can potentially make the method work with shorter time-series. Fig. 6 shows the expected behavior of the attention entropy; it increases with increasing the randomness of peak points. Fig. 7 shows sample distributions of the intervals among peak points of the four different subjects from Fig. 1. We can see that all intervals of AF are smaller than 10 and the distribution of AF always concentrates on the lower values, leading to low entropy. Some intervals of CHF are bigger than 10 but all of that are smaller than 20, and the distribution of CHF drops faster than young and elderly. The difference between the distributions of young and elderly is less visible from the graphs, but the average of the four entropy values, however, makes the distraction clear (young ¼ 2.68, elderly ¼ 2.25).

Algorithm 1. AttentionEntropy(X, V)
Input: X: Time-series of length n, V : key patterns Output: E: Entropy value FOR i ¼ 1 TO n: IF matchKeyPatterns(x i ; V) THEN:

Implementation
Implementation of attention entropy is shown in Algorithm 1. It requires O(n) time, where n is the length of time-series X. The algorithm contains the following steps: 1) Detect whether the point is a key pattern; 2) Calculate the interval between two key patterns; 3) Count the frequencies of all intervals; 4) Calculate Shannon entropy over frequencies of all intervals. When a point x i is detected as a key pattern, we calculate the interval as ij, where x j is the previous key pattern   6. The more randomly the peaks "^" and "v" appear, the greater is the attention entropy.

EXPERIMENTAL SETUP
Datasets. We first tested with simulated Gaussian distributed white and 1/f noises [42]- [45], and then tested with realworld data of healthy and pathological subjects: the interbeat intervals dataset which is downloaded from PhysioNet [46]. There are 72 healthy subjects divided into two groups: subjects with age 55 (young) and subjects with age >55 (elderly). There are also 44 subjects with congestive heart failure (CHF), and 24 subjects with atrial fibrillation (AF). The information about the dataset is shown in Table 2, and the selected sub-series of different subjects are shown in Fig. 1. The length is the number of samples in the time-series with a sampling frequency of 128 Hz for young, elderly, and part of CHF and 250 Hz for AF and part of CHF [46].
Define Key Patterns. We used the peak points introduced in Section 3 as key patterns. The same attention entropy (Average-4) calculation illustrated in Fig. 5 was applied to the experiments.
Baseline Methods. we compared the proposed method to all the entropy methods in Table 1. We used the parameters suggested from the original paper of each method.
Measurements. We used the analysis of variance (ANOVA) [47] and the area under the receiver operating characteristic curve (ROC AUC) [48] as the measurements. ANOVA can determine if the means of groups of data are significantly different from each other. ANOVA outcomes a p-value, and if the p-value is below the threshold chosen for statistical significance (usually 0.1, 0.05, or 0.01), there are significant differences among the groups. The idea of receiver operating characteristic (ROC) curve is to plot the true-positive rate against the false-positive rate over the ranked entropy values at various threshold values. The area under the ROC curve (ROC AUC) serves as the accuracy evaluation ranging from 0 to 1. The value 1 corresponds to a perfect classification result.

Simulated White and 1/ f Noises
We applied the attention entropy method to the simulated Gaussian distributed white and 1/f noises, and the results are shown in Fig. 8. We can see that the attention entropy values of 1/f noise are significantly higher (p-value < 0.01) than white noise. This result is consistent with the fact that, unlike white noise, 1/f noise contains complex structures [42], [43].

Real-World Heart-Rate Data
We next tested the interbeat interval time-series dataset with time-series length ¼ 100. The p-value results are shown in Table 3. We used the star symbol ( Ã ) to mark the results that are statistically significant (p-values<0.01). We can see that the results of attention entropy are statistically significant in all the important cases of separating healthy and non-healthy subjects. The differences of the entropy values in case of young-vs-elderly and CHF-vs-AF are as we expected but not statistically significant. The possible reason is that the number of samples in the data is too small for this. Attention entropy is the only method capable to separate all six groups when time-series length ¼ 1000 so that the result is statistically significant (p-values <0.01).
Another measurement of the results of classifying binary groups is shown in the AUC in Table 4. We can see that attention entropy outperforms other entropy methods on average (AUC: 0.71 versus 0.62 with time-series length ¼ 100, 0.81 versus 0.77 with time-series length ¼ 1000, 0.79 versus 0.77 with time-series length ¼ 10000). This indicates that the attention entropy is more powerful to separate the groups than the other methods. It gives evidence that analyzing the frequencies of the intervals between patterns is more beneficial than analyzing the frequencies of patterns, especially when the time-series length is short, for example, 100.

Effect of the Time-Series Length
We studied the effect of the time-series length and the results are summarized in Tables 3, 4     The " Ã " symbol means the p-values <0.01.

TABLE 4 ROC AUC Results
From Fig. 9, we found that, regardless of the time-series length, the attention entropy values decrease by following the order: entropy (young) > entropy (elderly) > entropy (CHF) > entropy (AF). These results are consistent with the concept that the cardiac dynamics of healthy young subjects are the most complex [43] and provide stronger support for the hypothesized complexity-loss of aging and disease theory [49] than multiscale entropy. The attention entropy method reflects the regularity of repeating patterns of signals and plays more critical roles behind the complexity-loss of aging and disease. The regularity-loss ignored by conventional entropy methods is explicitly addressed by the attention entropy.

Intervals Between Peak Points
To study the intervals among peak points further, we tested the intervals between local maxima and local maxima (Max- Fig. 9. Attention entropy analysis of interbeat intervals time-series derived from healthy subjects with age 55 (young), healthy subjects with age >55 (elderly), subjects with congestive heart failure (CHF), and subjects with atrial fibrillation (AF). Symbols represent the mean values of entropies, and bars represent the standard error (SE ¼ standard deviation / ffiffiffi n p , where n is the number of subjects).   Table 7.
Max intervals), the intervals between local minima and local minima (Min-Min intervals), the intervals between local maxima and local minima (Max-Min intervals and Min-Max intervals). We calculated Shannon entropy of these four intervals and the average of Shannon entropy of these four intervals (Average-4). The AUC results are summarized in Table 5. We can see that the choice of the interval does not matter regardless of which time-series length is used. To simplify the choice, we recommend using Average-4 by default.

Compared With Basic Statistics
Basic statistics such as mean, standard deviation, root mean square, and the number of pairs of successive interbeat intervals that differ by more than 50 ms (NN50 defined in [50]) are also used to analyze the interbeat time-series [50]. We make comparison with attention entropy and the results are summarized in Table 6. We can see attention entropy outperforms all basic statistics regardless of the time-series length.

Effect of Noise and Outliers
The result of an experiment may be affected by the type of noise. Here, we discuss the effects of superimposing uncorrelated (Gaussian distributed white) noise on a physiologic time-series. Fig. 10 shows that the attention entropy method is sensitive to the noise. The same observation in Fig. 11 holds for the effects of outliers. This is because noise and outliers affect the key patterns, namely the peak points.

Computational Complexity
Attention entropy takes O(n) time, where n is the time-series length. To measure the actual processing time of the algorithm, the algorithm was implemented in Python 3.7, which can be found from the web 1 and tested using PC with CPU Intel Core i7, 16 GB RAM, and clock frequency 2.3 GHz. Fig. 12 and Table 7 show the relationship between the running time and the time-series length of one young subject.
We can see that with the increase of the time-series length, attention entropy requires much less computing time than most of the competing entropy methods, including the competitive MSE [21].

Discussion
In this section, we discuss the potential usability of the method in affective computing, its limitations, and the threats to the validity of the results of the proposed method. Many methods based on HRV have been developed for affective state analysis. This is because plenty of affective computing researches consider that specific emotional states can elicit changes in the autonomic nervous system, which can be exactly monitored by HRV analyses as shown by studies over the decades. However, quantifying HRV with entropy-based methods has been rarely used for affective analysis although it has been widely adopted in many tasks such as disease detection and classification. This may be because conventional entropy methods were proposed for long-term HRV analysis as introduced in Section 2, therefore, they were not applicable to short duration HRV analysis. This obstacle is expected to be removed by attention entropy, which can work well with short duration HRV signals and, therefore, can be potentially applied to affective state analysis. Moreover, attention entropy may be able to capture the change of affective states in a timely manner considering its advantage of requiring linear time complexity. One limitation of the proposed method is that it needs to define key patterns in advance. The limitation of using peak points as key patterns is that it is sensitive to outliers and noise.
The key patterns may be application-specific, which may be a threat to the validity of the results. However, these threats may be overcome by defining different key patterns and combining the results from multiple key patterns; future work could explore this strategy. The mechanisms behind the key patterns such as peak points could also be explored in future work.

CONCLUSION
A novel complexity analysis method called attention entropy is proposed, which does not need any parameter tuning when using peak points as key patterns. It has linear time complexity and is robust to the time-series length. We compared it to fourteen state-of-the-art complexity analysis methods with realworld datasets. The results show that attention entropy outperforms all the compared methods and is the only method to be able to separate all groups with statistical significance using time-series length of 1000. This shows attention entropy has higher discrimination power in short duration HRV signals and has potential in other tasks such as affective computing. Future work could uncover more key patterns and the hidden mechanisms behind them.  [26] R enyi [29] Tsallis [30] Per. [31] App. [33] Sample [34] Bub. [35] HVE [36] GHVE [22] SVDE [38] EPE [32] Spe. [27] Ave. [28] MSE [21] Attention 100  <1  <1  <1  59  68  50  78  85  91  206  1  41  <1  <1  <1  1000  3  5  5  60  74  53  7714 8763  253  206  5  42  5  2  2  10000  36  77  75  67  254  237  795821 885471 1935  223  523  43  44  160  16  100000  636  1328  1374  133  7599  8236  >1h  >1h  19019  229  5001  106  448 16404  Pasi Fr€ anti (Senior Member, IEEE) received the MSc and PhD degrees in computer science from the University of Turku, in 1991 and 1994, respectively. Since 2000, he has been a professor of computer science with the University of Eastern Finland. He is currently a visiting professor with Shenzhen Technology University, China. He has published 89 journals and 173 peer review conference papers, including 15 IEEE transaction articles. His main research interests include machine learning, data mining, and pattern recognition, clustering algorithms and intelligent location-aware systems.