Machine Learning-Based Pain Intensity Estimation: Where Pattern Recognition Meets Chaos Theory—An Example Based on the BioVid Heat Pain Database

In general, classification tasks can differ significantly in their task complexity. For instance, image-based differentiation between vehicles and pedestrians is most likely expected to be less complex than CT-scan-based differentiation between several lung diseases. Intuitively, based on a human point of view, one can identify some classification tasks as more complex than other classification tasks. Moreover, based on expert knowledge and/or task-specific meta information, one could attempt to estimate the complexity ranks of specific classification tasks. In this work, based on the publicly available BioVid Heat Pain Database (BVDB), we experimentally confirm the intuitive assumption that the task of automated pain intensity recognition (PIR) is very challenging. Inspired by the field of chaos theory, we show that the BVDB-specific PIR task can not only be seen as highly complex, but is even identified as a classification task of chaotic nature. To this end, we apply Hao’s working definition for chaotic systems and provide an experiment-based chaos check method. To validate our approach, as a non-complex counterpart, we include a task of handwritten numerals distinction. Our study provides two main contributions, i.e.: i) an enhanced understanding for the still present and – more importantly – substantial gap between the ground truth and the predictions reported by different research groups in combination with automated PIR tasks; and ii) an approach for a numerical complexity check based on chaos theory. Different research directions are discussed for future work. Note that improving PIR accuracy performance is not part of the study objective.


I. INTRODUCTION
Machine learning-specific pain assessment based on physio- 20 logical signals constitutes a challenging task. Several studies 21 indicate that it seems feasible to design robust and effective 22 The associate editor coordinating the review of this manuscript and approving it for publication was Turgay Celik . models which can reliably distinguish between a person's 23 no pain and severe pain conditions. For instance, in [30], 24 Werner et al. obtained an averaged accuracy value of 94.3% 25 based on the X-ITE Pain Database [9], in combination with 26 a leave-one-subject-out cross validation (LOSO-CV), with 27 focus on the binary scenario of no pain vs the highest electri-28 cal pain level, using random forests. However, the distinction 29 several image-based classification tasks including one or even 48 two hundred classes [1] (e.g. defined by the CIFAR-100 [13] 49 or Caltech Birds [29] data sets). For readers that are inter-50 ested in automated pain intensity recognition, we refer to 51 the recently published survey studies, [18] and [32], which 52 focus on ANN-based and hand-crafted feature extraction 53 approaches, respectively.  In [16], the authors introduced three data complexity mea- 69 sures, which they identified as infeasible in practice. How-70 ever, they showed that the complexity can be approximated 71 by classification models. To this end, they used support vector 72 machines (SVMs) [26] for their data complexity analysis. 73 More precisely, they focused on the number of support vec-74 tors obtained during the training, with a higher amount of 75 support vectors implying a higher complexity. 76 In this work, we focus on the complexity of a given fea-77 ture space. We aim at showing that a classification task can 78 be identified as chaotic (and hence as complex) based on 79 Hao's working definition for chaotic systems [10]. Similar 80 to the classification model-based approach in [16], to this 81 end, we will use decision tree models to propose a chaos 82 check method based on Hao's definition. Note that in contrast 83 to [16], we use the term task complexity instead of data com- 84 plexity to emphasise that a classification task [19] is defined 85 by the combination of data samples and the corresponding 86 labels. Note that improving pain assessment accuracy perfor-87 mance is not part of our current contribution.

88
The remainder of this study is organised as follows. 89 In Section II, we motivate our work, present the goal of the 90 study, provide Hao's working definition for chaotic systems 91 and justify the choice of decision tree models. Subsequently, 92 in Section III, we briefly describe the BioVid Heat Pain 93 Database, which constitutes the main example of our numeri-94 cal chaos check. The formalisation is presented and discussed 95 in Section IV. Section V consists of the experimental evalu-96 ation, including a brief description of the Multiple Features 97 data set [25] -which constitutes a low-complex classification 98 task and which is used as the counterpart in our proposed 99 chaos (complexity) check approach -, the experimental set-100 tings, as well as the illustration and discussion of the results. 101 Finally, the paper is concluded in Section VI.

103
In this section, we will first discuss the versatile usability of 104 decision tree models. Subsequently, we will provide a sum-105 mary of Hao's working definition for chaotic systems and 106 check its applicability to decision tree classifiers.

107
Note that our motivation is based on the following intuitive 108 idea. Identifying a classification task as chaotic based on the 109 decision tree model (i.e. system), implies that the correspond-110 ing task is (highly) complex. Classification and regression trees [7] are classic machine 114 learning models. In this work, we focus on classification 115 trees, which we will simply denote as decision trees. In gen-116 eral, in their main function, decision trees serve as base classi-117 fiers in classification ensembles [14], such as in the methods 118 bagging [5], boosting [20], and random forests [6]. However, 119 one can also count the number of decision nodes constructed 120 during the training process to obtain an initial estimation of 121 the corresponding task (labelled data) complexity. In addi-122 tion, decision trees can be used to get feedback on the impor-123 tance of individual features.

124
Note that decision trees are instable classification mod-125 els [5]. This means that small changes of the training data 126 can lead to large changes in the final model. Although small 127 and large are relative terms, we will focus on the decision 128 trees' instability. We will use this characteristic for the iden-129 tification of some chaos-specific properties and hence clas-130 sification task complexity. In the following section, i.e. in 131 Section II-B, we will discuss the importance of stability and 132 instability in chaotic systems.    In this work, we will define decision trees as our system. the test set accuracy and the test set-specific label outputs. 167 Since there is no universal task complexity measure [16], 168 we will use Definition 1 in combination with these character-169 istics as an indicator for the complexity of classification tasks.

173
In this work, we focus on Part A of the BioVid Heat Pain  192 After defining the ground truth, each participant was stim-193 ulated 20 times with each of the pain levels in randomised 194 order. To this end, the temperature was linearly increased 195 to the corresponding value and held for four seconds. After 196 decreasing the temperature back to T 0 , i.e. 32 • C, the no 197 pain level was held for a random duration of eight to twelve 198 seconds.

199
During the main phase, the experimenters recorded videos 200 from three different angles as well as three physiological 201 signals. In this work, we focus on the recorded physiological 202 signals, i.e. electrocardiogram (ECG), electrodermal activity 203 (EDA) and electromyogram (EMG). ECG measures a per-204 son's heart activity, whereas EDA and EMG measure a per-205 son's skin conductance and muscle activity, respectively. The 206 EMG sensors were attached in the shoulder area with focus 207 on the trapezius muscle. The EDA sensors were attached to 208 the ring finger and index finger, on one of the participant's 209 hands.

210
To keep this study consistent with our previous works, 211 we will use the exactly same hand-crafted features as in [11] 212 and [12]. The features were extracted from windows of 213 5.5 seconds length from the temporal and frequency domains, 214 including statistical descriptors, such as mean and extreme 215 values, and signal-specific descriptors, such as the heart rate 216 variability (defined by the ECG signal), amongst others. 217 In total, 194 features were extracted, including 56, 68 and 218 70 features for the signals EMG, ECG and EDA, respectively. 219 Each person-specific feature set was normalised, leading to 220 zero mean and a standard deviation of value one. To focus on 221 our current contribution, we refer the reader to [11] and [12] 222 for a complete description of the preprocessing and feature 223 extraction steps. 224 Moreover, we refer the readers interested in facial 225 videos-specific pain intensity recognition based on the 226 BVDB to [22] and [31].

IV. FORMALIZATION
228 By X ⊂ R d , d ∈ N, we denote a d-dimensional, 229 labelled data set. More precisely, the elements of X consist 230 of pairs of data points and corresponding labels, i.e. X = 231 Our analysis is based on decision tree (DT) classifiers. 234 By , we denote the set of DT-specific training parameters 235 and settings, for instance, including the split criterion or the 236 cost of misclassification. Moreover, by DT X , we denote the 237 decision tree that is designed in combination with training set 238 X and parameter set . Note that in most cases, we will omit 239 the superscript for the sake of readability, simply using the 240 term DT X . For any data point z ∈ R d , we denote the label 241 output of model DT X specific to z simply by DT X (z).

242
By Q, we denote the set of model-specific measures, such 243 as the number of decision tree nodes. 244 Let X 1 , X 2 ⊂ R d be two training sets. In the current study, 245 we focus on measuring the differences between the resulting 246 DT classifiers. To this end, we evaluate the relative difference, 247 , between the corresponding classification models DT X 1 and DT X 2 , which we define as follows, 250 whereby q ∈ Q is a DT-specific measure as discussed above.

251
Note that is undefined if the corresponding denominator 252 is equal to zero. However, this case never occurred in our 253 experiments, which are presented in Section V. Moreover, . 256 In addition, let Z = ∅ be a set of d-dimensional data 257 points, i.e. Z ∈ R d . To measure the relative difference of label 258 outputs between models DT X 1 and DT X 2 specific to the set Z , 259 we define Z as follows, i.e. 0, . . . , 9, thus constituting a 10-class classification task.

277
The provided feature dimension of the data is equal to 649.

278
The features are organised in the following six feature sets:  For each test set, we will focus on the percentage difference 294 in the number of nodes, accuracy, as well as the output diver-295 sity. The first two measures are computed by using Eq. (1), 296 whereas the output diversity is calculated by applying Eq. (2). 297 Note that will analyse whether Properties 3 (instability con-298 dition) and 4 (stability condition) of Definition 1 are fulfilled. 299 More precisely, we will check whether all of the three mea-300 sures are sensitively or not sensitively influenced by small 301 changes in the training data.

302
For the BVDB, we will apply a nested 87-fold cross vali-303 dation as follows. Note that the BVDB consists of 87 partici-304 pants, with 100 data points each, i.e. 20 per class (5 classes). 305 For each test fold (i.e. test subject), we will apply 8,601 306 iterations. In each iteration, we will remove one data point 307 from the initial training set, which consists of 8,600 data 308 points. Thus, the change of the initial conditions is equal to 309 1/8600 ≈ 0.012%, for the BVDB. For the MFeat data set, we will apply a nested 20-fold cross 311 validation. Note that the MFeat data set consists of 2,000 data 312 points in total, with 200 points per class (10 classes). For each 313 test fold, we will apply 1,901 iterations. In each iteration, 314 we will remove one data point from the initial training set, 315 which consists of 1,900 data points. Thus, the change of the 316 initial conditions is equal to 1/1900 ≈ 0.053%, for the MFeat 317 data set.

318
Note that for both data sets, MFeat and the BVDB, each test 319 fold consists of 100 data points, equally distributed among the 320 classes, i.e. 20 per class for the BVDB and 10 per class for 321   The data sets-specific nested cross validation parameters are 328 summarised in Table 1.  MFeat and the BVDB. Secondly, for the MFeat data set, all of 340 the relative differences ( Nds, Acc, Out) are smaller than 341 1% -even smaller than 0.3% -on average. Thirdly, for the 342 BVDB, only the relative difference for the number of nodes 343 is less than 1% and also even less than 0.3%. For Acc and 344 Out, the averaged relative difference is equal to 2.5% and       Note that the size of the test sets is always equal to 100 data 371 points. Therefore, a change of 9% implies that by removing   Figure 5 depicts the maximum percentage changes for the 384 BVDB per epoch, i.e. the maximum of 8,600 training iter-385 ations, with a change of 0.012% of the initial training data. 386 From Figure 5, we can observe that the relative change in 387 the number of decision tree nodes never exceeds 10%. The 388 maximum value is observed for epoch 3 and is approximately 389 equal to 3% (3.0004%). The maximum Acc and Out val-390 ues always exceed 10%. The maximum change in accuracy is 391 observed in epoch 42, exceeding 90%. The maximum Out 392 value is noted in epoch 57 and is equal to 79%. That means 393 that by removing one single data point from the 8,600 training 394 data points led, at least once, to 79 differences in the label 395 outputs on the corresponding 100 data points-specific test set, 396 in comparison to the decision tree model that was trained in 397 combination with the whole data set.

399
The reason why we included the difference in the label out-400 puts ( Out) is that it is more precise than the difference in 401 accuracy ( Acc). Note that two classifiers can have an accu-402 racy of 50%, respectively, while disagreeing on all of their 403 label outputs. Therefore, in the following, we set the focus 404 on the measures Nds and Out. While in a chaotic system, 405 i.e. complex classification task, we expect the difference in 406 the number of nodes, Nds, to be low on average, we assume 407 the difference in the label outputs, Out, to be relatively high 408 in comparison. Moreover, in non-complex tasks, we expect 409 both measures to be low on average and thus violating 410 Property 3 (instability condition) of Definition 1.

411
Since our current work presents the initial outcomes in 412 combination with our proposed complexity check, we do not 413 have any empirical data to compare with. We observed that 414 both measures, Nds and Out, stayed below 0.3% on aver-415 age, based on a 20 × 1, 900 cross validation evaluation, for 416 the MFeat data set. On the other hand, while Nds stayed 417 below 0.3%, the averaged Out values exceeded 3.5% based 418 on a 87 × 8, 600 cross validation evaluation in combina-419 tion with the BVBD, with a change of 0.012% in the initial 420 conditions.

421
If we focus on the relation between the mean Out and 422 Nds values, we obtain the following outcomes. For the 423 MFeat data set, it holds Out : Nds ≈ 1.42. In contrast, for 424 the BVDB, it holds Out : Nds ≈ 14.68. While it could 425 be difficult to define task-independent absolute thresholds for 426 Nds and Out, the relation Nds : Out might allow 427 for a complexity comparison across different classification 428 tasks.

430
From the current work, we can draw the following 431 conclusions. University. He was a Co-Editer of 20 special issues 623 and workshop proceedings published in interna-624 tional journals and publishing companies. He has 625 published more than 200 papers at international 626 conferences and journals. His research interests 627 include artificial neural networks, machine learning, statistical learning the-628 ory, data mining, pattern recognition, information fusion, and affective com-629 puting. He has served as the Co-Chair of the IAPR TC3 on Neural Networks 630 and Computational Intelligence. He has also been the Chair of the IAPR TC9 631 on Pattern Recognition in human-computer interaction. 632 633 VOLUME 10, 2022