Detection of Diabetes Mellitus With Deep Learning and Data Augmentation Techniques on Foot Thermography

Diabetes mellitus (DM) is a metabolic disorder characterized by increased blood glucose. The pathology can manifest itself with different conditions, including neuropathy, the main consequence of diabetic disease. Statistics show worrying figures worldwide, diagnosed an estimated 1.6 million people with DM by 2025. In this sense, alternative and automated methods are necessary to detect DM, allowing it to take the pertinent measures in its treatment and avoid critical complications, such as the diabetic foot. On the other hand, foot thermography is a promising tool that allows visualization of thermal patterns, patterns that are altered as a consequence of shear and friction associated with lower limb neuropathy. Based on these considerations, we explored different strategies to detect patients with DM from foot thermography in this research. Initially, the study focused on finding a classification index like Thermal Change Index (TCI). Subsequently, we used the deep convolutional neural networks paradigm, implementing 12 different data augmentation methods, of which four are conventional, and 8 are newly proposed methods. The results showed that the proposed and the conventional methods increased the network’s performance, where a 100% detection was achieved by weighting the DM probability percentages for both images of the feet. Finally, it was also possible to demonstrate the importance of transfer learning, which does not depend on the type of database, but on the data corpus with which the transfer was trained.


I. INTRODUCTION
Diabetes Mellitus (DM) is a metabolic disorder characterized by increased blood glucose, known as hyperglycemia [1]. The pathogenesis of DM is due to immune disorders or insulin resistance, generated by genetic predisposition or environmental factors [2], [3]. In general, risk factors associated with DM are first-degree relatives with the pathology or a high Body Mass Index (BMI). In fact, high BMI has increased the number of cases worldwide due to poor eating habits and excessive sugar consumption [4], [5]. The impact is seen in alarming statistics worldwide. For example, World Health Organization (WHO) reported DM as the leading cause of death in 2019, with approximately 1.5 million cases [6].
The associate editor coordinating the review of this manuscript and approving it for publication was M. Shamim Kaiser .
Projections indicate that this number could increase to approximately 1.6 million by 2025 [7]. Similarly, the International Diabetes Federation (IDF) reported 537 million adults (20-79 years) with diabetes in 2021, i.e., one in ten adults are living with diabetes, where more than half are undiagnosed. Estimates project a global increase to 643 million by 2030 and 784 million by 2045. In addition, the IDF estimated approximately 6.7 million deaths by 2021 [8].
DM manifests as a group of metabolic disorders that can range from heart disease, stroke, renal failure, blindness, ischemia, and peripheral vascular disease, among the most common conditions [9]. Peripheral vascular disorders generate the well-known diabetic foot, being this also the main neuropathic alteration [10]. The disorder generates infections, ulcerations, or destruction of deep tissue in the lower limb, limiting the patient's mobility [11]. In this order of ideas, the pathology requires early and continuous medical attention to prevent complications such as lower limb amputation [12].
On the other hand, body temperature change is a widely used strategy to identify abnormalities in the medical field [13]. The research addresses different applications in the medical field, and diabetes is no exception to this trend. For example, different studies are currently based on foot thermography and even on the detection of type 2 DM from facial or tongue thermography, as explored by Thirunavukkarasu et al. in 2020 [14], [15]. Thermography is presented as a novel method for the exploration of different conditions [16], [17]; even such images have presented a significant growth in recent decades due to their noninvasive nature, being used even in face detection [18]. Additionally, thermographic images can be integrated with different Artificial Intelligence (AI) strategies due to their digital character. Advances have not been long in coming, and there are already several research studies aimed at diagnosing DM or the detection of conditions that are a consequence of it, such as neuropathic complications or diabetic foot [19]- [24]. The growing interest has even motivated the development of low-cost proprietary systems to register such thermographic images, as has been done by Bayareh [25] and Sruthi et al. [26].
At the same time, the topic has been approached from different perspectives, covering topics ranging from image processing to state-of-the-art deep learning networks. For example, Christy Evangeline et al. suggest a method of asymmetry analysis between angiosomes for the exclusive characterization of subjects with diabetes [27]. Filipe et al. propose a new coefficient like the Thermal Change Index (TCI). The method consists of parceling the foot into different regions based on the clustering algorithm applied to the intensity levels of the image (temperature). The average temperature is estimated from the regions, and they calculate the coefficient called Classification Temperature Threshold (CTT) [28]. Prabhu et al. developed an automatic system for segmentation of the plantar foot region in thermographic images, as it is an inherent need for feature extraction in such images. The development was done using the C means algorithm [29]. On the other hand, Carlos Padierna et al. extract 12 features from the top of the foot to train a Support Vector Machine (SVM) [30]. Similarly, Adam et al. also rely on a support vector machine for classification. The model uses Gray Level Co-occurrence Matrix (GLCM), Hu's invariant moment, Local Binary Pattern (LBP), Laws Texture Energy (LTE), and entropy, extracted from the decomposed form of the Discrete Wavelet Transform (DWT) and Higher-Order Spectrum (HOS). The development achieves an accuracy of over 89%. However, the authors point out that the work was based on a reduced database; moreover, the system is semi-automated, as it requires feature extraction [31]. In the same year, Adam et al. present new research on diabetic foot classification. The authors report higher values than in their previous research.
Similarly, the research is based on feature extraction but through decomposition into coefficients using Double Density-Double Tree-complex Wavelet Transform (DD-DT-CWT). The features were used to train different machine learning algorithms, where they managed to achieve the best results through four LSDA features from bilateral foot thermograms with k nearest neighbors (kNN) classifier [32]. In turn, Josephine Selle et al. follow the same classification path through SVM. The implementation is based on 15 features, of which 14 are generated from the GLCM, being the features in mention: autocorrelation, contras, correlation, dissimilarity, energy, entropy, homogeneity, max probability, sum of squares, sum average, sum entropy, sum variance, difference entropy, and difference variance. In this particular case, the model performance reaches an accuracy of 96.42%; however, the authors highlight that the number of the database is limited, the features are not optimized, and the classification is not performed on higher-order machine learning algorithms [33]. Reddy et al. implement four machine learning algorithms to diagnose foot ulcers in diabetic patients in a more comprehensive approach. The research is based on 22 recorded characteristics of each patient. The 22 attributes range from gender, age, body mass index, and qualitative indices such as smoking habit [34]. Nag et al. perform the classification with five previous steps, starting from acquisition, data preprocessing, analysis, segmentation of regions of interest, and feature extraction. The latter were obtained with first-order statistical features, textures based on GLCM, and features based on Wavelets. Finally, they perform the classification using three machine learning methods: SVM, kNN, and Decision Tree (DT), reaching an accuracy of 97.78% through the latter method [35].
Contrary to the previous developments, Thirunavukkarasu et al. perform a study base on standing thermographs' acquisition. In the process, they measure the temperature of the toe, metatarsal 1, 3, 5, instep, and heel, achieving an accuracy of over 81% through SVM [36]. For their part, Vardasca et al. use dynamic thermography to classify the diabetic foot. The proposed study was based on three machine learning algorithms using 12 temperature landmarks for subject discrimination. The development achieved 81% accuracy by implementing the kNN algorithm [37]. Finally, Cruz-Vega et al. use an artificial neural network and a support vector machine from the deep learning paradigm. The development is performed by obtaining several features after preprocessing. Additionally, they include the AlexNet deep learning network with the raw data, achieving the best classification performance in cross-validation [38]. A year later, Cruz-Vega et al. introduced a new convolutional neural network like AlexNet, which they call DFTNet. Additionally, the authors explore the multilayer perceptron and the support vector machine. The first model is trained with the raw data, while the third and second are trained with features extracted from the regions of interest [39], achieving an accuracy close to 95% [39]. Similarly, Khandakar et al. propose diabetic foot detection from two perspectives. The first one is from detection with different deep learning networks, and the second one uses conventional machine learning methods based on the subjects' characteristics, reaching an accuracy close to 96% [40]. Table 1 summarizes the related works described above,  where the best metrics obtained by the different authors with  the proposed methods are indicated. The previous review was based on the most recent and relevant articles in the study of DM diagnosis in thermographic images. The systematic review shows significant shortcomings in this area of interest. While it is evident that artificial intelligence has reached DM diagnosis, approaches to deep learning are scarce. On the other hand, despite the small corpus of data, approaches toward data augmentation or transfer learning are also not addressed in depth. Moreover, from previous reviews, one can confirm critical limitations such as small data volume, data variability, and low data quality [42].
At the same time, convolutional neural networks have had exponential growth in different research areas, and every day new architectures, models, and variations emerge that make them more and more efficient [43], [44].
Likewise, DL strongly impacts the medical area, addressing issues ranging from diagnosis to the prediction of possible pathologies [45]- [47]. Research is focused on networks with greater generalization capabilities, increasing the performance of models while trying to outperform human performance [48], [49]. The impact is seen in novel architectures such as ResNet [50], Inception [51], DenseNet [52] and transformers [53] among the most relevant developments. In addition, performance enhancement has been addressed with practical solutions such as regularization [54], batch normalization [55], transfer learning [56], and data augmentation. While the strategies are promising, data augmentation has been more widely embraced because it largely corrects the overfitting problem suffered by DL networks with little training data [57]. Traditionally, data augmentation consists of generating new images starting from reference images that retain their attributes, i.e., the new images must have the same labels as the original images [58]. The methods can address linear transformations, color space changes, and even intuitive transformations such as flipping images [59]. In other more inventive and current approaches, Szlobodnyik et al. [60] put forward a new method based on guided deep interpolation with an auxiliary linear selfexpressive layer. In turn, Raj et al. [61] suggest a nonlinear image blending technique, which challenges the classical notion of data augmentation. In this sense, exploring new data augmentation methods is an inherent need for the constant demand of large data corpora for training DL networks.
On the other hand, most DL applications are focused on conventional images, i.e., images of people, objects, or animals [62]. The trend was expected since high-quality digital cameras are now available, even at low cost and accessible to most users. Consequently, there are now genuinely novel developments that have not been exploited in medical imaging or are not known to be effective cases, such as foot thermography. For example, it has recently been shown that the ResNet network outperforms human performance in classification tasks and is even part of the top 5 networks [63]. Moreover, this network introduces residual connections to stabilize parameter training, allowing for deepening the model and, in turn, tackling increasingly complex tasks [50].
In this sense, we propose a framework for exploring different approaches to the detection of subjects with diabetes mellitus (DM). Initially, we propose the study of a new coefficient of characterization of subjects based on characteristics or descriptors such as age, general temperature, temperature by angiosomes, and rate of thermal change. Subsequently, due to the limited corpus of data, we propose the implementation of one of the most novel neural networks implemented with 12 data augmentation methods, of which 8 are newly proposed methods. Additionally, as a comparative analysis, we study the behavior of the network under three different training conditions. First, we train the network from scratch. Secondly, we use transfer learning from the conventional ImageNet image database, and, finally, we perform the same process but from a thermographic database, hoping that the thermal patterns of the data will allow us to achieve a higher accuracy of the model.
• Therefore, this work shows the following relevant contributions.
• A coefficient for the characterization of subjects with DM was sought.
• One of the state-of-the-art convolutional neural networks was implemented for accurate detection of DM subjects.
• Eight new data augmentation methods were introduced to overcome the limitation of a low amount of data.
• A comparative analysis of the training conditions of the convolutional models was performed.

A. DATASET
The investigation was performed using the public database ''Plantar Thermogram Database for the Study of Diabetic Foot Complications'' [64].  Figure 1). Additionally, the set includes the description of gender, age, TCI, height, weight, and BMI. It should be noted that the last three descriptors are not for all subjects.

B. DATA PREPROCESSING
The .csv files were converted to single-channel images in the 'float32' format. The size was set to 199 × 88, padding with zeros the edges of all images that did not meet that size.  Finally, the images were normalized with respect to the maximum and minimum of the data set, i.e., temperatures from 15.98 to 37.28 • C were taken as values from 0 to 1.

C. CONVOLUTIONAL NEURAL NETWORK
Artificial intelligence is one of the areas of computational sciences with many ramifications. AI ranges from the most straightforward systems composed of linear models to the most recent deep learning methods. One of the fundamental elements in AI is convolutional artificial neural networks or simply convolutional networks (See Appendix A). Networks are one of many bio-inspired systems based on the functioning of the central nervous system, more specifically, the brain.
Convolutional networks emulate the primary visual cortex to perform tasks such as detection, classification, or feature extraction to achieve a specific task [65]. This concept has led to the emergence of a large number of networks; in fact, there are currently several networks that differ from the first convolutional model developed by Lecun et al. in 1998 [66] or from the first convolutional DL model, known as AlexNet [65]. The various architectures and their depth allow them to perform increasingly complex tasks. However, the increase in depth has two drawbacks. First, the depth implies a more significant number of layers and, consequently, a larger number of training parameters. Consequently, a more significant number of training data is needed to fit many model parameters. In this sense, this is one of the reasons that motivated us to use data augmentation methods and include new strategies to address this problem. Secondly, convolutional layer augmentation limits the model's training generated by gradient fading [66].
For this reason, in this research, we chose to use the ResNet50v2 network [50], which has an elegant solution to this drawback. The network is based on the concept of connection or residual mapping. In general, the connection creates parallel paths to the convolutional layer sequences, allowing smooth transmission of the gradient through the layers and preventing the gradient value from being zero. In addition, the connection forces the network to learn the residual mapping f (x) − x, being easier to train if the residual mapping is the identity function f (x) = x [50], [67].

D. DATA AUGMENTATION
As mentioned above, neural networks require a large corpus of data to train all the model parameters. However, this is one of the main drawbacks in medical imaging-related research due to limited data sets. Fortunately, it has already been demonstrated that neural networks can be trained with synthetic images, allowing to reach the optimal values of the training parameters, and even avoiding overtraining the model [57], [58]. In this order of ideas, in the following, we present the four most used conventional methods in data augmentation, and we also propose eight new strategies based on image reduction methods.

1) FLIP
Among the most used conventional methods are geometric transformations [58]. The operations change the spatial position of pixels while maintaining the relationship between them, which can be subject to a mathematical model. One of the simplest methods is image flipping. The process consists of taking each pixel and assigning it a new inverted position concerning one of the two axes (in the case of two-dimensional images). The mathematical model of this process is governed by Equations (1) and (2) for horizontal and vertical flipping, respectively. The result is illustrated in Figure 2.
where, x max and y max are the last positions reached by the image pixels on the respective axes.

2) COLOR SPACE CHANGE
Conventional images are stored as arrays of three channels with the exact dimensions. The channels represent each of the intensity levels that make up the RGB image, i.e., the intensities of red, green, and blue colors. However, the intensity ratios of these channels can be combined to generate new intensity levels, creating a new color space [68]. Since the channels can be combined in different ways, there are many color spaces, e.g., Figure 3 shows five examples. For this particular case, we preferred to use our own color space, which was obtained with the equivalent transformations from RGB space to YCbCr, and from YCbCr to HSV space. However, the first transformation was applied by copying the gray channel to the remaining channels. That is, the transformation performed is the one expressed in equation (3) The transformation generated 3 channels from a single channel; however, we make the clarification that equation (3) is the equivalent transformation to YCbCr, but it is not such transformation because to obtain the Y, Cb and Cr channels, the operation must be applied to the RGB channels.
Subsequently, the channels obtained in the previous process were subjected to the equivalent RGB to HSV transformation. Mathematically, the process was performed with equations (4), (5) y (6), to generate the 3 definitive channels used in the training. [70].

3) LINEAR FILTERS
Another of the most used strategies is linear filters. Generally, filters are used to focus and blur the image, as illustrated in Figure 4. The method consists of sliding the filter through the whole image, obtaining new values for each filter position, generating the new image [71].
The new images were generated using a 3 × 3 Gaussian filter with a deviation of 90 pixels (σ = 90).

4) PRINCIPAL COMPONENT ANALYSIS
Principal component analysis (PCA) is a method used to represent a set of observations in an equivalent, represented in latent variables of lower dimensionality [72]. The method takes the observations and combines them linearly to create new components. These components are generated hierarchically, each retaining a percentage of the variability of the original data. The first component, z 1 , is created with the highest percentage of variability, while subsequent components take smaller percentages than the previous ones. In principle, the first principal component is expressed as shown in Equation (7) or (8) in the form of a dot product (For more details see appendix B).

5) KERNEL PCA
The kernel PCA is an extension of principal component analysis (PCA), developed to work with nonlinear data. Kernel PCA implements a function to project the data set into a higher dimensional feature space. That is, the original data set is projected into a linearly separable higher dimensional space and then reduced in dimensionality through the PCA method described in the previous section. In general, this method can be used with different filters, where the most common is the Gaussian filter, as shown in Equation (9).
Like the previous case, incremental PCA is another extension of conventional PCA but used for too large data sets. The method consists of a low-rank approximation for the data, independent of the number of input data samples.

7) FACTOR ANALYSIS
Factor analysis is one of the most popular multivariate analysis methods used to study the relationship between sets of variables and project them into a space of lower dimensionality, known as factors. In particular, the mathematical model of factor analysis assumes that the original p variables are functions of m common factors (with m < p) plus a single factor ε. Mathematically, this can be expressed as shown in Equations (10) and (11).
where X is the vector of the original variables (X ∈ R p ), U is the factor matrix or matrix with factor loadings (U ∈ R p×m ), F is the vector of common factors (F ∈ R m ), and E is the vector of unique factors (E ∈ R p ) [73], [74].

8) INDEPENDENT COMPONENT ANALYSIS
In statistics, independent component analysis is a widely used technique to find a linear representation of non-Gaussian data such that the components of the new representation are statistically independent. Assuming a data set represented with m variables (X ∈ R m ), one can estimate the independent components as a linear combination of the original variables, as expressed in Equation (12) or (13).
Starting from this assumption, the components or latent variables are determined by the matrix W , which is estimated by maximizing the non-Gaussianity of W · X [75], [76].

9) NON-NEGATIVE MATRIX FACTORIZATION
Non-negative matrix factorization, or also known as nonnegative matrix approximation, is a dimension reduction technique that factors the matrix X observations into the approximation of two matrices. The method assumes that the three matrices do not have negative elements, allowing the resulting matrices to be more easily inspected. Mathematically, the observations matrix can be defined as an approximation of two matrices, as shown in Equation (14).
The estimation of the two new matrices is calculated keeping as much information as possible and using the objective function expressed by Equation (15). The WH matrices are calculated in such a way that the objective function reaches VOLUME 10, 2022 its smallest possible value [77], [78].
here, α w and α h are the regularization coefficients associated with each matrix. L 1 is the regularization mixture parameter, m the number of variables or features in the original set, n the number of samples or observations, . 1 is the l 1 norm and . F is the Frobenius norm or also called as the Hibert-Schmidt norm, defined by Equation (16).
here, a j,k is each of the elements of the m × n-dimensional matrix A [79].

10) DICTIONARY LEARNING
Dictionary learning is a way to find a better sparse mapping matrix by using training data. The technique is like nonnegative matrix factorization since it is based on the minimization of an objective function. In particular, the method consists of finding a dictionary D (Matrix) and a representation of the data Y such that the objective function expressed in Equation (17) is minimized.
here . F is the Frobenius norm, λ the coefficient associated with the regularization and is the sparsity inducing function, which can represent different regularization terms, such as l 0 and l 1 norms. ζ represents the dictionary constraint such that ζ ≡ {D ∈ Y n×m : d_i _2 < 1∀i = 1, . . . , m}. Where, . 2 is the length of the vector and d i the elements that compose the dictionary D [80], [81].

11) LATENT DIRICHLET ALLOCATION
Latent Dirichlet Allocation (LDA) is a generative statistical model used to represent a set of observations (M ) by latent subjects. LDA is a three-level hierarchical model, with each level underlying a finite set of the respective lower level. Suppose that an observation m can be modeled as the multinomial mixture of k latent topics z. In turn, each topic can be represented as a mixture of the probability distribution of n variables that make up the set of all possible variables constituting the space of observations (V ∈ R N ).
The LDA algorithm randomly assigns, to each variable of an observation, one of the k topics. Subsequently, it computes the proportion of the variable n in each observation of the total set and represents it as a Dirichlet probability distribution over the latent topics, where the model of the whole corpus (D) of data is given by Equation (18). (18) here, α is the Dirichlet distribution parameter, β is the parameterization matrix of observation m, θ is the expected proportion of topics in observation m, and z represents the assignment of a word in a topic k [82].

E. IMAGE GENERATION (PROPOSED METHODS)
The above reasoning can be applied to images, where each pixel is taken as the features of the vectors subject to reduction. In other words, the 199 × 88 images are flattened to a vector of 17, 512 features, which can be represented in a latent space of lower dimensionality through the different methods described. In this case, for all transformations, the feature vector was reduced to 267 latent variables (principal components, for the case of PCA variations).
The latent variables are constructed with the different mathematical models, which are reversible and are generally referred to as inverse transformations. In fact, this concept is widely used in machine learning for data whitening. In principle, the data are represented with lower dimensionality latent variables, which may be associated with noise. Therefore, removing such variables (or components) would eliminate the noise in the images when they are returned to the original space through inverse transformation [83]. In this sense, altering the latent variables implies partially modifying the images without losing their spatial characteristics.
Based on the above consideration, altering the latent variables with random noise would imply the partial modification of the image, preserving its main attributes and creating new images from a reference image. In summary, our method consists of projecting the images to a latent space of lower dimensionality (see Equation (19)). Each latent variable is multiplied point by point (Hadamard Product) with a random vector V of the same dimensions but made up with values from a threshold t to 1 (see Equation (20)). The threshold value is determined by the expression t = 1 − noise, where noise is the proportion of noise added. Finally, the modified latent variables are used to return to the original space through the inverse transformation, as indicated in Equation (21).
here, X represents the flattened image, T {} the transformation depending on the reduction model used. Z the image in latent space, i.e., the vector of latent variables or components.
is the Hadamard product. Z the component vector modified with the random noise of the vector V , T −1 {} the inverse transformation and X the new flattened image.

F. TRANSFER LEARNING
DL networks require a large corpus of data due to many training parameters. Fortunately, some methods help overcome this drawback, such as transfer learning. The process consists of taking a network and training it with an extensive database, allowing filters to be tuned to this reference database. The adjusted filters are then used in a secondary network, allowing the model to train from the predefined filters. In general, if the database is large enough, the network learns the task with a high degree of generalization. The attribute can be retained for an equivalent task with another dataset. The model is trained based on the learned weights, and the network would achieve better performance, even if the data is small [84], [85].
Based on this premise, it was proposed to train the ResNet50v2 network with learning transfer from two different databases. First, the transfer was performed from the Ima-geNet database (1.28 million images), a widely accepted and used database [86]. Second, the ''Thermal Image dataset for object classification'' database (6695 images) was used [87], consisting of thermal images of cats, cars, and humans. In principle, we expect that the weights learned will be able to preserve thermal patterns that will allow better performance with diabetic foot thermography databases. It should be noted that the first database already has the weights organized in the structure of various neural networks, as is the case of the ResNet50v2 network. In contrast, the second database does not have this attribute; therefore, for this database, it was necessary to train the model to store the weights and transfer them to the classification model of subjects with DM. The training process for the latter is presented in Appendix C.

G. PERFORMANCE EVALUATION METRICS
The review of related work shows several performance evaluation metrics of artificial intelligence algorithms. Therefore, this research focused on evaluating the five most common metrics, namely accuracy, sensitivity, specificity, F 1 score, and precision. The metrics are expressed as shown in Equations (22) through (26) [88]- [90].
The five metrics are given in terms of True Positives (TP), True Negatives (TN ), False Positives (FP), and False Negatives (FN ). In addition, for this specific case, the metrics represent the following observations: • Accuracy: The ability of a network to correctly classify different classes, i.e., subjects with DM and healthy or control group CG) subjects.
• Sensitivity: The ability of a network to classify actual MDs.
• Specificity: The ability of a network to correctly classify images of CG subjects.
• F 1 score: The ability to correctly identify the different classes in proportion to the number of classes.
• Accuracy: The proportion of real DM subjects among all detected DMs.

H. STATISTICAL ANALYSIS
For statistical estimation between groups, the non-parametric Kruskal Wallis test was implemented. The statistic evaluates whether two or more groups belong to the same distribution, based on the median of these groups. In other words, the test is based on the null hypothesis with the assumption that all samples come from the same distribution. Consequently, a p-value lower than 0.05 significance level would indicate that the null hypothesis is false, revealing a statistically significant difference between the two groups tested [91].
Assuming k groups with n observations, the method defines the H statistic given by the mathematical expression of Equation (27).
where, n i is the number of observations of the i-th group, N is the total number of observations of the two groups, r ij is the rank of the i-th observation over the j-th observation among all observations, and k is the number of groups [92], [93].

I. SEGMENTATION OF CRITICAL REGIONS
As mentioned above, diabetic foot neuropathy results in loss of lower limb sensation and reduced proprioception. The condition manifests with recurrent walking trauma, atrophying and causing dislocation of the protective pads, compromising the integrity of these regions [94]. Likewise, neuropathy decreases sweating and affects skin perfusion, generating dryness and hyperkeratosis, compromising the tissue with cracks or fissures [95], [96]. The consequences can be seen in the appearance of lesions preceded by ulcerations. However, trauma, recurrent peak pressures, and shearing produce significant temperature changes in the compromised regions [97]. In this regard, a simple model was used to highlight critical regions prone to injury and ulceration. The model consisted of highlighting regions that exceed the average plus the standard deviation of the plantar temperature of each foot in subjects classified with MD. That is, critical regions were segmented following the mathematical model VOLUME 10, 2022 of equation (29).
Here, T is the temperature of each pixel of the plantar image, µ f is the average temperature, and σ f is the standard deviation of each foot.

J. EXPERIMENTAL DESIGN
Initially, the characteristics of the subjects were taken and evaluated through the Kruskal-Wallis test statistic. The evaluation was done between the control group (CG) and diabetes mellitus (DM) in the characteristics of general temperature (average), angiosome temperatures, TCI coefficient, age, and gender. Based on this validation, a new coefficient was constructed to discriminate between the different subjects. In addition, the temperature histogram was also explored, where the average curves and the error bands associated with the two groups and each foot were generated. Subsequently, the thermographic images were taken and processed as described in the methodology. The images were adjusted to 199 × 88 size, normalized, divided into training and test data, and evaluated through 5-fold cross-validation.
The training data were used as reference images to augment the images with the strategies of: Once the augmented images were obtained, the ResNet50v2 convolutional neural network was trained under different training conditions (See Figure 5). Initially, it was trained from scratch, i.e., with the training parameters initialized with arbitrary values. Second, the network was trained by transfer learning with the weights generated by the ImageNet database. Third, the experiment was replicated with transfer learning but from the ''Thermal Image dataset for object classification'' database [87]. For the latter case, ResNet50v2 was trained, and the generated weights were used in the transfer. The results of this training are shown in the appendix section.
After training, the network was evaluated with the test set under accuracy, sensitivity, and specificity, F 1 score, and precision metrics. Additionally, the network was run six times per fold using the following hyperparameters: Next, the different data augmentation methods were evaluated using the non-parametric Kruskal-Wallis test statistic with the obtained accuracies. Finally, the best performing model was used to detect subjects with DM to which the critical temperature regions were segmented to establish power ulceration regions.

III. RESULTS
This section presents the results generated by the different parts described in the methodology. Initially, the problem was approached from the characteristics obtained from the subjects and the thermal images. Table 2 shows the p-value generated by the Kruskal-Wallis non-parametric test statistic. The statistic was evaluated for the eight characteristics between healthy or CG subjects and subjects with DM. In particular, the p-value was several orders below the significance level of 0.05 between CG and DM, i.e., the null hypothesis that the two groups come from the same distribution is rejected. Conversely, the p-value generated in comparing both feet is above the 0.05 level, indicating that the null hypothesis is retained. In other words, there is no statistically significant difference between the characteristics of both feet of both CG and DM subjects.
On the other hand, exploration of the histogram (see Figure 6) showed that the temperature distributions present differences between the two subjects. First, patients diagnosed with DM present a temperature shift above 30 • Celsius, while healthy subjects present a temperature distribution centered between 25 to 30 • Celsius. In other words, the prevalence of a higher proportion of the foot with temperatures above 30 • would be an indication associated with DM.
Based on the above considerations, it is possible to assume that the characteristics studied serve to discriminate between the two groups. In this order of ideas, a new classification coefficient was constructed, governed by the mathematical expression of Equation (30).
here,T is the average temperature of both feet, A is the age, TCI R and TCI L are the thermal change index of the right and left foot, respectively.   Figure 7a, shows the score returned for the 167 study subjects. The coefficient showed a marked difference between CG and DM subjects concerning the TCI value (see Figure 7b). It was feasible to search for a classification threshold through a support vector machine in awareness. The strategy with the SVM was replicated in the TCI for the right and left feet. The results are shown in Table 3 with the conventional evaluation metrics. Table 3 shows an increase of more than 17% with the new coefficient concerning the most accurate  TCI (right foot). In addition, the coefficient has a sensitivity of more than 97%, i.e., it has a high capacity to identify true DM cases.
As mentioned in the materials and methods section, in the second instance, the diagnosis of subjects with DM was approached through the novel deep learning convolutional neural networks. In addition, this development was approached from data augmentation strategies to mitigate the effect generated by the limited data sets. Figure 8 shows the images synthesized by the data augmentation methods for a reference image of a healthy subject. Additionally, the proposed methods project the data onto new latent variables to which a noise component is introduced. Consequently, the proposed methods are presented with different noise ratios. Similarly, Figure 9 shows synthesized images, but for the case of a subject with DM.
The results presented in Figure 8 and Figure 9 show that the feet retain their morphological structure with changes in temperature distributions. The patterns generated vary from one method to another. For example, principal component analysis with 0.1 noise preserves the reference image almost unchanged, while the LDA deforms the original foot pattern.
As mentioned above, the ResNet50v2 model was trained under different training conditions. Initially, the model was trained with the different data augmentation methods by initializing the training parameters with arbitrary values (under the Uniform Glorot distribution). The results of this first experiment are shown in Table 4. The values are expressed as average percentage values with the respective standard deviation. Subsequently, the same training was performed by transferring the weights from the ImageNet database. The results of this are shown in Table 5. Finally, the process was replicated using the transfer learning from the database ''Thermal Image dataset for object classification'' [87], where the latter results are presented in Table 6.
The results in Table 4 show an increase of about 9% in accuracy between the highest and best scoring methods. That is, data augmentation based on incremental PCA presented a 9% increase in training without any data augmentation method (row None). At the same time, data augmentation with the conventional color space shift technique presented about a 5% increase in training without methods. In other words, the results demonstrate the high effectiveness of data augmentation methods, including the proposed new methods.
Similarly, Table 5 presents a similar behavior, i.e., the data augmentation methods increased the accuracy of the ResNet50v2 network. For example, the linear Gaussian filterbased strategy improved by 8.5% in training without data augmentation. Similarly, the conventional image flipping method increased the model's accuracy by 5.6%, again guaranteeing the high effectiveness of training with data augmentation. On the other hand, it is worth noting that transfer learning produced a marked difference to the model trained from scratch. The results in Table 4 are close to 81%, while the results VOLUME 10, 2022  with learning transfer (Table 5) are close to or above 90%, i.e., learning transfer significantly improves the network's performance.
Finally, the results of the model trained with the data transfer from the thermographic database are shown in Table 6. Although the results present a significant increase concerning the training without transfer, it does not present any noticeable difference with respect to the transfer from the Ima-geNet database. In other words, transfer learning increases significantly; however, the database used in the transfer of the weights does not seem to be relevant, as long as the database has a large number of data. Figure 10 shows the distribution of the accuracy obtained by the data augmentation methods and the three training conditions of the ResNet50v2 network. It should be noted that the graphs are shown with fractional values to percentage equivalents, i.e., with values from 0 to 1, indicating values from 0 to 100%. In addition, the results of the remaining metrics are shown in Appendix D.
The distributions corroborate what was presented above. In the model trained from scratch, the box and whisker plots of the methods are centered near 80%, while the boxes of the methods with transfer learning are above 80 and centered at approximately 90%. In addition, it can also be observed that the distribution without the data augmentation method (None) is partially below the others in all three training conditions. Table 7, Table 8 and Table 9 show the test statistic evaluated between the different data augmentation conditions for model training. The first one shows the values for training from scratch and the second and third ones show the values for training with transfer learning from ImageNet and the thermographic database. Although the p-values vary substantially between methods, it is evident that the comparison Training of the ResNet50v2 model with learning transfer from the thermographic database as a function of different training conditions (data augmentation). The scores are given in average values with the respective standard deviation, calculated from the 30 different runs, i.e., the runs generated by the six repetitions of each fold (5 folds). Scores generated with the test data. without data augmentation (None) shows p-values below the 0.05 significance level for all cases. That is, the null hypothesis that the scores come from samples with the same distribution can be rejected. In other words, all data augmentation methods are statistically significant in neural network performance, including the proposed new methods. VOLUME 10, 2022 TABLE 7. P-value generated by the Kruskal-Wallis test statistic evaluated between the data augmentation methods and the accuracy metric in the case of training from scratch. In particular, the results showed high performance of the ResNet50v2 network, reaching values close to 100% accuracy, training the model with transfer learning, and using the proposed data augmentation method: Incremental PCA.
However, it should be emphasized that the development was performed on each foot individually, i.e., the scores of the evaluation metrics were performed on each foot individually (test data). Therefore, to augment the model's predictive model, we weighted the probability score of each foot to generate an overall score. Using the best-performing model, the implementation of this principle allowed us to reach 100% accuracy with both training and test data.
Finally, starting from this model, we implemented the detection of subjects with DM and highlighted the higher temperature regions, generating the critical temperature regions, regions prone to further ulceration due to temperature increase from friction and shear.
In this sense, Figure 11 and Figure 12 present the results of the detection model, where the probability of a subject suffering from diabetes mellitus is estimated [98]- [100].
The results indicate that the healthy subject with the highest probability of belonging to DM is 24.97%, i.e., in the worst case, the network has a certainty of 75.09 that the subject is healthy. Similarly, the subject with DM with the best probability of being DM is 69.19%, i.e., in the worst case, the network has a 69.19% certainty of classifying the subject with DM.
On the other hand, some subjects presented the maximum temperatures over the inner part of the foot or over the butterfly pattern. This would imply that the subject, although suffering from diabetes, is far from developing a diabetic foot ulcer. Although the classification is 100% accurate and the regions of maximum temperature could reveal the possibility of developing diabetic foot, longitudinal studies are needed to determine if the proposed method has clinical validity.

IV. DISCUSSION
Diabetes Mellitus (DM) can manifest itself through different metabolic disorders. The disease has a high recurrence in people over 40 years of age, and the worldwide figures are alarming due to the high number of cases. In this sense, early detection of DM is a primordial necessity to avoid other affections. In this order of ideas, in this research, we proposed a framework to explore different approaches for detecting or diagnosing subjects with DM. First, we proposed to search for a new coefficient superior to the TCI to classify patients with DM. The proposed new coefficient proved to be more efficient than the TCI; the difference between the two is evident, as there was a nearly 17% increase in accuracy 1 (see Table 3). In addition, our new coefficient managed to outperform the most recent research-based on machine learning techniques and the TCI coefficient. While the latest research reported an accuracy of 90.1% [41], our coefficient reached 94.61%, a difference of more than 4%.
However, due to the lack of information with the data used, it is not possible to determine whether the value of the coefficient is proportional to the degree of severity of the pathology, i.e., it is not possible to determine whether the score corresponds to the real stratification of the pathology. In this sense, it is suggested to study whether subjects with different degrees of DM or diabetic foot correspond proportionally to the new estimated coefficient.
On the other hand, the study of the characteristics confirms what has been reported in the literature, where diabetic foot pathology is closely related to temperature changes due to friction and shearing due to lower limb 1 Accuracy validated through a support vector machine. neuropathy [98]- [100]. Secondly, subjects with DM were detected from a deeper approach through convolutional neural networks, specifically through the ResNet50v2 network. Additionally, the implementation was approached from the data augmentation approach due to the lack of data in foot thermographs. Eight new proposed methods were integrated, which make a significant contribution to data augmentation. Although the synthesized images varied substantially among the data augmentation methods, this does not indicate that they are not efficient. For example, color space-shifting is one of the most accepted methods [58]; however, it differed quite a bit from the reference image (see Figure 8 and Figure 9). In other words, the data augmentation does not necessarily have to have an intuitive interpretation for the human, since, even if the image lacks visual interpretation, it could have real meaning for the artificial networks, a statement that has already been affirmed in previous research. Additionally, the values returned during training and the p-value of the Kruskal-Wallis test statistic ensure that the methods are statistically significant.
The results section shows that data augmentation has a significant difference concerning training without augmentation. Similarly, transfer learning also presented an improvement in network performance. However, the type of database does not show statistically significant differences. In other words, using transfer from ImageNet and the thermographic database is not relevant. These rules out the assumption that the network can learn thermal patterns that improve the performance of an application based on the same type of images. Although the ImageNet and the thermal database are based on two types of images, the transferred weights are taken as the initial values of the model, which are adjusted in training with the final database. Therefore, the fine-tuning would occlude the original transfer patterns, implying that the performance does not depend on the type of images FIGURE 11. Output generated by the convolutional neural network. The scores are the weighted probability (between both feet) of being subjects with DM. Additionally, subjects diagnosed with DM have their critical temperature regions highlighted, i.e., regions prone to further ulceration. 59580 VOLUME 10, 2022 FIGURE 12. Output generated by the convolutional neural network. The scores are the weighted probability (between both feet) of being subjects with MD. Additionally, subjects diagnosed with DM have their critical temperature regions highlighted, i.e., regions prone to subsequent ulcerations. VOLUME 10, 2022 in transfer learning. The results delivered by the network are highly satisfactory. 100% accuracy was achieved, i.e., true positives and true negatives were correctly classified, while no false positives or negatives were generated. That is, the network achieved 100% effectiveness in sensitivity and specificity, surpassing all previous investigations, where the maximum metric was achieved by Cruz-Vega et al. with an accuracy of 99.98% [38]. In this sense, we assume that our approach has two strong axes concerning the implementation performed by Cruz-Vega et al. First, our research was based on one of the state-of-the-art networks (ResNet50 Network), which has proven to be highly efficient in classification tasks. Second, the network training was based on proposed new data augmentation methods. Also, we assume that our framework is the most efficient because most of the recent research focused on conventional machine learning methods (See Table 1). For example, only three of the mentioned researches implemented deep learning [38]- [40]. However, our research is the first development that integrates a state-ofthe-art network trained under new data augmentation methods to the best of our knowledge.
In addition, the critical temperature regions are a tool to determine regions with a high probability of ulceration. In fact, most of the processed images presented temperature maxima in the regions of highest pressure with a high probability of ulceration, i.e., the bottom of toes, pad of foot, and heel of foot [101].
The results are promising and are statistically significant concerning training without any data augmentation methods. Also, the methods were used to augment the existing data in the same proportion. For each reference image, a synthetic image was created, and with this data, the neural network was trained, allowing it to reach 100% accuracy. However, it should be clarified that the research has some limitations. First, the new coefficient generates a stratification of the subjects that correspond to their nature Figure 7), but it was not verified whether these correspond to the degree of severity of the pathology. Furthermore, these values are not clinically corroborated; further studies are required to determine their validity. Secondly, the investigation could only be based on a single database, which we assume does not have sufficient heterogeneity, which is achieved with data from different centers.
Furthermore, this experimental framework leaves open some research questions that can be explored in future work. The methods allowed us to reach this excellent performance without considering that all methods were used individually. Therefore, it would be convenient to study if the combination of the methods would generate better performance, or posed as a question, could there be a combination of methods that improve the performance of convolutional networks? Finally, although ResNet50v2 showed excellent results, it has many training parameters; therefore, it would be interesting to validate if the proposed methods have the same impact on lowerorder networks.

V. CONCLUSION
A framework for exploring different approaches for screening subjects with diabetes mellitus (DM) was proposed. Initially, we planned to study a new discrimination coefficient of the subjects based on the characteristics of average temperature, age, overall TCI of right and left feet. The new coefficient was able to exceed the accuracy of the TCI by up to 17%, which is a useful tool for the stratification of subjects with DM. Secondly, the ResNet50v2 network was explored to accurately classify subjects by integrating it with 12 data augmentation methods to make up for the low corpus of thermographic image data. The augmentation consisted of 4 conventional methods and eight proposed methods based on dimension reduction methods: principal component analysis, kernel PCA, incremental PCA, factor analysis, independent component analysis, non-negative matrix factorization, dictionary learning, and LDA. Initially, the images were taken to latent spaces of lower dimensionality using the aforementioned methods. The latent variables were altered with a random noise vector and returned to the original image space, generating new synthetic images with characteristics like the reference images. All methods were shown to be statistically significant with a significance level below 0.05, which is a relevant contribution to data enhancement.
Additionally, a comparative analysis was performed, we studied the behavior of the network under three training conditions. First, the network was trained from scratch. Secondly, we used transfer learning from the conventional ImageNet image database, and finally, we performed the same process, but from a thermographic database. The results showed that data augmentation and learning transfer are statistically significant strategies to improve the performance of convolutional neural networks. Furthermore, it was also shown that learning transfer has the same impact regardless of the nature of the images, as long as the data corpus is large enough to generate transferable learning patterns.

APPENDIX
The study presents some secondary results not presented in the main section. This section makes such useful results available to support those discussed throughout this paper.

A. CONVOLUTIONAL NEURAL NETWORK
In particular, as the name implies, convolutional networks are made up of a convolutional operator (or filter) that highlights some features of the image. The filter is a two-dimensional element belonging to the set of real numbers, which can have any number of combinations of values, as shown in Figure 13.
In general, networks are built with a different number of convolutional layers and with different filters per layer, which define the mathematical model of the network. Each filter is initially made up of random values that are progressively adjusted in the training process until the optimal parameters are reached. Once the model is fully trained, the network can take a new image, convolve it, and progressively generate feature maps as it moves through the different convolutional layers. For example, Figure 14 shows the first feature maps of a 64-filter model. The features of the first layer are then used by subsequent layers, which in turn create new and increasingly abstract feature maps. The features or feature maps predict a new result from what is learned from the training data. In essence, the mathematical model of the network is shaped by the expression of Equation (31).
Here, X i is the input image in the i-th channel, M is the total number of channels or number of feature maps (as the case may be). K ij is the i-th channel of the j-th convolutional filter. ( * ) is the convolutional operator, b j is the bias associated with the j-th filter, and ϕ is the activation function of the model. Therefore, A j is the feature map generated by the j-th convolutional filter. It should be emphasized that Equation (31) is only the output of a single filter in a single layer of the network. The model is similar throughout all layers, but for anyone layer, the layer's input is the previous layer's output; therefore, this principle is represented as expressed in Equation (32).     where, the superscript (l) indicates the l-th layer and (l − 1) the previous layer.

B. PRINCIPAL COMPONET ANALYSIS
Principal component analysis (PCA) is a method used to represent a set of observations in an equivalent, represented in latent variables of lower dimensionality [72]. The method takes the observations and combines them linearly to create new components. These components are generated hierarchically, each retaining a percentage of the variability of the original data. The first component, z 1 , is created with the highest percentage of variability, while subsequent components take smaller percentages than the previous ones. In principle, the first principal component is expressed as shown in Equation (7) or (8) in the form of a dot product. z 1 = u 11 x 1 + u 12 x 2 + u 13 x 3 + . . . + u 1m x m (33) In other words, let X be an observation of m variables (X ∈ R m ); this can be represented from a smaller number of latent variables Z , as shown in Equation (35).  where, Z is the vector of n principal components, with n < m.
In other words, Z t is equal to (z 1 , z 2 , z 3 , . . . , z n ), with each principal component z i being the linear combination of the m original variables that make up X . W is the matrix of the coefficients of these linear combinations, which are calculated according to the following considerations: The first VOLUME 10, 2022 component z 1 must satisfy the maximum variance subject to the constraint of Equation (37).
The subsequent components are calculated under the same reasoning but considering that the new components must be orthogonal to the previous ones. Therefore, the i-th component must fulfill the restriction of Equation (38).
59586 VOLUME 10, 2022  FIGURE 19. Distribution of the scores generated by the data augmentation methods in the ResNet50v2 model trained with learning transfer from the thermographic database for the metrics of a) accuracy, b) sensitivity, c) specificity, d) sensitivity, and e) precision.
C. MODEL TRAINING FOR TRANSFER LEARNING FROM EXTERNAL THERMOGRAPHIC DATA As mentioned above, it was necessary to train the ResNet50v2 model for transfer learning with the thermal database to generate the fitted weights and transfer them to the training model with the diabetic foot database. Similarly, to the training discussed in the materials and methods section, this network was trained under the same considerations. VOLUME 10, 2022 Table 10 shows the mean values with the respective standard deviation of the accuracy, sensitivity, specificity, precision, and F1 score metrics. Figure 15 presents the distribution of accuracy scores of the five folds in the cross-validation.
The results in Figure 15 show that the model achieved high performance with the test data, ensuring that there was no overtraining of the model. Figure 16, on the other hand, shows the training curve with the accuracy and loss metrics. The curves are the average with the 95% error band. However, the error between training is too small to be visible to the naked eye. In fact, Table 10 also shows this behavior, where the standard deviation for the five folds does not exceed 0.5%.

D. RESULTS OF CROSS-VALIDATION WITH 5-FOLDS
Similarly, the ResNet50v2 network was trained using crossvalidation with 5-folds. However, the results do not show significant differences between the different folds. Therefore, it was decided to present these results in this section. Table 11 presents the training results from scratch for the ResNet50v2 model evaluated with 5-fold cross-validation. Similarly, Table 12 and Table 13 show the same results for the model trained with transfer learning. The first one uses the ImageNet database, and the second uses the thermographic database.
Similarly, Figure 17, Figure 18 and Figure 19 show the distribution of scores from the three training conditions, i.e., from scratch, with learning transfer from ImageNet and the thermographic database. The three plots show the distributions in fractional values, with values ranging from 0 to 1 equivalent to 0 and 100%. In addition, the figures show the accuracy, sensitivity, specificity, precision, and F1 score metrics as a function of the 13 training conditions. That is, under the eight newly proposed methods, the four conventional methods, and the model's training without data augmentation