Evolving Pre-Trained CNN using Two-Layers Optimizer for Road Damage Detection from Drone Images

There are numerous pre-trained Convolutional Neural Networks (CNN) introduced in the literature, such as AlexNet, VGG-19, and ResNet. These pre-trained CNN models could be reused and applied to tackle different image recognition problems. Unfortunately, these pre-trained CNN models are complex and have a large number of convolutional filters. To tackle such a complexity challenge, this research aims to evolve a pre-trained VGG-19 using an efficient two-layers optimizer. The proposed optimizer performs filters selection of the last layers of VGG-19 guided by the accuracy of the linear SVM classifier. The proposed approach has three main advantages. Firstly, it adopts a powerful two-layers optimizer that works with a micro swarm population. Secondly, it automatically evolves a lightweight deep model which uses a small number of VGG-19 convolutional filters. Thirdly, It applies the developed model for real-world road damage detection from drone-based images. To evaluate the effectiveness of the proposed approach, a total of 529 images were captured by using a drone-based camera for various road damages. Reported results indicated that the proposed model achieved 96.4% F1-score accuracy with a reduction of VGG-19 filter up to 52%. In addition, the proposed two-layers optimizer was able to outperform several related optimizers such as AOA (Arithmetic Optimization Algorithm), Wild Geese Algorithm (WGO), Particle Swarm Optimization (PSO), Comprehensive Learning Particle Swarm Optimization (CLPSO), and Reinforcement Learning-based Memetic Particle Swarm Optimization (RLMPSO).


I. INTRODUCTION
Road maintenance is essential for keeping the roads in good condition. Such a good condition will provide safe and secure environments for road users. However, manual road inspections by a human is the most common method used for evaluating and detecting road damages [1]. The main challenge of this method is related to the high personal safety risks, especially on highways.
Unmanned Aerial Vehicles (UAV), commonly known as the drone, is a promising solution to automate the roads inspection process. This is due to the higher mobility of the drone, where it can move from one place to another easily; therefore, it can be used to monitor road conditions in larger areas. In the long run, it will bring better road maintenance at lower costs. The essential operation of the drone in performing roads inspection is illustrated in Fig. 1. As can be seen, a flying camera carried by a drone will record road conditions in real-time. Then, these recorded videos will be converted to images that will be presented to a classifier to classify them as normal or abnormal (see Fig. 2).
Recently, deep learning showed great success in recognizing road damages [2] [3]. For instance, Dung et al. [3] employed a pre-trained CNN model (VGG-16) as the backbone network for feature extraction with a softmax classifier. Reported results showed that road cracks detection accuracy reached 97.8% on a dataset of concrete crack images. Nevertheless, models in [2] and [3] were applied for non-drone captured images. More importantly, previous studies did not consider model complexity raised by the number of CNN filters that exist in the employed pre-trained networks such as VGG-19. To fill this gap and tackle such a complexity challenge, this study aims to evolve a pre-trained VGG-19 by a two-layers optimizer. The optimizer aims to find the most distinguished VGG-19 filters guided by two measures which are the classification accuracy of the linear SVM classifier and VGG-19 model complexity. The complexity measure is represented by the percentage of the remaining VGG-19 filters. It should be noted that our proposed two-layers optimizer was successfully applied for large-scale benchmark problems in a conference paper published recently [4]. However, in this study, it will be used for evolving the pre-trained VGG-19 model to tackle the problem of road damage detection. The main contribution of this work could be summarized in the following points: • It adopts an efficient two-layers optimizer that has a global and local search layer and works with the micro swarm population. • It evolves a lightweight pre-trained VGG-19 feature extractor that has a small number of convolutional filters. • It applies the optimized lightweight model for real-world road damage detection from drone-based images. • It compares the outcomes of the proposed two-layers optimizer with other well-known and recent algorithms. The remaining part of this paper is organized as follows. Section II discusses related work. The details of the proposed approach are explained in Section III. A series of conducted experiments used to evaluate the effectiveness of the proposed approach are shown in Section IV, followed by the conclusions, limitations, and future direction given in Section V. Table 1 lists all abbreviations used in this study.

II. RELATED WORK
In the literature, many deep learning-based techniques have been used for the problem of road damage detection. Zhang et al. [2] proposed a deep ResNet model for concrete crack detection. Reported results showed that the ResNet model achieved an accuracy of 96.84%. An additional end-to-end trainable CNN model named DeepCrack was given by Zou et al. [5]. The main aim was to combine the convolutional feature maps at different scales. Therefore, they introduced a multiscale feature-fusion map for road cracks detection. The outcome indicates that DeepCrack achieved F1-score accuracy of over 87%. However, the model was sensitive to noisy cracks. A CNN-based scheme which was evaluated with 500 smartphones images of size 3264 x 2448, was given by Zhang et al. [6]. Another study that introduced a CNN-based scheme with an adaptive thresholding technique to detect road cracks was presented by Fan et al. [7]. In [7], a CNN was used to recognize cracked images, followed by an enhancement technique that was applied to reduce the noisy pixels and emphasize the edges of the cracks. Finally, the filtered images were down-sampled and thresholded to segment road cracks. The outcome indicates that the proposed scheme in [7] reported detection accuracy of 99.92% and pixel-level segmentation accuracy of 98.7%. A modified deep AlexNet model was introduced by Fan et al. [8]. Their model was applied to images with various situations of road cracks such as shadowed crack, blurry crack, etc. In their work, a sliding window of 156 x 156 pixels was applied and fed to the modified AlexNet to detect road cracks. Their reported performances showed that an average accuracy of 99.09% had been achieved. Yusof et al. [9] presented a fully automated crack detection using CNN, which was applied for two sets of the dataset. The first one consists of 9000 images with two classes namely crack, and non-crack. The second dataset consists of 5700 images with four categories, including non-crack, transverse, longitudinal, and alligator cracks. Their research showed an accuracy of 98%, 99%, and 99% in the measures of recall, precision, and accuracy, respectively.
A very recent deep learning model was investigated by Arya et al. [10]. Their approach performed a transfer learning of the pre-trained MobileNet with an SDD object detector to localize road damage. Their model was evaluated using a large-scale dataset for images captured by a smartphone camera installed in a car. The dataset has been collected from different countries, namely India, Japan, and the Czech Republic. The conducted experiment showed low performance in terms of the F1-score measure. A lightweight auto-encoder model was given by Shim et al. [11]. The key idea of their model is that they classify the input image on pixel-level as normal or abnormal (damage). Their model was evaluated with 1700 images, where 400 images were used for testing, and the rest were used for model training. Results showed that F1-score detection accuracy reached 79.33%. A hierarchal semantic segmentation scheme for road damage inspection was studied by Wei et al. [12]. The proposed hierarchal system used a deep U-Net model that consists of four cascaded steps. These steps are road mark detection, tacking, region segmentation, and road damage identification. Their proposed hierarchal model was evaluated on images captured from a 15km segment of highway road in Beijing. Conducted analysis indicated that an average accuracy of 0.945 IoU measure had been achieved. Ali et al. devised a real-time deep Faster RCNN model for bridge steel cracks detection [13]. The proposed Faster RCNN was able to achieve a 93.31% precision rate. A novel two-stage deep learning model for simultaneous road crack detection and segmentation was given by Nguyen et al. [14]. Basically, the developed two-stage model adopted CNN for performing the detection and segmenting at the pixel level.
In particular, the first stage is responsible for locating and detecting damage from a square input image of size 96 x 96. In case the scanned square is recognized as damage, then it is passed to the second stage for identifying damaged pixels. Reported results on various benchmark datasets showed that the two-stage deep model achieved F1-score accuracy of 91%. Further recent work was studied in [15], where a pre-trained deep CNN model was employed. Ther pre-trained CNN was used to predict both road-side weather and surface conditions. Particularly, their model was trained to distinguish three different weather conditions which are clear, light snow, and heavy snow. In addition, three different surface categories, including dry, snowy, and wet. During their experiment, several pretrained CNN models were investigated, and the results confirm the superiority of the ResNet model against other models. However, the ResNet model is considered a complex network, and it has a large number of filters that consume more computational time. The idea of performing bridge damage assessment using the deep autoencoder model was studied by Sarwar et al. [16]. They adopted LSTM with a deep autoencoder to recognize damage information encoded by a 1D input signal. Road's pothole detection based on different deep learning models, including LSTM and CNN, was presented by Varona et al. [17]. Results showed that CNN was able to distinguish potholes from non-pothole with an accuracy of 98%. An additional recent deep learning-based scheme was conducted by Li et al. [18] for the problem of automatic defect detection of metro tunnel surfaces. Specifically, the Faster RCNN model was employed to localize three different types of defects, namely crack, falling block, and leakage. Faster R-CNN demonstrated the best performances as compared with other deep models such as YOLOv3, SSD, and R-FNN. Similarly, deep Faster R-CNN has been applied for concrete crack detection [19]. They investigated the stability of the deep models against the change in illumination level as well as weather conditions. Results showed a high effect in the accuracy of crack detection rate when the darkness increased. CNN for pavement distress detection was studied by Zhang et al. [20]. They adopt VGG-19 with different architecture. Their model has been tested to classify input images of size 150 x 150 pixels into six classes, namely clean, patch, pothole, linear crack, network crack, and pavement marking. The best results were achieved by a small VGG-19 with a recognition accuracy of 83.8%. A lightweight deep model that adopts an autoencoder network was utilized for performing semantic segmentation of road damage [21]. The key idea of their approach is to reduce model complexity by eliminating the decoding stage of the autoencoder network. Besides the achieved good accuracy of the proposed lightweight autoencoder network, their model reduced the computational time by 12

Testing Phase
An IoT-based sensor for capturing images for the road was given in [22]. The captured images are passed to a deep CNN model for performing pixel-level image segmentation. In their approach, CNN was trained using a bio-inspired optimizer, and results showed 99% per-pixel segmentation accuracy. A pyramid-based deep architecture has been devised by Wang et al. [23] for the problem of road pavement damage. The proposed pyramid model was evaluated with a dataset of 500 images, and results indicated a segmentation accuracy of 0.6235 in terms of IoU measure. The concept of transfer learning of pretrained deep models for damage detection was illustrated by Feng et al. [24]. Specifically, Inception-v3 has been transferred and finely tuned to recognize five different types of road cracks. The results indicated that the performance of the transferred deep model significantly outperformed other traditional machine learning models.

III. EVOLVING PRE-TRAINED CNN USING TWO-LAYERS OPTIMIZER
The architecture of the evolved pre-trained model is given in Fig. 3. As can be seen, the proposed architecture contains two main phases, namely the training phase and the testing phase. In the training phase, an efficient two-layers optimizer is used to perform VGG-19 filters selection guided by the performances of a linear SVM classifier. However, in the testing phase, the final optimized VGG-19 with linear SVM is applied on the testing set to measure model performances. The main components of the proposed approach are explained as follows.

A. VGG-19 Network
VGG-19 is a pre-trained network that was developed by a research team from Oxford University [25]. VGG-19 was trained with the ImageNet challenge dataset that contains millions of images. It was trained to classify a total of one thousand different objects such as cars, balloons, strawberries, etc. The internal architecture of VGG-19 is shown in Fig. 5 (a). Basically, it consists of a cascaded 25 layers. The input layer was designed to receive an input image of size 224 × 224 pixels. Then, the input image is passed to three cascaded convolutional layers, each with 64 filters of the size of 3 x 3 pixels. After each convolutional operation, a ReLU activation function is applied to improve the nonlinearity of VGG-19, and it is defined as follows.
where x is the pixel value; after that, the output of relu(x) function is passed to the max-pooling layer as given in Fig.  5(a). The main idea of the max-pooling operation is to reduce the size of the feature map produced by relu(x).  In VGG-19, the convolutional, ReLU, and max-pooling operations are repeated several times until they reach the flattening layer, which produces a 1D vector, as shown in Fig. 4. Basically, the flattening layer is the concatenation of the output of max-pooling operation to be in the form of a 1D vector. It should be noted that when going deeper into VGG-19, the number of convolutional filters is increased, which increases the complexity of the model. Therefore, this research aims to reduce and eliminate less contributed filters that exist in the last layers of VGG-19.

B. Two-Layers Optimizer
The proposed two-layers optimizer was presented in our previous research as a conference paper [4]. Basically, it consists of two-layers which are global search and local search, as given in Fig. 6. It is automatically select between local and global search based on generated action by the Qlearning algorithm, as explained in [4]. The main stages of two-layers optimizer are explained as follows.

1) INITIALIZATION
This stage is responsible for the initialization of the micro population (3 particles) of the two-stage optimizer. Each particle X is given a random initial location according to the search space of the conducted problem. In addition, a velocity value V is randomly initialized for each particle, and its value belongs to the same range as X.
2) TRANSITION In this stage, a Q-learning algorithm is embedded to control the switching from global search to local search and vice versa. As such, the Q-learning is modeled as a two-state diagram given in Fig. 7.

FIGURE 7. State diagram transition of Q-learning
As mentioned in [4] that the Q-table will be updated with a reward of +1 when the executed search operations were able to improve search performances; otherwise, a penalty of -1 is given.

3) SEARCH EXECUTION
The micro population of the two-stage optimizer will be updated and evolved based on the following equations.
where is the new location of currently executed particle i , is particle velocity, is the inertia. Parameters and are cognitive and social acceleration coefficients, respectively. Variables and are random numbers in the range (0,1).
is the local best position achieved, and is the global best position achieved by the micro swarm. Global and local search operations are identical; however, when the optimizer performs a local search, randomly selected bins of the evolved particle will be updated, other bins will remain unchanged, as indicated in Fig. 8.

4) STOP CONDITION
This stage is responsible for checking the maximum number of allocated iterations. In case it is met, then the execution of the two-stage optimizer is stopped, and the best potion achieved by the micro population is returned.

C. Model Training Steps
As can be seen in Fig. 5(a)  they contain 512 filters each. As such, the main goal of this research is to focus on the last eight layers of VGG-19 and eliminate less contributed filters. To encode VGG-19 filter selection as an optimization problem, this work proposed a binary encoding scheme. The length of this binary vector is 512 x 8, which resulted in a 4096 search vector length. As such, each bin of the encoding scheme is associated with variable F that could take a value of zero or one. If the bin value is one, then the corresponding filter is selected, and it will be activated during the features extraction process. Otherwise, it will be skipped. Referring to the steps of the model training phase shown in Fig. 3, after identifying the VGG-19 filter that will be activated during the features extraction stage, all training images are fed to the optimized VGG-19, and each input image will produce a 1D vector that represents the computed features. Then, a linear SVM classifier will be trained based on computed 1D vectors for each class, i.e., normal and abnormal. In the next step (Evaluate Model), the fitness function will be computed for each particle of the proposed two-layers optimizer according to the following formula.
where TP is the total number of images correctly classified as road damage, TN is the total number of images correctly classified as non-road damage (Normal). The complexity variable is the ratio of the total selected VGG-19 filters with respect to the total number of filters (i.e., 512 x 8 layers).
Therefore, the proposed two-layers optimizer will perform iterative VGG-19 filters selection guided by the fitness function given in Equation (4). It should be noted that higher weightage ( i.e., 0.9 ) is given to accuracy over complexity due to the importance of accuracy. Once the executed two-layers optimizer meets the maximum number of iterations, the best-achieved solution will be returned, and the optimization process will be terminated. After that, the testing phase will be executed based on the optimal VGG-19 filters, and it will be applied to the testing set.

A. Dataset
In this study, a total of 529 images were captured by an S-Series S30W drone that has a wide-angle camera shown in Fig. 1. These images were taken for both normal roads with 422 images and abnormal (damaged roads) with 107 images. A number of sample images of different road damages are given in Fig. 9. For evaluation purposes, the collected images were divided randomly into 80% for training and 20% for the testing phase, as indicated in Table II. The training set contains 252 normal images and 170 abnormal images. However, the testing set has 64 normal images and 43 abnormal images.

B. Performance Measures
The standard evaluation measures were employed in this study to determine the efficiency of the proposed approach. These measures are precision, recall, and F1-score. The mathematical formula of these measures is defined as follows. (6) where TP is the total number of images correctly classified as road damage, TN is the total number of images correctly classified as non-road damage (Normal), FP is the total number of images wrongly classified as road damage, but they are normal images, and FN is the total number of images wrongly classified as normal, but they are road damages (Abnormal). In addition, the Receiver Operating Characteristic (ROC) curve is computed to evaluate the performances of the proposed approach graphically.

C. Model Evaluation
The proposed approach is evaluated on the collected images according to the split of the data are given in Table II. In this experiment, the employed two-layers optimizer was executed ten times on the training data. Specifically, it is used to evolve the pre-trained VGG-19 guided by the fitness function defined in Equation (1), as explained in Section III. For each run, it was executed for 1000 iterations, and the final optimized VGG-19 is evaluated on the testing dataset, as shown in Fig. 3. Furthermore, the proposed approach was compared with the outcomes of VGG-19 (non-optimized) with linear SVM, and the results are detailed in Table III. It is clearly seen that the proposed approach yields the best performances in all measures, i.e., accuracy, precision, recall, and F1-score. This is due to the benefit of the incorporated two-layers optimizer, which performs simultaneous VGG-19 complexity reductions as well as accuracy enhancement of the linear SVM on the training set (see Equation (1)). In other words, these improvements are related to the reduction in complexity of VGG-19, which resulted in better generalization in linear SVM. To assess the performances of the proposed approach graphically, the ROC plot of the proposed approach and VGG-19 with linear SVM is given in Fig. 10. From the curves, it can be seen that the proposed approach has the largest area under the curve. This confirms the usefulness of the embedded two-layers optimizer in improving model performance.

FIGURE 10. ROC curve analysis
Additional analysis was performed to analyze model complexity in terms of the total number of VGG-19 filters. Table IV presents the total number of filters used in the proposed approach against the number of filters used in VGG-19. It can be seen that the non-optimized VGG-19 model has a total of 5504 convolutional filters; meanwhile, the optimized VGG-19 has an average of 2650 filters. This reduction in VGG-19 filters is due to the effectiveness of the incorporated two-layer optimizer. Nevertheless, there are still some challenging road damage cases where the proposed approach failed to classify. These cases are given in Fig. 11. These cases are abnormal, but they have been classified as normal. This misclassification occurs due to the presence of background objects such as cars, trees, etc. In addition, road damage is not clearly visible, which makes it difficult for VGG-19 to capture road texture features.

FIGURE 11. Misclassified abnormal cases
Further analyses were conducted by measuring the effect of changing the brightness levels of the input image. This analysis has been applied to the test set without retraining the proposed model. Specifically, two types of illumination change were investigated, namely increasing brightness level by 1.5 and decreasing it by 0.5. For illustration purposes, a sample image with its corresponding illumination variation is shown in Fig. 12.
The experimental analysis is given in Table V. It is clearly seen that the recall rate has been decreased to 81.4% due to the reduction of the brightness level, which makes the image to be darker, as shown in Fig. 12 (c). This reduction in the brightness level affects texture and shape features that contribute to the detection of road damage. On the other hand, increasing the brightness level by 1.5 did not affect model accuracy. This is owing to texture details of road damage becoming clearly visible which and easily encoded by the VGG-19 feature extractor.

D. Compare with the standard VGG-19
This section compares the performances of the proposed approach against the standard VGG-19 with a softmax classifier. As such, VGG-19 has been retrained with the same training set and evaluated using the same test set. The outcome of this experiment is given in Table VI. It is clearly seen that the proposed approach outperforms VGG-19 with a softmax classifier. This is due to the benefit of the generalization ability of linear SVM classifier. More importantly, SVM can be trained with a small training dataset; however, VGG-19 with softmax classifier required a large number of training instances to produce better recognition performances.

E. Computational time analysis
The computational time of the proposed optimized approach is compared with the non-optimized VGG-19 that works will all filters (5504 filters). This experiment is executed with a PC that has MATLAB 2021a software and windows 10. The hardware specifications are 32 GB memory with i7-8700 CPU @ 3.2 GHz. The computational time analysis is given in Table VII. It is clearly shown that the proposed approach is able to reduce the computational time to 57%, and it requires only 0.066 seconds per single image. On the other hand, the non-optimized VGG-19 takes a longer time, with 0.115 seconds per single image.

F. Compare with other optimizers
This section aims to compare the outcomes of the proposed two-layers optimizer against other well-known related PSObased variants such as PSO [26], RLMPSO [27], and CLPSO [28]. In addition, two recent optimization algorithms have been included to validate the performance of the proposed two-layers optimizer. Specifically, AOA [29] and WGA [30] were used in this analysis. This conducted experiment compares the performances of ten runs produced by each algorithm. Table VII reports the mean fitness, total number of filters, and F1-score measure for each algorithm. It is shown in Table VIII that the proposed two-layers optimizer is able to report the best mean fitness value with -0.8520. This is due to the benefit of dynamic transition from exploration to exploitation by the Qlearning algorithm. In addition, PSO, AOA, WGA, and  [27]. In terms of the total number of VGG-19 filters produced by each optimizer, it can be seen that the two-layers optimizer was able to produce the lowest number of filters with 2650. In addition, the reported mean F1-score value of all algorithms is somehow similar, as indicated in Table VII. This is because of the ability of a linear SVM classifier to give a good result even with many VGG-19 filters. It is worth mentioning that using all VGG-19 filters, the linear SVM classifier achieved an accuracy of 95.4% for the F1-score measure, as indicated in Table III.

G. Statistical test analysis
To measure the superiority of the proposed approach against other methods, this section compares the outcomes statically. Specifically, the Wilcoxon rank-sum test [31] is used in this analysis. The null hypothesis of the Wilcoxon test 0 H assumes that the outcomes of all algorithms are drawn from the same distribution. The significance level was set to 0.05, which means that the alternative hypothesis 1 H would be accepted when the p-value was less than 0.005 (95% confidence level). The results of the Wilcoxon rank-sum test are given in Tables IX and X. It can be clearly seen that the proposed approach significantly outperforms both VGG-19 with softmax classifier and VGG-19 with linear SVM (nonoptimized), as given in Table IX

H. Model Evaluation using public dataset
To validate the performance of the proposed approach, an additional benchmark dataset has been used in this section. Unfortunately, there is no public dataset for road damage detection from drone images. Therefore, the most related work is concrete crack detection using SDNET2018 public dataset [32]. SDNET2018 contains more than 56k images of cracked and non-cracked images. Some sample images from SDNET2018 are shown in Fig. 13. FIGURE 13. Sample of cracked images from SDNET2018 dataset [32] In this experiment, the SDNET2018 dataset has been divided into 70% to 30% for training, testing respectively, as in [33].
The outcomes of the proposed approach and others reported results in the literature are given in   Further analysis has been conducted by comparing the outcomes of the two-layers optimizer with other optimizers on SDNET2018 dataset. The results of this analysis are given in Table XII, and it can be clearly seen that all optimizers achieved almost the same test accuracy. One possible reason is related to the challenge of this problem, and non of the optimizer was able to improve test accuracy perfomance. Nevertheless, the proposed approach was able to produce the best fitness value and the minimum number VGG-19 filters with an average of 2917 filters.

V. CONCLUSION, LIMITATION AND FUTURE DIRECTIONS
This study proposed a novel pre-trained CNN model which has been evolved with a two-layers optimizer. The presented the approach was employed to classify drone-based captured road images as normal or abnormal (damaged). The proposed approach has been evaluated with a total of 422 normal road images and 107 abnormal images. Reported results indicated that the proposed approach is effective in road damage detection, and it was able to achieve an accuracy of 96.4% in the F1-score measure. In terms of model complexity, the proposed approach reduced the VGG-19 filter by up to 52%. Finally, the comparative experiment on the employed twolayers optimizer confirms its ability to outperform other wellknown optimizers such as AOA, WGA, RLMPSO, PSO, and CLPSO in the VGG-19 filters selection. Nevertheless, the proposed approach suffers from several limitations, such as the stability against the change in image illumination. This is due to the lack of sufficient training images. In addition, localizing road damage inside the recognized images was not investigated in this study.
As future work, it is suggested that the proposed model could be evaluated using a large dataset with heterogeneous road damage types such as crack, falling block, and leakage. Moreover, the proposed approach could be applied for several computer vision problems such as COVID-19 classification and plant diseases recognition.