Cognitive-Based Crack Detection for Road Maintenance: An Integrated System in Cyber-Physical-Social Systems

Effective road maintenance can not only achieve a balance between limited resources and long-term high-efficiency performance of road but also reduce the loss of life and property caused by road damage to vehicles and pedestrians. Due to the lack of a multidimensional dynamic monitoring system and enough extremely special data, the existing road maintenance system cannot accurately assess the road surface condition and provide timely early warning of sudden road damage. In this article, the M-RM system is proposed, that is, a metaverse-enabled road maintenance system based on cyber–physical–social systems (CPSSs), which fully utilizes the social and artificial system information of CPSS, as well as the simulation, monitoring, diagnosis and prediction functions of road systems in the virtual world of the metaverse. Then, in the road damage detection of system model in the virtual world, for the virtual data of the core assets of the metaverse, we propose an adaptive and information-preserving data augmentation (AIDA) algorithm-based nonclassical receptive field suppression and enhancement, an algorithm developed from human visual cognition. This algorithm enables the generation of a large amount of scarce fidelity data and avoids the introduced noise from impairing the performance of nonaugmented data. Finally, a crack detection algorithm named pay attention twice (PAT) is proposed, which uses the generated virtual data for training, and achieves secondary attention to high-frequency targets by fusing frequency-division convolution and mixed-domain attention mechanism. The detection performance of small targets in uncertain environments is enhanced. The metaverse system built in the current research can not only be used for road maintenance but also empower the traffic metaverse by using the traffic flow prediction module embedded in the algorithm. Experimental results demonstrate that the proposed algorithm can be applied to the road damage detection task under different noise and weather conditions, and the performance outweighs other state-of-the-art algorithms.

information for the maintenance and management of the infrastructure. Although the development of BIM, GIS, VR, and AR can realize the intelligent maintenance of roads to a certain extent. The urban road system includes many interrelated elements, such as transportation objects, transportation tools, infrastructure, management facilities, and organizational management systems. It is inextricably linked with the outside world and is a complex socio-technical system [20]. If only part of the information is emphasized or applied, while others are ignored, it will lead to the inefficiency of the whole system. One of the important reasons is that it pays attention to the construction of technical systems, but fails to fully consider the understanding and integration of social subsystems. This may lead to many problems, such as the lack of professional integration between traditional traffic engineering technology and advanced information communication technology (mechanism problem), insufficient improvement of traffic infrastructure protection for traffic participants (social awareness problem), service concept that overemphasizes the road itself while ignoring the maintenance process concerning the causative nature of emergencies (cognitive problems), etc. [21], [22].
Connecting through digital spaces has become an essential part of everyday life, and the concept of a cyber-physicalsocial system (CPSS)-based metaverse creates environments where people can meet and socialize in virtual spaces. In the future, culture, economy, marketing, entertainment and education will inevitably be integrated into the metaverse, bringing people into a virtual but real world [23]. To this end, we built a metaverse road maintenance system based on CPSS. It can realize the virtual digital mapping of road maintenance by transmitting the road data collected in real time to the road maintenance model system of the virtual society. Real data combined with virtual data under the guidance of social information [16], [24], through the system model to carry out the analysis of road damages and traffic flow, process deduction, and emergency road damage drill. Finally, the establishment of the traffic space-time map, the management of the whole life cycle of the road, the evaluation of the road health, the formulation of the road maintenance strategy, the maintenance early warning and the damage early warning are realized. The system can also be used for connected-automated vehicle [25], [26] and intelligent transportation systems [27], [28], [29]. As the cornerstone of the metaverse, data is the bridge connecting the real and virtual worlds [30]. Based on real-time and historical real data, under the guidance of social information, the virtual world of the metaverse-enabled road maintenance (M-RM) system we built can generate virtual data, which covers comprehensive road and traffic data in different scenarios and conditions. The traditional data generation methods introduce noise that degrades performance on nonaugmented data during inference [31]. In the virtual world of the road maintenance system in metaverse, facing the task of road crack detection, we propose an adaptive and information-preserving data augmentation (AIDA) algorithm. By enhancing the fidelity of the image, more faithful virtual data is generated, thus maintaining the performance of the algorithm in different weather and lighting conditions. A large amount of virtual data improves the performance of the system model. In the road performance evaluation module, the detection of small damages plays a pivotal role in the management of the whole life cycle of the road [32]. The system can even perceive sudden road damage in advance through subtle changes in the road surface, so as to give timely warnings. Existing studies only focus on obvious road damages (e.g., coarse cracks, road depressions, potholes, pumping, swelling, and settlement), ignoring the detection of small damages, especially small cracks. We propose a crack detection algorithm called pay attention twice (PAT), which uses scale transformation and OCTAVE convolution operation to allow the network to process high and low frequency components more clearly, and pay attention to low frequency components, which saves the amount of computation. In addition, the addition of a mixed-domain attention mechanism enables further enhancement of detection targets. The main contributions of this article are summarized as follows.
1) Novel Framework M-RM: A CPSS-based metaverse road maintenance system is proposed, which utilizes the social and artificial system information of CPSS to realize the full life cycle management of roads and timely warning of road damage. 2) We propose an AIDA algorithm based on a nonclassical receptive field suppression and enhancement. It can process a large amount of data from unpredictable scenarios and unexpected complications caused by the surrounding environment, and also avoid the damage to the performance of nonenhanced data due to data enhancement. 3) A crack detection algorithm was proposed, which named PAT for cracks that have an early warning effect on sudden road damage. The algorithm focuses on crack objects twice, especially small crack objects in complex back-grounds, by separately processing high-frequency and low-frequency components and fusing a mixeddomain attention module. 4) We conduct experiments on three public datasets Crack-tree200, CFD dataset, AigleRN dataset and SHADOW-CRACK dataset which contains a lot of shadow noise, and the experiments demonstrate the effectiveness of our algorithm.

A. Nonclassical Receptive Field Perception Mechanism
As the most important perceptual system architecture for people, the visual cognitive system can process 89%-90% of the external information, which is one of the cores of current brain research [33]. The middle visual receptive field of its information processing mechanism is divided into classical receptive field and nonclassical receptive field [34]. As shown in Fig. 1, the central area of the classical receptive field can filter out high-frequency noise in the image information, and the peripheral function is to remove low spatial frequency components to highlight edge features of the image. However, there is a large area outside the classical receptive field. Stimulating this area alone cannot directly cause the cell's response, but to a certain extent affects the response generated by the stimulation in the cell's classical receptive field. This part is the nonclassical receptive field, which has the functions of isotropic inhibition, anisotropic disinhibition, and facilitation to external stimuli. When gratings and textures are presented in the environment, there is often significant inhibition, and lateral inhibition reduces neuronal responses of homogenous components, so places where homogeneity changes (such as region boundaries) have relatively higher salience [35].
However, when the stimuli in the nonclassical receptive field and the stimuli in the receptive field can be arranged to form a smooth arrangement, it will enhance the neuron's response to the central stimulus, and when there are multiple line segments and central line segments in the environment to form a smooth curve, the response is more intense [36], [37].
As shown in Fig. 2(a), our vision can easily identify a structure composed of multiple edge segments with good spatial consistency from the back ground of randomly scattered edge segments [38], [39]. In Fig. 2(b), our first reaction to the figure was to focus on the circular pattern in the center. But the circular pattern in the whole picture is far less than the vertical line segment in the amount of data. This phenomenon is due to the inhibitory effect of nonclassical receptive fields on a large number of similar visual signals, which weakens the stimulation of such image information to visual cells. Therefore, the vertical line segment with a large amount of data is not the first to be noticed by us. The central vertical line segment and the surrounding area line segments are orthogonal or close to right angles, and the circular pattern is prominent due to the anisotropic disinhibition of the nonclassical receptive field.

B. Road Crack Detection Under Complex Background
In the metaverse traffic maintenance based on CPSS, road cracks, as one of the most common damages, will reduce the safety of road operation and directly endanger road traffic and driving safety [40], [41]. With the rapid development of air-space-ground integrated multidirectional detection and artificial intelligence, it is imperative to improve the detection and recognition accuracy [42], [43], [44]. Many road crack detection algorithms have been proposed, including edge detection [45], [46], [47], machine learning [48], deep learning [49], [50], graph-based [51], [52], etc. In these algorithms, the main environmental factors that interfere with crack detection include varying intensities of ambient lighting, shadows from trees, signs, railings, etc. Gu et al. [53] first used a median filter to remove shadows, then combined a multiscale line filter with Hessian matrix to enhance cracks, and then used the probability relationship to roughly detect cracks. This algorithm can easily judge crack information as shadows, reducing the performance of shadow removal. Ai et al. [54] proposed to use multiscale neighborhood and pixel intensity information to deal with poor lighting conditions and shadows under certain conditions. However, this method is only effective when the uniformity of the image brightness is good. The image illumination normalization method based on Gaussian blur value proposed by Qu et al. [55] can effectively eliminate the influence of illumination. However, this method can only improve the uniformity of image brightness, and cannot completely eliminate the influence of illumination. Wang et al. [56] proposed to first use the GRS algorithm and Gaussian filter to remove shadows and noise, and then use LOF-based local guided filtering algorithm to enhance crack information and suppress image noise. Finally, the edge of the crack is extracted by the improved level set algorithm. This algorithm has some accuracy advantages, but it needs to set too many parameters, which affects the generalization of the model. In addition, the existing algorithms do not consider the detection of small crack targets. The detection of small cracks can more accurately predict the performance of the road, and even carry out early warning of road damage. Based on the mechanism of visual cognition, we propose an AIDA algorithm for nonclassical receptive field suppression and enhancement. In addition, we propose a novel PAT algorithm for small crack targets, which separately handles high and low frequency components, and a mixed-domain attention mechanism is added to enhance the attention to small crack objects.

III. SYSTEM DESIGN AND METHOD
In this part, we introduce the metaverse road maintenance system based on CPSS and the AIDA algorithm based on nonclassical receptive field suppression and enhancement and the PAT crack detection algorithm for detecting small targets in the artificial world.

A. M-RM System Based on CPSS
The proposed M-RM system framework is given in Fig. 3. It contains physical word and artificial world. Compared with the existing system, the proposed CPSS-based road maintenance system can access real information, such as roads, vehicles, environment, weather, and even human beings through the metaverse port to build a real-like space-time framework for road maintenance systems. Through data security and privacy protection strategies, they make the road maintenance system closer to reality and more accurate. The virtual data generated based on real data and social information contains a large number of unknown and scarce samples, which can simulate and preview what will happen in the future, so as to realize the evaluation and prediction of road performance. It is described in detail as follows.
1) Stereoscopic Monitoring System in Physical World: The stereoscopic monitoring system in the physical word collects data from satellites, UAVs and ground equipment [57], [58], which are used to evaluate road performance and analyze traffic flow, etc. In road performance evaluation, satellite remote sensing conducts time-series-based road settlement monitoring and road slope deformation analysis through a large-scale census. UAVs are used for monitoring in key areas. It identifies key hidden dangers and conducts auxiliary analysis through refined road inspections. The bottom layer is the ground monitoring part, which uses data collection equipment and roadside infrastructure on mobile devices (such as taxis, buses, garbage trucks, and driverless inspection vehicles, etc.) to conduct real-time inspection and early warning of road conditions. In the traffic flow analysis, the air-sky-ground sensor can calculate the equivalent axis and calculate the traffic volume through the obtained vehicle attribute information, vehicle motion information, vehicle distribution information, traffic volume information and environmental precipitation illustrations, multicolor graphs, and flowcharts.
2) Data and Model in Artificial World: The artificial world is composed of data, social information and system models, in which the data includes real-time data and historical real data transmitted from the physical world, as well as virtual data generated based on real data and social information [59], [60]. In the real data, there are basic road data, remote sensing information, road network management data, hydrological data, traffic flow data and vehicle data, etc. As the important social information of M-RM system based on CPSS, it integrates the attributes of society and human cognition, including regional development information, policy information, expert experience and human cognition, etc. Among the information, on the one hand, regional development and policy information can directly affect the calculation of equivalent axle load times, and thus indirectly affect the calculation of traffic flow and road damage assessment in the system model. On the other hand, the information can guide the generation type and quantity of virtual data.
Expert experience and human cognition can not only guide the construction of virtual data generation models and system models but also can be used as prior knowledge to judge the results of the models, thereby improving the performance of the models.
Virtual data is the core asset of the metaverse, and its virtual scenes are generated based on real data and social information. It builds the simulation of road and traffic scenes and the running environment of vehicle dynamics software in the base of high-precision maps through information technologies, such as perception and positioning. Access real information, such as roads, vehicles, environment, weather, social information, and people through the metaverse port, and build a real-like spatiotemporal framework of road traffic, so as to deeply participate in road maintenance. In addition to real scene modeling, the generation of virtual data also includes data augmentation for special scenes or insufficient samples. Data augmentation based on human visual cognition can not only make up for the lack of real data but also avoid the influence of the introduced noise in the augmentation process on the system model.
The model in the M-RM system mainly includes two submodules, such as road performance evaluation and traffic flow analysis [61], [62]. As a direct guide for road maintenance, road performance evaluation can improve the initiative of maintenance work, reduce maintenance costs in the life cycle, and prolong the service life of roads. In addition, timely warning of road damage, especially the detection and warning of dangerous roads, is essential for traffic safety. It mainly includes three parts: 1) automatic detection of road damages; 2) detection of road structure performance; and 3) material performance detection. The detection of road damages is mainly aimed at divergent type (cracks), fatigue type (road depression), water damage type (pothole, pumping), mixture deformation type (rutting, subsidence and crowding, etc.), and roadbed structure type (subsidence and bridge jumping) and other damages. In the detection model, the input of a large amount of virtual data can not only solve the problem of low detection performance in noise, different environments and weather caused by insufficient data in the physical world. It can also improve the efficiency and accuracy of damage detection, and can also achieve accurate detection of minor damages.
Traffic flow analysis can arrange reasonable working hours for road maintenance according to the traffic flow characteristics of the construction section, reduce the impact of maintenance construction on passing vehicles, and most importantly, reduce the safety risks during maintenance and construction operations. It mainly includes two models of section equivalent axle load and traffic volume prediction. The number of equivalent axles loads of the section within the design life can be calculated from the traffic flow, which can be used to calculate the fatigue cracking life of the road stabilization layer, the permanent deformation of the material, the low temperature cracking index of the road layer, the thickness of the antifreeze layer and the deflection value of the top surface of the subgrade, etc. Traffic volume forecasting  can be achieved through vehicle type detection and density detection [63], [64], [65], [66]. Such as historical traffic volume, annual growth rate of traffic volume, changes in road network, regional economic development, and policies are used for traffic volume forecasting.
The establishment of a system model based on spatiotemporal dimension information, virtual data information, and social information can ultimately realize unified management and intelligent scheduling of data resources.
In terms of road health, it can conduct health assessment, full life cycle management, performance prediction and maintenancy guidance for roads. When the damage is serious or abnormal, the system can carry out maintenance warning and road damage warning.

B. Adaptive and Information-Preserving Data Augmentation Algorithm Based on Nonclassical Receptive Field of CPSS
In the identification task of road cracks, there are also cases where people usually pay attention to the cracks first, rather than the normal road, as shown in Fig. 4. Interestingly, even if the cracks are surrounded by noise such as shadows, as shown in Fig. 5, we will give priority to the cracks.
Based on this idea, we propose a data adaptation keep augmentation method based on local entropy noise enhancement of nonclassical receptive fields. The detailed process is shown in Fig. 6.
In order to achieve the same-direction suppression and anisotropic de-suppression mechanism of nonclassical receptive field at the algorithm level, we convert the image into a grayscale image, then normalize it, and finally calculate the average grayscale value of the entire grayscale image where A(i, j) represents the gray value of the ith row and jth column in the matrix formed by the grayscale image, H, W represents the grayscale image size, and a represents the average grayscale of the grayscale image. We found that the gray value of the shadow part and the crack part is low, so the shadow part is segmented by the average gray level, and a binary grayscale matrix is generated where M g (i, j) represents the value of the ith row and the jth the column in the binary grayscale matrix. If the gray value in the original grayscale matrix is greater than the average grayscale, it is 1 in M g (i, j), otherwise it is 0. The last part of M g (i, j) = 0 represents the segmented shadow part. The entropy value is used to describe the degree of chaos in a system. In a system, the larger the entropy, the more chaotic and unstable the system, and vice versa. The visual system adopts isotropic suppression for the part with lower information entropy, and adopts anisotropic suppression for the part with higher information entropy to improve our perception of edges. In the crack detection task, the parts with lower information entropy are normal ground or large shadows, while the parts with lower information entropy are the edges of cracks and shadows. In the same way, in the shadow part, we will take the same-direction suppression under the shadow. So, we first use the information entropy formula to calculate the local entropy value around each pixel to get the entropy value matrix. The specific calculation formula is as follows: where M s (i, j) represents the entropy value of the ith row and jth column in the grayscale image. In order to allow the edge elements to be calculated normally, we fill 0 around the grayscale image matrix, and at the end, M s (i, j) is linearly normalized. Similarly, we calculate the average entropy where S represents the average entropy. For pixels whose entropy value is less than the average entropy, that is, the surrounding area is relatively stable, it may be a normal ground or a cracked part, and the average grayscale judgment is performed at this time. We observed that the road gray level is basically larger than the crack gray level, so if the pixel gray level is greater than the average gray level, we regard this pixel as the ground part, and if it is smaller than the average gray level, we regard it as the crack part. We hope that the greater the contrast between the road and the crack, the better, so increase the gray value of the road part and decrease the gray value of the crack part. For pixels whose entropy value is greater than the average entropy, it is considered to be the edge of shadow and other parts, so as to increase the gray value of this part, so that the influence of shadow edge on crack identification is reduced. The specific code of the algorithm is shown in Algorithm 1 After local entropy enhancement, we get an enhanced image. In order to improve the contrast of road and cracks   20: end if 21: end if 22: end for 23: end for 24: return Updated I in the shadow part, and keep the information of the original image in the nonshadow part to avoid too much noise due to the introduction of data enhancement, we only use the enhanced pixel values in the shadow part to get the final result. The enhanced picture is shown in Fig. 7, and the picture here is from the Cracktree200 dataset, where Fig. 7(a) represents the original image and Fig. 7(b) is the entropy visualization. The higher the brightness, the greater the entropy. We can see that the entropy of the road surface is lower, while the shadow and crack edges have higher entropy. The image enhanced with local entropy noise is Fig. 7(c). All operations at this stage are the simplest matrix operations, and all operations basically do not affect each other. Therefore, the local entropy enhancement algorithm is very suitable for concurrent computing and can greatly improve the efficiency.

C. PAT Crack Detection Algorithm
In road damage detection, the timely discovery of small cracks and other minor damages can more accurately evaluate and predict road performance, and even provide early warning of sudden road damage. For small cracks, the traditional convolution method cannot accurately extract the feature information of cracks due to their small area, and the color gradation information on the normal road is more inclined to random noise, resulting in poor traditional convolution effect.
In the system model of the M-CM virtual world, in order to quickly detect small defects represented by small cracks, we propose the PAT crack detection algorithm, which focuses on crack targets twice based on color level information and channel information, as shown in Fig. 8.
By converting the image to grayscale, we found that road cracks are often points with strong grayscale changes (highfrequency signals), and this information is exactly what we want the model to learn.
We choose Octave Convolution [67] as the backbone network because it is different from the traditional convolution model and pays more attention to the high-frequency part of the image, that is, it pays more attention to information such as edges and contours. Such information is more helpful for the model to learn image details. While saving computational overhead, key information can be obtained. In addition, UNet based on encoder-decoder architecture adopts skip-connection to alleviate the loss of details due to down sampling operations [68]. The feature fusion operation in skip-connection uses concat. Compared with the additional operation, concat retains more features at the cost of increased computation. This allows the data to slow down the attenuation of the amount of information in the decoder. We add the attention module to the feature map after skip-connection in the decoder module. This makes the feature map obtained by down sampling in the encoder and the feature map obtained by up sampling in the decoder of the same degree of importance. Then, the attention of the high and low frequency information of the image and the convolutional attention to the stitching of skip-connection feature maps are carried out through the octave convolution.
In our algorithm, the road crack image first enters the encoder module, and after a series of encoder feature extraction, the model enters the decoding stage. In the decoder module, the feature map is first computed by octave transposed convolution each time, and the size of the feature map is in-creased by up sampling. Then perform skipconnection with the feature map obtained by down sampling of the corresponding encoder to obtain the feature map in the decoder. But before the octave convolution operation, we evaluate the importance of the information of each channel and spatial dimension of the concatenated feature map for the final segmentation task through the convolutional attention module. CBAM [49] is our chosen convolutional attention module, which quantifies the importance of each channel information and spatial information to the final task result in the process of image feature extraction. It avoids the problem of treating the importance of each channel as the same in traditional convolution. The convolutional attention module evaluates different importance levels for each channel. Likewise, attention is paid to the spatial information of feature maps.
In the channel dimension, we use average pooling and max pooling to extract two channel dimension feature vectors of the feature map. The feature vector obtains the mapped feature vector through a shared weight MLP, and the two vectors are added and mapped by sigmoid to obtain a weight vector with each vector element in the range of [0, 1]. Multiply the weight vector with the feature map to get the feature map of channel dimension. Likewise, the two spatial dimension feature matrices of the feature maps are extracted using average pooling and max pooling in the spatial dimension. The two feature matrices are spliced according to the spatial dimension, and after splicing, they go through a convolution layer, and finally use sigmoid to map the feature matrix to the value range [0, 1]. Multiply the feature matrix and the feature map that has been weighted in the channel dimension to obtain the feature map evaluated by convolutional attention. We continue the subsequent operations of the decoder with such feature maps, including octave convolution and octave transposed convolution. After the decoder stage is over, we get a binary image of crack segmentation.

A. Dataset
We conducted experiments on three public datasets, Crack-tree200, CFD dataset, AigleRN dataset, and SHADOW-CRACK dataset, which contains a large amount of shadow noise. The details of the dataset are as follows.

1) Cracktree200
Dataset: Cracktree200 contains 206 images with 800 × 600 pixels. The dataset contains various types of cracks, in addition to challenges such as shadows, occlusions, noise, low contrast, etc. Annotated as pixel-level labels. We split Cracktree200 with 146 images for training, 40 images for testing, and 20 images for validation, and all dataset segmentations were randomly sampled.
2) CFD Dataset: The CFD dataset is an annotated road crack dataset proposed by Shi et al., which consists of 118 images of 480 × 320 pixels, each with manually labeled crack contours. These images were taken with uneven lighting and contained noise, such as water, oil, and shadows, which made crack detection more difficult. The device for collecting images is Iphone 5, the focal length is 4 mm, the aperture is f/2.4, and the exposure time is 1/135s. In the experiment of this article, there are 72 training sets and 46 testing sets. Due to the small amount of data, no validation set is set, and all data set segmentations are randomly sampled.
3) AigleRN Dataset: The AigleRN dataset was taken on French sidewalks and contains 38 images with pixel-level annotations, with pixels of 311 × 462 and 991 × 462, respectively. In order to correctly calculate indicators such as AUROC, the ones that did not contain crack annotations were deleted. F03b data. In the experiments of this article, the training set is 22 and the test set is 40.

4) SHADOW-CRACK Dataset:
The SHADOW-CRACK dataset was captured on different roads in Changchun and Beijing, with a total of 210 images. Data was collected via iPhone XR handsets. The height above the ground is 1m-1.2m, and the pixels of the image are 480 × 480. The ratio of linear cracks to network cracks is 3:1, which is consistent with the actual road conditions. Objects that produce shadows include vehicles, pedestrians, and other traffic participants, as well as features, such as trees, city buildings, and roadside amenities. In the experiments in this article, there are 116 training sets, 84 testing sets, and ten validation sets, and all data set segmentations are randomly sampled.

B. Evaluation Metrics
To evaluate the performance of the proposed model, we AUROC represents the area under the ROC curve.
Parameter Settings: The hyperparameters include a batch size of 4, a learning rate of 0.001, the Adam optimizer is used for learning, eps is 1e-8, and the weight decay is 0.0005. When the loss function value is stable for ten epochs, use η = 0.9η. The contraction method calculates the current learning rate. Due to the small dataset size, we use 800 epochs for each training to prevent the model from overfitting.

1) Evaluation of Adaptive and Information-Preserving
Data Augmentation Algorithm: As the core asset of the metaverse virtual world, data plays a key role in the construction of the system model and can directly affect the performance of the model. The AIDA algorithm based on nonclassical receptive field suppression and enhancement is built on the basis of human visual cognition mechanism. It provides system models with a massive number of data on unpredictable scenarios and unexpected complication environments caused by the surrounding environment, while maintaining data quality. In order to verify the performance of AIDA, we used the Octave UNet backbone network to conduct experiments on Cracktree200, CFD dataset, AigleRN dataset, and SHADOW-CRACK dataset, where "No" means that AIDA was not used, and "Yes" means that AIDA was used. The ACC, SE, SP, and Auroc results are illustrated in Table I and Fig. 9. It can be seen from the table that in the four datasets and most evaluation metrics, the performance of the added DAK-Aug module is significantly better than the Octave UNet backbone network. When the evaluation metric is Auroc, after data augmentation, the performance changes by +0.9% (CFD), +1.51% (AigleRN) and +2.27% (SHADOW-CRACK), respectively. When the evaluation index is AE, the performance of the algorithm with the module of AIDA is greatly improved, with an average of more than 8.7% on the four datasets. The results, respectively, achieved the value of 96.84% (Cracktree200), 83.66% (CFD), 98.99 % (AigleRN), and 75.02 (SHADOW-CRACK), which surpassed the model without data augmentation (93.14% in Cracktree200, 81.38%  in CFD, 84.96% in AigleRN and 59.21% in SHADOW-CRACK). On one evaluation index, the detection algorithm has the superiority of performance on the four datasets after adding the AIDA module. The visualization of the detection results is shown in Fig. 10, where green represents TPs, red represents FPs, blue represents FNs, and TNs are represented in black. To enhance the contrast effect, we show the test results on the SHADOW-CRACK dataset. As can be seen from  the figure, the model with AIDA added is more accurate to find the target, and it is more robust in noisy environment.
2) Pay Attention Twice Crack Detection Algorithm Evaluation: In road maintenance, road damage detection is a key task. Road cracks, especially small cracks, are the early manifestations of various damages. If the existence of cracks can be detected and maintained in time, the safety of road driving can be ensured and the maintenance cost can be minimized. More importantly, the timely detection of small cracks can give early warning of sudden road damages and protect people's lives and property safety. To verify the performance of our proposed PAT crack detection algorithm, we also compare the performance of the PAT algorithm with the backbone models on four datasets Cracktree200, CFD, AugleRN, and SHADOW-CRACK. The evaluation indicators use ACC, SE, SP, and Auroc. Table II and Fig. 11 demonstrate the performance advantage of our algorithm, when the evaluation metric is SE, our algorithm achieves a significant improvement of +3.69% (Cracktree200), +10.43% (CFD), +13.75% (AigleRN) and +15.32% (SHADOW-CRACK), respectively, when compared with the benchmark model. It can be seen intuitively from Fig. 11 that when the PAT algorithm is tested on the four datasets, all indicators are higher than the benchmark model. Fig. 12 shows the visualization results of the algorithm tested on three datasets. From left to right are Raw Image, Ground Truth, No PAT and PAT. The first row of data comes from the AigleRN dataset, which contains a large amount of road particle noise. PAT only focuses on crack information, while the benchmark model misdetects a large number of road particles as noise. The crack type in the second row is network crack, which is a more complex type of road cracks. The benchmark model ignores a large amount of crack information, and only detects a small number of coarse and obvious cracks. But the PAT algorithm detects almost all cracks, even small ones. The data shown in the last line is from the shadow-crack dataset, which contains some shadows. The PAT algorithm can detect the crack completely, while the benchmark algorithm can only detect a very small part, and most of the information is ignored.
3) Performance Comparisons With Other Models: To demonstrate the effectiveness of our proposed algorithm, we have compared it on three classical semantic segmentation models. FCN, proposed in [71], is an important model to  use deep learning under the semantic segmentation task, FCN replaces the fully connected decoder of traditional classification task with convolution operation, which makes the model classify on each pixel to output heat map, thus achieving semantic segmentation task, FCN uses up sampling to recover the image size in order to solve the small size of images due to convolution and pooling. UNet initially used for medical image segmentation tasks and won several firsts in the ISBI cell tracking competition in 2015. DeeplabV3 [70] adds a batch normalization layer, replaces the empty convolution of size 3 × 3 with dilation = 24 in ASPP with a standard 1 × 1 convolution to the previous two versions and adds a global average pooling to capture the global information. Table III shows the performance advantages of our approach, which achieves improvements of +0.57% (CFD), +0.93% (SHADOW-CRACK), and +13.75% (AigleRN), on average, respectively, compared to the benchmark model when the evaluation metric is SE. Fig. 13 shows the visualization of the algorithm tested on three datasets. From left to right, AigleRN, CFD, and two columns of SHADOW-CRACK. From top to bottom, the original image, ground truth, UNet, DEEPLABV3, FCN, and ours. The first column of data is from the AigleRN dataset, which contains a large amount of road particle noise. PAT maintains its ability to detect small cracks that are not obvious and has a smaller error rate compared to the three algorithms used as a baseline. The second column of data is from the CFD dataset, and it is clear that PAT does not misclassify the road particle noise near the upper left corner of the middle, and maintains excellent detection capability for fine cracks in the middle that are not detected by UNET and DEEPLABV3.The third and fourth columns are from the SHADOW-CRACK dataset, where the PAT algorithm detects all crack sections under the influence of shadows and large road particle noise, and minimizes the misclassification due to shadows and large road particle noise.  V. CONCLUSION In this work, we explored the importance of road maintenance and the limitations of existing methods, such as the neglect of social attributes based on human cognition, the lack of multidimensional data, and the low accuracy of small object detection. To solve these problems, we proposed a novel M-RM system to make full use of the advantages of the metaverse and CPSS. The M-RM system was featured by a special attention on human cognition and systematic models, so as to achieve precise guidance for road maintenance and sudden road damage warning. In addition, in order to carry out fullcycle life modeling of roads and all-round simulation of actual scenes, we proposed an AIDA algorithm based on nonclassical receptive field suppression and enhancement, an algorithm built on the basis of human visual cognition. The model can not only process a large amount of high-quality data but also avoids the damage to model performance due to data augmentation. Finally, the PAT algorithm applied to the detection of small damage targets in road was developed. The experimental results demonstrated that the proposed algorithm can accurately detect small cracks and the training time is shorter.
In future work, we will explore a lightweight system model based on human cognition, and test the proposed system model in practice to further verify its performance.