A Survey on Audio-Video based Defect Detection through Deep Learning in Railway Maintenance

Within Artificial Intelligence, Deep Learning (DL) represents a paradigm that has been showing unprecedented performance in image and audio processing by supporting or even replacing humans in defect and anomaly detection. The Railway sector is expected to benefit from DL applications, especially in predictive maintenance applications, where smart audio and video sensors can be leveraged yet kept distinct from safety-critical functions. Such separation is crucial, as it allows for improving system dependability with no impact on its safety certification. This is further supported by the development of DL in other transportation domains, such as automotive and avionics, opening for knowledge transfer opportunities and highlighting the potential of such a paradigm in railways. In order to summarize the recent state-of-the-art while inquiring about future opportunities, this paper reviews DL approaches for the analysis of data generated by acoustic and visual sensors in railway maintenance applications that have been published until August 31st, 2021. In this paper, the current state of the research is investigated and evaluated using a structured and systematic method, in order to highlight promising approaches and successful applications, as well as to identify available datasets, current limitations, open issues, challenges, and recommendations about future research directions.


I. INTRODUCTION
Railway components have a long operating life, often spanning over decades, with installations that may span over hundreds of kilometres and need to withstand harsh environmental conditions. Therefore, maintaining a railway in efficient working conditions is a task requiring considerable amount of resources. Automating inspection, condition monitoring and eventually repairs to railway assets may lead to more efficient use of human skills and time as well as reduce maintenance mistakes and costs. Artificial Intelligence (AI) can help to analyse specific railway components through diverse sensors [1], and some works investigate to what extent it is possible to use non-intrusive data sources (e.g., LiDAR [2], X-Ray [3], light sensors [4], and fiber optic [5]) to improve maintenance effectiveness and efficiency. Among all, audio-and video-based inspection can provide effective means for observing the status of railway assets and selectively identifying items that need further attention. Indeed, automated and continuous audio-video inspection opens unprecedented ways to cope with railway maintenance that currently is the most investigated AI area in railways [6]. Several European projects have tackled the challenge to investigate and experiment with the adoption of AI techniques in the maintenance and inspection of railway systems, for example, the recent H2020 Shift2Rail (S2R) projects IN2SMART [7] and IN2SMART2 [8].
It is a matter of fact that current railway maintenance has been focusing on the use of special sensors, and many diagnostic tasks will continue to be addressed using those mature methods and technologies. However, those approaches might be rather expensive in terms of infrastructures and data analysis. Moreover, inspection activities are often performed on a fixed schedule, possibly resulting in late anomaly detection. The use of those sensors may also be problematic in terms of intrusiveness due to the need for "instrumenting" components, pausing railway traffic, etc. In such a context, audio and video sources seem to have a huge potential and several advantages, as i) they do not need to be installed on the object they have to monitor, thus limiting their intrusiveness and any need for (re)certification of railway components, and ii) they might be needed or already installed for other reasons (e.g. safety/security surveillance [9]), thus enabling multiple uses of technologies for higher cost-effectiveness. In particular, Deep Learning (DL) represents a paradigm that has shown unprecedented performance in image and audio processing by supporting humans in defect and anomaly detection [10].
Therefore, although other types of non-intrusive sensors exist, this paper aims to provide a structured survey on the DL approaches for railway maintenance that leverage video (i.e., sequences of frames) and/or audio data. In doing so, we borrowed and tailored to our scope some practices from Systematic Literature Review (SLR) methodologies [11], [12], such as the definition of research questions (RQ), wellstructured queries, and exclusion criteria. In particular, in this study we aim at answering the following research questions: RQ1 Which are the main railway applications and objectives addressed by current research on DL for audio and video analysis applied to railway maintenance? RQ2 Which specific DL approaches and datasets have been used in existing audio-video analytics applications for railway maintenance? Are those datasets publicly accessible? RQ3 What are the main open issues, challenges and opportunities in the field of DL applied to railway maintenance and inspection with audio and video sensors? Although different terms with similar meanings are used in the literature to refer to maintenance activities supported by technologies such as the Internet of Things (IoT), cloud computing and data analytics, including predictive maintenance, smart maintenance and computer-aided maintenance, in this paper we will adopt those terms according to the usage made by the authors of the surveyed paper. Otherwise, we will simply use the word "maintenance". We emphasize that this work has been developed within the RAILS 1 H2020 research project [13], whose main objective is to provide recommendations and research directions for a fast take-up of AI in railways.
A structured literature review can be presented in several ways, according to the points of view chosen to cluster and discuss the findings. We have chosen to organize the results 1 Roadmaps for AI integration in the raiL Sector -https://rails-project.eu/ according to reference railway areas and thus provide one section for each area, from Section IV to VII. The internal organization of these sections is homogeneous, hence it is possible to read the paper focusing on specific areas of interest. Those sections summarize the findings of the review, and they are preceded by the following sections: Section II presents existing review papers surveying DL applications to maintenance; Section III provides background information, including a description of the reference railway areas, and presents the organization of their related sections. The presentation of the review findings is followed by: Section VIII, which reports some overall statistics on the findings; Section IX, which discusses the responses to the research questions; and Section X, which summarizes and discusses the main results, challenges, opportunities, and research directions for effective utilization of AI techniques in the railway domain. Finally, Section XI provides some closing remarks.

II. RELATED WORK
The last few years have seen a growing interest in AI applications to railway systems. This trend is attested by industrial research and innovation initiatives, as well as by the growing number of scientific publications (Fig. 2), addressing the application of AI techniques to the rail sector. Technologies such as computer vision and audio processing, especially fostered by Machine Learning (ML) and Deep Learning (DL) approaches, will play an important role in providing effective methods to solve various problems, including intelligent surveillance, automatic train operation, timetable optimization and network management. Different authors tried to organise these works based on the task or application in the railway domain.
Rail track maintenance is among the most analysed aspects, with works focusing on both corrective and predictive studies. Focusing on the former, reference [14] reviews papers using both shallow and deep neural networks to detect structural defects, including those due to physical deterioration of the tracks, or geometrical irregularities (e.g., misalignment). The review concludes that, although DL algorithms have been widely adopted in recent years to identify structural defects, there are still shortcomings to address, such as i) the lack of benchmarks, ii) the lack of labelled datasets, and iii) the difficulty of finding defective observations. Moving to the latter, reference [15] reports a detailed Systematic Literature Review (SLR) focused on data-driven methods (including ML and DL). The review was published in October 2020 and analyses about 109 papers covering the period from 2014 to 2019. Among the results of the SLR, it has been concluded that the application of data-driven models can help in avoiding the unnecessary replacement of track components. Nevertheless, DL approaches accounted for only 8% of the reviewed models. Two hot topics have been highlighted: i) the need to develop an automatic data labelling method (currently carried out manually in most cases) and ii) the importance of interpretability of black-box models.
Predictive maintenance has been also analysed in a broader sense in [16], where the authors provided a widespectrum SLR of suitable ML approaches. The analysis, conducted on papers published from 2009 to 2018, highlights that predictive maintenance attracted an increasing interest from researchers starting from 2013. The work shows that only a few papers propose solutions specifically tailored to railway systems and suggests that appropriate sensors could help to avoid unnecessary replacements and hence lead to savings while ensuring safety, availability, and efficiency of maintenance processes. It is worth noting that a similar conclusion had been already stated in another work, dating back to 2010 [17].
An SLR on Prognostic and Health Management (PHM) has been presented in [18], where for each PHM sub-field, i.e., Fault Detection, Fault Diagnosis, and Prognosis (Remaining Useful Life estimation), papers have been classified according to i) the DL approach (including convolutional/recurrent neural networks, autoencoders, etc.) and ii) the type of dataset they rely on (vibrational, time-series, imagery, and structured). The paper also emphasizes DL challenges concerning model complexity and its characterization, the lack of labelled data, and the fact that most of the papers were about datasets gathered from bench-scale experiments that possibly lead to poor applicability in the real world. It is also worth noting that no image-based approaches have been identified for prognosis.
More recently, researchers are also focusing on the use of computer vision supported by ML and DL. In [19], the authors provided a review of obstacle detection based on AI and DL for risk reduction at Level Crossings. The paper analyses the combination of obstacle detection with intelligent decision making as an opportunity to provide robust interlocking decisions. Another recent work [20] focuses on visual inspection based on image processing, including traditional and ML-based techniques, without a clear focus on DL.
All the works mentioned above provide literature reviews with different degrees of abstraction and focus. In some cases, DL has been analysed as one of the possible solutions, while in others the authors focused on different transport domains, data sources, maintenance topics, or specific railway areas. To the best of our knowledge, no literature review specifically addresses DL for audio-video analytics in railway maintenance.

III. BACKGROUND AND REVIEW STRUCTURE
In this section, we provide background about railway areas and components that are relevant for maintenance, as well as an overview of DL approaches and maintenance activities.

A. RAILWAY AREAS
Based on the results of our review, current research on railway maintenance focuses on the following main areas and components: • Rail Track: the core element of the railway system infrastructure. It includes rails, switch points, sleepers, fasteners, and ballasts. It is also worth mentioning that the rails are, in some points, joined with welds [21]; • Pantograph & Catenary: the Pantograph and Catenary (PAC) system has a central role in electrical trains since they provide them with the necessary power to operate. The catenary is the fixed structure located near and upon (overhead line) the rail track; the pantograph is a component of the trains placed on the roof. Trains get the power through the contact between the pantograph slide plates and the catenary contact wire. Catenaries are composed of a high number of elements (e.g., screws, bolts, etc.). In section V, we divided this area into Pantograph and Arcs, Catenary Support Device (CSD), and Catenary Wires; • Rolling Stock: within rolling stocks, Bogies are undercarriages with four or six wheels pivoted beneath the end of a vehicle that could also provide traction and braking, while Electric Multiple Units (EMU) are multiple-unit trains consisting of self-propelled carriages using electricity as the motive power. Such category also includes a few other components of trains' body and frame (i.e., chassis); • Tunnel & Bridge: encompassing all components of tunnels and bridges, including concrete structure and lining. Fig. 1 provides an at-a-glance view of a railway system.

B. DEEP LEARNING
Deep Learning is a subfield of Machine Learning encompassing a multitude of algorithms and models, including Convolutional Neural Networks (CNN), which specialise in VOLUME 4, 2016 3 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. image processing by learning how to extract a suitable set of features for a specific task [14]. Three are the main families of image processing tasks: • Classification (C), in which a sample has to be assigned to a class within a fixed set of alternatives. This represents the earliest task CNNs have been used for, as classification is commonly used in computer vision. Tens of models have been proposed, with some remarkable examples such as AlexNet [22], VGG [23], ResNet [24], DenseNet [25], etc. In this study we refer to those as state-of-the-art (SOTA) networks; • Object Detection (OD), where CNN features are used to localise -using a bounding box -objects of interest within a scene. Literature includes many SOTA CNN-based models for OD, such as region-based CNN (e.g., R-CNN [26], Faster R-CNN [27], Mask R-CNN [28]) and those belonging to the You Only Look Once (YOLO) family (e.g., YOLO [29], YOLOv2 [30], YOLOv3 [31]); • Semantic Segmentation (SS), which is the precise pixelwise assignment of portions of an image to a given class. These architectures are often similar to autoencoders, as their output is a segmentation map having the same size as the input image. Very famous examples are the SegNet [32] and U-Net [33] architectures.
Interestingly, CNN can also be used for the analysis of audio signals. In this case, the idea is to analyse the frequencyphase spectrum as if they were images. This naive approach not only proved to be effective, but also allowed for the re-use of CNN architectures originally intended for images [34].

C. MAINTENANCE TASKS
Despite the railway domain encompasses several maintenance tasks, all the works covered by this survey fall into two main categories: • Surface Defect Detection (SDD), focusing only on surface defects (e.g., cracks in tunnels, rails' heads, etc.); • Defect Inspection (DI), including all models intended to detect and analyse different types of faults (e.g., broken objects, missing parts, anomalies in components, etc.).
It is worth noting that we also reviewed papers that, although totally focused on the railway domain, are intended to support a maintenance process and not to address it (e.g., detection and/or classification of railway-related objects). We decided to keep those papers (gathered under the name "Promising works") in this survey, as they represent promising foundations for future works on maintenance.

D. REVIEW METHODOLOGY
To evaluate the current status of the research and answer the questions highlighted in Section I, we followed the SLR methodology proposed in [12]. Therefore, we defined a research query by combining keywords from four sets: • Railway domain: "rail*", "metro", "subway"; • Maintenance and defect/fault detection: "smart maintenance", "predictive maintenance", "condition based maintenance", "fault diagnosis", "fault detection", "fault forecast*", "fault prediction", "fault prevention", "defect inspection", "defect detection", "health monitoring", "health status", "remaining useful life", "condition monitoring"; • Deep Learning: "deep learning", "deep reinforcement", "deep neural", "convolutional neural network", "CNN*", "recurrent neural network", "LSTM", "autoencoder*", "generative adversarial", "GAN"; • Image/audio processing: "computer vision", "vision", "image*", "video*", "sound", "acoustic", "audio". The final search query is given by the OR among keywords of each set, combined with the AND of all sets. These keywords were selected by performing empirical trials in order to cover as many studies as possible when submitting the query to the considered scientific literature databases: IEEE Xplore Digital Library 2 , Scopus 3 , and ACM Digital Library 4 . Notably, given the limited number of wildcards (i.e., asterisks), we opted for some generic keywords (e.g., "deep neural") and not for specific ones (e.g. "deep neural network") to include possible keyword variations (e.g., "network", "networks", "architecture", "architectures") that might not be encompassed otherwise. Lastly, the query was submitted to search within the title, abstract, and keywords data fields. Then, to filter the results, we applied the following exclusion criteria: E1 works that are not written in English; E2 works not related to maintenance and DL; E3 works not focusing on the rail sector; E4 works that present results leveraging data coming from sensors different from cameras or audio devices; E5 works that are non-experimental; E6 literature reviews.
The query was submitted to cover all years until August 31st, 2021. However, no relevant paper was found before 2014. Also, we removed all duplicated papers found in different libraries. After the application of all exclusion criteria, we got 95 relevant papers. Fig. 2 reports the publication trend per year; please note that the 13 papers published in 2021 are not included in the figure because the partial coverage of the year (until August 31st) would have led to a misleading trend. Fig. 3 shows the distribution of papers in the reference areas.

E. REVIEW ORGANIZATION
The survey is structured into four sections, one for each relevant railway area: Rail Track (section IV), Pantograph & Catenary (section V), Rolling Stock (section VI), Tunnel & Bridge (section VII). For each section, we provide two tables: one to gather the papers under the three groups identified in Section III-C (SDD, DI and Promising works); the other to summarise the maintenance task, DL approach, data acquisition system, dataset and obtained performance. The reported data is taken by preserving as much as possible the original format from the corresponding paper. Although this causes some inconsistencies (e.g., using different performance metrics in the same column) this ensures a fair and transparent analysis. Nonetheless, for the sake of correctness, we omitted misleading results when possible. For example, we did not report the accuracy for semantic segmentation or for object detection tasks when other, fairer metrics, such as the IoU (Intersection over Union) or the mAP (mean Average Precision) were available. Those tables can be used to have all the important information at a glance, while some finegrained details (where available) will be provided within the subsections.

A. RAILS' HEADS
Rails' heads surface defect detection is addressed by several papers. To classify cavities or small hills on the surface of rails' heads, in [41] a CNN-based classifier is trained on photometric stereo images while in [42] the authors used 3D laser camera data. Reference [63] proposes a CNN-based approach to classify rails' health status based on CNN and Acoustic Emission recognition. The tests have been obtained by stressing a rail test specimen until it broke, recording and labelling the audio events. Reference [54] proposes an architecture based on CNNs to identify surface rail defects at switch points. The proposed network leverages transfer learning on a dataset collected through a monitoring system previously developed by the same authors [75]- [77]. Reference [43] introduces a three-stage pipeline: the first stage detects and removes blurring from the images based on an Inertial Measurement Unit and on the Attitude and Heading Reference Systems algorithm; in the second stage, a CNN built from scratch was trained to detect surface defects (healthy or faulty image); in the third stage, the CNN was tested on new images. Experiments show that the whole process could be performed up to 40 frames per second which means a real-time operating speed of around 72Km/h. Reference [44] makes use of images collected by a robot running on the rail line. Local image processing and classification are performed onboard (on-the-fly) firstly. Once the onboard CNN architecture detected a defective image, the location is saved and the image is post-processed in the cloud. Reference [49] proposes an approach based on YOLOv3 architecture to improve the detection speed (about 0.15s), preserving a recognition rate of 97%. Training and test have been performed on images captured by a camera installed perpendicular to the rail. Reference [45] introduces a new network intended to operate (also) under poor lighting conditions. The model combines YOLOv3 for regression, multi-scale prediction and the loss computation method, with a feature extractor based on MobileNetV2 [86]. The resulting system operates up to 60 FPS. References [50] and [52] evaluate Faster R-CNN models against the use of other architectures VOLUME 4, 2016 5 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication.  Combination of Type I and Type-II [73], [74] As described in [73], [74] OD -2764 (650x188) images obtained by cropping and collecting together images from the Type-I and Type-II datasets. The aim was to identify defects within the images. The acronyms used for the tasks refer to Classification (C), Object Detection (OD) and Semantic Segmentation (SS). In the "collection method" the following acronyms have been  Table 1; while the † symbol indicates data obtained from specimens only (laboratory tests) as feature extractors. To eliminate the manual inspection, reference [64] proposes an IoT-based multi-robot able to detect rail tacks' defects in real-time. Multiple robots were involved in collecting data about defects through ultrasonic sensors and cameras. For surface cracks, a CNN is compared against SVM, ANN, and Random Forest trained with data obtained through the Speed Up Robust Features (SURF) algorithm, with the CNN showing the best performance. When a defect is detected, an alert SMS is sent to an operator, while the GPS data and the image are sent to an IoT web server to be remotely available.
References [46], [47], and [51] focus on Semantic Segmentation. The first proposes a new network to face the high false alarm rate of the traditional commercial-of-theshelf visual-based track inspection system, implementing a multi-stage approach: in the first stage, a U-Net is used to extract rail tracks and locate the Regions of Interest (ROI); then, the ROIs are cropped; finally, the resulting patches are fed to a CNN which classifies them into True or False alarms. The authors highlighted three limitations: (i) only one type of defect was considered; (ii) the labelling of ROIs introduced unnatural noise; (iii) the network was tested only on vertically aligned rail tracks. In [47], the authors proposed a SegNet variant to cope with images of different surface defects, collected through CCD industrial cameras on a testing vehicle arranged near the rail, through a histogram match-ing method. Similarly, reference [51] proposes a combined approach of saliency cues [87] of the damaged area with a U-Net Network implementing a defect detection through segmentation.
Lastly, reference [53] presents two deep learning approaches for scenarios with a reduced number of samples available. The former relies on a CNN with a Long shortterm memory and an attention module to select significant information; the latter removes the attention module and leverages a TD-LSTM [88].

B. RAILS
Reference [37] presents a systematic procedure to configure a CNN-based classifier to evaluate the rails in the quality assessment step. Images were collected by four cameras able to capture the whole rail surface. Despite the images were manually labelled into six defect classes, the paper focuses on a binary classification by grouping the samples into defective and non-defective images. RNN has been used in [38], where the inspection was performed using an ultrasonic flaw detection vehicle collecting B-scan data of ten different types of defects. The idea was to turn the problem into identification and classification of "sequence languages".

C. FASTENERS/FASTENING SYSTEMS
Some of the reviewed works address fasteners type recognition and fasteners defect detection. Reference [72] focuses on the recognition of the rail fastening systems and proposes an intelligent information processing architecture able to automatically recognize the fasteners' type between six fastening systems (e.g., DO2, KB, etc.). Images were acquired through a rail detector car equipped with four cameras, and an automated workstation to pre-process the video stream. References [65] and [69] propose architectures to evaluate the faulty or normal status of the fastening system. Images were collected through a camera installed on an inspection vehicle equipped with a GPS module to mark the location of damaged fasteners. Reference [66] relies on images captured by an Unmanned Aerial Vehicle (UAV) flying at 30m on one side of the railway line. Authors tried different architectures for the defect detection, with a YOLOv3-based detector resulting to be the fastest with a detection time of 0.01s. Reference [67] compares three different approaches: the first is based on Dense Scale Invariant Feature Transform [89], bag-of-visual-word model [90], Spatial Pyramid Decomposition [91], and SVM; the second is based on VGG16 convolutional layers; the last one is based on Faster R-CNN. The latter configuration results in better performances and in a faster comprehensive speed (0.23s for both detection and classification). Reference [48] focuses on both rails' heads and fasteners defect detection by relying on an architecture based on YOLOv3 with scale reduction and further feature connections between layers to achieve lower complexity and faster detection. In [68], the authors developed a 3D Laser Railway Detection System for automated railway fasteners detection on ballastless tracks. The system collects "depth images" to be used as input to a CNN-based classifier.

D. WELDED JOINTS AND SLEEPERS
Two papers deal with thermite welded joints detection and concrete and sleeper cracks width estimation. Reference [21] proposes a YOLOv3-based object detector for the welded joints detection task, trained and tested on images collected in laboratory thermite welded specimen. Reference [36] proposes three Semantic Segmentation models, based on the SegNet model, to predict the width of cracks in concrete and sleepers. The authors made use of rectangular convolutional kernels to make the model more compliant with the shape of the cracks.

E. FINDINGS ON RAIL TRACK
In most cases, researchers presented a single-target solution, i.e., focused on a specific component or a class of components (e.g., rails' heads), proposing new architectures or carrying out interesting comparative studies (e.g., [66], [67]). Nevertheless, we also found a comparative study presenting an interesting multi-target architecture able to cope with both defective fasteners and rails' heads defects [48].
Regarding rails' heads surface defect detection, the analysed papers mostly implement custom architectures from scratch. Instead, in Fasteners Defect Inspection most of the studies presented approaches based on SOTA object detectors (e.g., YOLOv3 and Faster R-CNN). Moreover: • in two cases, the authors "simply" focused on items classification or detection, without looking at defects. In particular, [72] focused on fasteners type classification, while, [21] focused on welded joints detection; • in two cases, tests were performed only on data coming from laboratory experiments (cf. [21], [63]); • some studies use geolocation to detect the location of faulty components on the rail tracks (e.g., [44], [69], [65]). A further finding regards the implementation of approaches based on Audio Processing for maintenance tasks: besides the results obtained by [63] on audio events generated in the laboratory, the authors in [35] leveraged audio data related to the wheel-rail interaction, while those in [54] leveraged Acoustic Emission data collected on the field pretraining their model on data coming from the well-known AudioSet dataset [34], [92].

F. DATASETS
In most cases, data have been collected using custom systems, hand-labelled, and not made available. Some exceptions leverage third parties data, such as [46] using ImageNet [93] or [44], [51] and [53] using the Type-I [73] or the Type-II [74] datasets. Reference [36] leveraged the dataset proposed by [94], despite the fact that this includes cracks related to generic concrete structures. Lastly, in [54], the authors leveraged the AudioSet dataset [92]. Only in one case [64] the authors presented a new dataset, available on Github 8 , for rails' cracks and rust.

A. CATENARY SUPPORT DEVICE (CSD)
Different DL-based solutions perform defect inspection on CSD components. In [134], the authors applied a DL-based approach to detect the insulators before performing the defect detection through Contour Features and Gray similarity matching. Reference [114] uses both a classical approach (based on SURF) and an AlexNet-based method to detect insulators and bolts in images of CSD in metro power supply systems. Reference [113] proposes a method based on the Markov Random Field (MRF) and on Faster R-CNN models to perform loose strands diagnosis in isoelectric lines. Differently, a three-stage DL-based model, combining SSD, YOLO, and CNN operating at >60 FPS, is proposed in [99] to detect fastener defects on the catenary support device. Reference [95] proposes a two-stage method to detect surface defects in catenary insulators leveraging a Faster R-CNN (based on VGG16), a deep denoising autoencoder (DDAE), and a deep material classifier (DMC, a segmentation approach inspired to [135] and [136]) trained through a multitask learning approach, sharing the first convolutional layers. A two-stage approach is also described in [98] to perform defect detection in contact wire clamps and split pins by using a custom catenary components segmentation network (CCSN) inspired to the Mask R-CNN architecture. With the same aim, reference [100] presents a three-stage method adopting an improved YOLOv3 network and DeepLabV3+, achieving the best performance with respect to the state of the art. Reference [103] proposes a new two-stage approach to detect defects of catenary insulators. Composed of a Basic Localization Network and of an Oriented Region Proposal Network, it uses a post-refinement method based on Generative Adversarial Network (GAN) to detect defects within the images, associating a defect score to each faulty image. Tests showed that, despite being effective, it operates at <0.3 FPS. Reference [101] proposes a solution consisting of a Spatial Transformer Network [137], to learn invariant representations of the original image, and two parallel networks to extract features on which classification is performed. Different couples were tested, with VGG16+VGG19 obtaining the best accuracy. Reference [102] presents a pyramid fusion architecture focused on bird preventing, relying on ResNet-101 as backbone network. Some works also address CSD component detection. In [112], the authors proposed a comparative study between four frameworks to perform catenary support components detection. Despite the best results were obtained by an R-CNN with ResNet-101, none of the tested architectures achieved good precision on small components, underlining the difficulties of such frameworks in extracting the features of small-scale targets. Reference [111] presents an approach based on the Faster R-CNN to detect brace sleeve screws. Faster R-CNN has been improved by discrimination feature maps and Proposal feature maps. Then, VGG16 and ResNet-101 have been tested as feature extractors, with the former resulting slightly better both in terms of mAP and AP, even if a little bit slower than the latter. Finally, in [115], the authors proposed to detect rod insulators within images captured through two drones. Analyses were conducted considering two types of drones, custom and off-the-shelf, evaluating different pre-processing strategies to obtain optimal images.

B. PANTOGRAPH & ARCS
The constant contact between the pantograph sliding plate and the catenary contact wire can cause erosion on one or both the components, resulting in a sparkling event called Arc. Reference [119] proposes an Arc detection system based on CNNs to analyse frames captured by a camera located on the top of the train pointing to the catenary-pantograph contact point. In reference [120], the authors proposed a Faster R-CNN based architecture able to detect both pantograph and arcs within high-speed train lines. The proposed system involved the computing of the height of the pantograph (to detect a bow drop) and a SIFT-based approach to geolocate the fault on the railway line. The authors of reference [122] proposed a detection method for systems with double contact lines. The authors also introduce a new dataset, called PAC-TPL2020, collected using a high-resolution system installed on the roof of the train. The authors of reference [121] proposed a two-stage object detection to detect pantograph horns as well as faults such as pantograph loss and deviation. Lastly, reference [117] proposes an approach to estimate the wear of pantograph sliding plate on metro line images collected with a handhold DSLR camera. They also presented a procedure to estimate the thickness of the pantograph sliding plate.

C. CATENARY WIRES
With respect to Catenary Wires, the droppers have caught the attention of researchers as their failure (e.g., bending, break, deformations, etc.) can lead to unpleasant consequences such as the burnout of the contact wire [125]. Reference [125] proposes a complex system for droppers defect inspection subdivided into (evident) faults and (tiny) defects. If a dropper is identified as faulty, a first system produces that output, otherwise, it is analysed by a second system to identify tiny defects (e.g., little foreign objects, broken strands). The same task is addressed in [126], with the authors proposing a two-stage detection system based on a light version of YOLOv3, a Faster R-CNN based network, and on traditional image processing (OTSU, digital morphological operation VOLUME 4, 2016 9 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and  This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. The acronyms used for the tasks refer to Classification (C), Object Detection (OD) and Semantic Segmentation (SS). In the "collection method" column it has been briefly reported the acquisition setup by using the following acronyms:  Table 3. This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. and self-define linear difference detection) to detect nonstress defects. Differently, in reference [127], the authors mostly focused on droppers detection as they can be easily obscured by other catenary components. With the same aim, the authors of reference [129] implemented an approach for high-speed railways based on the Faster R-CNN architecture. The model was tested on images from the high-speed rail 2C system, showing to be able to operate only up to ∼8 FPS.

D. FINDINGS ON PANTOGRAPH & CATENARY
What most characterises the Pantograph & Catenary area compared to that of the Rail Track is the presence of very small components. This represents, especially with regard to CSDs, one of the greatest challenges to be faced as applying a defect detection approach directly to the entire image will result in poor accuracy. Therefore, in many cases, multi-stage approaches have been presented to identify the component under examination as a first step and then apply a defect identification mechanism, whether based on traditional image processing techniques (e.g., [134]), semantic segmentation (e.g., [98], [100]), classification networks (e.g., [99]), or other approaches such as deep denoising autoencoders (DDAE) and deep material classifier (DMC) as in [95]. The architectures mostly used as the first stage were those belonging to the YOLO family, given their detection speed, and the Faster R-CNN, given their accuracy. Clearly, these frameworks have been modified to suit the case under examination. In this respect, a mention goes to the work described in reference [103] as it proposes an innovative methodology to better identify insulators and applied GAN to identify defects. Interestingly, while in some cases other DL or image processing approaches have been implemented downstream of these identifiers to detect anomalies, in other cases the presented approach is limited to target detection. For instance, [115] focuses on insulators detection, [111] focuses on brace sleeve screws identification, while [127] and [129] propose architectures for dropper detection. It is worth noting that "simple" object detection applications can also directly involve defect detection, such as the identification of Arcs (i.e., an object), which indicates possible deterioration for pantographs or contact wires (e.g., [119], [120]). Finally, to the best of our effort, we did not find any paper making use of audio data for this task.

E. DATASETS
Researchers mostly built datasets from scratch or relied on data that are not publicly available. Only in a few cases, the authors used existing datasets or made available those created by them. In [129], the authors tested their model on two well known datasets (VOC2012 [130] and MS-COCO 2014 [133]) to compare its performances with those achieved by other SOTA architectures. Differently, in [120], the authors used the ImageNet dataset [93] and VOC2007 [131] to pretrain their architectures, so transferring knowledge from big datasets and then adapting it to their purposes. Reference [111] used the Catenary-5000 [132] to test the proposed [35], [139], [140], [141], [142], [143], [144], [145], [146], [147] [148] EMU Train Key Components [149], [150] Others [151] model for the detection of brace sleeve screws, however, it is not clear whether the dataset is available. The same issue applies to the PAC-TPL2020 dataset used in [122].

VI. ROLLING STOCK
In this study, we found a few papers mostly related to small objects (in relation with the whole train) detection, and their defect identification, within bogies and train's body and frame. Tables 5 and 6, whose structure is defined in Section III-E, report their characteristics.

A. BOGIE & FRAME
Reference [139] proposes a defect detection architecture to deal with low image quality and complex background, sliding a set of anchors over multi-level convolutional feature maps. The model was tested on four datasets related to 1) Cut-out cock handle, 2) Dust collector, 3) Fastening bolts, and 4) Bogie block key. The same authors improved the approach in [140] to make it suited for real-time fault detection by introducing a novel multilevel feature fusion strategy. They also proposed an image acquisition system composed of a device placed in the middle of the rails and two devices on the rails' sides composed of CCD cameras and an auxiliary light device. When a train passes through, images were acquired and sent to a server system for the analysis. A multidefect (i.e., lost pin, lost bolt, lost rivet, broken chain, broken wire, and foreign object) detection system was proposed in [141]. The authors built a three-stage model based on ResNet-101 trained on data collected through 12 high-speed cameras placed on the tracks. Reference [144] proposes a DL-based two-stage component defect detection architecture to recognize anomalies in train bogies. In the first stage, a hierarchical object detection scheme is implemented to detect and localize small and large objects, while the second pipeline is tailored to detect smaller components. Reference [142] focuses on bolt defect detection. The authors proposed a system for bolt-loosening detection in key bogie components (Axle Box, Traction Motor and Gear Box) acquiring data in an online fashion, i.e. while trains were running. First, a CNN-based system detects regions of interest, then another CNN extracts bolts edges; lastly, to detect single or multiple bolt loosening, a 3D reconstruction method was used to calculate the distance between the bolt cap and the mounting surface. Results show optimal performances with a relative error smaller than 1.42%, moreover, the processing 12 VOLUME 4, 2016 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication.  from [152] (as [139]) and [153] Not From [139], [140] As described in [139], [140] OD D6 Brake show key, 9600 images grouped into 2 classes (no-fault, fault) SqueezeNet mCDR: 98.60% ImageNet [93] As described in [93] OD -As described in [93] SqueezeNet Top1 ACC: 64.4% Pascal VOC 2007 [131] As described in [131] OD -As described in [131] SqueezeNet mAP: 70.01% [146] MS-COCO [133] As described in [133] OD -As described in [ 13 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. The acronyms used for the tasks refer to Classification (C), Object Detection (OD) and Semantic Segmentation (SS). In the "collection method" column it has been briefly reported the acquisition setup by using the following acronyms:  Table 5.
time to complete the fault detection considering a train with 8 cars was about 4.6 minutes, 1.1s per image (within China's requirement of 5 minutes). Lastly, in reference [143], the authors implemented a two-stage approach to determine the status (Normal/Abnormal) of Axle Box Cover bolts. Images were collected through an acquisition system called "metro inspection shed" equipped with different line-scan digital cameras with high precision. In the first stage, a Faster R-CNN architecture was implemented in order to identify bolts within the component under examination. A one-class convolutional architecture (OC-CNN) was used to discern normal samples against defective ones, to deal with the lack of negative examples. Results show that the whole architecture requires 265ms to process a single image.

B. EMU TRAIN KEY COMPONENTS
With the term EMU Train key components, the authors of [149] and [150] referred to brake disk, brake caliper, tractor, side suspension, under suspension, and plate bolt. The authors proposed a system able to cope with EMU components defect detection inspired by the Faster R-CNN architecture with a feature extractor based on ResNet-101 (instead of VGG16). Extracted regions were cropped and processed in order to obtain "super-resolution images" leveraging, among the others, a Generative Adversarial Network (SRGAN) [155].

C. OTHER COMPONENTS
Reference [151] proposes a faulty rail-valve detection through a two-step segmentation approach leveraging U-Net. Images were collected through a VTIS system which allowed an image collection always from the same distance. The first network performed the segmentation on the whole image, detecting the segmented mask which highlights the valve. These were then processed by the second network that performed a high-resolution segmentation. The mask in output was then processed by an image processing algorithm 9 Northeastern University (NEU) dataset: http://faculty.neu.edu.cn/yunhyan/NEU_surface_defect_database.html to detect the fault. Despite the good performances, the system is able to work only with train-valve binary classification and only if there is a single valve per image oriented in a specific way.

D. FINDINGS ON ROLLING STOCK
From the above analysis, it is clear that the bogies have been the most investigated components. As for the Pantograph & Catenary area, the bogies are composed of multiple small elements that are very hard to identify given the complex background. Therefore, in most cases, multi-stage approaches based on Object Detection frameworks (mainly Faster R-CNN) were proposed to cope with Bogie and EMU train components defect inspection. Regarding Audio Processing, only the authors in [35] leveraged audio data to estimate the adhesion conditions between wheels and rails. Lastly, transfer learning approaches have been applied, as mentioned in the following subsection.

E. DATASETS
Regarding the used datasets, references [139] and [140] leverage those introduced in [152] to train and test their models. In [139], the authors also used the ImageNet dataset [93] to pre-train the model implementing, de facto, a transfer learning approach. Likely, in [142], the authors relied on the BSD500 dataset [154] to pre-train their bolt edge detection model. In two cases, the models proposed for bogie or EMU train components defect detection were also tested on other kinds of data to evaluate their generalization performances. Indeed, in [141], the model, trained to identify defects such as lost pin, lost rivet, broken chain, etc. was also tested finetuning it to cope with defects such as oil leaks and scratches; nevertheless, it is not clear if these datasets are available to the researchers' community. Differently, reference [149] proposes a model to address EMU Train Key Components defect detection, which was also tested on the Northeastern University (NEU) 10 [156] surface defect dataset. Lastly, reference [151] mentioned the Singapore Mass Rapid Transit (SMRT) dataset, however, also in this case, the dataset seems not to be accessible. Bridges [163] The * mark indicates data not related to the railway domain.

VII. TUNNEL & BRIDGE
Focusing on the Tunnel & Bridge area, a few papers have been found. Tables 7 and 8, whose structure and aim have been described in Section III-E, report their characteristics. Finally, it is worth noting that we reported in this section also a paper (i.e., [36]) already analysed in section IV, as it addresses concrete crack analyses by using a setting compatible with the estimation of cracks in tunnels' lining.

A. TUNNEL
As from Table 7, all but one work target tunnels' lining defects, performing such analysis at different levels. In [157], the authors used a Kinect to estimate cracks' defect intensity, leveraging a DL-based approach (a CNN classifier) to classify the faulting lines in vertical, horizontal, upward tilt or downward slope. Reference [161] uses Mask R-CNN to perform tunnel inspection and crack detection. Data were collected by a smart vision measurement system composed of different cameras, each capturing a segment of the tunnel. Segments were arranged in a circular array to obtain the tunnel cross-section. The authors considered both 2D and 3D information: the former were used to train a Mask R-CNN, with ResNet-50 as backbone, in order to obtain crack segmentation and detection; the latter were used to build a 3D model of the tunnel using a self-developed robust B-spline algorithm.
Reference [158] proposes a two-stage Semantic Segmentation approach using two Fully Connected architectures, based on VGG16, to detect (possibly overlapping) cracks and leakages. Images for training and testing were acquired by a Moving Tunnel Inspection (MTI-200a) system, developed by the authors, which captured the tunnel's surface images through 6 cameras while moving on the rails. Similarly, reference [159] uses a two-stages approach (inspired by the R-FCN architecture) to detect leakages, cracks, and scratches. Images were collected through the Movable Tunnel Inspection (MTI-100) system and two datasets were built. The first dataset was used to pre-train the Fully Convolutional Network (based on GoogLeNet and VGG16) composing the first stage of the model. This network has been considered as a traditional classifier (with a last fully connected layer) and has been trained to classify images according to five classes: leakage, crack, segment joint, pipeline and lining. Once the proposed network was trained, the last FC layers have been dropped and replaced to obtain the last feature map used as input for the next stage performing object classification and localization. Tests showed that, although the detection accuracy was almost the same, the location accuracy increased with smaller images.
Finally, reference [160] proposes a Semantic Segmentation based approach to detect cracks in tunnels' lining referring to the DeepLab-v3 architecture. The authors built two datasets starting from the images captured by a proprietary image acquisition system equipped with eight industrial linear array CCD cameras: the first contains only cracks (tunnel crack dataset); the second contains cracks, structural seams, water stains, and scratches (augmented tunnel crack dataset).

B. BRIDGE
In [163], the authors used a YOLOv3 network, pre-trained on the VOC2007 dataset [131], to detect cracks on bridges' beams. Since YOLO is not so good with small-scale objects, images acquired by a CCD camera were first cropped in 64 smaller images. To improve generality, images' brightness and contrast were modified to simulate different lighting and weather conditions. It is worth highlighting that, despite the authors indicated that the accuracy was greater than 95%, the recall rate was less than 10%.

C. FINDINGS ON TUNNEL & BRIDGE
In the Tunnel & Bridge area, tunnels are subject to some complex situations such as uneven illumination, image noise, etc. In order to address those problems, most of the reviewed papers leveraged existing state-of-the-art architectures. In almost all cases, the Surface Defect Detection task has been addressed as an Object Detection or Semantic Segmentation problem. Beyond reference [36], which also focused on width estimation, and [157], which implemented Image Processing approaches, the reviewed papers have not implemented an "estimation" mechanism of the cracks or defect intensity through DL, instead, they mostly focused on detecting, classifying and locating these defects in the concrete structures. Most of them focused on cracks, except reference [159], where the authors also included leakages and scratches, and reference [158] where the authors proposed an architecture composed of two FCNs to detect cracks and leakages.

D. DATASETS
Concerning datasets, as for the previous areas, the authors used some existing datasets to build transfer learning approaches: reference [163] used the VOC2007 dataset [131] to pre-train the proposed architecture, while in reference [36], the model has been trained on the [94] dataset, although in our understanding this is related to generic concrete structures cracks. On the other hand, [158] built a dataset from scratch related to defects on tunnels' lining from a proprietary acquisition system. VOLUME 4, 2016 15 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and  The acronyms used for the tasks refer to Classification (C), Object Detection (OD) and Semantic Segmentation (SS). In the "collection method" column it has been briefly reported the acquisition setup by using the following acronyms: MTI -Metro Tunnel Inspection (a type

VIII. STATISTICAL ANALYSIS
In this survey, we reviewed 95 papers distributed as shown in Fig. 3; interestingly, only 5 leverage audio data (Fig. 4). Fig. 2 describes the temporal progress of papers published during the last years and reviewed in this paper, with the first work found on the subject published in 2014. The trend shows an almost exponential growth, which demonstrates an increasing interest of researchers in this field (year 2021 was not included in the graph due to partial coverage showing a misleading trend). Fig. 5 summarizes the different DL approaches that have been adopted by researchers to solve specific maintenance tasks in reference Railway Areas. Notably, some architectures have been leveraged to deal with different problems in different areas and have been also exploited for different DL tasks (i.e., Classification, Object Detection, and Semantic Segmentation). However, some pairs are missing: in all papers, the authors proposed a methodology aimed at the detection of defects, while no study addressing object identification was found. It is worth mentioning that we also found a single paper [164] focusing on track-side equipment anomaly detection. Besides it is not related to any of the four main relevant railway areas, a mention is deserved, as it uses a DL approach (MDNet [165], [166]) trained on 600 images (2048x1011 px each) to identify defective objects in trackside components. The approach achieves a True Positive Rate of 98.9%. 16 VOLUME 4, 2016 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ FIGURE 5. DL Architectures. The chart shows the cumulative number of times that the specific network or architecture has been used to address a Classification, Object Detection, or Semantic Segmentation task in the reference railway areas for Surface Defect Detection (SDD), Defect Inspection (DI) or Object Identification (OI). OI refers to the papers we have identified as "Promising Works", which only perform object detection (no defects) or classification. Notably: i) the Faster R-CNN is an architecture involving a backbone network to extract features, with networks belonging to the ResNet series being the most adopted to this aim; ii) YOLOv3 was the most exploited YOLO architecture; iii) with "Custom Net" we indicate approaches that were not (explicitly) based on SOTA networks.

IX. RESPONSES TO RESEARCH QUESTIONS
In Sections from IV to VII, we reviewed DL approaches that have been adopted in the context of Rail Tracks, Catenary & Pantograph, Rolling Stock, and Tunnel & Bridge, to address defect detection and inspection. Specifically, Tables 2, 4, 6, and 8 summarise the main characteristics of these approaches, while Tables 1, 3, 5, and 7 report the maintenance tasks per railway component (e.g., CSD, fastening systems, etc.). From these analyses, we derived the following answers to the three RQs we have previously identified.

A. RESPONSE TO RQ1
We reviewed 95 papers distributed as shown in Fig. 3. The "Rail Track" and "Pantograph & Catenary" are the most  investigated areas. To better understand trends and research directions, Figs. 6 and 7 report the papers grouped by components within each area. The most investigated components are "Rails' Heads" and "Catenary Support Device", discussed in Sections IV-A and V-A respectively, with papers providing suitable and effective approaches. Differently, for other subcomponents, the outcome was not expected, especially for cracks in tunnels and bridges (which account for only 8 papers out of 95), since such defects have already been analysed in other domains. Evidence for this are: i) a study related to the road sector [167]; ii) the datasets it uses (CRACK500 [168], [169], GAPs384 [170], and CFD [171]), which are related to different types of cracks; iii) two approaches based on transfer learning we reviewed in this SLR, i.e., reference [36], using a general crack dataset [94], and reference [163], using the VOC2007 dataset [131]. Fig. 7 also highlights interesting aspects of publication trends per area. In particular, papers focusing on both "Rail Track", "Pantograph & Catenary" and VOLUME 4, 2016 17 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3183102 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ "Rolling Stock" show a growing trend (with the latter mostly pushed by papers on Bogies, suggesting this to be a hot topic in the future), while those focusing on "Tunnel & Bridge" appears almost steady.

B. RESPONSE TO RQ2
Several object detection approaches have been implemented for Rail Tracks, Bogies and, especially, Catenaries. Indeed, the latter are composed of very different parts (even very small), often causing ineffective defect detection on the acquired images. To deal with this, different multi-stage approaches have been proposed, with the first stage providing the detection of the area of the component under inspection (e.g., [99], [100]). Among the various object detection methods, the most used approaches are those belonging to the YOLO (mostly Yolov3) family, as they seem to be among the fastest architectures, and those belonging to the regionbased CNN (mostly Faster R-CNN) family, which have been shown to be (generally) more precise even if less fast than the previous [172]. Such models have been used both as defect detectors (e.g., [118], [125]), evaluating defects starting from the original images, and as a first-stage approach in the multistage solutions, to detect the portion of the image related to the components of interest (e.g., [95], [105]). This aspect can be easily derived from Fig. 5.
As for the datasets, unfortunately, most of the studies leveraged ad-hoc datasets built from scratch; additionally, data have been collected through different methods and sensors. However, as from Tables 2, 4, 6, and 8, some works also make use of (new or already existing) datasets that are narrowed to specific railway components (e.g., Type-I and Type-II [73], [74], Catenary-5000 [132] and BSD500 [154], etc.). Additionally, only a few works made available online the datasets they built (e.g., [62], [64]). The heterogeneity of datasets represents an obstacle to identifying the best approach among those proposed in the relevant literature.

C. RESPONSE TO RQ3
We have identified the following main research directions based on the issues highlighted by the works examined and their proposed solutions:

1) Defects in concrete infrastructures
Considering the analysis conducted on concrete infrastructures in section VII, we can state that only a few studies have been carried out in this direction. As already mentioned, these studies are scattered over time, and it seems that no particular emphasis has been placed on this topic in the railway sector. Therefore, starting from existing studies (including those not directly related to the rail sector [173]) and datasets, further analyses could be carried out on the detection of surface defects in concrete infrastructures, especially considering that, as expressed in some of the reviewed studies, delineating defects inside tunnels is not straightforward given some unfavourable environmental conditions (e.g., uneven illumination).

2) Small-scale object detection
The small-scale object detection introduces difficulties in tasks such as defect inspection of CSD and bogies components, leading some researchers to focus only on component detection (e.g., [111], [115], [129]). Building multi-stage approaches as discussed in the response to RQ2 could be a suitable solution to deal with this issue. Additionally, the ground laid by some of the reviewed studies could lead to further improved methods for detecting small-scale objects, and thus further improve maintenance tasks. In this respect, the authors in reference [103] presented a quite innovative approach to improve insulator detection by leveraging rotated anchors instead of horizontal or vertical ones.

3) Multi-target systems
Most of the reviewed papers focused on a single target or a class of targets such as rails' heads, fasteners, and insulators (or a few of the CSD components). Future research should aim at implementing multi-target systems focusing on more than one component. That can be done by combining and improving some of the proposed methods in order to detect defects of multiple components by leveraging the same video and audio data. Some steps have already been taken in this direction by evaluating rails' heads and fasteners' defects at the same time [48]. The same approach could be applied to CSD, pantographs, and arcs.

4) Harmonised datasets and benchmarks
The availability and quality of data have always been one of the major challenges when it comes to properly characterise ML and DL models. Additionally, most of the studies built an ad-hoc dataset from scratch to conduct their experiments, also to solve the same task on the same rail component (e.g., rails' heads). This means that: i) it is not possible to immediately identify the best approach to solve a given problem; ii) it is not said that if a model works well on a given dataset, it will be able to perform properly on another dataset, possibly collected in a different way. Hence, besides DLrelated concepts such as the "generalization" of the model, it would be advisable to define some guidelines for data collection and dataset construction. These guidelines should allow the creation of harmonised datasets, i.e., datasets that share similar characteristics, including data format, data quality, and data variety (e.g., under different weather or illumination conditions). Notably, data quality should not only address the resolution of the image, but it should also take into account the number of collected samples and how relevant they are to solve the task. Harmonised datasets, in our view, would facilitate performance evaluation and enable comparison among different models. Reference datasets, possibly managed by reputable railway stakeholders, would enable reliable benchmarks to assess the performance of DL solutions and effectively support future research activities.
18 VOLUME 4, 2016 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. Our results showed a strong deficiency in the use of audiobased techniques to support maintenance activities in railways (Fig. 4). Indeed, in only a few of the reviewed papers, mainly related to the Rail Track area, the authors leveraged audio data: reference [35] proposed a model to identify the wheel-rails adhesion conditions; reference [39] proposed an approach to identify rail crack status and, similarly, reference [40] aimed at detecting rail cracks; in reference [63], laboratory experiments are described to detect rails head defects, while, to address the same task, in reference [54], the authors used acoustic emissions recorded on the field and implemented a more complex approach. Therefore, based on our analysis and due to its known potential of warning maintenance operators about defects, the exploitation of audio data through DL is an aspect worthy of further investigation and extensive evaluation.

6) Combining audio and video data
No study has attempted to combine data from both audio and video sources. However, in the field of data-driven approaches and DL, there is plenty of feature fusion techniques that may be exploited to merge data from multiple sources, therefore, such a combination has a huge potential. In fact, by focusing on the same target, the models built on the two different data sources could be synergistic and complete each other to reduce their shortcomings and ultimately lead to more accurate defect detection.

X. DISCUSSION
The common idea shared by a number of the reviewed works (e.g., [48], [140], and [143]) has been to move towards data-driven methods implementing mechanisms to improve traditional inspection and maintenance activities, which are often time-consuming and limited by the experience of the operators. It is worth noting that, although our scope was to review works addressing DL approaches to perform or facilitate maintenance and inspection activities by Audio-Video Analytics in railways, search results also included several approaches based on traditional Image Processing techniques for image pre-processing (e.g., image enhancing, noise reduction), feature extraction and image segmentation (e.g., [67], [134], [114]), and defects evaluation (e.g., [46], [113], [125], [126], [151]). Although some of those works are very recent and interesting, they tend to show limited performance when compared against DL, especially in unfavourable conditions, such as in tunnels (e.g., uneven illumination, image noise, etc.). On the other hand, DL approaches often require more data and computational power to be trained, as well as hardware acceleration (e.g., GPU, Graphics Processing Unit) even for the inference phase. This latter point is particularly critical for an embedded application designed to operate in near real-time and/or at very high frame rates, often by using batteries (e.g., on track monitoring, without direct access to the power grid) or to a suitable cooling system. Despite this, the achievements made by deep neural networks in recent years are strongly pushing the research towards new procedures, techniques and devices able to cope with the aforementioned points, making DL more and more appealing even for inspection and maintenance applications in many domains, including railway transport. In particular, the growing interest in DL in the railway domain is demonstrated by the studies we have reported in Tables 2, 4, 6, and 8, although our work focuses on works leveraging DL for audio and video analyses, motivated by the necessity to analyse the current state-of-the-art on the usage of non-intrusive, possibly cheap and available sensors.
In Section I, we already discussed why non-intrusive and cost-effective sensors, such as cameras and microphones, could be more appealing and advantageous than "special sensors" (e.g., laser, ultrasonic, etc.) that might be expensive in terms of infrastructures and data analysis. Thus, cameras and microphones enable a transition from scheduled inspection activities using expensive inspection vehicles to continuous monitoring of railway assets. Furthermore, modern smartcameras used for railway monitoring and surveillance feature high definition, on-board computation capabilities, and are often equipped with quality microphones or microphone inputs, thus it is be possible to collect multiple types of data with a single device. Therefore, independent DL models, which elaborate diverse data can be built to increment detection performance, as discussed in Section IX-C6. As for the small-scale object detection issue (Section IX-C2), it depends mainly on the data acquisition system; by collecting track images through UAVs (e.g., [66]), it is clear that the fasteners will occupy only a small portion of the image. Hence, this issue is shared among almost all areas and their corresponding components; the same holds for the multitarget problem (Section IX-C3). Components such as CSD, bogies, and tracks are typically composed of a large number of "sub-components"; by evaluating the status of only one or few of them, it would not be sufficient to ensure an adequate safety level for the whole component. Image analysis has the potential to capture characteristics related to multiple subcomponents at once; therefore, it would be possible to build systems capable of detecting multiple defects to improve maintenance and inspection activities.
However, there might be some limitations related to the usage of audio and video sensors. For example, captured images might be blurred [43] given the train speed or vibration, objects of interest may be occluded by other objects [99], and -as already mentioned above -there might be a foregroundbackground imbalance [60] as railways components usually occupy only a part of the image. Similarly, audio might be distorted due to other noise and microphones might need a specific orientation [35]. On the other hand, from the data perspective, collected samples may be altered by natural phenomena (e.g., weather conditions, light reflection -especially when considering rails' heads [60]), or, given the rarity and different shapes/classes of the various possible defects, VOLUME 4, 2016 19 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3183102 datasets built from scratch may result to be unbalanced 11 [56], [71]. Part of these issues (e.g., blurring, noises) represent technological challenges which are related to the choice of the sensor(s). The remaining issues, instead, are related to methodological aspects encompassing both data acquisition and the choice of the DL model. Part of these challenges may be overcome by means of harmonised datasets and collection guidelines (Section IX-C4), and indicating the adequate set of sensors for the given task. Notably, using harmonised datasets, intended as a balanced collection of samples gathered under different weather and light conditions, may bring two positive side effects: they can be used as benchmarks to compare different approaches or as reference datasets for transfer learning.
As a final consideration, it is interesting to note that, despite some works use IoT for data acquisition, during our analysis, we did not find any work relying on edge computing, i.e., the use of embedded (possibly custom) hardware to run AI applications where needed. This is partly expected as almost all the approaches rely on powerful and energivorous GPUs, however recent development of embedded GPU and architectures is more and more opening for massive use of AI at the edge, which is sometimes referred to as edge intelligence [174]. Thus, despite not yet being exploited, we believe this will probably be a very promising research direction in the near future.

XI. CONCLUSIONS
Within AI, artificial vision and audio signal recognition are two extremely relevant research areas with many promising applications in several domains. Those areas are among the ones that are expected to benefit more from modern deep learning paradigms. Among the many domains and applications where audio-video analytics based on deep learning could succeed, railway maintenance has been explored by researchers and practitioners with the aim of making it more: effective, i.e., capable of detecting more faults and defects before they can cause failures and harmful consequences, and being less error-prone compared to human inspection; and efficient, i.e., requiring less time, resources and efforts through remote monitoring and cost-effective automation. Those recent and planned advances in railway maintenance are sometimes referred to as smart maintenance, including predictive analytics, usage of IoT sensors, intelligent cameras and microphones, autonomous drones (see e.g. [175]), etc. However, one of the main issues limiting the development of effective AI solutions in the railway domain is the lack of harmonised, well-described, open-source or public available datasets and benchmarks for the broad set of applications experts have to cope with. In this paper, we have summarized the state-of-the-art in this field by adopting a review approach whose results are reproducible and the risk of missing relevant results is minimum. We have provided a thorough classification, representation and extensive discussion of the 11 High ratio between different classes in terms of the number of samples. review results throughout the paper, by highlighting current challenges and some pointers to promising research directions and future developments. It is our expectation that this work can effectively support research and development connected with the usage of smart audio and video sensors to significantly aid railway maintenance by leveraging novel deep learning technologies.
ROBERTO NARDONE is an Assistant Professor at the University of Naples "Parthenope", Italy. He received his master's degree in computer engineering from the University of Naples Federico II, Italy, in 2009, and his PhD in computer and automation engineering from the same University in 2013. His research interests are in the area of V&V of critical systems and include quantitative evaluation of non-functional properties by means of formal methods and model-driven techniques.
CARLO SANSONE (M'12) is Full Professor of Computer Engineering at the University of Naples Federico II. His basic interests cover the areas of image analysis, pattern recognition and machine and deep learning. From an applicative point of view, his main contributions were in the fields of biomedical image analysis, biometrics and image forensics. He coordinated several projects, mainly in the areas of biomedical images interpretation and network intrusion detection. He was also coeditor of three special issues and of three books.
VALERIA VITTORINI is Associate Professor of computer engineering at the University of Naples Federico II.She co-authored more than 100 papers in the areas of verification and validation of critical systems, physical protection of critical infrastructure, formal modeling, and model-driven engineering. She is recently investigating the application of AI to the transportation sector, and in particular to railway systems. She currently serves as the Coordinator of the H2020 Shift2Rail project RAILS (Roadmaps for A.I. integration in the raiL Sector). VOLUME 4, 2016 25 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3183102