Assessing Identifiability in Airport Delay Propagation Roles Through Deep Learning Classification

Delays in air transport can be seen as the result of two independent contributions, respectively stemming from the local dynamics of each airport and from a global propagation process; yet, assessing the relative importance of these two aspects in the final behaviour of the system is a challenging task. We here propose the use of the score obtained in a classification task, performed over vectors representing the profiles of delays at each airport, as a way of assessing their identifiability. We show how Deep Learning models are able to recognise airports with high precision, thus suggesting that delays are defined more by the characteristics of each airport than by the global network effects. This identifiability is higher for large and highly connected airports, constant through years, but modulated by season and geographical location. We finally discuss some operational implications of this approach.


I. INTRODUCTION
Air transport delays, and specifically the phenomenon of delay propagation, is one of the most important research topics in air transport management, due to delays' profound implications in the cost-efficiency [1] and safety of the system [2], as well as their contribution to the negative impact of air transport on the environment [3]. Research works on air transport delays can be divided in two (partially overlapping) groups, depending on whether they focus on understanding the mechanisms behind delay propagation, or on predicting their appearance. The first group can further be divided among those works in which the propagation is modelled by means of large-scale simulations, as for instance in [4]- [8]; and those that rely on historical data to extract some properties of the propagation, as e.g. in [9]- [12]. With respect to the second group, the use of data mining and machine learning models, and specifically on tasks related to The associate editor coordinating the review of this manuscript and approving it for publication was Hao Luo . the prediction of delays, has a long history, possibly fostered by the operational relevance of such models. To illustrate, such models have extensively been used in the last two decades to predict delays [13]- [17]. Narrower models have also been proposed, focusing on identifying and estimating the impact of some operational elements on delays, e.g. of adverse weather events [18]- [21].
In more recent years a new trend is emerging: the use of Deep Learning (DL) models [22], [23], i.e. machine learning models not requiring an a priori definition of features, to predict the occurrence and magnitude of delays. The idea is thus to train a model using data that are not strongly pre-processed; on the contrary, the definition and selection of high-level features is performed in an automatic way by the model. The possibly first application of DL to delay prediction was proposed by Kim and co-authors, in which the sequences of departure and arrival flight delays of an airport were predicted using a Long Short-Term Memory network architecture, using input features like the delay of previous flights and the weather condition [24]. Numerous new studies have followed this initial work, mainly focusing on increasing the spectrum of information fed in the models: from micro-scale meteorological conditions [25]- [28], reasons of previous delays [29], airline and flights connection structure [26], [28], [30], [31], airport crowdedness [26], [32], to aircraft trajectories [33] and airspace structure [32]. The interested reader can refer to [34] for a review on the use of data analysis in the study of air transport delays.
In this study we propose a bridge between both groups, by analysing the role of airports in the delay propagation process through the use of data mining, and specifically of Deep Learning, models. The starting hypothesis is that, given time series representing the observed dynamics of a set of elements (here, airports), the score of a classification task aimed at distinguishing them can be used as a metric of their dissimilarity -or, in other words, of their respective identifiability. In other words, if the classifier model is able to successfully discriminate between the two elements, it can then be concluded that there are differences between them, even though the exact nature of such differences is not readily available. Note that this is conceptually different from predicting the magnitude of future delays, as commonly done through DL models; and is instead closer to the problem of identity recognition [35]. The choice of DL models, as opposed to classical machine learning ones, is justified by the fact that i) they are extremely sensitive, i.e. they are able to detect even subtle and complex differences between data sets; and ii) their main drawback, i.e. their black-box nature, is not a limitation for our objective. While the use of DL models for identifying relationships and couplings between pairs of time series is not new (see [36]- [38]), to the best of our knowledge this is the first time such problem is tackled through the concept of identifiability.
We specifically analyse the delay profiles of the top-30 European airports from 2015 to 2018, where the delay profile is here defined as the normalised average hourly delay observed during one day of operation. By performing a classification task on pairs of airports, we are able to define their identifiability, i.e. a metric describing how uniquely are delays distributed across the day for each airport in the set. Two alternative scenarios can then emerge: airports can have similar (or identical) profiles, i.e. low identifiability; or, on the other hand, unique ones and hence a high identifiability. In the first case, this would imply that delays are a global property of the system, i.e. that their appearance is independent on the considered airport, or that their generation is driven by some systemic properties of air transport. On the other hand, unique delay profiles imply that delays are a local property, more the result of the dynamics and rules of each airport than of the global system.
The results here presented suggest an intermediate and complex situation. Generally speaking, airports are highly identifiable, i.e. their delay profiles are unique. Delays thus seem to be a local property; or, at least, characteristic dynamics at each airport dominate over global network effects. This identifiability is nevertheless not homogeneous and, on one hand, increases with connectivity, i.e. large and highly connected airports are more unique; and, on the other hand, reduces with geographical proximity, such that near airports tend to share similar profiles. We further analyse how this identifiability changed over time, and how it is affected by the winter and summer seasons.
The remainder of the work is organised as follows. Sec. II presents a simplified synthetic model of the identifiability of a set of networking elements, aimed at clarifying the interpretation of subsequent results. Sec. III then introduces the main materials and methods of this work, including the used flight data set, the Deep Learning models used in the classification, and the optimisation of the classification parameters. Results are presented in Sec. IV, including the study of the identifiability of airports, their geographical distribution, and their evolution through time. Conclusions on these results and ideas for future works are finally drawn in Sec. V.

II. A SYNTHETIC MODEL OF IDENTIFIABILITY
Let us suppose a system composed of three coupled elements a, b and c (in the case here considered, these would be three airports), each one described by an observable metric through time (here, average hourly delay) -see left part of Fig. 1 for a graphical representation. In each realisation of the system, 24 values are observed, thus yielding three time series x a (t), x b (t) and x c (t), with t = 1, . . . , 24; note that each realisation is assumed to be independent from the other ones. These time series are further defined as the sum of three components: The first part (i.e. v a , v b and v c ) represents a constant dynamic observed at each element independently of the specific realisation, or, in other words, what is characteristic of that element. To this, a random component is added, in the form of random numbers drawn from a normal distribution N of zero mean and 0.1 standard deviation -note that, without loss of generality, vectors v are expected to be defined between zero and one, hence this noise is 10% of them. This random component thus represents the variability that is observed across different realisations -or across different days. Increasing (or decreasing) this noise only changes the final classification score, as a large noise will mask the underlying dynamics; but the conclusions drawn from the model are independent from the selected noise level. Finally, each element receives contributions from the other two, through a coupling constant γ . While, for the sake of simplicity, this parameter is considered equal for all pairs of interactions, a more realistic scenario should include six parameters γ a,b , γ b,a , γ a,c , γ c,a , γ b,c , and γ c,b ; results should then be represented in a six-dimensional space. In synthesis, the behaviour of each element is given by a characteristic and (almost identical) repetitive dynamic, plus a tuneable contribution from the two neighbours.
Once a value of γ has been fixed, 100 random realisations have been obtained, thus obtaining 100 vectors x(t) for each element. Note that this is equivalent to a data set comprising the average hourly delays of three airports along 100 days, as will be considered in the next section; still, results here reported only correspond to the previously described synthetic model, and not to real data. Vectors of elements b and c were then classified, using the Residual Network (ResNet) Deep Learning algorithm that will be described in detail in Sec. III-C. The final classification score, measured through the accuracy metric, has finally been calculated. Note that this classification score represents the identifiability of elements b and c, as these two elements can correctly be classified only if they are different in a characteristic way. As discussed in the introduction, the use of Deep Learning models is justified by the fact that they yield the best classification scores, i.e. they are able to detect the subtlest differences between the elements. Also note that a full identifiability analysis, as the one performed in the following sections, would require a similar classification task between all pairs of elements; the analysis is here limited to a single pair for the sake of clarity. Fig. 1, right panel and blue line, reports the median of the classification score as a function of the coupling γ . For low values of the coupling constant γ , the behaviour of each element is dominated by its own characteristic dynamic v; as v b (t) = v c (t), it is in general possible to distinguish x b from x c , and the classification yields a score close to 1. On the other hand, values of γ approaching 1 imply that the behaviour of each element is virtually indistinguishable from those of the other ones, as all dynamics are mixed, reaching what known in statistical physics as a mean field; hence no classification is possible. Finally, a perfect classification is recovered for γ > 1.2. In this latter case, the dynamics of each airport is dominated by the dynamics of the remaining ones; in other words, airport b is mainly defined by v a + v c , airport c by v a + v b , and hence x b = x c .
When the system is expanded to include 30 airports, similar results are still observed, see the red line in the same panel.
The larger number of elements, and hence of connections, nevertheless acts as a global noise, which reduces the coupling required to loose the identifiability of the elements. It further becomes impossible to identify elements for γ > 1, as now the dynamics of each airport becomes the result of the sum of a large number of independent contributions, i.e. it effectively becomes unpredictable. Most importantly, it can be appreciated that a high classification score (i.e. a high identifiability) is still obtained for low values of γ ; in other words, the size of the system does not affect the fact that elements are identified as long as they are not strongly coupled.
This simple model illustrates how the identifiability of a set of elements coupled together is the result of two contributions: how unique the dynamic of each element is, and how tightly they are coupled together. While such unambiguous conclusions cannot in principle be drawn when studying a real system, as more elements may affect the identifiability of elements, this synthetic model still suggests some interesting ideas. High classification scores will always imply high identifiability, as elements have dynamics that can be recognised; and this is usually the result of the unique dynamic of each element dominating over a global network signal. On the other hand, low classification scores (low identifiability) suggest that pairs of elements are coupled together in a tight way, such that they have a shared dynamics. It is additionally worth noting that in a real system the coupling γ i,j can be different for each pair of elements (i, j); in this case, elements can be identifiable even when having similar internal dynamics, provided their coupling patterns differ enough. For the sake of completeness, a final possibility can also emerge, not relevant for the present study: elements may not be identifiable when they are both disconnected and lacking an individual dynamic, i.e. v i = v j and γ i,j = 0 for all i and j.

A. OPERATIONS AND DELAY DATA
Data about air transport operations have been extracted from the EUROCONTROL's R&D Data Archive, a public repository of historical flights made available for research purposes, and freely accessible at https://www.eurocontrol.int/ dashboard/rnd-data-archive. It includes information about all commercial flights operating in and over Europe, completed with flight plans, radar data, and associated airspace structure. While it is limited to four months (i.e. March, June, September and December) of four years (2015-2018), it provides a good starting point to analyse both the structure of operations in Europe and the corresponding evolution.
In this study, we considered the information associated with the 30 largest airports in Europe, ranked according to their number of passengers. Table 1 reports the full list, and information about the number of landings and delayed flights. Additionally, Table 2 describes the evolution through time of some basic network metrics. For each flight landing at these 30 airports, the corresponding delay has been calculated as the difference between the actual (from the ATFM-updated flight plan) and the planned (according to the last filed flight plan) landing times. Afterwards, for each airport, flights have been grouped according to the actual landing hour, and the average delay per hour has been calculated. Subsequently, these time series have been split in windows of 24 hours, i.e. of the average hourly delay per airport per day.
In this work we are interested in the profile of delays, i.e. their distribution throughout the day, as opposed to the absolute value. In other words, we are interested in seeing if an airport suffered from delays homogeneously throughout the day or at some specific hours, and not in the value of the average delay at a given time. For that, average delays have been transformed through a Z-Score. Mathematically, given a time series (x 1 , x 2 , . . . , x 24 ), it is transformed according to: withx being the average and σ x the standard deviation. Therefore, a value of z i above (below) zero indicates that the average delay at time i was higher (respectively, lower) than what is observed throughout the same day.

B. WEATHER DATA
Each airport of Tab. 1 has further been characterised by the average climatic conditions that are expected in winter and summer; this information will be used in Sec. IV-B to explain changes in identifiability between these two operational seasons. Towards this aim, we have downloaded several average climate indicators for each airport from the corresponding city's Wikipedia page. These include: the average minimum winter temperature, as the average of the average minimum temperature observed in March and December (i.e. the two winter months available in the data set); the average minimum summer temperature, corresponding to June and September; the drop in temperature between summer and winter, i.e. the difference between the two previous values; and the average number of rainy days in winter (again, average between March and December).

C. CLASSIFICATION MODELS
Deep learning can be defined as a set of machine learning algorithms that progressively extract higher-level features from the raw input, usually with the objective of performing a supervised classification [22], [23]. Compared to standard machine learning classification models, deep learning presents the advantage of not assuming nor requiring a priori structures in the data, and of not requiring a pre-processing of features; in other words, features are automatically extracted from data, without human intervention. This results in a drastically higher efficiency, especially in complex problems for which features are difficult to be defined. On the other hand, this also implies high computational costs, and usually the need of dedicated hardware -as, e.g., general purpose graphics processing unit (GPGPU).
Here we consider a subset of deep learning algorithms designed to classify time series; in other words, given a set of time series, each one associated with a label, the objective is to assign the correct label to a new time series presented to the algorithm. While this is not one of the main focuses of deep learning, several models have been developed, usually evolutions of models designed for image classificationsee [39] for a full review. More specifically, the following five models have been used: • Multi Layer Perceptron (MLP). One of the most traditional and simplest form of neural networks, it is composed of a set of nodes organised in layers, each one receiving information from the previous layer and responding through a nonlinear activation function. Even though it does not encode temporal information, the MLP model has been proposed as a baseline architecture for classifying time series [40]. The network here considered is composed of 4 layers, each one fully connected to the outputs of the previous one, and with the final layer being a softmax classifier. The activation function is the well-known rectifier linear unit (ReLU) [41].
• Convolutional Neural Network (CNN). Convolutional networks are specialised versions of MLP, in which the matrix multiplication is substituted by a convolution operation [42]. Their advantages include a space (or, in the case of time series, time) invariance [43], and a reduced tendency to overfitting. We here consider a simple convolutional model, composed of two convolutional layers followed by a final sigmoid classifier.
• Residual network (ResNet). Residual networks are inspired by the way pyramidal cells are organised in the cerebral cortex; specifically, the connections between layers are not sequential, but instead some layers can be skipped (creating shortcuts or jumps). This presents the advantage of a simpler structure, and consequently of a reduced training cost [44]. The networks here considered are composed of 11 layers, the first 9 of them being convolutional, followed by a global average pooling layer that averages the time series across the time dimension, and by a final softmax classifier, as proposed in [40].
• Fully Convolutional neural Network (FCN). FCNs are networks in which only convolution operations can be performed; in other words, they are equivalent to CNNs without fully connected layers [45]. The model is composed of three convolutional blocks, each one performing a convolution, a batch normalisation and a final activation. As a last step, the result of the third convolutional block is fed to a softmax classifier [40].
• Multi Channel Deep Convolutional Neural Network (MCDCNN). This model is based on a modified CNN, in which the convolutions are applied independently (in parallel) on each dimension (or channel) of the input multivariate time series [46], [47].
The five models have been implemented in Python 3.8.5 using the libraries TensorFlow [48] and Keras [49]. In each iteration of the classification problem, a random half of the available time series has been used for training, and the remaining half for evaluating the model, being thus equivalent to a two-fold cross-validation. Each model performance is finally measured through the corresponding accuracy score; note that other complementary metrics, as e.g. recall or F-score, are here redundant due to the use of a perfectly balanced data set.

D. INTERPRETING THE CLASSIFICATION SCORE
The output of any classification task described in this work is measured in terms of the accuracy, i.e. the number of correctly labeled instances divided by the total number of instances. As only problems involving two classes and with the same number of instances in each class are here considered, the accuracy is expected to be included between 0.5 and 1.0. The former case indicates that only half of the instances have been correctly labeled, which is also what expected if labels were assigned at random; in other words, the model is not able to detect any useful pattern in the data. On the other hand, an accuracy of 1.0 points towards a perfect classification; from the point of view of the model, the two groups of instances are clearly different.
It is straightforward to interpret this classification score as a metric of identifiability. Specifically, if the delay profiles of two airports can perfectly be classified, i.e. with an accuracy of 1.0, they then are substantially different, and it is possible to construct a model able to identify the airport corresponding to each delay profile without errors. It is worth noting that such identifiability metric is a lower bound of the real identifiability. On one hand, the fact that a given classification model is not able to identify differences between two groups does not imply that the two groups are equal -they may actually be identifiable by more advanced and complex models. On the other hand, a high classification accuracy implies that there are differences, at least as large as those identified by the considered model. The use of Deep Learning classification models, i.e. of the most accurate models presently available, guarantees that this lower bound is as close as possible to the real identifiability value.
The five models here considered are different in terms of their internal structure, and of what features in the time series they are able to detect; as such, they may perform differently when classifying different pairs of airports. In other words, it may happen that one model works well classifying two airports, but its efficiency may lower for another pair, as the features that were important in the first case are not so in the second. As the objective here is to assess the identifiability of airport delays, and not the presence of a specific feature, each classification is executed with all five models, for then retaining only the highest score.

E. PARAMETERS TUNING
There are two parameters that have to be set for each model, and that may strongly affect the outcome of the classification.
The first one is the number of epochs, i.e. the number of times the training is performed over all available data. This number controls a trade-off: the larger the number of epochs, the more accurate is the classification, albeit at an increased computational cost. Additionally, performing additional trainings beyond a certain level usually does not improve the results.
Secondly, one has to note that the training is a stochastic process: the original data set is split between training and validation at random; and the initial state of the neural network is also random. As a consequence, it is customary to repeat the whole training and validation process several times, and to finally average the result. The second parameter is thus the number of iterations, i.e. executions of the full training and evaluation cycle with random initial conditions. The larger this value, the more stable is the final result, yet again at the price of a larger computational cost. Fig. 2 reports the results of tuning these two parameters. Specifically, we have considered one pair of airports (Dublin Airport, DW, and Brussels Airport, EBBR), and performed the corresponding classification varying these parameters. The right panel of Fig. 2 indicates that 400 iterations is a value high enough to obtain very stable results with all models. On the other hand, the left panel presents a more complex situation. Some models, as e.g. MCDCNN, only require a few epochs; others, specifically CNN and MLP, only saturate for thousands of epochs. As a compromise between precision and computational cost, the following number of epochs have been selected: 600 for CNN and MLP, and 200 for Resnet, FCN and MCDCNN.

F. TOPOLOGICAL METRICS
In order to evaluate how the identifiability of each airport is modulated by its connectivity inside the network, the following six metrics have been calculated for each one of them. These are based on a network analysis [50], [51], in which each airport is represented by a node, and nodes are VOLUME 10, 2022 pairwise connected by links whenever a direct flight exists between the corresponding airports [52].

Number of flights.
Total number of flights landed at the considered airport, as reported in the available data set, irrespectively of their origin. Number of destinations. Number of unique airports which the considered airport is connected to, according to all available flights. Clustering coefficient. The local clustering coefficient of a node quantifies the tendency of the neighbours of a node to form a clique, i.e. a complete graph [53]. More specifically, it is defined as the proportion of the number of links between the nodes within the neighbourhood of the considered airport, divided by the number of links that could possibly exist between them. Weight centrality. Measure of the centrality (i.e. importance) of each airport, and defined as the sum of all flights in it landing, divided by the total number of flights. Eigenvector centrality. Measure of node centrality, according to which the centrality of a node is proportional to the sum of the centralities of nodes to it connected. Closeness centrality. Node centrality measure calculated as the reciprocal of the sum of the length of the shortest paths between the considered node and all other nodes in the network.
Note that, in order to simplify the comparison between different metrics, all three centralities have been normalised between zero and one, with one being the centrality of the most central airport.

A. IDENTIFIABILITY OF AIRPORTS
We start by analysing the identifiability of airports in Fig. 3, understood as the score of the classification of hourly delay profile. Firstly, the top left panel reports the score obtained for all couples of airports. It can be appreciated that most pairs of airports yield a very good classification, in all cases above 70%. This is well above what obtained for the same classification problem when the data of each airport are randomly shuffled in order to destroy any characteristic pattern (average score of 61.3 ± 9.3%), thus confirming the statistically significance of these results. Additionally, the top right panel reports the fraction of times each model yields the highest classification score; ResNet is the clear winner, followed by CNN. The bottom panel of Fig. 3, left Y axis, depicts the average classification score for each airport, i.e. the classification score averaged over all pairs including that airport. All airports are highly identifiable, and especially London Heathrow (EGLL), Zurich (LSZH) and Geneva (LSGG). In order to exclude that these results could be biased by the use of a pairwise classification task, scores are also reported for a task in which the profiles of one airport are compared to a random selection of profiles of all other airports (Fig. 4, left panel); and a task in which all airports are considered at the same time, thus corresponding to a 30-classes classification (Fig. 4, right panel). While scores are generally lower, as is to be expected due to the higher complexity of these tasks, airports with the highest scores coincide with those of Fig. 3. These 1-to-all classification tasks can also be used to extract class-discriminative vectors, i.e. vectors representing what parts of the delay profiles are more characteristic of each airport, see Appendix A. Additionally, the 30-classes classification yields a score (0.446 ± 0.0076) significantly higher than what would be expected if results were random (1/30 ≈ 0.033).
In order to understand whether such identifiability is associated to some airport properties, the right Y axis of Fig. 3 reports the corresponding number of flights and number of connected airports. This is further explored in Fig. 5, depicting scatter plots of the classification score of each airport as a function of the six metrics defined in Sec. III-F, the corresponding linear fits, and the resulting ρ and p-values. While the effect is not very strong (and seldom statistically significant for α = 0.05), large and highly connected airports are more identifiable; in other words, small airports seem to have more common patterns of delays, while those of large airports are more unique. Still, a linear model using the six metrics to predict the classification score only reaches an R 2 = 0.377; this indicates that the metrics are highly correlated, and that the identifiability of each airport only marginally depends on them.
We afterwards study the structure created by such pairwise identifiability, with the objective to understand if there are clusters (or communities) of highly similar airports. For that, the pair-wise identifiability (as depicted in Fig. 3 top left panel) has been transformed into a similarity using the standard metric s i,j = (1 − I i,j )/2, I i,j denoting the identifiability of airports i and j. Afterwards, those similarities have been interpreted as weights of the network links, in which nodes represent airports and links the pairwise similarity. Finally, the celebrated Louvain algorithm for community detection [54] has been applied. The results of this community analysis are presented in Table 3; additionally, the three largest communities are also plotted on the European map in Fig. 6. Most notably, communities are not randomly distributed in space, but instead seem to be related to geographical regions -e.g., the largest one (red) to the central Europe, the second one (yellow) to UK and Belgium, etc. Prima facie, this may seem to be associated to time zones, such that two airports may have the same delay profile because most delays occur at the same (local) time. Nevertheless, this is disproved by several cases, e.g. Brussels, which does not share time zone with UK, or Prague, which shares time zone with Germany.
Another possibility is the existence of medium-scale weather patterns, that create disruptions across large regions and hence force the delay profiles of several airports to some common dynamics. This would explain why, for instance, a similar pattern is observed in UK and northern Europe. The fact that not all airports in those regions are included can be explained by considering their different role in air transport; to illustrate, and following the previous example, London Heathrow, Paris Charles de Gaulle and Amsterdam Schiphol may not be included in the yellow community due to their significantly higher traffic, which results in specific delay patters -as also illustrated in Fig. 5.

B. INVARIANCE OF IDENTIFIABILITY THROUGH TIME
One natural question is whether airport delay profiles are stable through time, or, on the contrary, have changed throughout the years. In order to assess this point, we here perform a classification in which the delay profiles of an airport are organised in two groups, one for years 2015 and 2016, and a second one for years 2017 and 2018. A high classification score would thus indicate that delays have changed through the four years. The result, depicted in Fig. 7, indicates that this is not the case, and that most airports have maintained a similar delay profile. VOLUME 10, 2022 FIGURE 5. Scatter plots of the identifiability of each airport, represented by its average classification score, vs. the six metrics defined in Sec. III-F. The red lines represent the best linear fit; and each panel further includes the ρ and the p-value of each fit.

TABLE 3.
List of communities, obtained from the structure of similarities between them, and ordered by size. The color in parenthesis for the three largest communities corresponds to their color in Fig. 6.   FIGURE 6. Map of the spatial location of the three largest communities, as listed in Table 3. Airports included in the analysis, but not belonging to these three communities, are marked in grey.
As in the previous section, we analysed whether this time-dependent identifiability can be explained by metrics representing the dynamics of airports. The difference is that, here, metrics must represent variations between the two groups of years; for that, given a metric m, its variation is calculated as m = We further analysed whether the delay profiles of each airport changes between summer and winter, through a classification task involving days in March and December on one hand, and June and September on the other. Results, reported in Fig. 8, indicate that there is a generally strong difference in the delays of different seasons; yet, such difference is not driven by changes in the number of   flights or of destinations (blue and green lines). We further checked whether such difference may be due to weather patterns, using the weather data described in Sec. III-B, and by performing a linear regression between the score of the summer/winter classification and the weather values. No statistically significant relation was found with the drop in minimum temperature between summer and winter (ρ = 0.0452, p-value = 0.8157), average minimum winter temperature (ρ = −0.0591, p-value = 0.7605), and average number of rainy days in winter (ρ = 0.2700, p-value = 0.1917).
As a final issue, we have studied how the identifiability of each airport, as depicted in Fig. 3 bottom panel, has evolved through time. For this, for each pairs of airports, four VOLUME 10, 2022 different classification tasks have been performed, each one using data of one single year. The result, as shown in Fig. 9, is thus four scores for each airport, each one quantifying the identifiability of the airport's delays in one given year. Note that this is complementary to what presented in Fig. 7, as there the focus was the identifiability of each airport from itself in two time intervals, and here is of each airport from the other ones. It can be appreciated that variations across years are minimal, with no clear trend. Also, the fact that the identifiability by year is always lower than the global one (the latter represented by the horizontal dashed lines) is easily explained by the fact that each classification task can rely on one forth of the data.

V. DISCUSSION AND CONCLUSION
In this work we have leveraged on the possibilities offered by Deep Learning to create an identifiability index for the top-30 European airports, i.e. a metric describing how uniquely distributed are average delays throughout the day. The main obtained result is that airports are actually highly identifiable and unique, with pair-wise classification scores almost reaching a 100% precision -see Fig. 3. While it is tempting to conclude that delays are a local phenomenon, i.e. only driven by local dynamics and operational constrains, several caveats have to be highlighted.
First of all, high identifiability does not preclude a propagation process, i.e. the diffusion of delays throughout the system through secondary (or reactionary) delays. Both aspects are compatible if received delays are different enough across different airport, as seen in Fig. 1. They are also compatible if one allows for received delays to be processed (or dealt with) by each airport through different (i.e. local) mechanisms. To illustrate, two airports may receive the same amount of delays from other airports; yet, the resulting dynamics can be different due to, e.g., different equipments and operational buffers. Therefore, what here presented does not contradict the large body of literature studying the propagation of delays [7], [11], [55]- [59]; instead, it suggests that propagation itself cannot be considered as a homogeneous phenomenon.
Secondly, this global identifiability is not homogeneous, but is instead modulated by different factors. On one hand, it is positively (albeit weakly) correlated with the traffic volume and connectivity of each airport -see Fig. 5. Thus, being strongly integrated in the network, i.e. receiving delays from multiple and heterogeneous sources, makes an airport more unique. On the other hand, airports located in some geographical regions seem to be more similar (see Fig. 6), possibly because their operations are modulated by common weather patterns.
Thirdly, when all flights throughout one year are considered, airport identifiability did not substantially change between the first and last year analysed. On the other hand, delay profiles were different (hence identifiable) between the winter and summer seasons of the same airport. This suggests that the change in network connectivity between winter and summer produces different, albeit characteristic, delay patterns in each season.
Irrespectively on how this identifiability originates, the fact that airports are characterised by fairly unique delay profiles leads to important conclusions. Delays (at least at the aggregated level here considered) are mostly defined by their past, as opposed to other external factors. In other words, it has previously been shown that predicting the delay of an individual flight requires taking into account aspects like local weather patters, and airport and airspace crowdedness [34]. Results here presented instead show that the average delay per hour can be estimated by only knowing the delay evolution of the same airport in previous days, as this evolution will be unique to that airport, and hence different from those of other airports. Weather and other elements, while useful to reduce the prediction error, are therefore not the main features. Most notably, this also applies to network effects: while the delay of individual flights can only correctly be predicted by taking into account the status (e.g. delays and crowdedness) of other airports and of the system as a whole [26], [32], [33], [60], the aggregated dynamics at an airport can be described without it. To make a parallelism with statistical physics, macroscopic observables of a system (e.g. the pressure or the temperature of a gas) can be studied without explicit knowledge of the individual elements composing it (e.g. position and velocity of all gas' particles).
From an operational perspective, this implies that, firstly, delays can be predicted, and hence acted upon. This stems from the fact that, if delays were unpredictable and were randomly changing across different days, they would not form consistent and identifiable profiles. This is of course in agreement with the large body of Literature dealing with delay prediction [13]- [17], [34]. Secondly, and more importantly, that these predictions and interventions aimed at reducing delays must be local in nature, as the delay profile at one airport (and hence, the evolution of delays through time) encodes most of the information about such delay dynamics. In other words, average delays at a given airport are largely predictable without looking at the network status. These results also suggest that airport identifiability could be used as a metric to measure how efficiently delays are managed throughout the system, and how intensely is the local dynamics of airports dictated by the global propagation.
As a final point, several limitations of the present study have to be highlighted. The calculation of delays has been performed with the data available in the EUROCONTROL's R&D Data Archive. While it provides a complete and official source of information about all flights operating in Europe, the planned time of landing corresponds to the last filed flight plan, and not to the original intentions of the airlines. This may result in a bias (specifically, a decrease) in the estimated delay at landing. Also, delays have been calculated per flight, i.e. disregarding the number of passengers that were affected and how these delays may disrupt multi-leg trips [61]- [64]; in other words, only the magnitude of delays, and not their FIGURE 10. Graphical representation of the class-discriminative vector (solid red lines, right Y axes) for each airport, as obtained by the Gradient-weighted Class Activation Mapping (Grad-CAM) approach [74]. Dashed black lines (left Y axes) represent examples of the delay profiles of each airport. VOLUME 10, 2022 importance, has here been considered. Regarding available data, limited information about the status of the airports and of the system in general has been included in the analysis; on the other hand, using data like flight connectivities, airspace structure and crowdedness may help explaining the origin of the identifiability, as previously shown for delay prediction [26], [34]. While delay propagation is a global process involving multiple airports simultaneously, such process as here been represented as a set of pairwise interactions. This is customary in network science, and specifically in the reconstruction of functional networks [65], [66]. Still, higher dimensional options have recently been proposed [67], [68], which could in principle yield more complete representations of the propagation dynamics. It has to finally be noted that, while the strength of the interactions was represented by the parameter γ in the model, it is not possible to derive a real interaction strength from the available data. While it may be tempting to state that, given a pair of airports, their associated γ should be inversely proportional to the corresponding pairwise identifiability, this would not take into account that the airport dynamics is in reality not constant, and it may be changed by random events or different traffic patterns. While outside the scope of this work, a more precise identification of all interaction γ s may be possible through the use of recent results on the dynamics of coupled systems, e.g. [65], [69]- [71].

APPENDIX A VISUALISATION OF AIRPORT CHARACTERISTIC PROFILES
In spite of the unavoidable black-box nature of Deep Learning models, several attempts have been made in recent years to extract intuitive and understandable components from them, in order to allow the practitioner to understand why a given model predicted what it predicted -see [72], [73] for reviews on the topic. We here applied the Gradient-weighted Class Activation Mapping (Grad-CAM) approach [74] to the 1-to-all classification problems presented in Fig. 4. In synthesis, a ResNet model is trained to recognise the delay profiles of one airport against a pool of profiles of all other airports; the Grad-CAM approach is then used to retrieve a class-discriminative vector, i.e. a vector representing how important is the value observed at a given time of the day for recognising such airport. The vectors for the 30 airports here considered are depicted in Fig. 10  She is currently an Assistant Professor with the Faculty of Computer Science and Engineering, Institute for Intelligent Systems, Ss. Cyril and Methodius University. Her current work in teaching includes courses, such as algorithms and data structures, object-oriented analysis and design, discrete mathematics, information systems, and intelligent information systems. Areas of her scientific research interests include machine learning, complex networks, and bioinformatics.
LUISINA PASTORINO was born in Argentina, in 1990. She graduated in industrial engineering from the National University of Rosario, Argentina, in 2016. She received the master's degree in big data analysis in economics and business from the University of the Balearic Islands, Spain, in 2021. She is currently pursuing the Ph.D. degree with the Institute for Cross-Disciplinary Physics and Complex Systems, Palma de Mallorca, Spain. She focuses on the study of the propagation of delays in the air transport system from an information processing perspective within the ARCTIC project.
MASSIMILIANO ZANIN was born in Verona, Italy, in 1982. He received the Ph.D. degree in computer engineering from the Universidade Nova de Lisboa, Portugal, in 2014.
He is currently a Researcher with the Institute for Cross-Disciplinary Physics and Complex Systems, Palma de Mallorca, Spain. His main topics of interests include complex networks and data science, both from a theoretical perspective and through their application to several real-world problems. Throughout his career, he has published more than 80 articles in international journals and more than 50 contributions in conferences, reaching an H-index of 25. He has participated in several European competitive research projects, and he is currently a PI of the ERC StG ARCTIC, on the analysis and modelling of delay propagation in air transport.