Missing Traffic Data Imputation for Artificial Intelligence in Intelligent Transportation Systems: Review of Methods, Limitations, and Challenges

Missing data in Intelligent Transportation Systems (ITS) could lead to possible errors in the analyses of traffic data. Applying Artificial Intelligence (AI) in these circumstances can mitigate such problems. Past works focused only on specific data imputation methods, such as tensor factorization or a specific neural network model. While there are review papers covering singular topics regarding missing data, there are none in the field of traffic, to the best of our knowledge, that introduces the process of missing data collection and the viability of the traffic data collected while also broadly covering the popularly used models of recent years. This has led to non-uniformity of the terms used in missing data imputation, limited research in areas where datasets are not available, and a narrowed view of the methods used for data imputation. Hence, this paper aims to standardize the terms used in missing data classifications, look into the limitations of using available public or private datasets for urban traffic research, and discuss popular statistical and data-driven methods used by recent AI and ITS papers. It was found that tensor decomposition-based methods are the most popular for missing data imputation, followed by Generative Adversarial Networks and Graph Neural Networks, all of which rely on a large training dataset. Meanwhile, Probability Principle Component Analysis (PPCA) methods provide valuable insights via traffic analysis and are used for real-time traffic imputation. This paper also highlights the need for more efficient and reliable methods for traffic data collection, such as online APIs.


I. INTRODUCTION
Missing data is a prevalent problem in many fields of study, and Intelligent Transportation Systems (ITS) is one of them. As vehicles on the road continue to increase yearly, the importance of improving the existing ITS framework continues to grow as well. Hence, there has been much research in the field of traffic modeling, prediction, and routing, among others. All this research can be done thanks to the availability of traffic data or access to traffic data collection tools. However, The associate editor coordinating the review of this manuscript and approving it for publication was Jjun Cheng . these traffic data could be missing, possibly due to a sensor malfunction or connection errors between the sensor and the system. Hence, these missing data pose a major obstacle in the various traffic research as they would introduce errors or biases in the results if not handled appropriately.
Historically, such missing data are handled via historical averaging, deletion-based methods, and other relatively basic statistical methods [1]. However, these methods tend to result in other problems, such as incorrect data size or unnatural data patterns due to deleted data. Hence, researchers started to investigate missing traffic data imputation using better methods.
Over the years, there have been studies proposing various missing traffic data imputation methods, as shown by the many reviews from more than ten years ago [1] to even recent times [2], showing how crucial missing data imputation is to the future of a well-developed ITS. Recent reviews such as [3], [4], [5], and [6] tend to focus on a single aspect of missing traffic data imputation and the methods related to it, providing in-depth details in those areas, making them very suitable when trying to investigate the improvements made as well as to provide more detailed explanations regarding the reviewed methods alongside the authors' insight. However, focusing on a single aspect can lead to a lack of reviews on the other aspects of missing traffic data in the field of Intelligent Transportation Systems (ITS) besides the popular methods used in recent years, such as the limitations, possible challenges regarding data collection, and discussions related to parameters and statistical methods in which future researchers could use or investigate.
Besides that, while there are many traffic studies that have studied different kinds of missing data, the classifications of the types of missing data tend to be somewhat vague outside of random missing data, which by itself technically has three different classifications on its own. For example, the definition of block missing in [7] coincides with the definition of what is generally known as fiber missing. This paper intends to define and classify these different missing data types into three categories for the purpose of easing future traffic research.
Additionally, this paper aims to introduce the different data collection methods and their feasibility when researching a detailed urban network.
Finally, the paper reviews the few popular methods many researchers employ when dealing with missing traffic data to provide a general idea of the popularly selected model used as the base (e.g., Deep Neural Network, tensor decomposition, etc.) as well as investigate the other parameters applied to the research. Topics such as whether rural or urban road networks were used, the classification of the missing data the proposed research aimed to solve, as well as other possible limitations, are discussed in this paper.
To clarify, the objective of this paper is threefold: i) to provide a generalized classification for the different types of missing data, to allow for better identification, ii) to introduce the popular data collection method and their weaknesses when researching detailed urban road networks, while providing another avenue of data collection which is used less, iii) to review the popular missing data imputation methods, their common design choices, and their future potential.
The paper is organized as follows: Section II discusses the literature reviews and research gap. Section III discusses traffic data retrieval and the type of missing data faced by researchers. Section IV reviews the various popular methods from the statistical and data-driven models. Section V discusses the popular design choices used in conjunction with the base data imputation model. Section VI covers the challenges and limitations. Finally, Section VII concludes the paper, and Section 8 is the acknowledgment of contribution.

II. SIMILAR WORKS
Missing data as a whole has been studied extensively over the years, and it follows that there are reviews done with this in mind. There have been review works done in recent years that cover the topic of missing data imputation extensively. Reference [8] has reviewed missing data imputation techniques from 2006 to 2017, while [9] has reviewed techniques from 2010 to 2021. These two reviews have split the missing data imputation techniques into two types -statistical and machine learning-based methods -and looked into the distribution of studies done for each of the techniques and the evaluation methods considered. It is interesting to note that both [8] and [9] have classified missing data as only as missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). In the field of traffic, and likely for time series or spatial datasets, there are more than just these classifications of missing data types, as explained later on.
Other missing data review papers, such as [10], have made a comparison between different missing traffic data imputation methods, namely prediction, interpolation, and statistical learning methods, and concluded that the PPCA-based (Probability Principle Component Analysis) methods perform the best overall in terms of accuracy and computational complexity. In addition, [11] has compared the performance of variations of other existing statistical methods such as linear regressions, Predicting Mean Matching (PMM), and mean imputation, while also comparing regression tree-based methods such as Classification and Regression Trees (CART) and Random Forest. The conclusion is that the random forest implementation performed the best.
Looking into reviews done more specifically in the field of traffic, [2] provided a summary of the methods of traffic data collection, splitting them into fixed and mobile types, as well as explained the classification of various missing data types along with traffic imputation methods. Meanwhile, [5] reviewed temporal data imputation methods specifically, providing a more in-depth analysis of the state-of-the-art data imputation methods that utilized only the temporal aspect of traffic, covering their application conditions and limitations, as well as providing a list of popular public datasets [4] focused on traffic state estimation in urban road networks, of which there are missing data for segments due to the unavailability of traffic detectors due to installation costs as well as faulty detectors, with a focus on methods that fuses multiple sources of data into their models.
In these existing works, it is noted several times that while random missing data has been tested quite often, research that simulates non-random missing data due to situations such as faulty detectors is significantly less. Also, the authors would like to note that many public datasets, such as PeMS [12], are freeway traffic datasets, which do not equate to an urban traffic environment, as also mentioned by [5] and [4]. Certain  studies may make use of road segments in an urban environment [11], but a few individual roads are not representative of urban traffic as a whole. In reality, urban networks are more likely in need of such research, and this paper aims to review recent papers whose work covers urban networks.
Also, data acquisition can be difficult, depending on the country. As can be seen in later sections, many urban network datasets utilize taxi GPS datasets either from the public or private sector. It should be noted, however, that such methods may not be available for all countries and locations of interest, which would result in certain ambiguities when it comes to the viability of the various proposed methods in said locations. FIGURE 1 summarises the flow of the following sections. This paper first discusses the various common traffic data collection and categorizes missing data types into three categories, namely random missing, fibers missing, and block missing, before going into commonly used data pre-processing methods. Secondly, the paper then looks into popular research methods, broadly categorized into statistical, machine learning, and ensemble methods. This paper reviews the type of road networks used in the various recent research and the types of missing data scenarios tested, as well as investigates the commonly used fundamental and auxiliary methods proposed. Notable design decisions are then mentioned in the following section. Finally, this paper also discusses the potential limitation of previously available datasets and how future researchers should investigate a more flexible yet easily accessible source of traffic data, along with emphasizing a focus on the scalability of models and their interpretability and robustness, besides purely accuracyfocused models.

III. TRAFFIC DATA RETRIEVAL
For any form of traffic management effort to succeed, the acquisition of traffic data is essential. Only by utilizing these data can the ITS process, learn, predict, and resolve the traffic issues it oversees. While some literature review includes reviews on public datasets [5], they focus on looking at the effect these datasets would have on the models rather than cover the different methods of traffic data retrieval. The aim of this section is to provide insight into the different methods of traffic data collection, as well as provide a definition of the type of missing data, as well as possible data pre-processing methods that could be used to augment a limited dataset. Besides that, Various factors need to be considered when handling traffic data: 3.1) Data collection method, 3.2) Types of missing data, and 3.3) Data Pre-processing A. DATA COLLECTION METHOD Acquisition of traffic data can be made via several methods. The most used methods would be through the access of public datasets or publicly available sensor data, such as those discussed below. Another method that is rarely seen being used in studies is the usage of online traffic API services, which is also discussed.

1) SENSORS AND CAMERAS
Sensors such as induction loop detectors were employed by ( [13], [14], [15], [16], [17]) to collect real-time traffic data. The main reason for its frequent usage is that induction loop detectors perform well in vehicle counting in high and low-volume traffic under different weather conditions.
Besides that, studies such as [7], [18], [19], and [20] make use of street cameras and vehicle identification software in order to capture traffic data. Using cameras has the advantage of being able to analyze certain traffic parameters more accurately, such as the traffic flow, average gaps between vehicles during different traffic hours, as well as traffic accidents and other such events.
The drawback to such methods is that the user is limited to where the sensors and cameras are placed, making research into other areas or even larger urban networks not possible.

2) ONLINE SERVICES
An Application Programming Interface (API) is a software intermediary which enables the communication between two applications. Using APIs, an application can send a request to a server and receive a reply in the form of an output of the data interpreted by the corresponding server. By utilizing these applications, users can obtain information almost immediately. This is especially useful when an application requires real-time data, such as various GPS applications such as Google Maps or Waze. Examples of such services are Google Maps [21], Bing Maps [22], HERE Traffic API [23], and TomTom Traffic API [24]. Despite the flexibility of obtaining traffic data using these services, there is hardly any literature with regards to missing traffic data imputation that makes use of it. This could be due to the location of interest, having other available sources of data, or the difficulty of collecting data over a period using the API service. However, it should be emphasized that as traffic research grows in technology and knowledge, so should their simulations, and [25] has shown that online traffic data can be a good indicator of traffic speed.
There has also been literature that compiled other available public datasets, such as [5], and has shown that these datasets tend towards freeways, highways, expressways, or limited signalized intersections. As shown here, there is a lack of public datasets with regard to urban traffic networks on a wider scale, as well as fewer public datasets outside of America and China, with few specific datasets in countries like Spain and England. This limits traffic studies for larger or more detailed urban networks and for areas located outside these few locations of interest.

B. MISSING DATA
The following subsections provide a standardized categorization of the types of missing data commonly experienced and how these missing data are manifested from the different types of data collection methods.

1) TYPES OF MISSING DATA
Existing missing data imputation researches have differing classifications for similar types of missing data. For example, [50] describes random missing data, along with two other types, namely univariate missing data and multivariate missing data. In other papers, such as [4], [51], and [52], univariate and multivariate would be named fiber and block or panel missing data, respectively. Other papers might have also given overlapping or different names for similar kinds of missing data, such as continuous missing data to represent fiber missing data [53].
For the sake of unification, these missing data types should be defined and generalized in order to help simplify the direction of future research. The general idea of the three categories is as mentioned below visualized in FIGURE 1: Random Missing Data: Missing Data is caused by sporadic errors in the transmission of which there is little to no correlation known between the data loss and other variables. Results in missing data at random points in the dataset.
Fiber Missing Data: Missing data is caused by a sudden, temporary failure in connectivity or in the data-capturing device, resulting in long periods of missing data. Results in missing data for a length of time.
Block Missing Data: Missing data caused by the absence of a detector in the area of interest (i.e., A rarely used arterial road that does not justify the installation of a loop detector [4] or all sensors are not in operation for some reason). Results in complete missing data for the entire length of time over a long period or complete missing data from all sources of information for the same time horizon. This is seen in datasets with multiple sources of data.
While random missing data can be further broken down into three more types -Namely, Missing Completely at Random (MCAR), Missing at Random (MAR), and Missing Not at Random (MNAR) -simulations are usually done in an MCAR situation, such as [54]. Reference [55] has also stated that MNAR is generally not considered as well. Hence, for most research, MCAR is the general test case, followed by fiber and block missing.
It is important to note that block missing data imputation is not researched much, probably due to the significant lack of data as well as some research deeming that the areas with these levels of missing data do not contribute much to the overall traffic state.

2) MISSING DATA IN DATA COLLECTION METHODS
It can be said that all three types of missing data can occur for all the missing data collection methods mentioned above. However, the impact of such cases may differ depending on the method.
When it comes to monitoring systems such as sensors and cameras, the main reason tends to be equipment malfunction and electrical breakdowns, which leads to loss or damaged data [55]. Early detection leads to this being a case of fiber missing data and failure to do so causes it to devolve into block missing data. Public or private datasets which use similar monitoring systems would also be subjected to similar issues. However, as the data has already been collected in the past, it is trivial to ignore datasets with missing data and select the ones for which the dataset is complete. Online traffic APIs might face similar issues, but applications like HERE Traffic API [23] or Google Maps [21] would have more than one source of data to ensure the integrity of their data, such as floating car data (FCD) or probe vehicle data from a fleet of connected vehicles via GPS services or applications [25], although even then, there are times where missing data can occur with an online service.
Random missing data can be caused by sporadic errors due to aged electrical parts or packet drops during the transmission of data, causing data loss or corruption for an element in the dataset. It tends to be spread out and is not obviously affected by the environment.

C. DATA PRE-PROCESSING
Data retrieved could sometimes require additional processing to reduce possible errors or noise for training and prediction, such as smoothing, outlier detection [56], or removal. Large datasets may require some form of data compression for scalability. Research done by [57] has proposed a data denoising and compression method based on wavelet transformation, along with the construction of a data model.
In cases where there is a lack of traffic data, data augmentation is also considered a way to generate additional data for model training purposes, such as the one conducted by [58]. As traffic data are usually time-series data, [59] has conducted an empirical survey on various time-series data augmentation methods and their suitable use cases. While data augmentation is useful in generating additional datasets, it is important to use it cautiously to avoid distorting the dataset as a whole. Missing data is also considered one form of data pre-processing when it comes to traffic forecasting or routing models, but due to the focus of this paper, data preprocessing is treated as the process before the actual missing data imputation is done.

IV. RESEARCH METHODS
Past literature reviews a specific aspect of missing traffic data imputation, such as [3], [4], [5], and [6], usually focusing on the results but largely ignoring other aspects, such as the road networks or missing data types involved. This section reviews the popularly used methods, broadly categorized into two methods, along with looking into the type of road networks and missing data scenarios used in various literature.
There are generally two categories of missing data imputation methods -Statistical and machine learning. Statistical methods refer to the more classical methods of utilizing mathematical models and statistical theories to impute the data, whereas machine learning makes use of modern computational power and big data to better learn the non-linear, latent features and patterns in a dataset and attempt to learn and output the most likely result based on an input from a similar dataset.

A. STATISTICAL METHODS
Statistical methods analyze the available data and aim to develop a model that best represents the original dataset. Unlike machine learning, which makes use of big data to learn, it is less necessary for statistical methods to need such a large number of data at the cost of being less robust in general.
There are various ways to handle missing data, as mentioned by [60], such as deletion-based methods, learning methods utilizing complete and incomplete data, as well as imputation methods. Mean smoothing has also been used in studies such as [61]. On the other hand, deletion-based methods tend to be avoided as deleting data may result in bias in the estimates and decrease the quality of the dataset itself [62]. Note that deletion-based and mean smoothing represents the simplest methods and are usually not used in missing traffic imputation studies.
With regard to learning methods, predictive mean matching (PMM) based on multiple imputations by chained equations (MICE) has been looked into in [63]. A study done later on has then proceeded to compare variations of PMM methods, including MICE, Classification and Regression Trees (CART), Least Absolute Shrinkage and Selection Operator (LASSO), and random forest, with the result being the Miss-Forest implementation of Random Forest being the best performer [11]. It is noteworthy that random forest is considered a machine learning algorithm, which shows why machine learning tends to be researched more compared to statistical methods, especially in recent years.
Instead, two of the most popular methods for missing data imputation would be Probability Principle Component Analysis (PPCA) and tensor decomposition. These methods are explained below:

1) PROBABILITY PRINCIPLE COMPONENT ANALYSIS (PPCA)
The most commonly used statistical method when it comes to data imputation is the PPCA-based (Probability Principle Component Analysis) model. PPCA is an extension of the Principal Component Analysis (PCA) method through the use of the expectation-maximization algorithm [64]. The resulting probability model results in the ability to better deal with missing data by treating the missing data as not-yetobserved missing data [65].
Recently, [3] has excellently reviewed spatiotemporal PPCA-based data imputation methods in an urban network setting for traffic flow data. As expected, the accuracy of the PPCA-based model changes depending on its field of view, i.e., whether it is a network, sub-network, or singlepoint imputation. Interestingly, if the view is too large, the result would drop, resulting in more inaccurate results. It was found that for a more realistic use case, the sub-network PPCA-based model worked the best for an urban road network as it is within a reasonable range of detectors.
Focusing on real-time missing traffic data imputation, [66] has proposed a PPCA-based minimum data imputation optimization method that ignores certain missing data points that it deems not required to be imputed, along with simplification of the spatial correlation between road segments on the map. However, not every country has a well-built traffic infrastructure that would provide clear road segment data, thus hampering the effectiveness of data imputation methods that requires the use of spatial data. Furthermore, although the effects may be small, missing data should be imputed to ensure the completion of the data set and to prevent possible bias in prediction results down the line.
Reference [65] also conducted a case study on the PPCA model for traffic analysis, data imputation, and flow prediction, and while the missing data rates tested were not large (1.4%, 4%, and 33% missing data rates), it was found that the PPCA did not show a large degradation in performance when comparing the Weighted Mean Absolute Percentage Error (WMAPE) between the 1.4% and 33% missing ratesaround a 1% drop in accuracy from 1.4% to 33% -which means it is overall robust. However, the initial WMAPE itself is rather high at around 14.75%. Despite that, the case study also exhibits the strength of statistical methods, namely the ability to conduct traffic analysis via a breakdown of its principal component scores. While it seems that PPCA was not used much for missing traffic data imputation, it should not be ignored due to its analytical ability, which could contribute to the advancements of itself as well as other techniques.
Additionally, a comparison between MICE and PPCA was made for missing data imputation in the healthcare sector [67], and PPCA was found to have performed better as well, further explaining why this method is one of the more popular statistical methods.

2) TENSOR DECOMPOSITION AND FACTORIZATION
Tensor factorization and its derivatives have seen a significant rise in popularity when it comes to the field of missing data imputation, and missing traffic data is not an exception. This can be seen when comparing the reviewed literature between [8] and [9], noting the tensor factorization methods have shown a spike in use in [9] compared to [8]. In fact, tensor factorization can be considered both a statistical model and a machine learning model. However, tensor factorization is more interpretable compared to other machine learning models because it enables the extraction of a dataset's latent variables via tensor decomposition. Even papers that focus on traffic forecasts, such as [68], make use of tensor decomposition to deal with their missing data before moving on to their proposed model. Papers such as [18], [19], [69], [70], [71], [72], [73], [74], [75], [76], [77], [78], and [79] are some of the recent stateof-the-art missing data imputation methods that have been proposed in the past three years that have utilized tensor factorization as a core part of their model. These tensor-based models performed well due to their being able to extract latent features from a traffic dataset and, through decomposition and completion, can fill in the missing blanks in an accurate manner. Via Bayesian Statistics ( [69], [71], [74]), extending or modifying the existing tensor factorization methods ( [18], [19], [70], [72], [75]), and even adding an additional preprocessing method ( [76], [79]), the base tensor factorization method has shown a significant improvement in the field of missing traffic data imputation. This can also be seen as a majority of these models have been tested for robustness in imputing missing traffic data of rates ranging from 1% to 90% while still retaining a high level of accuracy when compared to their respective benchmarks. Besides that, out of the 13 papers mentioned, 11 of them ( [18], [19], [69], [70], [73], [74], [75], [76], [77], [78], [79]) have also been tested on urban traffic networks, raising the evaluation on their robustness as urban traffic tends to be significantly more complicated than freeways/highways/expressways. However, tensor factorization methods are largely dependent on their dataset and would be unable to perform similarly if the same trained model is tested in another location without retraining [52]. Besides that, tensor decomposition tends not to scale well with larger datasets [80].

B. MACHINE LEARNING
Data-driven models in machine learning methods utilize the availability of data and learn the best weights to obtain the most optimum result for a certain model, as compared to classical statistical methods, which require prior knowledge to derive an appropriate mathematical expression from a given data trend. In general, the model is trained via a training set to 'learn' the optimum values to output given a certain set VOLUME 11, 2023 of inputs. This is a very powerful tool as it requires little to no supervision from the user, but at the same time, a certain understanding of the model may be lost. However, it could be understood that the underlying features of the dataset have been, in theory, mined via the model, allowing it to be more robust and accurate compared to traditional statistical methods.
Neural networks are the models which are the most synonymous with the term machine learning despite just being a subset of it. Regardless, the idea of neural network was introduced in [81] back in 1943 and only began gaining traction in recent years due to the improvement in computation technology. Now, it is being used in various fields, from classification, prediction, and identification to missing data imputation, among others. The following subsections cover the popular methods used in missing data imputation.

1) GENERATIVE ADVERSARIAL NETWORKS
Generative Adversarial Networks (GAN) is a new model proposed in 2014 [82] utilizing a Generator and Discriminator model to train a network. To summarize, the generator continuously attempts to 'trick' the discriminator into that the generated data is the same as the trained dataset. This results in both being trained to generate better, more realistic data as well as more discriminatory testing, allowing the overall model to impute missing data more accurately or realistically, in theory. While not as popular as tensor methods, GAN is a fairly popular method in missing data imputation applications due to its nature of constantly training to create a better dataset to trick the discriminator. This can be seen by the recent papers focusing on GAN methods such as [80], [83], [84], [85], [86], and [87]. As with other methods, this research tends to focus on the Spatiotemporal features of the traffic data ( [80], [84], [87]) when conducting traffic data imputation. Some utilize the Attention mechanism ( [83], [84]). Besides that, [85] makes use of additional external factors such as weather and holiday factor. Interestingly enough, that research found that external factors excluding holidays do not influence the data imputation much for missing rates less than 40%. While researchers tend to test for high levels of missing data, it could be said that missing traffic data would not be that high. In this case, future researchers can focus on methods that improve the missing traffic data imputation at low missing rates with minimal concern that external factors might cause a large discrepancy in their performance. Another interesting GAN model was proposed by [86], whereby the generated result is once again used as an input into another generator, and the discriminator tries to discriminate between the first generated data and the double-generated data.

2) GRAPH NEURAL NETWORKS
Various real-world datasets are represented as graphs, such as from a social network or the internet itself, and traffic data is not an exception. Traffic networks are naturally represented as a graph, as it is a suitable form in which to visualize road connections and their related information. Realizing this, researchers have proposed the use of Graph Neural Networks (GNN).
Recently, [6] has done a comprehensive survey regarding GNN and has classified various GNN models into four categories -Recurrent GNN, convolutional GNN, graph autoencoders, and spatial-temporal GNN. Among these, we have found that convolutional GNNs are the more popular choice in recent times when it comes to traffic research, as shown by [52], [88], [89], and [90]. Convolutional GNNs, or Graph Convolutional Networks (GCN), utilize convolutional neural networks to embed graph information into a tensor, resulting in a uniform framework from irregular datasets [89].
While GNN and GCN are popular methods used in traffic studies, most recent research focuses on traffic forecasting, and not as many focus on missing data imputation. Some research, such as [88], treats missing data as part of the traffic prediction process instead of the focus of the problem. This could be useful as traffic actions tend to require real-time analyses and predictions. While it is good to design a robust traffic prediction model towards missing data, having a missing data imputation model should not be neglected as it can further enhance the already robust traffic prediction model. On the other hand, [52] is more focused on missing data imputation, proposing a model that uses a bidirectional recurrent network (RNN) to capture temporal patterns and GCN to capture spatial patterns. Meanwhile, [90] proposed a Graph neural network that makes use of the attention mechanism, as well as a temporal convolutional network instead of RNN as standard RNN, which suffers from various drawbacks such as being unable to hold memory for long, prone to vanishing or exploding gradients, and having low efficiency in parallel training and inference. While not exactly imputing missing traffic data itself, [89] combined GCN with a mapping function to impute missing spatial flow data. This is another important aspect of traffic data that the authors believe should be highlighted and received attention, as origin-destination flow data can be a vital addition to other traffic-related models that could use additional traffic features.
Besides GCN, there are also pieces of literature, such as [7] and [91], that make use of spatial-temporal GNN instead. In other words, instead of utilizing convolution for feature extraction and graph embedding, the research proposes other methods, such as the fusion of multiple data sources ( [7], [91]) or attention mechanism, as well as multitask learning [91].
As the concept of GNN was introduced relatively early in 2005 [92], and GCN itself was only introduced even more recently in 2017 [93], there is still plenty of room for improvement, as can be seen by the recent literature mentioned above. As road networks differ depending on the location, it is imperative to find a model that is robust towards various forms of missing data and the structure of road networks. GNN may have a strong potential in this due to its deep-learning structure as compared to tensor decomposition methods which might be more transductive.

C. ENSEMBLE MODEL
A single model tends to have some forms of shortcomings along with its advantages. In this case, researchers have come up with the idea to combine multiple models to resolve each model's weaknesses and enhance their strengths further.
For example, [94] uses the very popular tensor decomposition but utilizes a Fuzzy Neural Network to further enhance the imputation accuracy by optimizing the weights of the tensor resolvers. Besides that, [95] combines GCN and tensor decomposition using graph Laplace for tensor completion.
Meanwhile, [20] designs a framework combining matrix modeling and factorization and conducting matrix decomposition before using a dendrite neural network to fuse the information to obtain the final imputed data. Besides being another ensemble model, the proposed neural network model was recently proposed by [96], of which the code is provided in their paper. This could be another good avenue for researchers to look into as it expands upon the existing neuron structure to further resemble the human nervous system.
As shown, these ensemble models make use of already established methods while modifying them to work together to obtain an even greater result. However, it should be noted that using more models would inadvertently increase computation time, which may result in the inability to function in a real-time scenario.

D. OVERVIEW OF RESEARCH METHODS
Information regarding the base method used, the referenced papers, the type of road network, the method of data acquisition (i.e., Public, private, or manually collected dataset), and the type of missing data tested was summarised in TABLE 1.
Regarding the road network, data acquisition, and missing types of columns, the number of papers reviewed that fulfilled the criteria were counted, and the sum is shown in the table cells.
Other statistical methods include the MICE implementation [63] and Gaussian Processes [97], [98] which are less popular methods but were nonetheless researched relatively recently and showed good performance when benchmarked against established methods.
From TABLE 1, it can be seen that many of the reviewed literature were conducted in an urban network setting. This is because urban networks are the most volatile as well as the busiest, making them the most in need of the support of intelligent transportation systems. However, a deeper look into the datasets used shows that many of the datasets are the same set of data, such as the Guangzhou urban traffic speed dataset, or datasets related to public transport, such as Taxi traffic data. Studies such as [97] and [98] made use of crowdsourced data from Google Maps' Location Sharing function, which could allow more flexibility in the location chosen at the cost of access to specific traffic data variables due to certain information being hidden due to user privacy and security [98]. These datasets are either limited in their location or in their comprehensiveness, as other countries do not have the same traffic patterns or road networks as America or China. Neither do taxis represent the entire state of the network at any time. This shows that researchers need to conduct simulations based on a larger variety of locations and utilize datasets that better represent the state of the traffic. Besides that, not many researchers acquire their traffic data manually but rather utilize public datasets or datasets from the private sector. This is understandable but is also a form of limitation, as discussed later.
Meanwhile, it can be seen that random missing types are almost always tested, followed by fiber missing and block missing. While some literature has mentioned block missing, by this paper's definition, they are fiber missing as it is only one source of data or the missing data period is not long enough. More research could be put into this particular missing type.
Additionally, a summary of the forecasted variables is shown in TABLE 2. The variable most studies focus on imputing is traffic speed, for obvious reasons, as it is the most direct traffic data used that tells the exact state of the traffic. This is followed by traffic flow, which could be due to the unavailability of the dataset for the area. It is interesting to note that the majority of the studies that impute traffic flow shown in TABLE 2 are those which use taxi GPS data, which could explain this situation as taxi GPS data may not have accurate traffic speeds logged in. However, GPS data does provide researchers with a more detailed view of the road network, which would help in proving the robustness of their work. Traffic volume sees fewer missing data imputation studies, likely due to data availability and the rather imprecise nature of traffic volume. However, traffic volume does give a good idea of the state of the traffic as well. Travel time and congestion levels are outlier studies but are also other parameters to keep in mind for future research.
Most studies focus on imputing only one traffic data, the exception being [7], which had done missing data imputation on both traffic speed and traffic volume, which leads to further proving their model's credibility. TABLE 3 lists the advantages and disadvantages of the popular methods mentioned in TABLE 1. As a general guideline, future studies should take into account the accuracy, interpretability, as well as computational complexity into account when designing a model.

V. NOTABLE DESIGN DECISIONS
Section IV discussed the popular base models that were the focus of recent papers. This section discusses the popular design decisions that the literature tends to use to augment their base models. To the best of the authors' knowledge, past literature reviews do not look into the common mechanism VOLUME 11, 2023 used between different reviewed models and focus more on the overall quality of each individual model instead.

A. ATTENTION MECHANISM
The attention mechanism is widely used in many studies due to its optimization abilities, such as by [95] for weight optimization or extracting multiple features like in [51]. Besides those two, it can be seen that a few of the literature reviewed had also incorporated the attention mechanism into their model [52], [84], [90], [91].
It should be obvious that the attention mechanism is proving to be a very good mechanism to be added when dealing with feature extraction or weight optimization, and more research should take note of it should they require such functions. To that end, [99] has reviewed the state-of-the-art attention models proposed recently, as well as provided more in-depth points when making use of this mechanism along with their real-life applications.

B. EXPECTATION MAXIMIZATION
Probability Principal Component Analysis (PPCA) applies this algorithm to the base Principal Component Analysis (PCA) to derive a probabilistic formulation of the PCA. This is important as it allows for the application of Bayesian methods as an extension to the existing PCA [64], allowing for further improvements as well as analysis to be done, as shown by [100]. This trait can be used in other models as well to possibly provide deeper insight and data for machine learning.

C. FUZZY THEORY
Fuzzy theory introduces the concept of membership functions, which allows variables to be partially a part of a set instead of a single yes or no. This allows uncertain or imprecise data to be represented in a more flexible manner. While this has seen use in many fields for missing traffic data, it has seen minimal use, such as for [53], which makes use of fuzzy rough sets combined with a fuzzy neural network. Another study using fuzzy theory for missing data imputation is [101], which used a hybrid model combining fuzzy rough sets with fuzzy C-means. However, the study was conducted using a medical dataset.
Despite seeing minimal uses, the authors found the method worth mentioning as traffic data tends to be rather imprecise, due to many external variables. Fuzzy theory could potentially improve the performance of other models in a hybrid setting, as shown by the research above.

VI. CHALLENGES AND LIMITATIONS
This section covers certain challenges that existing literature faces and suggestions regarding the directions future researchers should take when undertaking their research. The focus of the challenges and limitations mentioned here are with regards to large-scale deployment in different areas, of which the common issue would be traffic data retrieval, as well as scalability problems and model interpretability as explained below: A. LIMITATION OF DATA It can be seen from TABLE 1 that most of the traffic data used came from publicly available datasets, while some are obtained via other special methods such as private institutions, while research that has attempted to collect the traffic data manually is fewer than those using existing datasets. While it could be said that successful simulations on these datasets would imply similar results in other datasets, researchers belonging to countries with limited public datasets available to them might still want to test for the model's validity in their own country and location of choice. In such situations, it should be noted that online traffic APIs, as mentioned in Section III-A2, could be used to collect the relevant traffic data as they can leverage the companies' existing infrastructures (i.e., Various sources of data) when collecting data, which ensures a level of reliability on top of the accessibility.
However, collecting a significant number of traffic data requires a significant amount of time, and as such, researchers might not have a lot of data points when compared to the available public datasets. This can be seen from some of the reviewed papers, such as [7], [19], [65], [66], [69], [91], and [20], whereby their datasets range from 14 days to 2 months. The quantity of data collected within this short period would prove difficult for models which rely on many data to be properly trained. This leads to the research question of how well a model can do when facing the problem of a limited amount of training data.
Models which relied on external features such as [84] and [85] might suffer a reduction in performance, but based on the experiments by [85], it would seem that external factors might not play such a big part unless the missing rate is greater than 40%, barring holiday factors, of which future researchers are encouraged to try to differentiate between holiday and non-holiday traffic datasets whenever possible. VOLUME 11, 2023 As previously mentioned, manually gathering data takes a long time, and the collected data would be lacking in both quantity and quantity given that some research may be conducted under a time constraint and data collecting equipment may suffer from occasional breakdowns or failure in data transmission. While most, if not all, methods might have lowered performance, data-driven methods might suffer the most, depending on how limited the quantity of data is. However, there are methods such as data augmentation mentioned in Section III-C, which could be used with caution, as well as data generation methods via GAN, which could be researched.

B. SCALABILITY
Scalability is another challenge that researchers face. While tensor-based methods are popular, they also suffer a problem of scalability -as the size of the dataset increases, so does the computional cost to conduct the traffic imputation. In times like those, a data-driven approach might be a better method. However, research into developing a scalable tensor decomposition missing traffic imputation method should not be ignored [79] has proposed a method that utilizes the tensor nuclear norm minimization scheme to model the inherent low-rank property of traffic data, breaking down the large tensor into smaller matrices, allowing for an overall more efficient computation while maintaining a similar level of accuracy. More research should be investigated to improve the accuracy further and reduce the computational cost. In [79], the comparison was made between existing tensor-based models but not with other models, such as the other machine learning models, so further testing can still be done.

C. MODEL INTERPRETABILITY AND ROBUSTNESS
Machine learning or data-driven models tend to have less interpretability than statistical methods, which is understandable given how they work. However, future researchers should take note of how much data is being used in their model to reduce their model's computational cost. A more interpretable model also helps researchers see which part of their model can be adjusted or trimmed, especially in situations with a limited dataset.
Besides that, while statistical methods have interpretability, they lack robustness in contrast to data-driven or machine learning, as the learned model is highly dependent on the dataset they were trained on and might require retraining. Tensor-based models, which are probably the most popular statistical method, also suffer from this issue, as mentioned by [52]. However, deep-learning methods may be more robust towards this issue, and this topic is constantly being researched, making the deep-learning model more computationally efficient for improved real-time utilization.
There is always a cost when selecting a model, and depending on the goal of the researcher, the most optimum model is selected. The idea is to design a model that is both interpretable for further understanding as well as robust to various changes in the dataset's environment while keeping the cost to a minimum. Ensemble models such as [94] managed to propose a robust, generalized model utilizing both Fuzzy Neural Network and tensor decomposition, which could be said to be an improvement but at the cost of computational complexity.

VII. CONCLUSION AND FUTURE WORK
This paper introduced, defined, and categorized the three different types of missing data, namely random missing, fiber missing, and block missing. This paper reviewed various popular state-of-the-art methods and their corresponding research in recent years based on the papers' focus and goals. It was found that tensor decomposition has been used a lot in recent years. However, tensor computations could lead to scalability problems and are dependent on the location of the training dataset. Generative Adversarial Networks (GAN) and Graph Neural Networks (GNN) were found to be similarly popular, as both are well used in data generation and traffic networks, respectively. Both GAN and GNN are relatively new models, with Graph Convolutional Networks (GCN) emerging as a branch of GNN. It is also shown that the attention mechanism and expectation-maximization algorithm are popularly used as auxiliary methods to help bolster the base model's missing traffic data imputation capabilities. In addition, this paper also discussed the limitations of popular datasets and collection methods. Various challenges related to the scalability and availability of data have been highlighted with different data collection methods. As traffic differs from location to location, even within a country, different countries would have different traffic patterns.
Moving forward, a traffic data-collecting initiative for improved traffic performance is encouraged for further analysis. Researchers also need to develop methods that are robust toward different locations' traffic patterns. The lack of data while keeping in mind the interpretability of the models is important. Methods such as PPCA have shown their strength in breaking down traffic analyses, which could help in further understanding the various traffic factors as well as determining what variables have the largest influence on the accuracy and robustness of the data imputation. Scalability of models to function well enough for real-time applications is also important. While tensor factorization tends to suffer from scalability and complexity issues, there are also studies done regarding the design of an online and quicker algorithm for it as well, making this another topic worth pursuing.