Automated Machine Learning for COVID-19 Forecasting

In the context of the current COVID-19 pandemic, various sophisticated epidemic and machine learning models have been used for forecasting. These models, however, rely on carefully selected architectures and detailed data that is often only available for specific regions. Automated machine learning (AutoML) addresses these challenges by allowing to automatically create forecasting pipelines in a data-driven manner, resulting in high-quality predictions. In this paper, we study the role of open data along with AutoML systems in acquiring high-performance forecasting models for COVID-19. Here, we adapted the AutoML framework auto-sklearn to the time series forecasting task and introduced two variants for multi-step ahead COVID-19 forecasting, which we refer to as (a) multi-output and (b) repeated single output forecasting. We studied the usefulness of anonymised open mobility datasets (place visits and the use of different transportation modes) in addition to open mortality data. We evaluated three drift adaptation strategies to deal with concept drifts in data by (i) refitting our models on part of the data, (ii) the full data, or (iii) retraining the models completely. We compared the performance of our AutoML methods in terms of RMSE with five baselines on two testing periods (over 2020 and 2021). Our results show that combining mobility features and mortality data improves forecasting accuracy. Furthermore, we show that when faced with concept drifts, our method refitted on recent data using place visits mobility features outperforms all other approaches for 22 of the 26 countries considered in our study.


I. INTRODUCTION
In December 2019, a coronavirus disease , 20 caused by the severe acute respiratory syndrome coron-21 avirus 2 (SARS-CoV-2), emerged in the city of Wuhan, 22 China. By January 2020, the World Health Organisation 23 advised governments to prepare for active surveillance and 24 case management [1]. For policymakers to respond ade-25 quately, the ability to accurately forecast the spread of the 26 disease is essential. This has inspired many researchers to 27 work on forecasting methods in response to the COVID-19 28 pandemic based on available data. Such data may be in the 29 form of the number of confirmed cases, deaths, hospitalisa- for Disease Prevention and Control [2] have invested substan-32 tial effort into consolidating such data sources. Furthermore, 33 several technology companies -including Apple,Facebook,34 Foursquare and Google -have published data reflecting the 35 movement of people within a population. These data sources 36 are interesting with respect to COVID-19 forecasting, as the 37 movement of people is directly related to the spread of the 38 contagious disease. 39 Despite such efforts, Ioannidis et al. [3] claim that fore- 40 casting for COVID-19 has majorly failed. They argue that 41 draconian countermeasures have been taken on the basis 42 of incorrect modelling assumptions, poor data quality and 43 high sensitivity of estimates due to exponentiated variables. 44 Early models have built upon speculations while predict-45 ing for entire seasons. As a result, many forecasting models 46 would only work well for isolated homogeneous populations 47 • We adapt the auto-sklearn AutoML framework to the 98 task of forecasting COVID-19 mortality data and intro-99 duce two AutoML forecasting variants for multi-step 100 ahead time series forecasting. 101 • We study how we can incorporate anonymised mobility 102 data representing place visits and the use of different 103 transportation modes. We also study to which extent 104 doing so permits more accurate forecasting. 105 • We extend this framework to take into account 106 non-stationarity and concept drift in the data by com-107 paring the performance of three different drift adaption 108 strategies.

109
• We evaluate our methods on real-world datasets from 110 58 countries worldwide and against five baselines. sitioning from one compartment to the other is described by 118 differential equations, representing contact ratios and recov-119 ery time. Given more knowledge about a given disease, more 120 complex compartmental models may be created by adding 121 more compartments that reflect that knowledge. The SEIR 122 model [11], for instance, extends the SIR model by injecting 123 the exposed compartment, holding people infected by the 124 disease but not yet capable of infecting others. , similarly used a multi-scale network 137 to simulate an influenza-like disease. Instead of individuals, 138 they used sub-populations as nodes and gravitational flow 139 derived from commuting and flight data as weights for the 140 edges introducing a form of spatial awareness to the com-141 partmental models.

142
In order to create realistic contact networks, detailed 143 mobility datasets are needed. Ideally, these datasets encom-144 pass the entire population of a region, detailing where and 145 how people have come in contact with each other. In real-146 ity, datasets often summarise interactions and often present 147 samples of a population. Also, recorded interactions in these 148 datasets are not enriched with duration, or intensity [14]. 149 Contact networks where individuals are simulated as a basis 150 for the spread of diseases are called agent-based networks. 151 To create agent-based networks, one needs datasets con-152 taining the movement patterns of individuals. For instance, 153 Aleta et al. [15] created an agent-based network using a 154 dataset containing place visits published by Foursquare 155 to simulate the spread of COVID-19 through a synthetic 156 VOLUME 10, 2022 population in the Boston metropolitan area. While for some who predicted flu in the United States using a combina-188 tion of CNN, RNN and residual links. They achieved a 189 robust improvement over autoregressive models using mul-190 tiple real-world datasets. Aiken et al. [19]

213
The creation of regression pipelines encompasses many steps; 214 data pre-processing, feature pre-processing, hyperparame-215 ter optimisation and algorithm selection. The best choice 216 of the algorithm, pre-processing step and further how to 217 set their hyperparameters, typically depends on the data at 218 hand. Therefore, it is difficult to select a single algorithm 219 to ensure that the best model is configured for a forecast-220 ing problem. Different choices of these components may 221 vastly influence the predictive performance of the pipeline, 222 which is why we can benefit from making these choices 223 automatically. AutoML systems have recently addressed this 224 issue through developing techniques to automatically config-225 ure high-performing machine learning pipelines. Sequential 226 Model Based Optimisation (SMBO) is a black box optimisa-227 tion framework that has been used for the purpose of hyper-228 parameter optimisation. Hutter et al. [26] used (SMBO) to 229 automatically optimise hyperparameters of machine learn-230 ing algorithms. Sequential Model-based Algorithm Config-231 uration (SMAC) [27] is a system that implements SMBO 232 and can be used for hyperparameter optimisation. This is 233 a general-purpose algorithm configurator, which makes it 234 possible to both select algorithms and tune their hyperpa-235 rameters efficiently. Auto-WEKA [6] is an AutoML frame-236 work around the WEKA software package using SMAC for 237 its configuration. This framework fully automated the cre-238 ation and tuning of classification and regression pipelines. 239 Auto-sklearn [4] is an AutoML framework by Fuerer et al. 240 around the scikit-learn [28] Python package. This framework 241 includes meta-learning to warm start the configuration search 242 and creates ensembles of pipelines. In more recent updates, 243 this framework is updated with multi-output regression. This 244 option makes it suitable for forecasting with a range of mul-245 tiple days. TPOT [29] is a tree-based pipeline optimisation 246 tool for AutoML. Similar to auto-sklearn, it is built upon 247 scikit-learn. Instead of using SMBO, TPOT uses genetic 248 programming for hyperparameter optimisation. H2O [30] is 249 another AutoML framework that uses the random search 250 for its hyperparameter optimisation and combines models in 251 stacked ensembles. Unlike auto-sklearn, H2O does not opti-252 mise data and feature pre-processors, but only optimises mod-253 els. It is also possible to automatically construct deep neural 254 networks. Frameworks that support this are Auto-Keras [5]  Among these frameworks in this work, we have selected 268 to adapt auto-sklearn to the task of COVID-19 forecasting.

269
As data is limited when forecasting the pandemic, using  In this definition, p t 0 is the joint distribution between the set 324 of input sequences X where {x, m ∈ X }. 325 We aim to address the forecasting task by formulat-326 ing the Combined Algorithm Selection and Hyperparameter 327 (CASH) Optimisation problem [6]. Given a set of machine 328 learning algorithms A = A (1) , . . . , A (k) with hyperparam-329 eter spaces (1) , . . . , (k) , we search the optimal algo-330 rithm with optimal hyperparameter settings A * λ * following 331 Here L is the loss generated by algorithm A when trained 334 using set D train ∈ X and validated using set D valid ∈ X . This 335 loss is the mean squared error between the forecast made by 336 algorithm A using xm t,w and with hyperparameter settings 337 λ (i.e.,x t,h ) and the true observations in the validation set 338 (i.e., x t,h ), unseen by algorithm A. We are optimising a full 339 pipeline. Therefore, optimising A means that we are optimis-340 ing the hyperparameters of a combination of pre-processors 341 P, features F and regressors R, or A = {P, F, M }. Part of 342 this process is internally optimising the input window size w, 343 which is a newly added feature pre-processing step for time 344 series forecasting.

346
As discussed in Section II, in this work, we extend auto-347 sklearn to address the problem mentioned in Section III, 348 as it supports multi-output regression and holdout validation. 349 Furthermore, it supports automation of data and feature pre-350 processing steps, which are both important for time series 351 forecasting to configure the auto-regressive model and set its 352 window size.

353
Still, as this system was not necessarily created to perform 354 time series forecasting, we add an additional variable input 355 window size as feature pre-processor and introduce a new 356 way to perform multi-step ahead forecasting. In this section, 357 we provide the details on the data used in this work and 358 how we adapted auto-sklearn to perform the forecasting task. 359 Finally, we specify how we adapt the auto-sklearn ensembles 360 when faced with concept drifts.  The data used for our predictions comes from three sources: 363 mortality data and mobility data representing two types of 364 mobility modalities: (i) the mode of transport and (ii) place 365 visits. Table 1 presents the meta-data of these sources.   Both datasets are maintained and adjusted by ECDC when 384 numbers are deemed inaccurate due to delays in reporting. 385 We use the daily new deaths as part of our input and as 386 truth value to evaluate our estimations. We do so because the 387 reported deaths are likely to be more reliable than reported 388 cases, as mentioned by [7]. To make sure the data is compa-    We merged the mortality data and the mobility data into two 415 combined datasets. The first combined dataset captures the 416 first year of the pandemic. We used the intersection of dates 417 and countries of the first ECDC dataset and both mobility 418 datasets. There were some missing values, which we imputed 419 by taking the average of the values 7 days before the miss-420 ing data point and 7 days after the missing data point. This 421 way, the imputed value fits well between the previous and 422 next week and daily trends are preserved. For the country of 423 Serbia, the number of missing values exceeded 10%, which is 424 why we omitted it from the dataset. The resulting first com-425 bined dataset contains data from February 15th, 2020 until 426 December 14th, 2020. The second combined dataset contains 427 data from March 1st, 2021 until July 10th, 2021. When com-428 bining the mortality data with the mobility data for these 429 periods, there were no missing values to account for. The 430 first dataset includes 58 countries from all over the world. 431 The second dataset contains 26 countries from the European 432 Union.

434
Auto-regressive modelling is a common approach taken for 435 forecasting tasks. An auto-regressive model performs regres-436 sion using past measurements in a time series to predict its 437 future timestamps. Many regression algorithms can be used 438 to create an auto-regressive model. Furthermore, the data can 439 be pre-processed in different ways within a machine learning 440 pipeline before being fed into the regression algorithm. In this 441 paper, we extend the auto-sklearn [4] AutoML framework to 442 achieve this goal. Auto-sklearn is a wrapper around the popu-443 lar Python module scikit-learn [28]. Scikit-learn is a machine 444 learning library including a large set of algorithms that can 445 be used for regression and classification tasks, providing var-446 ious ways to pre-process data, select features, fit models and 447 evaluate the results.  The multi-output ensemble. This ensemble creates multiple predictions at once but has no access to meta-learning. Within the framework, pipelines are constructed to form a forecasting ensemble. By feeding this ensemble test data predictions can be made. over the incumbent, the best-seen configuration. A local 458 search is performed near these promising configurations 459 to find configurations with higher expected improvement. 460 In each iteration, the incumbent is updated to store the best 461 found configuration. The best configurations are grouped 462 together in an ensemble using ensemble selection [38]. This To predict the value of [x t+1 , . . . , x t+h ], we train the mod-484 els with sequences of the time series in the form of 485 [x t−w , . . . , x t ]. In vanilla auto-sklearn this window size w has 486 to be determined by the user. This would mean that when 487 we use lags of the time series as features, the number of 488 lags is predetermined. When making predictions with differ-489 ent regressors, not all parts of the time series may be rel-490 evant and depending on the configuration, it can be good 491 to use a longer or shorter input sequence. This is why we 492 implement the variable window size feature pre-processor as 493 proposed in our earlier work [40]. This pre-processor has 494 the hyperparameter w that is optimised within auto-sklearn. 495 The pre-processor takes the input sequence with predeter-496 mined static length and cuts off the first values, resulting in 497 an input sequence in the form of [x t−w , . . . , x t ]. The work 498 presented in [40] experiments on a large set of time series 499 tasks and showed that the variable window size had major 500 impact on the accuracy of the framework. We still need to set 501 a maximum value for the window size. As larger windows 502 limit the number of data instances we can use, we limit our 503 window size to a maximum of 30 days. By incorporating 504 the variable window size optimisation in auto-sklearn, it it 505 possible to define a forecasting task in the following two  of the test set, to be sure that the ensemble model generated 560 by auto-sklearn can't learn future information. This is why we 561 also disable shuffling. This keeps the temporal integrity of the 562 data intact and ensures that the holdout validation set consists 563 of the last dates in the train set. As an optimisation metric, 564 we use the mean squared error to evaluate the performance 565 of the pipelines. This ensures that the regressor line tries to 566 fit the set of data points as close as possible. To ensure our 567 ensembles are fully trained on the data, we refit the ensembles 568 on the full train and validation set after validation is finished. 569 This means that while the pipeline stays the same, the models 570 are updated with both the train and validation set. This way, 571 we make sure that there is no gap in knowledge just before 572 the forecasting starts. For the pandemic problem, it is important to consider the 575 changes in the data generation process that lead to concept 576 drifts in data. On the one hand, there may be a concept drift 577 caused by the fact that in 2021, many countries in Europe 578 started their vaccination programs. Furthermore, lock-downs, 579 mutations in the disease and changes in healthcare can lead to 580 additional concept drifts in the data. On the other hand, we use 581 two mortality datasets, separated in time, each normalised 582 with a different population size (the country population num-583 bers have slightly changed from 2020 to 2021). Currently, 584 auto-sklearn has no drift detection mechanism. 585 Celik and Vanschoren [41] created several concept drift 586 adaptation mechanisms for automated machine learning 587 frameworks. It is not trivial to use any drift detection meth-588 ods during training models with autosklearn. This requires 589 dynamically training multiple models to monitor the drift. 590 However, autosklearn works with a predefined number of 591 training instances to create a single model and cannot dynam-592 ically detect drift in consecutive windows of training data. 593 As training a single autosklearn ensemble with sufficient 594 complexity takes multiple hours, creating many ensembles 595 for drift detection can quickly increase the time needed 596 beyond feasibility. While in the problem of COVID-19 fore-597 casting, we can safely assume that drift exists in data, fur-598 ther research can study how automatic drift detection tech-599 niques can be incorporated directly in autosklearn. We imple-600 ment three methods based on the work of Celik and Van-601 schoren [41] that do not use drift detection to cope with 602 concept drift. For each of the methods, we first construct 603 ensembles using the old dataset. The drift adaptation strate-604 gies can be viewed as a forget mechanism, discarding old 605 information in varying degrees. Depending on the mag-606 nitude of the concept drift there can be merit for each 607 method. In our experiments, we study the performance of 608 these approaches in forecasting. The methods are explained 609 below:  We selected the following baselines based on earlier research 671 in COVID-19 forecasting that use machine learning models 672 and can train models based on the dataset we have collected. 673 Compartmental methods (e.g., [15]) need specific data that is 674 not available for all regions. Therefore, we cannot compare 675 our methods with these:  Table 2. We did, how-698 ever, enlarge the batch size from 10 to 58 for the first 699 scenario or 26 for the second scenario, which are the 700 number of countries in the dataset. This allows the mod-701 els to train for each country simultaneously without them 702 being able to see future time steps. We also increased the 703 number of time steps used as input to 30 to match the 704 other ensembles and baselines in our comparison.

706
Our framework is built on version 0.12.1 of auto-sklearn. 707 Auto-sklearn requires users to define a maximum runtime. 708 All of our ensembles, multi-output or repeated single out-709 put, were ran for 3 hours. For the training of every single 710 pipeline, we limit the runtime of auto-sklearn to a maxi-711 mum of 10% of the total runtime, which comes down to 712 18 minutes. The majority of iterations, however, finish much 713 faster. This amount of time ensures that hundreds of models 714 are compared to create the resulting ensembles. We run auto-715 sklearn in parallel on 8 cores, of an Intel(R) Xeon(R) CPU 716 of 2.1 GHz with 10 GB of RAM. As mentioned before in 717 Section IV, we use a holdout set as a validation strategy, 718 VOLUME 10, 2022 and make sure not to shuffle the data. As the internal perfor- on RMSE over all countries. These rankings come from a 762 bootstrap distribution of 1,000 resamples, based on 25 runs 763 per ensemble. A method that is consistently better than other 764 methods in most countries will be assigned a lower aver-765 age rank. These average ranks will give an insight on how 766 well these methods perform compared to each other. In Figure 3a, we compare our multi-output ensembles 775 using different sources of mobility data and our repeated-776 single output ensemble for the 2020 scenario. In this scenario, 777 the repeated single-output ensemble and the multi-output 778 ensemble using place visits mobility have the best perfor-779 mance. The repeated single-output ensemble outperforms all 780 multi-output ensembles not using place visits data. When we 781 compare the same methods for the 2021 scenario in Figure 3b, 782 we see that there is a drop in predictive power when using 783 place visits mobility features. In this scenario, the repeated 784 single-output and the multi-output ensemble using only mor-785 tality features are better than the ensembles using mobility 786 features. The best mobility ensembles now use the combi-787 nation of place visits and mode of transport, with place vis-788 its ranking slightly higher than the mode of transport. The 789 drop in predictive power of the ensemble using place visits 790 mobility can be explained by the concept drift and changes 791 in data distribution in the second scenario. In this case, com-792 plex models with more features will lose to simpler models. 793 In Figure 3c, we show the comparative performance of our 794 methods in 2021 with the partial refit adaptation strategy. 795 We found this strategy to be the best approach, as we will 796 detail when discussing the answer to Q3. The figure indicates 797 that mobility datasets can also show their power with proper 798 drift adaptation in the second scenario. This experiment has 799 shown that using mobility features can improve forecasts but 800 does not guarantee improvement. Of the mobility datasets 801 studied, the best results can be found using the place vis-802 its data. This dataset holds more predictive power than the 803 mode of transport dataset. This may be due to their level of 804 abstraction. The place visits data holds six categories, 805 whereas the mode of transport has only two. Moreover, the 806 place visits categories specify groups of locations instead 807 FIGURE 3. The comparative performance of our methods with varying mobility features using RMSE. A lower rank depicts a better performance. When methods are linked with a horizontal bar, they are within critical distance, meaning there is no significant difference between average ranks. Our methods are denoted with the prefix M.
of just an increase in activity. If more contagion happens at 808 specific location groups, this can be picked up easier from the 809 place visits data.  In Figure 4a, we show the results for the 2020 scenario.

817
Here, the performance of the methods and baselines are close.

818
The best baseline is the persistence baseline. Our best two 819 ensembles perform slightly worse than the persistence base- in the first year that is necessary for training these models.

826
The predictive power of these models, however, improves as  Table 3. The table shows that for 20 out of 840 58 countries, the persistence baseline has the best forecast. 841 However, as the first five of these have no new deaths in the 842 test period, the persistence baseline wins in these by default 843 as there are no fluctuations in the time series. Our best method 844 for this scenario, the repeated single-output ensemble, scores 845 best for 21 of the 58 countries. Using this table, we would 846 further investigate if the performance of models depends on 847 the properties of the time series acquired from different coun-848 tries. Notably, we look at the existence of (i) periodic patterns 849 and (ii) trends that point to the complexity of the time series. 850 In this table, we grouped countries based on the trend and 851 periodicity importance of the true values acquired using the 852 procedure explained in [44]. To compute this importance, 853 we split the true value time series Y t into its trend T t , periodic-854 ity P t and remainder series E t . Then, the trend importance can 855 be computed as 1 − Var(E t ) Var(T t +E t ) and the periodicity importance 856 as 1 − Var(E t ) Var(P t +E t ) . These measures range from 0 to 1, allowing 857 us to group the countries into 4 quadrants. We indicate values 858 lower than 0.5 as low and higher than 0.5 as high. When writ-859 ing about quadrants, we mention trend importance first and 860 periodicity importance second. The low-high quadrant, thus, 861 has low trend importance and high periodicity importance. 862 The table shows that for the low-low quadrant, the persis-863 tence baseline often has the lowest error. When there is high 864 periodicity importance in both the low-high and the high-865 high quadrants, our repeated single-output ensemble proves 866  Figure 8. This shows a similar situation as the low-high 900 quadrant, where periodic patterns are somewhat captured by 901 most methods and baselines but not as strong as the repeated 902 single-output ensemble. In cases like these, we see that the 903 persistence baseline can be difficult to beat if the observations 904 on the day before the test period are close to the average of 905 the true observations later.

906
This shows that compared to other baselines, our repeated 907 single-output ensemble and the multi-output ensembles using 908 place visit mobility data are quite strong in the 2020 sce-909 nario. While the persistence baseline outperforms for 20 of 910 the 58 countries, it fails with time series data that exhibits 911 strong patterns of periodicity or trends. The other baselines 912 perform worse than our methods. Our repeated single-output 913 ensemble is strong when cycles are apparent but fails when 914 the true observations suddenly change. In the 2021 scenario, 915 all baselines are performing better than our methods. Our 916 methods are not adapted to the concept drift in this scenario. 917 Due to the change in the normalising factor, old patterns 918 learned may obfuscate the new ones. We demonstrate how 919 to address this using the concept drift adaptation techniques 920 mentioned in Section IV-C. We aim to understand if adapting for concept drift helps 923 in improving COVID-19 forecasting accuracy using this 924 AutoML approach. The answer of Q2 showed that our meth-925 ods performed worse than the baselines in 2021, while they 926 were better than most in 2020. This may be a result of con-927 cept drift. This section shows the results of our experiments 928 adapting our methods to this drift. 929 94728 VOLUME 10, 2022  Figure 10a shows, the retraining was 938 detrimental to their performance. Therefore, in the subse-939 quent comparisons, we thus only consider the deep learning 940 baselines using the full dataset.

941
As we can only effectively adapt for drift in the 2021 sce-942 nario due to a lack of a drift detection mechanism, 943 we show only results for 2021 in this section. We com-944 pare all drift adaptation strategies previously introduced in 945 VOLUME 10, 2022 mobility features -outperform all methods using different 954 adaptation strategies on a significant level. The next group of 955 methods consist mainly of the multi-output ensemble using 956 only mortality features. For this ensemble, changes in perfor-957 mance with different drift adaptation strategies are smaller 958 than for the ensemble using mobility features, but a partial 959 refit still yields the best performance. The last group consists 960 of multi-output methods using mobility features and drift 961 adaptation strategies other than the partial refit strategy. These 962 strategies do not go well together. The repeated single-output 963 ensemble is the only method that does not improve by adapt-964 ing to drift. The non-adapted version of this approach is sig-965 nificantly better than all its adapted counterparts. Still, its 966 performance is ranked worse than all other partial refit 967 methods. 968 We also compare the ensembles using the partial refit drift 969 adaptation strategy with the baselines in Figure 10c. This 970 figure shows that all multi-output ensembles using the partial 971 refit strategy outperform all baselines. In this scenario, the 972 ARIMA wavelet baseline is the strongest but performs sig-973 nificantly worse than the multi-output ensembles using place 974 visits mobility data or combined mobility data. The deep 975 learning methods are in the same group as the persistence and 976 ARIMA wavelet baseline and are within a critical distance 977 of the partial refit multi-output ensemble using mortality 978 data. However, they are all significantly outperformed by all 979 multi-output ensembles using mobility features. We show the 980 RMSE of the baselines and our methods using the partial refit 981 drift adaptation strategy for all countries separately in Table 4. 982 This table shows that the multi-output ensemble using place we show that the country of Romania in Figure 9, grouped 988 in the low-low quadrant with irregular true observations in 989 the test set but some indication of trend and periodicity. The 990 sudden drops and spikes are quite difficult to anticipate for 991 all baselines, as well as for our methods not using mobility 992 features. The ensembles using these features, however, while 993 not exactly predicting the magnitude of the extreme values, 994 can predict where spikes and drops will occur. 995 94732 VOLUME 10, 2022 Drift adaptation may seem a lot more impactful for our 996 AutoML-based approaches than for baselines. We are per-997 forming hyperparameter optimisation to ensure the best mod-998 els are configured on the provided training data. However, 999 as the concept changes, this approach will lead to a model 1000 that over-fits the older part of the data. Consequently, this 1001 approach performs much worse on new data compared to 1002 baselines with average performance on all data.

1003
This experiment has shown that adapting to concept drift 1004 can indeed help to improve the accuracy of COVID-19 1005 forecasts using an AutoML approach. This is specifically 1006 the case for our multi-output ensembles using the partial 1007 refit strategy. This strategy entails keeping the ensembles 1008 trained using the old dataset but updating the model weights 1009 using the new data. This way, old knowledge is used, but 1010 the emphasis is placed on the newer data. This strategy 1011 FIGURE 10. Nemenyi plots showing the comparative performance of baselines and our methods when exposed to drift based on RMSE. Methods on the left have a lower average rank and are thus comparatively better than methods on the right. When methods are linked with a horizontal bar, they are within critical distance, meaning there is no significant difference between average ranks. M and B prefixes denote our methods and baselines.
works especially well when combined with mobility data Another limitation of our work is that the best moments to 1035 adapt the ensembles over time are not detected automatically. 1036 Current AutoML systems use large batches of data at the 1037 same time to train their models. If these batches are too large, 1038 however, chances are the concept drift slips in undetected. 1039 A proper trade-off should be made between how much data 1040 is used in order to learn the data patterns sufficiently and to be 1041 able to detect concept drift within the used data. Future work 1042 can address this issue further.

1043
Finally, we want to note that due to a lack of availability of 1044 the COVID-19 mortality data, we were only able to use the 1045 countries in Europe for our scenario in 2021. For the coun-1046 tries outside of Europe that were used in the 2020 scenario, 1047 we were thus not able to test the drift adaptation strategies. 1048 It would be interesting to see whether or not the partial refit 1049 adaptation improves forecasts consistently for these countries 1050 as well.

1052
In this work, we adapted the AutoML framework of 1053 auto-sklearn to COVID-19 forecasting. We used mortality 1054 data and mobility data collected from 26 European countries 1055 to construct automatically configured ensembles of regres-1056 sion models. We compared the performance of a multi-output 1057 TABLE 3. RMSE on 2020 forecasts. Countries are grouped by quadrant: Low trend -low periodicity, low trend -high periodicity, high trend -low periodicity and high trend -high periodicity. Our methods are abbreviated to MO for multi-output and RSO for repeated single-output. d denotes mortality data, mtm mode of transport mobility, pvm place visits mobility and cm combined mobility. Our methods and baselines are denoted with M and B, respectively. All Baselines only use mortality data. TABLE 4. RMSE on 2021 forecasts. Countries are grouped by quadrant: Low trend -low periodicity, low trend -high periodicity, high trend -low periodicity and high trend -high periodicity. Our methods are abbreviated to MO for multi-output and RSO for repeated single-output. Between parentheses a d denotes mortality data, mtm mode of transport mobility, pvm place visits mobility and cm combined mobility. All our methods in this table use the partial refit drift adaptation strategy. M and B prefixes denote our methods and baselines. All baselines only use mortality data. ensemble and a repeated single-output ensemble and fur-1058 ther combined these with concept drift adaptation strategies. 1059 We evaluated the performance of our ensembles based on 1060 root mean squared error compared to five different baselines 1061 found in recent COVID-19 forecasting literature.

1062
Overall, our work has demonstrated the potential of devis-1063 ing AutoML solutions for COVID-19 forecasting, as well as 1064 using open mobility data to guide predictions. Our experi-1065 ments have shown that it is possible to increase the fore-1066 casting accuracy by using mobility features in addition to 1067 mortality features. Our experimental results also suggest that 1068 place visits mobility data is more informative than the mode 1069 of transport mobility data; this may be due to the fact that the 1070 place visits data is less aggregated as opposed to the mode 1071 of transport set. Nevertheless, using either of these sets can 1072 improve forecast quality. We also found that when concept 1073 VOLUME 10, 2022 drift occurs, due to a shift in data normalisation and possibly virus mutations, it is necessary to incorporate concept drift adaptation techniques into our AutoML methods in order to obtain useful predictions. When adapted, our multi-output methods using mobility data significantly outperform the 1078 baselines we have considered in our study. adaptation strategy of refitting the ensembles once drift has occurred. Automatically, finding the best moments to adapt the ensembles over time is an interesting direction for future He is a fellow of the Association of Comput-1286 ing Machinery (ACM), the Association for the 1287 Advancement of Artificial Intelligence (AAAI) 1288 and the European AI Association (EurAI), the past 1289 President of the Canadian Association for Artificial Intelligence and one 1290 of initiators of CLAIRE, and an initiative by the European AI community 1291 that seeks to strengthen European excellence in AI research and innovation 1292 (claire-ai.org). 1293 1294 VOLUME 10, 2022