Machine Learning Approach for Forecast Analysis of Novel COVID-19 Scenarios in India

The novel coronavirus (nCOV) is a new strain that needs to be hindered from spreading by taking effective preventive measures as swiftly as possible. Timely forecasting of COVID-19 cases can ultimately support in making significant decisions and planning for implementing preventive measures. In this study, three common machine learning (ML) approaches via linear regression (LR), sequential minimal optimization (SMO) regression, and M5P techniques have been discussed and implemented for forecasting novel coronavirus disease-2019 (COVID-19) pandemic scenarios. To demonstrate the forecast accuracy of the aforementioned ML approaches, a preliminary sample-study has been conducted on the first wave of the COVID-19 pandemic scenario for three different countries including the United States of America (USA), Italy, and Australia. Furthermore, the contributions of this study are extended by conducting an in-depth forecast study on COVID-19 pandemic scenarios for the first, second, and third waves in India. An accurate forecasting model has been proposed, which has been constructed on the basis of the results of the aforementioned forecasting models of COVID-19 pandemic scenarios. The findings of the research highlight that LR is a potential approach that outperforms all other forecasting models tested herein in the present COVID-19 pandemic scenario. Finally, the LR approach has been used to forecast the likely onset of the fourth wave of COVID-19 in India.

Simple mathematical models that capture the fundamentals 94 of epidemic spread can be used to fit data using a large 95 number of parameters, and the resulting values can be used 96 to generate accurate forecasts. In recent years, the scientific 97 community has gathered significant justification for diverse 98 and complex social network connection patterns [15], [16]. 99 These are important in defining the behavior of equilibrium 100 and non-equilibrium systems in general, as well as the spread 101 of pandemics and the development of effective prevention 102 measures. Digital epidemiology and the theory of epidemic 103 processes on complex networks are the results of interdisci-104 plinary studies at the intersection of statistical physics, net-105 work science and epidemiology, driven by the vast amounts 106 of data documenting our health status and life style. 107 Various dynamic models were used to investigate and eval-108 uate epidemiological parameters such as incubation period, 109 transmissibility period and many others in prior pandemic 110 outbreaks [17], [18]. Machine learning (ML)-based forecast-111 ing methods have proven to be effective in analyzing post-112 operative outcomes and making better decisions about future 113 activities [19]. ML models have long been used in numer-114 ous domains, including detecting and prioritizing aversive 115 aspects of a threat. Several studies used simple techniques 116 to estimate the number of COVID-19 cases, assuming that 117 government data is reliable and accurate. Auto regressive 118 integrated moving average (ARIMA) and Holt's simple expo-119 nential method have been used in [20] for short-term fore-120 casting of COVID-19 spread in India. For Italy,China and 121 France, simple mean-area models and susceptible-infected-122 recovered-death models, the Gompertz model, the logis-123 tic model and the Bertalanffy model have been utilized 124 [21], [22]. Researchers have applied the Gompertz model 125 to predict the growth of tumors and many others, whereas, 126 the logistic growth model has been used to model the out-127 break of COVID-19 and predict its global spread. Similarly, 128 the exponentially escalating model was used to forecast the 129 final size and spread of COVID-19 in Italy, as well as the 130 total number of confirmed COVID-19 cases in China, Italy, 131 South Korea, Iran, and Thailand [23], [24]. Other nations, 132 such as the United States, Iran, Slovenia, and Germany were 133 expected to have COVID-19 instances between March 29 and 134 April 12, 2020, according to the prediction in [25]. In addi-135 tion, the number of new confirmed cases, recovery, and mor-136 tality numbers for Algeria, Australia, and Canada have been 137 assessed [26]. 138 Furthermore, in [27], the authors used ML and evolution-139 ary computing methods with regression for the COVID-19 140 virus spread prediction and control model. In addition, [28] 141 systematically reviewed forecasting models to identify key 142 factors in the spread of the COVID-19 pandemic. In a study 143 described in [29], artificial neural networks (ANNs) were 144 used to make a real-time predictor model for COVID - The successive contents of this research have been organized 225 as follows--The LR, SMO regression and M5P approaches 226 of forecasting have been briefly explained in Section 2. 227 Section 3 presents the results of the preliminary sample study 228 on the first wave of the COVID-19 pandemic scenario per-229 formed for three different countries namely, the USA, Italy, 230 and Australia. Section 4 presents the results of an in-depth 231 forecast study of COVID-19 pandemic scenarios for the first, 232 second, and third waves in India. Based on the proposed 233 forecasting model, the likely onset of the fourth wave of 234 COVID-19 in India is also hinted in Section 4. Finally, the 235 conclusion has been drawn out in Section 5.

II. TECHNIQUES USED TO FORECAST NOVEL COVID-19
237 With the ever-increasing amount of data availability, ML has 238 become an emerging technology for comprehensive data 239 analysis over the past two decades and has become more 240 widespread as an essential component of technological 241 advancement. In this work, three common ML-based fore-242 casting approaches viz. LR, SMO regression and M5P 243 techniques have been employed to predict the COVID-19 244 scenarios. To predict the COVID-19 scenario in India, the 245 authors aim to employ machine learning algorithms, which 246 are an emerging tool nowadays and are increasingly being 247 used in forecasting studies. All three methods are well-known 248 and commonly used by the researchers in their recently 249 reported literature; therefore, the authors used LR, SMO, and 250 M5P in this study. In addition, several error measures have 251 also been used for the assessment of the forecast accuracy of 252 these techniques. forecasts for a given problem using a common LR technique 267 have been comprehensively explained in [26] and [46]. y and x which is called regression.
where is the error term of the linear regression. Here the 275 error term takes into account the variability between both 276 x and y, β 0 represents the y-intercept and β 1 represents the 277 slope.

279
SMO regression is a simplified technique for rapidly solving 280 the support vector machine (SVM) quadratic programming 281 (QP) issue without any additional matrix storage or numerical 282 QP optimization. To achieve convergence, SMO regression 283 decomposes the overall QP issue into QP sub-problems. 284  forecasts for a given problem has been described in [47].

303
The implementation of SMO regression has the following 304 steps:

305
Step 1: Break large QP problems into a series of 306 smallest possible QP problem. Find the most promising 307 pair (µ 1 and µ 2 ).

308
Step 2: Solve small QP problems in promptly when com- The M5P technique is a numeric prediction tool based on 316 classification and regression analysis and is a modified ver-317 sion of the original M5 tree algorithm, which enables it to deal 318 with enumerated attributes and attribute missing values. M5P 319 is more sensitive to data segmentation and gives better results 320 with longer data set as input. The following steps are involved 321 in implementing the M5P technique to produce forecasts for 322 a given problem as detailed in [48] and [49]:

323
Step 1: Take the input data (enumerated attributes), then 324 convert it into binary variables and apply the algorithm to 325 maximize standard deviation reduction (SDR).
where C s is the set of cases, Cs k is the k th subset of cases 328 that result from the tree splitting process, δ(Cs) is the stan-329 dard deviation of C s , and δ(Cs k ) is the standard deviation of 330 k th subset as a measure of error 331 Step 2: Use these binary variables to construct a tree (as the 332 tree grows over fitting increases).

333
Step 3: Perform tree pruning process (which reduces the 334 problem of over-fitting) and compensation for discontinuities. 335 Step 4: Carry out tree smoothing process to compensate 336 for sharp discontinuities that occur between linear adjacent 337 models at end nodes (leaf) of pruned tree.

338
Step 5: Produce tree model as the output. [50], [51], [52] are mathematically defined as follows:  Table 2 and Table 3; Table 4 and Table 5; and Table 6 and 381     The MAPE is one of the most commonly used key perfor-398 mance indicators to measure forecast accuracy (i.e., the lower 399 the MAPE, the higher is the forecast accuracy). However, it is 400 interesting to note that the values of MAPE can exceed 100%, 401 which would mean that the errors are ''much higher'' than the 402 actual values [53]. On the other hand, setting arbitrary fore-403 cast performance targets without reference to the forecast data 404 (e.g., MAPE<10% is excellent, MAPE<20% is good, etc.) 405 is irrational [54].           Table 5 and illustrated in Figure 2, 443 VOLUME 10, 2022    daily cases for the first wave of COVID-19 in Australia 458 and the same has been depicted graphically in Figure 3. 459 It can be observed that the average of daily MAPE values 460 corresponding to LR, M5P, and SMO regression techniques 461 have been evaluated as 1.48, 2.8 and 2.78, respectively, for 462 the duration 01-07 April, 2020; 0.71, 1.61, 0.96, respec-463 tively, for the duration 08- 14 April, 2020;0.46, 0.65, 0.54, 464 respectively, for the duration 15-21 April, 2020;and 0.18, 465 0.24 and 0.56, respectively, for 22-28 April, 2020. The daily 466 MAPE, as summarized in Table 7 and illustrated in Figure 3, 467 clearly implies that the LR technique again outperforms M5P 468 and SMO regression for forecasting daily cases for the first 469 wave of COVID-19 in Australia. Forecast of daily cases 470 for April 01-28, 2020 using LR, SMO regression and M5P 471 techniques for the first wave of COVID-19 in the USA, Italy 472 and Australia have been depicted in Figure 4. A comparison 473 has been made with the actual data which depicts the vari-474 ations between the forecasted values and the actual values. 475 Table 8 summarizes daily MAPE for the duration range of 476 Apr 22-28, 2020 of LR, SMO regression and M5P techniques 477 to compare the accuracy of daily death forecast for the first 478 wave of COVID-19 in the USA, Italy and Australia and the 479 same has been depicted graphically in Figure 5. The average 480 FIGURE 6. Daily MAPE of LR, SMO regression and M5P techniques to compare forecast accuracy of daily cases for the first wave of COVID-19 in India for the duration-(a) 01-07 July, 2020 (b) 08-14 July, 2020 (c) 15-21 July, 2020(d) 22-28 July, 2020    the work presented with other methods reported in recently 492 published papers.

495
The various error measures are summarized in Table 2-496 Table 8 indicate that all three approaches viz. LR, SMO 497 regression and M5P employed in the preliminary sample 498 study presented in the previous section have acceptable fore-499 cast accuracy and the LR technique outperforms M5P and 500 SMO regression techniques. Although the LR technique out-501 performed during the first wave in three different coun-502 tries, using the LR technique alone would not be sufficient 503 for extensive forecast analysis in the Indian scenario since 504 ML algorithms rely heavily on quality data to learn future 505 trends and build better performing forecasting models.This 506 prompted the authors to continue their extensive forecast 507 analysis on the first, second and third waves of the COVID-19 508 pandemic scenarios in India using all three techniques. In this sub-section, the authors have conducted a complete 512 forecast study on the first wave of the COVID-19 pan-513 demic scenario in India.   have been evaluated as 4. 28, 1490, 1832.85, and 4247828, 522 respectively. Hence, LR has higher forecasting performance 523 when compared to M5P and SMO regression. As stated in 524 the previous section, MAPE is one of the most commonly 525 used key performance indicators to measure forecast accu-526 racy. Therefore, Table 11 has been prepared to summarize  graphically represented in Figure 6.It can be observed that 531 the average of daily MAPE values corresponding to LR, 532 M5P and SMO regression techniques have been evaluated 533 as 4.12, 4.20 and 4.52, respectively, for the duration 01-07 534 July 2020; 2.56, 2.56 and 2.98 respectively, for the duration 535 08-14 July 2020; 5.12, 5.12 and 6.18 respectively for the 536 duration 15-21 July 2020;and 5.18, 5.19, 6.06 respectively 537 for 22-28 July 2020. The daily MAPE values, as summarized 538 in Table 11 and shown in Figure 6, clearly indicate that 539 the LR technique for forecasting daily cases for the first 540 wave of COVID-19 in India outperforms the M5P and SMO 541 regression.

542
A forecast of daily cases for the duration from 01 June 2020 543 to 31 Jan 2021, using the LR technique for the first wave of 544 COVID-19 in India has been depicted in Figure 7. A com-545 parison has been made with the actual data which clearly 546 shows how closely the forecasted values match the actual 547 data. On the other hand, the first part of Table 16 summarizes 548 daily MAPE for the duration 08-14 July 2020 of LR, SMO 549 regression and M5P techniques to compare the accuracy of 550 daily death forecast for the first wave of COVID-19 in India 551 and the same has been depicted graphically in Figure 8.    of COVID-19 in India and the same has been illustrated 578 graphically in Figure 9. It can be observed that the average 579 of daily MAPE values corresponding to LR, M5P, and SMO 580 regression techniques have been evaluated as 6.72, 9.85 and 581 7.92, respectively, for the duration 01-07 May 2021; 6.61, 582 11.68 and 7.30, respectively, for the duration 08-14 May 583 2021; 4.40, 4.78 and 5.11, respectively, for the duration 584 15-21 May 2021;and 6.79, 7.39 and 8.24, respectively, 585 for 22-28 May 2021. The daily MAPE, as summarized in 586 Table 13 and shown in Figure 9, clearly indicates that the LR 587 technique for forecasting daily cases for the second wave of 588 COVID-19 in India outmatch the M5P and SMO regression. 589 A forecast of daily cases for the duration from 01 Feb 590 2021 to 31 Oct 2021 using the LR technique for the second 591 wave of COVID-19 in India has been depicted in Figure 10. 592 A comparison has been made with the actual data which 593 clearly shows that the forecasted dataset matches with the 594 actual dataset. On the other hand, the mid-part of Table 16 595 summarizes daily MAPE for the duration 08-14 May 2021 of 596 LR, SMO regression and M5P techniques to compare the 597 accuracy of the daily death forecast for the second wave of 598 COVID-19 in India and the same has been depicted graph-599 ically in Figure 11. The average daily MAPE values corre-600 sponding to LR, M5P, and SMO regression techniques have 601 been evaluated as 7.46, 10.04 and 7.56     graphically in Figure 12. It can be observed that the aver-  Table 15 and shown in Figure 12, clearly indicates that the 634 LR technique for forecasting daily cases for the third wave of   On the other hand, the last part of Table 16 Figure 16 depicts the box plot of daily MAPE of new cases 669 during first, second and third waves of COVID-19 in India. 670 On analyzing the combined plot of the daily cases of 671 COVID-19 for all three waves in India, it is evident that the 672 duration of the first wave of COVID-19 in India was longer 673 than that of the second wave. However, the number of daily 674 new cases of COVID-19 was the lowest for the first wave 675 compared to the second and third waves. On the other hand, 676 the duration of the third wave was the shortest among the 677 three waves of COVID-19 in India. Nevertheless, the number 678 of daily new cases of COVID-19 for the third wave in India 679 was slightly lower than for the second wave. It is a matter 680 of fact that the second wave of COVID-19 infected people 681 more severely than the first and third waves of COVID-19 682 in India. However, the people of India are fortunate that the 683 Indian government took action against COVID-19 before it 684 could get worse, which was a concern for many experts, given 685 India's large population.

686
Forecasting the likely onset of the fourth wave will be of 687 great help in making important decisions and planning for 688 the implementation of preventive measures. Therefore, based 689 on the extensive analysis conducted for the first, second and 690 third waves of COVID-19 in the Indian scenario, the LR tech-691 nique alone would be sufficient to forecast the likely onset of 692 the fourth wave of COVID-19 in India. The forecast result 693 using the LR technique for daily new cases for the period 694 27 March, 2022 to 28 July, 2022 is shown in Figure 17 695 which shows the upswing in daily new cases after May 2022. 696 Looking at the rapidly increasing daily new cases during 697 June-July 2022, it seems that India is likely to witness a fourth 698 wave of COVID-19 in the coming days if preventive measures 699 are not taken. 700 VOLUME 10, 2022