Novel Air2water Model Variant for Lake Surface Temperature Modeling With Detailed Analysis of Calibration Methods

The air2water model is a simple and efficient tool for modeling surface water temperature in lakes based solely on the air temperature. In this article, we propose to modify the air2water model in such a way that different parameters would be associated with lake stratification of cold waters than with lake stratification of warm waters. The situation of a mix of both cold water and warm water is also considered. The model is tested on 22 lowland Polish lakes against two classical air2water variants. As the new air2water model variant is slightly more complicated than the basic versions, we focus on the importance of the choice of the calibration method. Each variant of the air2water model is calibrated with eight different optimization methods, which are also compared on various benchmark problems. We show that the proposed variant is superior to the classical air2water models on about 90% of tested lakes, but only if the calibration approach is properly selected, which confirms the importance of the links between the model and appropriate optimization procedures. The proposed air2water variant performs well on various lowland lakes, with exception of large but shallow ones, probably due to the weak stratification of the shallow lakes.

The majority of lake water temperature models are either physical-based [14], [15] or empirical [16], [17], [18], [19]. However, recently, some kinds of physically based empirical models have been proposed. One group of such approaches are physics-guided neural networks [6], which have a large number of parameters, another is the air2water model that in its basic version has only eight parameters that represent physical and thermal processes that occur between air, land, and the lake surface [3], [20]. Air2water model has found numerous applications in hydrology and limnology [18], [21], [22], [23], [24]. In general, the air2water model aims at modeling water temperature within some depth of a lake's surface layer. However, it is very difficult to define the bounds of this surface layer [24]. Recently Guo et al. [25] introduced a variant of the air2water model with 15 parameters developed for the thermal surface conditions of frozen lakes. Based on the idea introduced in [25], in this article, we propose a new version of the air2water model that has 12 parameters but is much more flexible in capturing interrelationships between a cold or frozen lake surface and the water body. The variant uses different parameterizations for cold and warm water periods (basically, those with classical thermal stratification and inversion of water temperatures), and considers a situation of a mix of both of them.
In this article, we present a modified version of the air2water model and analyze how the modeling quality will be affected by the choice of the optimizer. Both issues are interrelated, as although the air2water model is physically based, the physical processes in air2water are parameterized, and such parameters require calibration. Simpler optimizers may be unable to show the advantage of the new version of the air2water model over its older counterparts, but more advanced optimization procedures may be able to do so. Hence, proposing a new variant of a model should be associated with a discussion on the appropriate calibration method. The question regarding the relationship between the model and its optimizer may be addressed from a philosophical [26] or empirical point of view (e.g., [27]), but is asked again and again in various fields of science. It is to some extent similar to the problem of the choice of the appropriate machine learning approach for the particular data [19].
In the present article, we compare the performance of the new air2water model variant against its two older counterparts [3], [20] on 30-year-long daily air temperatures and water temperatures of surface layers in 22 lakes that are located in the lowland part of Poland. Each air2water variant is trained with eight selected optimization algorithms that are discussed in Section III-A. Before the algorithms are applied to our problem of interest, we test them on benchmarks and real-world problems from the literature to determine which among eight metaheuristics seems to be the "best". To do so, we define a few criteria for comparing optimization algorithms and test the eight metaheuristics on 30 IEEE CEC2017 benchmarks [76] and 22 real-world problems from different fields of sciences introduced in IEEE CEC2011 [77]. We discuss the performance of each optimizer and analyze the ranking of metaheuristics based on particular criteria and benchmarks. We point at algorithms that seem most appropriate based on artificial benchmarks or various real-world problemsfollowing a roughly similar way as in [78]. After that, we get back to the problem of modeling water temperatures in lakes and use each among eight metaheuristics for calibration of the three air2water model variants. We fit parameters of the air2water model variants to observational data from the calibration set that is composed of roughly 20 years of daily measurements; the remaining 10 years of daily data is kept aside as the validation set. Based on the results obtained for calibration and validation sets, for each metaheuristic separately we choose the best variant of the air2water model and discuss the impact of the optimizer on that choice. Finally, we compare the performances obtained by different air2water variants trained by different metaheuristics and point out to what degree the choice of the best metaheuristics based on artificial benchmarks or real-world problems not related to our problem of interest were useful in determining the most appropriate method for calibration of the air2water model.
To summarize, there are three main goals of the present study:  Table I. 1) to improve the air2water model for lake water temperature modeling, by proposing a new variant; 2) to analyze the impact of existing optimization algorithms on the performance of air2water variants -both the new one and two classical versions from [20]; 3) to find out if one could pick up an appropriate optimization algorithm for the specific application-air2water model calibration in our case-based on the results obtained by optimizers on classical benchmark sets: either numerical benchmarks [76] or collections of real-world problems [77].

A. Data
In this article, 30 years long daily data (1987-2016) from 22 lowland lakes located in the northern and western part of Poland and from nine meteorological stations located in the nearby areas are used (see Table I and Fig. 1). Data from the years 1987-2006 are used for optimization, the remaining 2007-2016 data are left for the model validation. In the case of Lake Bachotek, the data set ends on November 30, 2016, and in the case of Lake Slawskie, the data ends on December 31, 2014; in both cases, the validation periods are shorter, but calibration periods are the same as for the other lakes.
In each lake, the water temperature was measured once a day at 6:00 GMT at a water depth of 0.4 m below the water surface, or below the ice cover during winter months. All lakes are dimictic, which means that the difference in temperature between the surface and bottom layers becomes negligible twice per year (about 4°C), allowing all strata of the lake's water to circulate vertically. During the year the surface may have water temperatures both lower and higher than 4°C-this is important for the mixing among lake layers, as 4°C is the temperature in which the water is the densest.

B. Air2water Model With Eight Parameters
The air2water model, which has been proposed by Piccolroaz et al. [20] and Toffolon et al. [3], is a simple physically inspired model that relates the surface water temperature in lakes only to the air temperature. The model is based on a volume-integrated heat balance equation with linearized heat flux terms. The relation between air and lake water temperatures is expressed as follows [3], [20]: (1) where δ is defined as In (1), t is the day of the year and t y is the number of days in a year. In (1) and (2), T w is the lake water temperature in the surface layer, T a is the air temperature, δ is a dimensionless parameter representing the ratio between the volume of the surface lake layer and a reference volume, T h is a reference value for the deep lake water temperature, and a 1 -a 8 are eight model parameters that are to be calibrated for the particular lake. In the original air2water model with eight parameters, the reference value T h is fixed depending on the lake type. In the case of Polish lakes that are dimictic, it is set to 4°C. For warm or cold monomictic lakes, it is set to the minimum or the maximum water temperature, respectively. In the version of the air2water model proposed in [79], T h is calibrated together with the remaining eight parameters, instead of being fixed. In the present study, we keep the value fixed to 4°C, as proposed for dimictic lakes in the original air2water model [20].
The air2water model is solved using the 4th-order Runge-Kutta method, and water temperatures below 0°C are set as 0°C during computations. To limit the impact of the initial conditions on the results, the first 30 days of each data set are considered a warm-up period and are not used to evaluate the modeling performance. As the basic model has eight parameters, we call it a2w8 in the future part of the text.

C. Air2water Model With Six Parameters
As may be seen from (2), the computation depends on whether the reference value for the deep lake water temperature is higher or lower than the water temperature in the surface lake layer. As discussed in [3] and [20], for some lakes air2water model may be insensitive to the parameters a 7 and a 8 , which are used solely for the calculations of δ, and only in a case when the water temperature is between 0°C and 4°C. Hence, a simplified air2water model with six parameters was proposed [3] in the following form: This way the parameters a 7 and a 8 are ignored and the model requires calibration of just six parameters (hence we call it a2w6). However, the relation between the air and the lake surface temperatures in colder periods of the year may be oversimplified and less adequate.

D. Proposed Air2water Model With 12 Parameters
The classical air2water model relates the air temperature with the temperature of water within the upper layer of the lake, hence the temperature that is observed within a body of liquid water at some depth under the surface. However, a conceptually similar approach could be also used to link air temperature with the temperature of the sheer surface of the lake that may be measured from remote sensing (e.g., [80]). In such a case the surface temperature may be measured at the top of either the liquid water or the ice that may cover the lake. Guo et al. [25] introduced the air2water variant in which heat exchange between the air and the water may in some periods be open, but in others become blocked by the ice. In [25], it was proposed that the temperature at the surface T w may be computed either as the temperature of the liquid water or as the temperature of the ice on the lake's surface (7) or as a mix of both temperatures. Which equation will be used is governed by water temperature (T w ) and two additional model parameters (a 14 and a 15 ).
is used, and when a 15 > T w > a 14 , the mix of both (5) and (7) is applied such that the result from (5) is multiplied by (1 -K) and the result from (7) is multiplied by K, where K = (a 15 − T w )/(a 15 − a 14 ) is a ratio between frozen and open water part of the lake surface.
Note that (7) is similar to (5) or (1), but with different parameter names (a 9 instead of a 1 , a 10 instead of a 2 , and so on). This allows the model to perform differently in icy and liquid surface conditions. The approach proposed in [25] was defined for cases when one aims to determine the temperature of the lake surface with satellite data, but it does not skip the problem with the lack of model sensitivity to a 7 and a 8 parameters. However, if we are interested in the temperature of liquid water in a surface layer of the lake (in the case of Polish lakes, either 0.4 m under the open surface or under the ice cover), one may eliminate these two parameters by using two similar equations with a separate parameterization for different lake stratifications in a similar way as proposed in [25]. This new variant developed for in situ water temperature measurements may be proposed in the following form: where a 11 , a 12 ∈ [0 • C, 25 • C], and K that appears in (8) represents the ratio between the occurrence of two different stratifications (attributed to cold or warm waters), defined as In the proposed variant of the air2water model [ (8) and (9)], there are 12 parameters: a 1 -a 5 and a 6 -a 10 represent similar processes, but for warmer (a 1 -a 5 ) or colder (a 6 -a 10 ) waters. Similar to the idea presented in [25], the role of K, and hence the parameters a 11 and a 12 , is to choose whether the water temperature should be computed according to the version representative for warm waters, cold waters, or a mix of both. This division is important due to different stratification dynamics in colder and warmer waters [81], [82]. Note that δ does not appear in this variant of the air2water model. Because the new variant uses different approximations for cold and warm waters, it may better fit the conditions in the particular lake. The number of parameters (12, hence we call this variant a2w12) is larger than in the case of the classical (8 parameters) or simplified (6 parameters) air2water variants, but lower than in the version designed for satellite measurements [25]. In Supplementary  Table I, the bounds for each parameter used in a2w6, a2w8, and a2w12 variants are given.

E. Optimization Criteria
In the present study, we test the performance of three different air2water model variants when each of them is calibrated with eight different optimization algorithms. As a result, we compare both the models and the optimizers. The objective function used during optimization is the mean square error (MSE) where N is the number of observations in the calibration data set, and T P and T M are predicted and measured lake temperatures at the ith day, respectively. Obviously, the data from the validation set are used to compute MSE for the validation period (with different days i and a different number of days N). The maximum number of function calls is set to 100 000 for all considered metaheuristics, irrespective of the variant of the air2water model, and hence of the number of parameters. This approach is similar to the one applied in the CEC2011 competition on real-world problems (where 150 000 function calls were allowed) but different from the majority of CEC competitions held on artificial benchmark problems, in which the number of function calls depends on the problem dimensionality (e.g., [76], [83], and [84]). The fixed number of function calls benefits rather the classical than the new air2water variant. During empirical tests, each optimization algorithm was run 30 times on every air2water variant and lake.

A. Chosen Optimization Algorithms
In [85], 12 various metaheuristics were tested for the calibration of the basic air2water model with 8 parameters. In the present study, we introduce a new air2water model variant and analyze how to choose the right algorithm to train it. As a result, we need to apply recent and competing algorithms that belong to various families of methods. For historical reasons, the basic PSO is also tested, as it is the first method used for air2water optimization in the past. The eight metaheuristics tested, include the following: 1) The basic PSO algorithm [86] that has been used for the calibration of the basic air2water model in the past papers [87]. The PSO version with global topology [86], c 1 = c 2 = 1.49445 [88], inertia weight decreasing linearly with time from 0.9 at the start of the search to 0.4 at the end [89] and swarm size set to 50 [59] has been applied. 2) Triple archives PSO (TAPSO) [90] is a new version of PSO that divides the particles into different subsets depending on their fitness and the speed of progress and proceeds with both swarm intelligence and genetic operators. We used 60 particles in the experiment, as there was no guide provided in [90] on how to set the population size for a particular type of problems. 3) HARD-DE [91] that turned out the best method for the classical eight-parameter version of the air2water model in [85]. The good performance of HARD-DE was also shown in [91] against a number of other DE variants on various benchmarks. HARD-DE uses nonlinear, parabolic population size reduction, from initial 25 × lnD × sqrt(D) (but, in the present study, not fewer than 25) to 4 at the end of the search. 4) EL-SHADE-SPACMA [92], which was ranked the third best method in the CEC2018 competition. This algorithm is an updated version of L-SHADE-SPACMA [93], and is a member of the L-SHADE family of methods [61] that was developed by step-by-step improvements of older algorithms [94] and performs exceptionally well in many competitions between metaheuristics [95]. This variant uses linear population size reduction during the run; the algorithm starts from the population size set to 18D and ends with four individuals. 5) L-SHADE-50-PWI [96] that, unusually, is a simplified version of some L-SHADE variants, developed by reducing some of their unneeded elements, but adding an inertia weight that is so widely used in PSO algorithms [89]. In the source paper, it outperformed a number of other L-SHADE variants on CEC2017, CEC2014, and various real-world problems. L-SHADE-50-PWI found a successful application in the soil erosion analysis [97] and achieved reasonable results for brain magnetic resonance image segmentation [98]. The use of inertia weight within L-SHADE was also considered a promising solution in space flight trajectory design [99]. 6) djDE with diversity-based adaptive population size [60].
This approach combines one of the best DE variants from early 2000's, jDE algorithm [100] with a new population size adaptivity method that allows to balance exploration and exploitation capabilities by controlling the population diversity [101]. 7) Stochastic fractional search algorithm with fitnessdistance balance (FDB) [102], which is a fresh approach based on the fitness-distance balance algorithm proposed in 2020 [103]. The population size is set to 50. 8) Self-organizing migrating algorithm-T3A -(SOMA, [104]), an adaptive extension of an algorithm proposed in [105], that has been ranked 4th out of 18 algorithms for the 2019 100-Digit Challenge on real-parameter single objective optimization [63], a competition with (almost) unlimited time to solve 10 hard problems with precision up to 10 digits of accuracy. SOMA solved 9 problems out of 10 with the required accuracy, the only algorithm that solved all problems with assumed precision was a variant of jDE called jDE100 [106]-a modification of which, called djDE [60], has already been included in our comparison. Three among eight tested algorithms are based on the famous L-SHADE [61] optimizer that won IEEE Competition in Evolutionary Computation in 2014. We have used three algorithms from this family of methods, as several L-SHADE-based variants won various IEEE Competitions or achieved very good performances in comparison studies [94]. The efficient L-SHADE algorithms include the three variants applied in this study, as well as SPS-L-SHADE-EIG that uses rotation invariant mechanism [107], L-SHADE-cnEpSin [108] that uses an ensemble of sinusoidal approaches in parameter adaptation, L-SHADE-X [109] that introduces dimensionality-dependent spread parameter in distribution from which scaling factor is generated and an archive with offsprings that did not reach the main population despite having good performance, or OLSHADE-CS [110] that uses novel initialization and conservative selection mechanisms. In our opinion, large successes of L-SHADE-based algorithms justify testing various variants on different practical applications, as in [57], [111], and [112].
All the metaheuristics were run with default control parameters as proposed by their inventors, with a few exceptions noted in the above description. In our tests, first, each method is tested on 22 real-world problems from CEC2011, and, due to the low-dimensionality of the air2water model-on 10-dimensional CEC2017 artificial benchmarks. After that, each method is applied to every air2water variant for each lake. A comparison of the results would show us to what extent the choice of metaheuristics based on either various real-world problems or CEC2017 artificial benchmarks will be useful for choosing the appropriate approach for a particular air2water variant.
On CEC2011 and CEC2017 problems, each algorithm is run 51 times as suggested in [76]. We compare the optimization algorithms in three ways.
First, the averaged performance of each algorithm on every problem is computed, and algorithms are ranked from the best (rank 1) to the worst (rank 8) on each problem. The rankings are then averaged over all problems in the set, and Friedman's test is used with the post hoc Shaffer's static procedure at α = 0.05 to choose if there are statistically significant differences between algorithms [113], [114]. This is the classical way of comparing metaheuristics, used in plenty of papers. However, it compares just average performance, and the specialization of a particular metaheuristic in a particular type of problem is often lost; as a result, the algorithm that performs above the average for all problems may be ranked best, even if it is not the best approach for any specific problem.
Second, we count the number of problems for which the particular algorithm has won. The algorithm that achieved the best-averaged performance on the particular problem is termed a winner. We set a range of precision to 10 -19 . If the difference between performances achieved by some algorithms is lower than 10 -19 , and no other algorithm performed better, we consider all such equally performing algorithms as winners. As a result, there may be more than 1 winner for the particular problem. By counting the number of problems for which the particular algorithm turns out a winner we find whether the algorithm may be especially useful for some problems. However, the possible failures of the algorithm in finding any reasonable solutions for some problems are neglected by this measure.
Thirdly, instead of averaged ranking, we count the number of wins differently. For each problem, we find the best run achieved by the particular algorithm. Then, we compare the best runs achieved by different algorithms and find a winner for a particular problem. After that, we count the number of problems for which the particular algorithm turned out a winner using the best run performance. This way we may look at algorithms that achieve diversified performance in each run, and may occasionally lead to very good results, even though some poor runs may highly affect their average performance. The algorithm chosen in this way would not necessarily be reliable for a single application and may perform poorly for some problems. However, for other problems, if run many times, it may find a solution better than algorithms that were chosen according to the average performance. Here we do not claim which way of comparison is a better option for choosing the right approach for the specific problem-in our case the calibration of the air2water variants. Our only intent is to verify how consistent (or not) the results will be if we use different ways to compare metaheuristics.

B. Comparing Metaheuristics on CEC2017 Benchmarks
The CEC2017 set contains 30 problems with various difficulties [76]. Because tested air2water model variants have 6 to 12 parameters, we have selected 10-dimensional versions of CEC2017 problems.
The average results obtained by eight optimization algorithms on each among thirty 10-dimensional CEC2017 problems are given in Supplementary Table II. The averaged ranking of algorithms, and the number of wins based on the averaged performance is given in Table II. As may be seen, based on the ranking averaged over all problems HARD-DE turned out to be the best method (with mean ranking = 2.4), followed by EL-SHADE-SPACMA (mean ranking = 2.72), djDE (3.0), and L-SHADE-50-PWI (3.5). Clearly, the worst method was SOMA (mean ranking = 7.44). In Supplementary  Table III, the statistical significance of the differences achieved by all algorithms, computed according to Friedman's test with Shaffer's post hoc method with α = 0.05, is presented. As may be seen, the differences between the best 4 algorithms are not statistically significant. FDB, ranked fifth, is significantly poorer than HARD-DE, but the difference between FDB and the other three well-performing methods is statistically insignificant.
Marginally different results are obtained if we count the number of wins for each algorithm when the win is based on the averaged performance for the particular problem. In such a case HARD-DE and EL-SHADE-SPACMA perform equally well with 12 wins, and djDE and L-SHADE-50-PWI share 10 wins. Among the other four methods, FDB achieves 7 wins, whereas other methods achieve at best a single win among 30 problems.
Based on these two ways of comparison, HARD-DE may be chosen as the most appropriate method for 10-dimensional benchmark functions, closely followed by EL-SHADE-SPACMA. Two other algorithms, i.e., L-SHADE-50-PWI and djDE, perform only marginally poorer. Other methods are clearly inferior, and only FDB may be considered somehow competitive.
Supplementary Table IV contains the best results obtained by particular algorithms on the specific problems in 51 runs. The number of wins counted with respect to this best performance is also given in Table II. If we count the number of problems for which the particular algorithm can find a better solution than the   Although these findings are not in contradiction to the results based on the average ranking, they nonetheless clearly differ. First of all, L-SHADE-50-PWI and EL-SHADE-SPACMA turn out more efficient in finding the best solutions than HARD-DE. Secondly, the best solutions for specific problems are found by seven various algorithms, including FDB, PSO, and TAPSO. Only SOMA remains fully unable to locate the best solution for any problems. This suggests that the good performance of HARD-DE observed when the performance is averaged over all runs may be due to the consistent and reliable performance in all runs, rather than the ability to find a solution that cannot be found by other optimizers.
According to the all above discussion based on artificial benchmarks of similar dimensionality to air2water models, one could conclude that one of four algorithms, i.e., HARD-DE, EL-SHADE-SPACMA, L-SHADE-50-PWI, and djDE, should be applied to the calibration of air2water model variants.

C. Comparing Metaheuristics on 22 Real-World Problems From Different Fields of Science
In Supplementary Tables V-VII, the same eight algorithms are compared on 22 real-world problems of different dimensionality that originate from various fields of science or industry. Supplementary Table V contains the averaged performance obtained on a particular problem, the ranking of algorithms averaged over all problems, and the number of wins achieved according to the average performance. Supplementary Table VI contains information on the significance of the differences between performances obtained by various algorithms according to Friedman's test with Shaffer's post hoc procedure with α = 0.05. Supplementary Table VII shows the information on the best performance in 51 runs achieved by a particular algorithm on a specific problem, and the number of wins counted with respect to the best performance. The main results are, like in the case of CEC 2017 benchmarks, summarized in Table II. According to the averaged ranks based on the mean performance, on real-world problems, L-SHADE-50-PWI turns out the best method (mean ranking = 2.57). L-SHADE-50-PWI is followed by HARD-DE (mean ranking = 3.05), PSO (3.64), and EL-SHADE-SPACMA (3.64). The djDE algorithm does not perform well on real-world problems and is only ranked 6th (mean ranking = 5.43). SOMA remains the poorest approach (with a mean ranking of 7.48).
Like in the case of CEC2017 problems, the differences between the four best methods are not statistically significant at α = 0.05. However, L-SHADE-50-PWI is significantly better than SOMA, djDE, TAPSO, and FDB algorithms.
If we count the number of problems for which the particular algorithm is the winner according to the mean performance (see Table II and Supplementary Table V), L-SHADE-50-PWI is clearly the best approach (ten wins), followed by HARD-DE (six wins) and, surprisingly, TAPSO (four wins) and FDB (three wins). This shows the superiority of L-SHADE-50-PWI and, to some extent HARD-DE, over the other methods. However, it also shows that some algorithms that do not perform well across all considered real-world problems (i.e., TAPSO and FDB) may be well suited for the specific goals: TAPSO is the best method for 30-dimensional problem 2 and 110-and 216-dimensional problems 11.1 and 11.2; FDB-is the best method for 1-dimensional problem 4 and 96-dimensional problem 11.10. Note also that EL-SHADE-SPACMA turned out inefficient in this competition.
When we compare the number of wins for the best among 51 runs performed on a particular problem, the results are much different. Although L-SHADE-50-PWI remains the best approach with nine wins, it is only marginally better than TAPSO with eight wins. EL-SHADE-SPACMA has seven wins, HARD-DE has six, and FDB has five. Hence, five out of eight algorithms can find solutions for the specific problem that is better (including ties) than all other solutions for this problem found by the other methods. We may again note that the specialization of algorithms for a particular kind of search becomes much more evident when the best runs, instead of the mean performances, are compared.
Tests performed on 22 CEC2011 real-world problems do not confirm the superiority of HARD-DE, and do not point at djDE as a promising method. Instead, overall for real-world problems L-SHADE-50-PWI seems to be the best initial choice. However, for many specific real-world problems HARD-DE, TAPSO, EL-SHADE-SPACMA, or even basic PSO may be considered the second-best approach. It seems that choosing the appropriate metaheuristics for the air2water model calibration based on the results obtained for various-dimensional real-world problems is more difficult than making this choice based on artificial benchmarks of fixed dimensionality.

A. Choosing the Best Air2water Model Variant Using Different Optimizers
In this section we use, one by one, each among eight metaheuristics to calibrate the three air2water variants on every among 22 lakes. We run each metaheuristic 30 times for every variant and lake. We consider two ways of comparison-mean and best performance among 30 runs. MSE (10) is used as an objective function and as a comparison criterion. We separately compare air2water variants for calibration and for testing data sets.
The graphical comparison of averaged MSE obtained for three air2water variants calibrated by each metaheuristic is given in Fig. 2; in Supplementary Fig. 1, a similar comparison is shown solely for the MSE obtained in the best run. The number of times the particular air2water variant turns out the best according to each comparison scenario is given in Tables III and IV-we seek the best air2water variant for each lake, and then sum up the number of lakes for which particular air2water variant is the best choice.
From Fig. 2 and Supplementary Fig. 1 and Tables III and IV, we note that the relative performance of different air2water variants indeed depends on the metaheuristic used for comparison. The results also differ depending on whether we compare the averaged MSE from all runs, or the best result obtained in all runs.
When MSE is averaged over all runs, three algorithms, namely PSO, SOMA, and TAPSO point at the superiority of a2w6, the variant with the lowest number of parameters, over a2w8 and the newly proposed a2w12 (see Table III). The choice based on the calibration data is, for these three algorithms, confirmed on the testing data set. When the FDB algorithm is used for calibration, a2w12 seems to be the best variant according to the calibration data, but this result is not confirmed on the testing set. On the contrary, when air2water model variants are calibrated with L-SHADE-50-PWI or djDE, the a2w12 variant seems to be the best according to the calibration data, and the validation data set does not change the outcome (although the number of wins achieved by a2w12 and a2w6 is close to each other, a2w12 is marginally better). Finally, HARD-DE and EL-SHADE-SPACMA algorithms show the clear advantage of a2w12 over the two simpler variants. When HARD-DE is used for calibration, a2w12 is better than both a2w6 and a2w8 on 21 out of 22 lakes according to the calibration data, and on 19 lakes according to the validation set.
From such results, we see that the a2w12 variant may outperform the other two versions on almost all lakes providing that the appropriate optimization method is chosen for calibration. In such a case the superiority of a2w12 is confirmed on independent validation data set. However, if for example the classical PSO algorithm is used to calibrate the models-as in various recent studies on air2water-the simplest air2water variant with 6 parameters will be pointed out as the best option. It seems that simpler algorithms are able to find appropriate solutions for simpler air2water variants, but fail on more difficult ones. Only advanced metaheuristics are able to find appropriate solutions also for other air2water variants.
Interestingly, the results would change if one compares only solutions obtained in the best run (out of 30) of the calibration algorithm. In such a case only SOMA would point to a2w6 as the best choice (see Supplementary Fig. 1 and Table IV), all other algorithms clearly show the advantage of a2w12 over simpler variants, and this result is confirmed on both calibration and testing data sets. It means that almost every algorithm was able to converge to the better solution for the a2w12 model than for the a2w6 variant, but finding such good solutions for a2w12 was relatively harder than for a2w6, and was done successfully only in some runs. As a result, in the case of PSO, TAPSO, and FDB and to some extent L-SHADE-50-PWI and djDE, the conclusions based on the performance from the mean and the best run do differ. Only for HARD-DE and EL-SHADE-SPACMA, the final results are basically the same for both cases (compare Tables III and IV).
From Fig. 2 and Supplementary Fig. 1, one may note that when the HARD-DE algorithm is used for calibration, the results obtained by the a2w12 variant are clearly better than the results obtained by a2w8 and a2w6 calibrated by any algorithm. In that case a2w12 variant may not outperform other air2water variants only on three lakes: Jamno (lake no. 2 in Table I and Fig. 2 and Supplementary Fig. 1), Lebsko (lake no. 4), and Jeziorak (lake no. 11). For all other 19 lakes the performance of a2w12 is superior to the performance of a2w8 and a2w6 variants. May we find any common feature of these three lakes on which a2w12 fails? All three lakes are relatively shallow and relatively large compared to the other lakes under study (see Table I). Jamno and Lebsko are two of three most shallow lakes, out of 22, and  Jeziorak is 4th-6th the most shallow lake, depending on whether one compares maximum or average depth. Lebsko lake is the 3rd largest lake, and both Jamno and Jeziorak are among the 6 largest lakes, out of 22 tested. No other common features of the three lakes may be found. This suggests that the a2w12 variant is not appropriate for large but shallow lakes. This observation may be confirmed (see Fig. 2 and Supplementary Fig. 1) by the very good performance of the a2w12 variant on the two deepest lakes-Wdzydze (lake 7) which is of moderate size, and Elckie (lake 17), which is very small. We suppose that the reason is in weak stratification of large shallow lakes, which results in low importance of seasonal mixing between deeper and surface lake layers. As a result, we may recommend an a2w12 variant of the air2water model for various lowland lakes with exception of large shallow ones.

B. Comparison Between Metaheuristics for Each Air2water Variant
In the previous section, we analyzed how the choice of calibration method would affect the choice of appropriate air2water variant. In this section, we compare the performance of metaheuristics for each individual air2water variant.
In Tables V-VII, the averaged over 30 runs performance of each algorithm achieved for every air2water variant is given for the validation set. The respective results for the calibration set are given in Supplementary Tables VIII-X. The results are shown separately for calibration and testing (validation) data sets. From the metaheuristics' point of view, the results for the calibration set are more meaningful, as the aim of the optimizer is to fit the model to these data. However, from a hydrological point of view results obtained for the testing set are much more important-as what is finally needed is the extrapolation of the calibrated model for the future, unknown data. In the bottom of Tables V-VII and Supplementary Tables VIII-X, the mean ranking (averaged over all 22 lakes) achieved by each metaheuristic is given. The lower the ranking, the better metaheuristic. We set the threshold in comparison among metaheuristics to 10 -19 ; if the difference in averaged performance is lower than this threshold for the particular lake, the methods are given an equal rank.
From Tables V-VII and Supplementary Tables VIII-X, we see that HARD-DE is the best choice to calibrate a2w12 and a2w8 variants. This result is consistent for calibration (see Supplementary Tables VIII-X) and validation (see Tables V-VII) data sets. In the case of the a2w8 variant, HARD-DE performs equally well for the calibration set as EL-SHADE-SPACMA and L-SHADE-50-PWI, for the other cases -it is the sole winner. EL-SHADE-SPACMA is constantly the second-best approach, and L-SAHDE-50-PWI and djDE compete for third place. All four metaheuristics perform well, and the differences between their results are relatively minor for the majority of lakes.
For a2w12 and a2w8 variants, the FDB algorithm is to some extent competitive, but the three remaining algorithms (TAPSO, PSO, and SOMA) perform much poorer. SOMA is the weakest approach.
The ranking of algorithms remains roughly similar for the a2w6 variant when calibration data are considered, with HARD-DE and EL-SHADE-SPACMA as two clear winners. Interestingly, for the validation set, PSO seems to be the best approach to calibrate the a2w6 variant. This may be highly surprising, but a quick look at the specific results given in Table VII shows that, from the practical point of view, the performance of all algorithms apart from SOMA is almost equal. Only the marginal differences noted for some lakes resulted in a better ranking of PSO in this specific case. The possible reason for the marginal superiority of PSO for validation data in the case of the air2water variant with just six parameters is that PSO was sufficient to find a good solution but unable to overfit to the specific training data; other algorithms (with exception of SOMA) fitted to both signal and noise present in the training sample.
Supplementary Figs. 2 and 3 allow visual comparison between observed and simulated lake water temperatures for the year 2016, which belongs to the validation period, on two typical lakes: Mikolajskie and Gardno. The simulated values are single realizations, out of 30 runs, from each metaheuristic for every air2water variant. One may note that simulations obtained by different metaheuristics for the a2w6 variant are almost identical (overlap on Supplementary Figs 2 and 3). On the contrary, simulations obtained by different metaheuristics for a2w8 and a2w12 variants are highly diversified for Lake Mikolajskie, and also differ, although on a much lower scale, for Lake Gardno. The differences between simulations obtained with the use of different optimizers may be large, and for some days exceed 3°C. This confirms that the calibration of the a2w6 variant is rather straightforward, but the a2w8 and a2w12 variants are much harder to calibrate and require much more attention in choosing the appropriate calibration method.
Having all this in mind, one could conclude that, for the problem of air2water model calibration, the choice of the optimizer based on CEC2017 benchmarks seems more appropriate than the choice based on real-world problems of various dimensionalities that come from versatile fields of knowledge.

V. CONCLUSION
In this article, we have introduced a novel air2water model variant for modeling lake surface water temperatures based on air temperatures, and have shown that the choice of the optimization method may be crucial to confirm the good performance of the new model variant. The new variant uses separate parameterization for cold and warm water periods. If inappropriate metaheuristics are used, they point at simpler versions of the model as the better. Only more flexible optimization algorithms may find appropriate solutions for more complicated air2water model variants.
Which model variant performs the best may also depend on the performance measure. If instead of the averaged performance from all runs, one compares only the best performance achieved by a particular algorithm in all runs, seven out of eight considered metaheuristics would show the superiority of the new air2water model variant over its older versions, and only one would suggest the superiority of the simplest variant with six parameters. It turns out that almost every algorithm was able to converge at least once to the sufficiently good solution of the new air2water variant to show its superiority. However, finding such a good solution was relatively hard, and many metaheuristics were able to achieve it only occasionally. Nonetheless, repeating computation many times and finding the best solution may be a simpler and cheaper option for many practitioners than seeking the method that will find a similarly good solution-but in every run.
We have also found that it is not easy to choose the right metaheuristics based on available benchmark problems from the literature. Depending on whether the tests are to be performed on 10-dimensional IEEE CEC2017 problems (which are similar to the air2water model variants in dimensionality), or on 22 real-world problems from the IEEE CEC2011 set (which, like optimization of the air2water model, represent real-world problems, not artificial functions), different metaheuristics may be considered as the best ones. Depending on the criteria of comparison, on CEC2017 problems HARD-DE or EL-SHADE-SPACMA are pointed out as the best algorithms. On the contrary, on CEC2011 real-world problems L-SHADE-50-PWI performs best and is followed by TAPSO, HARD-DE, or EL-SHADE-SPACMA.
In the practical application of the calibration of air2water model variants, the HARD-DE algorithm turns out clearly the best approach, confirming the choice that would be based on 10-dimensional artificial benchmarks, rather than variousdimensional real-world problems from diversified fields of science. When HARD-DE is used, the air2water model with 12 parameters turns out clearly better than the older air2water versions. However, if one would calibrate air2water models with L-SHADE-50-PWI, the superiority of the new air2water variant is weaker. If one decides on PSO for model calibration, the simplest air2water variant with six parameters would seem to be the best. Hence, the choice of metaheuristics for model calibration may be a crucial factor affecting the modeling outcome.
The general conclusion from our work regarding the search for the appropriate optimizer for the specific application based on the results obtained on benchmark problems is to look for algorithms that perform best on problems of similar dimensionality. Results obtained on versatile real-world problems of various dimensionalities may be misleading. If an analysis of the results for similarly-dimensional problems does not lead to any conclusions-a number of tests with different optimizers for the particular application of interest is needed.
We may finally note that if the right calibration method is chosen, the proposed air2water model variant with 12 parameters outperforms other air2water model variants on 19 out of 22 lakes. The new variant is not recommended for larger but shallow lakes but performs well on other kinds of lowland lakes. For large shallow lakes, we suggest using the simplified air2water model version with 6 parameters.