ALiDAn: Spatiotemporal and Multiwavelength Atmospheric Lidar Data Augmentation

Methods based on statistical learning have become prevalent in various signal processing disciplines and have recently gained traction in atmospheric lidar studies. Nonetheless, such methods often require large quantities of annotated or resolved data. Such data are rare and require effort, especially when exploring evolving phenomena. Existing simulators and databases primarily focus on atmospheric vertical profiles. We propose the Atmospheric Lidar Data Augmentation (ALiDAn) framework to fill this gap. ALiDAn serves as an end-to-end generation and augmentation framework of spatiotemporal and multiwavelength resolved lidar simulated data. ALiDAn employs a hybrid approach of physical models, data statistics, and sampling processes. In addition, it takes into account geographical and seasonal characteristics of aerosols and meteorological conditions along with short- and long-term phenomena that affect lidar measurements. This approach can provide diversified data and robust benchmarks to assist in developing and validating new lidar processing algorithms. We demonstrate simulations compatible with a pulsed time-of-flight lidar. Our approach leverages a broader use of existing databases and can inspire similar data augmentation to other types of lidars and active sensors.

key tool in atmospheric sensing. A lidar, however, is affected by changes caused by external factors and internal wear, such as in-door temperature fluctuations affecting optical alignment and laser energy. Thus, the analysis should address the dynamic phenomena of both the atmosphere and the lidar system. Lidar analysis is often done in synergy with other systems [1], [2], [3]. Thus, it requires many resources, including expert supervision to obtain well-validated continuous retrievals.
A lidar essentially samples vertical profiles in the atmosphere. Over the course of a day, consecutive vertical profiles create a two-dimensional (2D) spatiotemporal map. Advancing spatiotemporal lidar research can be achieved by adapting powerful deep-learning (DL) methods that had become highly developed in image processing. Recent image processing methods, such as segmentation and detection [4], [5], denoising [6], [7], and classification [8], have shown that spatiotemporal analysis better overcomes challenges posed by sparse or low-SNR signals. The use of sequential methods has also gained attention in recent boundary layer studies [9], [10], [11].
DL uses inputs that follow the statistical distribution of real measurements. However, this often requires large databases to facilitate the training. We suggest building a ground-truth spatiotemporal and multiwavelength (STMW) lidar database to support such approaches. According to our hypothesis, let us define the requirements for a desirable database.
1) The database should contain resolved or annotated STMW lidar data. This allows developing and implementing supervised or semisupervised learning algorithms.
2) The database should account for typical variability factors affecting ground-based lidar measurements, including the spatiotemporal variation of aerosol type and density; diurnal and environmental signal effects along with system-related effects; and seasonal-and geographical-related effects. This allows developing robust algorithms by expanding their scope to natural lidar signals. Currently, there are limited continuously validated lidar databases [12], [13]. Inspired by the recent contribution of synthesized databases to learning-based methods in various disciplines [14], [15], [16], we suggest augmenting the desired data with simulations. Potentially, we can achieve this using existing lidar databases and simulators. We elaborate on these tools and their limitations concerning our goal in Section II. Following this review, we conclude that each tool separately has not yet met the requirements above.
We present the Atmospheric Lidar Data Augmentation (ALiDAn) framework. 1 ALiDAn is a model-and data-driven generation and augmentation approach of STMW resolved atmospheric lidar measurements. In addition to spatial changes, we also include short-and long-term temporal changes. The proposed framework is adaptable to varying geographical locations and times.
To achieve diversified simulated lidar data, we define and employ a sampling space of typical generation parameters of each measurement component. Generation parameters are derived and augmented from an ensemble of physical models, measurements, and past analysis [1], [2]. The simulations demonstrated in this work are compatible with a pulsed timeof-flight [17], [18], [19] lidar. We derive simulation parameters from the literature and measurements collected in a field campaign during 2016-2019. In this campaign, a Polly XT [20] system was assembled on the rooftop of the Meyer Building (Electrical Engineering (EE) Faculty), Technion, Haifa, Israel [21], [22], [23], [24], [25]. To generate simulation parameters adapted to other scenarios, we suggest alternative sources throughout this work.
The suggested framework of ALiDAn consists of three generation modules of measurement factors, as shown in Fig. 1: the atmosphere (aerosols and air) presented in Section V; the electro-optical (EO) system factor presented in Section VI; and the background (BG) signal (scattered sunlight) presented in Section VII. We provide an additional simulation model that uses the augmented parameters to generate a new STMW simulated lidar measurement, as presented in Section IV. We provide simulated products of ALiDAn modules throughout Sections V-VII. In Section VIII, we demonstrate a fully STMW simulated lidar measurement and, accordingly, a real one for qualitative comparison. The outline of this work is presented in Table I. Using the ALiDAn framework, we demonstrate the concept of resolved atmospheric lidar database augmentation. As a result, augmented databases can better accommodate real-measurements statistics. Thereby, one can apply future data-driven algorithms to real data following minimal 1 The project repository is available at https://github.com/Addalin/pyALiDAn modification. Such a framework may assist various lidar processing tasks, such as inversion, aerosol/cloud classification, and aerosol typing. Its modular nature could benefit various types of lidar or sensors along with future extensions or modifications.

II. PRIOR DATABASES AND SIMULATORS
Here, we review and examine possible ways of using existing tools to obtain the desired lidar data, as discussed previously in Section I. The European Aerosol Research Lidar Network (EARLINET) provides one-dimensional (1D) synthetic vertical lidar profiles for typical aerosol conditions [26], [27]; the Lidar Climatology of Vertical Aerosol Structure (LIVAS) database [28] provides aerosol and cloud properties in multiple wavelengths that are designed to be used in space-based lidar simulations. However, both do not address the time-varying atmospheric state or measurement-related phenomena that affect ground-based lidars. The Optical Properties of Aerosols and Clouds (OPAC) database [29], [30] offers a few predefined 1D atmospheric profiles or a user interface for manual specifications [31]; however, OPAC has not addressed temporal atmospheric variations.
A possible way to create simulated STMW validated lidar data is using existing simulators. Radiative transfer simulators [32], [33], [34], [35] usually require predefined ingredients, such as aerosol profiles commonly obtained from OPAC. The radiative transfer library (LIBRADTRAN) [36] offers a lidar model. In addition to aerosol profiles, LIBRADTRAN requires predefined lidar system parameters. However, some parameters, such as the overlap between the laser beam and the receiver field-of-view, are not addressed.
The Earth Clouds, Aerosols, and Radiation Explorer (Earth-CARE) [37] simulator (ECSIM) [38], [39], [40] refers to a space-based multisensor platform, which also includes a lidar [41]. ECSIM is a comprehensive simulator that addresses a setup different than that of this article. The lidar measurements of EarthCARE are affected differently by external and internal factors. Therefore, adapting ECSIM to ground-based lidars is not straightforward.
The polarimetric lidar simulator (PLS) [42] provides a tool for a thorough simulation and assessment of the instrument components. Nonetheless, it does not include the statistics of the environment inducing the backscatter and the background signals.
The existing simulators and databases mentioned above focus on vertical profiles and do not address the dynamic phenomena of both the system and the atmosphere. To emulate such data with current simulators, one should manually provide the system and atmospheric state variations. Since the system and the atmosphere change over time, their respective simulations should also vary according to real statistics. Here, we suggest a framework to accomplish such simulations practically.
The value of using statistical databases in lidar study has already been demonstrated in aerosol typing study [43]. The ground-based lidar networks, EARLINET [44], MPLNET [45], and PollyNet [20], [46], have accumulated data globally and over time. Recently, the PollyNet Processing Chain [2], [47] created a vast database of lidar-derived aerosol optical property profiles from past years for several stations around the globe. These data, together with AERONET inversion products [48], yield rich statistics. Here, we demonstrate how to leverage these databases to achieve lidar data augmentation.

A. Lidar Model
An outdoor scene has downwelling radiance induced by natural lighting at wavelength λ [nm]. Denote by p BG (λ) [photons] the resulting BG number of photons detected by a detector on the ground during exposure time δt [sec]. An atmospheric lidar [49] is mounted on the ground, pointing upward.
Now, we derive p BS (r, λ). Let the backscatter and extinction coefficients of particles at r be β(r, λ) [1/(km · sr)] and α(r, λ) [1/km], respectively. The unitless optical depth corresponding to distance r is The attenuated backscatter coefficient for elastic interactions is The lidar detector has area A [km 2 ]. With a unitless system efficiency η, the lidar constant (LC) [1], [50] (sometimes referred to as the lidar factor) is In practice, the lidar laser beam does not fully coincide with the field-of-view of the detector. The portion of the laser beam that can geometrically be detected varies continuously with height. This portion is expressed by the unitless overlap function O(r ). We define an EO system factor Then, An atmospheric lidar samples heights above r 1 [km] A lidar typically averages the signals during a period of t[sec] that includes a number of laser pulses n pulse [20]. When acquisition starts at time t 1 , the time measurement samples are Each measurement is done simultaneously at w ∈ wavelength channels. Hence, a discrete time-dependent representation of (1) at λ k ∈ (λ 1 , . . . , λ w ) is

IV. LIDAR MEASUREMENT SIMULATION MODELING
We present an array-based formalism for measurements. Let the temporal and the altitude bin measurements be i ∈ [1, . . . , m] and j ∈ [1, . . . , n], respectively. Let the wavelength channel bins be k ∈ [1, . . . , w]. These create an STMW array m×n×w + . Then, from (9), the expected measurements can be arranged in an STMW array The ALiDAn framework is built on top of the XARRAY platform, designed to handle multichannel spatiotemporal gridded data [61]. The array formalism in (14) is elaborated on in Section IV-A. To induce natural statistics of photoelectron signals, measurements are simulated using a Poissonian distribution [62] PM ∼ Poiss(P) ∈ m×n×w We present simulations of the multiwavelength polarization Raman lidar (Polly XT ) by TROPOS [20], [63].

A. Array Formalism
Each element of P in (14) satisfies (9). Similar to (14), we denote the sampled optical coefficients in arrays α ∈ Let α j,k be the j th column at the kth channel of array α.
Then, the discrete form of (2) is Using (16) and (17), the sampled array of optical depths is Denote by the elementwise Hadamard product. Use elementwise exponent exp(−2τ ) with (3) to define an array for the attenuated backscatter Temporal variations in the EO system factor are mainly caused by the LC variations. Except for the pulse duration, all of the EO components are prone to disturbances, distortions, and wear. Hence, a temporal form of (4) is Overlap changes are expected to be slower than those of the LC. Hence, the height-dependent overlap function remains + . An array broadcast operation [61] shares on values of an array across additional dimensions. Using broadcasting, we share p LC k across all heights into P LC Similarly, broadcasting the vector O twice, across all wavelength and times, creates O ∈ Ê m×n×w + . Then, from (5), the EO factor is  (14) is  (12)  subarray of m×n + . Assume that the spatiotemporal distribution of air molecules does not change significantly between adjacent measurements. Hence, a spatiotemporal interpolation is performed and broadcast to an STMW array. According to Section III-B, we calculate elementwise α mol ∈ Ê m×n×w + [1/km] using [68]. Then, with LR mol and (11), we get To calculate daily optical coefficients of air gases, we use meteorological measurement profiles by GDAS, available globally every 3 h on a 360 × 181 (1 • × 1 • ) latitude-longitude grid [64]. Fig. 3 shows the generated extinction coefficient by air molecules at λ = 532 [nm] α mol for September 1, 2017, in Haifa.

B. Aerosol Optical Properties
To generate aerosol optical coefficients, our model addresses the following four questions: 1) where are the aerosols found? 2) is there a typical 2D texture associated with their spatiotemporal densities? 3) what are the typical optical coefficients of particles per wavelength? and 4) how can one simulate measurements typical to a geographic site or a season?
In the troposphere, generally, aerosol concentration decreases with altitude. Beyond a reference height r ref [km] [69], [70], [71], [72], the aerosol optical coefficients become negligible in comparison to air. Aerosols tend to have a smooth spatiotemporal distribution. Therefore, we assume that their density can be approximated or represented by smooth base functions, specifically spatiotemporal Gaussians. Here, we further leverage the aerosol conversion features noted in LIVAS [28] to generate typical random spatiotemporal variations of aerosols.
We denote aer as the aerosol optical parameter space used to statistically generate an STMW aerosol optical density. The parameter space aer consist of the following: Here, each matrix • k →k and k is diagonal, containing respectively the temporal values of spectral relations • 1,k→k , · · · , • n,k→k and LR relations LR aer 1,k , · · · , LR aer n,k . To simulate various atmospheric states, we initially randomly generate aer , as illustrated in 2. Section V-B1 describes how to derive these parameters. Then aer yields the optical coefficients as elaborated in Section V-B2. Simulating an atmosphere state with a mixture of aerosols is discussed in Section V-B3. Section V-B4 shows a few diurnal simulations of aerosol optical coefficients.

1) Aerosol Statistical Model:
Each parameter of aer is sampled from an empirical continuous probability density function (pdf), which is related to an aerosol optical property at a given location and time of year. Let f (D) be a truncated multivariate Gaussian mixture (GM) pdf [73] fit to dataset D.
For data given prior to statistical assessment as raw measurements, we used kernel density estimation (KDE) to calculate f (D) [73]. If data were provided after statistical analysis, we used the mean and uncertainty of each data point to generate a single Gaussian in a weighted mixture to calculate f (D).
Denote by D aer profiles a dataset of previously retrieved aerosol optical coefficients profiles [2]. From D aer profiles , we derivê f (α max k , r ref ). This pdf empirically represents the typical maximum value of the extinction coefficient at a reference wavelength λk, e.g., α max 352 , as a function of the aerosols' reference height.
The parameters are mainly sampled from monthly statistics, corresponding to the location and time of the synthesized measurements. The values of r ref and α max k are sampled once in a period of n temporal measurements. With these values, we produce varying spatiotemporal optical coefficients, as discussed in Section V-B2. To simulate dynamic changes in aerosol composition within such a period, we sample values of Å and LR at n s < n times e.g., three to five times.
Let I n {y j , t j } n s be a spline interpolation from a set of n s values y j at times t j to n times [76], e.g., the Bèzier interpolation. Using such an operator, we produce continuous changes of Å and LR based on their corresponding sampled values. The sampling process is summarized in Algorithm 1.
From the PollyNet Processing Chain, the automated analysis software of Polly XT [2], [47], we derived D aer profiles . The    Fig. 4. During the campaign, an AERONET sunphotometer was operated near the lidar. Using AERONET cloudless inversion products (level > 1.5), we created D Å . We first calculated optical depths, and using (13), the values of Å 355,532 and Å 532,1064 were calculated for each AERONET measurement in a month. The optical depths of wavelengths not measured by the sunphotometer were interpolated from neighboring measured wavelengths. Empirical distributionsf (Å 355,532 , Å 532,1064 ) for April and September 2017 are presented in Fig. 5.

2) Generating the Optical Coefficients of Aerosols:
We suggest representing the aerosol spatiotemporal distribution using a 2D GM in the spatiotemporal domain. In Appendix A, we compare this assumption with data derived from measurements. Accordingly, this section presents the generation process of the predefined sampled parameters aer . We generate random spatiotemporal Gaussians up to a given reference height r ref [km]. A randomly weighted summation of these Gaussians creates ρ ∈ Ê m×n + such that ρ 1 = 1. We now denote a few definitions in (22)- (24). These definitions are used in the process of creating spatiotemporal optical coefficients.
Let us denote the global min-max normalization scaling operator of an arbitrary array X to be where the global normalization factor is x max = max(X) − min(X) ∈ Ê + . A temporal min-max normalization scaling operator of a spatiotemporal array X ∈ Ê m×n such that x j is a column of X at time t j . The normalization factors of columns x j ∀ j ∈ [1, . . . , n] are arranged in a diagonal matrix Using (22) Using (23), we get ρ TN = S TN α aer k ∈ Ê m×n + , the density normalized per time t j . Using (24), we set α max k ∈ Ê n×n as the normalization factors of extinction coefficients at λk. Then, the extinction coefficient at λk can be rewritten as For September 1, 2017, the generated unitless ρ and ρ TN are presented in Fig. 6(a) and (b), respectively. Let the aerosol particle type be vertically uniform at each time t j . Then, from (18) and (26), the optical depths at λk are According to (13), the spectral relations set the temporal optical depths at λ k  Then, by plugging (27) into (28), and multiplying from the left by D −1 , the spatiotemporal array of the extinction coefficient at λ k is From (12), the backscatter coefficient at any λ k is β aer k = α aer k k ∈ Ê m×n 3) Aerosol Types: Consider a scene having several aerosol types at time t j e.g., marine minerals, smoke, dust, and soot [23], [25], [76]. Here, we suggest two methods for simulating such a scene.
The first approach considers an atmosphere with an effective mixed aerosol type. It yields a mixed distribution off (LR k , Å), affecting aer . Denote χ = (χ 0 , . . . , χ ν ) to be a vector of weights of ν aerosol types s.t. : χ = 1. For example, a set of weights may indicate the presence of different aerosol types typical of a specific location [1] and/or a season [79], [80]. Then, The second approach addresses an atmosphere state containing several aerosol layers. Here, optical coefficients are generated per each aerosol type, as elaborated on in Section V-B2. Then, one can set χ i, j,l to represent the percentage share of aerosol type l at height r i and time t j . This creates a weighting map χ ∈ Ê m×n + for each aerosol type l. Fig. 7(b) illustrates an example of such a map. Then, the optical coefficients of a

4) Examples of Diurnal Simulated Aerosol Optical Coefficients:
Here, we demonstrate diurnal simulations of aerosol optical coefficients for April 4, 2017, May 16, 2017, and September 1, 2017. We derivedf (LR k , Å) from [78]. With the first approach proposed in Section V-B3, we calculate a mixed f (LR, Å). As such, the weighting values for aerosols of types A (biomass burning), B (urban/ industrial), and C (a mixture of desert dust with biomass burning) are χ = (0.05, 0.75, 0.2), respectively. Such pdf is presented in Fig. 7(a).
For each day, we sample the Ångström values from f (Å 355,532 , Å 532,1064 ) of the corresponding months. Based on sampled values of the Ångström exponent, we samplef (LR|Å). The respective samples for September 2017 are marked with white stars in Fig. 7(a). The original pdf from [78] is given at λ = 355 [nm]. Here, we applied the LR values to be similar at all wavelengths. Diurnal Ångström and LR values are shown in Fig. 8.
Then, with ρ and ρ TN , as earlier presented in Fig. 6, we calculated the aerosol optical coefficients in other wavelengths according to Section V-B2. Fig. 9 presents accordingly the simulated diurnal aerosol extinction coefficients α aer [1/km] for these days.

A. LC Generation
The LC factor is affected by variations in the EO setup, including laser power drift, optical transmittance changes at the detector due to environmental exposure disturbances, and so on [50]. On a scale of days, observations show a monotonic decline in LC values. The expected descending trend resembles an exponential decay at rate t decay . Maintenance in every two to three months resets the LC to a higher range of values. We use these trends to generate simulated LC factors varying over time.
Denote a prediction band LC (t, λ) [81] of LC values at any time t. For any time between maintenance interrupts, fluctuations are introduced to the source power. We achieve this by setting LC (t, λ), within which randomized fluctuations deviate from the average parametric curve. After maintenance at time t LC 0 , the value of LC (t, λ) is narrowest, and then, it expands over time. As an example, the uncertainty ranges from 5% to 20% by the end of the period. Hence, the generation model for the time-varying LC factor is The parameters in (33) are set based on typical past LC retrievals for each λ. Except for operational maintenance, the power is expected to be smooth and continuous over time; thus, samples are temporally interpolated into p LC We generate LC values for an extended period, e.g., two months. For each arbitrary period of n time bins within this period, we use the corresponding temporal section of LC values for measurement generation, as illustrated in the diagram in Fig. 10. Fig. 11(a) shows the retrieved LC values from September to October 2017 at Haifa. The newly simulated values of such a period are presented in Fig. 11(b). Here, we used typical values such as LC = (15,45,35) [10 3 photons · km 3 ]. The value of t LC 0 was randomly selected within a period of t decay = 70 days.

B. Overlap Generation
The optical setup directly affects the overlap function, e.g., when the laser beam is slanted. Usually, changes in the overlap function are caused by maintenance interrupts or temperature fluctuations in the lidar's cabin. Various overlap functions can be realistically simulated by adjusting a modeled or estimated  overlap function [82], [83]. We generate overlap functions based on a typical set of past overlap retrievals. Denote r MO , d, g, and s as parameters in Ê + . The overlap function resembles an "S" curve; therefore, we model it as a generalized logistic Fig. 10. Diagram of the lidar EO system factor generator.
Previously retrieved overlap functions are fit to (34). From these, we derive normalized distributions for the parameters above. To simulate new overlap functions, the parameters are sampled from their respective distributions. The overlap sampled parameters are denoted by o in Fig. 10. The overlap function tends to be more stable than other system factors, so it can be generated infrequently, e.g., once per several months. For the simulations presented in this work, we assumed a similar overlap function across all wavelengths. Fig. 11(c) shows several examples of newly generated overlap functions.

VII. BACKGROUND SIGNAL GENERATOR
The Sun's position in the sky affects p BG , with a diurnal symmetry. A lidar samples the temporal average of p BG before every laser trigger event. This enables creating a statistical model. Denote a 0 , a, t peak ≥ 0, and u > 0 as Gaussian curve parameters. Then, the expected daily BG signal model per wavelength λ is Now, we elaborate on how each of the parameters is tailored to a particular day of the year and geographic location. Initially, an actual lidar measurement from a clear day, denoted as (orig), was fit to (35).
Let θ be the solar elevation at any time [84], [85]. Let t noon be the noon of true solar time [86]. At t noon , the Sun is at its highest angle θ noon . However, the BG peaks on t peak (λ), which varies around t noon , depending on the wavelength. For example, the signal at λ = 1064 [nm] peaks slightly later than the shorter wavelengths. We set t peak (λ) of an arbitrary day in proportion t noon using the ratio t orig peak (λ)/t orig noon . We use a clear sky model of global downwelling solar irradiance from [87] with air mass turbidity conditions. We utilize its trend by calculating a fit to a function with parameters a θ , b θ , c θ , and c θ expressed by E(θ ) = a θ cos(b θ θ + c θ ) + d θ .
Then, values of a 0 (λ) and a(λ) for an arbitrary day are set to be proportional to E(θ noon )/E(θ orig noon ). This is illustrated in Fig. 12(a).
To set u(λ), we use the twilight times. Let t daylight [sec] be the time difference between dusk and dawn. Denote the number of photons per δt reaching the detector due to sunlight at twilight as p twilight (λ) = p BG (t twilight , λ) [photons]. During twilight, we assume similar clear sky lighting conditions, i.e., p twilight ≡ p orig twilight (λ) > a 0 (λ), for any arbitrary day at a given location. From the symmetry of a Gaussian curve, we set u(λ) = t daylight 2 2 ln 1 q twilight (λ) (36) where In addition to θ , the BG signal is influenced by environmental conditions [87]. We, thus, introduce temporal fluctuations to the expected signal within a diurnal prediction band BG(t, λ). For example, one can set the prediction band to vary with θ . Then, the sampled BG signal is A BG signal was created by taking an original BG measurement from April 4, 2017. We adapted the BG signal for each day of 2017. A generated p BG for the original day and the one adapted to December 21, 2017, are presented in Fig. 12(b) and (c), respectively. An annual time series of p BG is shown in Fig. 12(d).
For comparison, Fig. 14 shows a real example of a Polly XT lidar RCS for a cloudless day during September 2017 at Haifa. Simulated measurements by ALiDAn in Fig. 13(c) possess similar characteristics to those of the Polly XT lidar presented in Fig. 14. One can observe resemblance in the dynamic range of values per wavelength channel, in the diurnal atmospheric dynamic variability, and in the diurnal change due to the Sun BG signal. In this work, the aerosol optical properties, as shown in Fig. 7(a), are derived from a modified distribution of [78]. Hence, we expect only a partial statistical resemblance.

IX. DISCUSSION
This work presents ALiDAn, a generation and augmentation framework of STMW lidar data based on actual signal statistics. ALiDAn incorporates effects related to atmospheric compounds, diurnal and seasonal phenomena, and EO effects. In addition, ALiDAn can be tailored to different geographical locations and times and, thus, can assist in simulations of new stations.
Generation via ALiDAn promotes broader use of wellestablished databases. ALiDAn can help generate traceable signals at any stage of their creation, eventually yielding  annotated databases. The generated data can assist in developing and testing lidar algorithms. We demonstrate this for supervised learning of lidar calibration in [88]. Future extensions or modifications can apply to other types of lidar and active sensors. For example, adding modules to ALiDAn can support polarized or inelastic Raman channels; this may address aerosol typing and clouds or water vapor analysis [89], [90], [91]. Two significant aspects set the strengths, and the limitations of ALiDAn are given as follows.
1) The first aspect relates to the source data statistics that our model relies on. Though being derived from well-established databases, the source data might be biased by various causes, such as temporal averaging of measurements, erroneous lidar calibration, and incorrect aerosol estimation arising from faulty assumptions or initialization in state-of-the-art retrieval algorithms, e.g., the LR assumption [1], [92]. Mitigating such issues is possible as lidar measurement technologies and analysis improve.
2) The second aspect relates to ALiDAn parametric model. This relates to a tradeoff between accuracy and practicality. The more complex the model (i.e., with more parameters it has), there is potentially better accuracy of its products. However, in practice, a more complex model requires resources, including training data and computations. We believe that, as time progresses, more data and better resources will facilitate better simulations by ALiDAn. Modifications related to source statistics may enhance desirable characteristics and enrich simulation by ALiDAn, e.g.,  incorporation of meteorological data by ERA5 [66] for the molecular model (see Appendix B). The aerosol model can benefit from additional databases [26], [28], [67], [93], [94], [95], from which one can generate D aer profiles , D Å , and D LR,Å . Such databases may also be useful for locations without a sunphotometer. Wavelength-related statistics of D LR,Å obtained over long periods [78], [96], [97] can expand data on aerosols, in particular, advanced analysis at 1064 nm [98]. ALiDAn can be adapted to handle aerosol microphysical properties, i.e., { aer , P aer , σ aer , ρ aer # }, in addition to aerosol optical properties. It may be useful when having statistical particle compositions instead of optical properties. Data of this kind can be available via reanalysis databases, such as CAMS [95] or MERRA-2 [67]. Enabling such an extension requires an a priori calculation of aerosol optical properties. Such a calculation is available via OPAC [30] or a modeling tool of aerosol optical properties [99].
Modifications related to the parametric model can focus on aerosols. The aerosols' variability is related to changes in their density or type. The variability depends on atmospheric dynamics. Dynamics can be simulated using physical simulators or can be learned from measurements. Representation of aerosol type variability may be enhanced via conditional sampling of LR and Å values from typical aerosol type variations. Typical aerosol type variations may be characterized by learning joint change patterns of LR and Å.
Sequential lidar measurements are often represented by a 2D spatiotemporal map. Statistical image-based methods [100], [101] may yield spatiotemporal density representations to enrich aerosol generation. To support effective learning of natural aerosol variability, these approaches may require additional computational efforts and sufficient spatiotemporal aerosol data.
The overlap function depends on the mechanical structure stability of the lidar system [102]. Diurnal or abrupt variations of the overlap function may occur due to variations in ambient temperature. The overlap function variability can be represented by learning joint change patterns of additional indicators, such as the lidar's power and housing temperature. To model aerosols' spatiotemporal distribution, we suggest using a 2D GM distribution 2 in Section V-B2. Here, we demonstrate a representation of empirical aerosol data in the spatiotemporal domain using a 2D GM distribution. We employ a scikit-learn [108], [109] implementation for fitting a GM distribution [110], [111]. We apply time interpolation to β Polly ; this allows using data from n s = 144 profiles appearing every 10 [min] in the day. Each profile has m s = 1998 height bins, corresponding to measurements up to an altitude of 15 [km]. Additional processing steps applied to β Polly include trimming of negative values, removing of noisy values above the aerosol layer using a height-dependent sigmoid function, and applying min-max scaling using (22) to set weights W s ∈ Ê m s ×n s , as presented in Fig. 15(c).
The weights W s are used for deriving samples that feed the GM fitting process; Fig. 15

APPENDIX B
As mentioned in Section V-A, the molecular distribution is calculated from meteorological measurements. For the molecular model, we have used GDAS [64] to be consistent with the PollyNet Processing Chain [47]. ERA5 [66] provides similar data, available every hour at four times higher spatial resolution. Although ERA5 provides a higher resolution of temperature and wind speed/direction, we have not observed significant differences for data related to molecular optical coefficients. This conclusion is supported by a temperature comparison between GDAS and ERA5 at corresponding pressure levels. The comparison was made on measurements of Haifa during April, May, September, and October 2017, corresponding to months of data presented in this work. Fig. 17(a) and (b), respectively, shows linear regression of GDAS and ERA5 temperatures and their residuals. The results show a high correlation between GDAS and ERA5 measured temperatures and pressures.