Structural Equation Modeling Applied to Internet Consumption Forecast in Brazil

The growth of Internet services over the past few years has increased the importance of Internet consumption forecasting models in telecommunications network planning activity. Most of these models use historical data of variables that, when reduced in size, form context variables: economic, innovation, technology, and Internet supply. Although they include multiple context variables, the techniques employed in using these models are limited to the analysis of a single dependency relationship. Thus, these models fail to understand the interrelationships between context variables and their effects (direct and indirect) on Internet consumption. This research uses Partial-Least-Square Structural Equations Modeling to understand Internet consumption in Brazil between 2002 and 2017. The results show, for example, that the economic context produces direct effects in the context of innovations and indirect effects in the context of technologies and Internet consumption. These results contribute to increase the knowledge of the relationships between the context variables that influence Internet consumption, providing inputs for the development of more accurate forecasting models, thus contributing to the activity of telecommunications network planning.


I. INTRODUCTION
Telecommunications network planning is more than the capacity of allocation and Internet/traffic routing [1] Studies show that most telecommunications network planning problems are associated with forecasting future Internet consumption [2]. Researchers suggest that such forecasts are important in ensuring that the planned network can meet rising Internet consumption [3] In this context, several studies have been dedicated to forecasting Internet consumption. Among these studies, the most commonly found models are those that assume that past consumption is a good indication of future consumption [4].
The associate editor coordinating the review of this manuscript and approving it for publication was Shunfeng Cheng.
Most Internet consumption forecast models by historical standards use a small number of independent variables (time series) [5]. This is because the presence of multicollinearity in the time series regressions [6] interferes with the results of the number of coefficients, errors, parameter reliability, and forecast accuracy [7]. Besides, time series models usually distort long-term forecasts since such models: I) do not capture emerging consumption behaviors [8]; II) ignore that future hitherto unknown technologies and innovations may change Internet consumption behavior and create new demands [9]; III) consider that Internet consumption is an independent variable of supply, ignoring that Internet supply enables the development of technologies and innovations that stimulate Internet consumption, that is, they disregard that supply creates its demand [10], [11]; and IV) are sensitive to the quality of historical Internet consumption data, which in turn is questionable [12].
In turn, Internet consumption forecast models not based on historical data enable forecasts in short-term changes in the amount of Internet traffic over five months [13] or to understand the concentration and distribution of consumption [14], but fail to replicate and/or generalize results to longer scenarios and/or to other geographic dimensions, such as cities and countries.
Considering the limitations discussed above, researchers claim that traditional models are unreliable to forecast phenomena embedded in complex contexts and that new models should represent such contexts in forecasting Internet consumption [1] by including variables such as, for example, the profile and number of Internet consumers [15]. Also, it must be considered that these variables, which represent the context of demand for connectivity and accessibility, correlate with economic factors that vary over time and space [16]. Loskot et al. [4] add that forecast models should incorporate variables of different types and natures, not just variables of an economic nature. The literature brings studies that correlate Internet consumption with demographic [8], technological and innovation [9], [17] variables and the supply of technologies based on Internet [11]. However, according to the OECD [12], the inclusion of variables in the model is to be based on theoretical bases capable of representing an Internet consumption profile consistent with reality. Therefore, how to represent the various contexts that permeate Internet consumption, their interrelationships, and effects on future Internet consumption?
This research addresses the problem of understanding future Internet consumption by approaching the multiple dependency relationships among variables of various types and natures associated with Internet consumption. Therefore, a simultaneous equation model is employed that combines factor analysis (dimensional reduction) and Partial-Least-Squares (PLS) regression [18] the so called Partial-Least-Squares Structural Equations Modeling (PLS-SEM).
In addition to this Introduction, the paper presents the theoretical framework of the Internet consumption forecast model in Section II. Section III describes the operational structure of PLS-SEM. Section IV details the 32 variables that represent the contexts: economics, technology, innovation, and Internet supply as well as the methods which process these variables: factor analysis, PLS regression, and exponential smoothing. The obtained results are presented in Section V and the conclusions of this study are given in section VI.

II. THEORETICAL FRAMEWORK OF THE INTERNET CONSUMPTION FORECAST
In PLS-SEM, the simultaneous estimation of dependency relationships requires the presence of a pre-established theoretical model that indicates the combination and order of dependency relationships as coherent justifications derived from the literature [26]. In this sense, the proposed model is formed by four exogenous or latent variables that represent the innovation, technology, economy, and Internet supply environments, as well as the endogenous variable of Internet consumption. The model assumes that Internet consumption is positively correlated with: I) Technology and innovation factors [9], [17]; II) Internet supply [10], [11]; III) The economic context, including its demographics [8], [16]. The model further assumes that the economic context is a relevant predictor in understanding Internet supply and technology and innovation factors over time [4]. Finally, the model considers that technologies are dependent on innovations [9]. Fig. 1 illustrates the structural model of Internet consumption in Brazil based on these assumptions. This structural model includes thirty-two observed variables (detailed in Table 1/Section IV), five constructs or latent variables, four dependency relationships between latent variables (continuous arrows), and four dependency relationships between latent variables and independent variables (dotted arrows). The proposed structural model is associated with a method of maximizing the explained variance of exogenous and endogenous constructs [18]. PLS-SEM retains a larger number of variables for each construct, allowing to simultaneously increase the forecast accuracy of endogenous variables [36]. Also, the sample size may be smaller in PLS-SEM than in CB-SEM, and there is no need for normal data distribution [26]. Thus, PLS-SEM is an appropriate method for conducting exploratory analyzes with smaller samples, more complex models, rich data, and weak theory [37]- [39]. In this sense, the proposed PLS-SEM structural model explores the interrelationships between the latent variables mentioned in the literature, and their respective direct and indirect effects on Internet consumption. Thus, it seeks to strengthen the theory of how the various contexts interrelate with Internet consumption over time.

III. OPERATIONAL FRAMEWORK OF THE INTERNET CONSUMPTION FORECAST
The basics of SEM are presented in the studies of Galton [19] and Pearson and Lee [20]. However, it is in VOLUME 8, 2020 Spearman's research [21] that the factor model (exploratory and confirmatory factor analysis) was presented. From this, Jöreskog [22] develops a general model containing two approaches in which factors (latent variables) correlate to a specific subset of observed variables [23]. In the measurement approach, the observed variables are related to the latent variables through confirmatory factor analysis [24]. Based on the covariance technique, this approach is called Covariance-based SEM (CB-SEM). By allowing the representation of constructs by factors, CB-SEM is more used in studies that seek confirmation of theories [25]. In the structural approach, latent variables, that is, variables constructed from observed variables, are related to each other through systems of simultaneous equations [18], [26]. Based on the partial least squares technique, this approach is called PLS-SEM. By enabling the representation of significance and strength (correlation coefficients) of factor relationships, PLS-SEM is most commonly used in studies that explore and analyze causal relationships [24], [27], [28].
SEM is an approach focused on the analysis of complex phenomena [29], and brings several advances concerning traditional methods (factor analysis, principal components, discriminant analysis, and multiple regression) [26], [27]. SEM enables: first, the inclusion of observed and/or latent variables [29]. Second, the analysis of more than one dependency relationship [30]. Third, the incorporation of estimation/measurement errors into the model [31]. Fourth, the confirmation of theoretical propositions, including them automatically to the model [32]. Fifth, through the interaction between theory and data, statistically testing theoretical assumptions by modeling relationships between predictor and multiple criterion variables [33]. In summary, unlike traditional methods, SEM allows the analysis of theories developed in previous research from an unlimited number of dependent variables and/or relationships [31], [32]. Therefore, SEM-based temporal data forecasts [34], [35] capture the dynamic interrelationships between the constructs examined, and offer a broader understanding of the evolution of observed and latent variables over time.

IV. MATERIALS AND METHODS
The process of building a forecast of Internet consumption includes three stages: data collection, use of the PLS-SEM, and utilization of a forecast model.
The first stage: eleven data sources [40]- [50] were consulted to reproduce the conceptual model presented in Section III. In total, 32 variables were taken into account, which are related to four dimensions or contexts: Economic, Innovation, Internet supply, and Technologic. The data of the 32 variables tested in the model cover the period from 2002 to 2017 and are presented in Table 1.
The second stage: the collected data were processed within the PLS-SEM, whose use combines the methods of factor analysis and partial least squares regression [38]. Factor analysis is a statistical method that allows representing a set of variables observed from a small number of unobserved  [28], [52]. The factor analysis model is given as follows: where X is the measurement vector, p × 1; µ is the vector of means, p×1; L is the matrix of factorial loads, p×m; F is the vector of common factors, m × 1; e is the vector of residues, p × 1; p is the number of measurements; m is the number of common factors. Applying the factor analysis, the 32 observed variables presented in Table 1, were reduced in their dimensions to four context variables: economy, technology, innovation, and Internet supply. Such a decrease in dimensionality makes it possible to reduce the probability of collinearity between independent variables in the forecast model [6], [14], [17].
The partial least squares regression is a well-known technique that minimizes the sum of the squares of the residuals of the regression (Y = a+bx) [26]- [28], [52] within the following model: where i = 1, . . . , n is the observation number and e i is the ith residual.
The minimization of n i=1 e 2 i permits one to adjust the regression model to the observed data.
In the use of the partial least squares regression, the significance and strength of the correlations between the context variables were calculated and tested [36], [39]. The significance and correlation of the independent variables indicate which context variables are to be forecasted to calculate Internet consumption.
The factor analysis combined with the partial least squares regression was performed by the PLS-SEM. In turn, the PLS-SEM can be operationalized in different types of software such as SmartPLS, EQS, MPLUS, PROC CALIS (in SAS), HLM, SIMPLIS, and GLAMM [23] as well as the popular SPSS AMOS [38] or by the pioneer LISREL [22]. The Internet consumption model developed in this research was operationalized by the SmartPLS Software [39], following a usage tutorial presented in [36], and using license available at www.smartpls.de.
The exploratory analysis procedure of the relationships between multiple observations (independent variables) and multiple constructs (dependent variables) was realized in two steps [36]. The first step in applying the PLS-SEM Algorithm concerns to the factor analysis. The observed variables that influences the latent variable measured through the factor loading remain in the model. The factor loading must be greater than 0.7 [26]- [28]. The latent variables that present acceptable convergent validity measured through the Average Variance Extracted (AVE) remain in the model. The AVE threshold must be greater than 0.5 to explain, on average, more than half of the variance of the observed variables [26].
Once the observed and latent variables are selected, the reliability of the model is analyzed by the cross-loadings, Cronbach's Alpha (CA), and the Dillon-Goldstein Rho (Rho) criteria [52], [53] The factor loadings of the observed variables must be higher in their respective latent variables than in others [33]. Differences close to 2.5% are acceptable, but it is possible to increase the rigor of discriminant validity by excluding observed variables with high correlations in two latent variables [36]. The internal consistency is measured through CA. Its test threshold varies by type of survey. For instance, for exploratory studies, CA values must be greater than 0.60 [27], [28]. Rho is considered the best reliability measure test [52]. Its values must be greater than 0.70 [53].
The second step in applying the PLS-SEM Algorithm concerns to the internal validity of model in which the PLS regression is performed. The observed variables not statistically significant (t-test < 1.96) to the latent variables are excluded from the model [18]. The correlations between latent variables that are not significant with a p-value greater than 0.05 do not remain in the model. The correlations between latent variables not consistent with the theory, for example, a negative correlation between Internet consumption and the context of technology and innovation do not remain in the model as well [36]. The significance and strength of the regressions are evaluated by Student's t-test and the coefficient of determination R 2 [26]. At the end of the PLS-SEM, path coefficients are generated, which are the standardized weights of the regression variables [36], [37].
The third stage: after the running the two steps of applying the PLS-SEM Algorithm, the latent variables with significant correlations with the dependent variables of Internet consumption are projected by the use of exponential smoothing autoregressive model. Exponential smoothing is a traditional linear statistical model that allows forecasting future values based on the past observations [54], [55] as follows: where α (0 < α < 1) is the smoothing factor. The smoothed statistic s t is a simple weighted average of the observation x t and the previous smoothed statistic s t−1 . The value of α defines the smoothing level so that the smoothing level will be zero when α = 1. The mean squared error (MSE) and mean absolute deviation (MAD) are applied to measure the forecasting accuracy. The ideal model has MSE and MAD values close to zero [54].
Internet consumption was forecasted considering the future values of the latent variables generated through exponential smoothing and the coefficients of these latent variables generated through the PLS-SEM. An overview of the three stages of the Internet consumption forecast model is illustrated in Fig. 2.

V. RESULTS
After running the first step of the PLS-Algorithm, the latent variable of Internet Supply did not meet the AVE 0.50 threshold and did not remain in the model. The others latent variables: economic, innovation, and technology meet the AVE thresholds and remained in the model. Six observed variables of economic context, two observed variables of innovation context, and three observed variables of the technology context did not meet the factor loading threshold. The observed variables of each context, that is, of each latent variable that remained in the model, are given in Table 2 and reflect the reliability of the model by the cross-loadings criterion. The factor loadings of the observed variables must be higher in their respective latent variables than in others [33]. Only two of the seventeen observed variables do not meet the criterion of discriminant validity. The differences in cross-loadings of variables V_IN_1 (0.91 and 0.92) and V_TIC_4 (0.99 and 1.00) are less than 2.5% which is an acceptable difference [33]. Table 3 shows the reliability of the model measured by the CA, Rho, and AVE criteria for each of the three latent variables maintained in the model.  Tables 2 and 3, the V_IN_1 and V_TIC_4 variables were maintained in their original constructs, establishing the structural model of Internet consumption. Fig. 3 provides a synthesis of the PLS-SEM-based structural model of Internet consumption, showing the variance extracted from the observed and latent variables, as well as the strength of these relationships (coefficient of determination).

Considering the results of
By analyzing the internal validity of the structural model measured by the coefficients of determination (R 2 ) [26], which presented values R 2 = 0.987 (Economic → Innovation), R 2 = 0.994 (Innovation → Technology), and R 2 = 0.790 (Technology → Internet consumption), it can be said that the structural model is strong, since R 2 > 0.67 indicate a strong model; R 2 > 0. 33    model [33]. Table 4 summarizes the correlations between the latent variables uses in the model.
The significance of the correlations and regressions were evaluated by Student's t-test [26]. Fig. 4 shows that all relationships between variables showed t-test > 1.96, indicating that correlations and regressions are acceptable.
Once the PLS-SEM parameters were estimated, the relationships between the latent variables indicated that the economic variables are strongly correlated with innovation (R = 0.994), technology (R = 0.965), and Internet consumption (R = 0.740), supporting the understanding that variables of an economic nature are relevant predictors for understanding other variables over time [4]. From the path coefficients obtained in the model, the effects that the latent variables have on each other are estimated. This effect is estimated from the variation of the Standard Deviation (SD). In the present model, the variation of one SD in the economic variable produces effects of 0.994 SD in the innovation variable, 0.990 SD in the technology variable, and 0.880 SD in the variable of Internet consumption.
The relationships between the latent variables also indicates that the innovation variable are strongly correlated with technology (R = 0.997) and Internet consumption (R = 0.830) variables, supporting the understanding that innovations drive Internet consumption [17] as well as the creation of technologies [9]. Considering the path coefficients extracted from the model, it is estimated that the effect of the variation of one SD in the innovation variable produces an effect of: 0.997 SD in the technology variable, and 0.886 SD in the Internet consumption variable.
Finally, the relationships between the latent variables indicate that the technology variables are strongly correlated with Internet consumption (R = 0.889), supporting the understanding that Internet-based technologies also drive Internet consumption [11], [17]. The path coefficients indicate that the effect of the variation of one SD on the technology variable produces an effect of 0.889 SD in the internet consumption variable.
It is noteworthy that the model also indicates that the internet supply variable may not stimulate Internet consumption as some research suggests [10], [11]. Taking into account this set of results, this research provides new evidence of the applicability of factor analysis to reduce the number of variables observed in a few latent variables, and to avoid multicollinearity between such variables in time series [6], [14], [17].
Besides, the results of the PLS-SEM also indicate that the effect between latent variables and Internet consumption may not always be direct. In fact, the results show the presence of indirect effects between: I) Innovation → Technology → Internet consumption (R = 0.886); II) Economic → Innovation → Technology → Internet consumption (R = 0.880); and I) Economic → Innovation → Technology (R = 0.990). VOLUME 8, 2020 Table 5 shows the path coefficients, indicating what are the effects (direct and indirect) between the latent variables. The indirect effects alter the curve slope of the independent variables, and, therefore, the forecast of Internet consumption and the statistical parameters of the model. Fig. 5 shows how indirect effects influence the forecast of Internet consumption, while Table 6 shows the influence of indirect effects on the model's statistical parameters.   Table 6 shows that considering indirect effects on the model can improve to some extent the accuracy of the forecasting, generating higher R 2 and R 2 -adjusted coefficients and reduce the model forecast errors, generating smaller MAD and MSE. The obtained results suggest that understanding the interrelationships between latent variables, their respective factor loadings, correlations, coefficients, and direct and indirect effects may improve the forecast of Internet consumption.
Taking all results and their analysis into account, this research contributes to broadening understanding of the direct and indirect effects of exogenous factors on Internet consumption over time. By providing a broader understanding of the causal relationships of the phenomenon under study, this research provides insights that enable future Internet consumption to be forecasted with a greater theoretical foundation and accuracy.

VI. CONCLUSION
This research showed that, at least in Brazil, independent variables (economy, innovation, technology) do not always have a direct effect on future Internet consumption. This study also demonstrates that the latent variables that represent economy and innovation, by influencing the latent technology variable, have an indirect effect on Internet consumption. This shows that in the PLS-SEM, the independent variable (technology) can assume the role of the dependent variable of the other independent variables (economy and innovation). Thus, the multiple dependency ratios, correlation coefficients, and regression significance are summarized in a structural model. This contributes to broadening the understanding of Internet consumption over time. Getting together multiple contexts, the structural model brings the forecast of Internet consumption closer to reality, opening the possibility for better planning of telecommunications networks.
Given these possibilities, it is suggested that future works will be directed at comparing the results of different Internet consumption forecasting models. In addition, to try to generalize the paper results to other countries, it is suggested to evaluate the impacts of the different results of these forecasts on the planning of telecommunications networks.
PETR YA EKEL received the Ph.D. degree from the Kiev Polytechnic Institute, National Technical University of Ukraine, and the D.Sc. (Habilitation) degree from the Institute of Electrodynamics, Ukrainian Academy of Sciences. Since 2004, he has been the President and a Principal Consultant with Advanced System Optimization Technologies. He is currently a Full Professor with the Pontifical Catholic University of Minas Gerais, Belo Horizonte, Brazil. He is also with the Federal University of Minas Gerais. He has published numerous articles and two research monographs related to these topics. His main research interests include operational research, decision making, computational intelligence, involve problems of modeling and optimization under different levels of uncertainty, and multicriteria decision making, including decision making in a fuzzy environment, fuzzy identification and control, processing of heterogeneous information, and risk evaluation and management. He was a member of numerous program and an advisory committees of international conferences in system modeling, optimization, and control and decision making. He is a member of the Ukrainian Academy of Engineering Sciences. He serves as an Editorial Board Member for numerous international journals, including Information Sciences, Information Fusion, and Group Decision and Negotiation.