Semiparametric Wavelet-Based JPEG IV Estimator for Endogenously Truncated Data

A new and an enriched JPEG algorithm is provided for identifying redundancies in a sequence of irregular noisy data points which also accommodates a reference-free criterion function. Our main contribution is by formulating analytically (instead of approximating) the inverse of the transpose of JPEG-wavelet transform without involving matrices which are computationally cumbersome. The algorithm is suitable for the widely-spread situations where the original data distribution is unobservable such as in cases where there is deficient representation of the entire population in the training data (in machine learning) and thus the covariate shift assumption is violated. The proposed estimator corrects for both biases, the one generated by endogenous truncation and the one generated by endogenous covariates. Results from utilizing 2 000 000 different distribution functions verify the applicability and high accuracy of our procedure to cases in which the disturbances are neither jointly nor marginally normally distributed.


I. INTRODUCTION
Scientists routinely try to model and extract causal relations among covariates, rather than merely their correlations. 1In practice however, the presence of endogenous covariates in the model challenges the causal inference due to comovement of the random disturbance with these covariates.We distinguish between population induced comovement and training data (as in machine learning) comovement without having to rely on a covariate shift assumption since the behavioral (causal) model embedded in the training data does not necessarily describing the behavior in the entire population (see discussion in [2]).
The common way to overcome the aforementioned, is to generate a variation in the endogenous covariate without introducing variation in the random disturbance.This idea is achieved by employing a proper instrumental variable (IV). 2 Application of a proper instrumental variable generates variation in the endogenous covariate without introducing variation in the random disturbance and hence is orthogonal to it.Thus, IV should contribute to exogeneity and therefore has been extremely popular in empirical work.
Once we have analytically shown that the IV estimator is no longer valid in an endogenously truncated environment, we offer a truncation-proof estimator, which is a semiparametric wavelet-based JPEG-IV denoising algorithm. 3This denoising algorithm decomposes the random disturbance into a noise and a systemic bias part, enabling the elimination of the truncation bias.The magnitude of the this bias is captured by the size of the wavelet coefficients which quantify and measure the degree of redundancy hidden in a sequence of data points.Consequently, this algorithm nests the conventional IV estimator as a special case due to the fact that in the absence of systemic endogenous truncation, the wavelet coefficients approach zero except for the intercept which describe the coarse level of the function. 4ur main contribution to the biorthogonal wavelet estimator is by formulating analytically (instead of approximating) the inverse of the transpose of wavelet transform without involving matrices which are computationally cumbersome [7]  5 as well as the management of irregular-spaced data. 6Additionally, our proposed methodology enables the combination of several penalty functions in the estimation procedure which are resolution-dependent. 7 Wavelets are useful in denoising data.Several image quality assessment (IQA) measures have been introduced to choose the optimal level of denoising.These approaches can be classified to full-reference (FR) in cases the original image (noise-free) is observed; reduced-reference (RR) in cases where there is a partial information about the reference; reference-free (RF) in cases where the original image is not accessible [10], [11].As we deal with truncated distributions, the source (the complete non-truncated distribution) is intrinsically unobservable and thus, we cannot assess the success of the denoising by comparing it to the original nontruncated distribution.Therefore, we select both the thresholding (tuning) parameter as well as the penalty function using a reference-free criterion function.
The proposed JPEG IV is biorthogonal, thus preserving both the symmetry (the original shape of the data) and compact support (small number of coefficients) properties of the data. 8Importantly, the proposed methodology is easy to compute by precluding the need to find an optimal bandwidth as conventionally done. 9These properties make it suitable for denoising by alleviating both the problem of coefficient expansion as well as border discontinuities [15].The proposed algorithm corrects for both sources of bias: the endogeneity 4 Unlike Fourier transform, the wavelet estimator preservers not only the data average (coarse) behavior but also its local behavior capturing deviations (details) from the average.This fact renders our denoising suitable for irregular-spaced data which largely depend on local behavior and play an important role in the denoising. 5"The implemented routine for the inverse transpose transform is approximate."[7], p.285. 6 In the orthogonal wavelets design various interpolation methods are used to alleviate these irregularities [8] and specific methodologies can be used to extend the Haar wavelet transform to the unequally spaced case [9]. 7It is known that soft thresholding provides smoother results relative to the hard thresholding because it is continuous.The latter, however, provides better edge preservation in comparison with the former. 8Our JPEG IV is a biorthogonal wavelet as it requires two sets of vectors, which are the dual basis and the series expansion sets, to obtain a denoised representation of the data.The elements in the former set are orthogonal to the corresponding elements in the latter set.See [12] for a formal definition of biorthogonality. 9Kernel estimation involves computational burden due to the necessity of finding the optimal bandwidth [13].Unlike the nonparametric case, in the semiparametric context there is no "protocol" for finding the optimal bandwidth, as the traditional bandwidth choice methods might lead to bias estimates due to improper bandwidth choice [14].
of covariates as well as the endogenous self-selection biases.
We run Monte Carlo simulations to measure the magnitude of the potential bias in the parameters' estimates under endogenous truncation, obtained by employing a conventional IV to eliminate the endogeneity bias.Our empirical implementation shows that even under mild correlation between the random disturbances, the resulting bias in the estimated parameter of the endogenous covariate in the substantive equation can amount to almost tenfold the true parameter value.Further, for sake of generality of the offered estimator, we subject it to various distributions in which the disturbances are neither jointly nor marginally normally distributed.These disturbances are constructed as realizations of non-symmetric and non-unimodal distribution functions. 10he rest of this paper is organized as follows.The methodology is presented is section II.Section III prepares the ground for the biorthogonal wavelet.Section IV presents our proposed JPEG algorithm.In section V we employ Monte Carlo simulations to validate our estimator performance.Section VI concludes.

II. METHODOLOGY
As discussed above, the IV is based on the following basic requirements: it is correlated with the endogenous covariate, as well as orthogonal to the random disturbance.Additionally, it must satisfy the exclusion restriction, such that in the presence of the endogenous covariate, the IV must be excluded from the regression.The IV is allowed to affect the dependent variable only through its effect on the endogenous covariate.However, the orthogonality condition is rarely satisfied in the presence of endogenous truncation, which is very frequently the nature of data used in empirical research, and therefore the IV will not provide a solution for the endogeneity problem.In what follows, we demonstrate the shortcoming of the conventional IV estimator, as well as potential bias generated in an environment of endogenously truncated data.
Suppose that there is a population random variable w = (z; x 1 , x −1 ; w) and that there is an independent and identically distributed sample {z i , x 1i , x −1i , w i } N i=1 drawn from this population, referred to as the complete data set consisting of N observations. 11The instrumental variable is z, the endogenous variable is x 1 and the exogenous random variables are (x −1 ; w), and where w ∈ R l is a covariate vector.
Let ξ 1i , ξ 2i and v i be jointly dependent random disturbances with the respective marginal distribution functions F ξ1 , F ξ2 and F v .Their joint distribution function is F ξ1,ξ2,v .The model is semiparametric, as neither the marginals nor the joint distribution function are required to be specified by the researcher.
The underlying model is composed of two parts.The first part consists of a selection equation, while the second part consists of the substantive (of interest) equation.
The population (non-truncated) selection equation is defined as: where γ ∈ R l and w i ∈ R l are the selection equation's coefficients and covariates vector, respectively.The selection equation's random disturbance is denoted by ξ 2i .
The substantive equation and the endogenous variable equation are defined as a system of equations: the substantive equation variable the endogenous variable equation (2)   where β ∈ R p1 and δ ∈ R p2 are covariates vectors, x 1i is an endogenous variable included in vector x i ∈ R p1 , and the exogenous variables are denoted by x T −1i .The substantive equation's random disturbances are denoted by ξ 1i and v 1i .
However the variables y * 1i , y * 2i , x * 1i are latent in the truncated environment and their respective observed realizations are denoted by y 1i , y 2i , x 1i , defined in ( 3) and (4) to follow.
The variable y * 2i is latent, while y 2i is observed and defined as: , the substantive equations (4) In the next section we reformulate the substantive equation as a partially linear single index model.

A. SEMIPARAMETRIC SELECTIVITY BIAS CORRECTION
The key difference between censored and truncated sample selection models is that in the former the entire covariate set (including the non-participants) and the selection variable are fully observed.In the latter, the entire data are truncated.Nevertheless, in both cases, the substantive equation can be represented as a partially linear regression, in which the dependent variable is observed only for the participants, as we are about to show.Following [16], the conditional expectation of the substantive equation in semiparametric (censored) 12 sample selection models is some generally unknown function M 1 (.) (to be estimated) of the selection equation's covariates variables w i : such that γ is the selection equation's coefficient vector.Since y 1i is observed only if i is a participant, the substantive equation's dependent variable obtains the following func- 12 His approach is a generalization of the well-known inverse-mills ratio estimator introduced by [3] for the substantive equation's bias term E ξ 1i |ξ 2i > −w T i γ in the case of a censored sample selection model.Note the difference between censored data and truncated data, which is the case we deal with.tional form: the bias term + ˜ 1i white noise (6)   The regression equation in ( 6) is referred to as a semiparametric partially linear regression (SP-NLS), in which the non-linear part is the bias term function.This regression can be estimated semiparametrically in cases of a truncated sample selection model using a non-linear least squares procedure as suggested by [13].
Both [13] and [16] models involve a kernel function estimation.However, kernel estimates' accuracy is sensitive to the bandwidth selected.This entails a potential problem of finding the optimal bandwidth resulting in computational complexity. 13Due to the lack of applicability of the traditional bandwidth selection methods in the semiparametric context, informal methods are being used, that may lead to a non-ignorable bias in the estimates [18]. 14In order to avoid the problems involved with kernel estimation, our methodology relies on a (thresholding-propagated) nonlinear wavelet-based JPEG IV estimator to approximate the bias term (in ( 6)).
The substantive equation depicted in ( 6) deals with endogenous truncation bias, assuming that the random disturbance and the covariates are not jointly dependent.However, in cases where this random disturbance is jointly dependent with one (or more) of the covariates there will emerge two bias terms: the first one is propagated by the endogenous truncation and the second one is propagated by the endogenous covariate.Next we present a decomposition Theorem 1, which enables reformulating the substantive equations as a partially linear single index model in the presence of an endogenous covariate.

B. DECOMPOSITION OF THE SUBSTANTIVE EQUATIONS
Theorem 1.Let the underlying model be as depicted in (3) and (4).Denote the random disturbances ε i and 1i which are constructed as: , respectively.The following requirements must hold: Using (7) we get: 13 There is an open question whether there is a way to choose a bandwidth sequence that is optimal for the estimation of the parameters [17]. 14"The well known bandwidth selection rules used in non-parametric estimation, such as cross validation, are not generally applicable to semiparametric settings."[18, p. 191]  which is simplified to: In order to obtain the substantive equation in the truncated environment, we construct 1i = y 1i − E[y * 1i |y 2i = 1] where E[y * 1i |y 2i = 1] is obtained from (9). 15 Following [17], the conditional expectation of ε i , given participation is expressed by some unknown function . Thus, we obtain: endogeneity bias term + 1i white noise (10)   For sake of brevity we present equation (10), which is a decomposition of the substantive equation into its components, such as the substantive equation's covariates, selection bias term, endogeneity bias term and a stochastic white noise term.It is easy to see that the conventional IV can not be sufficient in eliminating the endogeneity bias E[ξ 1i | x i ] in (10), since under truncation the endogeneity bias term is actually We express where M 2 (•) is some unknown function and obtain (see Theorem 4 to follow): selection bias term + 2i white noise (13)   It is easy to see the joint dependence of * 2i and * 1i through the selection bias terms in (10) and (13).
Next we formulate the relationship between the covariates and dependent variables in the equations to be estimated, in the presence of an endogenous covariate in the substantive equation under truncation.

C. TRUNCATED SAMPLE SELECTION MODEL WITH AN ENDOGENOUS COVARIATE
In cases where the substantive equation's dependent variable is a function of an endogenous covariate x 1i , both x 1i as well as y 1i (as in (4)) are truncated, we face a truncated sample selection model with an endogenous covariate.
Thus, the semiparametric partially linear index model in a truncated environment consists of the following system of 15 By construction of y 1i , the equality equations: where * * 1i and 2i are two jointly dependent random disturbances, 16 and by construction are independent of the random variables vector w. 17 The intrinsic endogeneity in the model is captured by the joint dependence of * * 1i and the covariates. 18The presence of the function M 2 (.) implies that we allow for a dependence between v i (the endogenous part of x i ) and the selection equation's random disturbance ξ 2i (in (1), the complete, non-truncated, sample selection equation).
Our primary interest is to show that the instrumental variable and the random disturbance might be correlated in a truncated environment as will be depicted in Theorem 2 to follow.By doing this, we denote a truncated environment using the indicator (selection variable) s = I(ξ 2i > −w T γ) and postulate the following assumptions: Assumtption 1.The instrumental variable z is jointly distributed with all covariates in the data: Assumtption 2. Conditioning the instrumental variable z both on random variable w and a stochastic function of w denoted by F(w, ε) (given that the stochastic component ε is an i.i.d white noise which is independent of z), would be the same as conditioning it only on w.Formally: These two assumptions implies that the conditional expectation of the instrumental variable, given the selection variable, is a function of the random variable vector w, as the following proposition argues: Proposition 1.Given assumptions 1 and 2, the conditional expectation of the instrumental variable given the selection variable is a function of w, rather E[z|s = s] = w G(w)f w|s=s (w|s = s)dw ∀s ∈ {0, 1}, where f w|s=1 (w|s = 1) and f w|s=0 (w|s = 0) are the conditional density functions of vector w given participation and nonparticipation, respectively.
Proof.It easy to see that w mediates between z and s using the Tower property of conditional expectation [19]: 19

E[z|s
The indicator variable s is a stochastic function of w, thus, it follows from assumption 2 that 16 There is dependence of these two random disturbances due to the dependence between v i and ξ 1i (as in (2)) in the complete (non-truncated) data. 17Not to be confused with its realization w i . 18The intrinsic model's endogeneity is related to the joint dependence of the random disturbance and the covariates in the population, unlike a conditional joint dependence of the random disturbance and the covariates given participation in the sample. 19The Tower property is referred interchangeability to the law of iterated expectations.For formal proof see [19].
Following assumption 1 we get: In Theorem 2 to follow we use proposition 1 and present our primary argument: in truncated sample selection models, the orthogonality condition of the instrumental variable with respect to the random disturbance might be violated.This violation stems from a dependency between the instrumental variables and the selection equation's covariates.
Theorem 2 (Lack of orthogonality).Let ξ 1 and ξ 2 be two jointly distributed random disturbances, and let z be a valid instrumental variable satisfying E[z • ξ 1 ] = 0. Denote a random variables vector w ∈ R l , a parameters vector γ ∈ R l and a truncated environment using the indicator variable s = I(ξ 2 > −w γ).Suppose that the following conditions are satisfied: (i) assumptions 1 and 2 hold; (ii) E[ξ 1 |s = s, w = w] = M(w T γ); (iii) z and ξ 1 are conditionally independent given w and s; (iv) G and M are linearly dependent in the truncated environment (given s). 20nder conditions (i)-(iv) above, z is not orthogonal to the random disturbance ξ 1 given s.
Proof.Using the Tower property, the following must hold: As G and M are conditionally linearly dependent random variables in the truncated environment (given s), implies: G(w)M(w T γ)f w|s=s (w)dw (17)   = G(w)f w|s=s (w)dw M(w T γ)f w|s=s (w)dw and consequently: Therefore, z is not orthogonal to ξ 1 given s (in the truncated environment).
However, the orthogonality condition can be satisfied by removing the contamination factor, which is the covariate generating the comovement between the random disturbance and the instrumental variable, as shown in the following Theorem 3. 20 The conditional linear dependence between G and M given the indicator (selection) variable implies that Since G and M are both functions of the random variable w, this inequality implies that Theorem 3 (Bias removal).Let ξ 1 and ξ 2 be two jointly distributed random disturbances, and let z be a valid instrumental variable satisfying E[z • ξ 1 ] = 0. Denote a random variables vector w ∈ R l , a parameters vector γ ∈ R l and a truncated environment using the indicator variable s = I(ξ 2 > −w γ).Suppose that the following conditions are satisfied: (i) assumptions 1 and 2 hold; Under conditions (i) and (ii) above, removing the contamination factor (the bias term) from the residual in the truncated environment leads to orthogonality of the instrumental variable to the substantive equation's disturbance, such that: Using the Tower property, to get: Therefore, a valid instrumental variable z is orthogonal to the truncated distribution (non-contaminated) disturbance * * 1i in (14), even though z and w are dependent.
The joint dependence of (ξ 1 , ξ 2 , v) implies the violation of zero mean expectation (under truncation) in the x 1i regression equation ( 4 That is, the conditional expectation of v given an endogenous truncation is a function of the covariate vector w, while in the population it does not depend on w and has a zero mean expectation.This violation is a precondition for the endogeneity of {x −1i , z i } with respect to v i given participation in the regression of x 1i . 21The following theorem indicates that such violation is also obtained in cases where the comovement of v and ξ 2 is entirely related to a variation in ξ 1 . Theorem 4 (Conditional independence).Let ξ 1 and ξ 2 be two jointly distributed random disturbances of the substantive and selection equations, respectively.Let v be a random variable which depends on ξ 1 such that v and ξ 2 are conditionally independent given ξ 1 .Denote a random variables vector w ∈ R l independent of (ξ 1 , ξ 2 , v) with a realization w, a parameters vector γ ∈ R l and a truncated environment using the indicator variable s = I(ξ 2 > −w γ).
Assume the following conditions are satisfied: (i) the conditional expectation of the random disturbance given par- Proof.Applying Tower property to E[v|s = 1]: It can be shown that ξ 1 mediates between v and s (participation), in that it generates a comovement between the random variables v and s.The last equality relies on the fact that the random variable H(ξ 1 ) is a monotonic mapping of ξ 1 , implying dependence on s due to the dependency between ξ 1 and s.
Next we show that the conventional IV estimator is inconsistent in the presence of a truncated environment in which the expectation of the instrumental variable and the random disturbance are functions of the selection equation's covariates vector w.The proof in section II-D to follow, relies on a linear dependence assumption between these two functions of w.The rationale for the linear dependence is due to the fact that the random disturbance's (ξ 1 ) conditional expectation generally satisfies monotonicity with respect to the index variable w γ.Therefore, it is enough to assume that, on average, z is affected monotonically by the index variable w γ to generate a linear dependence between z and the conditional expectation of ξ 1 given participation. 22

D. THE CONVENTIONAL IV ESTIMATOR'S ASYMPTOTIC BIAS
The IV estimator's asymptotic bias is: Given any correlation between z and M 1 (w T γ), plim Thus, the β iv estimator is an inconsistent estimator for β.
Next we discuss the two types of joint dependence which are present in our model.This is done is order to facilitate the understanding of our proposed procedure, which is intended to correct for the bias propagated by each type of joint dependence.

III. PRELIMINARIES
The objective is to eliminate the selection bias term captured by M 1 (•) in ( 14).As we don't want to impose a specific distribution function on the random disturbances, the aforementioned elimination should be performed in a nonparametric manner.This can be achieved using a semiparametric estimation method, which is distribution-free.However, the bias term might be a discontinuous function with different levels of smoothness that must be considered.These issues can be alleviated using multi-resolution analysis by employing the wavelet estimator [20].Wavelet is a bandwidth-free estimator, that is based on the idea of multi-scale representation of the data [21]  23 and is used as a denoising technique by simple thresholding, which is based on the concept of sparsity. 24he applicability of the classical wavelet estimator is problematic in several important aspects.First, it limits the sample size to be represented as 2 J , with J a non-negative integer, and the observations to be equispaced, which challenges the estimation in case of irregular-spaced data. 25Second, the classical wavelet estimator imposes the parametric assumption that the disturbances are independent identically distributed normal variables [23].Lastly, there are the problems of coefficient expansion and border discontinuities. 26n order to overcome these limitations, second generation wavelets have been introduced [24] which define wavelets in terms of lifting-steps instead of matrices to reduce computational complexity. 27An earlier attempt to deal with irregularspaced data using second generation wavelets is presented in [21] by postulating a prior distribution function for the wavelet coefficients. 28Alternative approaches extend Haar wavelet transform to accommodate for irregular data [27].
Both first as well as second generation wavelet estimation methods involve three steps: coefficient estimation (forward transform); (ii) denoising by using element-wise thresholding (coefficients selection) and (iii) reconstruction of the data without the noise (inverse transform).It is important to notice that the sequential nature of the estimation that relies on element-wise thresholding is applicable for limited types of wavelets, referred to as orthogonal wavelets which consist of the above described limitations.The main shortcoming of orthogonal wavelets is that the compact support and the symmetry properties which are useful in denoising are conflicting. 29To preserve both these properties, the biorthogonal wavelet-based JPEG is used [29], [30]. 30 23 De to its multi-scale property, we can distinguish between the important information, the function's average behavior, from the noise.The coarse scales (lower resolution-levels) usually convey important information, while at fine scales there is usually more noise. 24Sparsity implies that the majority of wavelet coefficients are small, and can be replaced by zero [22]. 25The observations location in space or time must be of equal distance. 26The standard orthogonal wavelet transform has the shortcoming in that it requires a large number of coefficients (coefficient expansion) to represent the original data [15]. 27The lifting-steps are consecutive operations of prediction (scaledmoving average) and update (scaled-first difference) to obtain the wavelet coefficients.
28 [21] adopt the parametric Bayesian denoising approach introduced by [25], [26] to obtain the wavelet coefficients assuming the coefficients are distributed according to a continuous mixture of a normal by a Beta density. 29Unlike biorthogonality, orthogonality and symmetry are conflicting properties for design of compactly supported nontrivial wavelets (see Theorem 8.1.4 in [28]). 30The JPEG algorithm used here is termed 'wavelet CDF 9/7'.
In what follows we briefly explain the concept of biorthogonality.Denote a set of functions {ϕ k (t)} which spans a vector space F, referred to as the expansion set.By construction, any function g(t) ∈ F can be expressed by using a series expansion, such that g where η k and ϕ k are the expansion coefficients and expansion functions, respectively.The set {ϕ k (t)} is biorthogonal to the set { φk (t)} if ϕ k , φk = d(k − k ) ∀k and k , with • being the L 2 inner product and the function d(•) is the Kronecker delta. 31These two sets form a biorthogonal system, in which { φk (t)} is referred to as the dual basis of {ϕ k (t)}.Thus, we get the following unique representation: Substituting each η k coefficient with its analytic expression in (20), to obtain: Obviously, in the present case of biorthogonality, the coefficients in (20) are obtained by using the dual basis and the function is reconstructed in (21) by using another basis which is the expansion set.In cases where { φk (t)} = {ϕ k (t)} we have an orthogonal basis {ϕ k (t)}, which is referred to as self-dual.Therefore, biorthogonality is a generalization of orthogonality that allows for a larger class of expansions.
Recall that our objective is to estimate the bias term for an unknown functional form, captured by M 1 (•) in ( 14), the conditional expectation of ε i , given participation defined as 32 In what follows, we attend to the estimation of M 1 (•) using the wavelet estimator.
We use the concept of a frame in (1) to define Riezs basis in (2).Riezs basis is a building block in the definition of biorthogonal wavelets in (3) to follow.
Let H be a separable Hilbert space with inner product •, • and a norm • .We denote a sequence We use the following frame and Riesz basis definitions [31]: .
Definition 2 (Riesz basis).A sequence F is a Riesz basis if and only if it is a frame having the additional property that upon the removal of any element from the sequence, it ceases to be a frame.
Let L 2 (R) be the space of square integrable and realvalued functions on R. We use the following biorthogonal wavelet definition [32]: 32 For brevity, we present M 1 (•) only.Identical treatment is applied to It is worth noticing that both existence as well as uniqueness of the series representation are satisfied in definition 3.However, our proposed nonparametric estimator might be unstable, rendering the estimation problem ill-posed, which is one of the challenges in nonparametric estimation of unknown functions. 33This ill-posed problem can be alleviated by employing regularization on the wavelet series expansion coefficients [33], [34].
Let u i = y 1i − x T i β be the i'th element in vector u of size n × 1, which satisfies: where {t i } n i=1 is a sequence in which the i'th element satisfies t i = w T i γ and 1i is the white noise described in (14). 34 use Φ I and Φ F ≡ Φ −1 I to denote the inverse and forward transformation matrices, respectively, each of size n × n [35] in an orthogonal wavelet. 35We note that using orthogonal wavelets one obtains the close-form solution to the wavelet coefficients as follows: where M 1 (•) is the estimate of the unknown function M 1 (•), and ρ λ (•) represents the element-wise thresholding generated by some penalty function, in which the tuning parameter is represented by λ.The procedure in (23) to obtain M 1 (•) by employing a thresholding operator is referred to as denoising.
We depart from the denoising procedure in ( 23) by employing biorthogonal wavelets, as we are interested in the applicability of the general case where the penelization is not an element-wise due to correlations among wavelet regressors. 36In such cases, there is no such a close-form solution, which necessitates the regularized least squares optimization method to follow.
Let Ψ I and Ψ F ≡ (Ψ T I Ψ I ) −1 Ψ T I denote the inverse and forward transformation matrices, respectively, each of size n × n of the wavelet-based JPEG, which is a biorthogonal 33 An estimator violating at least one of the requirements: existence, uniqueness and stability is referred to as ill-posed. 34A more general formulation solves the ill-posed problem by employing regularization in cases where a linear transform of the unknown function replaces the original function [33]. 35The forward transform is referred to as the Discrete Wavelet Transform (DWT). 36It has been shown that the performance of wavelet estimator can be improved when the dependencies among coefficients were taken into account [36].
We define resolution-dependent regularized least squares at resolution levels 1, ..., J: where T is the wavelet coefficient vector of size n×1 and δ j is of size .The penalty function is It is evident that when the δ j → 0, the bias propagated by the endogenous truncation approaches zero and thus, our algorithm is reduced to the conventional IV estimator.
The univariate solution of a regularized least squares problem using the penalty function in (24) is denoted by S α (•) and defined as: 40 where α ∈ (0, ∞).It is worth noting that if γ → ∞ the solution is soft-thresholding introduced by [40]; in case that αγ → 1 + the solution is hard-thresholding (see proof in the Appendix A).
To reduce computational complexity, the optimization problem in (25) is reformulated as: where Ψ T I is the transpose of matrix Ψ I , αI is an approximation of the Hessian and I is the identity matrix of size 37 For a definition of biorthogonal wavelets see [24]. 38Unlike biorthogonality, orthogonality implies that the wavelet regressors are mutually uncorrelated and that the inverse transform is the transpose of the forward transform.This simplifies the computation as the wavelet coefficients are obtained analytically (closed-form) using element-wise thresholding operators (e.g., hard and soft thresholding operators).However, we opted for the biorthogonality wavelet to exploit the correlation structure of the regressors.Biorthogonal wavelets preserve the perfect reconstruction property (by employing dual-filters) as well, but is more flexible in that the inverse of X is not required to be its transpose.Consequently, the thresholding is applied to the entire coefficient vector. 39The penalty function in (24) represents a family of penalty functions as a generalization of the soft thresholding (if γ → ∞) and hard thresholding (if γ → 1 + ) [38]. 40In order to utilize the min-max concave (MCP) penalty function in (24), we depart from the regularized least squares algorithm in [39], as it is limited to its special case of the LASSO penalty function.We introduce α as an approximation to the Hessian of the least squares problem in order to obtain an element-wise thresholding.This amounts to a dimensional reduction technique for reducing computational complexity.For the special case of α = 1, see [38].
n × n.The number of iterations is denoted by the integer iter.
For brevity, we divide the argument to be minimized by α and complete the squares using the expressions in • [39] to get: (28) The iterative procedure performs MCP-thresholding on a proximal gradient-descent update for k = 1, ..., n (see Algorithm 5 in the Appendix): where δ is the k'th coefficient in vector δ .We use (29)  to update the wavelet coefficients iteratively until the update is negligible, such that the following convergence criterion is satisfied: where τ is the tolerance which is a positive real number that we arbitrarily set to 10 −16 .
The optimization method in (29) involves matrices multiplication which is computationally infeasible for large data sets.To alleviate this computational complexity we develop a lifting scheme to be employed in order to perform simultaneously the transposed-inverse of the wavelet transform, consisting of lifting steps (see Algorithms 1-4 to follow).Conventionally, a lifting step can be either a prediction, that is a procedure generating a smoothed version of the data (the scaled coefficients), or an update that is the procedure to generate the remainder (the detail coefficients) between the data and its smoothed version.For the present case we define a new operator because the existing lifting steps do not provide an analytic representation of the transposed-inverse, as discussed in [7].
In the next section we discuss the main idea behind lifting steps, in order to obtain analytically the transposed-inverse transform.First we describe the lifting steps in a regularspaced data given a sample size of 2 J for a non-negative integer J. Then in equations ( 1)-(IV-E) to follow, we alleviate these two restrictions by formulating our proposed algorithm.

1) Lifting steps to obtain the wavelet coefficients
Let w = (w 1 , ...w n ) be a discrete sequence of data consisting of n real numbers, such that the sequence is referred to as dyatic iff n = 2 J for some integer J ≥ 0. The sequence can be expressed uniquely in terms of detail (difference) and summation coefficients denoted by d J−1 ,k n/2 k=1 and c J−1 ,k n/2 k=1 , respectively.The former capture the variation in the sequence at different scales and locations and the latter are a smooth representation of the original sequence.
The multi-scale representation of a function g ∈ L 2 (R) is obtained as follows: The first set of terms, φ 0,k , represents the average level of function g and the second set of terms ϕ j,k represents its details by accumulating information at a set of scales j ∈ Z.
Let {0, ..., J − 1} denotes a set of scales (resolution levels).We define d J−1 ,k and c J−1 ,k as follows [41]: The key idea is that a lower detail coefficient d J−1 ,k implies that w 2k is very close to w 2k−1 and visa versa, as such a smoother function is represented by a small sequence of detail coefficients.
In order to represent the sequence in a coarser-scale (using a lower resolution), we define the coefficients: By repeating the procedure in (33) we obtain detailed and smoothed coefficients for lower resolutions.The multiscale algorithm stops when the c 0 ,1 coefficient is produced.
Next we discuss how to select optimally the thresholding (tuning) parameter in (25) for each resolution-level.

2) Optimal thresholding by a reference-free criterion function
Since we deal with truncated distributions, the source (the complete non-truncated distribution) is intrinsically unobservable and thus, we cannot assess the success of denoising by comparing it to the original non-truncated distribution.Therefore, we utilize the "two-fold cross-validation" in [42]  which is a reference-free criterion function assesing the quality of the function estimated by denoising:41 where λ j,γ j is the tuning (thresholding) parameter being used in (29), in which γ j is a specific penalty function.The odd sample and even sample are denoted by u o j and u e j , respectively, and their corresponding estimates are f o λj,m and f e λj,m .These estimates are obtained by employing the iterative procedure in (29).
As previously discussed, our proposed truncation-proof IV estimator requires controlling for the bias terms M 1 (•) and M 2 (•).For generality and applicability purposes of the proposed estimator, we adopt a semiparametric approach which is not subjected to distributional assumptions and consequently, does not require specifying the functional form of these unknown functions.
The wavelet-based JPEG semiparametric estimator is chosen for its many advantages.It enables a multi-resolution representation of the noisy data points, implying that the data points are characterized both globally as well as locally. 42Such a multi-resolution decomposition facilitates distinguishing between the noise and the systemic part.The systemic part is the functional relationship between the covariates and the dependent variables in the regression equations in (14).
An additional advantage of incorporating the aforementioned newly introduced JPEG estimator is that it accommodates for various data set forms of different types of irregularities, such as non-equispaced design that will be described in section IV-B to follow.These irregularities are alleviated by introducing the locations in space of the various data points as an additional covariate that is unique for each resolutionlevel.An additional merit of our approach is enabling a group-wise denoising rather than the traditional element-wise JPEG denoising in the cases of image processing.Groupwise denoising plays an important role in data denoising, as it takes into account potential dependencies among the various data points.Thus, our contribution to the JPEG algorithm are controlling for irregularities, group-wise thresholding on the entire data and utilizing a reference-free criterion function to choose the optimal thresholding.
In next section we describe the JPEG algorithm which is introduced to estimate each of the bias terms M 1 (•) and M 2 (•).Although our proposed denoising procedure can be applicable to both even as well as odd sample sizes (as will be demonstrated in Algorithm 1 to follow), for ease of presentation and without loss of generality, the denoising procedure is formulated as a function of a data set consisting of 2n observations.

IV. THE WAVELET-BASED JPEG DENOISING
Let {(u i , t i )} 2n i=1 be a pairwise sequence of 2n data points as described in (22), such that t i < t j ∀i < j.The sequence {u i } 2n i=1 indicates the noisy data points (or colors of pixels in image processing) and their respective locations in space are represented by {t i } 2n i=1 .The JPEG algorithm is a procedure generating a multi-resolution denoised representation of the sequence {u i } 2n i=1 , which is denoted by the sequence { u i } 2n i=1 .The purpose of the present section is three-fold: first, to describe the JPEG algorithm to be employed in order to obtain a noise-free representation of the noisy data; second, to extend the JPEG algorithm to be compatible with irregularities in the data; 43 and thirdly, to incorporate a reference-free criterion to evaluate the denoising procedure accuracy.
Applying the conventional JPEG algorithm on a vector of data points is equivalent to employing three different 42 Global representation is a weighed average (smoothing) of the data, while local representation consists of more detailed information regarding first differences between neighboring data points. 43We define ∆ i ≡ t i − t i−1 ∀ 2 ≤ i ≤ n, such that equispaced (regular) data is a sequence of data points satisfying ∆ i = ∆ j ∀i and j.Other cases are referred to as non-equispaced (irregular) spaced data.
procedures on the noisy data: (i) the JPEG forward transform T F : R 2n×2 → R 2n×1 to obtain the wavelet-based JPEG coefficients (as will be shown in (45) to follow); (ii) coefficients selection T S : R 2n×2 → R 2n×1 by applying a thresholding procedure (as will be shown in (49)) and (iii) the JPEG inverse transform T I : R 2n×2 → R 2n×1 , which recovers the noise-free data by utilizing the selected coefficients (as will be shown in (46) to follow).
In the ensuing section we introduce auxiliary matrices to be used in each of the JPEG transforms, which are essential to construct the covariate matrix in the wavelet-based JPEG regression (in (48) to follow).

A. AUXILIARY MATRICES FOR THE JPEG ALOGRITHM
The implementation of the JPEG algorithm necessitates the construction of T F and T I operators.For this purpose, we construct auxiliary matrices A 2n , S 2n , H (t) 2n,

=1
which are the shifting, rescaling and smoothing operator matrices, respectively, each of size 2n × 2n.
Let S 2n and S −1 2n be the rescaling and inverse-rescaling matrices, respectively each of size 2n × 2n.Its elements are defined for m = 0, ..., n as: The rescaling operator ṽ = S 2n v takes a vector v of size 2n × 1 and return a rescaled vector ṽ of the same size, such that even and odd elements of the original vector are multiplied by the scalars 1/ϕ and ϕ, respectively.
Unlike the conventional JPEG, we allow for data irregularities by controlling for the data set location in space.For doing so, we denote a sequence of matrices H (t) 2n,
We define the linear interpolation weights: For tractability, we formulate the JPEG coefficients estimation problem as a linear regression estimation, which necessitates obtaining a closed-form expressions of the JPEG forward and inverse transforms.These closed-form expression are required to characterize the JPEG covariate matrix to be used in the wavelet-based JPEG regression.In the following section we express analytically each of the forward and inverse transforms using matrix notation as a function of the auxiliary matrices presented above.

B. THE VARIOUS JPEG TRANSFORMS IN MATRIX NOTATION
Employing our proposed JPEG algorithm on a data set involves representation of data set in multiple resolution levels, a property which referred to as a multi-resolution analysis.Let J be the highest resolution level, which requires the same number of data points as in the noisy data set.The data set representation in j'th resolution-level ∀j < J is a transformation of the data set representation in the finer (higher) resolution-level j+1.Consequently, the JPEG noisefree representation ∀j < J can be formulated recursively.However, the implication of this formulation is that the location in space of the data points in any given resolution-level is also determined recursively.This fact stems from depicting the noisy data set u = [u 1 , ..., u 2n ] T and its location in space t = [t 1 , ..., t 2n ] T as a pairwise sequence.For ease of notation we construct the adjusted space location operator ∀j < J: (41)   where m(j) ≡ 2n/2 J−j , J = log 2 (2n) is the number of resolution-levels and j is a specific resolution level. 44his recursive formulation takes the the noisy data points locations in space as control variables, which are essential for alleviating irregularities in the noisy data.
Using the adjusted space location sequence t j in (41), we define matrix Ψ (t) F (to be used in (45) to follow) for J ∈ 44 The operator's notation • represents the ceiling of a real number.{1, ..., log 2 (2n) } resolution levels as: 0 m(j)×(m(J)−m(j)) 0 (m(J)−m(j))×m(j) I (m(J)−m(j))×(m(J)−m(j)) (43)   where m,1 and I m×m is the identity matrix of size m × m.It worth noticing that m×m is a product of invertible matrices and consequently, its inverse is characterized as: The multi-resolution JPEG forward transform operator T F (u, t) is the linear transform: where u, t and δ are the noisy data, the location in space and the JPEG coefficient vectors, respectively, each of size 2n × 1.
Similarly, the multi-resolution JPEG inverse transform operator T I (δ, t) is the linear transform: where u, t and δ are the noisy data, the location in space and the JPEG coefficient vectors, respectively, each of size 2n × 1.
Matrix 46) is defined for J ∈ {1, ..., log 2 (2n) } resolution levels as: which Ψ (t) I is the analytic inverse transform operator.We introduce the wavelet-based JPEG nonparametric regression given a non-equispaced irregular data set: (48)   where δ consists of the sequences of coefficients c 0,k and d j,k , which capture the function's average behavior and details, respectively.The matrix Ψ (t) I consists of the sequences of covariates φ 0,k (t) and φ j,k (t) .Unlike the present case which employs a group-wise denoising on the entire data, in the conventional JPEG the coefficients selection operator, T S (δ, t), is constructed to be used element-wise (for each resolution-level j), e.g, T S (δ ij , t) = S α (δ ij ; λ j,γ j , γ j ) using S α (•) in (26), where λ j,γ j and γ j are defined in (34) i=1 is a subset of vector δ consisting of the j'th resolution-level coefficients.
In the following section, we discuss about the JPEG groupwise coefficients selection to perform denoising.

1) JPEG group-wise coefficients selection for irregular data denoising
Lastly, we formulate the transpose of the inverse wavelet transform, in order to employ the group-wise denoising pro-cedure depicted in ( 29): where Φ (t) m×m −1 T is constructed as: The JPEG estimator, denoted by δ, is obtained by minimizing the objective function in (28) using the iterative procedure in (29) given the chosen thresholding level.The latter is determined by minimizing the reference-free criterion function depicted in section III-2.
The construction of the wavelet-based JPEG transforms matrix involves computational complexity, a problem which can be alleviated by employing a faster algorithm, referred to as 'a lifting step'.In the succeeding section we discuss the algorithm to obtain a denoised representation of non equispaced data design by using a procedure that does not necessitate matrix operation to reduce computational complexity.

C. THE IRREGULAR FORWARD TRANSFORM
The irregular forward transform in (45) is the procedure to obtain the wavelet coefficients, δ, as follows: Algorithm 1 The wavelet-based JPEG Forward Transform .
In the ensuing section we describe the estimation procedure of the parameter vector of interest, β, using the waveletbased JPEG estimate of M 1 (•) obtained from (29).

F. THE INSTRUMENT VARIABLE ESTIMATOR: TRUNCATED SAMPLE
Denote the truncated data by a sequence of observations {y 1i , x i , w i , z i } n i=1 , such that each observation is an independent realization of the conditional joint distribution function of the random variables {y 1 , x, w, z} given that they are selected into the sample (y 2 = 1).The endogenous variable is denoted by x 1 and is included in vector x.There are two types of joint dependence between the covariate vector and the substantive equation's random disturbance.The first type is intrinsic in the model and is generated by a variation in v (the endogenous part of x 1 ) leading to a comovement between x 1 and ξ 1 .The second type is related to the sample selection and is generated by a variation in w leading to a comovement between the covariate vector x and ξ 1 .This implies that there are two sources of endogeneity to be taken into consideration: the first source is related to the endogenous covariate, while the second source is due to the truncation environment of the data.
Next we discuss the two-step estimation procedure to be employed for the correction of both endogeneity and truncation bias propagated by truncation.

G. THE ESTIMATION PROCEDURE
In this section we introduce a two-step estimation procedure to eliminate the two sources of bias discussed.To eliminate the endogeneity bias term we adapt a similar approach to the two step procedure in [44] for a partially linear single index model estimation, in which the first stage is a regression of the endogenous covariate on all the exogenous covariates and the instrumental variable.In the second stage, the endogenous covariate is substituted with the fitted values obtained from the first stage.However, the estimation approach in [44]  cannot be implemented in a truncated environment, because it treats the first stage regression as a linear population regression (as if the entire covariates distribution function is observed).We alleviate this by modeling both the first as well as the second stage equations as endogenously truncated equations.In order to eliminate the endogenous truncation bias, we control for this source of bias by including the truncation bias term as an additional covariate in the substantive equations, as depicted in (14).Thus, the partial linearity is applied to both the first as well as the second stage equations.
In the first stage, we regress the endogenous covariate on the instrumental and exogenous variables, by minimizing the partially linear index model: In the second stage, the endogenous variable is replaced by its predicted value obtained from the first stage in (51), and we minimize the following function: As can be seen in ( 52) the two sources of endogeneity bias we deal with are: (i) the bias propagated by the endogenous covariate is alleviated by utilizing the covariate set x 1i , x T −1i consisting entirely of exogenous covariates and (ii) the bias propagated by the endogenous truncation is alleviated by controlling for the selection bias term M 1 (•).
Next we present Monte Carlo simulation to examine our semiparametric IV estimator's performance in a truncated environment.

V. SIMULATION
In this section, we generate multiple random data sets to be used for the examination of our model's performance, using different sample sizes.
First, we discuss the procedure for the data generation process (DGP).

A. DATA GENERATION PROCESS
Denote the sample size by N ∈ 500, 2000, 3000, 5000, 8000, 10000 .In order to not restrict the data generation process to the family of symmetric unimodal distribution functions, a mixture of distribution functions is utilized to generate each of the selection model's disturbances that are jointly dependent (as will be discussed in section V-A1 to follow).In order to verify that our proposed model performs well under different data generating processes (DGP), we construct a data set consisting of 2,000,000 distribution functions, 45 practically generating 100 millions realizations 45 The estimates obtained given the various data distribution functions will be supplied upon request.which are not i.i.d.By construction, each observation is randomly drawn from a unique mixture of distribution functions.

1) The disturbances' joint distribution function
Each triple of disturbances {ξ 1i , ξ 2i , v i } is randomly and independently drawn from F ξ1,ξ2,v , which is the substantive and participation equations' disturbances joint distribution function.The aforementioned joint density function consists of two components: a Copula function, 46 which characterizes the disturbances' dependence structure, and three marginal distribution functions F ξ1 , F ξ2 and F v .In order to verify our model's performance in the presence of random disturbances' distribution functions that are not restricted to the family of symmetric and unimodal distribution functions, each one of the sample selection model's disturbances ξ 1 and ξ 2 is marginally-distributed according to a mixture of three different distribution functions: (i) a normal distribution function with expectation and standard deviation parameters (µ, σ a ) denoted by N (µ, σ (iii) a gamma distribution function with scale and shape parameters (µϕ, ϕ) denoted by Γ Gamma (µϕ, ϕ) 47 .This mixture distribution function is defined as: where The parameters set (µ, σ a , σ b , ϕ, σ v ) = (4, 2.5, 1.5, 2, 1) is arbitrarily chosen.Due to its simplicity, the Clayton Copula (as will be discussed in section V-A2 to follow) with a degree of dependence parameter is set to equal 1, assuring a mild correlation between the disturbances, is used for controlling the dependence structure.Choosing a mild correlation, is important in order to be conservative by examining the potential bias in the parameter estimates under conditions which are not extreme.
Next we employ a function characterizing the dependence properties of the Copula [46], referred to as a generator function to construct the joint dependence of the random disturbances in (53).

2) Archimedean Copula function
An Archimedean Copula is a Copula characterized by a nonincreasing, continuous generator function ψ: [0, ∞] → [0, 1], which satisfies ψ(0) = 1, ψ(∞) = 0 and is strictly decreasing on [0, inf {t : ψ(t) = 0}].In particular, we are interested in the d dimensional Archemdean Copula family (3 in the 46 Any continuous joint distribution function can be characterized by a set of marginal distribution functions and a joint distribution function determining the dependence structure which is referred to as a Copula function (Sklar's Theorem [45]). 47The scale and shape parameters imply that the expectation and standard deviation parameters are (µ, µ/ϕ), respectively.present case 48 ) which has the simple algebraic form [46]: 49 where ψ is a specific function known as the generator of C. To generate the disturbances, the Clayton Copula's generator ψ(t) = (1 + t) −1/θ is chosen.
The covariates vector of i'th observation is a realization of the random variables [z, x 2 , w 1 , w 2 ] which are jointly distributed with their corresponding marginal distribution functions F i z , F i x2 , F i w1 , F i w2 .The dependence structure is modeled by utilizing a Gaussian Copula which is a convenient way to generate high dimensional data.By construction, each datum is generated by utilizing a different sequence of marginal distribution functions (constructed as a finite mixture drawn from a menu of 2, 000, 000 continuous distribution functions).These random variables expectation vector µ = [0, 0, 0, 0] T and a covariance matrix Σ 4×4 .The arbitrarily chosen covariance matrix is: We generate the data y 1i , y 2i , x 1i according to the following data generation process (DGP) [49]: where each i element in the sequence {x 2i , z i , w 1i , w 2i } N i=1 is an independent realization of the random variables (x 2 , z, w 1 , w 2 ).We choose the parameter setting The truncated data set is characterized by the following equations: where x 1i is an endogenous variable included in vector x i ∈ R p , in which all the elements (except for x i ) are exogenous variables and β ∈ R p is a covariates vector.The substantive equation's random disturbance is denoted by ξ 1i .

B. SIMULATIONS RESULT
We have randomly generated for each sample size N ∈ {500, 2000, 3000, 5000, 8000, 10000}, a total of 10, 000 data sets using the data generation process elaborated on in V-A.For a given number of observations N , different models are estimated: (i) an OLS estimator utilizing a sample consisting of random realizations from the complete distribution 48 d = 3 representing the three-dimensional vector of random disturbances (v i , ξ 1i , ξ 2i ). 49Knowing the distribution corresponding to a generator ψ, [47] presented a sampling algorithm for exchangeable Archimedean copulas which does not require the knowledge of the copula density.This algorithm is therefore applicable to large dimensions [48].Note: a The parameters that are used in the data generation process.We estimate by ordinary least squares (OLS) method the parameters for the full sample and truncated sample without correction for the selectivity bias.Then, we calculate for these estimates the mean, median and standard deviation (Std.) over all data sets.The standard deviations are obtained using the estimates from the Monte Carlo simulations.
function, without correcting for the endogeneity of x 1i covariate; (ii) a conventional IV estimator, correcting for the endogeneity of x 1i covariate using the aforementioned entire distribution function; (iii) both OLS as well as a conventional IV estimators are applied to a truncated portion of the data distribution function consisting of participants only (without correcting for the self-selection bias); (iv) truncated sample model's estimates using the developed wavelet-based JPEG IV estimator, correcting for both truncation as well as endogeneity biases. 50ble 1 presents summary statistics of estimates for models (i) and (ii), while Table 2 presents summary statistics of estimates for models (iii) and (iv).In Table 3, different convergence measures of these estimates are presented.1 indicate that regardless of sample size, the means of the OLS estimates are biased, such that e.g., for a sample size of 3000 observations β 1 = 2.6807 and β 2 = −0.6988,while the mean of the full sample IV's estimates are β 1 = 1.0025 and β 2 = 1.2457 for the same sample size.For a sample size of 500 observations, the standard deviation obtained for β 1 (the endogenous covariate's coefficient) using the IV estimator is 2.73 times larger than in the OLS estimator and decreases from 0.4397 to 0.1006, when the sample size increases from 500 to 10, 000 observations.

Entries in Table
In Table 2 to follow, the estimates obtained by using the truncated data are presented.It is evident that the mean of the truncated sample OLS estimates are β 1 = 2.26 and β 2 = −0.3084for a sample size of 500 observations, where as the true parameter values are β 1 = 1 and β 2 = 1.25, respectively.This is a huge bias which is hardly improved as the sample size increases.Further, applying a conventional IV produces estimates which still represent a huge bias, particularly β 1 = 0.1084 and β 2 = 2.0544 for the same sample size (500 observations).Note: a The parameters that are used in the data generation process.We estimate the parameters for the truncated sample without correction for the selectivity bias by employing the OLS and the (conventional) IV methods.Additionally, we estimate the parameters using the same truncated sample by employing our JPEG IV estimator, correcting for both truncation as well as endogeneity bias.For brevity, we introduce here only the second stage results and delegate the first stage results to Table 4 Appendix B.
In each of the estimation procedures (OLS, IV and JPEG IV), we calculate for these estimates the mean, median and standard deviation (Std.) over all data sets.
Entries in Table 2 indicate that regardless of sample size, the means of the truncated sample IV's estimates are biased (ranges from one-tenth to one-fifth of the estimate that would have emerged by employing the conventional IV method in the absence of truncation). 51Note that estimates' accuracy hardly improved as sample size increases.This is due to the presence of two sources of bias.The mean estimate of β 1 (the endogenous covariate's parameter) which is obtained from implementing our proposed methodology, basically mimics the results obtained using a random sample from the entire data distribution function for sample sizes, above 2, 000 observations.The standard deviations of this estimate for sample sizes of 500 and 10, 000 observations are 0.7325 and 0.1577, respectively.For a sample size of 5, 000 observations (or above), the mean estimate of β 2 (the exogenous covariate's parameter) approximates the estimate obtained by employing the conventional IV, using a random sample from the entire data distribution function.However, the estimate of β 2 obtained by employing the conventional IV in a truncated sample is biased even for 10, 000 observations.We conduct sensitivity test to measure the influence of an increase in number of observations on the accuracy of the truncated sample's parameter estimates.
The first accuracy measure we use is the standardized root mean square error, RM SE j , measuring the bias in the truncated regression estimate relative to the true parameter value that would have been obtained in an non truncated distribution, defined as: 51 For sake of brevity, we have omitted the estimates of the nuisance parameters which can be furnished upon request as well as the parameter estimates of the first stage, which are delegated to where βs i,j and β s j stand for the substantive (s) equation's j'th coefficient estimated in the i'th sample and the coefficient in the theoretical model that would have been obtained in the entire population, respectively.Ω is the number of data sets generated for the Monte Carlo simulations, which is 5000 data sets (each one consists of N observations).
Another measure is based on a formula similar to the one described in (57), and is intended to find the relative accuracy of the truncated sample's estimates, in comparison to full sample estimates, defined as: where βts i,j and βs i,j stand for the substantive (s) equation's j'th coefficient estimated using the truncated (t) sample and the full sample, respectively.This measure evaluates the relative model's performance in the truncated sample, with respect to the conventional IV using the full sample.
The last estimates' accuracy measure is the δ coefficient used for the calculation of the estimators' standard deviations convergence rate n δ with respect to the sample size.It depicts the speed of standard deviation's shrinkage resulting from increasing the sample size.This coefficient is calculated based on the following ratio: where σ 1 and σ 2 are the estimate's standard deviations that are calculated for data sets with n 1 and n 2 number observations, respectively (calculated for a given estimate).

Note:
The parameters that are used in the data generation process.We examine three different measures for the parameters presented in Table (3).First, the standardized root mean square error RM SE between each model's estimates and the true parameters is calculated based on equation (57).Second, the Rj measure is calculated based on equation (58).Third, the convergence rate which is measured n δ is calculated for both the conventional IV as well as the JPEG IV estimates.This convergence rate measure implies that multiplying the sample size by 2 shrinks the estimators' standard deviations by 2 δ .
Table 3 entries indicate that the root mean squares error (RMSE) measure of the estimates obtained by employing the conventional IV estimator, using a random sample from the entire data distribution function, gets smaller as the sample size increases, as can be expected.However, applying the same procedure to the truncated data set leads to RMSE measures, which are in the range of 2 to 8-fold larger, given a sample size of 2, 000 to 10, 0000 observations, respectively.This is indeed a huge bias generated by the conventional IV, which is not immune to truncation bias.Additionally, the RMSE measures show negligible improvements as a function of number of observations for the conventional IV, whereas there is a huge improvement of the RMSE, as a function of the number of observations for the JPEG IV estimator provided by our model.The proximity between the JPEG IV and the full sample IV estimates increases with the sample size, as reflected by the R j proximity measure.Using the same measure, we find that there is a much smaller improvement in the proximity between the truncated sample conventional IV and the full sample IV, relative to the improvement in the proximity between the JPEG IV and the full sample IV estimates.
It is evident that JPEG IV is a √ n consistent estimator, as depicted by the δ consistency measure, which is about 0.5, implying that multiplying the sample size by 2 shrinks the estimators' standard deviations by 2 δ = √ 2. It is also evident that the truncated data conventional IV is poorly functioning in terms of consistency, as is shown by entries in Table 3.

VI. CONCLUSION
We provide an analytical proof showing that in an endogenously truncated data the conventional IV estimator does not perform the task it was intended to, but rather introduces an additional unintended bias into the parameter estimates of the substantive equation.The instrumental variable is endogenous by itself in the context of endogenously truncated data due to a comovement between the instrumental variable and the substantive equation's random disturbance, generated by mediating covariates.We offer a truncation-proof JPEG IV, shown to be a proper estimator under endogenous truncation.Employing Monte Carlo simulations attests to the JPEG IV estimator's high accuracy and its √ n consistency.These results have been verified by utilizing 2,000,000 different distribution functions (not restricted to the unimodal symmetric family), generating 100 million realizations to construct the covariates' data sets which are not imposed to be i.i.d.The various distribution functions attest to a very high accuracy of the model as depicted by the parameter estimates that closely mimic the true parameters.

A. AN ELEMENT-WISE THRESHOLDING
In this appendix we show that for any γ ∈ (1, ∞), S α (•) in ( 26) is the solution to the min-max concave penalty function in (24) ∀α ∈ (1/γ, ∞).We obtain the first order condition of (61) with respect to δ: After some algebraic manipulation we obtain: Next we show how to obtain an analytic representation of the transpose of the wavelet inverse transform.For doing so, we based our arguments on the equivalence between the matrices-product and the lifting scheme representations of the wavelet transform.Then using this equivalence, we generate a filter to render the costly matrices building useless.Our objective is to reduce computational complexity.
Next we present the first stage estimates.Note: a The parameters that are used in the data generation process.We estimate the parameters using the truncated sample by employing our JPEG IV estimator, correcting for both truncation as well as endogeneity bias.We calculate for these estimates the mean, median and standard deviation (Std.) over all data sets.The standard deviations are obtained using the estimates from the Monte Carlo simulations.

by conditional independence of z and ξ1 given w and s =
E w [GM|s = s] = w G(w)M(w T γ)f w|s=s (w|s = s)dw Similarly (using proposition 1), E[z|s = s] = w G(w)f w|s=s (w|s = s)dw and E[ξ 1 |s = s] = E w [E[ξ 1 |w, s]|s = s]

Definition 3 (
Biorthogonal wavelet).A pair of functions ϕ j,k , φj,k ∈ L 2 (R) is a pair of biorthogonal wavelets if the sets ϕ j,k |j, k ∈ Z and φj,k |j, k ∈ Z form the Riesz basis for L 2 (R) and if any function g ∈ L 2 (R) has the representation: g = j∈Z k∈Z g, φj,k ϕ j,k .
k'th column in Ψ I (the inverse wavelet transform).The notation ψ T I k implies the transpose of ψ I k 2 a ); (ii) a normal distribution function with expectation and standard deviation parameters (−µ, σ b ) denoted by N (−µ, σ 2 b );

TABLE 1 :
Monte Carlo Simulation -Non-truncated (complete) data set with (without) IV correction

TABLE 2 :
Monte Carlo Simulation -Truncated data set with (without) truncation bias correction

Table 4
Appendix B.

TABLE 4 :
Monte Carlo Simulation -Truncated data set with truncation bias correction