Analytics on Non-Normalized Data Sources: More Learning, Rather Than More Cleaning

Data analysis is increasingly performed over data assembled from uncontrolled sources, facing inconsistency in knowledge-representation conventions. The typical practice is to create “clean” data for analysis, matching entities and merging variants to overcome differences in knowledge representation. Despite progress in data management techniques to automate this process, it still needs labor-intensive supervision from the analyst. In this paper, we evaluate the benefit of advanced statistical tools to address directly many analytic tasks across data sources without such entity-matching cleaning. Reframing analytical questions as machine-learning tasks enables to replace exact matching of entities by continuous descriptions–vectorial embeddings– that expose similarities between entries. But are analyses with less cleaning trustworthy? We answer this question with a thorough benchmark on questions typical of socio-economic studies across 14 employee databases: we compare the approaches based on machine learning to manual data cleaning (entity matching). It reveals that using embeddings and machine learning improves results validity (smaller estimation error) more than manual cleaning, with considerably less human labor. While machine learning is often combined with data management for the purpose of cleaning, our study suggests that using it directly for analysis is beneficial because it captures ambiguities hard to represent during curation.


I. INTRODUCTION
Data analysis is increasingly performed across nonnormalized data sources, facing data-integration challenges. For instance estimating product prices must match offers referring to the same product [1]; studying the influence of climate warming on plant species must overcome variability in plant names [2], [3]; and early detection of acute kidney injuries faces the heterogeneous vocabulary of clinical notes [4]. Assembling and curating data into a clean form for statistical analysis is often described as one of the biggest hurdles to data science [2], [5], [6]. Even when the schemas are aligned, data curation needs some form of entity matching. Indeed, across different information systems, there are often different ways to represent the same concept, e.g. ''professor'', ''prof'', or ''professeur''. Entity matching bridges the various knowledge-representation conventions across sources to produce normalized entries (see Fig. 1). Though this curation process can be partly The associate editor coordinating the review of this manuscript and approving it for publication was Inês Domingues . automated [7], [8], it remains challenging and timeconsuming as it requires tedious manual supervision and quality assurance across a large number of entries. Yet, with current analytic methodology this task is central to the validity of downstream analyses.
Here we show that more sophisticated analysis pipelines can alleviate the need for careful curation to answer many data-science questions. We empirically benchmark such approaches, drawing examples from studies of the determinants of salary, in data journalism [9] or academia [10]- [12], which ask quantitative questions such as:

For a given job, how does salary evolve with experience? 2. For a given job, what is the 0.75-quantile of salaries? 3. What is the typical male-female pay gap?
Answering these questions from data assembled across different employers must overcome the lack of correspondences in job titles: as illustrated on Fig. 1, the same job title appears with multiple variants, such as senior research associate and sr research assoc. This problem is typically addressed via entity matching procedures, available in increasingly sophisticated data-integration softwares such as Wrangler [13], Tamr [14], OpenRefine [15]. Despite these tools, entity matching remains a difficult task as it often involves domain expertise or faces the lack of clear correspondences in entities across sources. Is such matching necessary to the validity of the analysis, or can more complex statistical pipelines do without?
In this paper, we show that applying advanced statistical techniques directly to non-normalized data can avoid labour-intensive data curation for many analytical questions. We benchmark whether relying more on machine learning and less on manual data cleaning compromises or not the validity of the analysis. To answer this important question, we formalize how many analyses boil down to estimating statistical quantities, and use various experiments that give an unbiased measure of the corresponding estimation error.
The quantities needed for the analytical questions can be estimated with machine-learning models applied to continuous embeddings of entries that represent ambiguities. A suitably trained model can be directly queried to give e.g. the evolution of salary with experience for a given job, giving a less noisy result than standard techniques after best-effort manual cleaning (see Fig. 2). Machine learning is already increasingly used in data integration to create more uniform data warehouses [16], [17] or to clean their entries [18], [19]. Instead, our study applies it directly to the analytical question, as this can be easier than curating the data for fundamental reasons. First the analytic task provides supervision [20]- [22], while cleaning needs examples of curated data. Second, representing ambiguities in the analysis often leads to more accurate results.
We first give a brief summary of how data cleaning is used to enable data analysis with diverse sources; after which we show how data-science questions can be formulated in terms of machine learning on embeddings of entries. We then study the validity of the results: on an analysis of wages across 14 data sources, we compare manual data cleaning to a simple machine-learning approach using embeddings of entries. Qualitatively and quantitatively, analyzing the non-normalized data gives better results. Finally, we discuss perspectives on adapting data-analysis practices to rely less on cleaning.

A. ENTITY MATCHING TO INTEGRATE DATA
Integrating data often faces alignment challenges, in particular if it is assembled across sources. Notably, correspondences are first needed at the schema level: different sources may come with different structures, e.g. columns (relations) of different names for the same information, or information split differently across columns [23]. Bridging such mismatches is known as schema matching, and is often required as a datapreparation step. However, it tends to burden less the human operator than instance-level entity matching, because there are less matches to check. We thus focus on entity matching in the remainder, though there are many other aspects to data quality [19].

1) ENTITY MATCHING
For data integration, entity matching strives to match different variants that denote the same entity [24]. Classic situations include deduplication of multiple variants of the same entity in a given table, or record linkage, matching entities across two tables [25]. Matching entries to uncover categories is necessary for standard statistical procedures to answer a question conditional to a non-normalized category: analyzing one quantity -such as salary-keeping another -job title-constant. [26]. Conversely, computing marginal quantities, such as the overall distribution of salary, does not require entity matching as it relies on aggregates of all the data (assuming that there are no duplicate across the sources).
Entity-matching techniques rely on an appropriate similarity -typically across strings-and threshold to assign entries to the same entity. The issue is that both similarity and threshold must be tailored to domain specificities and the resulting matches must often be manually reviewed. The process is thus labor intensive.
Automating the match of entities is challenging because it is an unsupervised-learning problem [27], unless there are known matches for supervision [28]- [30]. Such matches must typically be constructed manually by a user, though active learning can reduce human intervention [18], [31]. Dedicated data-integration softwares, such as http://openrefine.org/OpenRefine, facilitate the process with a user interface. The software suggests potential matches and enables users to tune parameters and match entries in a semiautomated way.
Automation tools can use techniques capturing natural language semantics, which shares with entity matching the challenge of relating multiple forms that denote the same things. For instance, natural language processing tools such as fastText [32] provide word embeddings resilient to morphological variations. Embeddings have led to many recent progresses in entity-matching pipelines [8], [30], [33], [34].

2) CLEANING REAL-WORLD SALARY DATA
To answer our questions on salary, we consider data from a study of salaries in Texas state institutions, 1 as illustrated in Table 1. The tables are assembled across 14 different employers and the Job Title information is particularly challenging: without normalization there are about 14 000 different job positions for a total of around 160 000 employees.
We performed manual entity matching on the job titles using OpenRefine. We first cleaned common abbreviations as they are the main hurdle to entity matching: string metrics struggle to capture their similarities. Typical examples include (sr/senior), (asst/assistant) or (mgr/manager). Some abbreviations can have multiple or complex meanings (tech/technology, technician, technical), (CMC/Chemical, Manufacturing and Control). Such manual cleaning is thus limited by the domain expertise of the operator. We then used OpenRefine to search and manually merge variants across sources. Around 1000 job titles were paired in the process, which took about 3 days. We believe that a more thorough entity matching -especially on rare job titles-can be performed, but would require intensive human labour to bring minor improvements.

B. STANDARD ANALYTICAL PRACTICE: MATCHING AND AVERAGING
In general, the analytic questions can be formalized as estimating a quantity y for a population or group of instances who share a set of attributes X : for instance, the typical salary of a project manager with 3 years of experience. To that end, the standard technique consists in matching & averaging: 1) a query on X to match and select the relevant instances.
2) a procedure (typically a form of averaging) to aggregate the results and estimate y.
Even when the entities in the data are normalized, a successful analysis may require to match them with the vocabulary used by the analyst: for instance in some data the correct query for project manager may be mgr project (Fig. 2). In our example analysis, studying determinants of salary, the motivation to unite data sources is to establish a more general result: project-manager salaries or male-female pay gap may vary across employers; there may be no instance of a project manager with 3 years of experience in a given employer. The underlying problem is that of statistical estimation: to compute the quantity that represents best the complete population from the instances at hand. If the entity matching is valid, matching & averaging estimates are unbiased from a statistical point of view. But they may exhibit high variance, as the data often exhibits a small number of representatives from a given category. A paradox of statistics is that the most accurate way of estimating the mean of a population from a small sample may not be the sample average, but biasing estimates with other sources of information (see Stein's paradox [35]). For instance, estimating the typical salary of an associate professor can leverage similar populations: professor, lecturer. Drawing information across similar entities is related to the notion of semantic queries in databases [36], as opposed to exact value matching, used in matching & averaging.

III. ANSWERING ANALYTICAL QUESTIONS WITH MACHINE LEARNING
We now give the statistical underpinnings of approaches based on machine learning. We specifically consider the three analytical questions of our example study on determinants of salary. While these originally do not appear to be machine learning problems, we show that they can be reformulated as such. An overview of our analysis pipeline is depicted in Figure 3.

A. SALARY EVOLUTION AND QUANTILES
To study the evolution of one quantity, salary, as function of another, experience, matching & averaging methods typically group employees by job and experience level, and compute the mean salary in each group. This quantity is an estimate of the conditional expectation E[Salary | Job, Experience].
Instead of averaging on groups, a machine-learning model trained to predict the salary given the job and experience level can estimate this quantity. Indeed, modeling the salary as a function f θ (Job, Experience) gives a consistent 2 estimate of the conditional expectation E[Salary | Job, Experience] if model parameters θ are optimized on the data to minimize the mean squared error on the salary [37, section 1.5.5].
Unlike averaging, casting the analytic question into a machine-learning task does not require matching. Rather than viewing each job title j as a discrete category, we can represent it by a vector j ∈ R p , that will serve as features to train the machine-learning model f θ .
The crucial point here is that these vectors should capture the similarities between job titles. For instance, administrative assistant and administrative asst should have close representations. This allows the machine-learning model to leverage these similarities and implicitly account for matching. We use in our experiments pretrained fastText embeddings [32], which readily provide vector representations for strings and account for semantic and morphological similarities. Other approaches to encode string similarities into vectors could be used as well [26], [38].
A wide variety of machine-learning models can be used to estimate the quantity of interest. For our experiments we use gradient boosted tree models from scikit-learn [39], as they generally perform well in prediction tasks. Besides avoiding tedious entity-matching, machine-learning models can form weakly-parametric estimators that are resilient to other imperfections in the data. For instance, imperfect correspondences between schemas across the sources lead to missing values: some sources may not have all the information. Despite these missing values, supervised learning can give optimal estimates without relying on probabilistic modeling of the missing-data mechanism [40], [41].
A model can be trained to estimate different statistical quantities by choosing the measure of error (loss) that it minimizes [42]: a mean squared error leads to conditional expectations. Similarly, a model f θ (Job) trained with a quantile loss estimates a quantile of the salary distribution for a given job [43]. The appendix (sec. VI-A and VI-B) gives more details about the specific implementation of the analysis.

B. PAY GAP ACROSS SEX: COUNTERFACTUAL ANALYSIS
Many of the advanced analyses and visualizations of rich data sources pertain to understanding causes and effects. For instance, when studying salaries, measuring and understanding the causes of gender gap is a long-running question [12], [44], [45]. Shedding light on this question requires contrasting salaries for man and woman at similar position, with similar experience. . . Can it be done without cleaning the data?
Counterfactual analysis provides a good framework to address quantitatively the question of gender pay gap. A counterfactual is a thought experiment measuring the effect on the outcome of interest -the salary y i -of changing only the feature of interest W i -here the sex-for an individual i. Borrowing from clinical trials, W i is called the ''treatment'' in the literature. The outcome y i can take two potential values depending on W i : (1 for a man, 0 for a woman), though for each individual only one of these is observed in the data.
The analytical quantity of interest is the typical gender pay gap, known as the average treatment effect τ = E[y(1) − y(0)]: the average difference in outcome for the same individual under scenario W i = 1 and W i = 0, i.e. that differ only by their sex [46], [47]. In general, it does not suffice to subtract the average salary of men from that of women: Indeed, the populations of men and women may not be directly comparable in the database. For instance, women may need to interrupt their careers during maternity leave, causing them to have less work experience and thus lower salaries. To account for such confounding factors and isolate the effect of interest, the potential outcome framework (Table 2) uses covariates, extra information X on each individual, such as the job title or the VOLUME 10, 2022 TABLE 2. Illustration of the potential outcome framework. The data contains an observation of a male white firefighter with 15 years of experience, but not a matching female employee; likewise with an hispanic female post-doc with 2 years of experience. The challenge is to interpolate the missing data. experience level, allowing us to directly compare salaries for men and women with the same features.
To estimate the average treatment effect, modern causal inference techniques either rely on estimates of the outcome given the covariates and the treatment, E[y|X , W ], or estimates of the propensity-score P(W = 1|X ), i.e. the probability for an employee to be a man given its covariates. The first quantity can be estimated using regression models, trained to predict y from X and W , as detailed before. Similarly, the propensity-score can be estimated with classification models, trained to predict W from X , when they are calibrated [48]. Finally, powerful causal-inference tools combine both estimates for more robustness [49]. State-ofthe-art approaches already rely on machine-learning models to adapt to biases and noise in the input data [50], [51]. The appendix (sec. VI-D) details the exact models used in our experiments to estimate the average treatment effect.

1) A PATTERN: MACHINE LEARNING
Because machine learning can capture complex links in complex data, it is increasingly used in data science to estimate quantities of interest to the analyst, whether they are intermediate quantities, as for counterfactual analysis (subsection III-B), or the direct answer to the question of interest, as for conditional links (subsection III-A). For data integration, this evolution brings exciting new opportunities: machine-learning models do not need to rely on averaging, and hence do not need actual matching of entities across sources. Rather, they can use vector representations that express, even indirectly, relevant similarities between entities.

IV. EMPIRICAL STUDY: LEARNING VERSUS CLEANING
Using machine learning can be less labor-intensive, as it does not require human-guided entity matching. But does it come at a cost to the validity of the results? We now compare empirically learning and matching-based approaches for the different analytic questions. 3

A. EXPERIMENTAL DETAILS 1) MEASURING ESTIMATION ERROR
How to compare estimators of a quantity such as conditional expectation of salary given job title? Even without entitymatching noise, the data at hand is limited and its mean is 3 The code and data to reproduce our experiments is available on Code Ocean: https://codeocean.com/capsule/6435573/tree an imperfect estimate of the unknown population quantity y. We adapt a classic procedure of machine learning: we leave out a test fraction of the databases, and use the rest of the data to derive estimatesŷ train . Applying an averaging-based estimator on the test data provides another estimateŷ test , that is unbiased though noisy. Importantly, as it has been estimated from different data thanŷ train , its estimation error is independent. We can thus use the difference betweenŷ train andŷ test over multiple splits -a cross-validation loop-to quantify the estimation error of the procedure that we use to computeŷ train .

2) ANALYTICAL APPROACHES STUDIED
We compare several approaches to estimate the quantities relevant to our analytical questions (implementation details are provided in the appendix): 1) Matching & averaging, as described in subsection II-B. 2) Embedding & learning: strategies of section III, relying only on standard machine-learning tools. Gradient boosted tree models from scikit-learn [39] are trained, using pretrained fastText embeddings [32] to represent the job titles, capturing semantic and morphological similarities. 3) Embedding & fuzzy matching: the notion of continuous similarities, as between embeddings, can also be exploited to define weighted averages. We modify the matching & averaging procedure to use fuzzy matches and weights defined with a cosine string similarity on the job title with an affine decay and a cut-off at zero. Parameters such as the affine decay and cut-off, or the hyper-parameters of the machine-learning models are tuned in a nested cross-validation procedure. To study the effect of entity matching, we apply these techniques on raw and manually matched entries.

B. QUALITATIVE RESULTS: DISPERSION ACROSS VARIANTS
The curves of salary as a function of experience represented on Fig. 2 are computed either with a matching-based or a learning-based approach. Machine-learning estimates leverage job similarities and have low dispersion across variants of project manager or administrative assistant. This robustness reduces the need for manual matching: taking the model output on any variant provides reliable estimates that are representative of the whole population. It is more convenient and reliable for an analyst to query the model for ''project manager'' or ''administrative assistant'' (thick magenta curves), than to search the database for all variants and average them. Beyond the dispersion across variants, matching & averaging curves appear more noisy; in particular they fail to capture well the evolution of salary with experience. Finally, machine-learning estimates show plausible extrapolations for queries where there is no data with exact matches, such as project managers with more than 25 years of experience. Cross-validated errors for salary, quantile, and propensity-score estimation. We report here estimates of the propensity-score P(W = 1|J) conditionally to the job title, rather than on all covariates, as matching-based estimates are very noisy in that case. RMSE = Root Mean Square error. MAE = Mean Absolute Error.

C. QUANTITATIVE RESULTS: CROSS-VALIDATED ERRORS
To go beyond the face validity of Fig. 2, we use crossvalidation, as detailed in subsection IV-A, to quantify which approach best estimates the population quantities. The 14 databases are randomly split into two sets of 7 databases: one to compute estimates for salary, quantile, and propensityscore; and the other to measure their error, reported in Table 3. Results show that for all three quantities embeddings notably reduce the error compared to exact matching and perform best when combined with learning. Adding manual matching on top of embeddings improves further, but the benefit is smaller than that brought by embeddings & learning. The residual error is due to variance in individual salary that is not explained by the attributes of the employees present in the databases, such as the appreciation of the manager.

D. ESTIMATION OF COUNTERFACTUALS
How do the differences in estimation errors reported in Table 3 impact complex end-user analytical questions? We investigate their impact on estimation of salary gap across sex. Fig. 4 gives average treatment effects computed with statistical methods -IPS and AIPS [52]-based on embedding & learning approaches, as well as manual matching and fuzzy-matching estimates (see appendix). To force the need for analysis across the databases, we create a sex imbalance by dropping randomly a fraction of either men or women in each database, with 50/50 probability. As a result, the estimation relies on employees of opposite sex with matching job titles across databases. Machine learning methods have much less variance than matching and averaging methods, but both approaches lead to estimates across databases (large sex imbalance) that do not depart for values obtained within databases (no sex imbalance). On the other hand, fuzzy matching creates sizeable bias: an analysis performed across databases differs markedly from an analysis comparing employees inside each database. The low variance of machine-learning methods comes from their implicit interpolation, visible on Fig. 2: if a given employee lacks an opposite-sex with the same covariates, the model will use information from similar profiles.

V. DISCUSSION: HOW MUCH CAN LEARNING REPLACE CLEANING?
On the data-integration problem that we have studied, relying more on learning rather than on cleaning facilitates the data analysis, and actually improves the validity of the results without manual labor. This result depart from classic datamanagement practices, and we now discuss its interpretation and impact for analytical practices.

A. CLEANING IS IN THE EYE OF THE BEHOLDER 1) CLEANING IS ANALYSIS
Studying the salary gap showcases the importance of analysis across data sources: for the highest-payed positions, finding employees of opposite sex requires considering multiple companies. Matching entities faces the fundamental challenge that there might not be exact correspondences: not every institution has a chief data officer (CDO) and the nearest match may be chief technology officer (CTO). Omitting companies without CDO will bias the analysis by excluding large tech companies.
The notion of cleaning, to make data more uniform, carries in itself analytical choices which may bias the results [53], [54]. While vinaigrette is just French for salad dressing, its use on an American's restaurant menu signals upper-scale clientele. Merging the two will lead to loss of information. From an ontological point of view, the solution would be to create a new category, posh salad dressing. But maintaining a complete and consistent ontology, catering for all the edge cases, requires manual work each time new data is integrated. Should the necessity to merge entities be considered as a bug of analytic pipelines, rather than a feature? New tools that do not require exact matches can give more reliable analyses in the face of ambiguity, as illustrated by the estimation of salary gap across databases with sex imbalance (Fig. 4). VOLUME 10, 2022 Manually curating entity matching brings to the data a consistency that is good practice in production settings. Yet, as illustrated in our empirical study, favoring more advanced statistics down the line facilitates valid analysis. It can indeed be easier to pass on uncertainties to the statistical analysis tools than to resolve them in a relational store. The best data representation, clean or fuzzy, is tied to the analytic question.

B. SUPERVISION FACILITATES INTEGRATING DATA WITH AMBIGUITIES
Representing uncertainty in relational systems helps tackling ambiguities [36], [55], [56] or curating data [57]. However, extending relational data management to a general probabilistic framework is intrinsically hard. Indeed, unlike with the relational algebra, queries in a probabilistic database can suffer non-polynomial complexity [58]. Approximate probabilities [59], [60] or fuzzy logic and similarities [61] have better tractability. Yet how to weight similarities to best capture ambiguities is often a challenge in itself.
Using supervised learning to answer a given statistical question alleviates the need for probabilistic models. In particular, many recent success rely on discriminative modeling using empirical risk minimization, as with deep learning [62]. It is crucial to the success of our empirical study: optimizing the statistical models gives accurate estimates from nonprobabilistic similarities -word representations that were not tailored to the question at hand. Such an approach goes much further than fuzzy matching (Fig. 4), as supervised learning can be seen as implicitly tuning scaling factors and thresholds to combine information optimally while minimizing noise.

1) EMBEDDINGS TO CAPTURE AMBIGUITIES
Entity embeddings are crucial to the success of our approach, to expose ambiguities to the analysis step. Our proof of principle purposely used a very simple implementation: a general-purpose machine-learning model applied on off-theshelf word embeddings. Yet, it is noteworthy that it leads to analyses on the unaligned data more accurate than standard statistical approaches on data cleaned with three days of manual labor using a dedicated software (Table 3, Fig. 4). There is ample room to use better embeddings of entries, for instance training them from the data at hand to adapt to its specificities, via the string forms [38] or the relations to other entities [63] including distant relational information [64].

C. THE ROAD AHEAD: RETHINKING ANALYTIC PIPELINES 1) MORE COMPLEX DATA-INTEGRATION PIPELINES
The data-integration problem studied in section III is very simple: it consists in analyzing the union of tables across sources. In relational algebra terms, the machine-learning models replace a GroupBy followed by aggregations. However, data integration often calls for joining and aggregating across tables of different nature. Tackling these operations using machine learning on embeddings will require exploring new tools, for instance adapting similarity joins to merge information across tables [65], [66], logic inferences on top of entity embeddings [67], or graph CNNs for relational data [68].

2) BACK TO THE DATA SCIENTIST: OPENING UP BLACK BOXES
Without explicitly merging variants into a small number of human-recognizable entities, data-analysis pipelines can be complicated to audit for the human analyst. And yet, such human inspection of pipelines is often important for validation and debugging. Understanding analytic pipelines based on machine learning rather than cleaning will need techniques from the growing field of black-box model explanation in AI [69]: counterfactual reasoning can be applied to understand how data-assembly pipeline transforms an input [70]; permutation importance can gauge how a given attribute impacts the results by shuffling its values across instances [71]; finally, entity embeddings can be crafted to relate to human-comprehensible notions, for instance revealing latent categories [38].

D. CLEANING OR LEARNING? TWO COMPLEMENTARY TOOLS
Replacing explicit cleaning by machine learning follows the trend from ''schema on write'' to ''schema on read'': it displaces the burden from the data producer to the data consumer [72].
Cleaning is difficult, but it comes with the hope that the efforts will yield long-lasting benefits, useful for multiple usages of the data. These hopes are certainly well-grounded. Yet cleaning never ends; ambiguities in entity matching must be revisited given a new topic of analysis, or a new data source to integrate [5]. On the other hand, while variations may capture nuances -vinaigrette being posh for salad dressing-, expressing the exact same entity in two different ways is often an unnecessary hurdle to data integration. Standard vocabularies, as the universal resource identifier (URI) developed for linked data [73], address these hurdles. They are complementary to a strategy based on embedding and learning, and can be priceless to bridge data sources, even if only a fraction of the entities can be expressed within the vocabulary. An analysis using machine learning to tackle ambiguities will be more successful if there are only few of these ambiguities. If data is normalized enough, data integration can leverage off-the-shelf embeddings, as FastText used in our proof of concept. These continuous embeddings are complementary to standard vocabularies.

VI. CONCLUSION: LEARNING CUTS HUMAN LABOR BUT KEEPS VALID RESULTS
Ambiguities often arise when analyzing data, for instance if it comes from different sources with different conventions. The analysis then faces a fundamental challenge of validity: has the data been merged right, so as not to bias the results?
The correct correspondence between entities across different data representations depends on the goal of the analysis: when integrating a ''CDO'' -chief data officer-into a employee directory that does not know such role, it could be legitimate to convert ''CDO'' to ''executive officer'' to study salary, or ''data scientist'' to study expertise.
The traditional view is that data cleaning is necessary to a valid analysis: carefully establish correspondences, typically combining automated approaches with manual supervision and quality assurance. Rather, our benchmark shows that valid answers to a given analytic question can be assembled by exposing ambiguities to a machine-learning pipeline. Indeed, many questions that do not explicitly call for machine learning can be formulated using such models as flexible estimators of the underlying quantities. Our empirical comparison of a simple machine-learning approach to a laborintensive manual cleaning shows that learning improved the quality of the analysis as much, if not more, than the cleaning. We hope that it can provide a point of reference to future analysts, and justify saving time on manual cleaning.

APPENDIX: IMPLEMENTATION DETAILS
We describe here the estimation methods used in our empirical evaluation. Subsections A, B and C correspond to the analytical tasks of Table 3: the evolution of salary with experience, salary quantiles across jobs, and the proportion of women across jobs. Subsection D focuses on the causal inference problem of Figure 4: estimating the effect of gender on salaries, accounting for confounding factors.

A. SALARY EVOLUTION AS A FUNCTION OF EXPERIENCE
For a given job, we aim to estimate the mean salary as a function of work experience. This amounts to estimating the conditional expectation τ = E[Salary | Job, Experience].

1) MATCHING AND AVERAGING
We form a group G(j, e) of employees with the job j and experience level e of interest, and compute the empirical mean of the employee salaries y i .
Note that in our experiments, we estimate τ for job titles in the test set. Some of them have no equivalent in the training set and thus cannot be matched, meaning that G would be empty.
In this case, we include in G all employees with the desired experience level, regardless of their jobs.

2) EMBEDDINGS AND FUZZY MATCHING
Matching and averaging provides noisy estimates when the group G(j, e) contains few employees. To obtain reliable estimates in these cases, fuzzy matching averages manual matching estimatesτ matching (j , e) over several jobs j , giving more weight to jobs j that are similar to the job j of interest: with J the set of all job titles and sim(j , j) ≥ 0 the string similarity between the job j and the job j of interest.
To define the string similarity sim(j 1 , j 2 ) between job titles, we encode them into vectors j 1 , j 2 using a pretrained fastText model 4 and compute their cosine similarity: We finally obtain the similarity score by rescaling the cosine similarity into [0, 1], based on a threshold t that we tune to minimize cross-validation errors: We select the threshold in the following range of values: t ∈ {0.9, 0.8, 0.7, 0.6, 0.5}.

3) EMBEDDING AND LEARNING
We can estimate τ by training a machine-learning model f θ to predict the salary of an employee given its job and experience level. Importantly, we optimize model parameters θ to minimize the mean squared error: where y i , j i and e i are the salary, job title and experience level of the i th employee. Indeed, minimizing the mean squared error leads to estimates of the conditional expectation [37, section 1.5.5].
Once trained, we can directly query the model to estimate the mean salary for the desired job j and experience level e: τ learning (j, e) = fθ (j, e) In our experiments, we use gradient boosted regression trees 5 as machine-learning model f θ . We also use vector representations j i of the job titles as features (obtained from a pretrained fastText model) to implicitly account for entity-matching.
We also tune the learning rate α ∈ {0.01, 0.03, 0.1, 0.3} of the model to minimize cross-validation errors.

B. SALARY QUANTILES
We are also interested in the distribution of salaries among employees with job j. More precisely, we aim to estimate the 0.75-quantile, i.e. the salary τ (j) so that 75% of employees with job j earn less than τ (j).

1) MATCHING AND AVERAGING
To estimate this quantity, we group employees based on their jobs, and then compute the empirical 0.75-quantile of salarieŝ τ matching (j) for each group.

2) EMBEDDINGS AND FUZZY MATCHING
We follow the same procedure that we used to estimate the mean salary given the job and experience level (see Section VI-A2), withτ fuzzy (j) being a weighted average of the ''matching & averaging'' estimates.

3) EMBEDDING AND LEARNING
We can estimate τ (j) by training a machine-learning model f θ to predict the salary of an employee given its job. To estimate quantiles, we optimize model parameters θ to minimize the quantile loss, instead of the mean squared error: where with α = 0.75 the quantile to estimate. As before, we can directly query the model to estimate the 0.75-quantile of salaries among employees with job j: Again, we use gradient boosted regression trees for prediction and fastText embeddings to encode job titles. We also tune the learning rate α ∈ {0.01, 0.03, 0.1, 0.3} of the model to minimize cross-validation errors.

C. PROPORTION OF MEN ACROSS JOBS
We aim here to estimate the percentage τ (j) of men among employees with job j.

1) MATCHING AND AVERAGING
As before, we simply group employees by job titles, and compute the empirical frequency of men in each job.

2) EMBEDDINGS AND FUZZY MATCHING
We apply the same procedure as in the previous subsections.

3) EMBEDDING & LEARNING
We can estimate τ (j) by training a classification model f θ to predict the gender W of an employee given its job j. The model output f θ (j) =τ learning (j) then estimates the probability that an employee with job j is a man. Model parameters are optimized to minimize the logistic loss: where W i and j i are the gender and job of the i th employee.
As before, we use gradient boosted trees as classification model 6 and use pretrained fastText embeddings to encode job titles. The learning rate of the model is also tuned to minimize cross-validation errors.

D. CAUSAL EFFECT OF GENDER ON SALARY
As described in Section III-B, we are interested in the average treatment effect (ATE) τ = E[y(W = 1) − y(W = 0)]: the average salary gap between a man and a woman, all else being equal. In our experiments we use the following features: job title, experience level, ethnicity and the type of employer (city, county, university, hospital). Including these features allows to compare salaries between similar employees and isolate the effect of gender.
Note that the ethnicity feature is also non-normalized: multiple variants for each ethnicity exist in the data (e.g. ''Black'', ''BLK'', ''Black or African American''). When estimating the ATE with manual or fuzzy matching techniques, we thus had to group similar ethnicities into 7 categories. When using machine-learning models for estimation, we simply encoded ethnicities into vectors of dimension 10, using a Gamma-Poisson factorization 7 [38]. Besides, these vectors can capture nuances that would have been lost in the matching process otherwise: for instance when grouping ''Mexican'' with ''Hispanic or Latino''.
We could easily estimate the ATE if for each employee we had access to y(1) and y(0), i.e. salaries under scenario W = 1 (employee is a man) and W = 0 (employee is a woman): Unfortunately, we either observe y i = y i (1) or y i = y i (0) in the data. To be able to apply Eq. 4, we can replace the unobserved salary y unobs i by an estimate.

1) MATCHING AND AVERAGING
A simple way to estimate the unobserved salary y unobs i is to consider the set O i of employees with the same features as employee i but of opposite sex, and take their average salary. However, this is not always possible: some employees may have no counterpart of the opposite sex in the data. We thus consider only the set M of employees for which O i is not empty.

2) EMBEDDINGS AND FUZZY MATCHING
Dismissing employees that have no counterpart of the opposite sex in the data can bias the results. To avoid this, we allow our estimate of y unobs i to include employees from similar, but non-identical jobs. For an employee i so that O i = ∅, we consider instead the sets O with We use the same similarity score as in section VI-A2, with a threshold t = 0.8.

3) EMBEDDINGS AND LEARNING
Modern causal inference tools rely on machine-learning models, for instance to estimate y unobs i . Typically, a function f θ is trained to predict y i given the employee covariates/features X i (job, experience level, . . . ) and its gender W i . As before, parameters θ are optimized to minimize the mean squared error. We obtain the following estimate: Other approaches are based on inverse propensity weighting. They rely on estimates of the propensity score e(X i ) = P(W i = 1|X i ) -the probability of being a man given features X i -to account for imbalances between men and women covariates in the ATE: A machine-learning model f θ trained to predict the gender of an employee from its covariates provides estimates of the propensity score: fθ (X i ) = e(X i ).
Powerful methods combines both approaches for more robustness [49]. We use such techniques in our experiments to estimate the ATE: Salary estimates y i,0/1 = f (y) θ (X i , W = 0/1) are obtained from the machine-learning model f (y) θ , trained to predict the salary y from covariates X and gender W . Similarly, propensity-score estimates e i = f (w) θ (X i ) are obtained from the machine-learning model f (w) θ , trained to predict the gender W from covariates X .
A technical subtlety is that we use a cross-fitting procedure to estimate salaries and propensity-scores [50]. Instead of fitting machine-learning models on all the data and then taking models output as estimates, we split samples in K folds and obtain estimates for each fold using models fitted on the K − 1 remaining folds.