Cleaning Data With Selection Rules

In this paper, we propose and study a type of tuple-level constraint that arises from the selection operator $\sigma $ of relational algebra and that closely resembles the concepts of tuple-level denial constraints. We call this type of constraint selection rules and study their concepts and properties in the setting of data consistency management. The main contribution of this paper is the study of rule implication with selection rules in order to solve the error localization problem by means of the set cover method. It turns out that rule implication can be applied more easily if the representation of selection rules is extended in order to allow gaps between attribute values. We show that the properties of selection rules allow to improve the performance of rule implication. Evaluation of our approach compared to HoloClean on four real-world datasets shows promising results. First, repair with selection rules is often faster and less memory-consumable than HoloClean, especially when the amount of work that rule implication has to do is limited. Second, in terms of precision and recall of error detection and correction, repair strategies with selection rules almost always outperform HoloClean.


I. INTRODUCTION
Data quality has become a crucial part of data management over the past decades. Apart from tackling completeness, acccuracy or currency problems, one of the main problems is to safeguard consistency of data [1], [2], [3], [4], [5]. A popular approach to handle this is the rule-based approach, where rules (or constraints) model how consistent data should look like. Examples of data quality rules are functional dependencies and relatives [6], [7], [8], [9], [10], inclusion dependencies [7], denial constraints [11] and edit rules [12], which all differ in terms of expressiveness. With this in mind, choosing the type(s) of rules that one can use poses a difficult, yet fundamental problem. On the one hand, types of rules that are more expressive are able to capture more types of errors. On the other hand, more expressive types of rules come with a higher complexity in terms of fundamental problems like discovery, implication and repair.
In this paper, we focus on tuple-level constraints in relational databases. Although more expressive types of rules exist, it has been stated that these types of rules can capture a large portion of inconsistencies in real-world datasets [13].
The associate editor coordinating the review of this manuscript and approving it for publication was Muhammad Asif Naeem . More specifically, we investigate a type of tuple-level constraints that arises from the selection operator σ of relational algebra [14]. We call these constraints selection rules or, for short, σ -rules and show that these types of rules are a generalization of edit rules [12] and have the same expressive power as tuple-level denial constraints [11].
A schematic overview of our contributions presented in this paper is given in Figure 1. As a first contribution, we study the concepts and properties of σ -rules and investigate how the concepts and properties of edit rules and tuple-level denial constraints relate to them. A second contribution is the proposal of an efficient algorithm for the implication of σ -rules based on these properties, in order to compose a so-called sufficient set. This algorithm will be an extension of the implication algorithms for regular edit rules [15], [16], [17] based on the implication procedure proposed by Fellegi & Holt [12]. The purpose of a sufficient set is to find minimal-cost repairs of inconsistent tuples easily. More precisely, if the cost of making a change to an attribute in a relation is fixed, then minimal-cost repairs can be found by searching for minimal set covers of failing rules in a sufficient set. Because of this property, repairing dirty data in a guaranteed minimal way is much faster than other procedures that rely on more expressive constraints, such as Llunatic [18] and HoloClean [19]. A remarkable result is that when domains are ordinal, the construction of a sufficient set is more difficult, because it is necessary to model gaps between attribute values to keep the number of implied rules manageable. The selection operator of relational algebra lacks the expressiveness to model such gaps easily. However, we propose a solution to this problem by extending the expressiveness of the selection σ to σ + , where gaps between data values can be enforced that indicate how large the difference between two attribute values can(not) be. Interestingly, it does not matter whether we start from σ -rules or σ + -rules: the implication procedure of both of them can be described in terms of σ + -rules. As a third and final contribution, we assess the usefulness of selection rules by evaluating their capability to detect and correct erroneous data in real-world datasets.
The remainder of this paper is structured as follows. In Section II, we recall some basic concepts and notations related to the relational database model and relational algebra. In Section III, we formally define σ -rules and σ + -rules, and state their concepts and properties. After this, in Section IV, we exploit these properties in a σ + -rule implication algorithm that can be used to generate sufficient sets. In Section V, an overview is given of state-of-the-art literature related to the concepts and approaches studied in this paper, together with a positioning of our contributions within this literature. We evaluate our contributions in practical settings and discuss the results in Section VI. Finally, we state our conclusions and view on future research in Section VII.

II. PRELIMINARIES
In this paper, we assume the relational model for databases [14]. For a countable set of attributes A, we denote for each a ∈ A, the domain of a by A. For each a ∈ A, we assume that the domain A is of one of the following types.
• Nominal: A is finite and features the equality relation =.
• Ordinal: A is equipped with an order ≤ and an order isomorphism exists from (A, ≤) to (Z, ≤).
• Continuous: A is equipped with an order ≤ and operator + and an order isomorphism exists from (A, ≤, +) to (R, ≤, +). When A is ordinal, it automatically comes with a successor function S. A successor function S defined on an ordinal domain A, applied on a value v i ∈ A (i.e., S(v i )), returns an element v j ∈ A, such that v i < v j and v k ∈ A : v i < v k < v j . In words, a successor function S returns the closest element greater than its argument that exists in the domain on which S is defined. If we apply a successor function S, n times on a value v ∈ A, i.e., n times , the n-th closest element greater than v in A is returned. If n = 0, the successor function S is not applied on value v and returns v, i.e., S 0 (v) = v. Moreover, if n < 0, the inverse function is applied of which the outcome is the n-th closest element lower than v in A.
A relation schema R = {a 1 , . . . , a k }, or schema for short, is defined by a non-empty and finite subset of A. A relation R over R is defined by a finite set R ⊆ A 1 × . . . × A k . Each element t of a relation R with schema R is called a tuple over R. For any R with schema R and X ⊆ R, we will denote the projection of R over X by R [X ]. Note that, in this work, we always assume only one relation over one schema, but our contributions are not limited to a certain number of relations of schemata, because they can easily be applied when considering each of the relations/schemata separately.
A propositional formula ϕ over a schema R = {a 1 , . . . , a k } is defined as a well-formed formula consisting of a set of propositional predicates and the Boolean operators ∧, ∨ and ¬. Each propositional predicate (or predicate in short) P in ϕ is either • a constant predicate a θ v, where a ∈ R, v ∈ A and θ = {≤, <, =, =, >, ≥} an operator on A, or • a variable predicate a i θ a j where a i , a j ∈ R with A i = A j , and θ = {≤, <, =, =, >, ≥} an operator on both A i and A j . Evaluation of a predicate P for a given tuple t over R is then obtained by replacing the attributes in P with the projection of t over those attributes. In other words, for a tuple t, we test whether t[a] θ v (resp. t[a i ] θ t[a j ]) is true or false. The generalized selection σ ϕ (R) with selection operator σ and propositional formula ϕ applied on relation R, is defined by

III. SELECTION RULES A. BASIC DEFINITIONS AND PROPERTIES
Now the necessary concepts and notations related to the relational database model and relational algebra are recalled, we can formally introduce σ -rules.
Definition 1 (σ -Rule): A σ -rule over a schema R is defined (and denoted) by a propositional formula ϕ. A tuple t satisfies (or is consistent against) the rule (denoted by t | ϕ) if ϕ(t) evaluates to false. Else, the tuple fails (or is inconsistent against) the rule (denoted by t | ϕ).
From Definition 1, it follows that each σ -rule describes a set of tuples that are not allowed in a consistent relation. If a σ -rule ϕ contains at least one variable predicate, it is called   a variable σ -rule. Otherwise, it is called a constant σ -rule. As any propositional formula, ϕ can always be (re)written in disjunctive normal form (i.e. disjunction of conjunctions, DNF in short) and if the DNF of ϕ consists of more than one conjunctive clause (i.e., ϕ = ϕ 1 ∨ . . . ∨ ϕ n ), we can rewrite ϕ as a set of σ -rules = {ϕ 1 , . . . , ϕ n }. Doing so, a tuple t then fails if any of the conjunctive clauses ϕ i ∈ evaluates to true. Furthermore, the set {≤, <, =, =, >, ≥} includes for each operator θ the inverse operator ¬θ so that Boolean negations of predicates can always be rewritten (cfr. Table 1). By these two facts, we can define a normal form of σ -rules, in which ϕ is restricted to a conjunction of predicates without negations. From now on, we assume that ϕ always has this normalized form, unless stated otherwise. Also, for easy notation, it is convenient to consider a third type of predicate, called a set predicate. A set predicate is of the form a θ V , where a ∈ R, V ⊆ A and θ = {∈, / ∈}. It can easily be verified that when a σ -rule contains a set predicate, it can always be rewritten as a set of rules that do not use set predicates. Set predicates are therefore only a tool for convenience of notation. Example 1: An example relation that we will use regularly in the remainder of this work is the relation over the stock schema 1 given in Table 2. This relation consists of tuples representing the daily changes of stock prices. The schema consists of four attributes 2 with domain R, which are previous close (PC, price of the stock on the latest daily transaction on the previous day), last trading price (LTP, price of the stock on the latest daily transaction on the current day), 1 https://lunadong.com/fusionDataSets.htm 2 Note that attributes change_in_dollar and change_percentage contain derived data, which is, strictly speaking, not allowed in a well-designed relational database. However, in this paper, we do allow it in order to make it more easy to explain certain concepts and techniques. change in dollar (CID = LTP − PC) and change percentage (CP = CID * 100 PC ). Although it is not possible to represent all constraints on this relation exactly by means of σ -rules (e.g., the formula to calculate CP), a set of 10 σ -rules (in normalized form) is given as an example in Table 3 to solve some consistency problems. 3 Examples are that it is not permitted that PC (resp. LTP) is strict negative (ϕ 1 , resp. ϕ 2 ), that LTP is greater than or equal to PC and CP is strict negative (ϕ 3 ) or that CP equals zero and CID does not (ϕ 9 ). ϕ 1 , ϕ 2 and ϕ 7 -ϕ 10 are constant σ -rules, whereas ϕ 3 -ϕ 6 are variable σ -rules. In the data shown in Table 2, t 1 and t 2 are the only tuples that are consistent against the given set of rules, because t 3 fails ϕ 3 and ϕ 7 , t 4 fails ϕ 1 and ϕ 3 , and t 5 fails ϕ 2 , ϕ 6 and ϕ 9 . Note, however, that, although t 2 is consistent against the given set of rules, it does not follow the (linear) relationship to calculate CID and CP.
In terms of expressiveness, σ -rules are a generalization of regular edit rules, introduced for categorical data in the framework of Fellegi and Holt [12] and extended to ordinal data in [17]. Indeed, each edit rule can be written as a set of (constant) σ -rules (and even one σ -rule if we allow set predicates), but the advantage of σ -rules, in terms of expressiveness, comes from the fact that one (variable) σrule can vouch for a set of infinitely many edit rules. Besides that, we can also state that σ -rules are a less expressive subclass of linear edit rules [20], because they are represented by linear (in)equalities involving one or two attributes with coefficients equal to 1. Also, σ -rules are a special case of denial constraints (DCs) where predicates are limited to a single tuple [4], [11], [21]. This makes σ -rules equivalent to tuple-level DCs and therefore, the concepts and properties of tuple-level DCs also apply for σ -rules. As we will explain in the following, the main virtue of these observations is that σrules come with a procedure for attribute elimination based on Fourier-Motzkin elimination, which helps to reduce the complexity of solving the error localization problem.
With this in mind, we will redefine the concepts and properties of edit rules and denial constraints within the setting of σ -rules. We say that an attribute is involved in a σ -rule if it appears in at least one of its predicates. The set of attributes involved in a σ -rule ϕ is denoted by I(ϕ). If the cardinality |I(ϕ)| = 1, then the rule provides a constraint on the domain of that attribute and we call such rules domain constraints. Thereby, we will denote the set of predicates in ϕ involving (resp. not involving) an attribute a by means ϕ a (resp. ϕā). Also, the concepts of redundancy, tautologies and contradictions can be formally defined in the setting of σrules.

B. MINIMAL-COST REPAIRS
Now σ -rules are formally defined and their related concepts and properties are stated, we can use them as a tool to detect and repair inconsistent tuples. In the remainder, the problem of finding correct repairs for inconsistent tuples is introduced, and we will propose a solution to this problem for σ -rules, which is in line with the solution introduced by Fellegi & Holt for regular edit rules [12].
To do this, suppose that a set of σ -rules , defined over schema R, and an inconsistent tuple t against are given. By definition, t fails a set of σ -rules ⊆ , i.e., ∀ϕ ∈ , t | ϕ. Any set of attributes S ⊆ R for which different values can be assigned in t to construct t , such that t is consistent against , is called a solution. For this matter, we call t a repair 4 of t based on solution S. If weights are assigned to each attribute (e.g., to represent the cost of changing the value for this attribute), we can search for a solution with a minimal cost. A specific case is when all attributes receive equal weights, which implies solutions that are minimal in terms of set cardinality. Repairs based on minimal solutions are called minimal-cost repairs. Straightforwardly, a (minimal) solution S should contain at least one involved attribute for each failing rule in . Indeed, rules for which no involved attribute is added to S will remain failed after repair. We say that, if this condition is met, S is a (minimal) set cover of as it covers all failing rules and therefore, all (minimal) solutions are also (minimal) set covers.

, an empty solution) exists (because t 3 is initially inconsistent).
Although the strategies to find minimal-cost repairs are quite simple and there is, initially, no guarantee that all (minimal) set covers are also (minimal) solutions (i.e., we cannot properly rely on the set cover method to find all correct (minimal) solutions), it comes with a great benefit in terms of finding solutions. Indeed, as we know for regular edit rules, it is guaranteed that all (minimal) set covers are also (minimal) solutions if we extend the initial set of rules with implied rules [12]. Fortunately, because the properties of edit rules still hold in the setting of σ -rules, the same solution(s) can be applied.
Definition 5 (Implied σ -Rules): Given a set of σ -rules is called an implied σ -rule, if ϕ * is not a tautology. Semantically, this means that, We will call c a contributing set of ϕ * . Definition 5 provides a method based on propositional logic for validating whether a given σ -rule is implied or not. A procedure for generating implied rules from a given contributing set will be studied extensively in Section IV. Note, thereby, that each rule ϕ r that is redundant to at least one ϕ ∈ c (with all ϕ ∈ c as special cases) is also an implied rule of c . With this made clear, it is straightforward to see from Definition 5 that implied rules do not forbid combinations of values that are not already forbidden by c , but they are indispensable in the sense that they guarantee correct error localization by means of the set cover method. Indeed, Fellegi & Holt proved, initially, that one needs to generate all implied rules in order to apply the set cover method properly [12], but generating all implied rules is a task of exponential complexity, because each combination of rules can potentially lead to the generation of many implied rules. Fortunately, the number of rules to imply can be limited, as implied rules are only necessary to adopt when they are (1) new and (2) non-redundant (NNR or necessary rules in short) [12], [15], [16], [22].
Definition 6 (New σ -Rules): An implied σ -rule ϕ * defined over R = {a 1 , . . . , a k }, with contributing set c = {ϕ 1 , . . . , ϕ n }, is new if an attribute a g ∈ R exists that is involved in each rule ϕ ∈ c , but not in ϕ * . We will call attribute a g the generator of ϕ * . Definition 6 states that generating new rules comes down to a procedure of attribute (i.e., generator) elimination. The state-of-the-art algorithm for generating all necessary rules is the Field Code Forest (FCF) algorithm [15], [16]. This algorithm repeats two steps during execution, which are (1) the selection of a generator a g and a corresponding set of rules a g involving a g (called candidate contributors), and (2) the generation of all necessary rules with generator a g and contributing sets c ⊆ a g . Eventually, the algorithm returns a set of rules that is sufficient to properly solve the VOLUME 10, 2022 error localization problem by means of the set cover method. Taking these notes into account, we will propose and study an algorithm to apply as the second step of FCF, specifically for σ -rules, in Section IV.

C. EXTENSION TOWARDS σ + -RULES
To end this section, we will extend the expressiveness of the selection operator σ , used by σ -rules (cfr. Definition 1), to σ + . This extension should allow to enforce gaps between attribute values, modelling the smallest (e.g., a 1 > a 2 + c) or largest (e.g., a 1 < a 2 + c) prohibited difference c between the values of the two involved attributes a 1 and a 2 . Regular edit rules, σ -rules and denial constraints lack the expressiveness to model such gaps easily. Take, for example, a constraint on the stock schema that only permits an increase of at least 1 from PC to LTP. Until now, at least one rule per value of R or infinitely many rules in total were needed to represent this (e.g., for value 0 of PC, a potential σ -rule that models this is PC = 0 ∧ LTP < 1). In the remainder, we will show that this constraint can be represented by means of only one σ + -rule.
With the definition of σ + -rules set, we can make two interesting remarks. First, it is clear that the definition of σ +rules exactly matches the definition of σ -rules when domains are nominal. Second, each σ + -rule can be written as a set of (sometimes infinitely many) σ -rules and each σ -rule can be converted to one σ + -rule. Indeed, a σ -rule is a special kind of σ + -rule in which each variable predicate is, depending on the domain type of A i and A j , of the form a i θ a j , a i θ S 0 (a j ) or a i θ a j +0. A direct implication of this is that the concepts and properties of σ -rules can easily be transferred to the setting of σ + -rules.
Example 5: All σ -rules given in Table 3 are also σ + -rules (with c = 0 in the variable predicates). Now, we can use σ +rules to represent the example constraint which is stated in the beginning of this section and only permits an increase of at least 1 from attribute PC to attribute LTP by PC > LTP−1.

IV. σ + -RULE IMPLICATION
Now we have introduced two types of selection rules (σ and σ + -rules), we can shift our attention to rule implication for these types of rules. Because σ -rules are a special subclass of σ + -rules and, therefore, the properties of σ + -rules also hold for σ -rules (cfr. III-C), we will mainly focus on implication of σ + -rules.

A. THE IMPLICATION FUNCTION
Before going into detail about rule implication, we will first define an implication function that is fundamental in the process of attribute elimination. This function should allow to find all possible constraints implied by ϕ on all attributes (except 1) involved in ϕ.
Definition 8 (Implication Function): Given a propositional formula ϕ defined over R = {a 1 , . . . , a k } and an attribute a g ∈ R, the implication function returns a propositional formula, written as ϕ With this implication function set, an important question that remains to be answered is how to construct ϕ * as the result of Imp(ϕ, a g ) correctly and completely. To answer this question, we can point out some observations. First, we can rely on the transitivity property of the operators {≤, <, =, =, >, ≥} for finding all possible constraints implied by ϕ. Indeed, exploiting this property for all pairs of predicates that are connected by a conjunction in ϕ, features a set of predicates representing these (potentially hidden) constraints. As an example, the conjunction of predicates {a i ≤ a g , a g ≤ a j }, with A i , A j = R reveals a constraint a i ≤ a j on both attributes a i and a j , in which a g is eliminated, because a i ≤ a g ∧ a g ≤ a j ⇒ a i ≤ a j . Second, we can state that exploiting the transitivity property on a pair of predicates may imply a propositional formula consisting of the conjunction of more than one (and potentially infinitely many) predicate(s). However, this formula automatically reduces, such that only the most restrictive implied predicate remains. Following the previous example, a i ≤ a g ∧ a g ≤ a j ⇒ c≥0 a i ≤ a j + c, but because c≥0 a i ≤ a j + c ≡ a i ≤ a j , the only (and most restrictive) implied predicate that follows from this is a i ≤ a j . Now, with this made clear, note that the maximal number of (hidden) predicates not involving a g that are generated by the implication function, with ϕ = ϕ 1 ∨ . . . ∨ ϕ n in DNF, for each ϕ i ∈ ϕ with k i predicates, equals k i 2 , which grows exponentially with increasing k i .
In Table 4 and 5, we have provided an overview of the most restrictive predicates, not involving generator a g , that are constructed after exploiting the transitivity property on each possible pair of predicates connected by a conjunction (depending on the operators), both involving a g . Note that Table 4 accounts for ordinal domains and Table 5 accounts for continuous domains and (in limited version) for nominal domains. Each cell in these tables contains two predicates, depending on the fact whether we start from a pair of variable predicates (e.g., a g < a i + c 1 ∧ a g = a j + c 2 ⇒ a i > a j + c 2 − c 1 ) or from a pair consisting of one variable and one constant predicate (e.g., a g < a i + c 1 represents a tautology, following from the fact that exploiting the transitivity property on the corresponding pairs of predicates will not reveal any more restrictive constraints on the involved attributes other than a g . For example, a g < a i + c 1 ∧ a g < v only implies that a i can take any value in its domain. An interesting observation is that exploiting the transitivity property on the pairs {a g < S m (a i ), a g > S n (a j )} and {a g > S m (a i ), a g < S n (a j )} results in predicates involving a i and a j with a 'selfexpanding' gap (indicated by resp. +1 and −1 in the resulting predicates in Table 4). Indeed, if we consider these pairs, the constraint on both a i and a j should account for the fact that the gap between these attributes should be large enough such that a g can take values meeting the original constraints in the pairs. For predicates involving attributes with continuous domains, this is not a problem, as there always exists a value for a g meeting these original constraints. Note, hereby, that for some σ -rules involving attributes with ordinal domains, the result potentially contains an infinite number of σ -rules, because these gaps cannot be represented. This reveals an additional and important benefit of introducing σ + -rules for rule implication. Finally, note that two trivial cases are not listed in the tables, which are the following. If a pair consists of any predicate that does not involve a g , this predicate is automatically implied (e.g., a g < a i ∧ a j > 0 ⇒ a j > 0). If a pair consists of two predicates involving no other attribute than a g , applying the transitivity property results in either a tautology (e.g., a g ≥ 0 ∧ a g ≤ 1 ⇒ ) or a contradiction (e.g., a g ≤ 0 ∧ a g ≥ 1 ⇒ ⊥).

B. GENERATING NECESSARY σ + -RULES
As explained in III-B, generating necessary rules is crucial in order to properly execute the FCF algorithm for generating a sufficient set. Because we have generalized edit rules to σ + -rules, we need an extension of the implication procedure proposed in [12] that works correctly and completely for (contributing) sets of σ + -rules. In order to generate all new rules from a given set of σ + -rules by means of generator a g , we can rely on the following theorem.
Proof: See Appendix. Example 6: Consider again the σ -rules (or σ + -rules) defined on the stock schema (cfr. Remark that passing any superset of {ϕ 1 , ϕ 4 } as first argument to the implication function will also result in a propositional formula ϕ * in which one of the conjunctive clauses ϕ * i equals LTP ≤ 0 ∧ CP > 0. Now, we point out three important remarks following from the construction procedure given in Theorem 1. First, if you want to generate all implied rules (including rules that are not new) from a set , you can apply Theorem 1 and keep all resulting predicates (so not only the ones that do not involve a g ). We will denote this by Imp(ϕ, −) in the following. Second, repeatedly applying Theorem 1, together with the observations listed above, ensures that all necessary σ + -rules certainly will be generated, as stated in the beginning of this section. The reason for this is that the proposed implication procedure is a special case of Fourier-Motzkin elimination to eliminate attributes in linear edit rules [23], [24], [25]. Fellegi & Holt proved that repeatedly applying Fourier-Motzkin elimination on pairs of linear edit rules from a given set eventually results in a complete (and thus, sufficient) set [12]. In order to avoid a repeated application of this method, the proposed optimizations to generate sufficient sets (e.g., the FCF algorithm) can also be used [15], [16], [20], [22]. Third, note that it might be the case that the result of the proposed procedure contains (many) new rules that are redundant to other rules generated during executing the FCF algorithm. To resolve this, it is possible to test each generated rule for being redundant to all other rules on each completion of the implication procedure and treat redundant rules properly as described in [16].
To end this section, we will state the following corollary of Theorem 1, which gives the upper bound on the number of new σ + -rules that can be generated when applying the proposed procedure.
Corollary 1: Consider a set of σ + -rules with |ϕ i | the number of predicates in ϕ i . Proof: See Appendix. With this corollary stated, it is clear that it is sufficient to apply Theorem 1 in order to generate all necessary σ + -rules with a given generator a g (i.e., to adopt as the second step of the FCF algorithm). However, the number of generated new rules can rapidly become unmanageable when the number of (predicates in the) candidate contributors increases. For two or three candidate contributors, this is not directly a problem, but when considering, for example, the five rules defined on the stock schema (cfr. Table 3) involving PC and generator PC as input parameters, the upper bound on the number of new rules with generator PC equals 10 8 . The main reason for this is that the procedure will not exploit any strategy to avoid generating rules that are new, but not necessary (i.e., redundant). In the following section, we will investigate techniques in order to overcome this problem.

C. A σ + -RULE IMPLICATION ALGORITHM
To overcome the potential performance problem of using the implication function when searching for all necessary rules, we will investigate some useful properties of σ + -rule implication (cfr. IV-C1) and exploit these properties in an optimized σ + -rule implication algorithm (cfr. IV-C2).

1) PROPERTIES
σ + -rules come with some useful properties that can be exploited during the implication algorithm.

Proposition 1: Consider
• two σ + -rules ϕ r and ϕ d , defined over R, and • two contributing sets r = {ϕ r , ϕ} and d = {ϕ d , ϕ}, with ϕ any σ + -rule, defined over R, which lead to the generation of resp. ϕ * r and ϕ * d , i.e., ϕ * r ≡ ¬Imp(¬ϕ r ∧ ¬ϕ, −), and Proof: See Appendix. Proposition 1 states that a contributing set r that contains a redundant rule ϕ r will always lead to the generation of rules ϕ * r that are redundant. This implies that we should never test combinations of rules containing redundant rules during the implication algorithm, reducing the number of combinations to test. Moreover, due to this fact and the fact that redundant rules are not necessary, we can always ignore rules that are redundant to one of the candidate contributors and that are captured in the result of the implication function. Note that tautologies are special cases in this regard, because they are, by definition, redundant to all σ + -rules.
Example 7: Consider again the σ -rules (or σ + -rules) defined on the stock schema (cfr. Proposition 2: Consider two σ + -rules ϕ 1 and ϕ 2 , defined over R = {a 1 , . . . , a k }. If ϕ 1 a g ⇒ ϕ 2 a g (with a g ∈ R), then rule with a generator a g ∈ R other than a g .
Proof: See Appendix. Proposition 2 states that we should only test combinations of rules in which each σ + -rule contributes to the elimination of a g , again reducing the number of combinations to test. Concretely, this means that for any two rules ϕ 1 and ϕ 2 in a combination, ϕ 1 a g ⇒ ϕ 2 a g and ϕ 2 a g ⇒ ϕ 1 a g . Otherwise, the combination will either lead to the construction of redundant rules or new rules with a generator a g other than a g , in which we are not interested. A first special case in this regard is when ϕ 1 ⇒ ϕ 2 (or ϕ 2 ⇒ ϕ 1 ), which automatically implies that ϕ 1 a g ⇒ ϕ 2 a g (or ϕ 2 a g ⇒ ϕ 1 a g ). This means that combinations should not contain pairs of rules in which one rule is redundant to the other. A second special case is when a g is not involved in ϕ 1 (or ϕ 2 ), because other rules will never contribute to the elimination of a g in combination with such rules. First, this means that only combinations of rules in which the generator is involved (i.e., candidate contributors) should be tested, which confirms the necessary condition in the definition of new rules (cfr. Definition 6). Second, once a combination of rules leads to the generation of a new rule ϕ * with generator a g , we should never add ϕ * in a combination to test for leading to the generation of new rules with generator a g .

2) ALGORITHM
Below, we will propose a σ + -rule implication algorithm exploiting the properties stated above in order to reduce the number of combinations to test, which serves as an alternative to the general implication procedure (cfr. IV-B) in case this procedure may result in a large number of implied σ + -rules. As we already mentioned in the previous, the algorithm will be a generalization of the implication algorithm for regular edit rules, proposed in [17]. Specifically, the algorithm will keep a list of implied σ + -rules that still can lead to the generation of a necessary rule by combining it, in an optimized way (i.e., by exploiting Proposition 1 and Proposition 2), with one or more other σ + -rules. If it is certain that a rule cannot lead in any case to the construction of a necessary rule, the rule is ignored. The pseudocode of the σ + -rule implication algorithm is given in Algorithm 1.
As first input parameter, the algorithm expects a generator a g for which necessary rules will be generated. As second input parameter, the algorithm expects any set a g consisting of candidate contributors for a g . Combinations of rules in a g will be tested for contributing to the generation of necessary rules with generator a g . From Proposition 2, we know that no other rules (i.e., rules in which a g is not involved) should be Algorithm 1 σ + -Rule Implication Algorithm 1: function getNecessaryRules(a g , a g ) 2: if ϕ∈ ag ϕ a g ≡ then 3: return ∅ 4: while Q = ∅ do 8: ϕ q ← Q.dequeue() 9: 10: for all ϕ k ∈ a g \ q do 11: if ϕ k a g ⇒ ϕ q a g ∨ ϕ q a g ⇒ ϕ k a g then 12: continue 13: * q ← ¬Imp(¬ϕ q ∧ ¬ϕ k , −) 14: for all ϕ * q ∈ * q do 15: if ∃ϕ i ∈ c : (ϕ * q ⇒ ϕ i ) then 16: continue 17: if a g / ∈ I(ϕ * q ) then 19: * ← * ∪ {ϕ * q } 20: continue 21: Q.enqueue(ϕ * q ) 22: return * considered as they will not account for this matter. In the first step of the algorithm, it is verified whether the set of selected candidate contributors can eventually lead to the elimination of a g (line 2). If this is not the case, an empty set is returned (line 3). Then, before searching for necessary rules, three variables are initialized (line 4-6), which are used throughout the execution and which are listed below.
• Q: Queue used to keep track of all implied rules that are not new and not redundant (i.e., that can still lead to the generation of a necessary rule). Initially, this queue contains all rules of a g .
• c : Map of implied rules of which each is mapped to its contributors. Initially, this map contains all rules of a g that are mapped to singletons containing themselves.
• * : Set containing all necessary rules with generator a g and any contributing set ⊆ a g . This set will be returned upon completion (line 22). After the initialization phase, the algorithm continues to test combinations of rules for generating necessary rules with generator a g , until Q is empty (line 7-21). In each loop, a certain rule ϕ q is dequeued from Q (line 8), and its contributors are stored in q (line 9). Then, any rule ϕ k that did not already contribute to the construction of ϕ q (line 10) and contributes to the elimination of a g with ϕ q (line 11-12) is tested for generating implied rules * q in combination with ϕ q (line 13). Each implied rule ϕ * q can either be (1) redundant to any of the previously generated (or given) rules, in which case it is ignored (line 15-16), (2) not redundant, but new with generator a g , in which case it is added to * (line 18-20), or (3) not redundant and not new, in which case it should be tested further and is, therefore, added to Q (line 21). Note that in the last two cases, the algorithm adds ϕ * q together with its contributors to c (line 17).
To end this section, we focus shortly on the performance of the σ + -rule implication algorithm, which highly depends on the number of implied rules that are added to the queue. Therefore, it is possible to count, in a worst-case scenario, the number of rules in * q each time ¬Imp(¬ϕ q ∧ ¬ϕ k , −) is executed, assuming that all rules in * q can be added to Q. This count can be determined in a similar way as done in the proof of Corollary 1. Although, one should take into account that, each time, only two σ + -rules (ϕ q and ϕ k ) are used in the implication function at the cost of generating all implied rules (and not only new rules). In the end, we can state that each * q contains at most σ + -rules. Now, the theoretical, worst-case complexity of executing the entire σ + -rule implication algorithm, assuming that all rules are added to Q, is often worse than the theoretical, worst-case complexity of Theorem 1. However, the strength of the algorithm is that a very large share of the generated implied rules that are captured in * q are either new or redundant, and are not added to the queue for further extension. Indeed, if we consider Example 9 again, we can see that, when only taking into account the number of rules generated by means of two given rules, 1044 implied rules are generated in total, of which 2 are necessary (i.e., ϕ * 1 and ϕ * 2 ), and 2 are not redundant and not new, and therefore added to the queue (i.e. ϕ * 3 and ϕ * 4 ). Further, 342 implied rules are generated by extending ϕ * 3 or ϕ * 4 , of which only 1 is necessary (i.e., ϕ * 5 ) and all others are redundant. After this, the algorithm terminates, such that only 1386 implied rules are tested instead of 10 8 when using Theorem 1. Later, in VI-B, we will analyze the effects of Algorithm 1 compared to Theorem 1 by evaluating it in a more practical setting and give a recommendation on when to use which procedure.

V. DISCUSSION AND RELATED WORK
One of the most important aspects in data quality handling research is the study of methods to resolve data inconsistencies by means of data quality rules in a (semi-) automated way [4], [22], [26], [27]. The problem of resolving data inconsistencies can be tackled as a process with two steps: error detection/localization (i.e., determining the attribute values in error) and error correction (i.e., estimating correct values for the attribute values in error). Because our contributions mainly focus on error localization, we will give an overview of state-of-the-art contributions in this regard and position our work among these contributions in the remainder of this section. The last decades, many error localization techniques have been proposed (cfr. Table 6). Examples include vertex generation methods [28], [29], branch-and-bound methods [28], [30], [31] and holistic [19], [32] and probabilistic methods [33], [34], [35], [36]. However, one of the most elegant and easy to understand methods to solve the error localization problem is the set cover method, which is proposed by Fellegi & Holt for regular edit rules [12]. Confirmed by the lifting property, this method only works on sufficient sets of rules [12], [15], [16], which can be constructed by means of rule implication, exploiting attribute elimination. For many types of data quality rules that are typically more expressive than edit rules (e.g., (conditional) functional dependencies [6], [7], [8], [9] and denial constraints [11]), the lifting property does not hold [37]. In such cases, it is not possible to solve the error localization problem by means of the set cover method. Because of this, other error detection and repair frameworks, such as Llunatic (based on the Chase algorithm) [18] and HoloClean [19], have been proposed to repair data failing more expressive data quality rules. However, these frameworks may have scalability problems when used in combination with a mix of different types of rules, with a large number of rules or when a large amount of data to repair is passed as input. As a solution, they rely on heuristics or estimates, in order to complete successfully. In this regard, data quality rules that (i) have expressiveness in between edit rules and denial constraints and (ii) come with an efficient rule implication and set cover repair method, provide an appealing trade-off between expressiveness and efficiency. This resulted in the introduction of σ -rules, which are based on the selection operator σ of relational algebra and to which the concepts and properties of edit rules and edit rule implication can easily be transferred.
A similar trade-off in complexity has also been observed for linear edit rules. In [22], it was stated that there are problems with edit rule implication featuring integer-valued data. Indeed, the properties, methods and algorithms for rule implication with nominal data [12], [15], [16] and continuous data [20] are quite straightforward. However, for integer data (or mixed data), linear edit rules encounter the same problems with gaps as stated in III-C and, thereby, it is not guaranteed that an integer solution always exist. Therefore, one has to rely on a complex post-processing step in terms of shadow regions during Fourier-Motzkin elimination in which this is verified [22], [30], [31]. In our work, we overcome this problem by generalizing σ -rules to σ + -rules (being a simplification of linear edit rules), which have the option to easily represent gaps between attribute values and for which, as a direct consequence, no verification step is needed because an integer solution always exists (if the rules in the set do not contradict to each other). In other words, rule implication as the basis of solving the error localization problem by means of the set cover method for σ + -rules is guaranteed to work correctly and completely.
Finally, with regard to the efficiency of rule implication, many optimizations for edit rules were studied and proposed in previous works. These optimizations are related to the properties of edit rules that can help in applying rule implication more efficiently. The main reason for this is that the initial implication procedure proposed by Fellegi & Holt can become very labour intensive when many edit rules are passed as input. Contributions to this are due to Liepins [38], [39], [40], Garfinkel et al. [15], [20], Winkler [41], [42], Boskovitz [16], Chen [43], [44] and Boeckling [17]. With this in mind, our final contribution in this paper is the study of these optimizations adapted to the setting of σ + -rules and the proposal of an efficient σ + -rule implication algorithm exploiting these optimizations.

VI. EVALUATION
In the following, we evaluate the scalability and the error detection/correction ability of using selection rules (σ + -rules in particular) over a set of experiments. More specifically, we try to provide an answer to the following questions.
• What is the scalability of repairing data with selection rules?
• What is the impact of the proposed optimizations on the σ + -rule implication algorithm?
• What is the ability of selection rules to detect and correct erroneous data?
• What is the impact of repairing data with selection rules on the amount of erroneous data?
All experiments are executed on a machine running Ubuntu 21.04, with an Intel Core i9-10920X CPU (3.50 GHz, 12 cores) and 64 GB of RAM.

A. EXPERIMENTAL SETUP
In order to answer the questions stated above, we consider six different approaches to evaluate on four different datasets. Details about the setup of the experiments are given below.

1) APPROACHES
In the experiments, we use three different repair strategies for σ + -rules and compare the results of these strategies with three different configurations of HoloClean [19]. The reason to choose HoloClean to compare with is because it is, to the best of our knowledge, one of the best repair engines in terms of error detection/correction based on data quality rules. Other well-performing, state-of-the-art repair engines, such as Raha/Baran [34], [35] and Spade [36], are not based on data quality rules and are, therefore, not taken into account.

a: σ + -RULE-BASED REPAIR STRATEGIES
For evaluating σ + -rules, we consider three constant-cost repair strategies. By this, we mean that the cost to change a value v into v is a constant, positive integer, fixed per attribute, and, therefore, does not depend on v or v . Although this is a very simple approach and it is not so flexible as non-constant cost strategies, which have the ability to account for a rich variety of repair strategies, potentially taking into account certain error mechanisms (e.g., single digit errors, rounding errors, phonetic errors,. . . ), the advantage of constant-cost repair strategies is that the set cover method can be applied to find minimal-cost repairs (cfr. III-B). This makes constant-cost strategies easy to understand and quite fast compared to non-constant cost strategies. The three repair strategies for σ + rules are listed below. Note that they all three are closely related to the strategies for regular edit rules, introduced by Fellegi & Holt [12].
• Sequential repair (σ + -rules (S)): With this strategy, one minimal solution is picked at random out of all potential minimal solutions. Attributes in this solution are repaired one at a time in random order. This is done by choosing, for each attribute, one repair value from the set of permitted values, which is created based on the values of the attributes that do not need a repair and the values of the attributes that are already repaired.
• Joint repair (σ + -rules (J)): Again, a minimal solution is picked at random. The difference with sequential repair is that all attributes in the solution are repaired jointly, in order to preserve joint distributions. This is done by picking a donor-tuple among the consistent tuples for which the values of the attributes to repair are covered by the set of permitted values of these attributes. The values of the attributes to repair are then copied from the donor-tuple to the inconsistent tuple. One problem with this strategy is that, sometimes, no donor-tuple is found. If this is the case, we will fall back on sequential repair.
• Conditional sequential repair (σ + -rules (CS)): This strategy is an extension of the sequential repair strategy, but chooses a minimal solution in a more intelligent, VOLUME 10, 2022 less random way, which is based on the lift (a notion of correlation) that would be observed if the values of some attributes are kept. The reason for this is that random selection of a (minimal) solution has been criticized, because it can lead to repairs that would be very infrequent and thereby inflate the frequency of such improbable value combinations [22]. Note that we did not yet elaborate on strategies to pick one particular repair value (in case of the sequential repair approaches) or donor-tuple (in case of joint repair) if multiple possibilities exist. The reason for this is that this might depend on the characteristics of the relation schemata or the datasets. More information on this matter is given in VI-D. Moreover, we used custom implementations of all concepts and algorithms related to σ + -rules in Java 8, of which the source code is provided in the open-source ledc-sigma package. 5

b: HoloClean CONFIGURATIONS
HoloClean is a framework for holistic data repairing driven by probabilistic inference exploiting different types of data quality (integrity) constraints [19]. It can make use of three different error detection mechanisms, which are the NullDetector (HoloClean (N)), the ViolationDetector (HoloClean (V)) and a combination of the NullDetector and the ViolationDetector (HoloClean (NV)). In the experiments, we will use three different configurations of HoloClean, each using a different error detection mechanism. For the other parameters, we kept the default values as provided. HoloClean is written in Python 3.6, using a PostgreSQL (version 9.4+) database in the backend. The source code of HoloClean is also provided as opensource software. 6

2) DATASETS a: CHARACTERISTICS
The experiments are conducted on four real-world datasets (referred to as Al, Eu, St1, St2) over three schemata: allergen, eudract and stock. Moreover, each dataset comes with a set of correct tuples, which forms the golden standard of the dataset. The subset of tuples in the dataset for which correct tuples are provided in the golden standard is called the overlap. The characteristics of these schemata (name and value type), corresponding datasets (id, number of tuples |R| and number of attributes |R|), golden standards (|R| and |R|) and overlaps (|R| and |R|) are provided in Table 7. For |R|, we distinguish between the number of attributes that is considered for repair and the number of remaining attributes (shown between brackets).
• allergen: The dataset over this schema captures data related to food products and their allergens. Detailed information about the construction of the dataset and golden standard is given in [45]. We use a slightly modified version in our experiments in which we combine, for each food product, the information provided by the 5 https://gitlab.com/ledc/ledc-sigma 6 https://github.com/HoloClean Alnatura 7 webshop and the Open Food Facts 8 registry in one tuple.
• eudract: The dataset over this schema captures data related to the design of clinical trials as reported by the EudraCT registry. 9 Detailed information about the construction of the dataset and golden standard is given in [46]. Note that the golden standard does not contain correct values for attributes placebo and active_comparator. Still, we decided to consider these attributes for repair, as they are involved in the set of σ + -rules and, therefore, may contribute to the repair of other attributes.
• stock: The datasets over this schema capture data related to daily prices of 1000 stocks, reported by 54 sources. Detailed information about (the construction of) the dataset and golden standard can be consulted and downloaded freely on the web. 10 For our experiments, we used a subset of the available data, in which we only considered continuous attributes with constraints (in the form of σ + -rules) defined on and in which we only considered tuples capturing data of the first week of July 2011. An important remark is that the values of two attributes (change_percentage and change_in_dollar) can be calculated based on (linear) relationships between the attributes previous_close and last_trading_price (cfr. Example 1). We will show later (cfr. VI-D) that these (linear) relationships can be exploited for calculating repair values. In the provided golden standard, tuples that were inconsistent against these relationships were manually corrected, if possible. Otherwise, the tuples were left out. Note, also, that we will use a second dataset over this schema, only consisting of tuples that appear in the overlap for which the date is July 6th, 2011. The reason for this is that HoloClean was unable to execute successfully on the larger dataset, because the machine ran out of memory resources during execution. The smaller dataset was, therefore, used to obtain an estimate of the performance of HoloClean.  Characteristics of three schemata with their datasets, golden standards and overlaps. For the number of attributes |R|, we distinguish between the number of attributes that is considered for repair and the number of remaining attributes (shown between brackets).
From Figure 2, we can notice the following. First, the allergen-based dataset (Al) does not contain any NULLvalues. This implies that error detection can only rely on the σ + -rules defined over this schema, which makes it more difficult to correctly localize the errors. Moreover, HoloClean without ViolationDetector is not applicable on Al. Second, Al contains, relatively speaking, the highest number of erroneous tuples, but the lowest number of erroneous cells. This means that, in relative terms, the Al dataset has few errors, but the errors are highly distributed across the tuples. Third, the stock-based datasets (St1 and St2) have the highest number of erroneous cells. Also, the distribution of errors in St1 and St2 is more or less the same and, therefore, we can state that St2 is quite representative for evaluating error detection/correction on the stock schema, although it consists of far fewer tuples than St1. Table 8, an overview is given of the number of σ + -rules defined over each relation schema. Note that, for each schema, a distinction is made between domain rules (which are always constant, because, by definition, there is always only one attribute involved in) and non-domain rules (both variable and constant). Moreover, to give a notion of the work that needs to be done to generate a sufficient set (by applying the FCF algorithm), a distinction is made between the number of rules in the sufficient set and the number of rules that are initially defined (shown between brackets).
With respect to the information provided in Table 8, we can say the following. First, for the allergen schema and the eudract schema, a domain rule is defined for each attribute that is considered for repair. This is not the case for the stock schema, because attributes change_percentage and change_in_dollar are allowed to take any value in their domain. Regarding the non-domain rules, we can see that only variable σ + -rules are defined over the allergen schema and only constant σ + -rules are defined over the eudract schema. The stock schema features a combination of constant and variable σ + -rules. Moreover, note that, for the eudract schema, the number of σ + -rules in the sufficient set equals the number of σ + -rules in the initially given set, which is not the case for the other schemata. This is because the sufficient set defined over the eudract schema equals the initially given set.

B. SCALABILITY
In a first set of experiments, we evaluate the scalability of repairing with σ + -rule-based approaches compared TABLE 8. Overview of the number of σ + -rules in the sufficient set and the number of initially given σ + -rules defined over each schema (shown between brackets). A distinction is made between domain rules, variable and constant non-domain rules.
to repairing with HoloClean. First, we measure, for each approach and dataset, the repair execution time. Second, we evaluate the effectiveness of the σ + -rule implication algorithm compared to applying Theorem 1.
The execution times that the different approaches need to repair the datasets entirely are listed in Table 9. Note that, for the HoloClean-based approaches, we distinguish between error detection and error correction and, for the σ + -rule-based approaches, we distinguish between sufficient set generation with FCF and repair. First, the execution times for error detection and correction in case of HoloClean, and repair in case of σ + -rules are directly proportionate to the size of the dataset. Moreover, the number of attributes, initially given rules and rules in the generated sufficient set impacts the execution time of the FCF algorithm, as expected. Second, we can state that, in general, HoloClean takes (much) more time to execute, especially if the time to generate a sufficient set is limited and the size of the dataset tends to increase. We already stated earlier that, in this regard, we did not manage to execute HoloClean successfully on the St1 dataset, because the machine ran out of memory resources. Only for the Al dataset, HoloClean performs better, which is due to the fact that generating a sufficient set takes quite long, because of the high number of attributes, rules and implicit relations between these rules and because of the very small number of rows. Although this is the case, note that generating a sufficient set should typically be done only once and it can be reused afterwards each time that one wants to repair a dataset over this schema with any of our approaches. This is because the set of rules depends on the schema and not on the datasets. On the other hand, in an ideal case, HoloClean should be executed entirely every time a dataset changes, because it exploits the value distributions in these datasets for training. Third, if we only consider the σ + -rule-based approaches, joint repair needs, generally speaking, most time to execute (but has the better error correction ability, cfr. VI-D), whereas (conditional) sequential repair is (much) faster.
To test the effectivity of the properties exploited by the σ + -rule implication algorithm compared to the procedure  Overview of the repair execution times (in seconds). For the HoloClean-based approaches, we distinguish between error detection and error correction and, for the σ + -rule-based approaches, we distinguish between sufficient set generation with FCF and repair. N/A indicates that the approach is not applicable. N/E indicates that the approach could not be executed successfully due to limitations in resources.
proposed in Theorem 1, we count the number of implied rules that will be generated by both methods when passing a given generator as input. For the stock schema, we already stated in IV-C2 that the σ + -rule implication algorithm has a large advantage over Theorem 1. Especially for the attributes involved in many σ + -rules (e.g., previ-ous_close, change_percentage,. . . ), the number of generated implied rules is highly reduced. When considering the eudract schema, only attributes open, single_blind and double_blind can lead to necessary rules, because the other attributes cannot be eliminated. Each of these attributes is involved in four σ + -rules, of which one is featuring one predicate, two are featuring two predicates and one is featuring three predicates, resulting in an upper bound of 6 12 ≈ 2 · 10 9 new rules (see (4)). When executing Algorithm 1 with each of these attributes as input, merely around 5000 implied rules are generated (and tested). Only for the allergen schema, Theorem 1 is better performing in terms of number of generated rules. The main reason for this is that all σ + -rules feature only one predicate, resulting in an exponent in (4) that is always equal to 1, and few candidate contributors per attribute exist, resulting in a low base. Therefore, we can conclude that the choice of procedure highly depends on the number of candidate contributors and the number of predicates involved in the candidate contributors. As a general rule of thumb, we recommend to use Algorithm 1 when the number of candidate contributors is larger than 2 (resulting in a base equal to at least 3) and the average number of predicates per candidate contributor is at least 2.

C. ERROR DETECTION
In a second set of experiments, we will evaluate the error detection ability of σ + -rules compared to the HoloCleanbased approaches. More specifically, we will evaluate how well the different approaches perform in detecting erroneous tuples and erroneous cells in the overlap of each of the datasets. The reason to distinguish between error detection and error correction is that, if an approach performs well in detecting errors but performs poorly in correcting errors, we can still rely on other approaches (e.g., domain experts, deductive (i.e., certain) repairs,. . . ) to find correct values.

1) DETECTING ERRONEOUS TUPLES
First, we evaluate the ability to detect erroneous tuples in the overlap of each of the four datasets. In other words, we evaluate how well the different approaches perform in finding tuples in which they are going to change at least one value (i.e., which they consider as erroneous). For this, we report on the following evaluation metrics for each dataset and approach.
• Precision (P): the number of tuples correctly identified as erroneous divided by the number of tuples identified as erroneous.
• Recall (R): the number of tuples correctly identified as erroneous divided by the number of erroneous tuples.
• F 1 -score (F 1 ): harmonic mean of precision and recall. In Table 10, an overview of the results is given. Note that we did not make a distinction between the σ + -rule-based approaches. The reason for this is that error detection with σ + -rules only depends on the expressiveness of the σ + -rules and does not depend on the repair strategy (cfr. VI-A1). Indeed, each tuple failing any of the σ + -rules or containing a NULL-value is detected as erroneous. Besides that, the precision is 1 for each approach and dataset. If this was not the case, tuples would exist in the golden standard that fail any of the σ + -rules or that contain NULL-values. For the recall and F 1 -score, the σ + -rule-based approaches always outperform the HoloClean-based approaches, especially on the Al and Eu datasets.

2) DETECTING ERRONEOUS CELLS
Second, we evaluate the ability to detect erroneous cells in the overlap of each of the four datasets. In other words, we evaluate how well the different approaches perform in finding cells which value they are going to change (i.e., which they consider as erroneous). Note that, for the σ + -rule-based approaches (i.e., the constant-cost repair strategies), this depends on the cost assigned to each attribute to change its value (i.e., the cost model). In order to give a notion of the error detection ability of these approaches and for simplicity reasons, we only consider one cost model for each schema. For the allergen and eudract schema, an equal cost is assigned to each attribute. For the stock schema, a lower cost is assigned to these attributes which value can potentially be determined in a deductive way, because of the (linear) relationships that exist between these attributes, or which contain derived data. More specifically, we assigned cost 1 to change_percentage and change_in_dollar, cost 2 to pre-vious_close and last_trading_price and cost 3 to the other attributes. We report on the following evaluation metrics for each dataset and approach.
• Precision (P): the number of cells correctly identified as erroneous divided by the number of cells identified as erroneous.
• Recall (R): the number of cells correctly identified as erroneous divided by the number of erroneous cells.
• F 1 -score (F 1 ): harmonic mean of precision and recall. In Table 11, an overview of the results is given. Note that we did not make a distinction between the sequential and joint repair strategy, because of the fact that these approaches search for attributes to repair in the same way (cfr. VI-A1). First, a straightforward insight is that, when errors are due to NULL-values rather than due to erroneous values, error detection is more easy, because cells with NULL-values will always be identified correctly as erroneous. Second, although the σ + -rule-based approaches score best for detecting erroneous tuples in the Al dataset, they score worst in detecting erroneous cells in this dataset compared to applying the σ +rule-based approaches on the other datasets. This is because of the high number of attributes and σ + -rules, the lack of NULL-values, and the relative high number of inconsistent tuples. Indeed, we noticed that, for each inconsistent tuple, the approaches can choose between a high number of minimal solutions (in terms of attributes to repair) for this dataset. Although this is the case, we can safely state that, for all datasets, the σ + -rule-based approaches (especially the σ + -rules (CS) approach) generally outperform the HoloClean-based approaches. Only for the Eu dataset, the HoloClean (N) approach is slightly better in terms of precision. For the Al dataset, σ + -rules (S/J) slightly outperforms σ + -rules (CS). Also, note that the results shown in Table 11 serve as upper bounds for the error correction results discussed below.

D. ERROR CORRECTION
In a third and final set of experiments, we evaluate the error correction ability of σ + -rules compared to the HoloCleanbased approaches. First, we evaluate how well the different approaches perform in finding correct values for erroneous cells in the overlap of each of the four datasets. Second, we check how many erroneous tuples still exist after finishing the repair process.
Before discussing the results, we elaborate on which selection strategy has been used to pick a particular repair value (in case of the sequential repair approaches) or donortuple (in case of joint repair) if multiple possibilities exist. For the Al and Eu datasets, the (conditional) sequential repair approach picks a value in a frequency-driven way. From the consistent tuples, the frequency of each of the permitted values is measured and each of these values can be chosen with a probability proportionate to their observed frequency. For the stock-based datasets, if it is possible to determine a value deductively (i.e., for attributes change_percentage, change_in_dollar, previous_close and last_trading_price) due to the (linear) relationships between them and if this value is permitted, it is chosen. If not, we fall back on the frequency-driven strategy. Regarding joint repair, donor-tuples are, also, picked in a frequency-driven way, with more frequent donors more likely to be chosen. For the Eu, (resp. St1 and St2) datasets, a donor is chosen in a frequency-driven way from the set of consistent tuples that refer to the same clinical trial (resp. stock-date combination), if possible. Overview of the results regarding the ability to detect erroneous cells. N/A indicates that the approach is not applicable. N/E indicates that the approach could not be executed successfully due to limitations in resources. For all σ + -rule-based approaches, the results are shown as averages over 10 executions.

TABLE 12.
Overview of the results regarding the ability to correct erroneous cells. N/A indicates that the approach is not applicable. N/E indicates that the approach could not be executed successfully due to limitations in resources. For all σ + -rule-based approaches, the results are shown as averages over 10 executions.

TABLE 13.
Overview of the percentage of erroneous tuples in the repaired datasets compared to the original datasets. N/A indicates that the approach is not applicable. N/E indicates that the approach could not be executed successfully due to limitations in resources.
The results regarding the ability to correct erroneous tuples are listed in Table 12 and we report on the following metrics.
• Precision (P): the number of correctly repaired cells divided by the number of repaired cells.
• Recall (R): the number of correctly repaired cells divided by the number of erroneous cells.
• F 1 -score (F 1 ): harmonic mean of precision and recall. We can state that the results are more or less in line with the results regarding error detection on cell-level. For the Al and Eu datasets, the σ + -rule-based approaches outperform Holo-Clean for all metrics and approaches. Only for the St2 dataset, HoloClean is better. The reason for this is that there are many (up to 54) sources reporting on the same stock and Holo-Clean is able to learn repairs based on correlations between attributes (values). For the St1 dataset, we cannot compare with HoloClean, because it did not execute successfully due to limited memory resources (cfr. VI-A2). If we only consider the σ + -rule-based approaches, we can see that σ + -rules (J) is, generally speaking, the better one. Especially for the Eu, St1 and St2 datasets, the difference with the other approaches is quite big, because of the ability to use an optimized donor selection strategy, as stated earlier. The difference in performance between the sequential and conditional sequential repair strategy is negligible and depends on the dataset (with a small overall advantage for σ + -rules (CS)). Based on these findings, we can conclude that the performance of error correction highly depends on the performance of error detection and the ability to use optimized value selection strategies, but, also, on the number of permitted values/donors. Indeed, take, for example, a continuous attribute not containing derived data (e.g., today_low in the stock schema). For these kinds of attributes, you may be allowed to choose from a huge set of permitted values and choosing a value from this set is often done quite arbitrarily because a more intelligent strategy does not exist.
To finish this section, we would like to point out that, an important property of all σ + -rule-based approaches is that they guarantee error-free correction. This, however, is not the case with HoloClean. Indeed, in Table 13, we have listed, for each approach and dataset, what percentage of tuples in the repair still fail any σ + -rule or contain a NULL-value compared to the original dataset. We can see that the overall number of erroneous tuples decreases when applying Holo-Clean in general (with HoloClean (NV) performing the best), but for the Al and Eu datasets, this is only slightly. Also, for the HoloClean (N/NV) approaches, it is not guaranteed that all original NULL-values will contain a non-NULL-value after repair (cfr. Eu dataset). Moreover, for the HoloClean (N) approach, the number of errors against σ + -rules can even increase, which is as expected, because the approach does not take into account any constraints.

VII. CONCLUSION
In this paper, we investigated the concepts and properties of two types of tuple-level constraints, which we called σ -and σ + -rules (selection rules in short) and which arise from the selection operator σ of relational algebra. First, σ -rules are an extension of regular edit rules, in the sense that they allow more operators in their representation and they allow variable and constant comparison instead of constant comparison only. Regardless of the increase in expressivity, it was stated that the concepts and properties of edit rules can easily be transferred to the setting of σ -rules. A particularly interesting result is that rule implication with σ -rules, in order to solve the error localization problem by means of the elegant set cover method, can, in most cases, properly be applied. Only when attribute domains are ordinal, we have to rely on a representation that allows gaps between attribute values to keep the number of rules manageable, which is the reason why we proposed σ + -rules as an extension of σ -rules. In order to assess the usability of selection rules in practical settings, we evaluated the scalability and error detection and correction ability of selection rules compared to HoloClean, which is a state-of-the-art repair engine, over a set of experiments on four real-world datasets. The results show us that our approach outperforms HoloClean in terms of scalability, especially when datasets are very large and the work that FCF has to do is limited. Although, ideally, FCF has to be executed only once per schema, whereas HoloClean is supposed to train its model again after each modification to the data. Moreover, also in terms of error detection and in terms of error correction, our approach outperforms HoloClean, if the number of sources describing an entity in a dataset is limited.
Still, some research questions remain to study in future work. First, although it is possible to capture many errors by means of selection rules, other types of (more expressive) data quality rules (e.g., functional dependencies) exist, which also have proven their utility in the past. Therefore, one might wonder to what extent (implication with) selection rules can account for more expressive constraints without highly increasing the complexity of the repair engine. Besides that, one can also investigate the benefit of using selection rules and other types of constraints jointly. Second, as is the case for many types of data quality rules, an efficient strategy to automatically discover selection rules can be useful. A potential solution to this could be to study and simplify the strategy to discover (tuple-level) denial constraints. Third, instead of restating the properties of edit rules that help in optimizing rule implication to the setting of σ -rules, one can study additional properties (e.g., of variable rules in particular or related to rule folding) that help to reduce the number of combinations to test. Fourth, although the implication-based method with selection rules proofs to perform quite well in terms of error detection, there might exist better (perhaps more complex) strategies for error correction. One potential way of research can, therefore, focus on error correction mechanisms for selection rules depending on different schema (or dataset) characteristics.