Supporting Uncertain Predicates in DBMS Using Approximate String Matching and Probabilistic Databases

Current relational database systems are deterministic in nature and lack the support for approximate matching. The result of approximate matching would be the tuples annotated with the percentage of similarity but the existing relational database system can not process these similarity scores further. In this paper, we propose a system to support approximate matching in the DBMS field. We introduce a ‘<inline-formula> <tex-math notation="LaTeX">$\approx $ </tex-math></inline-formula>’ (uncertain predicate operator) for approximate matching and devise a novel formula to calculate the similarity scores. Instead of returning an empty answer set in case of no match, our system gives ranked results thereby providing a glance at existing tuples closely matching with the queried literals. Two variants of the ‘<inline-formula> <tex-math notation="LaTeX">$\approx $ </tex-math></inline-formula>’ operator are also introduced for numeric data: ‘<inline-formula> <tex-math notation="LaTeX">$\approx +$ </tex-math></inline-formula>’ for <italic>higher-the-better</italic> and ‘<inline-formula> <tex-math notation="LaTeX">$\approx -$ </tex-math></inline-formula>’ for <italic>lower-the-better</italic> cases. Efficient approximate string matching methods are proposed for matching string-type data whereas numeric closeness is used for other types of data (date, time, and number). We also provide results of our system taken over several sample queries that illustrate the significance of our system. All experiments are performed using the MySQL database, whereas the IMDb movie database and European Football database are used as sample datasets.


I. INTRODUCTION
In traditional databases, select, from, and where are the fundamental clauses of any SQL query. General SQL query takes relations specified in the from clause as an input, removes tuples which do not satisfy the predicates in the where clause, and selects the attributes specified in the select clause. In the end, the query returns a filtered relation as an output.

A. PATTERN MATCHING IN DETERMINISTIC DATABASE
Pattern matching in a deterministic database is performed using the like operator. Patterns are described using two special characters '%' and '_'. The '%' matches any string i.e., any number of characters and '_' matches a single character. SQL uses the like operator to express the pattern. Consider the following query in Fig. 1. This query returns the title of books containing 'Computer' as a substring.
Conditions in the where clause are used to compare the expressions. The expression can be a column of the relation, or a constant such as string literal (also called as text literal The associate editor coordinating the review of this manuscript and approving it for publication was Huanqing Wang. or text constant), a number, date, or an alphanumeric string. We can use comparison operators <, ≤, >, ≥, =, and <> as well as like operator in the where clause. Clearly, the traditional database does not support the approximate matching of the values. It can only match the pattern specified in the like operator, which is equivalent to the substring matching. The user needs to have comprehensive knowledge about the data in the database, otherwise, the query may return an empty result-set. The same can also happen due to a single misspelled word, or a typing error. Moreover, problems may arise due to the homophonic nature of the words. For example, Cristopher or Kristopher i.e. words pronounced in the same way but differing in spellings. If strings are different even in a single character, exact string matching fails. Thus, one of the problems in the existing deterministic database is the lack of support for approximate matching. In this paper, we try to address this issue.

B. NEED OF UNCERTAIN MATCHES
Instead of returning an empty result-set due to abovementioned reasons like lack of knowledge about the existing data, misspellings, typing mistakes, etc, it will be better to return the result-set ranked according to the degree of similarity. This would help the casual user who has very little knowledge about the stored data. From the approximate results that the user would get, she would come to know about the actual values in the database and she may re-query with the exact keywords that she intended.
Let's assume there exists an uncertain predicate operator, denoted as '≈', which is used for approximate matching just like '=' operator for exact matching. If somehow approximate matching is performed, resulting rows would have a percentage of matching, where a higher percentage will indicate higher similarity. In such a case, the result of the given query could be a relation annotated with the percentage of matching. These matching scores can be treated as the probability scores or the confidence scores for the respective tuples. Probability scores of filtered tuples need to be stored along with the tuples. The deterministic database has no support for managing probabilities of tuples during query processing. In that direction, the probabilistic database has been proposed by Dalvi and Suciu [1] for managing the probability of a tuple during query processing. Moreover, the queries with uncertain predicate using a probabilistic database also have been proposed. For a given SQL query, each tuple in an input database is assigned a probability based on its match with constants in uncertain predicates. Thus the probability for each intermediate tuple is derived. Finally, the result-set is ranked according to its derived probability, indicating the most credible (i.e. the highest probable) answer first. For string matching purposes, Approximate String Matching (ASM) methods can be used which are discussed in the next section.

C. EXAMPLE OF UNCERTAIN PREDICATE
Our general aim is that using Approximate String Matching(ASM) techniques and probabilistic databases, we should be able to execute the kind of query as shown in Fig. 2. As mentioned, the '≈' operator is the proposed uncertain predicate operator. The expected result is depicted in Table 1 which shows the result-set arranged in the decreasing order of the tuples' probability, topmost tuple being the most similar.

D. OUTLINE AND CONTRIBUTIONS
As per our knowledge, database systems still do not have support for uncertain predicates and it is incompetent to perform approximate string matching. In this work, we extended the idea of queries with uncertain predicates introduced in [1]. In this paper, we try to provide the design of a complete system supporting the uncertain predicates in the database, which, we believe, will make database fields more informative. This will also enable the capabilities of information retrieval to be applied to the database fields. With a probabilistic database in hand, support for uncertain predicates is attainable.
The rest of the paper is organized as follows. We discuss the related work in Section II. In Section III, we take a brief overview of the concepts of the probabilistic database and its use for modeling the uncertainty. We also review various existing ASM methods and propose the best method of interest. Later, we cover the main contributions of the paper: • We propose the design and implementation of a system to support uncertain predicates along with the regular ''certain'' predicates.
• We identified the deficiency of basic q-gram distance in our scenario and proposed the global q-gram distance. We provide the complexity and overhead analysis of the proposed formula (Section IV).
• We present the queries to calculate the above-mentioned distance by utilizing the in-built features provided by the DBMS (Section IV).
• We also introduce two variants (≈+ and ≈−) of uncertain predicate operator (≈) for the numeric type of data (Section IV).
• We formulate the procedures to implement our proposed method. We also describe details of the sample databases that we have used for the testing (Section V).
• We demonstrate the results of several queries on sample databases (Section VI). Finally, we conclude in Section VII with possible future directions.

II. RELATED WORK
SQL query with uncertain predicate is first proposed by Dalvi and Suciu [1]. But, the main focus of their work was on query evaluation in probabilistic databases. They have given intensional and extensional query evaluation to derive the probability of the filtered tuples. But, to the best of our knowledge, a full-fledged system supporting uncertain predicate is still missing. Probabilistic databases have been widely VOLUME 8, 2020 explored for storing uncertain data, instead of just replacing uncertain fields with nulls [2]. Hassanzadeh and Miller [3] have used probabilistic databases in order to remove duplicate records from the base dirty relation. They find out the similarity between the records and use clustering algorithms to identify the set of records that are probable duplicates. Kanagal et al. [4] proposed the sensitivity analysis to find the most influential input tuples for the query which predominantly affects the probability of the answer. Qin and Yu [5] proposed the dynamic programming algorithm to perform sensitive analysis for Inequality Query (IQ) efficiently. Sensitivity analysis can be utilized for ranking the query results [6].
Hakak et al. [7] presented a comprehensive survey on exact string matching algorithms that are commonly used for pattern searching. Unlike exact string matching, Approximate String Matching(ASM) methods allow errors in matching, which makes them useful for applications like spell checker, autocompletion, etc [8], [9]. ASM methods are also used to extract features for machine learning and to calculate intra-comment similarity to tackle the issue of cybercrimes like cyberbullying, online aggression, etc, over the social media [10]- [13]. Loo [14] has included stringdist package in R for approximate string matching. Methods in stringdist package have an optional parameter to specify a string matching method to be used among several well known implemented methods. Parallel versions of ASM methods using Graphics Processing Units (GPUs) are discussed in [15].
Another area of DBMS that finds the application of ASM is the approximate string join [16], [17]. Effective pruning techniques are inevitable in this area. Various filters based on the count filtering, position filtering, length filtering [18], subtrie pruning [19], minimal-edit-distance pruning, duplication pruning [20] have been proposed to improve the performance. The probabilistic database system has promising applications in various fields that have to deal with uncertain, imprecise, or inaccurate information [21], [22]. But, a concrete probabilistic database management system is still lacking. Only a few university projects like MayBMS [23]- [25], Trio [26] have been proposed.
With the proposal of probabilistic database, additional queries like top-k [27]- [29], skyline queries [30], [31] over probabilistic databases, are proposed in the literature. Both of these queries highlight the dominating tuples from the database. Top-k and Skyline queries are very useful in the fields of decision making and multi-criteria analysis. In probabilistic databases, tuples or attributes are assigned with some probability. Therefore, the probabilistic database exists in multiple instances and each instance is known as 'possible world'. If possible world semantics is followed for aggregate queries, a query may return an exponentially sized result-set, i.e., one aggregate result for each possible world. In order to overcome this problem, three variants of aggregate queries have been proposed by Murthy et al. [32]. These are termed as low bound, high bound, and the expected value of the aggregate values.

III. PRELIMINARIES
In this section, we briefly take an overview of the probabilistic database and ASM methods which are the building blocks of our system.

A. UNCERTAINTY AND PROBABILISTIC DATABASES
To deal with uncertain data and probabilistic query processing, probabilistic databases are proposed [33]. Nowadays, many applications have to deal with uncertain data, such as object recognition, sensor networks, moving object monitoring, data extraction, and integration [21]. Uncertainty may arise during the data collection, data transmission, or data processing stages.
With traditional deterministic databases, uncertainty cannot be modeled. The use of nulls is proposed to handle the uncertainty by Codd [34]. But, there can be an information loss in using the nulls in place of doubtful values. For example, consider a classifier that classifies the objects. It could identify that object X belongs to class A with 0.6 probability and class B with 0.4 probability. Due to this uncertainty, if the class of X is recorded as null, the information that we had earlier is lost. Such type of uncertainty can be best modeled and processed using probabilistic databases [35]- [39]. It stores uncertainty in terms of a probability for that uncertain tuple or attribute. The probability of a tuple or an attribute is a measure of confidence that the system has on that reading.
Formally, probabilistic databases can be defined as follows: Definition 3.1 (Probabilistic Database and Possible World [23]): Consider a schema with relations R 1 , . . . , , is called a possible world and its probability of existence is p [i] .
Further, 1≤i≤n p [i] = 1. Every relation R i j in world W i contains one of the possible subsets of tuples from relation R j . If relation R j contains only the certain tuples i.e., tuples with probability 1, it is called as certain relation, and R 1 j = R 2 j = · · · = R n j = R j . Let us consider the following example of probabilistic database.

1) EXAMPLE OF PROBABILISTIC DATABASE
Consider, an institute has installed an automated system for maintaining students' attendance. The system recognizes a student using facial recognition with some error. The percentage of the match can be treated as a probability of that tuple. One particular instance of such a table is given in Table 2. The database with such uncertain relations is called a probabilistic database.

2) POSSIBLE WORLD SEMANTICS
As seen in the previous example, every tuple is assigned with some probability of existence. For instance, tuple t 1 is true with 0.9 probability and at the same time, it is false with 0.1 probability. Thus, in one instance, some of the tuples will remain true and the rest will be false. This forms one possible instance of the above database. In this way, the probabilistic database can exist in multiple possible instances. In one of the possible instances, all tuples will remain true. Similarly, in one particular instance, a table may not have any tuple (this can happen only if there is no certain tuple in a relation). Such multiple instances are called possible worlds of a probabilistic database.
A traditional relational database has only one state, containing all the present tuples. In contrast, a probabilistic database can be in one of several states. It is a set of multiple possible certain databases each of which has some chance to appear in the real world. The probability of a possible world is determined by the probabilities of its tuples. For example, probability of possible world pw containing three tuples (say t 1 , t 4 , and t 5 ) out of five tuples is, 5 are the probabilities of presence of tuples t 1 , t 4 , and t 5 , respectively, and (1 − p 2 ) and (1 − p 3 ) represent the probabilities of the absence of tuples t 2 and t 3 .

B. APPROXIMATE STRING MATCHING
Approximate String Matching (ASM) is a string matching problem where two strings are matched approximately rather than exactly. ASM aims to find the closest match for a query string. The uncommon part between the two strings is considered as the error between them. As ASM allows errors, it is useful in cases of undesired corruption or inadvertent typing errors. Most often, the approximate string match is measured in terms of distance. Exactly matched strings will have a distance of 0. ASM distance metrics are classified into three major categories, namely edit based, q-grams based (sometimes called as n-grams), and heuristic based [9], [14].
The edit based metrics are measured in terms of the number of edit operations required to make two strings equal. These edit operations include addition, deletion, and substitution of a character, or transposition of characters. Based on the allowed operations, edit based distances are categorized as Hamming distance [40], Longest Common Subsequence (LCS) distance [41], Levenshtein (LV) distance [42], Optimal String Alignment (OSA) distance, and Damerau-Levenshtein (DL) distance [43].
The q-gram based distances aim to find common q-grams between two matched strings. The q-grams of string s is a set of all possible q length substrings of s. The maximum number of distinct q-grams for string s is |s| − q + 1. For example, if q = 2 and s = 'VNIT' then the q-grams are {'VN', 'NI', 'IT'}. Jaccard distance, q-gram distance [44], and Cosine distance are some examples of q-gram based distances.

Definition 3.2 (q-Gram Distance):
The q-gram distance is the number of q-grams that are not common between the two strings and can be calculated as: where, represents the finite set of alphabets. 1 The Kleene closure set ( * ) is an infinite set of all possible strings over . The set q represents all the strings of length q from * . The vector υ(s; q) is a q-gram profile of string s of dimension | | q whose coefficients represent the number of occurrences of every possible q-grams of length q. Let, a q-gram x ∈ q , has an index i in the vector υ(s; q) then the υ i (s; q) represents the total occurrences of x in s. Thus, the q-gram distance is equal to the L 1 -norm of vectors of two strings [44]. It varies between 0 to |s| + |t| − 2q + 2. For example, the q-gram distance between 'Tuesday' and 'Thursday' is 7 for q = 2 as 'Tu', 'ue', 'es', 'Th', 'hu', 'ur', and 'rs' are the q-grams that are not common between them.
Lastly, heuristic based distances have a particular application in mind, but do not have a solid mathematical backbone. For example, Jaro distance [45] was specifically designed to pair the records between the census data and the independent Post Enumeration Survey (PES) conducted by the U.S. Census Bureau in 1985. The intuition behind the method is that the types of errors like transpositions of characters or character mismatches between close characters are more likely due to the typing errors like martha-marhta or jonathon-jonathn. These are legitimate variations and should have lesser distances. It is evident from the following examples: Though there is a single transposition of characters in the first two examples, the Jaro distance of the first example is less as it may be more likely due to the typing error.
After this brief overview of uncertainty and probabilistic databases, as well as Approximate String Matching (ASM), our proposed methodology based on these two concepts is described in the next section.

IV. METHODOLOGY
In order to find a similarity between the values in terms of distance, we introduce '≈' operator for the uncertain predicate. To implement such an operator, for string type data, we can use the distance between the strings which is also a notion of dissimilarity between them. For numerals, the distance could be the indicator of numeric closeness. Fig. 3 shows the basic modules of the proposed system which are preprocessing, distance calculation, distance normalization, and probability calculation. When the parser encounters '≈' symbol in the where clause of the query, it performs approximate matching on the columns involved in uncertain predicates. A query may have other predicates along with the uncertain predicate, as present in our previous books example in Section I (again depicted in Fig. 4(a)). An instance of the books relation is shown in Fig. 4(b). All predicates, except uncertain predicates, are applied first and then uncertain predicates are applied on those filtered result-set. These steps are the preprocessing steps of our system. Fig. 4(c) illustrates the filtered result-set. Distance Calculation Module calculates the distance between the queried literal and each of the field values in the corresponding column to get the distance array ( Fig. 4(d)). Distance array is normalized in the range [0, 1] by the Distance Normalization Module (Fig. 4(e)). The above steps are repeated for all the uncertain predicates. Probability Calculation Module combines the probabilities obtained from the uncertain predicates to calculate the final probability of filtered tuples (Fig. 4(f)) and finally, we get a probabilistic database as an output (Fig. 4(g)).
Procedures to find distances differ based on the data type of a column. Most of the standard databases support various built-in data types, such as number for integer type and real type data, char for a character, varchar for a string. Along with these basic types, databases also support types for a date (date) and time (time).
Firstly, we focus on string-type data. Out of all the ASM techniques discussed in the previous section, we believe q-gram distance (with required modifications) is best suited for our purpose. Edit based distances do not fare well in case of swapped words cases. For example, 'John Watson' may get recorded as 'Watson, John'. The edit based distance between these two strings is 10 (for OSA) out of a possible range of [0, 12], which is significant. In the case of q-gram distance, it would be just 5 (for q = 2) out of a possible range of [0,21]. We have neglected heuristic-based methods as they are very much application-specific.
In Section IV-A we describe a naive solution to find the distance between two strings. In Section IV-B we propose a more accurate distance formula that overcomes the drawbacks of the naive solution.

A. NAIVE SOLUTION TO CALCULATE THE DISTANCE BETWEEN TWO STRINGS
As noted earlier, q-gram distance needs O(| | q ) storage. To avoid such high space complexity, we use the following technique to calculate the q-gram distance between the two strings (s and t) in RDBMS. Here one can try to utilize in-built features already available with the RDBMS like efficient joins methods, clustered index, etc. Auxiliary tables, A 1 (qgram) and A 2 (qgram), with qgram as the only attribute,  are created on-the-fly respectively for the two strings and deleted upon use. The example given in Fig. 5 depicts auxiliary tables A 1 and A 2 tables storing q-grams of s = 'abdgh' and t = 'abcbcdeghgh' strings. Query Q1 in Fig. 5 is the SQL query to find the q-gram distance between the strings s and t.

B. PROPOSED SOLUTION TO CALCULATE THE DISTANCE BETWEEN TWO STRINGS
The problem in using q-gram distance in its purest form is that the user may query a part of the original string present in the table. For example, she may query 'Harry Potter' instead of 'Harry Potter and the Order of the Phoenix' while searching for the movie titles. Such extra uncommon parts of one of the strings increase the distance. To overcome this problem, we introduce the concepts of Local q-gram distance and Global q-gram distance. We define them as follows: Definition 4.1 (Local q-Gram Distance): The local q-gram distance is the q-gram distance between the smaller length string and one of the substrings of the same length of the larger string.
The larger string (say y), among the two strings to be matched, is divided into the set of substrings (y i 's) where length of each substring is equal to the length of smaller-length string (say x). Let, l x , l y be the length of strings x and y, respectively. Let, n = l y − l x + 1, which will be the number of possible substrings (y i 's) of length l x of a string y. The local q-gram distances (d i ) between x and each of y i 's are calculated using query Q1. The proposed net distance between the strings x and y is derived from their local q-gram distances to be termed as 'Global q-gram distance'.

Definition 4.2 (Global q-Gram Distance):
The global q-gram distance between strings x and y is proposed to be calculated as: Here d 1 , d 2 , . . . , d n are the local q-gram distances between two strings calculated as mentioned above. The next two terms are the fine-tuning terms that help to calculate a more accurate distance. A maxdist(x,y) is the sum of maximum distance possible between the x and each of y i 's and therefore maxdist(x, y) = 2n(l x − q + 1). Now we explain the significance of each term in the global q-gram distance formula proposed above. The overall aim is to reduce the increased distance between two strings due to the missing part of the query string, the lacuna in the naive solution mentioned earlier. We first note that the first string may be contained anywhere in the second string. Hence, to minimize the distance between two strings, minimum of local q-gram distances is considered (min(d 1 , d 2 , . . . , d n )) as the first term in the formula to calculate the global q-gram distance. But, it can be quickly seen that this is not enough. Even if d('aaa', 'aaa') is 0 and d('aaa', 'aaaaa') is also 0 (for q = 2), still, the first string should get precedence over the second in terms of similarity. Hence, there is a need for a penalty for the difference in the relative lengths of the strings, which is computed by (l y − l x )/l x as the second term in the formula. Hence, higher the length differences (amongst the matched strings), higher will be the penalty. Now a penalty with a fractional value (0.667) gets added in the calculation of d('aaa', 'aaaaa') as against d('aaa', 'aaa') which remains at 0. Even then, d('aaa', 'aaaaa') is 0.667 and the d('aaa', 'baaad') is also 0.667. Here we believe that 'aaa' is more closer to 'aaaaa' than 'baaad' as 'aaa' gets matched multiple times in 'aaaaa' than in 'baaad'. To factor in this observation, the third term ( n i=1 d i /maxdist(x, y)) in the formula is added in the calculation of global q-gram distance. It denotes the ratio of summation of actual distances of the substrings ( n i=1 d i ) to the maximum possible distance between the strings x and y (maxdist(x, y)). This ratio measures the amount of cumulative dissimilarity between the two strings measured relatively against the maximum distance. Distance is maximum when there is no common character or common q-gram between the strings x and y. In the example discussed above, the sum of distances with respect to 'baaad' (0 + 0.667 + 0.333 = 1) is more as against 'aaaaa' (0 + 0.667 + 0 = 0.667) due to this third term, and hence 'aaaaa' will be considered as a better match of 'aaa' compared to 'baaad', as required. Table 3 depicts some sample strings and their distances when the formula is applied in a part-wise manner. The distances calculated using the global distance formula in the table justifies the significance of each term of the formula. Let AUX be an auxiliary table used to store local q-gram distances. To calculate the global q-gram distance, we execute query Q2 in Fig. 6 on the AUX table. Before executing Q2, VOLUME 8, 2020 FIGURE 6. Auxiliary table storing local q-gram distances and SQL query for calculating the global q-gram distance.
we need to set constants like q, l x , l y and maxdist using MySQL commands as shown in Fig. 6.
For other data types like numeric or date, a simple absolute difference of the values would be the distance between them. For example, a unit of distance between two date values is the number of days separating them, like d ('1973-12-30', '1973-10-15') = 76 days. In this way, approximate matching is performed for all the columns involved in uncertain predicates and corresponding distance arrays are obtained.

C. COMPLEXITY ANALYSIS
In case of a naive solution, the q-gram distance between the strings x and y can be computed in O(|x| . |y|) time and the memory requirement is O(|Q(y; q)|) storage, where Q(y; q) denotes a set of q-grams existing in string y, if l y > l x .
Overhead due the creation of substrings in global q-gram distance will increase the time complexity to O(|x| 2 . |y|) but, memory requirement will reduce to O(|Q(x; q)|). This overhead comes with improved accuracy of the system.

D. VARIANTS OF '≈' OPERATOR FOR NUMERIC DATA
Along with '≈' operator, we introduce '≈+' and '≈−' operators. The operator '≈+' indicates the higher-the-better case. Sometimes the higher distance between the values is beneficial. For example, if we are searching for books with discount nearly equal to 20%. It is better to show books with higher discount at the top. Similarly, '≈−' indicates the lower-thebetter case. In our book example, we were searching for books with a price nearly equal to 400. Here, it is better to show books with lower prices at the top. For '≈' operator we take the absolute difference between the values. But in the case of '≈+', we need to take the specific difference (constant -column_value) and in the case of '≈−', the (column_value -constant).

E. NORMALIZATION OF DISTANCE ARRAY
All distance arrays may differ in the range of values. For the uniformity, distance arrays are normalized in the range [0, 1] using the following formula to obtain the probability array: The probability of a match is equal to one minus the normalized score, because d = 0 indicates the exact match, whereas, in probability notion, p = 1 indicates the exact match. When we get the distance array for data type other than string, we assume the highest element in an array to be the maximum calculated distance (max). In this way, the probability array for each of the columns involved in uncertain predicates is calculated.

F. FINAL PROBABILITY CALCULATION
A query may have multiple uncertain predicates combined with and and or operators. Moreover, each uncertain predicate is an independent probabilistic event, which means the occurrence of one predicate will not affect the probability of other predicates. Therefore, the conditional probability of a predicate A, in the presence of predicate B, is equal to the probability of the predicate A i.e., p(A|B) = p(A). Also, the precedence of and operator is higher than the or operator. Therefore, the probability of 'n' predicates combined with and operator is first calculated using the following formula: The probability for 'm' predicates combined by or operator is calculated by using the principle of Inclusion-Exclusion as given below: In case of a complex expression containing multiple operators like and and or, operations are performed based on the operator precedence. Parenthesis can be used to override the actual precedence of operators. For nested parenthesis, innermost parenthesis is evaluated first.
For example, consider a query with four uncertain predicates combined with one and and two or operators applied sequentially without explicitly resolving the precedence. After performing approximate matching, let 0.7, 0.6, 0.4, and 0.5 be the probabilities of the corresponding four columns for the resultant tuple t 1 . Then, the final probability of tuple t 1 is In the next section, we summarize the entire methodology in the form of set of procedures. We discuss the implementation details and the datasets used for the experimentation, as well.

V. IMPLEMENTATION, DATASETS AND EXPERIMENTATION
To support uncertain predicate operator '≈', the query parser should recognize newly introduced '≈' operator. For experimentation, we parsed the input query in the C code itself to check whether the given query has '≈' operator. If the input query has a '≈' operator, we first filter out tuples by executing a query with certain predicates. Afterward, we apply uncertain predicates on those filtered result-set by invoking procedure UncertainPredicate given in Procedure 1.

Procedure 1 UNCERTAINPREDICATE(uncertain-query)
1: Let, m be a number of filtered tuples after applying predicates except uncertain predicates 2: Retrieve columns over which uncertain predicate is specified and let n be the number of those columns 3: Let, dist[1 . . . m, 1 . . . n] be a matrix to store distances of 'm' tuples for 'n' attributes 4: for i =1 to n do Distance between the constant mentioned in the query and the corresponding column field is calculated based on the data type of the column. Type of a column is determined using typeOf() function. Procedure CalculateDistanceForString() is used for the column of string type. For other data types, CalculateDistanceForOther() procedure is called.
Matrix dist [m][n] stores the calculated distance for m tuples and n columns. NormalizeDistance() function normalizes the distance array of every column in [0-1] range. DeriveProbability() derives the probability of tuples using probability inference rules discussed in the previous section.
In Procedure 2, the value of 'q' is obtained using get-ValueOfQ() function. Using calGlobalQgramDist() function, we calculate the global q-gram distance. It issues query Q2 to the database which itself makes use of query Q1.
In Procedure 3, calDaysDifference() function is used to calculate the distance in days, for a column of type date. Distance for the time data type is calculated using calMinsDifference() in the units of minutes. For the numeric data type, distance is the difference of values and it is calculated using calDifference() function.
For experimentation, we have used the MySQL database. MySQL C API library is used to communicate with MySQL server. The libmysqlclient is a client version of the library that C applications use to communicate. This C API offers several data structures for database connection, for storing the result of a query, like, MYSQL, MYSQL_RES, MYSQL_ROW. It also provides various C API functions like mysql_init() for initializing MYSQL structure, mysql_real_connect()  11: return d for connecting to the MySQL server, mysql_query() for executing the query and many more [46].

A. DATABASES USED FOR TESTING
IMDb movie [47] and European football [48] databases were used for testing. IMDb movie database has information about 5000+ movies. IMDb database has only one relation named 'Movies'. Original relation had too many attributes related to the movies. In our version of 'Movies' relation, we inserted only 11 important attributes amongst them. These are Movie_Title, Director_Name, Duration, Actor_1_Name, Actor_2_Name, Actor_3_Name, Gross, Genres, Budget, Title_Year, and IMDb_Score.
Football database contains information of 11000+ players and their attributes, 25000+ matches, and 11 leagues. 'Player' relation stores personal information about the player and his attributes values are stored in 'Player_Attributes' relation. Attributes in 'Player' relation are id, player_api_id, player_name, player_fifa_api_id, birthday, height, and weight. 'Player_Attributes' relation has 42 columns that cover all attributes of a football player. player_api_id and player_fifa_api_id attributes are the foreign keys referring to the identical attributes in the 'Player' relations. Some fields in these databases were missing or incorrect. Our modified versions of these databases are available at [49].

VI. RESULTS AND DISCUSSION
In this section, we present the results of different experiments. As part of the experimentation, we compared the performances of the naive approach with the proposed solution.
We also compared the results of the proposed global q-gram distance with the classic edit based distances. We analyzed the effect of the value of q on accuracy. We have also presented the results of additionally proposed operators (≈+ and ≈−).

A. NAIVE SOLUTION VS. PROPOSED SOLUTION
Here we compare the performance of the naive solution as mentioned in Section IV-A with the proposed solution in Section IV-B. One sample query on the IMDb movie database is shown in Fig. 7. Table 4 shows their results on some random strings. Queried strings and the expected strings are shown in the first two columns of the table. The third and fourth columns of the table represent the result of a naive approach and the proposed technique, respectively. A naive method fails when there is a significant mismatch between the lengths of two strings. In the proposed method, there is an overhead of matching individual substrings which resulted in the change of the complexity from O(|x| . |y|) to O(|x| 2 . |y|). The actual increase in execution times is depicted in Fig. 8. The Movies relation contains 5043 tuples. Each of the execution times shown in Fig. 8 is the average of 10 experiments. An average of execution times of all the experiments (5*10 = 50 experiments) shows a 46.3% increase in the execution time. But, this trade-off comes with improved accuracy as shown in Table 4 and Fig. 8.

B. EDIT BASED DISTANCE VS. GLOBAL Q-GRAM DISTANCE
As mentioned earlier, we noted that the edit-based distance would fail to identify similarities between the swapped-words strings (i.e. 'John Watson' vs 'Watson, John'). Here, we illustrate it further using a few examples. In query 1 (see Fig. 9), we intend to get a film from the Harry Potter film series, which has its IMDb score is near to 7.0 but let's say we (wrongly) suppose, is directed by 'David Bates'. In reality, some parts of this film series are directed by David Yates (and not Bates). Tables 5-7 show the correct part of the movies at the top of the tables. We tried OSA and LCS methods from     the edit-based category. We obtained the intended results for all the methods, as shown in the tables mentioned above.
In query 2 (see Fig. 10), we have given the case of swapped words. Here we queried for titles which are similar to 'Hunters of Ghost' in which an actor named 'Steve' has acted. But the real name of the movie is 'Ghost Hunters' and the complete name of the actor is 'Steve Gonsalves'. IMDb database has three columns for storing the names of three actors and we are not sure which column of actor names contains 'Steve'. Notice that we have used parenthesis to override the default precedence of the operators and to associate these three conditions. In the q-gram distance, a set of all possible q-grams is formed. So, inherently, the sequence of the substrings does not affect the distance. Thus, it is expected that the q-gram method would give the intended result at the top. On the other hand, OSA and LCS perform sequential matching. Hence, even if two strings have exact same substrings and if one has them swapped, distance is not zero.

C. ADDITIONAL OPERATORS
To illustrate the utility of the proposed uncertain operator '≈', two additional operators ('≈+' and '≈−') are also proposed as an extension. We conducted several experiments to test the practicability of the proposed operators. We executed the same query first with '≈' and then using '≈+/≈−' operators and examined the results. In query 3 (see Fig. 11), we tried to utilize our system as a predictor. We want to find the best striker from the Football database. There are several attributes that represent the player's skills and his/her performance. Key attributes of any striker are finishing, dribbling, acceleration, composure, and pace. To get the name of the players, we joined 'Player' and 'Player_Attributes' relations on the 'player_api_id' attribute. Table 11 shows the result of Query 3 which has today's leading strikers at the top.    In query 4 (see Fig. 12), we demonstrated the use of the '≈+' operator. Players with better finishing, acceleration and dribbling scores should be ranked first. Table 12 shows the result of query 4, where we can observe that Messi now has jumped to the top of the table.
In query 5 (see Fig. 13), we tried to find out the highest-grossing movies. Therefore, the gross amount should be higher than the constant in the condition as much as possible, and the budget should be as low as possible. Values of gross and budget specified in the conditions are the average values of the gross and budgets of all the movies. Table 13 shows the results of Query 5 and indicates Avatar as the highest-grossing movie of that period.

D. EFFECT OF THE VALUE OF q
Our experiments show that a small value of q is more likely to give better results. According to [9], the optimal value of q = log | | |s|, where is the finite set of alphabets. Therefore, the value of q = 2 or 3 is more favorable. The results are not reliable for q = 1. It is obvious because no q-gram overlapping happens and no ordering of characters is checked when q-gram size is 1.
Unlike traditional database field, Information Retrieval (IR) techniques retrieve results based on the matching of a set of keywords from the query string. Returned web pages may not match with all the keywords, but they do match with only the part of it. Moreover, IR techniques always return a result-set sorted according to their percentage of matching. Thus, IR techniques differ from database queries with respect to these two features, first, the approximate matching and second, a ranking of the result-set showing most promising result at the top. By providing support for uncertain predicates in this work, we are, in a way, trying to achieve the same features in traditional databases as well.

VII. CONCLUSION AND FUTURE WORK
In this paper, a database management system to support the uncertain predicates is proposed. The similarity between the values (of strings) is measured in terms of a modified q-gram distances, i.e. local q-gram distance and global q-gram distance. A novel formula to calculate these distances between the strings is proposed and explained in detail. A set of sample queries indicate how this could be beneficial. Probabilities of filtered tuples are then meaningfully derived from the normalized distance arrays, which are treated as a confidence score for that tuple. Finally, the result-set is ranked according to the decreasing values of probabilities, indicating the most likely matches at the top. As an extension, we have also introduced two additional variants of the uncertain predicate which add to the practicability of the system. We demonstrated the significance of our proposed system through the various sample queries. We believe that our system will surely be helpful for the users having a minimal or superficial knowledge of the actual existing data in the database. The proposed global q-gram distance may find its application in other fields like record linkage, data deduplication, DNA sequence alignment.
The output of our proposed system is a probabilistic relation. The derived column named 'prob' is augmented to the result-set as an additional column. A concrete system that could be useful on a wider scale can be implemented by incorporating all proposed procedures in an open-source database system. For further query processing, query evaluation techniques of probabilistic databases can as well be utilized for processing uncertain relations.