Comparative Analysis of Relational Database Watermarking Techniques: An Empirical Study

Digital watermarking is considered one of the most promising techniques to verify the authenticity and integrity of digital data. It is used for a wide range of applications, e.g., copyright protection, tamper detection, traitor tracing, maintaining the integrity of data, etc. In the past two decades, a wide range of algorithms for relational database watermarking has been proposed. Even though a number of surveys exist in the literature, they are unable to provide insightful guidance to choose the right watermarking technique for a given application. In this paper, we provide an exhaustive empirical study and thorough comparative analysis of various relational database watermarking techniques in the literature. Our work is different from the existing survey papers as we consider both distortion-based and distortion-free techniques along with a rigorous experimental analysis demonstrating a detailed comparison on robustness, data usability, and computational cost with considerable empirical evidence.


I. INTRODUCTION
Digital watermarking is considered one of the most promising techniques to verify the authenticity and integrity of digital data. It is used for a wide range of applications, e.g., copyright protection, tamper detection, traitor tracing, maintaining the integrity of data, etc. For several decades, relational databases are at the heart of many information systems. As they contain crucial information, they must be protected before sharing them to the world of the internet. Although encryption is used to protect the data stored in a relational database from being accessed by individuals with malicious intent, but it is very restrictive in nature. Since the first proposal in 2000 in [1] that used digital watermark for protecting a database of map information, various relational database watermarking techniques have been proposed in the literature thereafter. Among them, the first and most significant one is proposed by Agrawal and Kiernan in [2]. The database watermarking techniques embed a piece of information (known as watermark) in an underlying data and extract it later from any suspicious content in order to verify the absence or presence of any possible attacks. The former phase is known as Embedding phase, whereas the later phase is known as Detection or Verification phase. In general, these database watermarking techniques are classified as (i) distortion-based techniques that embed the watermark into the underlying content of the data and (ii) distortion-free techniques that generate the watermark based on various characteristics of the data.
A number of survey papers [3]- [11] already exist in the literature, which provides a comprehensive summary of different techniques and their comparison. Authors in [3] elaborated the features of the relational databases, application of digital watermarking, attack analysis of the then existing distortion-based and distortion-free watermarking techniques. A survey of reversible watermarking approaches has been proposed in [4], [5]. A holistic study of distortion-based watermarking techniques has been proposed in [6]. A recent survey on multimedia and database watermarking is reported in [7] where, in addition to different multimedia artifacts, a comparative summary of only nine existing database watermarking techniques is presented. Other significant works related to the survey of relational database watermarking include [8]- [11].
Despite this, the existing survey papers do not carry the following insights that may provide an appropriate guidance to choose the right watermarking technique for a given appli-cation: (i) what should be the criteria to compare different categories of watermarking techniques, (ii) how to show empirically that a particular watermarking technique is better than the other techniques, (iii) lack of emphasis towards distortion-free techniques.
To fill this knowledge gap and to provide a well-informed guidance to the users for a wise decision on choosing right watermarking technique, in this paper, we provide an exhaustive empirical study and thorough comparative analysis of various relational database watermarking techniques. Our work is different from the existing survey papers as we consider both distortion-based and distortion-free techniques along with a comprehensive experimental analysis of robustness, data usability, and computational cost, and their comparisons with considerable empirical evidence.
In order to achieve these objectives, our major contributions in this paper are as follows: 1) We classify the distortion-based and distortion-free techniques in various categories on the basis of the algorithmic steps adopted as well as the type of the watermark information used in the algorithm. 2) We perform an empirical study on a selected number of algorithms, each representing the class of algorithm it belongs to. In particular, we perform a rigorous experimental analysis demonstrating a detailed comparison on robustness, data usability, and computational cost. 3) Our empirical analysis provides a well-informed guidance to the users for a wise decision on choosing right watermarking technique. The structure of the rest of the paper is as follows: Section II explains the research methodology we adopted. Section III and IV provide the detailed comparative performance analysis of distortion-based and distortion-free algorithms respectively. Section V discusses our evaluation-results w.r.t. the existing experimental observations. Section VI provides a guidance to the users for choosing the right watermarking technique for a given application. Finally, we conclude our work in Section VII.

II. RESEARCH METHODOLOGY A. PRIMARY STUDY SELECTION
We perform the primary study by searching the major online scientific repositories (depicted in Table 1) using the following search queries: "relational database watermarking", "watermarking of relational databases", and "copyright protection of relational databases". In all cases, we set as a filter the years from 2002 to 2022. We carefully analyze each and every publication obtained in the search result by following the inclusion and exclusion criteria mentioned in the subsequent subsection.

B. INCLUSION/EXCLUSION CRITERIA
In this study, we consider research works published in journal, conference, symposium, or workshop and we exclude other kinds of works such as books, newsletters, magazines, technical reports, Ph.D. thesis, and undergraduate/master project documents. These criteria are depicted in Table 2.

Exclusion
The research works related to other databases watermarking like XML, JSON, etc. It is a patent. It is not published in the English language. It is a Ph.D. thesis or undergraduate/master project document.
It is a book, newsletter, magazine, or technical report. Inclusion It is a journal, conference, symposium, or workshop paper and the title, keywords, and abstract explicitly indicate that the paper is related to relational database watermarking. As these search results overlap, we remove the duplicate entries and obtain 416 publications. Finally, after applying the inclusion and exclusion criteria, we obtain 94 publications that we consider in our paper. The summary of the articles by type of publication and the temporal trend of these research publications under consideration are depicted in Table 3 and Figure 1 respectively. We analyze these 94 papers on the basis of a brief overview of the watermarking technique, the data set used in the experiment, and the attacks performed.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.  • Meaningless bit-pattern as the watermark.
• Virtual primary key based.
• Image as the watermark.
• Fake tuple or fake attribute insertion.
• Other Meaningful watermark information. Whereas, the distortion-free techniques are classified as: • Permutation of tuples.
• Conversion of the database into binary form.

E. RESULTS
We examine the motivations, contributions, future works of the papers which passed the quality assessment. We select one algorithm in each category that follows any one of the following criteria for the experimental analysis: 1) Criteria 1: Select the pioneer work if the other recent works are minor variants of the pioneer work and there is no significant improvement. 2) Criteria 2: If there is a significant improvement in the recent work compared to the previous works then select the recent one. 3) Criteria 3: Select a work in a category if the work is published in a publication having a higher core ranking and h-5 index. Let us discuss the research works under each category of distortion-based and distortion-free techniques in detail.

1) Distortion-based techniques:
The distortion-based watermarking techniques are classified on the basis of the algorithmic steps adopted as well as the type of the watermark information, described below: a: Meaningless bit-pattern as the watermark Authors in [2], [12]- [20] propose the watermarking algorithms that embed a meaningless bit pattern of the watermark into the data set. Authors in [2] have proposed the algorithm in which hash function is used to decide the marking of a particular tuple. Authors in [12], [14]- [16], [21] extend the proposal of [2]. For example, in [12] the pseudo-random number generator is used instead of a hash function. In [14] chaotic random number generator is used instead of the hash value. Gupta et al. in [15] extend the proposal of Agrawal et al. in [2] and propose a reversible watermarking algorithm. Authors in [16] use the similar approach of [2] but instead of flipping the least significant bits (LSB) they embed random digits (0 to 9) at LSB of the attribute values. Authors in [17] apply data flow analysis to identify the variant and nonvariant parts of the relational database, and then apply the watermarking algorithm in [2] to embed the watermark.

b: Virtual Primary Key based
In most of the watermarking algorithms, it is assumed that the primary key exists and is not distorted by the attackers. However, it may not be always true. To deal with this situation, various techniques have been proposed to generate and use Virtual Primary Key (VPK) instead of a primary key. Authors in [2] propose an extended proposal named as S-Scheme in [22]. In S-scheme, one attribute is used to generate the VPK and the remaining attributes are used for watermark embedding. Authors in [22] propose E-scheme and M-scheme. The VPKs generated in E-scheme is similar to the S-scheme, but it considers all of the attributes. Mscheme considers more than one attribute per tuple to generate the VPK. In this approach, each time a different attribute is selected and hence is more resilient towards the delete problem. Other approaches based on virtual primary keys are proposed in [23]- [27]. Approaches in [23], [24] are similar to the M-Scheme. Two attributes having hash values near zero are considered. In [23], the textual attributes are considered. The VPK is generated based on two numeric attributes in [24]. Different attributes are selected each time in [25]. The HQR-Scheme [26] generates one VPK per tuple based on the cyclic model of the attribute. VOLUME 4, 2016 Various watermarking approaches [28]- [36] embed images as the watermark into the relational databases. All these approaches first group the tuples and then embed the bit string of the image watermark. Authors in [28] insert a binary image watermark into a relation. In the case of text data, the carriage return character represents 1 and the linefeed character represents 0 of the watermark bits. In the case of numeric data, the watermark bits are embedded in the LSBs of the attribute value.
In most of the techniques, the partitioning of tuples is based on hashing. However, in the case of Huang et al. in [29], the tuples are clustered into equivalent classes by using the k-means algorithm. The parity of the watermark bit is compared with the LSB of the candidate attribute for embedding the watermark bit. The location of the embedded watermark is assured by the clustering method. In [30], the original image of size N ×N is converted into a binary string of length L = N × N . The tuples are grouped into L groups based on the hash function, and an i th bit of the binary string is embedded into the bit positions of a fixed attribute in the i th group. The authors in [31] follow the same algorithm as in [30] but they do not consider a fixed attribute and they do not consider the order of image during computing the bit position. After marking, the usability constraints are also checked. The approach in [32] is also similar to [31]. The difference is that they have divided the image into two parts: header and image data. The header is used for the grouping of the tuples and the image data is converted into a binary string and embedded into these groups.

d: Partitioning based
The partitioning based watermarking techniques [37]- [44] partition the data into various groups and embed the watermark into these groups independently. In [37], a marker tuple is used for partitioning and one watermark bit is embedded into one group maintaining the usability constraints. In [38], instead of marker tuple, the hash function is used for partitioning the tuples into groups, and in each group, the watermark bit is embedded by altering the group statistics satisfying the usability constraints. In [40] also, the hash function is used for partitioning. The changes are minimized by selecting a few tuples for watermarking and the watermark (generated from date-time) bit is embedded in each of the selected tuples. In [43], the tuples are partitioned and in each partition, two types of watermark, attribute watermark W 1 and tuple watermark W 2 are embedded. e: Fake tuple/attribute insertion The watermarking techniques in this category insert a new tuple or a new attribute into the database relation as a watermark. In [45], probability distributions are used to determine the properties of the new tuple inserted as a watermark. In [46], a new attribute is inserted into the existing relation. Parity checks are calculated from each attribute and appended to that attribute. The new attribute also has a value from the aggregate function of any of the attributes for all tuples.

f: Fingerprinting Techniques
A fingerprint is a piece of meaningful information, e.g. social security number that is used as a watermark. Authors in [48] extend the proposal of [2] but embed a fingerprint of length L computed from a hash function taking input as secret key K and user identifier n. Liu et al. in [47] propose a block-oriented fingerprinting technique. The hash function is based on a secret key and the buyer's ID is used to generate the fingerprint. Authors in [49] propose a twice-embedding watermarking scheme. In the first process, the fingerprint value is used to select the position and the embedding value for every group. In the second process, a pattern is embedded using the fingerprint as the secret key. Authors in [22] also extend the proposal of [2] but they embed fingerprint instead of meaningless bit pattern and they propose schemes named as E-scheme and M-scheme for constructing the virtual primary keys. In [50], watermarking is based on integer linear programming constraint solving. In [51], a buyer's "thumb impression" is used for embedding the fingerprints.

g: Other Meaningful Watermark Information based
In [41], the database tuples are partitioned based on the hash function, and meaningful information is embedded in a single attribute as the watermark. Authors in [54] use a pseudorandom sequence number to know the attribute and bit position where the watermark is to be embedded. Similarly [52], [53], [55], [56] also embed meaningful information as the watermark.
A brief overview of different distortion-based watermarking techniques within each category is depicted in Tables 4, 5, and 6.

2) Distortion-free techniques:
The distortion-free techniques can be classified into following categories: a: Permutation of tuples In these techniques, the order of the tuples is arranged based on secret parameters without causing any distortion in the data values. The significant proposals that perform tuplereordering based watermarking are proposed in [57]- [61]. In [57], some secure parameters are used to partition the tuples into groups. The order of two tuples is changed based on the hash values of the tuple and the watermark bit. In [62], the value of some critical attribute(s) is used to re-order the tuples relative to a secret initial order, e.g., ascending. The proposed schemes in [58]- [61] are also similar to the approach as proposed in [57] as they also partition the tuples into groups and the tuples are re-ordered in a group that corresponds to the watermark.

b: Converting Database Relation into Binary Form
These techniques convert the database relation into a binary form. In [63], the watermark is generated from the most significant bits (MSBs) of the attribute values and can be verified publicly. In [64], the watermark can not be verified publicly as it uses a private key. The approaches in [65], Uses pseudo-random sequence generator instead of hash function.
same as [2] same as [2] No Agrawal et al. [13] Uses pseudo-random sequence generator instead of hash function.
same as [2] same as [2] No Lafaye [20] Describes the security analysis of [2] Random databases and keys were Based on the cyclic model of the attribute same as [2] same as [25] No Gort et al. [27] Proposes double fragmentation of the watermark by using the existing redundancy in the set of virtual primary keys same as [2] attribute deletion, tuple addition,  Kamran et al. [40] Data is partitioned by using hash function and watermarks bits are embedded in each selected tuple.
Real life data set that shows power consumption rates deletion, insertion, alteration, multifaceted, collusion, additive Yes Shehab et al. [38] Data is partitioned by using hash function and watermark bit is embedded by altering the partition statistics.
same as [40] deletion, alteration, insertion  The tuple ordering is done on the basis of value of critical attribute(s) and re-arrangement is done relative to a secret initial order.  Hou et al. [78] Quality of watermarked data is used to claim copyright Generated data Tuple deletion, tuple addition, tuple alteration Lin et al. [79] Two different secret embedding keys are generated Generated data Alteration, deletion, mix-match, sorting, combination Shen et al. [80] Clustering-based and difference expansion technique is used Generated data Tuple delete, tuple modification Li et al. [81] Embeds the watermark bit by bit on the basis of grouping Same as [2] Insertion, deletion, modification Lian et al. [82] Differential expansion technology based on ant colony algorithm Same as [2] Subset deletion, modification Li et al. [83] Based on continuous columns in histogram Same as [2] Insert, delete, alter Hamadou et al. [84] Prediction-error expansion method Same as [2]. Attribute Alteration, tuple deletion, tuple insertion Ge et al. [85] Histogram shifting watermarking method Wisconsin breast cancer diagnosis data set Tuple addition, tuple deletion, attribute value modification Tufail et al. [86] Binary Bat Algorithm used for watermark creation Heart disease medical data set Insertion, deletion, alteration Chai et al. [87] Based on clustering grouping Same as [2] Attribute modification or deletion, subset deletion, subset addition, subset alteration Chai et al. [88] Based on erasure code Same as [2] Attribute modification or deletion, tuple deletion, tuple addition, subset alteration Wu et al. [89] Difference-expansion reversible data hiding method is used Protected numeric data -Li et al. [90] Based on histogram gap low distortion Same as [2] Insertion, deletion, modification Hu et al. [91] Genetic Algorithm with a new proposed Histogram Shifting of prediction error Watermarking (HSW) method to minimize distortion Same as [2] Insertion, deletion and alteration Imamoglu et al. [92] Difference expansion watermarking (DEW) with Firefly Algorithm is used to embed watermark Same as [2] Addition, deletion, bit-flipping, tuplewise-multifaceted, attribute-wisemultifaceted, sorting Chang et al. [93] Watermark is embedded into the textual relational database Textual relational database Sorting, deletion, modification, addition Chang et al. [94] The content of textual attributes are used to generate the virtual primary attribute Synthetic data Tuple deletion, tuple alteration, tuple insertion Iftikhar et al. [95] Genetic Algorithm is used for getting optimal watermark information Heart disease medical data set Insertion, deletion, alteration Chang et al. [96] Embeds the watermark into the fractional portion of the numerical attributes to minimize the distortion Generated database Alteration, deletion, mix-match, sorting Jawad et al. [97] Genetic algorithm is used to improve the capacity of difference expansion based watermarking in databases Same as [2] Addition, deletion, bit flipping, sorting, tuple-wise-multifaceted, attributewise-multifaceted, secondary watermarking Farfoura et al. [98] An identification image is converted into a stream of bits 0's, and 1's and embedded into numeric attributes Synthetic data Deletion, insertion, modification Farfoura et al. [99] Time-stamping protocol is used Synthetic data Tuple alteration, tuple deletion, mix and match, attribute deletion Contreras et al. [100] Based on a circular representation of a bijective transformation Medical database Modification of attribute values, elimination or insertion of tuples Gupta et al. [101] Difference expansion on integers is used to achieve reversibility Generated database Random bit wise flipping, subtractive, sorting, secondary watermarking Gupta et al. [102] Query-preserving watermarking scheme is proposed.
--Zhang et al. [103] Based on expansion on data error histogram Generated database -Gupta et al. [104] Based on difference expansion Same as [101] Same as [101] Unnikrishnan et al. [105] Based on orthogonal learning particle swarm optimization Synthetic data insertion, deletion and alteration [66] also, extend the approach of [63]. In [67], tuples are first grouped, then a fixed number of MSBs and LSBs of the selected attribute value are used to generate the watermark.

c: Attribute Reordering
Authors in [68] have proposed a fragile distortion-free watermarking technique based on the attribute reordering method. First, a secret initial order of attributes is defined by virtually sorting the attributes based on the hash of attribute names. Thereafter, the MSBs are extracted for generating the water-mark.

d: Content Characteristics based watermarking
The watermarking approach in [70] generates the watermark based on the local characteristics like frequency distribution of various digits, lengths, and ranges of data values. In [69], the data set is grouped as the square matrices and the watermark is generated using the determinant and the minor of those square matrices.  Td  Te  Td  Te  Td  Te  Td  Te  Td  Te  Td  Te  Td  Te  Td  276  44593  36333  238693  225158  51900  57727  38693  29089  94193  87239  44270  42005  16589  5677  107848  107039  532  101587  72430  455016  429327  113053  92188  71165  57295  193080  169147  93525  72687  29355  9897  314846  276820  888  142363  119988  613255  586367  213112  146547  138523  100428  320350  287711  146400  91233  50078  21219  359679  332385  1124  178702  158994  887012  859342  213136  181424  139666  104371  367192  336507  198813  151873  63113  19883  519823  525728  1338  234311  174010  945584  883639  285000  219363  169044  128484  502728  427423  252000  176136  70664  33010  536742  530287  1692  292721  228411  1318398  1292265  312628  268167  205991  166133  543450  482110  291490  229154  83886  27686  926221  898700  2237  354479  287743  2168703  1430286  464093  370848  278627  214460  818662  694515  696066  422730  118508  51731  933961  910995  We have classified the recent works in distortion-free watermarking in this category. The significant recent research works are proposed in [71]- [76]. In the proposed scheme in [71], some secure parameters are used to partition the tuples and three fake tuples are generated for each partition. A hash function is used to generate the first tuple. For the other two tuples, a genetic algorithm is used for numeric attributes, and for non-numeric attributes, the most frequent value is selected. These fake tuples are stored in a separate file, not inside the database, therefore making this approach distortionfree. Authors in [72] have adapted the MapReduce paradigm for watermarking of relational databases to decrease the computational cost and have implemented distortion-free algorithms in both sequential and MapReduce form. The proposal in [73] generates an image as a watermark from the database content. In [74], each column (attribute) is organized into groups, each having g data elements. The data elements in each group are re-ordered based on a watermark value. In [75], the data elements are grouped and the group watermark is generated by extracting µ MSBs of the hash of attribute names. They present the proposed watermarking as a service (WaaS) scheme. A brief overview of different distortion-free watermarking techniques within each category is depicted in Table 7.
It is to observe that the reversible database watermarking techniques [4], [5], [77]- [105] as depicted in Table 8 have a wider scope of research and we would like to explore these techniques in the future separately. Figure 2 provides a quick reference on the classification of different relational database watermarking algorithms.

III. COMPARATIVE PERFORMANCE ANALYSIS OF DISTORTION-BASED WATERMARKING TECHNIQUES:
We select the distortion-based algorithms proposed in [2], [22], [28], [40], [41], [46], [48], [49] for the experimental analysis. We implement all the algorithms using Java. The experiments are performed on a server equipped with six core Intel Xeon Processor, 2.4 GHz Clock Speed, 128 GB RAM and Linux Operating System. We use benchmark data sets obtained by modifying the Forest CoverType data set 1 into data sets of size 276MB, 532MB, 888MB, 1124MB, 1338MB, 1692MB, and 2237MB and perform the following analysis: 1 https://kdd.ics.uci.edu/databases/covertype/covertype.html 1) We analyze the usability of the watermarked databases in terms of differences between the mean and variance of attribute values, before embedding of watermark and after embedding. 2) We also analyze the watermark embedding and detection time by increasing the data set size. 3) We perform the robustness analysis of these techniques over various attacks, e.g. insertion, update, delete, zero out, and multifaceted attack.
The prime reason behind choosing this data set is its wide consideration by the majority of the proposals in the literature. In particular, 33 out of 94 research works considered this data set as their benchmark, whereas the rest of the proposals used either a different kind of real-world data or self-generated data which differ from one proposal to another. This makes it difficult to compare them empirically uniformly. In order to unify the comparative analysis, in this paper, we consider this most popular Forest CoverType data set as the benchmark for all the proposals under our consideration.

A. COMPUTATIONAL TIME
In database watermarking, the time spent during watermark generation and detection is an important factor to consider. The watermark embedding and detection time for various approaches is shown in Table 9. The comparison of watermarking time for these techniques is depicted in Figure 3.
From Table 9 and Figure 3, we have the following observations: 1) For all algorithms, watermark embedding and detection time increase as the data size increases. 2) The watermark embedding and detection time are least in the case of [48].
3) The watermark embedding and detection time are highest in the case of [22]. 4) The order of computational cost from lowest to highest is: [ There are many operations that may affect the computational time. We identify these operations as partitioning, hash calculation, random number generation, virtual primary key generation, updating the attribute value. Other parameters like the number of attributes and the number of tuples considered for watermark embedding also affect the computational time. We observe that the computational cost is highest in the case of [22] because it generates a virtual primary key for each of the tuples in the data set and therefore it takes Meaningless bit pattern:- [2], [12]- [20] Virtual primary key based:- [2], [22]- [27] Image as watermark :- [28]- [36] Partitioning Based:- [37]- [44] Fingerprinting techniques:- [22], [47]- [51] Fake tuple or attribute insertion:- [45], [46] Other meaningful watermark information:- [41], [52]- [56] Distortion-free

AHK [2]
Li et al [22] Prasanna. [46] Zhang [28] Kamran [40] FieGuo [49] Li2005 [48] Huang [41] (b) Watermark Detection Time more time. Whereas, the approach in [48] has least computational cost as it generates some random numbers instead of hash computation. The computational cost is less in the case of [2], [28]. In the case of [28], after partitioning only some of the tuples and attributes satisfying a criterion are considered for embedding the watermark. In the case of [2], the tuples are not partitioned, but only a fraction (γ) of tuples satisfying a particular condition are considered. The LSBs of one attribute in each selected tuple are flipped based on the watermark bits. The approach in [49] also considers only one fixed attribute in each partition to embed the watermark.
Partitioning is a common operation in case of [40], [41], [46], [49] and all the approaches are having more computational cost after the approach in [22]. Therefore, the partitioning operation is affecting the computational cost.

B. USABILITY OF DATA AFTER WATERMARK EMBEDDING
The usability of the database is based on the domain, e.g., a minor change in a voter database can create a problem, and hence the watermarking should not cause any changes to the voter database, whereas, minor changes in a forest survey database can be tolerated. Therefore, it is difficult to generalize the criteria for usability. However, Table 10 will help the users to understand the effect of watermark embedding on the mean and variance values of the attributes and give them an idea about whether watermarking causes more changes in the underlying data or not. Table 10 shows the change in variance of database values before embedding of watermark and after embedding. The watermark embedding algorithms in [2], [22], [28], [41], [49] embed the watermark bit in a particular bit of the attribute value. The number of bits available for watermark embedding is denoted by the variable ξ. We compute the variance of each attribute by varying the value of ξ to 3, 5, and 8. We observe that there is no change in mean after watermark embedding. In distortion-based watermarking, it is assumed that a certain level of distortion is tolerable. In the case of [2], there are small changes in variance when the bits available for embedding are increased to 8 bits. In the case of [46] and [48], there is no change in the variance at all, whereas in the case of [22] and [28], the variance is highly affected after the watermark embedding.
From figure 10, we observe that in the case of approaches in [40], [41], [49] the variance of only one attribute is affected after watermark embedding because they only consider a single attribute to embed the watermark. In the case of [46], there is no change or negligible change in the variance as it inserts a new tuple into the database relation. Therefore, it does not cause any change in the attribute values and the variance of attributes is not affected. Similarly, in the case of [48], there is no change in the variance and in the case of [2], there is negligible change as both of the algorithms embed the watermark into a fraction of tuples. The usability is highly affected in the case of [28] because an attribute is selected for embedding if the length of the attribute value is greater than a particular value. This causes the watermark to be embedded in more than one attributes.

1) Delete Attack
In delete attack, the attacker deletes some of the tuples of the watermarked database in order to distort the embedded watermark. Though the attacker is supposed to delete the tuples keeping in mind the usability of the data, we analyze the detection rate by varying the attack percentage from 10% to 90%. The rate of detection for various distortionbased techniques after delete attack is shown in Figure 4(a). From Figure4(a), we can observe that the rate of watermark detection remains more than 90% even after 90% delete in case of [2], [22], [28], [41], [48].

2) Update Attack
In an update attack, the attacker randomly updates some of the values of the watermarked database with his own values and try to claim ownership of the database. We analyze the detection rate by varying the update percentage from 10% to 90% as depicted in Figure 4(b).
We can observe from Figure 4(b) that the rate of detection is more than 80% in case of [2], [22], [28], [40] even after This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

3) Insertion Attack
In an insertion attack, the attacker removes a particular number of tuples from the watermarked database and inserts the same number of tuples into the database to destroy the embedded watermark. The rate of watermark detection for various techniques after insertion attack is shown in Figure  4(c). We can observe from Figure 4(c) that the rate of detection is 100% in the case of [22], [28], [48] even after a 90% attack.

4) Zero-out Attack
In this attack, some tuple values are selected randomly and updated with zero by the attacker to destroy the embedded watermark. We analyze the rate of watermark detection by varying the attack percentage as shown in Figure 4(d).
We can observe from Figure 4(d) that the rate of detection is more than 80% in case of [2], [22], [28] even after 90% attack.

5) Multifaceted Attack
This is the combination of delete, update, and insertion attacks. The attacker randomly updates some of the tuple values, deletes some of the tuples, and inserts his own tuples to destroy the embedded watermark.   The data usability is highly impacted by this attack. The intensity of these attacks that we have considered is shown in Table 11. The rate of detection after the multifaceted attack is depicted in Figure 5. We can observe from Figure 5 that the rate of detection is 100% in case of [22], [28] even after 90% attack.
The robustness against various attacks is more in the case of [2] and [48] since the detection in [2] is based on the match counts that are computed on the remaining watermarked tuples after the attack. Similarly, in the case of [48], the detection is based on the majority voting for each fingerprint bit and form the remaining watermarked tuples after attacks, the fingerprint can be recovered.

IV. COMPARATIVE PERFORMANCE ANALYSIS OF VARIOUS DISTORTION-FREE WATERMARKING TECHNIQUES
The data values in the database are not changed in the case of distortion-free watermarking techniques. These techniques mainly generate the watermarks from the database contents. The primary phases in these techniques are (i) Partitioning of tuples into groups, and (ii) Watermark generation from each group. The watermarks for each group can be combined together to generate the watermark for the whole database.
A generic distortion-free watermarking algorithm GEN_WM is shown in Algorithm 1. The database relation R

Algorithm 1 GEN_WM(R)
1: for each tuple t ∈ R do 2: Compute g t = f (t). 3: end for 4: for each group g i ∈ G do.
// G = total number of groups 5: T = set of all tuples t i ∈ group g i . 6: Compute watermark W i for group g i using T. 7: end for 8: Compute W = ||W i , ∀ i = 1 to |G|.
is taken as input. Steps 1 to 3 compute the group id g t to which a tuple t belongs by applying function f (e.g. a hash function). In Steps 4 to 6, a group watermark W i for the group g i is generated by using the tuples belonging to group g i .
Step 8 computes the overall watermark W by performing a suitable operation || (e.g. a concatenation operation) to the group watermarks.
Authors in [57] proposed the first work in this domain. We classify the distortion-free watermarking techniques in the following categories: (i) permutation of tuples, (ii) converting database relation into binary form, (iii) attribute reordering, (iv) content characteristics based, and (v) others. We analyze these techniques and select the algorithms for experimental analysis on the basis of the same criteria as discussed in Section II.
We consider the distortion-free watermarking algorithms proposed in [57], [63], [64], [68]- [70], [75] and perform the robustness analysis of these techniques over various attacks, e.g. insertion, update, delete, zero out, and multifaceted attack. We also analyze the computational cost. We implement all the algorithms using Java. The experiments are performed on a server equipped with six-core Intel Xeon Processor, 2.4 GHz Clock Speed, 128 GB RAM, and Linux Operating System. We use benchmark data sets obtained by modifying the Forest CoverType data set 2 into data sets of size 276MB, 532MB, 888MB, 1124MB, 1338MB, 1692MB and 2237MB. The reason for choosing this data set is discussed in Section III.

A. COMPUTATIONAL TIME
In database watermarking, the time spent during watermark generation and detection is an important factor to consider. The watermark generation and detection time for various approaches is shown in Table 12. The comparison of watermarking time for these techniques is depicted in Figure 6. Bhat09 [64] KhanHusain [70] Li2006 [63] Hamadou2011 [68] Camara2014 [69] naz2020 [75] Li2004 [57] (a) Watermark Generation Time Bhat09 [64] KhanHusain [70] Li2006 [63] Hamadou2011 [68] Camara2014 [69] naz2020 [75] Li2004 [57] (b) Watermark Detection Time Following are the observations from Table 12 and Figure  6: 1) For all the watermarking approaches, watermark embedding and detection time increases as the data size increases. 2) Watermark generation and detection time is least in case of [70] and highest in case of [64]. Authors in [72] adapted the MapReduce paradigm to watermark relational databases. They have implemented the algorithms proposed in [57], [64], [67], [69], [70] in sequential as well as MapReduce form and it was observed that as the data size increases, the percentage reduction in watermarking time increases from sequential to MapReduce.
In the case of distortion-free watermarking techniques, there are various operations that affect the computational cost, e.g. hash computation, partitioning, watermark generation, pseudo-number generation, matrix operations, etc. The number of attributes, tuples, and the bit positions available for watermark generation also affects the computational cost. From Figure 6, we can observe that the computational 2 https://kdd.ics.uci.edu/databases/covertype/covertype.html time is highest in the case of [64] since it partitions the database relation based on the hash function and uses all attributes of all tuples for generating a binary form of the relational database. The computational cost is least in case of [70], since it does not partition the database relation. The watermark is generated by considering all attributes of all tuples and by generating digit, length, and frequency sub-watermarks. The basic step in the case of distortionfree technique is partitioning. For example, the approaches in [57], [64], [68], [69], [75] partition the data based on either hash function, pseudo-random number, etc. The group watermarks are then generated independently.

B. USABILITY OF DATA AFTER WATERMARK GENERATION
In the case of distortion-free watermarking approaches, the watermark is generated from the underlying content of the data and there is no distortion in the data itself, hence the data usability is not affected.

C. ROBUSTNESS ANALYSIS
We perform the robustness analysis of the watermarking techniques over various attacks, e.g. insertion, update, delete, zero out, and multifaceted attack. We analyze the rate of detection by varying the intensity of the attacks from 10% to 90%.

1) Delete Attack
In a delete attack, some of the tuples of the watermarked database are deleted by the attacker in order to distort the watermark. Though the attacker is supposed to delete the tuples keeping in mind the usability of the data, we analyze the detection rate by varying the attack percentage from 10% to 90%. The rate of detection for various distortion-based techniques after delete attack are shown in Figure 7(a). From Figure 7(a), we observe that the rate of detection remains 100% in case of [63] even after 90% attack.

3) Insertion Attack
In an insertion attack, the attacker removes a particular number of tuples from the watermarked database and inserts the same number of tuples into the database to destroy the watermark. The rate of watermark detection for various techniques after insertion attack is depicted in Figure 7(c). We observe that the rate of detection remains 100% in the case of [63] even after 90% attack.

4) Zero out Attack
Some of the tuple values of the watermarked database are randomly selected by the attacker and updated with zero to destroy the watermark. We analyze the rate of watermark detection by varying the attack percentage as shown in Figure  7(d). The rate of detection even after a 90% attack is highest in the case of [63].

5) Multifaceted Attack
This is the combination of delete, update, and insertion attacks. The attacker randomly updates some of the tuple values, deletes some of the tuples, and inserts his own tuples to distort the watermark. KhanHusain [70] Li2006 [63] Camara2014 [69] Bhat2009 [64] naz2020 [75] Hamadou2011 [68] Li2004 [57] FIGURE 8: Rate of detection after Multifaceted Attack in case of distortion-free algorithms. VOLUME 4, 2016 The intensity of the update, delete, and insertion attacks are taken as shown in Table 11. We analyze the rate of detection after the multifaceted attack in Figure 8. The rate of detection remains near 100% in the case of [63] even after a 90% attack.
From Figure 7 and Figure 8, we observe that the approach in [63] has the highest robustness against four types of attacks since it considers the number of attributes as that of binary attributes (γ) present in the database relation. It generates the watermark bits from the MSB positions of the attribute values. If the value of γ is increased, though it will increase the robustness, the computational cost will be increased.

V. DISCUSSION W.R.T. THE EXISTING EXPERIMENTAL OBSERVATIONS
While comparing our evaluation-results with the results reported in the existing papers, we draw the following observations: 1) Although watermark-embedding and detection times have significance when to apply in case of large-scale data set, none of the existing proposals (except [2]) under distortion-based approaches performs this evaluation. To the best of our knowledge, this paper first reports a detailed comparative study on the computational costs incurred by different algorithms under consideration. Under distortion-free approaches, only one proposal [75] evaluates its performance on patient's medical data achieving the watermark embedding and detection time of 13 and 21.1 seconds respectively. Even though [75] considers data set different from CoverType, we observe a linear growth of computational time in both embedding and detection phases similar to [2], [75]. 2) Experimental evaluation on data usability using Cover-Type data set is being conducted by the authors in [2], [48], [49], whereas [40] considers a data set comprising consumers' power consumption rates. Like the results reported in [2], [48], [49], our evaluation results also reveals the similar fact that there is no notable change in the mean value of the data after watermark embedding, while very little change in the range 1-99 is observed in case of variance when more number of LSBs (e.g., 8 bits) for watermark embedding is considered. On the other hand, when we conduct experiments for the algorithms in [40] on CoverType data set, we observe a significant increase in the variance and little decrease in the mean values than that reported in [40] on power consumption data. This is due to the difference in the semantic domains of the attributes used for watermark-embedding in case of two different data sets. Note that distortion-free approaches do not suffer from this issue. 3) Attack analyses to manifest the robustness of the algorithms are being conducted over CoverType data set in [2], [22], [48], [49], [68]- [70]. Interestingly, we gain a similar experience in our results also. To be more precise, in both the cases, the results show that the watermark can be detected even after a 90% attack. On the other hand, the attack analysis of the algorithms in [28], [40], [75] are performed on data sets different from CoverType. This is worthwhile to mention that the result reported in [28], [40] is similar to the result obtained using CoverType in our case, which shows that the watermark detection rate is above 70% even after a 90% attack. Similarly, the attack result reported in [75] exhibits similar trend as we observe in our case (on CoverType data), which show that the watermarkdetection rate drops below 20% after a 90% attack.

VI. OVERALL RESEARCH GUIDANCE
In this section, we discuss in detail the guidance to the users for a wise decision on choosing the right watermarking technique. We observe that various operations and parameters (such as the number of attributes, tuples, and bit positions for embedding or generation) in the watermarking algorithms impact the computational cost, data usability, and robustness. Few observations are listed below: • The number of attributes involved in watermark embedding: Some algorithms embed the watermark in all attributes of the database relation. Even though this increases the robustness, this may cause more distortion and may affect the usability with increased computational time. • The number of tuples considered for watermark embedding: If all of the tuples are considered for embedding the watermark, then it will increase the computational cost. It will also affect the usability more, though the robustness may be increased. Whereas, some watermarking algorithms consider a fraction of tuples for embedding the watermark. This will decrease the computational cost and the data usability will be less affected. • The number of bits available for watermarking: If the number of bits considered for embedding watermarks is increased, it will increase the distortion. The computational time and robustness will not be affected much by this. • Parameters particularly affecting the computational cost: There are many operations that may affect the computational time. We identify these operations as: partitioning, hash calculation, random number generation, virtual primary key generation, matrix operations, updating the attribute value.
Although we can not generalize, we categorize the usability, computational time, and robustness towards attacks for the relative comparison of various watermarking techniques in the following groups: The ∆Variance represents the change in variance of the attribute values after the embedding of the watermark. A comparative summary of the distortion based algorithms that we have considered for the experimental analysis is shown in Table 13.  The best algorithm should affect the usability "Less" after watermark embedding, have "Less" computational cost, and have "Very High" robustness against various attacks. In case of the distortion-based algorithms, if the usability is the main concern then the approaches in [2], [46], [48] are the better options since the attributes are having no change or negligible change in the variance after watermark embedding. If we consider the robustness and computational cost, then [2], [48] are better, but the approach in [46] has less robustness against all types of attacks. If only computational cost is considered, then the approach in [48] is having the least computational cost. If only robustness is considered, then the approach in [22] is the most robust, but the usability is highly affected after embedding. The computational cost is also highest in the case of [22] since it computes a virtual primary key for each of the tuples.
From Table 13, we can observe the following in the case of both [2] and [48]: • The data usability is least affected after the watermark embedding. • The computational cost is "Less". • "Very High" robustness against three kinds of attacks. Therefore, considering the usability constraints as defined, the computational cost and the robustness towards various attacks, we can say that the watermarking algorithms in [2] and [48] perform better than the other distortion-based watermarking algorithms we have considered for experimental analysis.
A comparative summary of the distortion-free algorithms that we have considered for the experimental analysis is shown in Table 14.
In the case of distortion-free watermarking techniques, if only computational cost is considered, then the approach in [70] is the best option as it takes the least watermarking time. If only the robustness against various attacks is considered, then the approach in [63] has a very high robustness in case of update, delete, insertion and multifaceted attacks. The usability is not affected, as the watermark generation process does not cause any distortion in the data.
From Table 14, we can observe the following in case of [63]: • The usability of the data is not affected after the watermark generation. • The computational cost is "Less". • "Very High" robustness against four kinds of attacks. Overall, considering the above-mentioned facts, the watermarking algorithm in [63] performs better than the other distortion-free watermarking algorithms in terms of computational-overhead and robustness.

VII. CONCLUSION
In this paper, we perform a detailed comparative analysis of various relational database watermarking techniques empirically. We classify the existing distortion-based watermarking techniques into six categories, namely (i) meaningless bitpattern as the watermark, (ii) virtual primary key based, (iii) image as watermark, (iv) partitioning based, (v) fake tuple/attribute insertion, (vi) fingerprinting techniques, and (vii) other meaningful watermark information. Similarly, the existing distortion-free techniques are classified as (i) permutation of tuples, (ii) conversion of the database into binary form, (iii) attribute reordering, (iv) content characteristics based, and (v) others. We perform an exhaustive empirical study and comprehensive analysis of a number of algorithms selected based on our quality-criteria. In particular, our evaluation focuses the following three crucial factors: computational cost, data usability, and robustness, as a way to provide an insightful guidance to choose the right watermarking technique for a given application.