Semi-Stream Similarity Join Processing in a Distributed Environment

Similarity join has become very important for semi- or un-structured data processing and analysis. Although several studies have been conducted on the similarity join, little attention has been paid to a semi-stream similarity join, which is a similarity join between stream data and a large disk-based relation. In this study, we propose the first distributed solution called DSim-Join for semi-stream similarity join problem. DSim-Join minimizes the data transmission, reduces database accesses using a cache in a distributed stream processing engine, parallelizes join processing, and balances the load between parallel join threads. Experimental results obtained using real-world datasets show that DSim-Join yields significantly improved throughput compared with state-of-the-art methods, especially for large datasets. The results also show that DSim-Join is scalable and stable; it is not very sensitive to the parameters such as the micro-batch interval, checkpoint interval, and similarity threshold.


I. INTRODUCTION
A similarity join between two sets of records finds all similar record pairs from the two sets [1]. This type of operation has become very important for semi-or un-structured data processing and analysis such as those encountered for document clustering [2], plagiarism detection [3], and near-duplicate detection [4]. Similarity join processing is quite expensive because it needs to find all similar pairs. When the data volume is large, it becomes difficult to perform join processing using only a single node.
Distributed similarity join processing based on the MapReduce model [5] has attracted significant research attention [6]- [19]. Recently, a distributed in-memory similarity join method, called Dima [18], has been proposed. Dima uses distributed in-memory indexes to find similar pairs and balances the workload of distributed index partitions. However, Dima is hard to handle big data larger than the total amount of main memory in a distributed cluster of computing worker nodes. Furthermore, the previous studies on distributed similarity join processing do not consider a stream environment.
The associate editor coordinating the review of this manuscript and approving it for publication was Wen Chen . In a stream environment, a semi-stream join operation is important to join stream data with a large disk-based relation [20]- [29]. Most of the previous studies address the semistream equi-join in a single node. S3J [28] is a semi-stream similarity join method, but does not support a distributed environment. DS-Join [29] is a distributed semi-stream join method but does not support the similarity join.
In this paper, we examine the semi-stream similarityjoin problem in a distributed environment and propose a comprehensive solution called DSim-Join. To the best of our knowledge, this is the first distributed solution that supports semi-stream similarity join. We make the following contributions. First, we develop a data partitioning technique that minimizes the data transmission. Second, we devise a caching mechanism that reduces database (DB) accesses. Third, we develop a technique that parallelizes the join processing using a cache. Fourth, we propose a novel cache replacement algorithm that dynamically balances the load between parallel join threads by adjusting the cache size.
We have implemented DSim-Join on top of Spark Streaming [30] because it provides several useful APIs for distributed stream processing. We have extended state-of-the-art distributed methods [18], [29] to accommodate semi-stream similarity-joins and performed extensive experiments on real datasets. The results show that DSim-Join significantly outperforms the state-of-the-art methods in terms of throughput, especially for big databases (larger than the total amount of main memory in a cluster).
The remainder of the paper is organized as follows. Section II introduces Spark and signature-based similarity join processing, and Section III reviews the existing work. Sections IV and V present DSim-Join and the experimental results, respectively. Finally, Section VI presents our conclusions and future work. For ease of reading, Table 5 in Appendix lists the abbreviations used in this paper.

II. BACKGROUND A. SPARK
Spark is a distributed in-memory computing framework that extends the MapReduce model. Spark performs parallel computations with fault tolerance using Resilient Distributed Datasets (RDDs) [31]. An RDD is an immutable distributed collection, and a pair RDD contains key-value pairs. Spark processes each partition of an RDD in parallel, and the number of partitions sets the degree of parallelism. Users can explicitly store (or persist) an RDD in memory to reuse it. For fault-tolerance, Spark tracks the lineage information to rebuild lost data.
Spark provides per-partition operations that can iterate through all the elements in a partition such as mapPartitions and zipPartitions. Such type of operations are crucial for implementing complex functions such as the semi-stream similarity join. The mapPartitions operation applies a function on each partition of an RDD. The zipPartitions operation combines multiple RDDs into a new RDD by applying a function. It assumes that all the input RDDs have the same number of partitions; however, it does not require each partition to have the same number of elements. To preserve partitioning information for the per-partition operations on pair RDDs, we need to set the preservesPartitioning option to true.
Spark Streaming is a Stream Processing Engine (SPE) that builds on Spark. In Spark Streaming, a stream is divided into a sequence of RDDs (or micro-batches). In a distributed environment, a semi-stream join method [29] based on this micro-batch model shows better throughput than the methods [32]- [34] based on the record-at-a-time model [35].

B. SIGNATURE-BASED SIMILARITY JOIN PROCESSING
The state-of-the-art similarity join methods [18], [36]- [38] are based on signatures, which are a subset of tokens. The methods generate signatures for each record such that if two records are similar, they must share at least one signature. In other words, for two given records r and s, let Sig(r) and Sig(s) be the signature sets of r and s, respectively; r and s are similar only if Sig(r) ∩ Sig(s) = ∅. We find candidate record pairs having common signatures and then verify that each candidate pair meets the similarity criteria. This filterand-verification technique can prune many dissimilar pairs. Table 1 summarizes related work along three dimensions: distributed environment, semi-stream join processing, and join type. Most of the studies [6]- [27], [29] except [28] do not consider the semi-stream similarity join. Gao et al. [28] do not consider a distributed environment.

III. RELATED WORK
There are several studies [6]- [19] on distributed similarity join processing based on the MapReduce model. Elsayed et al. [6] proposed a similarity join method using the filter-and-verification technique and an inverted index over all tokens. Vernica et al. [7] described similarity join methods that balance the workload and minimize replication. Metwally and Faloutsos [8] proposed a similarity join method applicable to sets, multisets, and vectors. Afrati et al. [9] provided a comparative analysis of similarity join methods that perform in a single MapReduce stage. Kim and Shim [10] proposed top-k similarity join methods. Deng et al. [11] supported set-based and character-based similarity joins by extending the signature scheme [39]. Deng et al. [12] proposed a new signature scheme to improve the pruning power. Rong et al. [13] proposed a parallel similarity join method using a vertical partitioning technique. Meena et al. [14] presented a character-based similarity join method for highly skewed data. Fier et al. [15] empirically compared the distributed set similarity join methods. Ding et al. [16] addressed privacy challenges for the similarity join.
For the distributed in-memory framework, i.e., Spark, some work [17]- [19] has been performed on the similarity join. Chen et al. [17] proposed an approximate similarity join method using a locality sensitive hashing (LSH)-based distance function. Sun et al. [18] proposed a similaritybased query processing system called Dima. Dima uses the partition-based signature scheme of Deng et al. [12] and adaptively selects signatures to balance the workload. Rashtchian et al. [19] proposed an approximate similarity join method based on locality sensitive filtering [40]. In this work, we address a problem fundamentally different from those of existing studies on distributed similarity join processing. We assume that one input of the similarity join is a stream, and the other input is stored in an independent external database system. This fundamental difference raises many challenging problems that are described in Section IV-B.
Considerable research has been conducted on semi-stream joins [20]- [29]. Naeem et al. [20] proposed a semi-stream equi-join method called SSCJ, which uses a cache to store frequently accessed disk tuples in memory. The join is between the foreign key of the stream data and the primary key of the relation. SSCJ sequentially alternates between the stream-probing and disk-probing phases. The stream-probing phase finds a match in the cache for each incoming stream tuple. Unmatched stream tuples are forwarded to the diskprobing phase through a fixed-size queue. In the disk-probing phase, the oldest stream tuple in the queue is used to look up and load a partition of the disk-based relation, and then the stream tuples are matched for each tuple in that partition. The matched stream tuples are deleted from the queue, and the stream-probing phase is restarted. Naeem et al. [21] proposed a strategy for selecting an appropriate stream tuple for the partition lookup. The strategy alternates between the last queue element (i.e., the oldest stream tuple) and an intermediate queue element. Mehmood and Naeem [22] used parallel execution between the stream-probing phase and the disk-probing phase by introducing an intermediate buffer between the two phases. Naeem et al. [23] proposed multiway semi-stream join methods. Naeem et al. [24] presented a technique for load shedding in semi-stream join processing. Naeem et al. [25] proposed a semi-stream join method for the many-to-many equi-join. Mehmood and Anees [26] analyzed the CPU and memory usage for semi-stream join processing with NoSQL. Naeem et al. [27] implemented a parallel loading of partitions into memory in the disk-probing phase. Gao et al. [28] proposed semi-stream similarity join methods using tries. In the aforementioned studies on the semi-stream join, a distributed environment was not considered, which is the focus of this study.
Finally, Jeon et al. [29] proposed a distributed semi-stream equi-join method, called DS-Join, which uses the micro-batch model. Our method, DSim-Join, for the semi-stream similarity join also uses the micro-batch model for distributed stream processing, but our join algorithm is completely different from that of [29] because the type of join is different.

IV. DISTRIBUTED SEMI-STREAM SIMILARITY JOIN PROCESSING
In this section, we propose a distributed semi-stream similarity join method called DSim-Join. Section IV-A defines the problem, and Section IV-B explains the design goals. Sections IV-C and IV-D present the architecture and algorithm outline. Sections IV-E and IV-F describe and discuss DSim-Join in detail.

A. PROBLEM DEFINITION
Given a disk-based relation R, a stream S, a similarity function f , and a threshold τ , we define the semi-stream similarity join of R and S as finding all pairs of records r ∈ R, s ∈ S such that f (r, s) ≥ τ .
It is commonly assumed that the disk-based relation changes gradually. Our method can support various similarity functions (e.g., Jaccard, Cosine, Dice, and edit distance) using the same signature schemes as used in [12], [41]. The focus of this paper is on the join algorithm, and the signature scheme is orthogonal to our work.

B. DESIGN GOALS
We establish the following design goals to achieve high performance for the semi-stream similarity join processing in a distributed environment: • Minimizing data transmission; • Reducing database accesses; • Parallelizing join processing; • Balancing load between parallel join threads C. THE PROPOSED ARCHITECTURE Fig. 1 shows the proposed architecture. In the distributed SPE, the distributed semi-stream join operator, DSim-Join, processes each micro-batch of the input stream of records and generates joined output data. We transform each micro-batch into a pair RDD (called inputRDD) comprising signature, record by generating signatures for each record. The signature is used as a join key in subsequent processing. The disk-based relation in an independent external database also contains tuples of signature, record , and a part of the relation is cached as a pair RDD (called cacheRDD) inside the SPE. DSim-Join supports any type of database system that allows key-based searching. Because the relation changes gradually, there is sufficient time to generate signatures for changed records.
We use the filter-and-verification technique that finds candidate pairs with common signatures and then verifies the candidate pairs. To reduce database accesses, we first find candidate pairs between inputRDD and cacheRDD using an outer equi-join operation. In general, the two inputs for the join operation are shuffled into multiple worker nodes. However, we can avoid the data shuffling by carefully controlling the data partitioning. We split the result of the join operation into cache-hit and cache-missed data, process each of them in parallel using multi-threading, and then combine both their results. For missed data, we need to query the database to find matching database tuples. To avoid duplicate verification, we use the same technique as in [18], which checks whether this signature is the first match for each candidate pair.

D. OUTLINE OF THE PROPOSED ALGORITHM
In this section, we briefly explain how each design goal is achieved. A detailed explanation is presented in Section IV-E.

1) MINIMIZING DATA TRANSMISSION
Data shuffling redistributes data across partitions and thus incurs expensive data transmission. DSim-Join minimizes the data transmission by shuffling only the input microbatch throughout the entire join process. Obviously, the shuffling of the input micro-batch is inevitable. To prevent data shuffling for the join between inputRDD and cacheRDD, all the elements of the two RDDs with the same key should be co-located in the same worker node. To that end, we first partition the input micro-batch using the partitioner of cacheRDD and then specify partition's preferred location using the getPreferredLocations function of Spark such that corresponding partitions are co-located in the same worker node. Fig. 2 shows the difference between without and with getPreferredLocations, when we join RDD 1 and RDD 2 . In Fig. 2(a), corresponding partitions are located in different worker nodes and thus incur huge data transmission overhead. We also avoid shuffling for subsequent operations by preserving the partitioning.

2) REDUCING DATABASE ACCESSES AND PARALLELIZING JOIN PROCESSING
Caching in the SPE not only reduces database accesses, but also allows parallel join processing. The cache-hit data can be simultaneously processed with the cache-missed data. We call the thread that processes the hit (missed) data Hit-(Missed-)Thread. For the hit data, we simply verify the candidate pairs to generate the final results. For the missed data, we generate a new RDD that contains only the missed data. For each partition of the RDD, we generate one query for all the join keys (i.e., the signatures) present in the partition. For relational DB, the query has a WHERE clause with the IN operator (or multiple ORs). We can generate queries for multiple partitions in parallel using mapPartitions. We send the queries to the database system to find candidate pairs and then verify the candidate pairs.

3) BALANCING LOAD BETWEEN PARALLEL JOIN THREADS
The performance of the parallel phase is determined by the slowest parallel thread. We balance the load between Hit-and Missed-Threads by adjusting the cache size. With the increasing cache size, the load of Hit-Thread increases because more candidate pairs need to be verified. In contrast, the load of Missed-Thread decreases with increasing cache size due to reduced database accesses and candidate pairs. To balance the load, we propose a Cost-based Cache Replacement (CCR) algorithm. The CCR algorithm monitors the execution times of the database querying and the similarity verification, and dynamically adjusts the cache size based on the execution times. Fig. 3 shows a flowchart of the DSim-Join process, which is iteratively run for each micro-batch. All the branches in the flowchart are processed in parallel. The shaded RDDs in Fig. 3 denote persisted RDDs. We systematically explain each component of the process.

1) JOIN BETWEEN INPUTRDD AND CACHERDD
We generate signatures for the input micro-batch and build inputRDD. It is worth noting that data shuffling occurs only when building inputRDD. We then perform a left outer equi-join operation, which returns all the elements from inputRDD, along with matches from cacheRDD. The join operation is implemented as the sort-merge join operation using zipPartitions, and the joined result is persisted in zippedRDD. There is no data shuffling for the join because both inputRDD and cacheRDD are partitioned by the same partitioner and corresponding partitions of the two RDDs are co-located.

2) HIT DATA PROCESSING
In hit data processing, we first select elements from zippe-dRDD in which the cache side is not empty using filter.
We then verify the elements (i.e., candidate pairs) using mapPartitons. No data shuffling is performed during this process because filter and mapPartitons do not change keys.

3) MISSED DATA PROCESSING
In missed data processing, we select elements from zippe-dRDD in which the cache side is empty and call the result as missedRDD. For missedRDD, there are two steps involved in the mapPartitions operation.
Step 1 generates and executes a query for each partition to retrieve database tuples.
Step 2 performs a join-and-verify operation for misse-dRDD and the database tuples. The join-and-verify operation perform the sort-merge join operation simultaneously with verification to avoid intermediate data materialization. That is, as soon as we find a candidate pair, we verify it. Finally, we persist the result in DB_RDD. No data shuffling is performed during this process, as in the hit data processing.

4) CACHE MANAGEMENT
In the cache initialization phase, we create a cache file in advance by randomly selecting database tuples and build cacheRDD from the cache file. We then pre-partition the cacheRDD before the continuous join processing is started and preserve the partitioning in subsequent processing. To fully utilize all the worker nodes, the cache file should be large enough, and the number of partitions should be determined by considering the total number of CPU cores.
The cache management phase adds the database tuples in DB_RDD to cacheRDD using union and applies the CCR algorithm using mapPartitions for cache replacement. There is no data shuffling because we preserve the partitioning and do not change keys. To maximize parallelism, we run the cache management phase in parallel with the inputRDD building phase, as in the bottom of Fig. 3. To do that, we add the database tuples of the n-th iteration to cacheRDD at the (n + 1)-th iteration. Periodic checkpoints are performed for cacheRDD to reduce the lineage reanalyze overhead.

5) COST-BASED CACHE REPLACEMENT
Algorithm 1 presents the CCR algorithm; the notation is defined in Table 2. To balance the load between Hit-and Missed-Threads, the cache size is adjusted based on the execution times of the threads. If T hit ≤ T missed for the current iteration, we simply add the database tuples to cacheRDD and the CCR algorithm does not take any action in the next iteration. If T hit > T missed for the current iteration, we remove some elements from cacheRDD such that T hit decreases and T missed increases in the next iteration. We use average query execution time as the criteria for selecting the elements to remove. To keep elements with high query time in the cache, we remove elements with the lowest query time first. As we execute a query for each partition, the query execution time for a single element cannot be accurately measured. Thus, we use the average query execution time for each element. The CCR algorithm is applied in parallel for each cacheRDD partition.
Using the notation in Table 2, T hit and T missed are expressed, as in Formulas (1) and (2).
Let us assume that n elements are currently stored in each cacheRDD partition in ascending order of average query execution time. Let us also assume that the input micro-batch of the next iteration is the same as that of the current iteration. If we remove k elements with the lowest query time from the cacheRDD partition, T next hit and T next missed of the next iteration can be expressed as shown in Formulas (3) and (4). if T hit − T missed ≤ 2S verify + S db then 8. break 9. end if 10. remove the i-th element from the cacheRDD partition 11. end for 12. end if In Algorithm 1, when T hit > T missed , we iteratively remove elements with the lowest query time from the cacheRDD partition until we reach the minimum k that satisfies Formula (5) to make T next hit ≤ T next missed true.
Because the input micro-batch changes for each iteration, we cannot accurately estimate k. However, this is not a problem because our cache management mechanism dynamically adjusts the cache size for each iteration.

F. DISCUSSION
Our work can be extended to other SPEs that support the window semantics. We can consider a non-overlapping window as a micro-batch and implement the optimization techniques of DSim-Join using existing APIs of the SPEs. For example, we can implement the stream-cache join using the non-overlapping window and a bounded stream for the cache. We leave this for future work. VOLUME 8, 2020 The cache coherence between the cache and the database can be provided using the Change Data Capture (CDC) feature of database management systems (DBMS). DSim-Join receives the stream of record updates from the DBMS using CDC and removes the updated records from the cache before the sort-merge join operation between inputRDD and cacheRDD.

V. PERFORMANCE EVALUATION A. EXPERIMENTAL SETUP
We compare our method, DSim-Join, with Dima and DS-Join in terms of throughput for various scenarios. We choose Dima because it is a state-of-the-art method that supports similarity join processing in a distributed in-memory environment and DS-Join because it is a state-of-the-art method that supports semi-stream equi-join processing in a distributed environment.
We have extended Dima to support semi-stream join processing. For the two given datasets, we construct a relation for one dataset and use the other dataset as a stream. For each input micro-batch, we build a pair RDD consisting of signature, record (as in the inputRDD building phase of DSim-Join) and query the database for each partition of the RDD to find candidate pairs (as in the Missed-Thread phase of DSim-Join). We also apply the join-and-verify operation. However, we do not use a cache because Dima does not have a caching mechanism, which is a feature of our algorithm. We also do not use the broadcast operation for load redistribution in the original Dima implementation because it incurs high data transmission overhead when the datasets are large. Hereafter, we call the extended implementation of Dima Dima Stream .
We have extended the DS-Join to support similarity join processing. To that end, we apply the filter-and-verification technique of DSim-Join to DS-Join. Hereafter, we call the extended implementation of DS-Join, DS-Join Sim . Almost all the optimization techniques of DSim-Join are applied to DS-Join Sim except the cache replacement algorithm. We do not change the caching algorithm of DS-Join to observe the effectiveness of the CCR algorithm of DSim-Join.
We use the following real-world datasets: the Amazon review, 1 Internet Movie DataBase (IMDB), 2 and DBLP 3 datasets as detailed in Table 3. The average length of records is measured in byte units. For each dataset, we construct a relation consisting of signature, record . The DB size includes the size of both the relation and the index on the signature column. To generate the input stream for each dataset, we randomly select records from the dataset, store them in files, and iteratively read and send the records. For the Amazon review dataset for books, we generate three datasets with different number of database tuples. The IMDB dataset consists of movie reviews. For the DBLP dataset, which is a 1 http://jmcauley.ucsd.edu/data/amazon 2 http://ai.stanford.edu/ amaas/data/sentiment 3 https://data.mendeley.com/research-data computer-science bibliographic dataset, we generate a much larger dataset, DBLP ×10 by duplicating the DBLP dataset 10 times. We use the title field of the DBLP dataset.
To measure the average throughput per second for each dataset, we count the number of similarity join results for 10 minutes period after the cache of DBMS is warmed up. For all the methods, we use the maximum sustainable input rate for the system. We use Spark Streaming 2.4.4 (as an SPE), MongoDB 4.2.2 [42] (as a DBMS), and Mesos 1.8.0 [43] (as a cluster manager). We set up a MongoDB sharded cluster, which consists of shards, a query router, and a configuration server. The relations are hash partitioned across DB nodes with shards. A worker node for Spark Streaming and a DB node can be mapped to the same physical machine. Table 4 lists the default experiment parameters. All the nodes are connected by a 1-gigabit ethernet network, with the Ubuntu 18.04 LTS installed. PCs with Intel Core i5 CPU and 16GB RAM are used for the worker nodes and the DB nodes. A server with two Intel Xeon E5-2620 v2 CPU and 48GB RAM is used for the query router and the configuration server. Samsung 850 PRO 256GB SSDs are used as storage devices.

B. EXPERIMENTAL RESULTS
We perform the following experiments to test various scenarios.

1) EXPERIMENT 1 (THROUGHPUT FOR VARYING DATABASE SIZE)
For the Amazon review dataset, we measure the throughput for the databases with 10, 30, and 50 million database tuples. We randomly select 7,000 records from the Amazon review dataset to generate the input stream. For AmazonReview 10M , the database almost fit in the DBMS cache, but is larger than the cache in the SPE. In contrast, for AmazonReview 50M , the database is too big to fit in the DBMS cache.  We use four nodes each for worker nodes and DB nodes where a worker node and a DB node are physically distinct. Hereafter, we call this configuration the default system configuration. Fig. 5 shows that DSim-Join achieves much higher throughput than the other methods for all the database sizes. The performance improvement of DSim-Join over Dima Stream is up to 1.8 times. This is because DSim-Join uses the cache in the SPE, but Dima Stream does not. Furthermore, DSim-Join optimizes the join processing using the cache. As the database size increases, the advantage of DSim-Join over Dima Stream becomes more prominent because the database access time increases. The improvement of DSim-Join over DS-Join Sim is up to 1.3 times because DSim-Join balances the load between parallel join threads using the CCR algorithm. The improvement of DSim-Join over DS-Join Sim is almost constant (1.2 to 1.3 times) for varying database size because both methods use the cache.

2) EXPERIMENT 2 (THROUGHPUT FOR VARYING MICRO-BATCH INTERVAL)
We measure the throughput by varying the micro-batch interval. To generate the input stream, we randomly select 7K, 7K, and 100K records for AmazonReview 50M , IMDB, and DBLP ×10 , respectively. We use the default system configuration. Fig. 4 shows that DSim-Join significantly outperforms the other methods when the micro-batch interval is greater than 2s for all the datasets. DSim-Join exhibits up to 2.2 times (1.5 times) superior throughput than Dima Stream (DS-Join Sim ). Increasing the micro-batch interval to more than 5s only slightly increases the throughput of DSim-Join because the system becomes fully utilized. When the micro-batch interval is very small (0.5s to 1s), Dima Stream outperforms the other methods that use the cache because the amount of data needed to be accessed from the database is reduced; this means that the overhead of caching becomes greater than the benefit gained from caching.
The three datasets in the ascending order of throughput are IMDB, AmazonReview 50M , and DBLP ×10 . This order coincides with the descending order of the average length of records as shown in Table 3. This is because longer records generate more candidate pairs and are more expensive to verify.

3) EXPERIMENT 3 (STABILITY TEST)
To demonstrate the performance stability of DSim-Join, we run DSim-Join for 24 hours and measure the average processing time and cache size for each hour. We use AmazonReview 50M with 7K records for the input stream under the default system configuration. Fig. 6(a) shows that the processing time and the cache size become stable quickly within a few minutes. Fig. 6(b) shows that the average processing time and cache size are stable, and the processing is always performed within the micro-batch interval of 3s.

4) EXPERIMENT 4 (SCALABILITY TEST)
To test the scalability of DSim-Join, we measure the throughput for the Amazon review dataset by varying the number of nodes n from 4 to 8. In this experiment, there are n nodes where each node is a worker as well as a DB node. Thus, both the SPE and database system are scaled out with increase in n. We also vary the database size. Fig. 7 shows that the performance improvement of DSim-Join over Dima Stream increases with n (e.g., from 1.7 to VOLUME 8, 2020 1.9 times for AmazonReview 50M ). This is because the total size of the cache in the SPE increases with n for DSim-Join. The improvement of DSim-Join over Dima Stream becomes more prominent as the database size increases. The improvement of DSim-Join over DS-Join Sim is almost constant (1.17 to 1.24 times) for varying n and database size because both methods use the cache.

5) EXPERIMENT 5 (THROUGHPUT FOR VARYING DATA DISTRIBUTION)
We measure the throughput by varying the data distribution. To that end, we vary the number of distinct records n in the input stream. Obviously, the smaller n results in the higher cache hit ratio. We use the default system configuration. Fig. 8 shows that the performance improvement of DSim-Join over Dima Stream is higher for smaller n due to more cache hits in the SPE. DSim-Join outperforms DS-Join Sim because DSim-Join uses the CCR algorithm that balances the load between parallel join threads. In Fig. 8(c), DS-Join Sim exhibits lower throughput than Dima Stream because the caching algorithm of DS-Join Sim excessively reduces the cache size when the database access operation is not relatively slow compared with other operations. The CCR algorithm does not possess this problem.

6) EXPERIMENT 6 (THROUGHPUT FOR VARYING INPUT RATE)
We measure the throughput for varying input rate (i.e., the number of incoming records per second). We vary the input rate from 20 to 60 and set the micro-batch interval to 3s. It is worth noting that the input rate of 60 is the maximum sustainable input rate for the micro-batch interval of 3s. We use AmazonReview 50M under the default system configuration. Fig. 9(a) shows that as the input rate increases, DSim-Join significantly outperforms Dima Stream . The higher input rate results in more data that need to be accessed from the database. Thus, the advantages of caching increase. DSim-Join exhibits up to 1.6 times (1.2 times) better throughput than Dima Stream (DS-Join Sim ).

7) EXPERIMENT 7 (THROUGHPUT FOR VARYING CHECKPOINT INTERVAL)
We measure the throughput by varying the checkpoint interval used in the cache management. We use AmazonReview 10M under the default system configuration. Fig. 9(b) shows that DSim-Join outperforms the other methods for all the interval between 5 to 40 and achieves the best throughput at the interval of 10 iterations. When selecting the checkpoint interval, there is a trade-off between the checkpoint overhead and the lineage reanalyze overhead. Optimizing this parameter is another research topic.

8) EXPERIMENT 8 (THROUGHPUT FOR VARYING SIMILARITY THRESHOLD)
We measure the throughput by varying the similarity threshold τ . We use AmazonReview 50M under the default system configuration. Fig. 9(c) shows that DSim-Join achieves higher throughput than the other methods for all the threshold values. DSim-Join exhibits up to 2.0 times (1.5 times) better throughput than Dima Stream (DS-Join Sim ). As the threshold increases, the throughput of all the methods improves because higher threshold results in less candidate pairs.

C. SUMMARY
DSim-Join outperforms the state-of-the-art methods when the database size is large, and the input rate is high, and thus, is suitable for big data. Furthermore, DSim-Join is scalable and stable; it is not very sensitive to the parameters such as   the micro-batch interval, checkpoint interval, and similarity threshold.

VI. CONCLUSIONS AND FUTURE WORK
In this study, we have proposed DSim-Join, which is the first semi-stream similarity join method for a distributed environment. We have achieved high performance by minimizing data transmission, reducing database accesses, parallelizing join processing, and balancing load between parallel join threads using the CCR algorithm. Experimental results show that our method significantly improves the throughput compared with the state-of-the-art methods. We have implemented our method in a widely used distributed SPE, namely, Spark Streaming; thus, our approach has broad applicability and is practically useful to big data community. Finally, we plan to apply our work to online data analysis such as online text mining and extend our work to other SPEs that support the window semantics for wider applicability.