Bitcoin Shared Send Transactions Untangling in Numbers

Bitcoin uses an unspent transaction output (UTXO) model for coin circulation, which is similar to the banknotes. The transaction history is publicly available and allows to trace cryptocurrency flows. Different users merge transactions into a single bigger one to tangle flows. The merged transaction is called a shared send mixer (SSM). One can try to find the original subtransactions–solve an untangling problem. Based on the number of untanglings and their size, one extracts additional information about coin circulation. Theoretical analysis of the untangling problem is known from the literature. The paper aims to collect statistics of the SSM usage by transaction type for Bitcoin blockchain. We propose an algorithm to solve the problem, prove its correctness, and provide a source code. We applied the algorithm to the Bitcoin historical data. 15% transactions are SSM, and 90% of them allow unique untangling. The future work is an algorithm application to other UTXO systems and the results adaptation to an address grouping.


I. INTRODUCTION
Bitcoin is the first and the biggest by capitalization blockchain-based cryptocurrency [1], [2]. Proof-of-Work consensus protocol run by a globally distributed network allows censorship resistant cross-boarder transfers in a native digital token-bitcoin [3], [4], [5], [6]. The system does not force users to disclose their identity, but only the information to prove bitcoins ownership: public keys digital fingerprints upon receiving and public keys with transaction signatures upon sending. Bitcoin protocol represents receiver's information with the format identifier and checksum as an address [7], [8]. Although data encryption techniques exist, they are still challenging, increase costs and decrease trust [9], [10], [11], [12]. So the key anonymity option users have is no need to disclose their names, but keep only in-platform pseudonyms-addresses [13], [14].
The censorship resistance prevents network from freezing suspicious transactions and does not allow to ban users. An immutable transaction log makes transaction cancellation impossible once it is in the blockchain [15]. The (pseudo) anonymity helps users to remain unknown. These properties The associate editor coordinating the review of this manuscript and approving it for publication was Yudong Zhang . make cryptocurrencies, including Bitcoin, a popular tool for criminal groups to launder money, sell illegal goods, scam and ransomware [16]. For 2021, the amount of illegal funds passing through cryptocurrencies, according to estimates, amounted to 14 billion USD [17]. Although, this volume is only 0.15% of the total volume of all transactions for this year. We do not claim that only criminals want to hide their bitcoin flows. But criminals are vitally interested in it, and the web is full of fail examples [18], [19], [20].
Bitcoin uses an unspent transaction output model (UTXO) for the currency circulation [1]. Each transaction consists of several inputs and outputs. An output contains a non-negative amount and an instruction to spend it. An instruction usually requires a digital signature with a private key, which corresponds to the stated in the output public key fingerprint [21], [22]. Once a transaction is committed in the blockchain, its outputs become unspent-UTXO-and usable as inputs for the future transactions. If the UTXO is used as an input of a committed transaction, it is considered as spent and can not be used as an input of any other transaction. Bitcoins are like paper banknotes in circulation: to spend wanted amount of the currency, one provides banknotes for a greater or equal amount. If the amount is greater, the sender gets change back. In Bitcoin, transaction input banknotes are burn, while output banknotes are printed. The banknote denomination can be any multiple of 1 satoshi = 10 −8 bitcoin. The sum of input amounts have to be greater or equal to the sum of outputs, and their difference is a transaction fee paid to block miners. The only exception is the first transaction of each block. It has no inputs but only outputs, and outputs contain emitted currency and fees for all the transactions in the block.
A shared send mixer (SSM) is a technique to merge several transactions into one so that the correspondence between senders and receivers from the original subtransactions is not available [23]. SSM is an old but effective instrument for increasing privacy in Bitcoin transactions, as recent Europol reports state [24], [25]. Kristov Atlas introduced the SSM untangling by matching in a Sudoku manner in 2014 [26]. The white paper [27] provided a mathematical formalism for the untangling and proved that the problem is computationally hard, allowing only pseudo-polynomial algorithms.
The main contributions of our work to the Bitcoin transaction analysis are 1) Introduce an algorithm for SSM untangling.
2) Provide an open source code of the proposed algorithm.
3) Present SSM usage statistics for Bitcoin historical data. The rest of the paper is organized as follows. Section II considers the related work and the contribution of the current paper. We formulate the SSM untangling problem and list existing results in Section III, supporting them with real transaction examples. The algorithm to solve the problem is proposed in Section IV. We analyze Bitcoin transaction history in Section V. Finally, Section VI concludes the paper.

II. RELATED WORK
Privacy preservation is an arms race: while one person wants to keep it, another tries to reveal information. Bitcoin privacy model is in keeping public keys anonymous [1]. Since Bitcoin origin, various UTXO templates were developed: P2PK, P2PKH, P2SH, P2WPKH, etc. [28]. Different templates utilize a public key in different ways. And we not always can compute one template from another because of hashing [29]. The researchers consider an address as a commitment of a public key rather than the public key itself.
A user is free to generate addresses. To use a generated address, he needs to remember the corresponding private key(s). Hierarchical deterministic wallets made multiple addresses management as easy as a single private key storing [30]. Thus, the grouping addresses by owners (wallets) problem rises. Bitcoin usage patterns inspire heuristics to reveal information: common spending (CS), one-time change [31] and coinbase clustering (CBC) [32]. At a banknote circulation level, CS groups input addresses, when user spends several banknotes to achieve wanted amount. OTC groups one output address with input addresses and stands for getting a banknote back, when the input amount overcharges the amount to spend, and the users gets change. CBC groups outputs of the miner reward transaction-coinbase transaction, as a miner (pool) rewards himself no matter the number of emitted banknotes.
As CS, OTC and CBC are heuristics, they are prune to errors. One source for errors is a shared send mixer (SSM) [23]. CS and OTC provide wrong grouping for SSM. SSM both tangles the transaction history and reduces fee per transfer. These made them a part of custodial wallets before 2017, and Europol claims them still in use [24], [25]. The idea of SSM untangling is formulated in [26]. The paper [27] stated the problem mathematically, analyzed its complexity class, and computed statistics for a selected month. Since then, neither untangling algorithm nor quantitative information on SSM are not publicly available. Zero knowledge proofs [9], [33] allows advanced coin mixing, but Bitcoin transaction language does not allow such an involved logic. Another anonymization technique is a shared coin mixer (SCM) [34]. In SCM, the user sends bitcoins to an intermediary service. After a certain time, the service sends other available funds to the recipient's address. Sending to and reception from the intermediary is usually divided into several transactions. A client needs to trust the service. If the operation is successful, the connection between the sending and receiving addresses is almost impossible to trace. SCM can be provided by exchanges or services without registration, depending on the desired anonymity.
Although address owners are not required to disclose information about theirself, much public information can be found on the Internet-off-chain information. Both delayed and real-time off-chain data collection helps deanonymization. Examples of public tag providers are Wallet Explorer [35]-categorized organization labels-and Bitcoin Abuse [36]-scam reports for addresses. Network attacks for de-anonymization are: Sybil and Fake Node [37] to get users' IP. While transaction remote release [38] and delayed propagation [39] hides the IP address of the transaction's author. Finally, machine learning models incorporate available data to cluster addresses by wallets [40], [41], [42], classify wallets by user type [13], [43], or score addresses for an anti-money laundering compliance and risk management [44], [45], [46], [47].

III. UNTANGLING PROBLEM
We introduce the basic notation and assumptions in this Section.

A. NOTATIONS
An address and a non-negative amount represent each transaction input and output. Let small letters be amounts, capital letters be addresses, and calligraphic capital letters be multisets of amount and address pairs. For example, (c k , C k ) is an input or output with address C k and amount c k , and Coinbase transactions have no inputs. We do not consider them as they are not shared send mixers. A Bitcoin transaction is viewed as an ordered triple t = (A, B, c), where • A is a multiset of transaction inputs, and each input (a n , A n ) ∈ A is an ordered pair of the address A n and the value of the input a n ≥ 0.
• B is a multiset of transaction outputs, and each output (b m , B m ) ∈ B is an ordered pair of the address B m and the value of the output b m ≥ 0.
• c = (a n ,·)∈A a n − For an arbitrary multiset of transaction inputs or outputs C, Addr(C) = ∪ (·,C)∈C {C} denotes the multiset of addresses in C, and Sum(C) = (c,·)∈C c. An address may be used several times in the set of inputs and/or outputs. For convenience, we simplify the multisets of inputs and outputs as follows ( Figure 1): 1) Sum all inputs grouped by addresses, and sum all outputs grouped by addresses. 2) For each address C existing both as input and output, substitute the smaller amount from both input and output. 3) Discard all pairs with zero amounts. I.e., Simplify : and Bal is the balance function Hereafter, we consider only simplified transactions t = Simplify(t) and omit any special notations for them. As a result, A transaction t = (A, B, c) can be represented by a bipartite graph G(t), which reflects coin flows among participants: • The first part of the vertexes is A, and the second part is B.
• Each vertex C has a weight equals to the corresponding amount c.
• Two vertexes A and B are connected by an edge, if the the entity controlling the input A authored spending to B.
Assumption 1: Intended expenses of each input subset in the transaction do not exceed their actual expenses: Public information about the transaction contains all data about the vertices of the transaction graph, but does not contain any information about the edges. Thus, the goal of shared send analysis is to reconstruct the edges based on the vertices and, perhaps, some other information. Recovering information about the edges is an important subtask of the previously stated problem of finding relationships among addresses; therefore, it would be useful in a variety of applications, such as transaction risk scoring and law enforcement investigations.
Hereafter, we display transactions as a set of input addresses in purple boxes on the left and a set of pink boxes on the right. An input box is connected to an output box, if the coins flow from the input box to the output box. The addresses are presented with the first and the last two symbols, when there is no confusion. Bitcoin has integer arithmetic in satoshis, where 1 BTC = 10 8 satoshi. We display amounts in BTC with eight decimals to keep alignment and scale. The fee is listed at the bottom of a figure. We refer to a committed transaction with its position in the blockchain, i.e., the block height and index. The edges recovery G(t) can have multiple solutions. We always can construct a complete graph. If the transaction t is a shared spend mixing transaction, we can assign a connected component per each participating entity. The goal of untangling is to construct all the possible graphs.
Definition 2: Given a transaction t = (A, B, c), a pair of sets A ′ ⊂ A and B ′ ⊂ B, with at least one non-empty set, is called connectable iff the following condition holds: That is, connectivity means that bitcoins from the addresses in A ′ were spent to the addresses in B ′ , and from A collection of sets {X k } K k=1 is called a partition of a set X if the sets in the collection are pairwise disjoint, and their union equals X . We denote a partition as X = ⊔ K k=1 X k . Definition 3: Consider a transaction t = (A, B, c). A pair of K -element partitions of input and output sets, i.e.
We will denote partitions in the form and say that sets A k and B k in the partition correspond to each other. We need to restrict the search space for untangling to reduce complexity.
Definition 4: A connectable pair (A, B) is called minimal iff the sets in it does not allow smaller partitions: is called minimal iff it cannot be further subdivided into an VOLUME 11, 2023   acceptable partition, i.e., (A k , B k ) is a minimal connectable pair for each k = 1, . . . , K .
Using the concept of minimal partitions, the untangling problem is Definition 6: The minimal untangling problem is, based on the transaction t = (A, B, c), to produce every minimal partition of the transaction graph G(t).
By accepting such a restriction, we implicitly assume that a non-mixing transaction is unlikely to look like a mixing transaction. Furthermore, the value flows are more likely to correspond to a minimal acceptable partition than to a non-minimal partition extensible to this minimal partition, as additional measures need to be taken to make value flows dividable. we can not find non trivial disconnected components.
• Separable iff the unique minimal partition P exists and |P| > 1. • Ambiguous iff there are at least two different minimal partitions. Additionally, we semi-formally define time limit transactions as transactions, for which attribution to a specific category is infeasible because of computational limitations inherent to any practical solver. Unlike with categories described in Definition 7, which transactions categorize as time limit depends on capabilities of a particular solver. For a solver with infinite computational resources, all shared send transactions would be from one of the classes: simple, separable, or ambiguous. A simple transaction gains no additional information from untangling. A separable transaction can be divided into several subtransactions, usable for further analysis. For example, we can apply CS and OTC heuristics to them induvidually and get new data on user wallets. An ambiguous transaction indicates mixing transaction without a way to analyze it. OTC heuristic can be missused for the ambiguous transaction. Hence, a Bitcoin transaction is regular, simple, separable, ambiguous, or time limit (see Figure 5). Time limit is a denial of the transaction classification into simple, separable or ambiguous due to the computational limitations. Regular, simple, separable, and time limit are types of shared send transactions. Transaction examples are on Figures 2, 3, 4, 6, 7 and 8.

C. THEORETICAL RESULTS
The white paper [27] provides a necessary condition for ambiguous transactions. And the same white paper [27] reveals the computational complexity class of the shared send untangling problem: Theorem 2: The problem of an ambiguous shared send transactions detection is NP-complete.
Theorem 2 shows unlikely to find an algorithm faster than the brute force. As the search space for the brute force  includes all the input and output subsets, the execution time may be infeasible for some transactions. So we will terminate the execution upon reaching a computational time budget-time limit transactions (see Figure 8).

IV. PROPOSED ALGORITHM
In this section, we present a transaction classification algorithm. The algorithm is divided into three distinct steps (Algorithm 1):

1) find all connectable pairs (Algorithm 2) 2) remove all non-minimal pairs (Algorithm 3) 3) check for intersections in minimal pairs (Algorithm 4).
We denote as C k a set of the sorted by Sum in increasing order subsets C with the cardinality k or smaller: In Step 1, we use a loop over a maximum subset size to find ambiguous transactions earlier. We find connecting pairs by iterating through both outputs and inputs (see Algorithm 2). In Step 2, we check the minimality of connectable pairs. If the connectable pair does not include other minimal connectable pairs with smaller Sum, it is minimal. In Step 3, we search for VOLUME 11, 2023 the intersection of minimal pairs. Minimal connectable pairs can be a partition, then the transaction is simple, if |P| = 1; and the transaction is separable, if |P| > 1. Otherwise, minimal connectable pairs have intersections, and the transaction is ambiguous from Lemma 2. Proof: Firstly, regular transactions are classified by Definition 1 (Algorithm 1, lines 1-3). After it we iterate over maximum subset size c_length (lines 4-10). In each iteration the algorithm can either terminate or run three steps from scratch with increasing subset search space.

Algorithm 1 Classify Transaction
Algorithm 2 considers pairs (A i , B j ) in increasing order of Sum(A i ) and Sum(B j ) with i and j increase. If the pair (A i , B j ) is connectable by Definition 2 For a single index either i or j increment, it is possible to get and the transaction is ambiguous by Lemma 1 for A i , B j and B j+1 (lines 5-6).
• Sum(B j ) + c ≥ Sum(A i+1 ), then (A i+1 , B j ) is connectable, and the transaction is ambiguous by Lemma 1 for A i , A i+1 and B j (lines 5-6). This and the upper case cover all cases of Lemma 1.
We save the pair (A i , B j ) and continue the execution. After the line 10, the pair (A i , B j ) is either connectable or not. If (A i , B j ) is connectable, we can increment any i or j and increment j (line 14). If (A i , B j ) is not connectable, then we increment i or j based on the violated inequality in Definition 2. For a single iteration of the while loop (lines [3][4][5][6][7][8][9][10][11][12][13][14][15][16][17][18][19], the algorithm either stops or increase one of i or j by 1. As both |input_sets| and |output_sets| are finite, we will exit from the loop after a finite number of iterations. If Algorithm 2 does not return the class, it considers all pairs in increasing order of Sum(A i ) + Sum(B j ) and returns all connectable pairs. Algorithm 3 takes sorted connectable pairs as input. By Definition 2, there is at least one address Algorithm 2 Find Pairs Input: Simplified transaction t = (A, B, c) Output: Connectable pairs set or transaction class 1: i = 0, j = 0 2: pairs = ∅ 3: while i < |input_sets| and j < |output_sets| do 4: if Sum(B j ) < Sum(A i ) < Sum(B j ) + c then 5: if 6: return Ambiguous (Lemma 1) 7: end if 10: end if 11: if Sum(A i ) < Sum(B j ) then 12: i = i + 1 13: else 14: j = j + 1 15: end if 16: if Time exceeded then 17: return Time limit 18: end if 19 If a connectable pair (A i , B i ) is not minimal, then exists a connectable pair (A j , B j ) in its partition and So no later then after the consideration of |A i | + |B i | smaller partitions, we will get a subpartion with at least one minimal connectable pair. The property of being a part of partition, like So Algorithm 3 is either out of time or returns all the minimal connectable pairs. Algorithm 4 checks for intersections in connectable pairs (lines 1-7). If it finds one, then the transaction is ambiguous by Lemma 2. If c_length = max{N , M }, and we have no intersections, then Lemmas 1 and 2 do not work for the transaction. The transaction is not ambiguous by Theorem 1 and has a unique partition P. Base on the partition size, the transaction is either simple or separable (lines [8][9][10][11][12][13][14]. This concludes the proof of Theorem 3.

V. NUMERICAL EXPERIMENTS
We analyzed the first 747, 936 Bitcoin blocks occurring between January 3, 2009 and August 4, 2022. The workflow is as follows 1) Collect data in a form t = (A, B, c) for all transactions (Section V-A). 2) Classify all historical transactions by Algorithm 1 with a fixed time limit (Section V-B). 3) Examine how does the time limit increase results in time limit transaction classification into simple, separable, and ambiguous (Section V-C). The source code of the workflow, including Algorithm 1, is available on Github [48].

A. DATA COLLECTION
To get the data, we use local Bitcoin Core node version 22.0. Bitcoin node stores the data in a compact format with scripts for both inputs and outputs for each transaction. We read raw data from the Bitcoin node remote procedure calls (RPC) with bitcoinrpc library for Python programming language. Bitcoinrpc library converts script to addresses for outputs with no additional information. But inputs refer to unspent transaction outputs (UTXO) by transaction hash and output index. So we store UTXOs in a local dictionary and dump it as JSON file. After one sequential run over all blocks, the transaction data was collected in a suitable for the further consideration form t = (A, B, c). VOLUME 11, 2023   The data collection took 5 days with a 10-core 2.81 GHz CPU, 96 GB RAM and 2 TB SSD. Bitcoin node size is 500 GB, UTXO dump size is up to 12 GB and transactions for untangling size is 400 GB. The data collection process is sequentional, so the CPU is a bottleneck. Also, Bitcoin node uses 5 GB for UTXOs, our JSON dump uses 12 GB, but Python dictionary uses up to 85 GB RAM. So 96 GB RAM is important for our implementation, but a strict data types management can relax the requirement.

B. SHARED SEND STATISTICS
To classify a transaction t, we Simplify it and run Algorithm 1 for shared send mixers (Definition 1). We aggregated results with 2016 blocks granularity for visualization, i.e., approximately two weeks per point. Each transaction classification is an isolated problem. So the execution parallelizes by transaction. The classification with a time limit τ = 1 second took 48 CPU days for 8-core 3.2 GHz CPU, 32 GB RAM and 2 TB HDD, and 10 calendar days. Algorithm 1 is implemented in C++. The visualization is implemented in Python.

C. TIME LIMIT VARIATION
Time limit is a specific category of transactions caused by computational limitations of a NP-complete problem. The proposed Algorithm 1 implementation terminats the execution for 1.4% SMM on given computer with τ = 1 second computational budget. Our goal is to provide a numerical evidence of SSM popularity. Therefore, such a classification denial rate is acceptable. If one needs to decrease the rate, she can increase the computational budget or design heuristics for early transaction classification. Time limit transactions analysis can be a key to heuristics design (see Figure 8). But the heuristics design is beyond the scope of the current research. While we vary the time limit τ in this Section to examine its importance for classification. We selected 1000 random time limit transaction from a uniform distribution. For a time limit τ in {5, 10, 30, 60, 300} seconds, the classification results are in Figure 11. Computational budget τ increase up to 300 seconds helps to reveal the true class for more than 25% time limit transactions. Simple are still dominating, but followed by ambiguous and separable with a smaller gap.

VI. CONCLUSION AND FUTURE WORK
Shared send mixer (SSM) is a technique to merge several transactions into one in cryptocurrencies and tokens with the UTXO model. Senders and receivers in such transactions save on fees and increase their privacy by tangling money flows.
SSMs are popular in Bitcoin. In the paper, we verify their popularity numerically.
Are SSM secure? One can try a transaction untangling-a division into subtransactions. Under general assumptions, the untangling problem is known from the literature. In this paper, we propose an algorithm to solve the problem, prove its correctness, and provide a source code. We applied the algorithm to the Bitcoin historical data. 15% transactions are SSM, and 90% of them allow unique untangling. A unique untangling retrieves data about the money flows and indicates the lack of transaction security. Thus, analysts can use untangling information for the address grouping with CS and OTC. And cryptocurrency users can understand the SSM transaction complexity upon generation and avoid too simple transactions to keep privacy.
The untangling is a well-posed mathematical problem, but the assumptions are discursive [27]. First, inputs and outputs, that are smaller than fee, make transactions more tangled than they are. A preprocessing can be a part of the assumptions. Second, an address grouping changes a transaction and may cause a different untangling. The consistency and practice of the untangling with and without grouping is an open question. Finally, small noise in amounts may change the transaction type, for example, from ambiguous to simple. One can relax the inequality in Definition 2 allowing small violations to decrease false-negative mixer detection rate.
Bitcoin is a permanent leader of cryptocurrencies by capitalization and the most interesting for untangling. But the same algorithms work for other UTXO systems. For example, Litecoin, Bitcoin Cash and Bitcoin SV are Bitcoin forks and use the same transaction models. Our code is applicable to these blockchains. Cardano uses an extended UTXO model, which does not affect the untangling. Zcash uses Bitcoin UTXO model for public transactions and a UTXO-based shielded pool secured by zero knowledge proofs. One can untangle public transactions and, probably, shielding-deshelding events. Tornado Cash and Railgun are UTXO-based anonymity pools for tokens secured by zero knowledge proofs. Tornado Cash has fixed amounts for each pool, so unlikely to collect any useful information from untangling. Railgun has arbitrary amounts within a pool, so the untangling may be challenging but useful.