Private Computation of Phylogenetic Trees Based on Quantum Technologies

Individuals’ privacy and legal regulations demand genomic data be handled and studied with highly secure privacy-preserving techniques. In this work, we propose a feasible Secure Multiparty Computation (SMC) system assisted with quantum cryptographic protocols that is designed to compute a phylogenetic tree from a set of private genome sequences. This system significantly improves the privacy and security of the computation thanks to three quantum cryptographic protocols that provide enhanced security against quantum computer attacks. This system adapts several distance-based methods (Unweighted Pair Group Method with Arithmetic mean, Neighbour-Joining, Fitch-Margoliash) into a private setting where the sequences owned by each party are not disclosed to the other members present in the protocol. We theoretically evaluate the performance and privacy guarantees of the system through a complexity analysis and security proof and give an extensive explanation about the implementation details and cryptographic protocols. We also implement a quantum-assisted secure phylogenetic tree computation based on the Libscapi implementation of the Yao, the PHYLIP library and simulated keys of two quantum systems: Quantum Oblivious Key Distribution and Quantum Key Distribution. This demonstrates its effectiveness and practicality. We benchmark this implementation against a classical-only solution and we conclude that both approaches render similar execution times, the only difference being the time overhead taken by the oblivious key management system of the quantum-assisted approach.


I. INTRODUCTION
The emerging fields of Data Mining and Data Analysis of genomic data have deeply benefited from the increasing power of computers [1]. However, its need for a massive and methodical collection of data can lead to the complete or partial leak of private sensitive data [2]- [5]. Besides these threats, the aggregation of data from different sources may be blocked due to legally imposed regulations such as the General Data Protection Regulation (GDPR) [6], preventing honest collaboration studies to occur. To overcome these The associate editor coordinating the review of this manuscript and approving it for publication was Junggab Son . privacy-related issues, several Secure Multiparty Computation (SMC) protocols have been developed, rendering different framework implementations [7]- [10]. The speed and security of SMC heavily rely on the speed and security of an important cryptographic primitive known as Oblivious Transfer (OT) [11]. However, most current OT implementations use public-key cryptography which has its security based on unproven computational assumptions. Moreover, with the emergence of quantum computers, Shor's algorithm [12] jeopardizes all the current public-key methods based on RSA, Elliptic Curves or Diffie-Hellman. This puts at risk the deployment of classical Oblivious Transfer which ultimately leads to the exposure of the SMC parties' private VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ inputs. Thus, it is essential to develop SMC methods secure against quantum computers while not compromising current performance levels. Several privacy-enhancing technologies (PET) (Differential Privacy [13], Homomorphic Encryption [14] and SMC) have been applied to biomedical data analysis [15]- [19]. In particular, these classical techniques have been used in the context of genomic private data analysis. As a way to push research and innovation forward, there have been several competitions [20] focused on developing faster and more secure solutions in the field of genomic analysis. Also, in recent surveys [21], [22], the authors describe the role of PETs in four different computational domains of the genomic's field (genomic aggregation, GWASs and statistical analysis, sequence comparison and genetic testing). However, they do not reference any privacy-preserving method applied to phylogeny inference. In contrast to classical technologies, the usage of quantum cryptographic technologies in private computation has not been widely reported. It was developed by Chan et al. [23] a real-world private database queries assisted with quantum technologies and in [24] the authors simply suggest that their implementation of quantum oblivious transfer is suitable to be applied in an SMC environment. In [25], it is presented a system assisted with quantum technologies for the private recognition of composite signals in genome and proteins and in [26] the authors give a brief description of a private UPGMA (Unweighted Pair Group Method with Arithmetic mean) protocol assisted with quantum technologies. Despite its little integration with PETs, quantum cryptographic technologies have already reached a maturity level that enables this integration: Quantum Key Distribution (QKD) and Quantum Random Number Generators (QRNG) are currently being commercialized and applied to critical use cases (e.g. Governmental data storage and communications, Data centres [27]) with in-field deployment (e.g. OpenQKD, https://openqkd.eu/); Quantum Oblivious Key Distribution (QOKD) protocol is based on the same technology as QKD and QRNG, benefiting from its development and allowing to generate the necessary resource used to execute Oblivious Transfer [28]- [30].
In this work, we present a feasible modular private phylogenetic tree protocol that provides enhanced security against quantum computer attacks and decreases the complexity of the computation phase when compared to state-of-the-art classical systems. The system is built on top of Libscapi [31] implementation of Yao protocol and PHYLIP phylogeny package [32] and it integrates three crucial quantum primitives: Quantum Oblivious Transfer, Quantum Key Distribution and Quantum Random Number Generator.
This work follows a top-down approach. In section II, we start by explaining the concept of phylogenetic trees and the distance-based algorithms used to generate these trees. In section III, we set down the security definitions that will be used to analyse and prove the system's security. In section IV, we explain the cryptographic tools used in the system. In sections V and VI, we describe the quantum cryptographic tools and the software tools that are integrated into the protocol, respectively. In section VII, we describe the proposed Secure Multiparty Computation of phylogenetic trees. In section VIII we explain how the quantum cryptographic tools are integrated into the system and we comment on the experimental threats and possible mitigation strategies. Section IX is devoted to the theoretical security analysis of the protocol and in section X we perform a complexity analysis. In the last section we present a performance comparison of the system between a classical-only and a quantum-assisted implementation.

II. PHYLOGENETIC TREES
Phylogenetic trees are diagrams that depict the evolutionary ties between groups of organisms [33] and are composed of several nodes and branches. The nodes represent genome sequences and each branch connects two nodes. It is important to note that the terminal nodes (also called leaves) represent known data sequences, whether internal nodes are ancestral sequences inferred from the known sequences [34], [35]. The length of the branches connecting two nodes represents the number of substitutions that have occurred between them. However, this quantity must be estimated because it cannot be computed directly using the sequences. In fact, by simply counting the number of sites where two nodes have different base elements (Hamming distance), we underestimate the number of substitutions that have occurred between them.
The best way to compute a correct phylogenetic tree depends on the type of species and sequences under analysis and the assumptions we make on the substitution model of the sequences. By a correct tree, we mean a tree that depicts as approximate as possible the real phylogeny of the sequences, i.e. the real ties between known sequences and inferred ancestors. These assumptions lead to different algorithms which can be divided into two categories: 1) Distance-based methods: they base their analysis on the evolutionary distance matrix which contains the evolutionary distances between every pair of sequences. The evolutionary distance used also depends on the substitution model considered. These methods are computationally less expensive when compared to character-based methods. 2) Character-based methods: they base their analysis on comparing every site (character) of the known data sequences and do not reduce the comparison of sequences to a single value (evolutionary distance). In this work, we will only consider the distance-based algorithms that are part of the PHYLIP [36] distance matrix models, namely: Fitch-Margoliash (fitch and kitsch), Neighbour Joining (neighbor) and UPGMA (neighbor). Also, we will only consider the evolutionary distances developed in PHYLIP dnadist program: Jukes-Cantor (JC) [37],   [38], F84 [39] and LogDet [40]. We refer interested readers on this topic to some textbooks about phylogenetic analysis [34], [35].
In the next two sections, we give an overview of these distance-based methods to build some intuition on how to tailor them to a private setting. We start by looking at the different evolutionary distances and then at the distancebased algorithms.

A. EVOLUTIONARY DISTANCES
The evolutionary distance depends on the number of substitutions estimated between two sequences, which is governed by the substitution model used. So, before defining a suitable distance, it is important to have a model that describes the substitution probability of each nucleotide across the sequences at a given time.
The distances considered in this work can be divided into two groups by their assumptions. JC, K2P and F84 assume that the substitution probabilities remain constant throughout the tree, (i.e. stationary probabilities), whether the LogDet distance assumes that the probabilities are not stationary.
Also, the first three evolutionary distances (JC, K2P and F84) assume an evolutionary model that can be described by a time-homogeneous stationary Markov process. This Markov process is based on a probability matrix P(t) that defines the transition probabilities from one state to the other after a certain time period t. It can be shown [41] that this probability is given by where the rate matrix Q is of the form given by (2), as shown at the bottom of the page. In Q, each entry Q ij represents the substitution rate from nucleotide i to j and both its columns and rows follow the order A, C, G, T . µ is the total number of substitutions per unit time and we can define the evolutionary distance, d, to be given by d = µt. The parameters a, b, c, . . . , l represent the relative rate of each nucleotide substitution to any other. Finally, π A , π C , π G , π T describe the frequency of each nucleotide in the sequences.
From expression (1), it is possible to define a likelihood function on the distance d and use the maximum likelihood approach to get an estimation of the evolutionary distance. The likelihood function defines the probability of observing two particular sequences, x and y, given the distance d: The parameters of Q are defined differently depending on the evolutionary model used and the maximum likelihood solution leads to different evolutionary distances.

1) JUKES-CANTOR
The Jukes-Cantor model [37] is the simplest possible model based on Q as given in (2). It assumes the frequencies of the nucleotide to be the same, i.e. π A = π C = π G = π T = 1 4 and sets the relative rates a = b = . . . = l = 1. This model renders an evolutionary distance between two sequences x and y given by: where h xy is the uncorrected hamming distance and n the length of the sequences.

2) KIMURA 2-PARAMETER
This model [38] distinguishes between two different nucleotide mutations: 1) Type I (transition): A ↔ G, i.e. from purine to purine, or C ↔ T , i.e. from pyrimidine to pyrimidine. 2) Type II (transversion): from purine to pyrimidine or vice versa. These two different types of transformation lead to different probability distributions denoted by P and Q, where P is the probability of homologous sites showing a type I difference, while Q is that of these sites showing a type II difference. So, the Kimura [38] metric between x and y is given by the following: where P = n 1 n , Q = n 2 n and n 1 and n 2 are respectively the number of sites for which two sequences differ from each other with respect to type I (''transition'' type) and type II (''transversion'' type) substitutions.

3) F84
This model [39] also distinguishes different nucleotide transitions but do not assume the nucleotide frequencies to be the same. This leads to a more general distance which can be estimated in closed form: (2) VOLUME 10, 2022 where A = π C π T π Y + π A π G π R , B = π C π T + π A π G and C = π R π Y for π Y = π C + π T and π R = π A + π G , and P and Q are defined as in the Kimura 2-parameter model above.
Although more complex models can be considered with different combinations of parameters in Q, not all of them produce a distance function that can be estimated in closed form.

4) LogDet
As mentioned before, the models based on matrix Q assume that the probability matrix P(t) is stationary, i.e. remains constant throughout the tree. However, there are evolutionary scenarios where this assumption does not give a correct description of reality. The LogDet evolutionary distance [40] suits a wider set of models and considers the case where P(t) is different at each branch in the tree. This is given by where the divergence matrix F xy is a 4 × 4 matrix such that the ij−th entry gives the proportion of sites in sequence x and y with nucleotide i and j, respectively. Also, x and y are diagonal matrices where its i−th component correspond to the proportion of i nucleotide in the sequence x and y, respectively.

B. DISTANCE-BASED ALGORITHMS
All distance-based methods reduce the comparison between sequences to their evolutionary distance. Although it may lead to less accurate phylogenetic trees, these methods are highly popular among researchers who have to handle large number of sequences. It is common to all of them to assume the following: 1) The evolutionary distance computed between each pair is independent of all other sequences; 2) The estimated distance between each pair of sequences is given by the sum of the size of the branches that connect both of them. These algorithms are thus divided into two phase: 1) Distance computation phase: all the pairwise evolutionary distances are computed according to the selected model. This step is common to all distance-based methods; 2) Iterative clustering: aggregate the sequences in clusters iteratively. This step is specific to each method. Let us briefly describe three of the most common distance-based methods [34].

1) UPGMA
The Unweighted Pair Group Method with Arithmetic mean (UPGMA) method produces a rooted phylogenetic tree and assumes the data to be ultrametric, i.e. assumes that for sequences x, y and z. These two assumptions imply that all the sequences are equidistant to the inferred root sequence.
It starts by considering every sequence as a single-valued cluster. Then, it goes on merging the clusters according to the smallest difference between them and recomputes the distance matrix through a simple average of distances. In summary, we have the following steps: 1) Merge clusters, C i = {c i } and C j = {c j } for sets c i and c j , with the smallest distance present in the distance matrix, i.e. d i,j ≤ d k,l ∀k, l. Create a new cluster C i/j = {{c i , c j }}. This new cluster represents a branch between clusters C i and C j ; 2) Recompute the distance matrix according to the following formula: for all other clusters l; 3) Eliminate clusters C i and C j from the distance matrix and add cluster C i/j with the distances computed as in the previous step; 4) Repeat steps 1 − 3 until there is only one cluster left.

2) NEIGHBOUR-JOINING
As we have seen, the UPGMA joins the clusters with the minimum distance between them. Now, the Neighbour-Joining method considers not only how close two clusters are, but it also considers how far these two clusters are from the others. Thus, the clusters to be merged should minimize the following quantity: where r is the number of clusters in the current iteration and u(C i ) = j d(C i , C j ).
As opposed to the UPGMA algorithm, this method produces an unrooted tree and it can be summarized in the following steps: 1) Consider every sequence as a single-valued cluster and connect it to a central point; 2) Compute a matrix Q where its entries are given by the quantity above, i.e. Q ij = q(C i , C j ); 3) Identify clusters C i and C j with the smallest value in the matrix Q. Create a new node C i,j and join both clusters C i and C j to it. 4) Assign to the branch C i C i/j a distance given by: r − 2 and to the branch C j C i/j a distance given by: ) Eliminate clusters C i and C j from the distance matrix and add cluster C i/j with the distances to the other clusters computed as follows: for all other nodes C l . 6) Repeat steps 2 − 5 until there is only one cluster left.

3) FITCH-MARGOLIASH
This method renders an unrooted tree and also assumes that the distances are additive. It analyses iteratively three-leaf trees and computes the distance between three known nodes and one created internal node. This is based on the following observation. Given three clusters C i , C j and C l , and one internal node a that is connected to all these three clusters, the distances between the clusters are given by: from which we can easily see that Thus, we can estimate the distances from the known clusters to the new internal node using the distances between the clusters as given in (7). Based on this, the Fitch-Margoliash algorithm goes as follows: 1) Consider every sequence as a single-valued cluster; 2) Identify the two clusters, C i and C j , with the smallest distance in the distance matrix; 3) Consider all the other clusters as a single cluster C l and recompute the distance matrix with just three clusters. The distances between the identified clusters and the new cluster is given by an average value of the distances between the identified clusters and the elements inside the cluster C l , i.e.
and similarly for C j ; 4) Using expressions (7), we compute the distances from the three clusters and the central node; 5) Merge clusters, C i and C j , into a new one C i/j and recompute the distance matrix between C i/j and all the other clusters c ∈ C l by a simple average expression: ) Repeat steps 2 − 4 until there is only one cluster left.
All these methods output a tree with some topology, T along with the distances between the branches.

III. SECURITY DEFINITION
In this work, we consider a multi-party computation scenario that is secure against semi-honest parties. This means that all the parties strictly follow the protocol but can use their inputs, received messages and outputs to deduce any additional information. As such, these are also commonly called honest-but-curious parties. Nevertheless, we can extend the protocol to the malicious setting, by simply implementing a two-party secure computation protocol that is secure against malicious adversaries [42]. Our security will follow the simulation paradigm and we start with the definition of security in a multi-party setting. The formal definition is taken from [42]. Notation: • F denotes the ideal functionality to be computed in the Secure Multiparty Computation (SMC) session, i.e. F : X n → Y n where n is the number of parties participating in the SMC and X and Y are the input and output space of each party, respectively. X i ∈ X and Y i ∈ Y denote the sets of input and output of party P i , respectively. Also, for short, X = (X 1 , . . . , X n ) and Y = (Y 1 , . . . , Y n ); • π denotes the protocol that implements the ideal functionality F; • C is the set of corrupted parties; . This tuple is called the view of party P i and it contains its inputs (X i ), its random-tape value (r i ) and the messages m i j received during the SMC execution; • output π (X ) = (output 1 π (X ), . . . , output n π (X )), where output i π (X ) is the output of party i computed from its view view i π (X ); • Sim is a probabilistic polynomial-time simulator in the ideal-world; • The distribution on inputs X given by a real-world execution of the protocol π: Real π (C; X ) • The distribution on inputs X given by the ideal-world simulation of the parties' view: : i ∈ C} , F(X ) X Definition 1 (Semi-Honest Security): A protocol securely realizes F in the presence of semi-honest adversaries if there exists a simulator Sim such that, for every subset of corrupted parties C and all inputs X , we have where c ≡ denotes computational indistinguishability. This definition conveys the notion that whatever can be computed by a party during the execution of the protocol is only based on his inputs and outputs, i.e. the execution of the protocol do not provide any further information. This is equivalent to expression (8), which states that the distribution of the view and outputs in a real-world execution is computationally indistinguishable from the distribution generated by a simulator and the functionality output. It is also worth noting that, as it is proved in [43], for deterministic F we have that definition III.1 is equivalent to the simpler case where the Real and Ideal distributions do not take into account the output of the real protocol execution and the output of the functionality, respectively, i.e.
Therefore, we just need to build a simulator that satisfies expression (8) for the Real π (C; X ) and Ideal Sim,F (C; X ) given as above in order to prove security.

A. DISTANCE MATRIX FUNCTIONALITY
For our private phylogenetic tree problem, the ideal functionality F outputs the distance matrix according to the selected evolution model (Jukes-Cantor, Kimura 2-parameter, F84 or LogDet). We denote by DM d , d ∈ {JC, K2P, F84, LD} such a functionality. Note that this functionality is deterministic and, as we pointed before, we just have to prove expression (8) to hold for the simpler definition of Real and Ideal.
The protocol that privately computes the distance matrix DM d is built up by many invocations of a two-party distance functionality, denoted by D d for d ∈ {JC, K2P, F84, LD}. Consequently, we can reduce the the security of DM d to that of D d and use the composition theorem proved in [44] to prove DM d security.
Before presenting the composition theorem, we provide some informal definitions. We have that an oracle-aided protocol using the oracle-functionality f is a protocol where the parties can interact with an oracle which outputs to each party according to f . Also, when an oracle-aided protocol privately computes some g in the sense of (8) using the oraclefunctionality f , we say that it privately reduces g to f . For a more detailed discussion on this topic, we refer the interested reader to [44]. The composition theorem for the semi-honest model can therefore be stated as follows: Theorem 1 (Composition Theorem): Suppose that g is privately reducible to f and that there exists a protocol for privately computing f . Then, there exists a protocol for privately computing g.
In other words, there exists a private protocol of g when the oracle-functionality f is substituted by its real private protocol in the corresponding oracle-aided protocol g.

IV. CRYPTOGRAPHIC TOOLS
In this section, we present the functionalities that build up the Secure Multiparty Computation (SMC) system assisted with quantum technologies.

A. OBLIVIOUS TRANSFER
Oblivious Transfer is a rather exotic functionality that turns out to be crucial in the executability of SMC. This primitive was proposed by Rabin in 1981 in a different flavour [11] and it was proved by Kilian [45] that it is theoretically equivalent to SMC, i.e. OT can be built from SMC and vice-versa. Succinctly, it is a two-party protocol between a sender and a receiver. The sender holds two l-bit messages, m 0 , m 1 , and the receiver holds one-bit choice b ∈ {0, 1}. The OT functionality allows the receiver to receive m b without the sender knowing b and the receiver is not able to know b 1−b . Schematically, we have that OT is given by the functionality described in Figure 1.
Impagliazzo and Rudich [46] proved that OT protocols require public cryptography and cannot just rely on symmetric cryptography. However, quantum computers pose a threat to our currently deployed public-key systems. More specifically, the Shor's [12] algorithm can crack RSA, Diffie-Helman and Elliptic Curve Cryptography systems as it can solve the Discrete Logarithm problem in polynomialtime. In section V, we present a quantum cryptographic protocol that executes OT and we describe how quantum cryptography can prevent these attacks.

B. RANDOM NUMBER GENERATOR
A Random Number Generator (RNG) is another very important tool in the realm of Secure Multiparty Computation (SMC). The SMC security can be compromised and the parties' privacy can be broken if the RNG used is predictable. An attack of this kind was reported in [47] where the authors exploited the Java weak Random Number Generator used in v0.1.1 FastGC [48] and disclosed the inputs of both parties in an SMC scenario. This example points out the fact that it is not possible to use any kind of Random Number Generation for cryptographic purposes.
In the case of Cryptographically Secure Pseudorandom Number Generators (CSRNG), it is crucial that it provides both forward and backward security. The former means that an attacker should not be able to predict the next generated number even when he knows all the generated sequence. The latter means that an attacker should not be able to predict all the generated sequence from a small set of generated elements. These two properties are not present in common Random Number Generators. For example, Linear Congruential generators do not fit for cryptographic tasks since they can be easily predicted as reported in [49]. Also, Krawczk found that a large class of General Congruential Generators do not provide forward security even for obscured parameters [50]. So, in order to produce some CSRNG, instead of using linear operations, the research community decided to rely on the computational intractability of computing the discrete logarithm. Both [51] and [52] use modular exponentiation as an intermediate step in order to generate some pseudorandom bit. As mentioned above, all the cryptographic protocols with their security based on the Discrete Logarithm problem are threatened by quantum computers and these CSRNG protocols are not an exception. Besides this technique, one could use either AES or DES as a cryptographically random generator.
Although these techniques are used to provide unpredictability and backward secrecy, all the randomness relies on the initial seed. This seed is used because all the process is based on deterministic algorithms. So, a Pseudo RNG can be viewed as a randomness extractor from some initial random value. For this reason, it is crucial to use an initial random value that is as close as possible to a truly random value. This can be generated from different sources and usually, the best randomness comes from physical devices (e.g. atomic decay [53] or thermal noise [54]).

C. SECURE MULTIPARTY COMPUTATION
Let us consider a scenario with n parties, P i , each with input x i , i ∈ {1, . . . , n}. Secure Multiparty Computation (SMC) allows these n parties to jointly compute some function f (x 1 , . . . , x n ) = (y 1 , . . . , y n ) without disclosing their inputs to the other parties. So, this functionality is designed to be equivalent to the case where every party P i sends his inputs to some independent and trusted third party Q who computes f () and sends back to each party their corresponding output.
A solution to SMC was given for the first time by Yao [7] and its main idea resides in the fact that every function has a Boolean circuit representation. From this fact, Yao developed the concept of Garbled Circuits which is one of the key elements for secure computation. The Yao's Garbled Circuit (YGC) protocol is constrained to only two parties but its generalization was achieved by GMW [8], BGW [55], [56] and BMR [57]. Also, some implementation optimizations on YGC were later developed in order to improve its performance: point-and-permute [57], row reduction [58], [59], FreeXOR [60] and half gates [61].
Our system security can be reduced to the secure computation of some predefined distance. Therefore, it only requires several two-party secure computations of the distance between two sequences, making YGC a good candidate due to its simplicity.

1) YAO PROTOCOL
As we said before, the main idea of YGC is to represent the desired function f () as a boolean circuit C, i.e. by a sequence of logical gates interconnected with wires. After the generation of the circuit C, each party will have two very different roles. Generally speaking, one of the parties P 1 (usually called garbler) randomly generates keys to each input bit, encrypts each circuit's gate and sends both elements to P 2 (called evaluator). This procedure masks P 1 inputs from P 2 . Then, through the OT functionality, P 2 receives the keys corresponding to his input bits. So, the OT allows to mask P 2 inputs from P 1 . Finally, since the evaluator has all the input keys, he can decrypt every gate, i.e. evaluate the circuit. Let us see in more detail how the protocol works using a four input boolean circuit description of the Millionaires' problem given by the following expression: The protocol goes as follows: 1) Circuit generation: The garbler P 1 generates a boolean circuit of function (9): In this case, the circuit contains one NOT gate (g 1 ), two AND gates (g 2 , and g 5 ), two XOR gate (g 4 and g 6 ), one XNOR gate (g 3 ) and four input wires (w 1 and w 2 belongs to P 1 and w 3 and w 4 to P 2 ). 2) Wire encryption: P 1 uses a Random Number Generator to generate two keys k 0 i and k 1 i for each wire w i , i ∈ {1, . . . , 10}. These keys correspond to the possible values (0 or 1) on the wire. Note that this is done to prevent P 2 from knowing the true value of the wires during the evaluation process. 3) Gate encryption: For every gate g l in the circuit with corresponding input wires w i and w j and output wire w s , P 1 creates the following table: where g l (t, r) is the output of gate g l for inputs t, r ∈ {0, 1}. So, we could think of each row as a locked box that requires two keys to be opened. If the two correct keys are used, it outputs the key corresponding to the desired output value given by g l . After encrypting each gate, P 1 permutes the rows of the corresponding table, otherwise, it would be easy to know the real value of VOLUME 10, 2022 the input keys. Then, he sends to P 2 the garbled tables along with P 1 's input keys.
As an example, we can easily see that if we use input keys k 0 i and k 1 j (corresponding to real values 0 and 1), we would only be able to decipher the second row of the table, Enc k 0 i (Enc k 1 j (k g l (0,1) s )), and get k g l (0,1) s . 4) Oblivious Transfer: At this stage of the protocol, the evaluator knows the garbled circuit and P 1 's input keys but he does not know the keys corresponding to his real inputs. However, since P 2 wants to keep his input value private he cannot directly ask for those keys. At this point, the Oblivious Transfer functionality enables the evaluator to receive his input keys without compromising neither the evaluator's nor garbler's security. In fact, for every input wire, both parties perform an OT where P 1 plays the role of sender and P 2 plays the role of receiver. Let us assume P 1 's input keys to be k 0 1 and k 1 2 (corresponding to the real value 01) and P 2 's input bits to be 11. This means that P 2 must use the respective input keys (k 1 3 and k 1 4 ) in order to correctly evaluate the circuit. So, they will execute two OT protocols where: Once the evaluator has all the necessary elements, he can proceed with the circuit evaluation. In this step, he simply has to decipher the correct rows of the garbled tables sent by P 1 with the corresponding keys. Since the rows of the tables are shuffled, the evaluator does not know which row is the correct one. This small issue can be solved by simple techniques (Pointand-Permute or encryption with a certain number of 0 padded) which, for the sake of brevity, we will not explore here. At the end of the evaluation, the evaluator receives the key that corresponds to the result. Finally, the evaluator sends the resulting key to the garbler and the garbler tells him the final bit. According to our Millionaires' Problem, the evaluation yields the following results for a = 01 and b = 11: , g 6 (k 0 6 , k 0 9 ) = k 0 10 . Actually, the desired result is 0. The Yao GC protocol has its security based on two main building blocks: Garbled Circuits and Oblivious Transfer. Although Garbled Circuits can be generated with symmetric encryption (i.e. using double AES encryption), we have already seen that OT protocols cannot be classically achieved with symmetric cryptography alone. Thus, it is crucial to find some efficient protocol for a quantum-resistant OT.

V. QUANTUM TOOLS
In this section, we start by talking about the very basics of quantum information. Then, we present three quantum primitives used in the private computation of phylogenetic trees, rendering a full quantum-proof solution.

A. BASICS OF QUANTUM INFORMATION
In quantum information theory, we characterise quantum states as qubits. Mathematically, these qubits are normalized vectors of an Hilber space equivalent to C 2 and we represent them using bra-ket notation. Here, we just consider two quantum orthonormal bases: the computational basis Z = {|0 , |1 } and the hadamard basis Qubits can be used as a medium to encode some information. To extract this information, it is necessary to measure them. However, contrary to classical measurements, a quantum measurement is intrinsically probabilistic. In this work, we will just use projective measurements taken with respect to some basis. To describe the probabilistic nature of projective measurements, we make use of the scalar product between two vectors. More specifically, the square of the scalar product | x|y | 2 between two states |x and |y , gives the probability of receiving |x when measuring |y in the x basis. As an example, the probability of receiving On the other hand, | 0|0 | 2 = 1 and | 0|1 | 2 = 0, which means that we always see the state |0 if we measure it using Z basis. This is the core ingredient that guarantees the security of the quantum tools used in the system. We refer the interested reader to Nielsen and Chuang book [63] for a more thorough introduction on the topic.

B. QUANTUM OBLIVIOUS TRANSFER
As we have seen in section IV-C, Oblivious Transfer (OT) is a crucial primitive that guarantees the security of Yao protocol and it is of utmost importance to develop methods that are both quantum secure and efficient. A quantum OT (QOT) protocol was proposed by Bennett et al. [64] for the first time, however, they were not able to prove its security. Unfortunately, several No-Go theorems [65]- [67] proved the unconditional security of QOT protocols to be impossible without further assumptions. Several QOT protocols were proposed by limiting the technological power of the adversary [30], [68]- [71].
Damgård et al. [72] and Lemus et al. [28] proposed a hybrid QOT (HQOT), where they use specific classical commitment schemes instead of a quantum commitment version. Unruh [73] proved the security of this hybrid version with ideal commitments in the universal composability model. So, we have that the HQOT protocol is secure against quantum adversaries as long as the commitment scheme used is quantum-resistant. Furthermore, Lemus et al. [28] also stressed that this HQOT protocol can provide a very practical way to perform OT in a Secure Multiparty Computation (SMC) environment. They split the HQOT protocol into two phases: a precomputation phase that generates oblivious keys (oblivious key phase), and a postprocessing phase that executes the OT based on the oblivious keys (oblivious transfer phase). Since we only need quantum technology during the first phase of HQOT, this splitting method allows separating the use of quantum technology and the execution of OT during the Yao protocol. Moreover, Santos et al. [74] proposed an optimization that makes the oblivious transfer phase as fast as the current most efficient classical methods.

1) QUANTUM OBLIVIOUS KEYS
The concept of oblivious key appeared for the first time in Jakobi et al. [29] as a way to implement Private Database Queries (PDQ). Also, a similar concept was used in [30] under the name of weak string erasure.
We can define the oblivious keys shared between two agents (sender and receiver) as a tuple of the form (k S , (k R , x R )), where k S is the sender's key, k R is the receiver's key and x R is the receiver's signal string. x R indicates which indexes of k S and k R are correlated and which indexes are uncorrelated. By correlated indexes i, we mean that the receiver knows that k S i = k R i . By uncorrelated indexes j, we mean that the receiver does not know whether Moreover, we have that half the elements in the oblivious keys are correlated and half are uncorrelated.
In order to generate the oblivious keys, Lemus et al. follow the prepare-and-measure quantum approach developed by Bennet [64] with Halevi and Micali classical bit commitments based on universal and cryptographic hashing [75]. The generation of correlated and uncorrelated elements comes from the quantum uncertainty principle along with the use of commitments. The security of the protocol is based on the laws of physics and on the fact that there is no significant quantum speed-up in finding collisions on the hash-based bit commitments [28], [76], [77]. Also, as discussed in [28], [74], this protocol has an important security feature: it is resistant against intercept now -decipher later attacks. The quantum oblivious keys distribution (QOKD) protocol is summarized in Figure 3.
Following a similar approach, König et al. [30], [78] developed a prepare-and-measure protocol secure in the noisy quantum storage model. Also, under the same noisy quantum storage model, Kaniewski [79] and Ribeiro [80] proposed device-independent (DI) protocols that generate oblivious keys. Theoretically, these DI protocols offer enhanced security guarantees because they assume untrusted quantum devices.

2) HYBRID QUANTUM OBLIVIOUS TRANSFER
Based on the oblivious keys, we can easily execute an OT using a protocol similar to the reduction of Rabin 1 2 OT to 1-out-of-2 OT. The oblivious transfer phase with the optimization proposed in [74] is described in Figure 4.

C. QUANTUM RANDOM NUMBER GENERATOR
As noted before, a potentially good source of True RNG comes from natural phenomena where some part of the system is used as the source of entropy. In the case of classical natural phenomena, the entropy is frequently taken from some unknown or chaotic subsystem which can ultimately be described by a deterministic theory. In this case, the unpredictability drawn from the system's entropy comes from our VOLUME 10, 2022 lack of knowledge and inability to fully grasp the underlying complex natural mechanisms. Also, some classical phenomena (e.g. mouse pointers) may not have enough entropy to generate good quality random numbers. However, quantum natural phenomena have their roots in Quantum Mechanics which is intrinsically related to Probability Theory. For this reason, quantum systems can be potential sources of entropy even assuming complete knowledge of the system. This comes from the fact that, in Quantum Mechanics, we only have access to the probability distribution of the system's state and we can only know it after measuring it [81].
Within the scope of SMC, the generation of the circuit's wire keys must be guaranteed to be unpredictable and efficient. All these features can be achieved with a QRNG [82].

D. QUANTUM KEY DISTRIBUTION
As we will explain in the last section, part of the communication between the parties should be kept encrypted. Message encryption is commonly achieved with symmetric cryptographic tools, such as AES (Advanced Encryption Scheme) or the perfect cypher One-Time pad. These symmetric tools are used to encrypt the communication content through a common key assumed to be only known by both communicating parties. However, the techniques used to distribute a common key cannot be realized using just symmetric cryptography and it is required to use asymmetric cryptography. Unfortunately, most of the commonly used techniques in asymmetric cryptography (RSA, Elliptic Curves or Diffie-Hellman) rely on computational assumptions that can be broken by a quantum computer through the already mentioned Shor's algorithm [12].
So, to render a quantum-resistant privacy-preserving solution, we make use of Quantum Key Distribution (QKD) protocol to share symmetric keys to be used along with symmetric cryptography [83]- [86]. Its security relies on the laws of Quantum Physics and it is proven to be resistant against computationally unbounded adversaries [87], [88]. This level of security comes from one very important quantum property known as No-Cloning theorem. This property ensures that it is not possible to measure a quantum state without introducing a measurable perturbation in the system. Thus, both parties enrolling in the QKD protocol will be able to detect a potential eavesdropper in case some adversary tries to intercept and read the quantum signals.

VI. SOFTWARE TOOLS
Next, we present the open-source tools used to implement the system presented in the subsequent sections.

A. CBMC-GC
The CBMC-GC compiler [89] is used in step 1) of Yao GC protocol to generate the boolean circuit representation of the desired function. It translates C-like code into boolean circuits based on a model checking tool called CBMC and it optimizes circuits for size and depth [90], [91]. HyCC [92] is also a potential candidate for this step as it builds upon CBMC-GC. However, it aims to build circuits for hybrid MPC protocols in which our system is not based.

B. LIBSCAPI
The Libscapi library [31] implements several important cryptographic primitives for two-party and multi-party protocols. It is extensively used to implement steps 2 − 5 of the Yao GC protocol in the repository MPC-Benchmark [93]. This implementation has integrated one of the most efficient OT extension protocols [94] along with the base OTs proposed by Chou and Orlandi [95].

C. PHYLIP
The PHYLIP package [36] is a C++ open-source project that provides a set of programs to infer phylogenies. Among other programs, it implements distance-based methods (UPGMA, Neighbour-Joining, Fitch-Margoliash) and computes the evolutionary distances described previously in section II-A (JK, K2P, F84, LD). Due to its modularity, we integrate PHYLIP distance methods with Yao protocol for evolutionary distances assisted with quantum technologies.

VII. SECURE MULTIPARTY COMPUTATION OF PHYLOGENETIC TREES
The proposed system allows to securely compute a suite of algorithms that perform phylogeny analysis through the computation of phylogenetic trees. Based on the modular nature of distance-based algorithms, the system combines different evolution models with different phylogenetic algorithms. In this section, we describe how to integrate the tools presented in previews sections IV-VI to develop this modular private system.

A. FUNCTIONALITY DEFINITION
As already mentioned in section II, all distance-based methods are divided into two phases: distance matrix computation and distance matrix processing. Apart from the metric used, the first phase is similar among all methods whereas the second phase is specific to each one while depending only on the distance matrix. Therefore, each phase corresponds to a particular functionality that can be formalized as follows:  where each l 1 and l 2 denotes the distance to its parent node, subtree is built up by other subtrees and the leaves are given by (subtree k−1 : l k−1 , s i k : l k ). For consistency, leaves are also considered as a subtrees. Note that this representation is not unique, e.g.

B. PRIVATE PROTOCOL
During the distance matrix computation phase (DM) of the private A a d , each party has to compute the distance between his sequences and the other parties' sequences privately, i.e. without revealing his sequences to the other participating parties. Since this corresponds to several instances of a two-party secure computation, we make use of the Yao GC protocol described in IV-C1. This means that each party has to generate the boolean circuit representation of the elected distance d, which is accomplished by the CBMC-GC software tool before the beginning of the protocol. In section IX-A, we analyse how to generate these circuits. Now, since the Yao protocol is executed only between two different parties P i and P j for i, j ∈ [n], the other participating parties P t , t ∈ [n] \ {i, j}, do not have access to the distances computed between theses two parties' sequences. For this reason, P t has to receive the result of the Yao protocol execution from both P j and P i . After this, each party outputs the distance matrix in the format required to be used as input in the PHYLIP programs fitch, kitsch and neighbor.
In the second phase of the protocol (A), the parties do not need to communicate as this phase only depends on the quantities computed during the first phase. For this reason, this phase is executed internally by each party, who then compute the phylogenetic tree. This phase is carried out by the PHYLIP programs mentioned in the previous paragraph. These two phases are shown in Figure 6 and we give more details about the protocol assisted with quantum technologies in the next section.

C. QUANTUM PRIVATE PROTOCOL
Let us specify the private A a d protocol with the quantum cryptographic tools. Following the scenario depicted in Figure 6, we define S i = {s i,1 , . . . , s i,l } to be the set of sequences owned by party P i . Also, we denote by d (i,l),(j,k) the distance between the l-th sequence of party P i and the k-th sequence of party P j , i.e. d (i,l),(j,k) = d(s i,l , s j,k ).
As briefly described before, the private A a d protocol has two phases. The first phase requires different types of interactions between the parties to compute the desired distance matrix and the second phase is computed internally. Since the second phase is carried out internally, there is no need for communication between the parties. Therefore, the quantum cryptographic tools will only be used during the first private phase. In summary, each pair of parties require two quantum channels as depicted in Figure 6: one to generate oblivious keys for oblivious transfer and the other to generate symmetric keys for encryption.
Consider the case where P t has to compute the distance matrix entry corresponding to distance d (i,l),(j,k) . Depending on whether P t owns both sequences, one of the sequences or none of the sequences (s (i,l) , s (j,k) ), P t proceed as follows: 1) If i = j = t (i.e. both sequences are owned by P t ), d (i,l),(j,k) is computed internally by P t (blue arrow in Figure 6); 2) If i = t and j = t (i.e. one of the sequences is owned by P t ), d (i,l),(j,k) is computed privately with Yao GC protocol assisted with Quantum Oblivious Key Distribution system (red arrow in Figure 6); 3) If i = t and j = t (i.e. none of the sequences is owned by P t ), both parties P i and P j (or just party P i in case i = j) must send to P t the distance d (i,j),(k,l) encrypted with the symmetric key generated through the Quantum Key Distribution system (black arrow in Figure 6).

VIII. QUANTUM TECHNOLOGIES INTEGRATION
Now, let us see the role of quantum technologies in this private system and its integration with quantum networks.

A. QUANTUM OBLIVIOUS TRANSFER
Libscapi implementation of Yao GC protocol combines a very efficient base OT protocol with one of the fastest OT Extension protocols: it uses the base OT (SimpleOT) proposed by Chou and Orlandi [95] integrated with the OT Extension presented in [94]. In this setting, the HQOT protocol can be implemented in two different ways depending on the number of oblivious keys generated between the two parties: as a base OT protocol integrated within OT Extension protocol or as a stand-alone method substituting all Libscapi OT implementation. If the number of oblivious keys generated is scarce compared to the number of OT required, then one should integrate HQOT with OT Extension. Otherwise, one could directly use the HQOT. A scheme of the integration of the Quantum Oblivious Key Distribution (QOKD) system is depicted in Figure 7.
It is important to note that the base OTs executed during the pre-computation phase of the OT Extension have the parties' roles reversed. This means that the OT Extension sender is the base OT receiver and vice-versa. This should be taken into consideration in case the HQOT is integrated with OT Extension because HQOT is not symmetric in the sense that the apparatus used by the sender is different from that of the receiver. However, since it is known that Oblivious Transfer is symmetric, we can use the reduction proposed in [96] without having to swap the quantum technological material.
We can use oblivious keys to execute a Sender Random Oblivious Transfer (SR-OT) as presented in [74]. This is the flavour of Oblivious Transfer with the smallest computation and communication complexity that can be implemented with oblivious keys. From an implementation perspective, it is important to note that the oblivious transfer step has to be implemented before the garbling phase in case we use the SR-OT version. This is because the input wire keys are defined by the oblivious keys in this case. Consequently, there is no need to use the Quantum Random Number Generator to generate random keys for the evaluator's keys as they are already being generated by the oblivious keys. However, it is still necessary to generate random keys for the corresponding garbler's inputs. This will cut in half the number of random numbers required by the QRNG. So, in case SR-OT is adopted, the structure of the Yao GC protocol must be as follows: 1) Circuit generation; 2) Random Oblivious Transfer; 3) Wire encryption; 4) Gate encryption; 5) Circuit evaluation.

B. QUANTUM RANDOM NUMBER GENERATION
As previously described, the Yao GC protocol needs to generate random numbers for the keys in the Wire encryption step. This is crucial for the security of the protocol because its predictability allows deducing the parties' input as reported in [47]. Libscapi implementation makes use of OpenSSL library function RAND_bytes to randomly generate a seed from which it computes new numbers. In this private system, we substitute this function to a call of QRNG.

C. QUANTUM KEY DISTRIBUTION
The QKD system allows the participating parties to receive the distance elements of the sequences they do not own, while preserving the security of the system. We use the keys generated by the QKD system along with the perfect cipher: One-time Pad.

D. QUANTUM NETWORK INTEGRATION 1) TECHNOLOGICAL EQUIPMENT
Both QKD and QOKD protocols rely on the same physical processes. They can both be realized either with continuous or discrete variables [28], [83], [85], [97]. Also, the technological equipment used by the receiver and transmitter is the same in both quantum services (QKD and QOKD). As for the case of the prepare-and-measure setting, the first quantum step is the same in both protocols: the sender randomly sends quantum states in two different bases and the receiver measures these states on random bases. The difference relies on the classical post-processing phase. So, we can conclude that both services share the same technological equipment (fibre, receiver and transmitter). Moreover, as proposed by Pinto et al. [25] in a similar setting, both QKD and QOKD services can coexist with classical signals in the same fibre.

2) NETWORK TOPOLOGY
The quantum private protocol explained above (VII-C) assumes that every two parties have a direct quantum channel between them that is used to generate oblivious keys and symmetric keys, i.e. a fully connected quantum network. This approach follows from the fact that the first Quantum Key Distribution and Quantum Oblivious Transfer protocols were based on prepare-and-measure techniques [64], [98]. However, there are also protocols that implement device-independent QOKD (DI-QOKD) [79], [80] (under some constraints) and DI-QKD [83]. In addition to the advantages from a security point of view, these DI protocols can also be implemented within a star-structured quantum network having an untrusted party as the middle point. This increases the implementation flexibility of the proposed quantum private protocol of phylogenetic trees (VII-C).
As analysed by Joshi et al. [99], existing networks fall into three possible types: trusted node networks, actively switched and fully connected quantum networks based on entanglement sharing and wavelength multiplexing. Using the two types of protocols just mentioned (prepare-and-measure and device-independent), it is possible to implement our proposed system in all three existing quantum network implementation types.
Moreover, Kumaresann et al. [100] analyses possible Secure Multiparty Computation infrastructure topologies that can be created based on a set of OT channels shared between some pairs of parties in the network. They developed ''secure protocols that allow additional pairs of parties to establish secure OT correlations using the help of other parties in the network in the presence of a dishonest majority'' (Abstract, [100]). Since they work in the informationtheoretical setting, there is no security loss in combining Kumaresann protocol with quantum approaches. This integration increases the range of configurations allowed. However, further efficiency analysis has to be done to understand the impact of this approach in practice.

E. EXPERIMENTAL ATTACKS
Although QKD and QOKD systems are proved to be theoretically unbreakable, all experimental implementations come with possible loopholes. Theoretical proofs usually assume that the physical apparatus of honest parties cannot be hacked. However, imperfections in both generating and measuring the photons can be exploited in multiple ways to perform quantum attacks. We refer the interested reader to proper review articles [83], [101] on QKD attacks and possible mitigation measures. Here, we briefly discuss the impact of these attacks on QOKD systems.

1) QOKD ATTACKS
It is important to stress that there is a fundamental difference between QKD and QOKD systems. In the former, both parties can cooperate in order to detect an external attack, whereas, in the latter, both parties are not trusted. Regarding the QOKD system, the sender must not be able to know which set of indexes is known by the receiver (i.e. x R ) and the receiver must have a limited knowledge on the sender's key (i.e. k S ). This means that both sender and receiver can leverage quantum attacks to gain some information (or control) about the set of bases used by the other. Two of the most problematic attacks on quantum systems are faked-state attacks [102] (FSA) and trojan-horses attacks [103] (THA). The former targets measurement apparatus and the former can target both preparation and measurement apparatus. In a prepare-andmeasure setting, FSA can only be used by the sender while THA can be used by both.
FSA comes from well crafted optical signals that allow the sender to take control over the receiver's measurement outcomes. In summary, as described by Jain et al. [104], when both parties' bases coincide, the receiver's detector clicks; when these are incompatible, he gets no detection event (⊥). The indexes corresponding to no detection events will be discarded by both parties whereas the others will be used in the rest of the protocol. This way, the sender has full knowledge of the receiver's bases and can easily distinguish I 0 from I 1 . Note that the sender does not have to attack all measurement turns. He only needs one successful FSA to guess one basis. This happens with high probability in the number of attacks q, This attack is summarized in Figure 8. We denote by S qokd (J ) (R qokd (J )) the sender's (receiver's) quantum hacking procedure that provides him with the receiver's (sender's) bases from index set J . THA is achieved by sending bright pulses into the equipment under attack and scanning through the different reflections to obtain the bases used. Likewise the FSA, the sender only needs one successful attack as summarized in Figure 9. However, the receiver's attack is more challenging. Not only he has to successfully guess all the sender's bases, he also has to be able to correctly measure the corresponding qubits after leaking the sender's bases. Without the help of quantum memories, this procedure is much more difficult to succeed and allow the receiver to extract the whole key, k S . The receiver's attack based on THA is summarized in Figure 10.

2) COUNTERMEASURES
We have seen how two well-known quantum hacking techniques can undermine the security of oblivious keys and, consequently, the security of oblivious transfer. Fortunately, VOLUME 10, 2022  there are some countermeasures that can be applied that prevent such attacks from breaking the system's security. These countermeasures can be divided into two categories: security patches that tackle specific vulnerabilities and novel schemes that allow faulty devices.
Regarding the two presented possible attacks, it is commonly possible to implement security patches that prevent them. FSA can be prevented by placing an additional detector (usually called watchdog) at the entrance of the receiver's measurement device. This detector monitors possible malicious radiation that blinds his detector. Also, THA can be blocked by an isolator placed at both parties entrance devices. However, as mentioned by Jain et al. [104] these two countermeasures only prevent these attacks perfectly in case the isolators and watchdogs work at all desired frequencies, which is not the case in practice.
This security patches strategy only tries to approximate the experimental implementation to the ideal protocol. However, since the ideal protocol does not assume faulty devices this task is very difficult to accomplish. A better approach to mitigate these securities issues is the development of novel schemes that allow faulty devices. This is the main aim of device-independent protocols which treat both sender and receiver devices as block boxes with minimal security guarantees. To the best of our knowledge, there are only two proposed DI protocols for oblivious keys [79], [105]. However, Kaniewski's protocol [79] is just proven to be secure against sequential attacks and Broadbent's protocol [105] uses post-quantum computational assumption.
To avoid the technological challenges of DI protocols, we can relax its security levels and work in the measurement-device-independent (MDI) setting. This approach allows two parties to perform QOKD with untrusted measurement devices while trusting in their sources. However, Ribeiro et al. [80] showed that although the protocol is secure with ideal photon sources, it is not proven to be secure with imperfect sources.

IX. SYSTEM SECURITY
In this section, we analyse the security of the proposed system. We start by describing the methods used to privately compute the distance between two sequences and then we prove the security of the private protocol proposed in VII-C which implements the functionality described in VII-A.

A. PRIVATE COMPUTATION OF DISTANCES
The private computation of distances between sequences is an important building block in the security of the system. We have that the privacy of the sequences directly relies on this step. Here, we go through the methods used to compute the distances used by the PHYLIP program: Jukes-Cantor, Kimura 2-parameter, F84 and LogDet.
A common building block to all these four distance metrics is the computation of the Hamming distance between two sequences x and y, h xy . We start by looking at an adapted divide-and-conquer way to compute the Hamming distance between two sequences and then we see how to apply it to the private computation of distance metrics.

1) HAMMING DISTANCE
We are interested in the boolean representation of the Hamming distance and, as mentioned above, we use the CBMC-GC tool to translate ANSI-C code into this representation. Usually, to compute the Hamming distance between two binary strings, x and y, we start by applying the XOR operation, z = x ⊕ y. Then, we just have to count the number of 1's in z. This operation is commonly known as population count or popcount(z) for short. So, the binary Hamming distance is given by h xy = popcount(x ⊕ y).
We use an adapted divide-and-conquer technique for the computation of popcount(z) [106]. Originally, this divideand-conquer technique starts by dividing the sequence into 2-bit blocks and then counts the number of 1's inside each 2-bit block. After that, it allocates the result of each block in a new 2-bit block. Then, we can sum the values inside these 2-bit blocks iteratively. We follow the approach described above but we have to tailor it for the computation of the Hamming distance between two four-based sequences (A, C, G, T ). Since we are using a boolean circuit representation, the nucleotide sequences must be represented in binary. So, by convention, we use the following 2-bit encoding: A = 00, C = 01, G = 10 and T = 11. If we follow directly the approach described above, we would have that the Hamming distance between the single-valued sequences ''A'' and ''C'' is smaller than the single-valued This issue comes from the fact that we are counting the number of 1's inside every 2-bit blocks. Instead, we are just interested in knowing if there is at least one element 1 inside each 2-bit block because it indicates that the bases at that site are different. Therefore, before counting the number of 1's in the XORed sequence, we apply an OR operation to the bits inside every 2-bit blocks. We call this operation popcount t (z). For simplicity, hereafter we denote by h xy the tailored Hamming distance between sequences x and y. Now, we have that the tailored Hamming distance between ''A'' and ''T '' gives the desired result: In Figure 11, we show an example on how to compute the Hamming distance between two-valued sequences ''AG'' and ''GC''.

2) JUKES-CANTOR
As described in section II-A1, the Jukes-Cantor distance between two sequences is given by: where h xy is the hamming distance between sequence x and sequence y. Now, note that the function f (x) = − 3 4 ln 1− 4 3 x N is oneto-one. This means that, from a privacy point of view, f (x) carries the same amount of information than x. Therefore, we could simply proceed as follows: 1) Privately compute the Hamming distance, h xy , using the tailored Hamming distance method described above and the Yao GC protocol assisted with quantum oblivious keys; 2) Internally compute d xy = f (h xy ) (no need of quantum SMC). This way, we just have to generate the boolean circuit for h xy rather than generating for the full expression d xy .

3) KIMURA
In section II-A2, we saw that the Kimura 2-parameter model leads to the following distance: where P = n 1 N , Q = n 2 N and n 1 and n 2 are respectively the number of sites for which two sequences differ from each other with respect to type I (''transition'' type) and type II (''transversion'' type) substitutions.
Similar to the case of Jukes-Cantor metric, note that h(x) = − 1 2 ln( x N 3 ) is one-to-one and only defined for x > 0. Thus, we can proceed as follows: 1) Privately compute the expression c = (N − 2n 1 − n 2 ) 2 (N − 2n 2 ) using the tailored Hamming distance method described above and the Yao GC protocol assisted with quantum oblivious keys; 2) Internally computes d xy = h(c) (no need of quantum SMC). More precisely, the ANSI-C code that privately computes expression c = (N −2n 1 −n 2 ) 2 (N −2n 2 ) proceeds as follows. It uses the function popcount t (z) described above to compute the quantities n 1 and n 2 . Observe that a transition type (A ↔ G or C ↔ T ) renders the same XOR value: A ⊕ G = 00 ⊕ 10 = 10 T ⊕ C = 11 ⊕ 01 = 10 Therefore, using a four-sized sequence, the quantities n 1 and n 2 are given by:

4) F84 AND LogDet
Recall from sections II-A3 and II-A4 that the F84 model and LogDet metrics are given, respectively, by: VOLUME 10, 2022 where A = π C π T π Y + π A π G π R , B = π C π T + π A π G and C = π R π Y for π Y = π C + π T and π R = π A + π G , and P and Q are defined as in the Kimura 2-parameter mode above. Also, the divergence matrix F xy is a 4 × 4 matrix such that the ij−th entry gives the proportion of sites in sequence x and y with nucleotide i and j, respectively. Also, x and y are diagonal matrices where its i−th component correspond to the proportion of i nucleotide in the sequence x and y, respectively.
As before, we want to split the private computation of both F xy and L xy in two steps. Note that, in this case, there is no clear way to define two bijective functions, g() and q(), on some simple parameters, d and e, such that F xy = g(d) and L xy = p(e). By simple parameters, we mean parameters that do not depend on complex operations such as logarithm or square root. Instead, one can use the CORDIC algorithm [107], [108] for square-roots and logarithm functions and translate an approximation of both F xy and L xy into boolean circuits.

B. PRIVATE COMPUTATION OF PHYLOGENETIC TREES
In this section we prove that the protocol A a d described in VII-C securely implements functionality A • DM described in section VII-A according to the security definition 1. So, we want to prove the following theorem: Theorem 2: The protocol A a d securely realizes A • DM in the presence of semi-honest adversaries.
We start by noting that the ideal functionality outputs the distance matrix to the parties and that during A computation there is no interaction between the parties. Therefore, the security of the system is independent of the distance-based algorithm used (UPGMA, Neighbour-Joining or Fitch-Margoliash) and we can only focus on the computation of DM functionality.
As already mentioned, the protocol that implements the functionality DM is built up by many invocations of a two-party distance functionality, denoted by D d for d ∈ {JC, K2P, F84, LD}. So, in order to prove the above theorem, we will need to following two lemmas: Lemma 1: A a d privately reduces DM to D d , i.e. an oracleaided A a d protocol privately computes DM using the oracle-functionality D d .
Proof: In order to prove this lemma, we have to develop a simulator Sim that simulates the view of a set of corrupted parties C. The Sim starts from receiving all the input sequences from the corrupted parties. It then proceeds as follows: 1) Generates random sequences of the honest parties, H . 2) Invokes the oracle-functionality D d on these sequences.
3) Sends to all corrupted parties C the results of distances computed from honest parties sequences 4) Invokes the oracle-functionality D d on the sequences owned by the corrupted parties. 5) Invokes the oracle-functionality D d (s i , s j ) for s i ∈ H and s j ∈ C.
In a real execution, the corrupted parties will only receive the distances computed by D d on the honest parties sequences (as in step 2.), on their sequences (as in step 4.) and between corrupt and honest parties. Therefore, we have that the oracle-aided A a d protocol privately computes DM using the oracle-functionality D d .
Lemma 2: Yao Garbled Circuits protocol with the OT primitive instantiated by HQOT protocol V-B privately computes D d .
Proof: In [109] it was developed a framework that allows quantum protocols to be composed in a classical environment. They also mention that a general secure function evaluation remains secure when instantiating the OT primitive by a secure quantum version. In [72], it was proved that HQOT is secure according to the security definition given in [109]. Therefore, we can compose the HQOT protocol with a Yao Garbled Circuit [110] while preserving the overall security.
So, from Lemma 1 and 2 we can use the composition theorem 1 and conclude that the protocol A a d is secure. We have proved that our system is well designed and secure against quantum computer attacks under the semi-honest model. In order to extend the protocol to the malicious setting, we just have to implement a two-party secure computation protocol that is secure in the malicious adversary model [42].

X. COMPLEXITY ANALYSIS
In this section, we start by analysing the complexity of the protocol A a d presented before. We assume there are n parties, P 1 , . . . , P n , with M 1 , . . . , M n sequences, respectively. Also, we assume that the sequences are aligned and that they have the same number of nucleotides, s. Then, we extend the analysis carried out in [74] and compare the computation and communication complexity of the fastest reported malicious oblivious extension protocol used by Libscapi [94] and the optimized version of HQOT.

A. PROTOCOL COMPLEXITY ANALYSIS
Now, let us analyse the complexity of the protocol presented in section VII-C.

1) YAO GC EXECUTIONS
Regarding the number of Yao GC protocol executions, we have that each party P j owning M j sequences has to perform N j Yao = M j i =j M i secure distance computations. So, the total number of Yao GC executions is given by If we assume the number of sequences per party to be the same, i.e. M j = M ∀j ∈ [n], then we can simplify the expression above and conclude that N Yao = M 2 n(n − 1). This means that the number of Yao GC executions is quadratic in the number of sequences per party (O(n 2 )) and also in the number of parties (O(M 2 )).

2) OT EXECUTIONS
From N Yao we can deduce the number of OT executions. In the Yao GC protocol, we need to execute one OT for each of the evaluator's input wires. For a sequence with s nucleotides and using a two-bit representation of each nucleotide, the boolean circuit that computes the distance between two sequences will have 2s input wires for each party input. Therefore, each party executes the following number of OT executions (∀j): It is important to note that N j OT is independent of the size of the boolean circuit used, i.e. it is independent of the distance metric d used in the protocol. This is a consequence of using the Yao GC protocol where the number of OT only depends on the input size. In case we were using GMW [8] protocol, the number of OT per party would depend on the size of the circuit.
As mentioned in section VIII-A, in case the number of oblivious keys generated is scarce compared to the number of OT required, we can use the HQOT protocol to generate the base OT used in OT extension protocol. In this case, we just have to generate κ HQOT protocols per Yao execution: L j bOT = N j Yao · κ = κM 2 (n − 1)

3) OBLIVIOUS KEYS
At this point, we can easily deduce the size of oblivious keys that each pair of parties have to generate when using messages of size l.
In case we use HQOT to generate the final Oblivious Transfer: Also, we can use the number of OT executions per party and the analysis from Table 2 and [74] to compute the computational and communication complexity (in bits) of HQOT: In case we use HQOT to generate the base OT, the total size of oblivious key required is:

4) QRNG
The QRNG has to generate twice the total length of oblivious keys, i.e. L QRNG = 2L ok .

Number of internal computations per party:
As discussed before, for every party P j , P t (t = j) has to receive from P j the distances known by P j that P t does not have access. So, P j has to send M 2 (n − 2) + N j int distance values to P t . Consequently, the length of the QKD key used to send these distances to P t is: for a 32−bit number representation. Therefore, the total size of key shared between two parties P j and P t must be:

B. OBLIVIOUS TRANSFER COMPARISON
To implement practical SMC protocols, we need to be able to execute OT with a rate of the order of millions of OT per second. To reach this rate, classical solutions make use of extension algorithms: generate a small number κ of base OT (precomputation phase as in HQOT) and extend them to m (κ m) real OT through symmetric cryptography [111] (oblivious transfer phase). Currently, the most efficient OT extension protocols developed in the semi-honest model is reported by [47] (ALSZ13) and in the malicious model it is reported by [94] (KOS15). In [74], the authors showed that the overall complexity in the transfer phase of ALSZ13 is bigger than that of HQOT. Furthermore, they argued that KOS15 complexity is also bigger than HQOT but do not perform a complexity comparison between them. Here, we analyse the complexity of the KOS15 protocol which is implemented in the Libscapi library and we compare it with HQOT.

1) KOS15 AND HQOT COMPARISON
KOS15 protocol is very similar to ALSZ13 with the addition of a check correlation phase. This phase ensures that the receiver is well behaved and does not cheat. The KOS15 protocol that generates m l-bit string OT out of κ base OT with computational security given by κ and statistical security given by w is shown in Figure 12. Note that in Figure 12 we join all the subprotocols presented in the original paper: κ,m COTe , κ,m ROT and κ,m DeROT . Also, they identify Z κ 2 with the finite field Z 2 κ and use ''·'' for multiplication in Z 2 κ . For example, the element t j in m j=1 t j · χ j (Figure 12, step 10) should be considered in Z 2 κ .
Similarly to HQOT, the KOS15 starts with a precomputation phase that can be carried out before the actual computation of the OT protocols. However, in the HQOT, the precomputation phase is based on quantum technologies while the transfer phase is solely based on classical methods. Since it is not clear how to compare quantum and classical protocols, we only focus our comparison on the transfer phase of both protocols.
Note that in the original KOS15 paper [94] the computation of pseudorandom generator G is carried out in the OT extension phase. However, these 3κ G computations can be executed during the precomputation phase because they do not depend on the input elements. As mentioned before, the additional steps that KOS15 added to the ALSZ13 protocol are steps 9 − 11 (check correlation phase). Here, both parties start by calling a random oracle functionality F Rand (F m 2 κ ) that provides them with equal random values. The receiver has to compute twice m κ-bit sums, m κ-bit multiplication and sends 2κ bit (x and t) to the sender. Finally, the sender has to compute m κ-bit sums and m κ-bit multiplication. We consider karatsuba method for multiplication with complexity O(κ 1.585 ) and schoolbook addition with complexity O(κ). Therefore, we consider that the sum of two κ takes κ bit operations and the multiplication takes κ 1  the check correlation phase. However, since this overhead is independent of m (number of OT executed) its effect is amortized for big m.
So, we have that the computational complexity of the transfer phase of the fastest malicious OT extension reported implementation [94] is higher than HQOT corresponding phase, while their communication complexity is essentially the same. Therefore, by using the HQOT protocol, in principle we do not have to sacrifice efficiency on behalf of security.
However, in this comparison, we are not taking into account the infrastructure that is required in a real implementation to manage precomputed oblivious keys. As discussed further in section XI, a solution assisted with HQOT causes a time overhead when compared to a classical-only implementation mainly due to the oblivious key management system.

C. USE CASE
We now present the scenario used to test and compare both quantum-assisted and classical-only approaches. We start by exploring the complexity analysis and the OT comparison carried out in previous sections. We extend this analysis in the next section with a testbed implementation.
We consider a scenario where three parties n = 3 have M SARS-CoV-2 genome sequences (with length s = 32 000) and want to privately compute a phylogenetic tree from them. In the next section we consider a varying number of sequences, but, for now, we set M = 10. Following a standard choice [47], we consider garbled circuit keys with l = 128 bits, computational security parameter with κ = 128 bits and statistical security parameter with w = 64 bits. For these parameter values, we can instantiate the expressions deduced in the complexity analysis (section X-A). This information is summarized in Table 2. As expected, the total size of oblivious keys (L j ok ) required for a scenario where HQOT is the main OT protocol is three orders of magnitude higher than the case where HQOT serves as a base OT protocol in KOS15 (L j bok ). Also, we note that the total size of symmetric keys required in the protocol (L j qkd ) is much smaller than that of oblivious keys (L j ok and L j bok ), pointing to the fact that its management should be less expensive than the oblivious keys management system. This will be discussed further in the next section.
We can also estimate the time required to generate the keys based on their size. If we consider state-of-the-art rates of 10 Mbit/s for both QKD and QOKD systems [112] and a rate of 240 Mbit/s for QRNG (ID Quantique QRNG PCIe cards [113]), we would need around 5 minutes for L j ok , 0.64s for L j bok , 28s for L j QRNG and 1.9 × 10 −3 s for L j qkd . Note that we can significantly reduce the time of the precomputation phase in case we integrate HQOT with KOS15 OT extension protocol.
Finally, we compare the number of binary operations and bits sent by HQOT and the KOS15 OT extension. Considering the number of OT required for this use case to be N j OT = 12.8×10 6 ( Table 2), we get the results summarized in Table 3. Observe that KOS15 requires around four times (4.2) more binary operations than HQOT for this scenario. This points to the conclusion that HQOT has the potential to provide a faster transfer phase execution when compared to KOS15.

XI. PERFORMANCE EVALUATION
In this section, we set out to explore and compare the performance of two implementations of the proposed secure phylogenetic tree computation (A a d ): classical-only and quantum-assisted. The quantum-assisted system replaces Libscapi base OT (SimpleOT [95]) implementation with the HQOT presented before (Figure 4). It also uses symmetric keys along with One-Time Pad to encrypt distance values as described in VII. More specifically, we benchmark our implementation for the duration of its main components: circuit generation, communication, (internal) computation and SMC operation.
In this work, we do not assess the generation performance of both symmetric keys and oblivious keys. We precompute these keys using a simulator that mimics the structure of the quantum generated keys and we do not include their generation time in the performance analysis. The reason for this is twofold: performance in quantum cryptography is an active field of research with no clear way on how to be compared with classical approaches; quantum generation of both keys (symmetric and oblivious) can be precomputed without depending on the parties' inputs and used later as a resource in the execution of the system.

A. SETUP
We leverage a testbed on a virtual environment composed of three Ubuntu (64-bit) 16.04.3 Virtual Machines (VM) with 3GB of RAM. The virtual environment was created using VirtualBox and the VMs were running on a 2.6 GHz Intel Core i7 processor.
The performance of the implementation was measured on the VMs with the clock type CLOCK_REALTIME from the C++ library time. Although the values might differ for different host machines, this method is certainly adequate to use as a comparison between a classical-only and a quantumassisted system.
We follow the scenario presented in section X-C, where we have three parties (n = 3) owning at most ten sequences (M ≤ 10) with 32 000 nucleotides. For the sake comparison, we use the Jukes-Cantor phylogenetic distance along with PHYLIP implementation of UPGMA algorithm, i.e. (d, a) = (JC, UPGMA).

1) SEQUENCES PREPROCESSING
The 30 sequences used in this testbed were taken from GISAID database [114] which collects SARS-CoV-2 genome sequences. These sequences were then aligned using the Clustal Omega API [115]. After alignment, the sequences (4-based) were translated to bits according to the following rule: A → 00, C → 01, G → 10 and T → 11. Note that this alignment procedure is not privacy-preserving and was only used for testing purposes. A privacy-preserving alignment can be easily executed if all parties agree on a public reference sequence and align locally their sequences against this reference.

B. CIRCUIT GENERATION
As mentioned above, the CBMC-GC tool can generate a boolean circuit description of the phylogenetic distance from its corresponding ANSI-C code. In Table 4 we present the VOLUME 10, 2022    generation time of the Jukes-Cantor boolean circuit description for three different minimization time values (CBMG-GC parameter). We note that the generation of the circuit only has to be carried out once. From Table 4 we can see that the minimization time for values above 100s does not have a great impact on the minimization of both the number of gates and circuit depth.

C. SYSTEM EXECUTION TIME
We start by recalling that the proposed secure algorithm is divided into the following parts: 1) Distance Matrix, DM: a) Pairwise SMC computation of distances, SMC; b) Pairwise internal computation of distances, IC; c) Sending/Receiving other sequences, Com; 2) Phylogenetic computation, A. We join the internal computation of sequences and PHYLIP phylogenetic computation into the same category and assess three different components for both classical and quantum runs: Communication (Com), SMC (SMC) and Computation (IC, A). In Tables 5 and 6 we show the proportion of each component. As expected, in both systems the pairwise SMC computation of distances represents the greatest portion, accounting for more than 95% of the time for all different numbers of sequences. However, the weight of SMC in the quantum-assisted system is consistently higher than the classical-only system for all cases. This can be explained by   the fact that the quantum-assisted SMC takes longer than the classical-only SMC. Figure 13 present us with the average duration of both systems with standard deviation as error bars. Here we see that the quantum-assisted approach has a higher cost than the classical-only implementation. As discussed in section VIII-A, we can either use the HQOT protocol as the main OT in the Libscapi implementation or we can use it as a  base OT in the KOS15 OT Extension used by Libscapi. Since we have implemented the latter, our HQOT is competing against the SimpleOT [95] base OT implementation. As analysed by the authors (section 4 [74]), the HQOT transfer phase is expected to outperform base OT implementations and to have comparable performance to OT Extension protocols. However, these analyses only compared cryptographic and computational operations and did not take into account implementation constraints.
In the quantum-assisted implementation, we separate the precomputation phase (generation of symmetric and oblivious keys) from the secure computation phase of the proposed protocol, A a d . For this reason, it is necessary to develop a key management system to save and keep key synchronization between parties. Consequently, the key management system becomes the bottleneck as the number of sequences increases. In particular, the key management system of oblivious keys is responsible for most of the overhead (Figure 14).
The reason for oblivious keys management to be more expensive than symmetric management and to be the main cause of overhead is twofold: the total size of oblivious keys used is three orders of magnitude higher than that of symmetric keys (compare L j qkd and L j bok from Table 2); oblivious keys are loaded into ROM memory (slower access) whereas symmetric keys are loaded into RAM memory (faster access). The main reason for oblivious keys to be managed from a file system is that it allows to use Libscapi implementation of Yao protocol in a modular way, i.e. we only have to change the type of base OT used by Libscapi implementation without tailoring any other module.
As the management of files is time-sensitive to their size, the proportion of time spent on the overhead due to the oblivious key management system (OKM) increases with the number of shared keys per party. This can be confirmed by Figure 15 which shows the proportion of time spent by the oblivious key management system in the difference between the quantum-assisted and the classical-only system.
Future work is required to develop more efficient oblivious key management systems. Despite this difference, we stress that the quantum-assisted system has a significantly higher degree of security against quantum computer attacks.

XII. CONCLUSION
In this work, we presented a Secure Multiparty Computation protocol assisted with quantum technologies tailored to distance-based algorithms of phylogenetic trees. It is a modular protocol that uses one distance metric taken from four possible evolutionary models (Jukes-Cantor, Kimura 2-parameter, F84 and LogDet) and three different protocols (UPGMA, Neighbour-Joining and Fitch-Margoliash). In total, we can implement twelve different combinations of protocols.
The proposed system is based on ready to use libraries (CBMC-GC, Libscapi and PHYLIP) that are integrated with quantum technologies to provide a full quantum-proof solution. We use the quantum version of primitives that play a central role in the security of the system: oblivious transfer, encryption and random number generation.
We compare the performance of a classical-only and a quantum-assisted system based on simulated symmetric and oblivious keys. Previous analyses on the computation and communication complexity point to a scenario where the quantum-assisted version does not add an extra efficiency cost. This is confirmed by comparing the running times of both approaches without considering the overhead created by the oblivious key management system that increases with the number of shared keys. Further work is required to develop more efficient key management systems. Despite this extra cost, the quantum-assisted version significantly improves the system security when compared with the classical-only as it renders a protocol with enhanced security against Quantum Computers.