Super Resolution Phase Retrieval for Sparse Signals

In a variety of fields, in particular those involving imaging and optics, we often measure signals whose phase is missing or has been irremediably distorted. Phase retrieval attempts to recover the phase information of a signal from the magnitude of its Fourier transform to enable the reconstruction of the original signal. Solving the phase retrieval problem is equivalent to recovering a signal from its auto-correlation function. In this paper, we assume the original signal to be sparse; this is a natural assumption in many applications, such as X-ray crystallography, speckle imaging and blind channel estimation. We propose an algorithm that resolves the phase retrieval problem in three stages: i) we leverage the finite rate of innovation sampling theory to super-resolve the auto-correlation function from a limited number of samples, ii) we design a greedy algorithm that identifies the locations of a sparse solution given the super-resolved auto-correlation function, iii) we recover the amplitudes of the atoms given their locations and the measured auto-correlation function. Unlike traditional approaches that recover a discrete approximation of the underlying signal, our algorithm estimates the signal on a continuous domain, which makes it the first of its kind. Along with the algorithm, we derive its performance bound with a theoretical analysis and propose a set of enhancements to improve its computational complexity and noise resilience. Finally, we demonstrate the benefits of the proposed method via a comparison against Charge Flipping, a notable algorithm in crystallography.


I. INTRODUCTION
Imagine that instead of hearing a song you can only see the absolute value of its Fourier transform (FT) on a graphic equalizer. Can you recover the song from just this visual information? The general answer is "No" as there exist infinitely many signals that fit the curve displayed by the equalizer. However, if we have additional information (or priors) about the song, we may be able to recover it successfully. The reconstruction process is the subject of this paper and is generally known as phase retrieval (PR).
Beside this day-to-day example, PR is of great interest for many real-world scenarios, where it is easier to measure the FT of a signal instead of the signal itself. During the measurement process, it may happen that the phase of the FT is lost or distorted. Phase loss occurs in many scientific disciplines, G. Baechler, M. Kreković and M. Vetterli are with EPFL, Switzerland, J. Ranieri is with Google, Switzerland, A. Chebira is with CSEM, Switzerland, and Y. M. Lu is with Harvard University, USA.
GB, MK, and JR had equal contributions and should be considered first authors. The work was divided as follows: JR, AC, YL and MV designed research, JR devised the support recovery algorithm and its performance bound, GB and MK proposed algorithmic improvements and carried out experiments, and GB, MK and JR wrote the manuscript. particularly those involving optics and communications; a few examples follow.
• X-ray crystallography: we measure the diffraction pattern of a crystallized molecule-that is the magnitude of its FT-and we would like to recover the structure of the molecule itself [1]. • Speckle imaging in astronomy: we measure many images of an astronomic subject and the phase of the images is compromised by the atmospheric distortion. We would like to recover the subject without the resolution downgrade imposed by the atmosphere [2]. • Blind channel estimation of multi-path communication channels: we measure samples of the channel output without knowing the input. We would like to estimate the impulse response of the channel to optimize its capacity [3].

A. Previous work
The field of phase retrieval was born along with X-ray crystallography, when the first structures were solved with trial-and-error methods leveraging crystal symmetries. These initial attempts prepared the ground for more systematic approaches, a first example of which was proposed by Patterson in 1935 [4]. This method is based on locating the peaks of the Patterson function-the auto-correlation function of the electron density-to determine pairwise differences between the locations of the atoms constituting a molecule.
In the 1950s, a rich family of approaches exploiting the unique relationships between intensities and phases of measured diffraction patterns was developed, e.g. Cochran [5], Sayre [6], Karle [7]. These methods operate in the Fourier space and are known as direct methods because they seek to solve the phase problem directly based on the observed intensities.
We would also like to emphasize the relevance of dualspace algorithms, where both spatial and Fourier domains play a fundamental role in reconstructing the signal. While the origin of these methods dates back to 1972 with the work of Gerchberg and Saxton [8], a lot of interest was recently sparked by the introduction of Charge Flipping [9], [10].
This short literature review of phase retrieval algorithms in X-ray crystallography is focused on ab initio methods, that attempt to solve the phase problem with zero or very little prior information about the structure we are trying to infer. Hence, ab initio methods are considered very challenging, given the minimal amount of information they have access to. Successful methods hinge on the design of an abstract data structure that reduces the degrees of freedom of the desired signal and simplifies its reconstruction. For example, direct arXiv:1808.01961v1 [cs.IT] 6 Aug 2018 methods exploit statistical relationships between the phases to reduce the number of unknowns, while Charge Flipping considers a discretization of the electron density.
In this paper, we focus on the PR problem on sparse signals. The sparsity assumption is legitimate and encountered in many applications; for example atoms in crystallography form a sparse structure. We consider the most compact structure one can imagine for a sparse signal: a set of K atoms defined by their locations x k and their amplitudes c k , where f s (x) = K k=1 c k δ(x − x k ) represents the structure, x is a spatial variable defined over R D (with D being the dimensionality of the signal), φ(x) is the scattering function induced by one atom and * is the convolution operator.
Even if the advantage of the compact model defined in (1) looks appealing, the associated algorithmic challenges are often overwhelming. Computer scientists attempted to design a scalable (i.e. with a computational complexity that is polynomial in the number of atoms K) and stable to noise algorithm that could solve all possible instances of this problem without much success; to date, it is not even clear that such an algorithm would exist [11]. In other words, we encounter a nontrivial trade-off between the compactness of such data structures (i.e. the number of unknown variables) and the ease of solving the PR problem using them. For example, Charge Flipping easily solves many PR problems in X-ray crystallography, but it is based on a discrete spatial structure, which is not the most compact representation.
Recently, we observed the emergence of new PR algorithms leveraging the notion of sparsity while assuming a discrete spatial domain. Two notable examples are GrEedy Sparse PhAse Retrieval (GESPAR) [12], based on the 2-opt algorithm [13], and Two-stage Sparse Phase Retrieval (TSPR) [14], where the support is recovered by solving the discrete turnpike problem [15], [16]. Both algorithms differ from our approach in that their models are discrete and the locations are bound to a discrete grid. Even though it was not designed with continuous setups in mind, TSPR can theoretically recover locations on a continuous domain. However, while it handles noise on the measured coefficients, it does not tolerate noise in the support, which makes it impractical for continuous setups.
The major benefit of having a continuous parametric model is that it enables estimation of the locations and amplitudes avoiding any discretization. In such a case, the achievable resolution is theoretically infinite and only limited by the noise corrupting the measurements. This is what we call super resolution phase retrieval.

B. Main contributions and outline
We propose a three-stage framework that precisely determines a sparse signal from the absolute value of its FT. In Section II, we formalize the problem and describe the typical PR measurement pipeline. In Section III, we give a highlevel overview of our modular approach, discuss the main challenges and introduce a few relevant properties. Algorithms to solve these different modules are proposed in Section IV.
We then describe the details of the proposed method to recover the support, which constitutes the critical element of the pipeline: its complexity analysis can be found in Section V, together with a method to reduce its computational cost, while Section VI identifies a theoretical bound (confirmed by numerical simulations) to successfully recover the signal support in a noisy regime. Then in Section VII, we propose a few improvements and variations of the algorithm to make it more robust to noise. In Section VIII, we discuss the influence of the support configuration on the resulting reconstruction. Finally, Section IX compares our PR pipeline with the stateof-the-art.
Throughout this paper, we use bold lower case letters for vectors and bold upper case letters for matrices. Upper case calligraphic letters denote sets, e.g. X . Furthermore, X represents a set with noisy elements and X an estimated set. Subscripts are reserved for indexing elements in lists and vectors. In the primal domain, continuous functions are written in lower case letters and indexed with x, e.g. f (x) and discrete functions are indexed with n, e.g. f n . In the Fourier domain, we use capital letters, that is F (ω) and F m , for continuous and discrete functions, respectively.
II. PROBLEM STATEMENT We consider the FT of the signal defined in (1), where ω is the frequency variable and Φ(ω) is the FT of the known kernel φ(x).
In practice, it is impossible to measure the whole FT (2), hence we sample it. Furthermore, due to limitations of the measurement setup, we are usually only able to measure the absolute values of such samples, that we denote |F m |, where F m = F (mΩ), m = Z D and Ω is the sampling frequency. As previously mentioned, the PR problem has infinite solutions without a priori knowledge of the signal f (x), since we can assign any phase to the measurements and obtain a plausible reconstruction. The role of structures, such as (1), is to constrain the PR problem to a correct and possibly unique solution. Under the sparsity assumption, retrieving the phase is equivalent to retrieving the locations and amplitudes of f (x).
The auto-correlation function (ACF) a(x) of f (x) is given by the inverse FT of |F (ω)| 2 : where F −1 is the inverse FT operator [17]. Interestingly, the ACF structure is completely inherited from the signal (1): ACF sampling scattering Fig. 1. Typical PR measurement pipeline: the signal of interest f s (x) generates the auto-correlation function a s (x), which is first filtered by the scattering function ψ(x) (here an ideal lowpass filter) to yield a(x) and then sampled, resulting in an. Note that the spatial samples an can be obtained via the inverse discrete FT of the Fourier samples Am, when the periodicity in the two domains holds. Darker colors represent higher intensities.
where the kernel ψ(x) is the ACF of φ(x) and a s (x) is the ACF of the sparse structure of the train of Diracs f s (x).
Equivalently, in the Fourier domain, we have The PR acquisition pipeline can be summarized as the filtering of the ACF a s (x) followed by sampling, where the filtering represents the scattering operation, as illustrated in Fig. 1. We now have all the ingredients to state the core problem of this paper.
Problem 1: Given Fourier samples A m = A(mΩ) of the sparse ACF defined in (4), recover the support X = {x k } K k=1 and amplitudes {c k } K k=1 determining the signal f (x). Note that the information we are interested in is hidden behind two walls: the convolution with the kernel ψ(x) that spatially blurs the sparse structure of the ACF and the phase loss of the original sparse signal, f s (x), that usually characterizes any PR problem.

III. A THREE-STAGE APPROACH
We propose to solve Problem 1 in three distinct stages: i) reconstruct the continuous ACF a(x) from a set of its discrete Fourier coefficients, ii) estimate the support X of f (x) given a(x), and iii) estimate its amplitudes {c k } K k=1 . The first step is a classical sampling problem where we would like to fully characterize a continuous signal from a set of discrete measurements. Problem 1.A (Sparse ACF super resolution): Given samples A m of the sparse ACF as defined in (4), recover its continuous version A(ω).
The most well-known sampling result is due to Nyquist-Shannon-Kotelnikov and guarantees perfect recovery for signals that lie in the subspace of bandlimited functions, provided that the sampling rate is high enough.
In our case, f (x) and a(x) are obviously not bandlimited, but we assume that such signals are sparse, as in (1). Sparsity has two antagonistic effects on PR: it makes the problem combinatorial and hence hard to solve, but at the same time enables a divide-and-conquer approach, in which we first recover the support X and then the amplitudes of f (x). We argue that the support contains more information than the amplitudes, hence we choose to estimate it first. As an example, if all the atoms have the same amplitude, then only the support is useful to recover the original signal. On the other hand, if all the atoms have the same location, the problem is trivially solvable. Problem 1.B (Support recovery): Assume we are given the complete set of unlabeled differences D = {d k, } k, = {x k − x } k, , recover the support X of the sparse signal f (x).
In most real-world scenarios, the unlabeled differences of Problem 1.B are corrupted by noise. Hence, we assume additive Gaussian noise affecting d k, , where ν k, ∼ N (0, σI). Furthermore, we denote the set of measured differences as D = { d k, } k, . For simplicity of notation, we convert the pairs of indices (k, ) to n ∈ {1, . . . , N }, where N = K 2 − K + 1, and order them such that d 1 ≤ d 2 ≤ . . . ≤ d N . We do not assume any ordering on the elements of X .
In what follows, we state a few interesting observations related to Problem 1.B. First, when we measure a set of differences, some information is inevitably lost.
Observation 1: A set of points can be reconstructed from their pairwise differences, even when labeled, only up to shifts and reflections.
To show that, we first translate and reflect the set of points X as X = −X +x, where we overload the arithmetic operators on sets to transform each point as x k = −x k +x. Then, the set of differences of the transformed points is equivalent to the original one, where the natural symmetry of D compensates for the negative sign.
Second, while excluding shifts and reflections does not lead to a unique solution in general, we can still prove uniqueness under certain assumptions.
Observation 2: Assume that the points x k are drawn independently at random from a sufficiently smooth distribution, then the solution is unique [18].
Third, we briefly discuss the occurrence of collisions in the ACF. We say that there is a collision in the ACF when two different pairs of distinct points from X map to the same difference in D. Since we consider a continuous domain for the support, it natively prevents the appearance of collisions. 1 Observation 3: If the locations of the points are independently drawn uniformly from a finite interval, then collisions in the ACF occur with probability zero.
Last, we note that the set of differences D contains many valid solutions. In particular, we can construct two solutions from every element of X ; this is a direct consequence of Observation 1.
Observation 4: The set of differences D is a superset of 2K valid solutions X to Problem 1.B and such solutions always contain the point zero, that is 0 ∈ X .
To verify this, we pick an element of the support, e.g. x , and build the following tentative solution, Then, we notice that i) X is a valid solution with the shift fixed as −x , ii) X ∈ D and iii) we have a solution for every element of X . Moreover, we can exploit the symmetry of the ACF to reach the aforementioned 2K solutions. Such an observation can be extended to the noisy case, assuming we allow the solution to be noisy as well. This property is essential to the algorithm for support recovery proposed in the next section.
Once the support X of the solution has been retrieved, it remains to find the amplitudes {c k } K k=1 of the signal f (x). Problem 1.C (Amplitude recovery): Given an ACF a(x) as defined in (4) together with the estimated support X of f (x), find the amplitudes {c k } K k=1 .

IV. ALGORITHMS
In this section, we lay down our solutions to Problems 1.A, 1.B and 1.C, effectively providing an end-to-end framework to solve the sparse PR problem.

A. ACF super resolution
When we look at (4), we notice that a(x) is completely defined by the locations x k −x and the amplitudes c k c . Hence, we can recast Problem 1.A as a parameter estimation problem given the measured samples A m of the FT of the ACF. An effective existing approach is known as finite rate of innovation (FRI) sampling [19], [20]. FRI-based methods are inspired by spectral analysis techniques to estimate the locations x k − x ; in what follows, we review their fundamentals for the sake of completeness. In this subsection, we restrict ourselves to the 1dimensional case for clarity, even though our implementation is generalized to higher dimensions (see [21] for more details).
The essential ingredient in FRI is to represent the signal of interest as a weighted sum of complex exponentials in the following form: This formulation has several similarities with (5); to see this, we define t n = x k − x , substitute α n = c k c and u n = exp(−jΩt n ) and rewrite the sampled ACF A m as follows, We remark that |Φ(mΩ)| 2 does not allow us to express (9) as a sum of complex exponentials yet. However, if we assume that the kernel function φ(x) is an ideal low-pass filter 2 , i.e. a sinc function, its FT becomes a box function. Thus, we can ignore such a kernel for some neighborhood of m around zero, since Φ(mΩ) = 1 for |mΩ| smaller than the bandwidth of the signal.
The locations {d n } N n=1 are fully determined by the exponentials {u n } N n=1 , that is d n = ∠un Ω , with ∠u n being the phase of u n . Recovering u n from (8) is a classical spectral estimation technique and a possible solution is provided by Prony's method [25], [26]. The idea is to identify a filter H m to annihilate A m , which is mathematically defined as Then, the filter H can be estimated by rewriting and solving (10) as a Toeplitz system. As shown in [19], if A m has the form of (8), then the z-transform of H m is where u n are nothing else but the roots of H(z). Our situation differs from usual FRI applications in the sense that the locations of the ACF describe a symmetric structure. As a consequence, all roots u n come in conjugate pairs (except for the one corresponding to the zero location).
Once the locations are known, the amplitudes α n are found by injecting u n in (9) and solving a linear system of equations.

B. Support recovery
For the recovery of the support, we propose a novel greedy algorithm that is initialized with a partial solution X 2 , which contains two locations. At a given iteration k, we generate a partial solution X k+1 composed of k + 1 locations, hence the algorithm has a total of K − 2 iterations indexed from 2 to K − 1.
1) Initialization: From Observation 4, we know that the solution set X is contained in D and 0 ∈ X ; this gives us the first point of the solution, that is x 1 = 0. Next, we identify the element d N in D with the largest norm, so that we maximize the noise resilience of our algorithm. Indeed, assuming that the locations are corrupted by identically distributed noise, picking the largest norm ensures the maximal SNR of our initial solution. Note that the value d N is the noisy difference between two unknown locations of f (x); without loss of generality, we call them x 1 and x 2 . The elements x 1 = 0 Algorithm 1 Support recovery ordered by their norms Output: A set of K points X such that their pairwise differences generate D Reffering again to Observation 4, we know that the set of differences D contains the rest of the points {x k − x 1 + ν k,1 } K k=3 , that should belong to the final solution X = X K . Furthermore, since we do not want to duplicate points in X k , we initialize a set of possible elements of the solution Due to noise, the vector 0 is not in D, so we remove the closest element d 1 .
2) Main algorithm: At each step k, we identify the element in P k that, when added to the partial solution X k , minimizes the error with respect to the measured set of differences D. More precisely, at every iteration k we solve the following optimization problem, Intuitively speaking, we would like to identify the element x k+1 ∈ P k such that the set of pairwise differences of the points in X k+1 = X k ∪ x k+1 is the closest to a subset of the measured D. The main challenge is that we do not know the correct labeling between these two sets. In the noiseless case, we are looking for a set X k+1 , whose pairwise differences form exactly a subset of D. Hence, we can solve the labeling by matching identical elements. In the noisy case, we cannot leverage the definition of a subset. Therefore, we loosen the equality between elements and determine the labeling by searching for the differences in D that are closest in 2 -norm to the pairwise differences of the elements in X k+1 . This procedure is summarized in Algorithm 1 and its application on the ACF a s (x) from Fig. 1 is illustrated in Fig. 2.

C. Amplitude recovery
If we assume that collisions can occur, recovering the amplitudes with a given ACF and support is equivalent to solving a system of quadratic equations. However, if there are no collisions, we suggest a simple but efficient algebraic solution to Problem 1.C, inspired from [18]. Our new approach  Fig. 1. We start by setting x 1 = 0 and identifying x 2 , the point with the largest norm. Points x 3 to x 5 are then selected in a greedy way according to (12). The solution coincides with the initial signal f s (x) displayed in Fig. 1. is different in that it avoids a matrix inversion step and hence, it is both faster and more robust to noise.
Let c = [c 1 , c 2 , . . . , c K ] be a vector made of the amplitudes to be recovered. If we define a matrix C = cc , all the elements outside of the diagonal of such a matrix are the amplitudes of the measured ACF, that is C i,j = c i c j . Notice that we cannot observe the diagonal entries C i,i = c 2 i,i as we just have access to their sum a s 0 = i c 2 i,i , which is the value of the ACF at 0. This is unfortunate since they are precisely the values we are interested in, up to a squaring operator.
We recast Problem 1.C as a matrix completion problem, where we would like to estimate the diagonal entries C i,i under the constraint of C being a rank-one matrix. The first step of our proposed method is to introduce a matrix L such that where i = log(c i ). The sum of the ith row of L is given by where the term j j does not vary between rows. Hence, its value can be obtained from summing all the entries in L, Then, we recover the vector = [ 1 , 2 , . . . , K ] for K > 2 as where 1 is the all-ones vector. 3 Finally, it suffices to compute c i = exp( i ) to retrieve the amplitudes.
Note that this solution assumes that C is symmetric; this might not be the case in a noisy setup, but we enforce it by replacing C with 1 2 (C + C ). In case of collisions, the problem does not have an algebraic solution and a possible convex relaxation is provided in [14]. In practice, this is often not a concern due to Observation 3.
In what follows, we study and propose improvements to the performance of our PR algorithm, focusing our attention on the support recovery step, i.e. Algorithm 1. In fact, the first step-the super-resolution with FRI-is well represented in literature, where theoretical analyses, extensive simulations in noisy scenarios and efficient denoising schemes have been proposed [21], [27], [28]. On the other hand, the amplitude recovery, while being novel, only consists of simple algebraic manipulations that are not computationally costly.
V. COMPLEXITY ANALYSIS Algorithm 1 has K rounds. In each of these rounds, we go through all points in the existing solution set X k , and for each point we compute the difference with all the values in D. Since there are O(K) points in X k and O(K 2 ) elements in D, this is done in O(K 3 ) operations. Furthermore, for each of these computed differences, we need to find the closest element in D, which requires additional O(K 2 ) comparisons. In total, the complexity of our algorithm is O(K 6 ). Even though this is high and limits the field of application to reasonable sizes, it compares favorably to an exhaustive search strategy, which grows exponentially with K.
It is possible to trade time complexity for storage complexity. Indeed, we observe that we compute at each round the following values for every point x i ∈ X k and candidate p j ∈ P k . However, since we are just moving one element from P k to X k+1 at each iteration, we propose to cache the values (17) in a lookup table to reduce the total computational cost. By doing so, we only need to update each d i,j when the corresponding candidate p j is removed from P k to be added to X k+1 . The theoretical complexity when caching d i,j is not trivial to analyze, but in practice we notice a significant improvement, as illustrated in Fig. 3.

VI. PERFORMANCE ANALYSIS
In what follows, we study the expected performance of Algorithm 1 in the presence of noise. More precisely, we estimate the expected mean squared error (MSE) of the support recovery algorithm when the correct solution is obtained in Section VI-A. Then, we approximate the probability of the algorithm to find such a correct solution in Section VI-B. We consider the problem with D = 1 dimension to lighten notation and simplify the discussion. However, all the results can be easily generalized to the multidimensional setup introduced in Problem 1.B.

A. Expected performance
After K − 2 iterations and if the algorithm successfully finds any correct solution as defined in (7), then this solution will be noisy as it is constructed by selecting noisy elements from D, see (6). If we assume Gaussian noise affecting the support, then the MSE of the support recovery solution can be computed as where Q K ∼ χ 2 K follows a chi-squared distribution with K degrees of freedom. Therefore, the expected value of the MSE of any correct solution is

B. Probability of success
We model the probability that Algorithm 1 finds the correct solution as a function of the noise variance σ 2 and the number of elements K to characterize its performance. Such a probability can be factored as K − 2 iterations where P (σ, K) is the probability of success of the support recovery algorithm and P k (σ, K) is the conditional probability of success at iteration k, given that the algorithm was correct until iteration k − 1.
We focus our attention on what happens at iteration k, i.e. we study the probability P k (σ, K). First, we split the set of possible elements of the solutions P k as the union of two disjoint sets: C k containing the elements that when picked by the algorithm generate a correct partial solution at iteration k, and W containing the elements that when picked corrupt the partial solution. Second, we generalize the cost function used in the main optimization problem (12) to a generic set of 1D elements A as, Below, we use g(A, X k ) with both sets and single elements as arguments: in other words, the expression g(a, X k ) is interpreted as g({a}, X k ). Then, we compute the probability that the support recovery algorithm picks an element from C k instead of an element from W, when searching for the solution of (12). This happens if the cost of C k is smaller than the one of W measured via (21), We assume that the events g(c, X k ) ≥ g(W, X k ) are independent for all c ∈ C k and obtain This is a crude simplification, but it enables us to compute an approximation of P k (σ, K) that will not impair the quality of the end result, as we will demonstrate later. With a similar development as (22), we can write Again, we approximate P k (σ, K) assuming the independence of the events g(w, X k ) as Then, we focus our attention on the term P g(c, X k ) g(w, X k ) < 1 . First, we compute the cost of adding an element c from C k to X k+1 , where each x ∈ X k is a shifted noisy version of an element of X . Following a similar reasoning, the newly added element c can be expressed as c = x k+1 − x 1 + ν k+1,1 . By substituting this expression into (25), we further obtain where Q (1) k ∼ χ 2 k , and in (a) we approximate g(c, X k ) by selecting the difference d = x k+1 − x + ν k+1, . We select this specific d as it is likely to be picked, provided that the noise variance σ 2 is small compared to the values x i . This also significantly simplifies (26) by dropping the random variables x 1 , x k+1 , and x .
Second, we analyze the cost of making an error g(w, X k ) at iteration k-that is selecting any element w ∈ W given X k : We express the minimum in (27) as an exhaustive check of all the possible selections of k differences from D. To do so, we define M k as the set containing all the k-permutations of D, and rewrite the probability of selecting a correct location c instead of a wrong one w for any given c and w from (24) as follows, Here, we introduced e(w, π, X k ) as the cost for a given permutation π, where the elements in π and X k are indexed with i. Once more, we assume that all these selections are independent to obtain Finally, we discuss the probabilistic aspects of (28). The terms ω, x i and π i are each made of the difference between two points plus a noise value. Indeed, they have the form for some specific indices i and j. Assuming that the points in X are uniformly distributed between −0.5 and 0.5, and the noise is Gaussian with zero mean and variance σ 2 , we can approximate (28) as where Q 1 2 ). In (a), we approximate the sum by assuming independence between all the random variables and in (b) we approximate the sum of six random variables uniformly distributed on [−0.5, 0.5] with a normal random variable with variance σ 2 = 1 2 . We now have all the ingredients to compute the probability of success at iteration k (24), as where F(x, k 1 , k 2 ) is the cumulative distribution function of an F-distribution with parameters k 1 and k 2 ; it can be calculated using regularized incomplete beta functions. Last, we determine the size of the sets as Note that as the number of points K increases, these exponents grow faster and push any probability that is not 1 to 0; hence, we expect a steep phase transition. Along the path of our analysis, we made a few rough assumptions that we cannot theoretically justify regarding the independence of events, e.g. in (23), (24) and (29). While we would like to be more rigorous, we provide below numerical evidence that such assumptions hold in practice as the algorithm's performance exhibits a phase transition matching closely the derived theoretical bound (31).

C. Numerical simulations
We define the index-based error as a binary metric that is equal to 0 if the solution set X is of the form (7), and 1 otherwise. This error can be used to empirically measure the probability of success of Algorithm 1: we approximate it by running several experiments with different levels of noise σ and numbers of points K. In Fig. 4, we report the results of such an experiment and compare it with our theoretical result obtained in (31). We confirm that P (σ, K) exhibits a sharp phase transition-we can identify pairs (K, σ) for which the algorithm always succeeds and pairs for which it always fails. However, the empirical phase transition is less sharp than the theoretical one and this is probably due to our approximations regarding the independence of events. Nonetheless, the two phase transitions are closely aligned for each value of K.
In the following, we develop some intuition that may explain why these approximations appear to be so tight. We claim that, even though not all events are pairwise independent, most of them are. As an example, when we look at for two different values p 1 and p 2 of p, the respective cost functions g(p 1 , X k ) and g(p 2 , X k ) probably share a few common differences d. However, we observe that at round k, only k out of K 2 − 2K + 1 differences are selected, one for every x ∈ X k . Then, assuming that most pairs g(p 1 , X k ), g(p 2 , X k ) are independent is practically equivalent to assume that we select the differences d uniformly at random within the minimization. Moreover, we believe that the few dependent events ignored by such assumptions are one of the likely causes of the different steepness exhibited by the theoretical and observed phase transition.

VII. IMPROVING NOISE RESILIENCE
We now discuss strategies and variations of our support recovery algorithm aiming at improving the quality of the solution in noisy settings. We chose not to include them in the analysis as they make it more intricate.

A. Deleting solutions from the set of differences
When a new point x k+1 is added to X k , Algorithm 1 ignores some useful information. Assuming that there are no collisions and no noise, we know that the values X k − x k+1 and x k+1 − X k in D cannot belong to the solution X as they are part of W. Thus, as soon as x k+1 is added to the solution set, we can remove all values of the form X k − x k+1 and x k+1 − X k from D.
The same reasoning applies to the noisy case, but we pick the closest values in D as we do not have exact differences. More formally, when we add a new point x k+1 to the solution X k , we dispose of the following 2k elements of D, This approach results in two opposing effects. On one hand, we introduce the risk of erroneously discarding a point d * that belongs to the solution. On the other hand, we are pruning many elements out of D and naturally reduce the risk of picking an erroneous candidate later on in the recovery process. As we will show in Section VII-D, the benefits outweight the risks.

B. Symmetric cost function
Next, we replace the cost function (12) with a symmetric one to leverage the natural symmetry of the ACF.
In Algorithm 1, we search for the vectors in D closest to the computed differences p − X k for each candidate p. We strengthen its noise resilience by jointly searching for the vectors closest to ∓ X k ± p and choosing the candidate p that minimizes the sum of both errors. Specifically, we rewrite the cost function (12) as (34) We stress that this improvement is compatible with the idea of caching introduced in Section V. We can in fact cache the following pairs for each x i ∈ X k and p j ∈ P k and recompute them when p j gets added to the solution X k+1 .

C. Denoising of partial solutions
At each iteration k of Algorithm 1, we have a partial solution X k+1 and, from (12), we identify for each pair x i , x j ∈ X k+1 a difference d i,j that is the closest to x i − x j . In other words, we are simultaneously labeling the differences d i,j using our current partial solution; such a labeling is completed as k reaches the final iteration K − 1.
This partial labeling can be exploited to denoise the set X k+1 as it provides unused additional constraints and mitigates the error propagation between the iterations. More precisely, we propose to find a set of points { x i } k+1 i=1 that minimizes the following cost function The solution to (36) is derived in closed-form by setting its first derivative to 0. As it is based on a least-square-error criterion, it is then optimal when the differences are corrupted by additive Gaussian noise. This leads to a simple and effective strategy: refine the estimate of the solution set at each step by taking the average of the differences related to each point x i ∈ X k+1 as where we recompute all x i as they are used in the k + 1 iteration. To see why this works, we separate the sum as We observe that − 1 k+1 k+1 j=1 x j is the same translation for all points x i . The consequence of this approach is that the total noise is reduced as we average its different realizations over k +1 values. Note that since Algorithm 1 assumes that x 1 = 0 in X k , we also translate back all the points by − x 1 after each update.
Unfortunately, the idea of caching the differences introduced in Section V is not compatible with the denoising of the partial solutions. As at each step we modify the partial solution set X k , the differences between X k and D change accordingly, which makes it impossible to cache them. Hence, there exists a hard trade-off between quality and complexity, and we should pick the right strategy depending on the requirements of each specific practical scenario.

D. Comparison of improvement strategies
Last, we evaluate the significance of our proposed improvements on Algorithm 1. We quantify the results using the indexbased error introduced in Section VI, as well as the 2 error, which we define as the 2 -norm of the difference between the underlying points X and their estimation X . 4 Fig. 6. Influence of the points' locations on the estimation errors. We solve a 1D instance of the problem with K = 4, x 1 = 0, and x 2 = 1. The locations x 3 and x 4 vary along the xand y-axis.
The comparison of the different improvement strategies is illustrated in Fig. 5. In this experiment, we draw K = 6 onedimensional points uniformly at random from the interval [0, 1] and add Gaussian noise N (0, σ 2 ) on their pairwise differences. We run Algorithm 1 and the proposed improvements for different noise levels σ. It is clear that all the proposed strategies enhance the original algorithm, with respect to both the index-based error and the 2 error.
Moreover, we also observe that different strategies combine constructively: for example, the symmetric cost function decreases the 2 error by 5% on average, while deleting solutions from the set of differences improves the results by 27% on average. When combined together, the average error decreases by 59%. Including the denoising further enhances the algorithm, as the average error decreases by 62%. Similarly, for the index-based error there is an evident shift between the phase transitions of the original algorithm with and without improvements.

VIII. INFLUENCE OF THE POINTS LOCATIONS
The algorithm performance around the phase transition in Fig. 4 also seems to indicate that some configurations of points are easier to recover than others. In this section, we run a small experiment to visualize the challenges posed by certain configurations.
We consider a low-complexity setup (K = 4, D = 1), fix the support boundaries, that is x 1 = 0 and x 2 = 1, and study the reconstruction error for various pairs (x 3 , x 4 ) ∈ [0, 1] 2 . We generate several instances of this problem and perturb the differences in D with additive Gaussian noise with zero mean and σ = 0.01. We measure the performance of Algorithm 1 (with all the improvements introduced in Section VII) using both the index-based and the 2 error. The average errors are then shown in Fig. 6, where we observe that there exist some combinations of points that lead to a significantly higher error.
We now develop intuition about a few interesting cases that emerged from the previous experiment. For the sake of simplicity, we consider a noiseless setting where collisions in the ACF or non-uniqueness of the solution are the only causes of challenging configurations. 1) Collision between a difference and a point. When a difference and a point collide, it can happen that the difference is mistaken for the point. This does not influence the 2 error, but causes an index-based error. An example of such a case is when x 3 = x 4 (the main diagonal in Fig. 6): both the difference x 4 − x 3 and x 1 have value 0. As a consequence, the sets X = {x 1 , x 2 , x 3 , x 4 } and X = {x 4 − x 3 , x 2 , x 3 , x 4 } are both equal to X = {0, 1, x 3 , x 3 }, but the latter is not of the form (7). 2) Constant difference 0.5. When x 4 = x 3 ± 0.5, we can actually find more than one set of 4 points that map to a subset of the given differences. In the case x 4 = x 3 +0.5, the differences are D = ±{0, 1, However, X does not lead to a zero 2 error. 3) Collision of differences when adding a new point to the solution set. This is for example the case of The differences 0 and 1 are always selected in the first and the second step. In the third step, we could potentially add 2x 3 to X 2 = {0, 1} and reduce the set of differences to D = ±{x 3 , 1 − x 3 , 1 − 3x 3 }. Next, we select x 3 as a new point. We can verify that the differences of x 3 and the values in X 3 = {0, 1, 2x 3 } exist in D. However, in this verification we use the value x 3 in D twice: once as the difference between x 3 and 0, and once as the difference between x 3 and 2x 3 . The set of pairwise differences of X 4 = {0, 1, 2x 3 , x 3 } is indeed contained in the original D, but neither its 2 error nor its index-based error is zero. Notice that if we swap the third and the fourth step, this confusion would be avoided as x 3 would be removed from the set of differences in the third step. These three cases explain all the segments visible in Fig. 6. Such an analysis also applies to noisy regimes; the main difference is that we move from very localized configurations to blurrier areas where the solution is ambiguous. In fact, we introduced some noise into the experiment in Fig. 6 to enable the visualization of the lines identifying challenging patternsa noiseless setting would have just led to infinitesimally thin lines. Such patterns become blurrier and wider as noise increases. These areas where reconstruction is harder also explain the not-so-sharp phase transition in Fig. 4: when drawing supports of K elements at random, the probability of hitting a challenging pattern significantly grows with the noise. To the limit, these blurred lines cover the entire domain and the probability of success is null.

IX. COMPARISON WITH CHARGE FLIPPING
In this section, we evaluate the performance of the proposed PR algorithm in comparison with other state-of-theart methods. Recall that our algorithm is, to the best of our knowledge, the first to operate in a continuous-support setup, whereas other algorithms assume discrete signals. Indeed, the vast majority of PR methods are simply not designed to work with continuous supports; examples are PhaseLift [29], which recasts the PR problem as a semi-definite program, is sampled and we observe the magnitude of its DFT, Am, which also corresponds to a discrete version of its ACF. These DFT coefficients are directly used by Charge Flipping to recover a discretized support of f (x). Our approach proceeds in two stages: first, using FRI we compute a super-resolved version of the ACF, and then by applying the proposed algorithm, we recover the continuous version of f (x). and GESPAR [12], which linearizes the PR problem with the damped Gauss-Newton method. In general, these approaches assume that the signals of interest are sparse vectors. As seen in Fig. 1, when the locations are not aligned with the sampling grid, the discretized signal contains very few-if any-nonzero entries as the scattering function spreads the sharp continuous locations.
A few algorithms can be adapted to work with continuous supports, but they fall short when the measured support D is noisy. This is the case of TSPR [14], which relies on the triangle equality between locations to recover the support; as soon as the locations are corrupted with noise, these equalities do not hold anymore.
The closest point of comparison to our method is the Charge Flipping algorithm [9], [10]; even though it operates in a discrete domain, our experiments have shown that it is resilient to some noise on the ACF support.

A. Charge Flipping
Charge Flipping is one of the reference algorithms in crystallography. It belongs to the class of dual-space algorithms as it alternatively acts on the spatial and Fourier domains, designated real and reciprocal space in crystallography. After randomly assigning a phase to the observed magnitudes of the discrete Fourier transform (DFT) coefficients, it iteratively performs the following two operations: 1) In the real space, it flips the sign of the values that are below some fixed threshold δ. 2) In the reciprocal space, it enforces that the magnitude of the signal corresponds to the measured magnitude. Charge Flipping directly takes as input the DFT coefficients of the ACF, while our support recovery algorithm operates on a continuous version of the ACF. This is a significant advantage of our algorithm over Charge Flipping as we do not assume that the support of the points is aligned with a grid. To have an adequate comparison between the two, we need to consider the entire pipeline, combining the three algorithms exposed in Section IV; this is illustrated in Fig. 7.

B. Experimental setup and results
We generate DFT coefficients corresponding to a sparse signal as described in (1), discard their phase information, and corrupt them with zero-mean Gaussian noise. Notice that in Sections VI and VII, we assume noise on the support of the points; here, we are dealing with noise that is applied to the DFT coefficients instead. Obviously, these noisy DFT coefficients also lead to a noisy support of the super-resolved ACF, but it is not Gaussian anymore. In fact, as FRI algorithms rely on nonlinear methods, the noise of its output is difficult to characterize.
The discretization of the Fourier domain is equivalent to a periodization of the spatial domain. As a consequence, the squared magnitude of the DFT coefficients corresponds to a circular ACF. While it is certainly possible to adapt Algorithm 1 to handle circular ACFs by testing all the possible 2 D quadrants for every observed d ∈ D, we chose to zero-pad the support of f (x) until its ACF is not circular anymore.
Regarding Charge Flipping, we notice that its performance highly depends on the initial solution as well as the choice of δ. To avoid giving an unfair advantage to our algorithm, we run Charge Flipping 10 times for each experiment and pick the best solution; practical experiments show that the performance gain is marginal when going above such a number of repetitions. Furthermore, best practice [10] suggests to pick δ = bθ, where b is a constant around 1-1.2 and θ is the standard deviation of the measured signal. Our experiments showed that progressively decreasing the value of δ also improves the performance of Charge Flipping. This mimics the behavior of the simulated annealing algorithm, where the temperature is steadily decreased until convergence.
Then, given noisy DFT coefficients as input, we compare the 2 error on the support of the points for both algorithms, as well as a probability of successfully recovering the support. To define the latter, we say that an algorithm fails when the 2 error is higher than a specific threshold. Fig. 8 shows that our FRI super-resolution algorithm surpasses Charge Flipping in terms of both metrics. It is not surprising that our algorithm exhibits a superior performance in a low noise regime-it even achieves exact reconstruction in the absence of noise-since it is not bound to a grid. On the other hand, Charge Flipping always suffers from approximation errors due to the implicit discretization: in the noiseless case and for a grid of size 200, we calculate that the expected 2 error on the support of K = 5 points is about 0.0056, which is in adequacy with the baseline observed in Fig. 8a. Simulations also indicate that our algorithm outperforms Charge Flipping in high noise environments. Indeed, the reconstruction error is consistently lower and its phase transition compares favorably as well.

X. CONCLUSION
We presented a novel approach to solve the phase retrieval problem for sparse signals. While conventional algorithms operate in discretized space and recover the support of the points on a grid, the power of FRI sampling combined with the sparsity assumption on the signal model enables to recover the support of the points in continuous space. We provided a mathematical expression that approximates the probability of success of our support recovery algorithm and confirmed our result via numerical experiments. We observed that while our algorithm runs in polynomial time with respect to the sparsity number of the signal, it remains relatively costly. To alleviate the computational costs without impacting the quality of the reconstruction, we proposed a caching layer to avoid repeating calculations. Furthermore, we introduced several improvements that contribute to enhance the quality of estimation in the presence of noise. Finally, we showed that our super resolution PR algorithm outperforms Charge Flipping, one of the state-of-the-art algorithms, both in terms of average reconstruction error and success rate.