Molecular Surface Estimation by Geometric Coupled Distance Functions

Estimating the surface from given atoms with location and size information is a fundamental task in many fields, such as molecular dynamics and protein analysis. In this paper, we present a novel method for such surface estimation. Our method is based on level set representations, which can efficiently handle complex geometries. The proposed method is analyzed from mathematical point of view and from computation point of view. The method does not require any prior information about the surface. This property is fundamentally important for the surface estimation task. The presented method is evaluated on both synthetic and real data. Several numerical experiments confirm that our method is effective and computationally efficient. Finally, the method is applied on protein surface estimation. This method is suitable for high performance molecular dynamics study, protein surface analysis, etc.


I. INTRODUCTION
Given the location and size information of some atoms, it is important to know the geometric surface that compactly contain these atoms. Such geometric surface is important for its function, such as protein docking and Molecular Hydrophobicity Potential. Therefore, surface estimation from such atoms becomes fundamentally important for related research fields. One example is shown in Fig. 1, where the surface is estimated by our method and the color indicates its mean curvature property.
Before showing the molecular surface estimation, we first introduce a surface reconstruction method from point cloud. Different from the molecular surface, these points exactly live on the surface and do not have a size or radius. We show this method and its accuracy on both synthetic and real data. Then, we extend this method for molecular surface estimation in Section V.

A. POINT CLOUD REPRESENTATION
A point cloud P = { x i ∈ S} is a fundamental representation of a surface S. If an object is small compared to S, it can be treated as a point. For example, nuclei are treated as points to represent cells in the tissue; a small region is treated The associate editor coordinating the review of this manuscript and approving it for publication was Jenny Mahoney. as a point during tissue development; a protein is treated as a point on the cell membrane. All points together can represent the geometry where they live on. This idea has been used in super resolution microscopy techniques such as Photo-Activated Localization Microscopy (PALM, invented by Eric Betzig who won Nobel prize because of PALM in 2014) and Stochastic Optical Reconstruction Microscopy (STORM).
However, surface reconstruction from unstructured point clouds is challenging due to the absence of connectivity information between the points, which may lead to topological ambiguities. Even if there is a model to choose VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ the topology according to the sample's geometry, this requires prior knowledge about the sample. Moreover, high curvature (sharp corners or edges), irregular sampling, and noise in the point positions often complicate the task. Conceptually, there are two approaches to recovering the surface S from which the points have been sampled: interpolation (find missing data) and model fitting (reduce error). Traditionally, both approaches require predefined basis functions that ideally reflect geometric properties of the true surface, such as connectivity, smoothness, sparsity, and curvature. These implicitly assumed properties constitute the prior knowledge (regularization) about the unknown geometry. Even though their imposition may render the reconstruction problem well-posed, these priors may mask details or patterns in the signal, such as smoothing out texture, or bias the reconstruction result toward the imposed priors.
Regularization-free geometry reconstruction is desirable, specially for biological data, where prior knowledge is scarce or needs to be investigated. If a prior was imposed, it would be difficult to decide whether a property of resulting geometry results from the prior or from the signal. This prior-freeness requirement makes most of state-of-the-art approaches in surface reconstruction fail in this context. We present a regularization-free method for geometry reconstruction from point clouds to tackle this challenge.

B. PREVIOUS WORKS ON SURFACE RECONSTRUCTION
Numerous methods have been proposed for surface reconstruction from unstructured point clouds. This includes approaches based on FFT (e.g., [1]), Poisson surfaces (e.g., [2]), and moving least squares (e.g., [3]). Most of those methods require that the surface normals have been first estimated from the data. From a differential geometry point of view, normals are first-order information about the geometry, while position is zeroth-order information. However, estimating normals from unstructured point sets is challenging because global consistency across the entire surface must be ensured. Traditionally, normals are estimated using Principal Component Analysis (PCA), a local method that can not guarantee global consistency. Several methods, for example level-set methods [4], [5], do not require normal information. Instead, they attempt to minimize the energy [5] where q is a real number, and d(·) is the distance field to the unknown implicit surface S (see Fig. 2(a)), and λ is the regularization coefficient. In this energy-based approach, methods from image segmentation have also been adapted to surface reconstruction [6]. Thanks to the convexity of the energy, fast solvers (e.g., split-Bregman [7], [8] or Primal/ Dual) can be used. The regularization term R(S) in the model, however, tends to smooth the result and remove details from the surface. Moreover, the computational cost depends on proper initialization and stepsize control. Previous works also demonstrated several strategies to save memory and reduce the computational complexity of levelset algorithms. This includes narrow-band formulations [9], multi-scale methods [10], [11], and DT-grids [12]. The need for computationally expensive level-set re-initialization has been overcome by adding an additional penalty (regularization) term in the energy [13].

II. OUR METHOD
Our work is motivated by the symmetry property of a distance field and the antisymmetry of the level set representation [14]. It is inspired by coupled level-set methods [15]- [17], but addresses some of their shortcomings when reconstructing high-curvature regions, while guaranteeing the signed-distance property. The connection of our method with others can be found in section II-D.
We are given an unordered point set P = { x i : x i ∈ R n , i = 1, . . . , N }. Usually, n = 2 for image processing problems, such as estimating a contour from feature points or filling gaps between edge fragments, and n = 3 for computer graphics problems, such as constructing a surface model from a point cloud, or for stereo vision problems. Even though we focus on two-and three-dimensional problems, the method presented here also works in higher dimensions, for example for constructing surfaces in manifold learning.
For surface reconstruction, we present a coupled Signed Distance Functions (cSDF) method. The key idea in cSDF is to use two spatially coupled signed distance functions φ in and φ out to capture the surface S. This geometric constraint is in contrast to coupled level-set methods, which are based on topological constrains [18]. The cSDF idea is inspired by the difference between the reflection symmetry of the distance field d and the reflection antisymmetry of the level-set function φ.
From a variational point of view, the cSDF method attempts to minimize the energy functional where φ max = max (φ in ) 2 , (φ out ) 2 , d 2 . However, as described in the following, we do not need to evolve any Partial Differential Equation (PDE) in order to compute the result. Instead, simple thresholding is enough. The reason for this becomes apparent from the Eikonal equation (Eq. 3), where thresholding is equivalent to wave propagation if the wave speed is constant.
We first illustrate cSDF in 2D and then apply it to synthetic and real-world data in 3D. The method proceeds in three steps, as detailed by Algorithms 1 to 3 below. The corresponding steps are illustrated as in Fig. 2(a). The coupled level set functions φ out and φ in are illustrated in Fig. 2(b) for two examples. A simple example of the intermediate results after every processing step is shown in Fig. 3.

A. STEP 1: DISTANCE FIELD
For a given point cloud P (exemplified in Fig. 3(a)), we compute the distance field d( x) on a predefined Cartesian grid G = m × n of uniform resolution h. This amounts to solving the following Eikonal equation: as the boundary-value formulation of the Hamilton-Jacobi problem associated with Eq. 2. Several methods are available to numerically solve this equation, including the Fast Marching Method [9], the Group Marching Method [19], the Fast Sweeping Method (FSM) [20], the Fast Iteration Method [21], and direct Hamilton-Jacobi solvers [22].
Here, we show three alternative methods to compute the distance field. The first one uses an extended-window FSM restricted to a narrow band of width b: where Eq. 3 is only solved for x ∈ N b . We ensure the distance property by setting v( x) ≡ 1. This method was first published in [23]. The second method directly computes the distance field from the point cloud using a Sparse Voxel Oct-tree (SVO) data structure in the narrow band. The third method solves an inhomogeneous Helmholtz equation using FMM [24], similar to how it is done in a Schrödinger distance transform [25].

1) EXTENDED-WINDOW FAST SWEEPING METHOD
FSM sweeps the grid until convergence, which can be inefficient for points far from the interface. Fast iterative methods relax this by using locks [21]. These locks, however, cause additional serialization. Here, we avoid these locks and accelerate FSM by using a larger window size w > 1 (see Algorithm 1). The min method of FSM [20] is hence extended to account for all points in a w-neighborhood, as illustrated in Fig. 4(a) for w = 2. We initialize the algorithm with: The original FSM [20] is recovered for w = 1.
For w > 1 the present extended-window FSM is more accurate than the original FSM, because the integrated error is reduced when the local window size w gets larger. Moreover, it converges faster since the larger window causes a wave of higher speed (not further elucidated in this paper). The local update cost increases from 2 to w(w + 3)/2 (not (2w + 1) 2 , thanks to the symmetry property).
In our implementation, we further accelerate initialization and fast sweeping by using a Look-Up Table (LUT) for the distances in the local window. Instead of directly computing the distances of each sample point x i to each grid node x i,j within the local window, we subdivide each grid cell into 4 × 4 bins, represented by the green/white checkerboard pattern in Fig. 4(a). We then build a LUT of the distances between each checkerboard bin center and each grid node in the local window. The distance of a sample point to a grid node is then taken to be the distance between the bin center that it resides in and the grid node (black arrows in Fig. 4(a)). This can be used to further accelerate the initialization and sweeping steps as follows: Step: There are two ways to initialize the distance field. The sample point x i , represented by a red dot in Fig. 4(a), can be used to directly compute the distance to its all neighbors on the grid. The other way is to use a LUT. As shown in Fig. 4(a), x i must be in one of the grid cell neighboring x i,j . All distances in the LUT are computed with respect to the center of the checkerboard cell containing x i .
• Sweeping Step: We determine d( x) in the local window using a distance LUT with respect to the center node x i,j . Thanks to the symmetry property of d( x), only w(w + 3)/2 real numbers need to be stored in the LUT. The sweeping process only uses addition and comparison operations, which are fast on modern CPUs. Fig. 3(b) shows an example d( x) computed using FSM with w = 3.

2) DIRECT COMPUTATION
The distance field can alternatively be directly computed using a SVO data structure [26]. SVO has attracted a lot of attention in the computer-graphics literature recently for its nice properties in ray tracing, which essentially also is a distance computation task.
We use a Hilbert curve (illustrated Fig. 4(b)) to encode the narrow band grid points. Then, the so-encoded narrow band is organized in an oct-tree, which is constructed bottom-up. The distance is then directly determined by a nearestneighbor search in the tree. For more details, we refer to Ref. [26].

3) HELMHOLTZ EQUATION
We can embed the distance field d( x) into a homogeneous Helmholtz equation: where ψ( x) = exp(d( x)/τ ), δ ≤ τ < 0 and δ is a negative number close to zero. is the Laplace operator. The solution of this Helmholtz equation automatically also satisfies the Eikonal Eq. 3. For τ → 0(τ < 0), After solving Eq. 6, the distance field can be recovered as: The advantage of this method is that there are several very efficient solvers for Eq. 6, such as FFT(DCT, DST)-based solvers, Multigrid solvers, and Fast Multipole Methods [24].
Here, we use a DCT-based solver in order to directly impose homogeneous Neumann boundary conditions, i.e., the gradient of ψ is zero in the normal direction at the band edge.

B. STEP 2: COUPLED SIGNED-DISTANCE FUNCTIONS
We aim to compute φ( x), the signed-distance function associated with d( x). The key idea in cSDF is to apply distance-preserving shift transformations to the output of Algorithm 1, thus solving the boundary-value problem in Eq. 3 without (pseudo-)time evolution. Specifically, we shift d by an offset T in order to determine the functions φ in bin and φ out bin that indicate whether the shifted level set d − T is inside or outside of S (see shaded areas in Fig. 2(a)). The threshold T defines the separation between the regions to be labeled. Therefore, T > √ 3h.
Algorithm 1 Extended-Window Fast Sweeping Method in 2D 1: INPUT: threshold tol, window size w, S, U , V 2: set w +1 = w + 1 3: initialize d k ( x) using Eq. 5 4: define the loop sets 6: go through the loop sets and do After thus identifying φ in bin and φ out bin , we shift d down by T s , yielding the level sets φ in1 and φ out1 . Then, the function d − T s is shifted up by exactly the same T s , yielding φ in2 and φ out2 . It is clear that T < T s < b in order to keep the two layers separate. As shown later, T s is a scale parameter. The complete procedure is given in Algorithm 2.

C. STEP 3: SURFACE RECONSTRUCTION USING cSDF
After computing φ in2 and φ out2 , a joint estimation of the signed-distance function φ of the reconstructed surface S is computed from φ in2 , φ out2 , and d( x) as described in Algorithm 3.

Algorithm 3 Surface Reconstruction Using cSDF
if d out edge == t then φ = φ out2 8: end for 9: OUTPUT: φ There are only three possible curvature (c) cases for any point x i on S: positive, negative, or zero curvature (as illustrated in Fig. 3(a)), corresponding to the three cases in Algorithm 3: • c > 0: S is convex at x i . Therefore, S is captured by d and φ out2 .
• c < 0: S is concave at x i . Therefore, S is captured by d and φ in2 .
• c = 0: S is planar at x i . Therefore, S is captured by φ in2 and φ out2 . A simple average of the corresponding two fields provides the estimate for φ. Fig. 3(c) shows an example φ computed using this procedure.

D. RELATIONS TO OTHER METHODS
The cSDF method is related to coupled level set methods and to Hilbert-Huang transforms. However, cSDF uses a geometric coupling, while coupled level sets and the Hilbert-Huang transform only provide topology control. This difference is illustrated in Fig. 5. For example, the inner level set in coupled level sets can evolve arbitrarily as long as it stays inside the outer level set (left panel of Fig. 5). This is not possible in cSDF, where the two layers must be geometrically shifted by T s (right panel of Fig. 5).

E. NOISE, OUTLIERS, AND PARAMETER
Since the present cSDF method is free of regularization and intends to reconstruct even minute details of the surface, it is sensitive to noise in the input point set. If the input point positions are noisy, or contain outliers, we hence use a different algorithm for computing the distance field d( x). This robust algorithm is presented below. All downstream processing, in particular the cSDF construction, then remains unaffected.
We also analyze below how the parameter T s controls the scale of the surface details to be recovered. This parameter hence is a scale-space parameter for the cSDF method.

1) ROBUST DISTANCE FIELD
In the presence of outliers or noise in the input point set, we use a local variance-weighted method to compute the distance field, which is robust against noise and outliers. This choice is motivated by the fact that in practice the positions x i are often uncertain due to, e.g., measurement errors. We model this uncertainty as Gaussian noise of mean zero and standard deviation σ i i.i.d. added to the point positions. Usually, this σ i can directly be obtained from the measurement or imaging device that acquired the data.
If σ i is unknown, we estimate it by K n -nearest neighbor clustering or by singular value decomposition. In the clustering method, we first compute the K n -nearest neighbor distances of all points and compute their average. Then, we use the difference between the distance and this average as the weight. When using singular value decomposition, we first compute the variance matrix of the point set and use SVD to compute the distance to the plane defined by the first eigenvectors. In what follows, we use the clustering-based method, which is also called ''inverse distance weighting'' in the literature.
The weighted distance field is then defined as: is a normalization factor, D i = |x − x i | is the distance between x and x i , and x is an arbitrary position on the grid. A comparison between the so-obtained robust distance field and the one obtained using Algorithm 1 is shown in Fig. 6.

2) SCALE PARAMETER T s
The cSDF coupling parameter T s controls the scale space in which the interpolated signal lives. Let be the average (across all data points) distance between K n -nearest neighbors. Then, T s has to satisfy the condition: in order for neighboring points to meet. This defines the lower bound on how small of details can possibly be recovered from the samples by cSDF. An illustration is shown in Fig. 7. The green dot represents a grid point x i,j , the red dots represent the input samples x i . The two red samples within the shaded disk around the green dot are indistinguishable by x i,j . Therefore, T s must be larger than the radius of the shaded disk in order to ensure that it is unnecessary to distinguish between those two samples (lower bound). As mentioned, thresholding with T s is equivalent to wave propagation. Thus, this step turns the explicit discrete-sample representation of the surface into an implicit continuous representation.
Increasing T s , however, is not equivalent to smoothing or to a regularizer. T s only defines the scale space for the interpolation, but does not limit the curvature within that space. As seen in the area highlighted by the green rectangle in Fig. 8, surface details are not lost when increasing T s . However, between closely apposed surfaces, a too coarse scale space may lead to topological problems, as for example shown in the red rectangle in Fig. 8. This issue can be avoided by adaptively changing T s in a standard scale-space approach, or by using a spatially adapted T s ( x).

FIGURE 8.
Changing the scale parameter T s does not introduce surface smoothing (green rectangle), but may lead to topological problems (red rectangle) due to scale-space coarsening.

III. NUMERICAL VALIDATION
We validate cSDF in 2D and show its accuracy by measuring the error under the 1 and 2 norms with uniformly sampled points on a circle. We further perform 3D benchmarks on computer-graphics models.

A. 2D BENCHMARKS
We test the accuracy of cSDF by sampling N points uniformly on a circle of radius R and comparing the reconstructed circle to the ground truth for decreasing N . We use a 200 × 200 grid for all N ∈ [45, 360] × R ∈ [40, 70] and directly compute distances without LUT. We linearly interpolate the resulting φ at each of the original x i ∈ S. The correct value would be φ c = 0 for all x i . We then compute the overall (reconstruction plus interpolation) 1

and 2 errors as [
, respectively. The result is shown in Fig. 9.

B. 3D BENCHMARKS
We benchmark cSDF in 3D by using the vertices of the triangulated surfaces of the well-known computer-graphics models ''Armadillo'' and ''Buddha'' as input point clouds. The number of points for each model, the CPU time for cSDF reconstruction of the implicit surface representation, and the resulting errors against the known ground truth at the vertex positions are given in Table 1. The code is implemented in C and run on a 2GHz Intel Core i7. When the distance LUT is used, the CPU time is further reduced to 12.8s and 15.9s for ''Armadillo'' and ''Buddha'', respectively. The timings compare favorably with the > 200s CPU time of an efficient Bregman code [27] for ''Buddha'' with similar resolution. Figure 10 shows the resulting reconstructions and close-ups ( Fig. 10(b) and (d)) with the input point cloud overlaid to demonstrate the method's capability to represent high-curvature regions without grid refinement (see, for example, the ''Armadillo'' claws).

IV. SURFACE RECONSTRUCTION
In this section, we illustrate the use of cSDF in several applications on real biological data. Since the cSDF is priorfree, the resulted geometry is only based on the point cloud. Therefore, the property of the surface, such as normal or curvature distribution, is only from the geometry itself, without being corrupted by any prior.
The first data set comprises 3D positions of atoms in a protein conformation obtained from molecular-dynamics simulations. 1 This is an example of noise-free data where 1 Data courtesy of Dr. Anton Polyansky, Zagrovic group, MFPL, Vienna. the surface is to be reconstructed as accurately as possible. We use cSDF to reconstruct the molecular surface of the protein and to locally shade it according to the Molecular Hydrophobicity Potential (MHP). The result is shown in Fig. 11(a) and (b).
The second case considers a 2D PALM (photo-activated localization microscopy) super-resolution image. PALM intrinsically produces point clouds, as it detects the centroids of single fluorescent molecules. cSDF can then be used to reconstruct the surface (e.g., the membrane) on which the fluorescent molecules live. The PALM image in Fig. 11(c) and (d) shows fluorescent lamin proteins of the nuclear lamina. 2 We use cSDF to reconstruct the nuclear envelope from these point detections. Due to the large amount of outliers and noise in this data set, we use the robust distance field method presented above.

A. COMPARED WITH TV
In its current implementation, cSDF is about 6 times faster than a highly efficient Bregman code for surface reconstruction [27]. This runtime can be further reduced by parallelizing FIGURE 12. 2D example with sharp corners, compared with TV regularization [6] with parameter µ = 10 −6 (left), 2 × 10 −6 (middle) and 3 × 10 −6 (right) and presented method.
the code on multi-or many-core hardware, such as GPUs or computer clusters. cSDF computes the distance field three times and then estimates the signed distance field. Thus, the key point of cSDF's parallelization is to compute the distance field in parallel, which has been extensively studied in computer graphics [28], [29].
Due to its regularization-free nature, cSDF can capture high curvature without adaptive grid. It also does not bias the reconstruction result toward a prior, nor does it excessively smooth the reconstructed surface. This way, cSDF is for example able to perfectly reconstruct and represent the sharp corners and edges of a cube, whereas regularization-based methods will round them even for the smallest amount of regularization, as shown in Fig. 12.

B. MEAN CURVATURE ESTIMATION
From φ, thanks to the signed-distance property of cSDF, the mean curvature can directly be computed as: Examples are shown in Figs. 13. Based on the prior-free property, the normal and curvature are guaranteed to be features of the surface itself, without being corrupted by any prior.

V. PROTEIN SURFACE ESTIMATION
cSDF can be extended to volume data, where the inner distance field vanishes. The surface of the volume is captured by the outer distance field and the distance field of the point cloud. Since the inner distance field does not exist, the concave region can not be accurately recovered (sharp inside corners get smoothed). The algorithm is summarized in Algorithm 4. The process is very similar with shrinking effect in traditional level set method. cSDF can be further extend to volume data, where a point x i becomes a sphere centered at x i with given radius r i . Algorithm 4 can be used to handle this case by letting d( x) inside the sphere be negative.

A. MEAN CURVATURE DISTRIBUTION PRIOR
Since cSDF is regularization-free, we can use it to obtain prior knowledge about the surfaces. We prepare 17 different proteins from Molecular Dynamics Simulations with 1000 time The mean curvature distribution of (a). (c) Surface reconstruction, normal estimation, and curvature estimation for a part of a human aorta point cloud (data courtesy of Dr. George Bourantas, MPI-CBG). The mean curvature is color coded after curvature histogram equalization for better visualization. (d) Zoom of (c). It worth pointing out that the distribution is only relied on the surface without being corrupted by any prior since cSDF is prior-free.
step. 3 The proteins are summarized in Table 2. These names are the same as they are in the Protein Data Bank. 4 We have five independent runs for ubiquitin (UBQ), and five independent runs for UBM2. In total, we have 25 trajectories, each of which has 1000 time steps. The number of points for each protein is shown in Table 2.
We use Algorithm 4 to construct the 25,000 protein surfaces and estimate their mean curvature. To reduce the resolution effect, instead of studying H , we study H · h 2 , which is called mean curvature half density [30], [31] or weighted curvature [32]. And it is independent on resolution h.
Two distributions of H · h 2 from the examples are shown in Fig. 14. Even though these two surfaces are very different,  We use 1 We compute a distance matrix M between all proteins at each time step. M (j, k) are the χ 2 distances between  p i 1 t 1 and p i 2 t 2 , where j = (i 1 − 1) * 1000 + t 1 and k = (i 2 − 1) * 1000 + t 2 . The result is shown in Fig. 15a. We define the average distribution for each protein as Similarly, we can compute a distance matrix˜ M , wherẽ M (j, k) is the χ 2 distance betweenp j andp k . We use isomap algorithm [33] to reorder the proteins, showing the relationship between each other. The reordered distance matrix is shown in Fig. 16a and b. The rearrangement shows the relationship between proteins.

B. CURVATURE DISTRIBUTION MODELING
Noticing the distributions of H · h 2 can be well approximated by a Gaussian distribution in Fig. 14, we use a Gaussian model to approximate each distribution of H · h 2 for all 25,000 protein surfaces. The Gaussian model is defined as The modeling results are shown in Fig. 17. The parameters a and σ are stable for all tested proteins. The parameter b is stable for each protein during time steps. This stability guarantees that mean curvature distribution can be used as priors.
We can also rearrange the proteins by simply sorting the parameter b. The result is shown in Fig. 18b. The reordered parameter b is shown in Fig. 19b. The similarity between the FIGURE 18. Distance matrix after the rearrangement of these proteins. result from Isomap and the result from sorting parameter b suggests that b is a dominant parameter in this modeling.

VI. SUMMARY
We have presented a regularization-free method for geometry reconstruction from unstructured point clouds. The result is guaranteed to be a signed-distance function, dispensing with the need for re-initialization and regularization. We benchmarked the method on 2D and 3D artificial datasets and showed its accuracy and computational efficiency. We further showed its application in real-world surface reconstruction for protein molecular surfaces and PALM microscopy data.
Thanks to the regularization-free property, the mean curvature distributions from the estimated surfaces can be obtained and modeled as a prior. We show how to compute and model the mean curvature distributions for a Molecular Dynamic Simulation dataset.
Thanks to the computational efficiency, our method can reach real-time performance and can be applied in many fields, such as protein surface estimation, studying the relationship between structure and function, molecular dynamics, etc.