FCSN: Global Context Aware Segmentation by Learning the Fourier Coefficients of Objects in Medical Images

The encoder-decoder model is a commonly used Deep Neural Network (DNN) model for medical image segmentation. Conventional encoder-decoder models make pixel-wise predictions focusing heavily on local patterns around the pixel. This makes it challenging to give segmentation that preserves the object's shape and topology, which often requires an understanding of the global context of the object. In this work, we propose a Fourier Coefficient Segmentation Network~(FCSN) -- a novel DNN-based model that segments an object by learning the complex Fourier coefficients of the object's masks. The Fourier coefficients are calculated by integrating over the whole contour. Therefore, for our model to make a precise estimation of the coefficients, the model is motivated to incorporate the global context of the object, leading to a more accurate segmentation of the object's shape. This global context awareness also makes our model robust to unseen local perturbations during inference, such as additive noise or motion blur that are prevalent in medical images. When FCSN is compared with other state-of-the-art models (UNet+, DeepLabV3+, UNETR) on 3 medical image segmentation tasks (ISIC\_2018, RIM\_CUP, RIM\_DISC), FCSN attains significantly lower Hausdorff scores of 19.14 (6\%), 17.42 (6\%), and 9.16 (14\%) on the 3 tasks, respectively. Moreover, FCSN is lightweight by discarding the decoder module, which incurs significant computational overhead. FCSN only requires 22.2M parameters, 82M and 10M fewer parameters than UNETR and DeepLabV3+. FCSN attains inference and training speeds of 1.6ms/img and 6.3ms/img, that is 8$\times$ and 3$\times$ faster than UNet and UNETR.


I. INTRODUCTION
Over recent years, we have witnessed increasing popularity in the applications of Deep Neural Network (DNN) for various medical image segmentation tasks.The encoder-decoder model [1], [2] is currently the most widely adopted DNN approach for the segmentation task.Given enough training data, the encoder-decoder models can extract local patterns from an image that are associated with labels at each spatial coordinate.However, due to its heavy reliance on local patterns, the model often fails to exploit the global contexts that potentially help to nullify nuisance local variations.
Specifically, in medical imaging tasks where the risk of misclassification is high, we need a model that is robust to many unpredictable local variations by incorporating the global contexts.Taking the segmentation of optic cup in retinopathy as an example which is demonstrated in figure 1, the following problems are difficult to address unless the model learns the global context: Fig. 1: Comparison of encoder-decoder model (upper) and Fourier Coefficient Segmentation Network (FCSN) (lower).Unlike the encoderdecoder model, which makes a coordinate-wise prediction of an object, our FCSN predicts the complex Fourier coefficients of the object's masks, which requires the learning of broader contextual information.Moreover, FCSN is more memory-efficient with the absence of a decoder.
• anatomically, the shape of an optic cup is always like a single filled oval, but current DNN often give segmentation with multiple components or with holes • an optic disc has a smooth contour, but current DNN give contours with sharp corners or unnecessary zigzags • retinopathy images from different sources are likely to suffer from different degradations, which cause generalization problems for current DNNs.In this paper, we argue that these problems, which are either ignored or indirectly treated in the conventional encoder-decoder segmentation models, can be effectively addressed if we train the DNN to directly predict the shape, size and location of an object.

A. Encoder-decoder Segmentation Model
As shown in the first row of figure 1, modern segmentation models typically adopt an encoder-decoder structure which models a conditional probability of predicting label y hw given an input x at each spatial coordinate h, w (i.e.p(y hw |x)).The model is then optimized to maximize the likelihood of the spatially summed log probability (i.e.argmax p hw y hw log p(y hw |x)), assuming spatial independence across the coordinates.Based on the structure of the model and the way in which the model is optimized, the existing encoder-decoder model will make a prediction mainly relying on local patterns and often does not utilize the global context of the image at all.This absence of global context can cause inconsistency in segmentation performance, especially for the tasks that assume specific global priors.
Most of the existing works on global context learning aim to solve the problem by proposing a more flexible (general) model structure that offers the model an opportunity of capturing global patterns [3]- [5].However, offering the opportunity does not necessarily mean that the model will explore the new aspect of learning.There is a possibility that the model will still focus on finding local shortcut evidence and hence fails to focus on the global evidences [6].Also, when the network is trained under a data constraint, higher flexibility could negatively impact the model performance.In this regard, we argue that increasing the model flexibility alone is an unstable solution to the global context learning problem.

B. Contribution
We propose a novel segmentation model-Fourier Coefficient Segmentation Network (FCSN) that lifts segmentation to a shape prediction task, where the shape is represented as Fourier coefficients.As shown in figure 1, FCSN perceives the segmentation mask as a smooth function in a complex domain, which can be accurately approximated as complex Fourier coefficients.We use Fourier Transform to extract the Complex Fourier coefficients of the contour of the mask.Hence, FCSN learns the global shape of an object by predicting its Fourier coefficients, and during inference, a contour is retrieved with Inverse Fourier Transform.
To motivate how predicting Fourier Coefficients helps to learn global context, imagine we want to segment an ellipse-shaped object, which can be precisely described by three complex Fourier Coefficients z −1 , z 0 , z 1 .The z 0 describes the center of the ellipse, and z −1 and z 1 determine lengths and orientations of the semi-major and semi-minor axes.Thus, for a DNN to make a precise prediction of the three coefficients, the model must learn to perceive the whole ellipse as a single object.This is in contrast to the traditional encoderdecoder model, where the model makes predictions only by looking at the local structure of the object.
Also, we propose to add a Fourier differentiable spatial to numerical transform (F-DSNT) module [7] to improve the accuracy of Fourier coefficient prediction and also to reduce memory consumption.One could view the coefficient prediction as a typical regression problem and introduce fully-connected (FC) layers on top of the spatially flattened feature.However, FC layers have several drawbacks: 1) they are over-parameterized, affecting the generalizability, 2) it assumes a fixed input shape, and 3) the output range is not bounded.Instead, DSNT drives the encoder module to produce heatmaps that represent the probability distributions of Fourier coefficients.DSNT does not introduce any trainable parameter and works with any input shape.
We evaluate the performance of FCSN on three Medical image segmentation tasks, including skin lesion, optic disc, and optic cup segmentations.FCSN outperforms state-of-the-art segmentation models such as DeepLab-v3+ and U-Net+ when eveludated with Hausdoff Distance.Furthermore, as our model can attend to global features, its performance does not degrade from local perturbations such as contrast change, additive noise, or motion blur.Lastly, our model is lightweight, requiring less computational cost by discarding the decoder module that has been indispensable in the modern segmentation model and incurs a considerable memory overhead.

II. RELATED WORK
A. Encoder-decoder Models FCN [8] and U-Net [1] were the early few DNN models that proposed encoder-decoder structure for semantic segmentation.However, the two approaches often produced noisy predictions that contained holes or non-smooth contours, implying that the models failed to understand the global context.The issue had been addressed broadly in two ways while preserving the encoder-decoder structure: by 1) increasing the receptive field size and 2) introducing a regularizer that penalizes non-smooth prediction.
1) Broader Receptive Field: For a unit in the prediction of a network, the theoretical receptive field (TRF) of this unit refers to the region in the input image that contributes to the prediction of this unit.For convolution neural networks, the TRF is usually only a fraction of the input image, which depends on the architecture and filter sizes of the networks.To make more global aware predictions, the TRF must be large enough to cover the whole region that contains information related to the prediction.
In the literature several methods have been proposed to increase TRF.In [4] the authors proposed ParseNet which incorporated a global context feature that is generated using a global pooling operation in feature embedding.In [9], [10] the authors proposed non-local U-Nets which included Transformer modules [11] to extract long-range features.In [2] the authors proposed DeepLab with Atrous Convolution module that extracts features with varying receptive field sizes using dilated convolution.
As observed in [12], the effective receptive field (ERF) can be very different from theoretical receptive field.The ERF is defined as the collection of pixels inside TRF that have non-negligible impact on the prediction.It is found in [12] that for neural networks prior training, the ERF is usually smaller than TRF, and a proper training is needed to enlarge ERF.Therefore, models with large TRF may not be capable of effectively understanding global context.In [13], the authors proposed the Lovász metric, which is a convex function that approximates the Intersection over Union (IoU) metric.Since IoU is calculated over the whole image, the proposed metric can facilitate global learning.
2) Regularizing Prediction: Another approach to promote smooth segmentations is to adopt regularization on the models or the predicted masks.In [14] the authors proposed the ACNN-Seg for predicting high resolution segmentation masks from low resolution images.They introduced an extra autoencoder (AE) network to regulate segmentation outputs, such that the AE would produce similar features for both the predicted masks and the ground-truths.
More recently, the authors in [15], [16] proposed to add spatial regularization to softmax activation functions in order to minimize total variation of predictions, such that the predicted masks are more robust to various local perturbations in the images.

B. Segmentation via Shape
For image segmentation most DNNs make per-pixel predictions for segmentation masks.One way to obtain more regularized prediction is to predict the shape of the segmentation mask, which effectively reduce the output dimensionality and complexity.
In [17], [18] the authors proposed DNNs that learn parametrization of boundary curves via piecewise Bézier curves.However, the Bézier parametrization does not necessarily converge to the true boundary curve.In [19], the authors proposed to predict polar coordinates of sampled points on boundary curves for instance segmentation.
There are not many DNN approaches that utilize Fourier transforms for segmentation.In [20], the authors used DNN to learn Fourier coefficients of sampled points on boundary curves for instance segmentation.However, they regarded the x and y coordinates of boundary points as two sequences of real numbers and applied Fourier transforms independently.In our approach, we regard the boundary curve as a sequence in the complex domain, and we apply Complex Fourier Transform only once to get Fourier coefficients.

III. PROPOSED METHOD
As shown in figure 2, our DNN model consists of four modules.The first module CNN θ is a feature extraction module that takes an image as its input.Any standard CNN backbone can be adopted.The The Fourier coefficients {zn ∈ C} of the boundary curve α(t) is defined by for n = . . ., −1, 0, 1, . . ., where j is the imaginary unit.The original boundary curve α can be fully recovered from the Fourier coefficients {zn} by taking the Inverse Fourier transform defined by Therefore, instead of making a direct prediction of the segmentation mask Y , it is possible to predict the Fourier coefficients {zn} and recover the mask Y with Inverse Fourier transform.Predicting Fourier coefficients forces the training of DNN to utilize global context better.As suggested by equation (1), the Fourier coefficients, which we predict, are obtained by integrating global information on the boundary curve.This forces DNN models to learn the global context of an image better, facilitating to make more spatially consistent segmentation.
It is usually sufficient to only learn to predict the lower Fourier coefficients which encodes the location and the general shape of the boundary curve α.This is because the coefficients {zn} are concentrated on small absolute values of n when α is smooth: In fact, if α is k-times continuously differentiable, then zn converges to 0 faster than 1/|n| k for large n.Discarding higher Fourier coefficients can be regarded as a regularization that smooths groundtruth boundary curves.Figure 3 shows segmentation masks obtained by only taking zn for −10 ≤ n ≤ 10.
2) UP θ : Probability Distribution of Coefficients: Given a feature extracted from a raw input using a CNN module, UP θ generates heatmaps which represent the discrete PDFs of possible Fourier coefficients.(i.e.{p(zn|x)} +k −k = UP θ •CNN θ ).UP θ module consists of a 2D transposed convolution layer with 2 * k + 1 kernels, followed by a softmax activation across spatial axes.2D transposed convolution layer projects input feature to a higher spatial resolution; thus, the generated heatmaps are more granular.We apply softmax to normalize the heatmaps such that it is non-negative and sum to one.
3) F − DSNT : Selecting the Most Probable Coefficients: Finding the most probable coefficient from each discrete PDF (i.e.ẑn = argmax p(zn|x)) is not differentiable.To make it differentiable, we adopt DSNT [7] which can be viewed as a soft-argmax operation.This is done by calculating the expectations of the PDFs.As shown in figure 2, the expectations are calculated by performing a weighted sum of discrete PDF with real and imaginary coordinate values.
For the original implementation of DSNT in [7], the PDFs are assumed to have spatial range In our model, we multiply the output of our DSNT module with scaling constants estimated by checking the range of each Fourier coefficient from the training dataset.This is equivalent to increasing the resolution of PDFs for higher Fourier coefficients which are usually close to zero.
4) Loss Function: Our loss function is a combination of weighted L 1 and L 2 losses plus the Jensen-Shannon (JS) divergence regularization.Given a batch of M input images {x (m) }, our predicted coefficients {ẑ where p(ẑ|x) is the PDF generated by our UP θ module.The wn's are weight constants that we introduce to promote the learning of higher Fourier coefficients which are much smaller than lower coefficients, defined as .
The JS(p(ẑ|x)||N (ẑ, σI 2 )) is the JS divergence between the PDF p(ẑ|x) and the bivariate normal PDF N (ẑ, σI 2 )) with the same mean.The covariance σ of the bi-normal PDF is a hyper parameter.The JS regularzation is minimized when the heatmap matches with the Gaussian distribution, thus making sure our heatmaps of Fourier coefficients are unimodal and concentrate nicely around true locations of the Fourier coefficients.

IV. EXPERIMENTS
A. Evaluation Metrics Let Y be a segmentation mask, and let Ŷ be a mask predicted by a DNN model.To measure model performance, we use both the dice metric and the Hausdorff distance defined by where d(y, Y ) is the Euclidean distance from the point y to the target in Y , and d(y, Ŷ ) is defined similarly.The smaller the Hausdorff distance is, the better the approximation of Ŷ is to Y , and H(Y, Ŷ) = 0 means Y and Ŷ coincides completely.The dice metric is widely used in evaluating segmentation models.However, the dice metric is not sensitive to changes of shape and topology of the masks.This is demonstrated in figure 4, where (a) is the ground truth, and (b)-(d) are three predictions with the same dice value 0.9.However, it is clear that Figure 4(b) gives the best segmentation, while the shape of the segmentation in (c) is wrong, and the topology of the segmentation in (d) is wrong.On the other hand, the Hausdorff distance is more sensitive to changes in shape and topology, and it can successfully pick up the best segmentation.
1) ISIC-2018: ISIC-2018 dataset contains 2,594 and 100 dermoscopic images with ground truth segmentation for training and validation, respectively.The test dataset is not publicly available.Hence, following conventions of other papers using ISIC, we report the final evaluation results using 5-fold cross-validation on the training dataset.
2) RIM-ONE-DL: RIM-ONE-DL dataset contsists of 313 and 172 retinographies from normal and glaucoma patients respectively.All images include a manual segmentation of disc and cup that have been assessed by experts.The dataset contains 341 and 149 training and testing samples respectively.As suggested by the dataset provider, we perform a simple train-test split evaluation.

C. Implementation Details
During training and inference, images are resized to have a size 256×256.For data augmentations, we used ColorJitter, random crop, and random flip for the RIM dataset, and we replaced random crop by resizing and random crop for the ISIC dataset.For all our training, we trained for 500 epochs with a batch size of 8, and we used the Adam optimizer [23] with a learning rate of 3e −4 without weight decay.To generate Fourier Coefficients, we sampled 71 points on boundary curves and used FFT to get the Fourier Coefficients, where the model only learns 21 lower coefficients (i.e.{zn} +10 −10 ).All our codes will be made publicly available later upon acceptance.
As shown in Table I, for all instances, FCSN achieves a significantly lower Hausdorff score while maintaining a competitive Dice score, supporting that the shape of generated mask closely matches with ground truth.Also, we observe greater performance gain from RIM_CUP and RIM_DISC tasks that have much smoother contours than ISIC.We note that the performance of FCSN improves when we use DResNet [26] backbone that produces higher resolution output.Also, using a deeper backbone further improves the performance.
2) Robustness to Perturbations: We test robustness of models to four types of perturbations at inference: Gaussian noise, Salt & Pepper noise, contrast changes, and motion blur.We chose Gaussian and Salt & Pepper noises because they are the most common additive and impulsive noises respectively.Contrast change and motion blur are typical degradation in medical images.The results are summarized in figure 7, where level of perturbation increases along the x-axis.Comparing with the DeepLab-v3+ (with Lovász loss) and the UNETR models, our method is more robust, especially for the two noises, where our method can give almost consistent predictions regardless of noise level; on the other hand, the predictions of the DeepLab-v3+ model deteriorate heavily as noise level increases.Metrics of results of the UNETR model are either similar to that of DeepLab-v3+, or lie between the DeepLab-v3+ and our method.
Figure 8 shows examples of segmentation results for images with perturbations.For images with noise or contrast change, the DeepLab-v3+ method omitted large portions of target areas, and the UNETR failed to correctly segment the RIM cup with Salt & Pepper noise,   while our method consistently give reasonable segmentation for all cases.For the image with motion blur, the DeepLab-v3+ and UNETR methods wrongly included large portion of background area.All the predictions of the DeepLab-v3+ have either wrong shape or wrong topology.On the other hand, our method gives satisfactory segmentation results.
3) Global Context Awareness: Here, we empirically prove that the two major strengths of FCSN, precise shape prediction and robustness to perturbations, indeed arise from the model's global context awareness.We propose to use the Effective Receptive Field (ERF), initially proposed by Luo et al. [12], as the method to measure the global context awareness of models.ERF measures how much each input pixel contributes to the model prediction.Mathematically, this is done by computing the partial derivative of an arbitrary output unit y i with respect to input tensor x i.e ∂y i /∂x, measuring how much y i changes as x changes by a small amount.ERF is therefore a natural measure of the importance of x with respect to y i .
Figure 6 shows the comparison of ERF for various models.We observe that FCSN visually attains significantly bigger ERF size compared to baseline models across all tasks, strongly supporting our global context awareness argument.
4) Computational Efficiency: We compare the computational efficiency of FCSN against baseline segmentation models.Specifically, we measure models' floating-point operations per seconds (FLOPs), inference time (ms/img), training time (ms/img), and parameter number (M).We compare FCSN with ResNet50 backbone against UNet, DeeLab with ResNet 50 backbone, and UNETR with VIT-B-16 backbone.During the measure of FLOPs, inference & training time, we set the input size to 256 × 256.The results in Figure 5 shows the computational efficiency of FCSN in all of the 4 aspects.Comparing with the least performing model for each of the aspect, FCSN requires 58% less FLOPs, 8× faster training and inference speed, and 5× less parameter number.Note that the computation overheads from Fourier transform and inverse Fourier transform are small, which are equivalent to two 1d convolution layers with kernel size of 21 and input size 21.Empirically, these two transforms only take 0.05ms/img.
Our model has high computational efficiency because our model does not contain a conventional decoder.For most segmentation models employing neural network approach, they contain decoders which have several layers of 2d convolution and up-sampling operations.This will introduce a large amount of model parameters and heavy computations.On the other hand, our model only contains the encoder, and the prediction of Fourier coefficients is based on the   II show that for the Dice metric, the DSNT approach consistently gives better results, while for the Hausdorff metric, the DSNT approach gives better results in most of the cases.
2) Impact of JS Divergence: We study the effect of the Jensen-Shannon divergence regularization on our model by removing the regularization or by altering σ in the covariance σI 2 of the 2d Gaussian PDF.As seen from table III, the introduction of the regularization greatly improves model performance, but our model is not sensitive to the choice of σ.    1) 3D shape learning: MRI and CT scans are 3D in nature.To apply the current FCSN structure to 3D segmentation tasks, the 3D scan must be interpreted as independent slices.However, the independent assumption across the slices could lead to an inconsistent mask prediction.As a solution to this, one can generalize our framework by modifying our 2D F-DSNT module to a 3D version of it.

V. LIMITATION AND FUTURE WORKS
2) High Variance of Higher Frequency Coeficients: Figure 9 shows that FCSN can give accurate predictions for the first Fourier coefficients, but there are greater mismatch for the higher frequency coefficients.One way to tackle this problem is to introduce multiheads with various resolutions, where the high resolution head can promote learning of higher Fourier coefficients.
3) Multi-object Segmentation Task: To extend FCSN to multiinstance segmentation cases such as multi-organ segmentation, one could regard FCSN as a segmentation head in MaskRCNN [27].
4) Learning other transforms: FCSN learns to predict Fourier Coefficients for segmentation, and it works well for targets with smooth boundary.However, if the target boundaries contain sharp corners, one may consider modifying FCSN to learn coefficients from more general transforms, like wavelet or tight frame transform.The idea is to use a proper family of base functions that are more efficient in coding boundaries curves.

VI. CLOSING REMARKS
In this paper we propose FCSN, a novel and lightweight segmentation model that segments an object by predicting the Fourier coefficient of the object's contour.Our model is designed to incorporate the global context of an image, leading to more accurate segmentation that better preserve the shape and topology of the object.Moreover, the global context awareness makes our model robust to unseen local perturbations during inference.
Our approach is the first step towards a systematic study of performing segmentation by predicting coefficients of mask decomposition.There are many other approaches besides predicting Fourier coefficients.For instance, one can use wavelet or tight frame transforms to obtain more efficient decomposition for boundary curves with sharp corners.

Fig. 4 :
Fig. 4: (a) Ground truth, (b)-(d) three predictions with the same dice value 0.9 but Hausdorff distances (smaller is better) 11.3, 26.9 and 93.3 respectively.Note that the star shape in (b) is smaller than that in (a).

Fig. 9 :
Fig. 9: Plots of real parts of Fourier coefficients for a batch of 256 images.Blue lines: Predictions by FCSN.Red lines: Ground truth.Upper: First Fourier Coefficients.Lower: 10th Fourier Coefficients.

TABLE I :
Dice & Hausdorff comparison between FCSN and baseline encoder-decoder models on 3 tasks.The standard deviation (std) is computed from 5-fold results.The best result is in bold.
There are a couple future research directions that can make the proposed FCSN more robust.

TABLE II :
Dice and Hausdorff metrics of our model with DSNT or FC head.

TABLE III :
Dice on ISIC for our model with various regularisation.