Wasserstein HOG: Local Directionality Extraction via Optimal Transport

Directionally sensitive radiomic features including the histogram of oriented gradient (HOG) have been shown to provide objective and quantitative measures for predicting disease outcomes in multiple cancers. However, radiomic features are sensitive to imaging variabilities including acquisition differences, imaging artifacts and noise, making them impractical for using in the clinic to inform patient care. We treat the problem of extracting robust local directionality features by mapping via optimal transport a given local image patch to an iso-intense patch of its mean. We decompose the transport map into sub-work costs each transporting in different directions. To test our approach, we evaluated the ability of the proposed approach to quantify tumor heterogeneity from magnetic resonance imaging (MRI) scans of brain glioblastoma multiforme, computed tomography (CT) scans of head and neck squamous cell carcinoma as well as longitudinal CT scans in lung cancer patients treated with immunotherapy. By considering the entropy difference of the extracted local directionality within tumor regions, we found that patients with higher entropy in their images, had significantly worse overall survival for all three datasets, which indicates that tumors that have images exhibiting flows in many directions may be more malignant. This may seem to reflect high tumor histologic grade or disorganization. Furthermore, by comparing the changes in entropy longitudinally using two imaging time points, we found patients with reduction in entropy from baseline CT are associated with longer overall survival (hazard ratio = 1.95, 95% confidence interval of 1.4-2.8, ${p}$ = 1.65e-5). The proposed method provides a robust, training free approach to quantify the local directionality contained in images.


I. INTRODUCTION
C OMPUTERIZED feature extraction has remained a key topic ever since image data was first digitalized.Local directionality is an important feature in image processing.Over the past several decades, multiple different image feature describers have been proposed to extract such information.Steerable filters are synthesized filters of arbitrary orientations from linear combinations of basis filters [1].Scale-Invariant Feature Transform (SIFT) and Histogram of Orientated Gradient (HOG) are two feature describers based on histogram of gradients [2], [3].Image feature describing directionality are widely used for object detection, object matching and network pretraining [4], [5], [6].
More specifically, many directional related features are used in radiomics analysis.The Gabor filter is a linear filter based on the Gaussian kernel and sinusoidal plane wave.It gives an orientation based representation of the image.Wavelets are time and frequency localized functions that can be used to decompose signals into different scales or resolutions, which may also represent directionality [7].The Gray-Level Co-Occurrence Matrix (GLCM) is a method used to calculate the number of occurrences of neighboring pixels with respect to each gray level value [8].It utilizes two parameters: the distance between the pixels and the angle at which the co-occurrences are counted.
The Wasserstein distance is a powerful metric for distributions, or more generally measures, which is known to be robust compared to other metrics (or divergences) given its property of stability to small perturbations in the input data (via weak continuity).It is naturally defined by the optimal cost of transporting one distribution to another, which was first motivated by the civil engineering problem of relocating a pile of soil to an excavation site by Gaspard Monge in 1781 [9], [10], [11], [12].A relaxation was proposed by the Russian mathematician Leonid Kantorovich in 1942 [13], and so the optimal transport problem is many times called the Monge-Kantorovich problem.It gives a distance of the space of probability distributions.
Optimal mass transport methods are widely used in signal processing, machine learning, computer vision, meteorology, statistical physics, quantum mechanics, and network theory [14], [15], [16], [17], [18], [19].Some examples include Zhu et al. and Pouryahya et al. employed distances derived from the optimal mass transport theory to study multiomics networks for cancer subtype clustering [20], [21].A regularized version of optimal transport is utilized to visualize fluid flows in the glymphatic system [22], as well as an unbalanced version of the Wasserstein distance is utilized to identify high-risk normal tissue regions associated with worse mortality from spillover radiation in radiation treatments [23].
In this work, we make use of the robust property of the Wasserstein distance for image local analysis in order to extract directional information from the optimal transport map instead of gradient.The optimal transport based computation increases robustness to noise, making it preferable for a variety of medical image analysis applications.In the following sections, we sketch some of the background on commonly used optimal transport models.Then we first detail our proposed methodology in terms of the continuous case in general and then introduce the discrete case with an additional multi-scale integration step, which we specifically use for image feature extraction.Quantitative comparison and illustrative examples of our method are included in the Experiments section.We conclude this note with some discussion about future work on this topic.The contributions of this work include: • Novel features characterizing local directionality are developed, using an optimal mass transport methodology, which is robust to local image variations and reconstructions required for reproducible and reliable radiomic analysis.
• Our multi-scale recombination which utilize the unique structure of optimal transport further increases the robustness of the method and is able to extract directional features at different scales.
• The parallelized implementation allows for faster computation of the feature calculations.
• We demonstrate ability of our features to prognosticate patient outcomes using magnetic resonance imaging (MRI) and computed tomography (CT) images for brain, head & neck and lung cancers, respectively.

II. BACKGROUND
Optimal transport was originally formulated by Gaspar Monge in 1781 and given a relaxed formulation by Leonid Kantorovich in 1942, and hence is known as the Monge-Kantorovich problem [9], [10].We will use the Kantorovich version in this paper which may expressed as follows: where µ 0 , µ 1 are two absolutely continuous measures on R n , (µ 0 , µ 1 ) denotes the set of all the couplings between µ 0 and µ 1 (measures on R n × R n whose two marginals are µ 0 and µ 1 : In the present work, we will take c(x, y) = ||x − y||.More generally, one may employ any lower semi-continuous function such that c(x, y) ≥ a(x)+b(y) for all x, y, where a ∈ L 1 (µ 0 ), b ∈ L 1 (µ 1 ).In terms of discrete probability densities, the Kantorovich model may be expressed as a standard linear programming problem, which may be solved very efficiently.The discrete form of the Kantorovich optimal transport may be formulated as follows: where ρ 0 ∈ R M + and ρ 1 ∈ R N + are densities of M/N discrete locations of initial/target distributions and their total mass needs to be preserved as a prerequisite ( M i=1 ρ 0 (i) ) is an all-1 vector of length N (M).

A. Continuous Case
In the Monge-Kantorovich problem, is a generalized transport map between two distributions [9], [10], which specifies the amount of mass that is transported from location x in the initial distribution to location y in the target distribution, from which one derives a transport direction.To extract the local directionality, Kantorovich optimal transport between the original distribution and a smoothed one on a given local image patch X is employed.Namely, we consider subject to: where 1 X is the indicator function, and K denotes any mass preserving smoothing filter on X , In the present work we will employ the following simple filter: By considering the transport distance (3), we can quantitatively measure how different the patch is from its feature removal/smoothed version.Notice that the optimal transport also gives an optimal transport map , from which one derives the probability (x, y) of each point x ∈ X to be transported to any other point y ∈ X .The displacement vector y − x together with the probability gives a natural directionality at each point.By collecting all the sub-transport work in the same direction, we can define a distribution ρ on S n in terms of directionality of the patch X as follows: which is a decomposition of the total transport work into all possible directions.Proposition 1: W ,X (θ) is decomposition of the total transport in (3): Proof: The latter result indicates that W K ,X decomposes the transport distance into different directions.

B. Discrete Case
For application of our directionality distribution, we employ a discrete version of (6) to extract the local directionality of images.We choose to use the same output format as the HOG because they extract the same type of local information.Because of the similarity, we call our method Wasserstein HOG (WHOG).Note that we only borrow the name HOG because of the similar information extracted.Our method does not depend on the gradient.We consider the following discrete version of Kantorovich optimal transport: subject to: ⃗ where X n is a local n × n image patch and ρ 0 ∈ R n 2 + is the flattened image intensity on that patch.An iso-intense patch with the mean value of ρ 0 as intensity of all pixels on that patch is used as taken as the standard pairwise distance matrix.
Local image directions are extracted from the matrix W = C . . .* by regrouping the entries into bins where * is the optimal solution to the optimization (Fig. 1).Each entry of W has a corresponding orientation given by the transport starting point and end point locations.For the (i, j) entry of W (i ̸ = j), the i th location in the density vector corresponds to a coordinate (u 1 , v 1 ) in the original n×n grid and the j it location in the density vector corresponds to another coordinate (u 2 , v 2 ) in the original n×n grid.The directionality of the sub-transport route is given as: Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
We evenly divide [0, π] into n b bins and add up all the w i j values in their corresponding bins.So each bin contains all the sub-work of the optimal transport in its corresponding direction.The n b -vector is a representation of local directionality.The sum of n b values in the feature vector coincides with the sum of all w i j , which is the Wasserstein distance (total work) by the same reasoning as Proposition 1.On the other hand, we decompose the distance into components in the n b directions.This gives a similar feature vector that characterizes local directionality as in HOG, but is more robust.From a different perspective, optimal transport has been used in registration [16].The proposed feature may be understood as a summary of the directions of corresponding points of a generalized local registration from each local patch to an isointense patch, where the "registration" based on the optimal transport method ( 8) is not 1-to-1, but 1-to-n.

C. Discrete Case With Multi-Scale Recombination
Multi-scale is a commonly used technique in image processing.In this current work, we find it extremely useful to use different patch sizes.To deal with the problem of aligning patches of various sizes and centered in different locations, we first extract directionality vectors on the level of pixel and then recombine together, making a minor change from the method of the last section (III-B), which has the same form of output as before.But instead of using one fixed patch size n, we use multiple patch sizes and multiple patch starting positions: (n 1 , p 1 ), (n 2 , p 2 ), . . ., (n i , p i ), . . ., (n k , p k ), where n i ∈ Z is a patch size and p i ∈ {1, 2, . . ., n i } × {1, 2, . . ., n i } ⊂ Z 2 is to shift the grid of all patches.The Kantorovich optimal transport problem in the same form of (8) with different patch parameters is solved.Instead of deriving just one n b directionality vector from the entire matrix W for that patch, n 2 i of n b −vectors for each pixel are extracted from each row of the matrix W . Noting that each row of the matrix W specifies the optimal transport plan for a pixel on that patch, the n 2 i vectors represent the directionality on all of the n 2 i pixels of that patch.For that purpose, we further decompose the matrix W in an even finer level.To get the same output format as before, for a given output patch size n, we simply sum all n 2 × k n b −vectors on each output patch.So the final output contains information of a neighborhood in the perspective of multi-scale.

D. Implementation of Discrete Case
Because of the translational invariance of the pairwise distance within a given patch, the cost matrix C is the same for a fixed n i .It is the pairwise distance matrix of all the lattice points, and it is constructed beforehand to save computational cost.In addition, the n b mask matrices M j 's for each directional bin are also constructed.They are also the same for a fixed value of n i .The main step is solving a linear programming problem (2) for each patch from an input image I .The 2D intensity is vectorized and reformulated as a standard linear programming problem, which is solved via the dual-simplex algorithm [24].The output solution of the transport matrix is then decomposed into n b directions using mask matrices, and the pixelwise transport cost is computed by summing each row of the transport matrix, which is stored in the n b matrices (I P V i (:, :, 1), . . ., I P V i (:, :, n b )) of the same size as the original image I .I P V i 's from different patch parameters (n i , p i ) sum up to be the multiscale pixelwise directionality matrices I P V .Finally, the features are output as the sum on each output patch.We implemented our proposed method in MATLAB.For the linear programming part, the built-in linprog function is used.Parallelization was utilized for speed-up by taking advantage of the independence of computations on different scales and different patches (See Algorithm 1).
Algorithm 1 Wasserstein HOG Inputs: Gray scale image I , patch size n, number of bins n b Parameters: Multi-scale parameters (n 1 , p 1 ), . . ., (n k , p k ) Outputs: Directionality vector field i for all transport directions within the range of j'th directionality bin 5: end for 6: for each patch X in I starting from pixel p i with step = n i (in parallel) do 7: Get intensity ρ on patch X 8: Construct mean value patch ρ m 9: Solve (8) via linprog for X using C, ρ, ρ m 10: for 1, …, j,…, n b do 12: Assign j'th pixel-wise directional work portion to each pixel on X : end for 14: end for 15: end for 16:

A. Comparison With Conventional HOG
A brief overview of extracting the standard histogram of oriented gradients (HOG) features is provided for completeness.The first step involves computation of the image gradients, which typically uses the Sobel image filters, described as: to compute the x and y coordinate derivatives (g x , g y ) for each pixel.The directionality of a pixel located at {x, y} is Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.computed using the direction of the gradient vector: Second, patch histograms of oriented gradients are computed by binning the pixels within the patch using θ values, with the number of bins corresponding to the chosen discretization for the gradient θ.The sums of gradient vector norms g = g 2 x + g 2 y present in all bins are then combined to form a feature vector of that patch.Feature vectors of neighboring patches are normalized together to alleviate artifacts caused by light variations.Patch sizes used to construct the HOG features were determined empirically and was set to 8 by 8.
Of note, the computation of directionality (9) using WHOG is very different from HOG (10).Specifically, the directionality of features using HOG is based on image gradient, which is sensitive to image noise.WHOG, on the other hand, considers constituents of all the possible sub-transport routes traversing the grid points between all pairs of pixels to compute the direction, thus making WHOG more robust to image noise than the HOG approach.The robustness of WHOG to noise is illustrated by the toy examples in Fig. 2, wherein both WHOG and HOG perform similarly well without noise.However, when more noise is added (Gaussian noise with mean= 0, σ = {0, 0.01, 0.01, 0.1, 0.15}), WHOG computed directions show very little change.HOG on the other hand, gradually loses directionality until the histogram is evenly distributed to all the bins with the highest noise levels.The qualitative performance of the two methods for extracting the gradient orientations using rose plots within the tumor and at its boundary (yellow box in Fig. 3) shows that WHOG (Fig. 3b) extracts both the large and small differences in edge orientations with stronger resolution of differences compared to the HOG method (Fig. 3a).Furthermore, clear edge orientations are lost with HOG within the tumor, but continue to be extracted for the WHOG, indicating that WHOG may be better at extracting clear directionality even inside tumors.

B. Quantifying Image Tumor Heterogeneity Using WHOG
Imaging tumor heterogeneity has been shown to be associated with biological heterogeneity by multiple radiomics studies [25], [26], [27].In particular, radiomic features constructed with oriented edges from CT as well as MRI scans were shown to be associated with survival and treatment outcomes in solid cancers [28], [29].Hence, we asked whether the imaging tumor heterogeneity quantified as average entropy of the local feature directionality (described in the subsequent subsection) is associated with survival and treatment outcomes in three different solid cancers, namely glioblastoma multiforme (GBM) in the brain, oro/nasopharynx cancer in the head and neck, and non-small cell lung cancer (NSCLC) using three separate datasets.Two of the three datasets were sourced from the publicly available the Cancer Imaging Archive (TCIA) and previously used in radiomic studies for brain [30], head and neck cancer [6].The third dataset in NSCLC patients treated with immunotherapy was previously used for evaluating deep learning segmentation of lung tumors [31].Description of the analyzed datasets and the outcomes are provided in the subsequent subsection.
1) Entropy Formulation: Because we are only interested in the flows within the tumor, we utilized the tumor masks and considered transport problem only within the tumor region: where both the original patch and the iso-intense patch are taken to be within the tumor region.The following is the same Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
set-up as in the section III-C.By considering the sub-transport route direction derived from (9), a vector v p ∈ R n b which represents the directionality of patch q is computed using multi-scale recombination.Notice that which gives a measure of the strength of directionality of a given patch, which may not sum up to be 1.The local directional distribution is defined as the normalized v q : Finally, entropy is defined as the average of all patches within the tumor region (the patch intersects more than 80% of mask area): where Q is the set of all feasible patch q.Finally, CoLlAGe [28], a recently developed radiomic measure quantifying imaging heterogeneity using histogram of oriented gradients together with cooccurrence matrix formulation was evaluated as a benchmarking metric to assess the gains achieved using WHOG as a novel radiomic measure of imaging tumor heterogeneity.
We used n = 8 and n b = 9 for both WHOG and HOG in the following tests, which is the default setting of HOG from MATLAB.For the latent multi-scale parameters, we used (4, (1, 1)), (4, (3, 3)), (8, (1, 1)) and (8, (5, 5)) -four different sets of patches, which are patches starting from origin or the first patch center for both patch sizes equal to 4 or 8.
We used the implementation of CoLlAGe from CERR with all default parameters [32].

C. Patient Datasets Description
1) Brain GBM MRI Dataset: This cohort consisted of 50 patients diagnosed with GBM and arose from a multiinstitutional open-source dataset sourced from 28 different institutions and provided by the TCIA [33] and was used in our prior work on radiomics analysis [30].In order to limit the number of analyzed features, we restricted the analysis to fluid attenuation inversion recovery (FLAIR) images provided by the cohort alone as these depict tumors as well as edema outside of the tumor.Tumor features were extracted within segmentations generated using a semi-automatic method and available from prior publication [30].
In addition to the imaging data, TCIA also provides matched overall survival (OS), and relevant clinical data for these patients, which were used in the univariate and multivariable Cox proportional hazards regression model to measure associations of WHOG and HOG features with OS.
2) Head and Neck CT Dataset: The head and neck dataset consisted of 136 patients diagnosed with head and squamous cell carcinoma treated with radiotherapy and was used in a large radiomics analysis paper by Aerts et al. [6] and made available together with the gross tumor volume segmentations in the TCIA.All CT scans were acquired in the MAASTRO clinic, the Netherlands.
The standardized pre-processing steps were applied as described previously in [34].In order to increase contrast on CT, all image slices were subject to intensity clipping, with the clip range values corresponding to those that covered the mean intensity of the tumor regions.The same clipping upper and lower bounds of 1000 Hounsfield unit (HU) to 1200 HU were applied.The clipping range was set to 200 HU.
Together with the CT scans, volumetric segmentations produced by the radiation oncologist and relevant clinical data such as the disease free survival, OS, biological sex at birth, human papilloma virus (HPV) status, age, and other confounding clinical variables were also provided with the dataset and were used in the univariate and multivariable Cox proportional hazards regression analyses.
3) Lung Immunotherapy Longitudinal CT Dataset: 143 stage III-IV NSCLC patients treated with immunotherapy who underwent serial CT imaging at baseline (pre-treatment) and 9 weeks after treatment initiation were analyzed.This dataset was used in our prior segmentation studies and the segmentation results were utilized in this study [31], [35].
Clinical outcomes included progression free survival (PFS), defined as the time from the date of primary surgery to the date of documented first recurrence on the basis of findings on a CT scan, physical examination results, or death prior to recurrence, and OS defined as the time interval between surgery and the date of death or censure.In addition, clinical variables such as patient age, biological sex at birth, and smoking status were also available for measuring associations of WHOG and HOG features using univariate and multivariable Cox proportional hazards regression.
4) RIDER Lung CT Dataset: The RIDER Lung CT collection consists of 31 patients with non-small cell lung cancer [36].It was constructed as part of a study to evaluate the variability of tumor unidimensional, bidimensional, and volumetric measurements on same-day repeat computed tomographic (CT) scans.Each patient who underwent two chest CT scans within 15 minutes using the same imaging protocol, were included in this study.Three radiologists independently measured the two greatest diameters of each lesion on both scans, and during another session, measured the same tumors of the first scan [37].

D. Comparing Methods
We compared our proposed method with CoLlAGe [28], HOG [3], wavelet [6], Gabor [38] and GLCM [8].We used the implementation HOG of extractHOGFeatures function from the Image Processing and Computer Vision package.CoLlAGe, wavelet, Gabor and GLCM features are extracted via the implementation from CERR [32].We chose the output patch size [8,8] from the same default setting from both HOG and CoLlAGe (the same for WHOG).For wavelet, Gabor and GLCM features, we used the standard feature extraction pipeline which all parameters are set to in accordance with the IBSI guidelines [39].

E. Statistical Analysis
Statistical evaluation of the association of the entropy of WHOG, CoLlAGe, HOG, wavelet, Gabor and GLCM features was performed with the available outcomes for each dataset using univariate Cox proportional hazards regression.Multivariable Cox proportional hazards regression was also done by adjusting for known clinical variables including age, biological sex at birth, smoking status, and HPV status, as appropriate and available for the different datasets.Adjustment for multiple comparisons between the three features was done using Bonferroni correction.Only p-values with p < 0.05 were considered significant.Measurement of association was performed using dichotomized low-risk (≤ median of entropy feature) and high-risk (> median of entropy feature) groups.Kaplan-Meier (KM) survival analysis was performed using the dichotomized patient curves.Population median values were chosen as an unbiased cutoff to dichotomize patients and avoid need for calibrating the best cut-off value from the individual datasets.

A. Brain GBM: Association of Entropy Features With Overall Survival
Results of univariate and multivariable (adjusted for age, tumor volume, biological sex at birth and MGMT methylation) Cox proportional hazards regression using the dichotomized high and low-risk groups determined using entropy of WHOG, CoLlAGe, and HOG features are shown in Table I.
KM analysis showed that entropy of WHOG resulted in a stronger association with OS ( p = 0.027) compared with entropy of CoLlAGe ( p = 0.047) and entropy of HOG ( p = 0.106) features, the latter of which was not associated with OS.Fig. 4a, Fig. 4f and Fig. 4e show the KM curves for the entropy of WHOG, CoLlAGe, and HOG features, respectively.Fig. 6 shows two representative patients randomly selected from low and high-risk groups as determined using the WHOG entropy feature.
1) Robustness of Features to Noise: We tested the robustness of the extracted features to noise and the impact of added noise in measuring the association with OS by adding zero-centered Gaussian noise as well as Rician noise [40], [41].WHOG features showed less variability measured using coefficient of variation for the different noise levels.Importantly, the WHOG features extracted at the different noise levels remained associated with OS for larger amounts of added noise when compared to other features, indicating the stability of WHOG to added image noise (see Table II).On the other hand, CoLlAGe features were not associated with OS with any added noise.Wavelet features gave the smallest p-value of Rician noise σ = 10 −3 , with the high risk group and low risk group reversed from the original case.The result could be caused by random effects (see Table II).
2) Robustness to Scanner Manufacturers: The robustness of the WHOG and other features computed from MRI acquired from different scanner manufacturers was evaluated between GE and non-GE scanners as well as between 1.5 Tesla vs. 3 Tesla magnet scanners.The WHOG features showed no statistical difference (Wilcoxon rank sum test) in the computed values for either the scanner manufacturer (GE vs. non-GE: p = 0.7510) or the magnet strengths (1.5 Tesla vs. 3.0 Tesla: 0.9772) as did the other features.As seen from the results, HOG and wavelet showed a (marginal) dependence on the magnet strengths indicating reduced robustness compared to other methods.

3) Comparison of Patients Assigned to High and Low Entropy
Groups by WHOG and CoLlAGe: We compared patients assigned to different risk groups by WHOG and CoLlAGe methods (see Table IV).Fisher exact test of the patient risk groups showed significant agreement in the assigned risk groups ( p = 7.26e − 06).Spearman rank correlation Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

B. Head & Neck Cancer: Association With Overall Survival
Results of univariate and multivariable (adjusted for age, biological sex at birth, tumor volume and overall HPV p16 status) Cox proportional hazards regression using the    borderline association when using CoLlAGe ( p = 0.046) and no association for HOG entropy based patient dichotomization ( p = 0.386).Fig. 7 and Fig. 9 show KM survival curves for the WHOG and CoLlAGe entropy based patient groups, respectively.Fig. 8 shows two representative patients randomly selected from low and high-risk groups as determined using the WHOG entropy feature.

1) Comparison of Patients Assigned to High and Low Entropy
Groups by WHOG and CoLlAGe: The WHOG entropy and CoLlAGe entropy features were strongly correlated (Spearman rank correlation coefficient of ρ = 0.7143) with a high agreement in the grouping of patients into the high and lowrisk groups.However, WHOG entropy was more strongly associated with OS compared to the CoLlAGe entropy as shown in Table V. Fig. 10 shows two examples where WHOG and CoLlAGe resulted in opposite categorization of patients into low and high-risk groups.In particular, CoLlAGe-based dichotomization grouped the image in Fig. 10a into a low-risk group while WHOG placed this patient in the high-risk group and the patient was associated with a shorter OS of 16 months.grouped this patient in the low-risk group and the patient was associated with a longer OS of 135 months.

C. Lung Cancer Treated With Immunotherapy: Association With Overall Survival Using Longitudinal CT Scans
Different from prior disease sites, we studied whether longitudinal changes in the entropy features extracted from the CT scans using WHOG and other features are able to predict OS.Longitudinal analysis using the change in the radiomic features requires features that are robust to imaging differences in order to capture real changes resulting from image noise and acquisition differences.Change in entropy was computed between the pre-treatment CT scan (S 0 ) and the first CT scan acquired 9 weeks after starting treatment (S 1 ) as the difference in average entropy computed within segmented tumors or S = S 1 − S 0 .
As shown in    change with OS are shown in Fig. 11.Additionally, we studied whether the WHOG entropy features computed from the individual CT (baseline and 9-week) scans showed an association with OS.As shown in Fig. 12a, the features computed from the baseline image did not show an association with OS ( p = 0.444), but those computed after the first treatment scan (Fig. 12b) did ( p = 0.0172).On the other hand, the delta WHOG entropy feature showed a stronger association ( p < 0.0001) compared to the WHOG entropy computed from the 9-week CT.Of note, the high-risk patients were those in whom the delta entropy increased with the WHOG feature, indicating higher imaging heterogeneity within tumor after treatment (median of 0.0492 and inter-quartile range (IQR) of 0.0450).On the other hand, patients grouped as low-risk using the delta WHOG feature had a decrease in the entropy (median of -0.0632 and IQR of 0.1046).We next evaluated the association of patients in the high-risk S W H OG > 0 vs. low-risk S W H OG ≤ 0 groups with disease response, namely, partial or complete response (PR/CR), stable disease (SD), and progression of disease (POD).Fisher exact test showed a significant difference ( p = 3.78×10 −6 ) between the two groups, showing a larger proportion of POD in the group with increased tumor heterogeneity than the group with decreased WHOG entropy (Table VII).On the other hand, patients who had a reduction in WHOG entropy, had a higher prevalence of stable disease.

D. RIDER Lung CT Dataset: Reproducible and Robust Image Feature
Figure 13 shows the feature values (normalized by mean and standard deviation to align different features into the same range) from two CT scans that were taken 15 minutes apart.
Robustness as a Repeatable Feature: The robustness of the proposed WHOG image feature and other features in terms of repeatability was measured by the Pearson correlation coefficients and p-values between the output from two time points.To mitigate the effect of WHOG using multi-scale, single-scaled features of WHOG are extracted compared to HOG and CoLlAGe features with the same scale (patch size).

VI. CONCLUSION AND FUTURE WORK
The present work makes use of optimal transport for the local analysis of general distributions.We explore the expressions under three commonly used optimal transport models.In particular, we utilize the discrete Kantorovich version for image directionality feature extraction.Because of the robust nature of optimal transport, we find our method performs well even in the presence of noise.The main reason for this is that our approach does not use gradients, and the underlying Wasserstein metric is continuous to perturbations in the data.
We reached a similar conclusion about image directionality and tumor heterogeneity as in [28].They performed classifications based on their HOG based feature vectors.In principle, we can replace their HOG extracted feature by our WHOG feature.By doing this, we hope that the classification accuracy may be improved.
One area of improvement for the proposed algorithm concerns the speed of computation.We noticed in our experiments, that the WHOG method took more time to compute than the other tested methodologies.To test feasibility, our implementation only used standard dual-simplex algorithm of linear programming.There are a number of other algorithms for the step of solving the Monge-Kantorovich problem, which should lead to a better performance [42], [43].
We plan to further evaluate ability of our method to study pathology images [44], [45], [46], [47].The microscopic features in these high resolution images contain information for which our proposed method may better capture directionality, and thus distinguish different biological processes occurring in different tissues.Our preliminary results indicate that our WHOG extracted feature may indeed be correlated with tumor stage and recurrence (See an example in Figure 14).
The WHOG feature has the potential to be integrated into complex clinical applications.We have demonstrated its ability to quantify tumor heterogeneity and its applicability to several modalities of medical imaging data.It may be utilized in conjunction with other commonly used image features for patient prognosis prediction.We have indicated that the WHOG directionality feature, extracted from image data, represents an attractive robust alternative to the more standard HOG or CoLlAGe techniques.It is interpretable, carries prognostic information, and may be incorporated into a machine learning framework to enhance performance.
The method can be easily modified to the case of color imagery by looking at the transport map between a local cube and a pure color cube.We also want to test on the 3D extension so that the directionality is within the transverse plane, frontal, and sagittal directions.Finally, we plan to consider certain possible connections between the extracted WHOG features and the fractal dimension of the tumor especially for MRI data where textural information is more clearly exposed.

Fig. 1 .
Fig.1.An illustration of the proposed WHOG method.Local patches of varying sizes and positions are extracted from within the tumor region of an image.For each one, the optimal transport problem of transporting the local patch to a mean value patch is computed.The resulting optimal transport matrix gives a map matrix, each entry of which represents a sub-transport work value with a clearly defined direction.The values of sub-transport work in the matrix are decomposed into an n b binned vector, which represents the directionality of the patch.Multiscale feature recombination is employed to gather information from different scales, and directionality vectors are used to compute entropy on each patch.Finally, the output WHOG feature is the average entropy of all the patches within the tumor region.

Fig. 2 .
Fig. 2. HOG vs WHOG under different noise levels: for both methods, we use 8 by 8 patch and 9 bins.

Fig. 3 .
Fig. 3.An example of a brain GBM slice from a TCIA dataset.(a): Rose plot of the HOG with a zoom-in view of the tumor region, (b): Rose plot of the WHOG with a zoom-in view of the same tumor region.

Fig. 4 .
Fig. 4. Kaplan-Meier plots of survival difference between the high entropy group and the low entropy group of brain GBM patients of method WHOG (a), CoLlAGe (b), HOG (c), Wavelet (d), Gabor (e) and GLCM (f).

Fig. 5 .
Fig. 5. Two examples of WHOG and CoLlAGe with divergent grouping of patients into low vs. high-risk groups: (a) WHOG based categorization was high-risk but CoLlAGe categorization was low-risk, (b) CoLlAGe categorization was high-risk but WHOG categorization was low-risk.

Fig. 7 .
Fig. 7. Kaplan-Meier plots of survival difference between the high entropy group and the low entropy group of head and neck patients of method WHOG.

Fig. 8 .
Fig. 8. Central tumor slices from representative patients chosen from high-risk ((a), (b) and (c)) and low-risk ((d), (e) and (f)) groups as selected using WHOG entropy feature.CT image slices with tumor contour are shown in (a) and (d).Rose plots showing the directionality of image gradients as computed using WHOG are shown in (b) and (e).Finally, heatmaps of voxel-wise WHOG entropy for the two patients are shown in (c) and (f).
Another representative example where the CoLlAGe categorized the image into the high-risk group, although the presence of homogenous sub-regions is shown in Fig 10(b).WHOG

Fig. 9 .
Fig. 9. Kaplan-Meier plots of the high entropy group and the low entropy group from CoLlAGe in head and neck patients.

Fig. 10 .
Fig. 10.Two examples of WHOG and CoLlAGe with divergent grouping of patients into low vs. high-risk groups: (a) a patient with a low OS of 15 months and WHOG correctly classified the patient into high-risk but CoLlAGe classified the patient as low-risk, (b) a patient with a longer OS of 135 months, which was correctly categorized as low-risk using WHOG but incorrectly using CoLlAGe as high-risk.

Fig. 12 .
Fig.12.Kaplan-Meier plots of association with OS computed using dichotomized patients using (a) (low-risk as ≤ S 0 WHOG vs. high-risk as > S 0 WHOG ) and (b) (low-risk as ≤ S 1 WHOG vs. high-risk as > S 1 WHOG ), where the computed features are WHOG entropy measures, 0 corresponds to features extracted within tumor from baseline CT and 1 corresponds to features extracted within tumor from 9-week CT scans, respectively.

Fig. 14 .
Fig. 14.An example of extracting WHOG feature from a pathology image: Centroids of different cell types are extracted.An optimal transport map is used to compute directionality on a patch following a similar pipeline as done for other image types.

TABLE I UNIVARIATE
AND MULTIVARIABLE COX PROPORTIONAL HAZARDS REGRESSION ANALYSIS ON BRAIN GBM MRI DATA.HR: HAZARD RATIO; CI: CONFIDENCE INTERVAL

TABLE II LOG
RANK p-VALUES OF WHOG, CoLlAGe, HOG, WAVELET, GABOR AND GLCM FEATURES MEASURING ASSOCIATIONS WITH OVERALL SURVIVAL AND COMPUTED WITH INCREASING NOISE LEVELS OF GAUSSIAN NOISE(G) AND RICIAN NOISE(R) TABLE III WILCOXON RANK SUM TEST ON DIFFERENCES OF SCANNER MANUFACTURERS AND MAGNETIC STRENGTH

TABLE IV COMPARISON
OF WHOG LOW/HIGH ENTROPY GROUPS VS CoLlAGe LOW/HIGH ENTROPY GROUPS dichotomized high and groups determined using entropy of WHOG, CoLlAGe, and HOG features are shown in TableV.WHOG entropy based patient dichotomization resulted in a significant association with OS ( p = 0.009) compared to a Table VI, only the change in WHOG entropy was associated with OS in both univariate and multivariable (adjusted for age, biological sex at birth, smoking, treatment, tumor volume and tumor volume change) Cox proportional hazards regression analysis.HOG entropy based univariate analysis showed an association with OS, but it was not associated in the multivariable analysis.Furthermore, WHOG entropy resulted in a stronger association, indicated by smaller p-values than HOG entropy.Different from the other two datasets, CoLlAGe showed no association with OS.The KM curves showing the association of the WHOG entropy Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE VI UNIVARIATE
AND MULTIVARIABLE COX PROPORTIONAL HAZARDS REGRESSION ANALYSIS PERFORMED ON THE LUNG CANCER DATASET USING LONGITUDINAL OR DELTA ENTROPY FEATURES.HR: HAZARD RATIO; CI: CONFIDENCE INTERVAL

TABLE VII RESPONSE
COUNTS AND PERCENTAGES OF THE ENTROPY INCREASE GROUP, THE MINOR CHANGE GROUP AND THE ENTROPY DECREASE GROUP FROM THE BASELINE IMAGES TO THE 9-WEEK IMAGES AFTER THE START OF IMMUNOTHERAPY.POD: PROGRESSION OF DISEASE, SD: STABLE DISEASE, PR: PARTIAL RESPONSE, CR: COMPLETE RESPONSE Fig. 13.Scatter plots of normalized features extracted from two CT scans of HOG vs WHOG, CoLlAGe vs WHOG, Wavelet vs WHOG and Gabor vs WHOG.
Table VIII shows that Pearson correlation coefficients and p-values between two feature outputs.We expect robust radiomic features to have similar values in two consecutive CT measures.So higher correlation coefficients of WHOG features than other methods indicates that it is indeed a robust feature.

TABLE VIII PEARSON
CORRELATION COEFFICIENTS AND P-VALUES BETWEEN TWO FEATURE OUTPUTS FROM CT SCANS OF 15 MINUTES APART.THE NUMBERS IN APPRENTICES ARE THE PATCH SIZES USED IN COMPUTATION.IN PARTICULAR, WHOG(4+8) IS THE MULTI-SCALE VERSION OF WHOG FEATURE USED IN EXPERIMENTS OF OTHER DATA SETS.WAVELET, GABOR AND GLCM PARAMETERS ARE DEFINED ACCORDING TO IBSI GUIDELINES