GPU Accelerated Polar Harmonic Transforms for Feature Extraction in ITS Applications

Polar Harmonic Transform (PHT) is termed to represent a set of transforms those kernels are basic waves and harmonic in nature, which can improve the effect in Intelligent Transportation System (ITS) applications. PHTs consist of Polar Complex Exponential Transform (PCET), Polar Cosine Transform (PCT) and Polar Sine Transform (PST). PHTs can extract orthogonal and rotation invariant features and demonstrated superior performance in various image processing and computer vision applications. For real time systems and large multimedia databases, execution efﬁciency is always a signiﬁcant challenge. With widespread use of Graphics Processing Unit (GPU), this study presents GPU based PHTs. Proposed methods are based on mathematical properties of PHTs and optimization techniques of GPU. Optimal parameter selections for GPU execution are also discussed. In our experiments, proposed methods are over 1800 times faster.

Polar Harmonic Transforms (PHTs) consist of Polar Complex Exponential Transform (PCET), Polar Cosine Transform (PCT) and Polar Sine Transform (PST) [15]. PHTs can extract orthogonal and rotation invariant features and The associate editor coordinating the review of this manuscript and approving it for publication was Edith C.-H. Ngai . demonstrated superior performance in ITS tasks. Farnoosh and Ali [16] use PCT coefficients to correct uncertain labels for getting more accurate body reconstruction. Al-asady and Al-amery [17] obtain accurate features for human action detection. Lin et al. [18] and Liu et al. [19] propose region duplication detection scheme for feature point mapping.
This paper focuses on GPU based Polar Harmonic Transforms (GPHTs) that consist of GPU based Polar Complex Exponential Transform (GPCET), Polar Cosine Transform (GPCT) and Polar Sine Transform (GPST). For utilizing GPU parallel computational capability, we implement proposed methods with Compute Unified Device Architecture (CUDA) [43], [44]. Mathematical properties of PHTs are also considered in CUDA. Different execution configurations of GPU lead to different running time. Optimal parameter selection are evaluated as well.
The organization of this paper is as follows. The mathematics definitions of PCET, PCT and PST are given in Section 2. The proposed methods are presented in Section 3 after introducing GPU memory structure and parallel execution model. In Section 4, the performance of proposed method is evaluated under different images. The experimental results illustrate that proposed method is really effective. Finally, concludes this study.

II. POLAR HARMONIC TRANSFORMS
This section introduces PHTs, and for further details refer to [15].

A. POLAR COMPLEX EXPONENTIAL TRANSFORM (PCET)
Given a 2D image function f (x, y), it can be transformed from Cartesian coordinate to polar coordinate f (r, θ) as following formulae transform, where r and θ denote radius and azimuth respectively. and PCET is defined on the unit circle that r ≤ 1, and can be expanded with respect to the basis functions H nl (r, θ) as where the coefficient is The basis function is given by where R n (r) = e i2πnr 2 .

III. PROPOSED METHOD
A. GPU ARCHITECTURE GPU contains Streaming Multiprocessors (SM). Parallel threads on SM are grouped into a thread block. Thread blocks are grouped into a grid that corresponds to a CUDA kernel function call in a GPU program [43]. Each block in a grid has its own block id. The number of threads in a thread block and the number of thread blocks in a thread grid can be specified and can also impact the computation efficiency [44]. GPU has several memory models such as register, local memory, shared memory, global memory, constant memory and texture memory as shown in FIGURE 1.

B. DIRECT COMPUTATION METHOD ON GPU
Each GPU thread handles a pixel data. Parallel threads significantly boost PHTs. Following variables can be set on GPU. gridDim.x is the number of thread blocks in a thread grid, blockId.x is the index of thread block in a thread grid, blockDim.x is the number of threads in a thread block, threadId.x is the index of thread in a thread block. We define id as the index of a thread in a thread grid. id can be calculated by following formula, For an image with N × N resolution, each pixel is represented as (x, y). The pixel in x-th column and y-th row can be mapped to GPU thread id by following equation: where 0 ≤ y < N , 0 ≤ x < N . x and y can be calculated from id: where mod(·) is modulo operator. According to Eqs. (21) and (22), parallel GPU threads with different id can access different (x, y) pixel data to accomplish PHTs on GPU directly.

C. FAST COMPUTATION METHOD ON GPU
From Eq. (13), we can find for the pixels with same r and cos(πnr 2 ), the different integrated part is  Table 1.
Similar relationships also exist for cosine function and other l values. For the eight symmetric points on the same radius r, coefficients can be calculated simultaneously.
Finally, the GPCET is given by ×(cos(2πn(x 2 + y 2 ))G l (x, y) − sin(2πn(x 2 + y 2 ))H l (x, y)) −i(sin(2πn(x 2 + y 2 ))G l (x, y) + cos(2πn(x 2 + y 2 ))H l (x, y)) dx dy. (39)    By using Eqs. (30), (36) and (39), a group of symmetric pixels can be handled by a GPU thread. In this case, Eqs. (21) and (22) should be reevaluated. As shown in FIGURE 3, we select one pixel from row 0, two pixels from row 1 and so on, then concatenate them for GPU threads. We have: where 0 ≤ x ≤ y < n. GPU thread id is in following range y can be deduced from id as    x can be calculated as

D. UNROLL OPERATION
Unroll operation is an important technique that optimizes GPU execution speed by reducing branch penalties and hiding latencies including the delay from reading data [44].  In proposed method, unroll operation is to calculate more than one group of pixels in a thread. Let unrollNum be the number of groups calculated in a thread. To choose an optimal unrollNum depends on algorithm complexity and GPU memory limitation.

IV. EXPERIMENTAL RESULTS
Images with different resolution and content are tested to illustrate the feasibility and efficiency of proposed GPHTs.     Windows 10 and Visual Studio 2015 are used to perform experiments, CPU (Intel Core i3-8100) has 4 cores with 3.6GHz frequency, GPU (Nvidia GeForce GTX 1060) has 1280 cores with 1594MHz frequency and graphic memory is 6GB, CUDA version is 9.0.176.

A. SYNTHETIC IMAGES
Synthetic images are used. They are generated by following formula: where rand(·) is a function to randomly generate a integer. 1,000 synthetic images are generated. In this experiment, n ranges from 11 to 16 and l ranges from 11 to 15. With different unrollNum and blockDim.x, the running time of GPCT, GPST and GPCET are different as shown in Tables 2, 3 and 4 respectively. unrollNum is defined as a power of two, like 2, 4, 8. For 1000 images with 2048 × 2048 resolution, FIGUREs 4, 5 and 6 show running time curve of GPCT, GPST and GPCET for different unrollNum. When log(unrollNum) is 1, GPCT, GPST and GPCET achieve the best performance.
We evaluate the impact of blockDim.x. blockDim.x is defined as a power of two, like 2, 4, 8. FIGUREs 7, 8 and 9 show running time curve of GPCT, GPST and GPCET for 1000 synthetic images. As for GPCT when log(blockDim.x) is 7, proposed method is the fastest and is about 11.39 times comparing to log(blockDim.x) is 0. Similarly, GPST and GPCET achieve the best performance when log(blockDim.x) is 7.
With optimal unrollNum and blockDim.x, we evaluate GPHTs against PHTs as shown in Table 5. While image resolution increasing, the proposed GPHTs outperform obviously. In our experiment for images with 2048 × 2048 resolution, GPCT runs 1887.5 times faster than PCT on CPU. GPST can achieve 1795.4 times faster than PST on CPU. GPCET is 1527.1 times faster than PCET on CPU.
B. REAL IMAGES ITS real images are shown in FIGURE 10. 128 image patches with 512 × 512 resolution are selected. In this experiment, n ranges from 1 to 25 and l ranges from 1 to 25.
As shown in FIGURE 11 when log(unrollNum) is 2, GPCT, GPST and GPCET achieve the best performance. We also evaluate optimal blockDim.x as shown in FIGURE 12. When log(blockDim.x) is 7, GPCT, GPST and GPCET achieve the best performance.
With optimal unrollNum and blockDim.x, Table 6 shows the running time comparison of GPHTs and PHTs. In our experiment for real images with 512 × 512 resolution, GPCT runs 961.8 times faster than PCT on CPU. GPST can achieve 975.3 times faster than PST on CPU. GPCET is 790 times faster than PCET on CPU.

V. CONCLUSION
In this paper, we propose GPU based PHT. By using the symmetric properties and mathematical properties of trigonometric functions, parallel GPU threads can manipulate pixels simultaneously. Formulas between GPU thread id and image pixel (x, y) are deduced. For real time systems and large multimedia databases, proposed method can fully unleash GPU parallel computational capability. Comprehensive experiments are also given to illustrate the effectiveness of proposed method. Wide range of emerging applications that using PHTs will be inspired from this study.
MINGKAI TANG is currently pursuing the bachelor's degree with the Guangdong University of Technology. His main research interests include image processing and computer vision.
ZHUOZHANG LI is currently pursuing the bachelor's degree with the Guangdong University of Technology. His main research interests include image processing and artificial intelligence.