Deep Pipeline Architecture for Fast Fractal Color Image Compression Utilizing Inter-Color Correlation

Fractal compression technique is a well-known technique that encodes an image by mapping the image into itself and this requires performing a massive and repetitive search. Thus, the encoding time is too long, which is the main problem of the fractal algorithm. To reduce the encoding time, several hardware implementations have been developed. However, they are generally developed for grayscale images, and using them to encode colour images leads to doubling the encoding time $3\times $ at least. Therefore, in this paper, new high-speed hardware architecture is proposed for encoding RGB images in a short time. Unlike the conventional approach of encoding the colour components similarly and individually as a grayscale image, the proposed method encodes two of the colour components by mapping them directly to the most correlated component with a searchless encoding scheme, while the third component is encoded with a search-based scheme. This results in reducing the encoding time and also in increasing the compression rate. The parallel and deep-pipelining approaches have been utilized to improve the processing time significantly. Furthermore, to reduce the memory access to the half, the image is partitioned in such a way that half of the matching operations utilize the same data fetched for processing the other half of the matching operations. Consequently, the proposed architecture can encode a $1024\times 1024$ RGB image within a minimal time of 12.2 ms, and a compression ratio of 46.5. Accordingly, the proposed architecture is further superior to the state-of-the-art architectures.


I. INTRODUCTION
In recent years, images are found everywhere and become an essential element in our daily lives. Images are produced The associate editor coordinating the review of this manuscript and approving it for publication was Yun Zhang .
intensively every single day in various forms such as medical images, personal images, social media images, surveillance images, and graphics images. Storing these images or transferring them through the network creates an unbearable burden for memory devices and the network bandwidth too. Taking this into account, compression is an indispensable tool for archiving images in fewer spaces of storage and also for transferring images in less time of transmission.
Fractal image compression (FIC) is one of the most popular techniques due to its distinctive way in compressing images. This compression technique uses self-similarity features as a means to compress an image [5]. It also offers a high compression ratio (CR), especially when applied to a digital image with a high degree of self-similarity like aerial photography or satellite imagery [7]. Owing to its popularity in digital archiving, FIC has been found in numerous applications such as character recognition [8], super-resolution [9], watermarking [10], [11], and digital signature embedding [12]. FIC also possesses other attractive features like good peak signalto-noise ratio (PSNR) performance and a simple decoding method [13], [14], [15], [16], [17], [18]. However, FIC suffers one major drawback arising from the computational complexity of the algorithm. Typically, FIC requires a very large number of searches to find excellent or good-enough maps between image blocks. Theoretically, the time complexity of the fractal compression algorithm approximately equals to O n 4 for n × n size image [19], [20]. Owing to this, the encoding time required for FIC is generally large and not efficient for most real-time applications that require the processing of 30 frames per second, or more.
To date, various methods have been proposed for accelerating the fractal algorithm. These methods can be grouped into two types: software-and hardware-based methods. The software-based methods mainly focus on reducing the size of the range and domain pool, adopting of classification approach [21], restriction of search space [22], [23], combining of different code schemes [24], and applying of feature vector approach [25]. For instance, Bani-Eqbal [26] arranged the domain blocks in a tree structure to reduce the number of candidate blocks for matching search. In an earlier study, Saupe and Hamzaoui [27] discarded the domain blocks that have low variance. Tong and Pi [28] presented an adaptive search for excluding the domain blocks that do not satisfy the necessary condition of better matching. Meanwhile, Tong and Wong [29] converted the matching search problem to a nearest neighbour search problem. Although these schemes are successful in speeding up the FIC, the attained encoding times are still too high and not adequate to deliver the real-time requirement for the most time-sensitive applications. So far, this problem is still existing even when using cutting-edge microprocessors and digital signal processors (DSPs). This is because the processors architectures are still sharing the same concept as the older processor, where a static set of instructions is run in a static architecture. On the other hand, the hardware-based methods execute FIC operations in parallel, resulting in the enhancement of the encoding time performance.
Some research works targeting hardware implementation of FIC have been published recently [1], [2], [30], [31], [32], [33]. These works exploit the inherent parallelism in FIC as means of speed-up. Although these works reduce the encoding time significantly, they are developed to encode grayscale images only. However, most of the images existing these days are colour images. This shortage of fractal-colourimage-compression (FCIC) hardware implementation can be attributed to the fact that designing a dedicated hardware to process colour images is much more complicated compared to designing a custom hardware for processing grayscale images. Due to this, fractal coding designers were contented with the straight-forward method (i.e., three-component Separated Fractal Coding (SFC)) which encodes colour images using the grayscale-based methods. In this case, each colour component in a colour image is treated as a single grayscale image. As a result, the correlation among colour components is not exploited which results in a relatively low compression ratio and PSNR. Furthermore, the encoding time is increased significantly.
To the best of our knowledge, there is no certain hardware work developed for encoding colour images except that proposed in [34]. According to [34], a soft-core processor designed for Xilinx FPGA has been used for coding a 64 × 64 colour image. In the process, the RGB components of the colour image are transformed into YUV components, and then fractal coding is performed in different sampling modes of 4:4:4, 4:2:2, 4:2:0, and 4:1:1 for attaining a higher compression rate. The range and domain blocks for each component are classified into a specific number of classes. More specifically, Y's blocks are classified into 72 classes while U's and V's blocks are classified into 24 classes each. The encoding process is performed so that each component is encoded separately without exploiting the inter-colourcomponent redundancy. Although the encoding time of this implementation is not reported, the time can be estimated by considering an almost similar implementation published by the same authors Son, et al. [35], but applied for the grayscale images. Based on this assumption, the encoding time of colour image is approximately 33 s. This reported time is large because this design lacks a hardware accelerator unit and also because it still encodes colour components individually.
In this paper, a new high-speed pipelined parallelism architecture is proposed for coding a high-resolution colour image of 1024 × 1024 pixels size. Since Field-Programmable-Gate-Arrays (FPGAs) have the advantage to feature dense fine parallelism where hundreds of instructions can be executed simultaneously, as well as be reprogrammable, therefore, the proposed architecture is realised and developed on FPGA. The proposed architecture is designed to exploit the crosscorrelation between colour components in order to achieve higher CR and lower encoding time. Unlike YUV and other colour spaces, the colour components in RGB space are relatively more correlated and have a higher degree of similarity [36]. Therefore, colour images are chosen to be encoded in RGB space, and thus direct mapping between the colour components is used to reduce the search process. By doing so, one component is encoded with a search-based method, while the other two components are encoded with a searchlessbased method. As a result of encoding two of the three colour VOLUME 10, 2022 components using the searchless method, the runtime is nearly one-third of that when using grayscale-based designs. Furthermore, it provides relatively higher compression rate and image quality compared to the conventional approach that encodes colour components separately. The main two contributions of this work are: Develop a hardware-friendly fractal RGB image encoding method with two encoding schemes, utilizing the inter-correlation between the colour components in RGB images to improve the encoding speed and the compression rate. Design a high-speed hardware architecture of fractal-based colour image compression with deep pipeline processing and two parallel matching processors.

II. MATHEMATICAL BACKGROUND OF FRACTAL ALGORITHM
In this section, the mathematical background of the fractal algorithm is demonstrated and explained in detail. The encoder and decoder parts of the fractal algorithm are elaborated clearly in the following sub-sections.

A. ENCODING PROCEDURE
The idea of the fractal algorithm is to find a set of transformations that can map an image into itself. In other words, the fractal algorithm attempts to find the local self-similarities on the image and maps them together using affine transformation. Thus, to encode an image, the image needs to be partitioned into a set of non-overlapping blocks referred to as range blocks, Rs. Additionally, the image is also partitioned into larger blocks called domain blocks and denoted as Ds. Unlike the range blocks, the domain blocks can be overlapped together and are generally 4× greater than the range blocks. In the case that the range blocks have the size of α × α pixels, the size of the domain blocks will be generally 2α × 2α.
In order to map the range and domain blocks together, the encoder will search the domain pool for the acceptable or the best matched-domain block for each range block on the image. As said earlier, the affine transformation needs to be applied to the candidate domain blocks in order to maximize the matching degree between the matched blocks. Thus, the values of the transformer's coefficients for the best matches are then stored as the fractal codes for the encoded image. Typically, the affine transformer can be simplified to three small basic functions. First, the scaling function is responsible to contract the domain blocks to the size of the range block. To do so, every 2 × 2 pixels in the respective domain block is replaced by one pixel of their average value. Second, the geometric function where the domain block can be rotated and reflected by specified degrees to generate several symmetries of the domain block. Last is the massic function where the intensity values of the candidate domain are adjusted by two parameters; contrast scaling δ and brightness offset ρ parameters. The scaling parameter is typically between −1 and 1 [37].

B. DECODING PROCEDURE
In order to decode an image being encoded with the fractal algorithm, the decoder requires to perform an iterative reconstructing process. The first task of the decoder is to use an arbitrary image and then update the image's partitions as follows: where R represents the retrieved range block and D j represents the mapped j th domain block. The δ, ρ and j coefficients are the stored fractal codes for the corresponding range block.
Once it finishes retrieving all the range blocks on the image, the produced image will be used as the domain image for the next iteration. This process will continue till the encoded image is retrieved. Typically, 10 to 20 iterations are enough to reconstruct an image encoded by the fixed domain block size approach [38]. To measure the image quality for the reconstructed RGB images, the PSNR metric is usually used and computed as follows: where ν andν are, respectively, the original and retrieved RGB images, and M ×N is the height × width of the encoded image, which is here 1024 × 1024.

A. PROPOSED FCIC ALGORITHM
In the proposed algorithm, the colour components of an RGB image are processed with two different schemes, i.e., search and searchless-based schemes. The search scheme is used to encode the most-correlated component, while the searchless scheme is used for encoding the remaining two components. From [36], it can be found that R and B components are highly correlated to the G component and consequently the most-correlated component is G. Therefore, G-component is denoted here as the primary component while R and B components are denoted as the secondary components. In this case, the primary component is processed with the search scheme and the secondary components are processed with the searchless scheme. The procedures for both encoding schemes are discussed in the following sub-sections.

1) SEARCH SCHEME
As previously discussed in Section 2.1, the kernel of the fractal method is to pair each range block with an appropriated domain block. Thus, each range block in the G-component; R g ; needs to be mapped to one domain block, D g . The selected domain block must have the minimum matching error compared to others. In fact, searching the entire component for the most matched block is a very time-consuming operation as it requires performing a massive number of matching operations. To reduce the required number of matching operations, the searched blocks are restricted to those that exist only in the area close to the corresponding range block. This generally will not cause noticeable degradation in the quality of the decoded image as the correlation among the adjacent blocks is generally high, specifically for images captured from natural scenes [5], [39], [40]. Therefore, in this work, each range block R g in a 128×128 window is compared with every domain block D g in the same window. With this selected window size, 1024 × 1024 G-component would contain 64 non-overlapped windows. In this case, G-component is firstly divided into non-overlapping subimages (denoted as sub-img) of 128 × 128 pixels size each. This will result in 64 sub-img k : k = 0 · · · 63. Then, each subimage is partitioned into range and domain blocks of 8 × 8 and 16 × 16 pixels size each, respectively. As a result, there will be 64 non-overlapped domain blocks in each sub-image, and they are denoted as D g j , j = 0 · · · 63. These 64 domain blocks constitute one search domain pool. In terms of the range block, there are 256 blocks in each sub-image and these blocks are grouped into 64 sets of four-adjacent range blocks denoted as R g i,p=0,1,2,3 : i = 0 · · · 63. As a result, every four adjacent range blocks R g i,p=0,1,2,3 will constitute a domain block D g j , where j = i. Fig. 1 illustrates the overall partitioning procedures for G-component.
To match R g i,p with D g j , the affine transform is used to maximize the matching degree between the matched blocks. For a simple implementation and fewer matching operations, the geometric function involved in the affine transform is discarded. In this case, the affine transform ψ for D g j is computed as: where γ is the contract function returning the contracted domain block of the size α × α, I is the identity matrix of the size α × α where all entries are ones, and δ and ρ are, respectively, the scaling and offset parameters. Now, to measure the similarity between R g i,p and the obtained transform of D g j (i.e., ψ D g j ), the sum-of-absolutedifferences (SAD) metric is used. Among the common metrics (e.g., the least-squared-error (LSE) and the meansquared-error (MSE)), SAD is more hardware-friendly as it does not require multiplication operations. The formula for computing SAD is as follows: are, respectively, the x th intensity values of the respected range and transformed domain blocks, and N is the number of pixels in R g i,p ; i.e., N = 8 × 8 = 64. For getting less distortion and max matching degree, ρ is computed as: is the mean value for D g j and δ is the contrast scaling parameter. To gain a high compression rate, δ is chosen to be a 2-bit size and, therefore, it has four possible values (i.e., −0.5, 0.25, 0.5, 1). In this case, each of these values must be tested in order to select the one that produces less distortion.
For a given range block R g i,p , the encoder will search the domain pool D g j , j = 0, . . . , 63 for the best-matched domain block that produces the minimum SAD value. The corresponding δ and ρ values which represent the fractal codes require to be stored together with the spatial information of the matched domain block. In this work, ρ and δ are of 7-bit and 2-bit sizes, respectively.
From Fig. 1, it is clear that each domain block contains four range blocks. Thus, reading four adjacent range blocks R g i,p=0,1,2,3 from memory will equal reading a domain block. Likewise, reading one domain block will implicitly lead to reading four range blocks. Therefore, when a domain block is read from the memory for similarity check, the implicit four range blocks can be utilised and processed at the same time, resulting in reduced memory access. Consequently, the compression time can be reduced effectively.
In the process, each four adjacent R g i,p=0,1,2,3 , i = 0 · · · 63 is compared consecutively with all D g j , j = 0 . . . 63 existing in the same sub-image. At each comparison between R g i,p=0,1,2,3 and D g j , the implicit blocks (i.e., D g j=i and R g i=j,p=0,1,2,3 , respectively) are utilised to perform another matching operation. Thus both matchings match(R g i,p , D g j ) and match(R g i=j,p , D g j=i ) can be executed in parallel using two matching processors and the results for the best-matched blocks need to be stored. To avoid performing the same matching twice, the match(R g i,p , D g j ) is carried out in one processor for j ≥ i while the match(R g i=j,p , D g j=i ) for j < i is carried out in other one. For example, for i = 0, the four range blocks R g 0,p=0,1,2,3 will be matched one-by-one with D g j : j = 0 · · · 63 in the first processor, while the other processor will VOLUME 10, 2022 carry on the matching between R g i=j,p : j = 1 · · · 63 and D g j=0 . Next, for i = 1, the next 4-range block R g i=1,p=0,1,2,3 will require to be compared with D g j : j = 1 · · · 63 only because the range blocks R g i=1,p=0,1,2,3 have already been compared with D g j=0 in the first cycle. Meanwhile, the matching between R g i=j,p : j = 2 · · · 63 and D g j=1 will be carried out on the second processor. Therefore, for each upcoming value of i, the number of domain blocks that need to be matched is 64 − i as j = i, . . . , 63. As a result, the number of matching operations is reduced when the value of i is increased. This leads to a significant reduction in search time.

2) SEARCHLESS SCHEME
Within this scheme, the high degree of similarity between the colour components is utilised to encode R and B components in a significantly short time. Since the range blocks from G component, R g , are strongly correlated with the range blocks from R and B components (R r and R b ) when they are totally overlapped, therefore, both R r and R b are mapped with its overlapped R g . For doing so, R and B components need to be partitioned in a similar way that the G-component is partitioned. One exception is that, unlike G-component that needs to be partitioned into range and domain blocks for pairing each range block with a particular domain block, R and B components need to be partitioned only into range blocks as they will be paired to the range blocks in G-component. In this case, R and B components are partitioned into 64 subimages each and each of them is partitioned into 8 × 8 nonoverlapping sets of four range blocks called R r i,p=0,1,2,3 and R b i,p=01,2,3 , respectively, where i = 0, . . . , 63. For each range block in the secondary components, the affine transform of the mapped block, say R g i,p , is computed as follows: where here the contract function is removed as R g i,p has the same size as R r i,p and R b i,p . The two parameters, δ and ρ, are defined in the same fashion as described in Section 3.1.1. Thus, the values of these parameters are stored as the fractal codes for each R r i,p and R b i,p . The flowchart of the proposed encoding method for RGB images is shown in Fig. 2.
In respect to the decoding process, the R and B components are decoded in a different way compared to the G-component. In fact, the G-component is decoded in a conventional manner as described in Section 2.2. Unlike the G component which is reconstructed from an arbitrary image with numerous iterations, the R and B components are reconstructed from the retrieved G component and only need one iteration to be reconstructed. Consequently, the decoding time is effectively decreased, and this is another advantage of the introduced encoding approach.

B. FCIC ARCHITECTURE
To achieve a high-speed encoding using the proposed FCIC algorithm, new hardware architecture is proposed as shown in Fig. 3. The proposed architecture is designed carefully to work at a higher clock speed and to provide larger throughputs. To do so, each component of the proposed architecture is designed and developed to have lower latency. Parallelism and pipelining approaches have been used widely and efficiently in designing the whole system in order to reach optimal performance. Basically, the proposed architecture comprises (i) Memory-Control (MemCtrl) unit, (ii) Memory-Addresses-Generator (MemAddrG) unit, (iii) Mean-and-Contraction Computing (MCC) unit, (iv) two matching processors (PU1 and PU2), and several control units. Examples of the developed control units are (i) the Storing-Control-unit (SCtrl), (ii) MCC-Ctrl unit and (iii) the Fractal Codes and SAD Storing-Control-unit (FC-SAD SCtrl). Each matching processor in the system contains RAMs, registers, an Offset-Computation-Unit (OCU), and a SAD-Computation (SADC) unit with its control unit (SADC-Ctrl).
In operation, the MemAddrG unit is designed to generate the required addresses in order to fetch the image blocks as described in Section 3.1., while the MemCtrl unit is designed to control the main memory. The MCC unit is used to compute the mean values for the fetched range and domain blocks. Besides, it is responsible to contract the domain blocks' size to 8 × 8 pixels. The computation of the offset parameter and the sum of absolute differences for each mapping is performed on OCU and SADC units, respectively. These units are controlled by MCC-Ctrl and SADC-Ctrl modules, respectively. Last but not least, SCtrl unit is to control the storing of the fetched range and domain blocks in the internal memory for computing SAD values, while FC-SAD SCtrl is to store the best fractal codes. For a clear understanding of the encoding process, the memory organisation, and the memoryaddresses-generator unit MemAddrG are demonstrated first.

1) MEMORY ORGANISATION
A 1024 × 1024 RGB image of 24-bit pixel size requires 3 megabytes of memory in order to be fully stored in the system. With a memory word size of 64-bit, the encoded image requires 3 × 2 17 memory words or locations. The image's pixels are buffered in a row-wise order, where the colours components are buffered one after another. Thus, the address bus width is 19-bit. As shown earlier that the encoded image comprises three components, where each component is subdivided into 64 sub-images and each is further divided into 64 domain and/or 256 range blocks. To read a particular R i,p or D j block from a certain sub-image sub-img k in one of the colour components, altogether the block, the sub-image and the component spatial information (i.e., i, j and k) are required to generate the memory addresses. In this case, two bits are required to address the colour components, six bits to address the sub-images, six bits to address the domain/4-adjacent-range blocks, and five bits to address the pixels in the respective image block. Thus, the 19-bit long address Addr 18...0 is constructed as: Addr 18,17 is constructed from the colour component index (i.e., 00, 01 and 10 for G, R, and B components, respectively), Addr 16...14 and Addr 13...11 are constructed from the three most-significant bits of the sub-image and domain block indices respectively (i.e., k and j), Addr 6...4 and Addr 3...1 are constructed from the VOLUME 10, 2022 remaining three least-significant bits of the sub-image and domain block indices respectively, Addr 10...7 and Addr 0 are to read the domain/4-adjacent-range block's pixels. Correspondingly, the base addresses of each sub-image and image block are in the form of ''xx-xxx-000-0000-xxx-000-0'' and ''xx-xxx-xxx-0000-xxx-xxx-0''.

2) MEMORY-ADDRESSES-GENERATOR UNIT
To generate the required addresses for fetching the image blocks as described earlier, we designed a MemAddrG unit as shown in Fig. 4. From this figure, it is clear that the unit is equipped with five counters, a flip-flop, an adder, and several logic gates. The counters are component counter c-Cr, sub-image counter k-Cr, 4-range block counter i-Cr, block counter j-Cr, and block's pixels counter p-Cr. All counters are of 6-bit size except c-Cr and p-Cr which are 2-bit and 5-bit sizes, respectively.
Referring to Fig. 4, the p-Cr is enabled whenever the signal MemAddrG_En is in HIGH state. This leads to generating the addresses required for reading all pixels in the desired domain or 4-range blocks. Since these blocks are of 16 × 16 pixels size and stored in 32 memory words of 8-byte, the addresses for the left 16 × 8 pixels are generated first, followed by the addresses of the right 16 × 8 pixels. As a result, the range blocks are read in the right sequence, i.e., R i,0 , R i,1 , R i,2 and R i, 3 . The counters i-Cr, j-Cr, and k-Cr are responsible to provide the right indices for the desired read block based on the flowchart shown in Fig. 2, while the counter c-Cr is used to select the colour component to be read (i.e., 00, 01 and 10 for reading G, R and B components, respectively).
For encoding the image as described in 3.1, the MemAd-drG unit first generates the 32 addresses for reading R G 0,p=0,1,2,3 by enabling the counter p-Cr, resulting in counting from ''00000'' . . . ''11111''. Meanwhile, the remaining counters are all having zeros values. Once p-Cr reaches the final count (p-Cr = ''11111''), c-Cr is enabled to allow generating the base addresses of the overlapped range blocks on R and B components. Consequently, the first four range blocks of all colour components are read in the following subsequence: R g 0,p=0,1,2,3 , R r 0,p=0,1,2,3 , R b 0,p=0,1,2,3 . In this case, both R r 0,p=0,1,2,3 and R b 0,p=0,1,2,3 can be encoded with the proposed searchless scheme by mapping them to R Once the values of the counters j-Cr and p-Cr reach the maximum value, j-Cr is loaded with i-Cr + 1, which leads to fetching the subsequent 4-range block, R g 1,p=0,1,2,3 . At the same time, the counter i-Cr is also enabled. When the counters altogether i-Cr, j-Cr and p-Cr reach their final counts, this means that the current sub-image is completely read and the MemAddrG unit needs to start generating the addresses for fetching the next sub-image, sub-img k=1 . Thus, the counter k-Cr is enabled for one clock cycle to generate the base address of the next sub-image. Then the counters c-Cr, i-Cr, j-Cr and p-Cr are enabled in the same manner as for the previous sub-image. This process continues until all subimages are read and encoded.

3) ENCODING PROCESS DATAFLOW
To let the system starts the encoding process, the signal Start needs to be asserted. At this time, the MemCtrl unit enables the MemAddrG unit by asserting the control signal, MemAd-drG_En. This control signal MemAddrG_En and some others such as PG-DataReady, SG-DataReady are also provided by MemCtrl unit. PG-DataReady and SG-DataReady control signals are asserted to indicate that the pixels values to be read from the primary component or from the secondary components, respectively, are available on the data bus. Thus, they give an indication to other units to start processing the data with the corresponding encoding scheme.
The implementation of each encoding scheme requires fetching different blocks as described previously. According to MemAddrG, in every cycle, four range blocks of every colour component (R g i , R r i , R b i ) are read first, followed by reading the domain blocks from the G component. Since R r i and R b i need only R g i to be encoded with the proposed searchless scheme, therefore, the searchless scheme is carried out first. Then, the proposed search scheme is executed for encoding R G i . As stated previously, R g i , R r i and R b i are consecutively fetched for encoding R r i and R b i with the searchless scheme. At the time of reading R g i,p=0,1,2,3 , MCC computes the mean value of each range block. Simultaneously, StrCtrl stores R g i,p=0,1,2,3 and their computed mean values in the dual-port RAM 4-R and µ R 0,1,2,3 registers, respectively. As shown in Fig. 3, these memory elements are available in the processor PU1. On the other hand, R r i and R b i range blocks and their mean values are stored in the internal memory of PU2, which are respectively 64 12-bit shift register and µ (R D ) register. For each range block of R r i and R b i , the matching process is performed once its mean value is available in µ (R D ). As MCC is designed to compute the mean value with 3 pipelining stages as shown in Fig. 5, the mean value needs three clock cycles to be computed and one additional cycle to be stored in the respective register µ (R D ). Thus, after four clock cycles of reading R r i or R b i , the processor PU2 starts the matching process by computing the offset value g. When computing ρ using OCU in PU2, the two multiplexers, mux1 and mux2, select the corresponding µ R g i,p value to be fed to the OCU µ(D) input. For each scaling value δ, OCU produces one value of ρ which is stored in the corresponding register at the 5 th clock cycle. At this stage, the PU2 starts computing the SAD value for each set of δ and ρ using its SADC unit. To do so, the corresponding R g i,p block needs to be provided to SADC unit. Since R g i,p is currently available in the internal dual-port RAM 4-R, therefore, R g i,p is provided to SADC unit from this internal RAM. On the other hand, the 12-bit shift register would provide the processed range block (i.e., R r i,p or R b i,p ) to SADC at the required time. For each set of δ and ρ, SADC computes the SAD value, and these computed values are then compared together to select the minimum SAD, MinSAD. The SADC unit is designed to do the given task in 4-pipeline stages. The encoding set corresponding to the minimum SAD is stored for decoding purpose. Thus, these operations are repeated for each range block of R r i and R b i in the secondary components. Fig. 6 demonstrates the pipeline processing for encoding the range blocks of the secondary components (R r i,p and R b i,p ) by the proposed architecture. After reading the range blocks R r i and R b i for encoding them by the searchless scheme as shown in Fig. 6 Fig. 3. Accordingly, the performed matching operations on PU2 do not need any access to the memory and hence the overall memory access is reduced effectively. For each matching, the corresponding processor calculates, for each defined δ value, ρ and SAD. The coefficients corresponding to the lowest SAD value are stored on FC-SAD RAM for further processing.
In order to match R  performed directly after the searchless encoding stage is completed. To be more specific, the matching process starts once RAM 4-R is ready to provide the required range blocks for the PU1 processor. Since RAM 4-R is still in used by PU2 for twelve clock cycles after fetching the last range block from the main memory (see Fig. 6), therefore, this number of clock cycles are required before starting the matching operation (i.e., computing SAD with SADC unit) between R g i,p=0,1,2,3 and D g j=i in PU1. As ρ is required in the matching operation, therefore, it is computed one clock cycle earlier (clock cycle #108). Thus, for preparing the next domain block D g i+1 for the next matching operation, D g i+1 needs to be fetched four clock cycles earlier (i.e., clock cycle #105). This number of clock cycles is required to compute µ D g i+1 and g values for the next matching operation and consequently allows to perform the next matching exactly after the previous one is finished. In this case, D g i+1 is fetched eight clock cycles after completely fetching the range block R b i and accordingly the search-based encoding stage is considered to start at this time. As a result, the searchless encoding phase needs eight clock cycles along with the clock cycles required for fetching the range blocks R g i , R r i and R b i (i.e., 96 clock cycles), altogether are 104 clock cycles.
At the clock cycle #105, the processor PU1 starts encoding R g i,p=0,1,2,3 by the proposed search scheme as shown in  and µ D g j=i are stored in µ (R D ) and µ (D R ), respectively. All the mean computations are carried out on the MCC unit. However, unlike the situation in PU1, the mean values of the range blocks µ R g i=j,p are calculated after the calculation of the mean value of the domain block µ D g j=i . Therefore, PU2 starts the matching process once µ R g i=j,p is available in µ (R D ) register. In this case, the first matching of PU2 is performed after fetching the first range block R g i=j,0 : j = i + 1 and computing its mean value µ R g i+1,0 . This is corresponding to twelve clock cycles from the beginning of the search encoding stage. Thus, PU2 computes δ u of R g i+1,0 and D g j=i using its own OCU unit (i.e., OCU2), and stores the results in their respective register in the following cycle. Following this process, the SADC2 is enabled to calculate the SAD values and needs to be fed with the corresponding range and domain block, i.e., R g i+1,0 and D g j=i . To do so, R g i+1,0 and the contracted D g j=i are stored in the internal RAMs (i.e., 64 12-bit shift register and dual-port RAM D R of the size 8 × 64 bits) to be utilised by SADC2. The outputs from SADC2 are stored on SAD2 and FC2 registers. The SAD2 value is then compared with the minimum SAD value that is already obtained from the previously performed matching operations and stored in FC-SAD RAM. Thus, SAD2 and FC2 values are stored in FC_SAD RAM if and only if the SAD2 value is smaller than the previously stored SAD value. These operations are repeated for each R g i=j,p : j = i+1 · · · 63. In conclusion, the range blocks (R r i,p=0,1,2,3 and R b i,p=0,1,2,3 ) of the secondary components are encoded first, followed by encoding the range blocks R g i,p=0,1,2,3 of the primary component. At the end of encoding these range blocks with the proposed encoding schemes, the next blocks are fetched and processed in the same manner. This process is repeated until encoding the entire image blocks.

IV. RESULTS AND DISCUSSION
The proposed FCIC architecture, depicted in Fig 3, has been synthesised and implemented on Altera DE4 board. This architecture utilises the inherent correlation between the colour components to compress 1024 × 1024 RGB image in an efficient short time. The design characteristics are given in Table 1. From this table, it can be seen that the number of logic elements (LEs) utilised for the proposed FCIC design is 4973 LEs, which represents around 2% only of the total available LEs. In respect to the memory requirement, this design needs 13403 bits of memory size. Finally, this design can successfully operate at a maximum frequency of 380 MHz.
Meanwhile, the runtime or the encoding time can be calculated also by computing the time needed to encode each component individually and adding up the results. Since the colour components are processed in two different groups (i.e., primary, and secondary groups), the total encoding time equals the sum of the encoding time for each group. Here, the  processing time of each group is obtained by computing the total number of clock cycles that are required to execute the whole process.
Referring to Section 3.2.3, every set of eight 8 × 8 range blocks in the secondary group (i.e., R r i,p=0,1,2,3 and R b i,p=0,1,2,3 ) needs 104 clock cycles to be processed by the proposed searchless encoding scheme. As the two-colour components of the secondary group have 4096 sets of eight range blocks (4096 = 2×1024×1024 8×8×8 ), hence, the total number of clock cycles required for encoding these colour components is 4096 × 104 = 425984 cycles. As a result, the encoding time for the secondary group equals 425984 × Similarly, the CR can be also calculated based on the sizes of the fractal codes of both the primary and the secondary groups. Since they are encoded with different bit sizes, the CR needs to be computed individually. For the primary group, the fractal code size is 15 bits for each range block (i.e., 2-bit for δ, 7-bit for ρ and 6-bit for the spatial information of the corresponding matched domain block). In this case, the compressed size of the primary group is 1024×1024 8×8 × 15 = 245760 bits. Unlike the primary group whose range blocks are encoded with three parameters, the secondary group's range blocks are encoded with only two parameters (i.e., δ and ρ). In this case, each 8 × 8 range block in the secondary group is encoded by 9 bits (i.e., 2-bit for δ and 7-bit for ρ). Thus, the compressed size of the secondary group is equal to the total number of range blocks multiplied by 9, which is 2×1024×1024 8×8 × 9 = 294912 bits. As a result, the total CR of the image is equal to the ratio between the original image size and the total compressed size of both the groups, which is 3×1024×1024×8 245760+294912 = 46.5.  In order to obviously see the advantages of utilising the correlation between the colour components when encoding colour images as in the presented design, the result from the introduced design needs to be compared with a similar design but using the conventional approach that encodes each colour component individually. To do so, the presented design has been modified to encode each colour component separately with the introduced search scheme (the searchless encoding scheme is disabled as it cannot be used to encode all the colour components in the image). Thus, all colour components of the RGB image are encoded as similar as the G colour component encoded. By doing so, we can fairly compare the proposed design and see the disadvantages of encoding colour images individually as this is the case for nearly all designs available in the literature. Table 2 shows the comparisons for both approaches. It is clear that the design with the proposed approach is nearly threefold faster than the same design without utilising the correlation among the colour components. With the proposed approach, R and B components are encoded in 0.56 ms each, compared to 11.4 ms when encoded individually by the search scheme. Thus, the proposed approach reduces the compression time to one third of the conventional approach. Furthermore, the compression rate of the proposed approach is also increased by 36 % over the conventional approach as the spatial information for the mapped blocks are not required to be stored when the searchless scheme is used. Although the CR for the proposed approach is increased, the PSNR is also improved by 0.7 dB on average. The decompressed images for both approaches are shown in Fig. 9 for subjective assessment. For further clarity, PSNR results for the four tested images are shown in Fig. 10. From the illustrated results, it is clear that the introduced design with the proposed approach outperforms the conventional approach significantly.
The achieved performance has also been compared with the performances of the available grayscale-based implementations. The decision to compare with the grayscale-based designs rather than the colour-based designs is due to there is no such appreciated hardware implementation in the literature for encoding colour images, while there are many for encoding grayscale images. One reason for this lack in the colour hardware implementations is that the colour images can be encoded by the one developed for grayscale image too. Thus, several well-known grayscale-based designs and implementations from the literature have been compared and reviewed in Table 3. In this comparison, their reported encoding times for grayscale images have been multiplied by three to obtain the encoding times for RGB images, as each of the three colour components is assumed to be encoded separately as the grayscale image.
From the table, it can be seen that the encoding time of the proposed architecture is the lowest compared to others, which corresponds to 2×, 133×, 11250×, 2125× and 116× less than [1], [2], [3], [4], and [6], respectively. If the image size is taken into account and knowing that the proposed design can encode 512 × 512 × 3 and 256 × 256 × 3 images in 3 ms and 0.75 ms, respectively, the proposed design exhibits 8×, 2133×, 11250×, 8500× and 1866× faster than [1], [2], [3], [4], and [6], respectively. From these figures, it is clear that the proposed architecture is significantly superior to others. Among the targeted platforms, FPGA shows the lowest running times compared to CPU and GPU platforms, even though it runs at a relatively lower clock speed. This is because the hardware solutions are dedicated and optimised for a specific application (fractal-image-compression here) and they process the data in massive processing units in parallel. The high speed in the introduced architecture is attributed to several factors mentioned earlier, including the processing of the image blocks in parallel and in several stages of pipeline. The later strategy allows to increase the running clock speed to 380MHz, which in turn leads achieving high encoding speed.
In terms of the compression ratio metric, the presented architecture is also superior to others. One of the reasons is the encoding of two-colour components by direct mapping, which results in saving bits allocated for the spatial information. Another reason is the encoding of the image blocks at the size of 8 × 8 pixels, rather than 4 × 4 pixels. Although using this block size leads to higher compression rate, it generally causes lower quality in the decompressed images. However, by looking at Fig. 9, it can be seen that the quality of the decompressed images encoded by the proposed approach are clearly good and the PSNRS achieved are above 34 dB for all four tested images.

V. CONCLUSION
As the colour images can be processed by the designs developed for the grayscale images-where each colour component is dealt as a grayscale image and hence is encoded individually-the hardware developers have mainly focused on developing hardware solutions for grayscale images. However, encoding colour images by such a way causes to double the encoding time by n folds, where n is the number of the colour components in the image. To overcome this encoding time problem, it has been developed in this article a high-speed architecture for encoding colour images with fractal technique. The developed architecture utilizes the correlation between the colour components in RGB image in order to encode R and B colour components by direct mapping to G-component (searchless scheme). Thus, only the G component is encoded by the time-consuming searchbased scheme. The presented architecture comprises two matching processors. Deep pipeline processing technique with 12 stages is also utilised in developing both processors and results in improving the throughput and the encoding time significantly. The introduced architecture showed the ability to encode 1024 × 1024 RGB image in 12 ms; where R and B components require only 0.56 ms each compared to 11.06 ms for G component. The proposed design clearly showed the efficiency of utilising inter-colour correlation in developing a specific architecture for encoding colour image by fractal algorithm. As a future work, the inter-frame correlation will be utilized in developing fractal video coding system. MOHD ZAID ABDULLAH received the B.App.Sc. degree in electronic from the Universiti Sains Malaysia (USM), in 1986, and the M.Sc. and Ph.D. degrees in instrument design and application from the University of Manchester Institute of Science and Technology, U.K., in 1993. He remained with the University of Manchester Institute of Science and Technology, carrying out research in electrical impedance tomography. He joined Hitachi Semiconductor (Malaysia) as a Test Engineer. He is currently a Professor with the School of Electrical and Electronic Engineering, Universiti Sains Malaysia. He has published numerous research papers in international journals and conference proceedings. His research interests include microwave tomography, digital image processing, computer vision, and ultra-wide band sensing. One of his papers was awarded The Senior Moulton medal for the best article published by the Institute of Chemical Engineering, in 2002.