DISPARITY estimation can provide depth information for miscellaneous applications, such as free-viewpoint TV system, robot, and so forth. Among disparity estimation algorithms, belief propagation (BP)  could deliver high disparity quality. However, BP bears large computational complexity about O(ID2T) for dense disparity, where I is the number of pixel in an image, D is the disparity range, and T is the iteration count. Therefore, a fast algorithm or a hardware implementation is demanded to satisfy real-time applications.
To improve the computational speed, Felzenszwalb and Huttenlocher  have proposed hierarchical BP (HBP) which processes message update from coarse level to fine level so that it makes energy function converge faster. Based on HBP, Brunton et al.  implement it on a graphics processing unit(GPU) whose performance is only 1.6 fps for I = 384 × 288 and D = 16. Another implementation on GPU is Yang's work  which adopts a fast-converging approach to reduce run time. This work can achieve near real-time performance of 16 fps for I = 384 × 288 and D = 16. In contrast to the GPU implementation, Park et al.  propose an array processing architecture for HBP and implement it on two FPGA devices. In their design, they apply the sliding window scheme to reduce storage cost. However, their design still suffers large storage cost, 880 KB, for I = 320 × 240 and D = 32.
Motivated by the storage cost problem, we focus on the message update PE which needs the most storage cost in BP. Thus, we propose four low-memory cost architectures, which are post-normalization, shadow buffer, no memory, and no memory+double PE. Compared to Park's design for I = 320 × 240 and D = 64, the lowest-memory cost architecture, no memory, can save 55% of the hardware cost at most with 16 fps processing speed. The best proposed architecture is no memory+double PE architecture which not only has real-time performance of 30 fps but also reduces 28% of the hardware cost at most.
The rest of paper is organized as follow. In Section II, we first discuss the storage cost issue in BP, and then we present our proposed architectures. In Section III, we compare the proposed architectures with previous designs. Finally, we conclude this paper in Section IV.
A. Design Issue
In general, BP algorithm is composed of the following steps: local matching cost computation, message update, belief computation, and disparity selection. The highest hardware cost of these steps is the message update due to its iterative processing. The hardware cost for the message update could be divided into three parts: PE computational circuit, PE storage, and node storage. In this paper, we focus on the major hardware cost, the node storage and the PE storage.
The node storage for each node includes one matching cost set and four injecting messages and its size is 5D unit data. Because of the iterative processing, the messages have high data dependency, and matching costs are read in each iteration. Therefore, both the message and matching cost need to be stored into the node storage, and the whole size of node storage amounts to 5ND where N is the number of node.
The PE storage stores the immediate computed data for each message update, and its size is analyzed as follows. Fig. 1,  shows the pseudo code of the message update. Note that the code describes the message update only for one direction message of one node. To carry out the processing for an entire image, these steps need to be performed 4IT times. In this code, the message update consists of three loops and takes 3D latency. The direct design for this code requires 2D unit data due to the three loops. Thus, the total size of PE storage is 8PD where P is the number of PE and P is usually smaller than N.
The pseudo code demonstrates detail of the message update processing 
View All | Next
B. Previous Architecture
Park et al.  directly design the architecture of message update PE illustrated in Fig. 2. The whole operation follows the computing flow in Fig. 1. In Fig. 2, the node storage is placed outside of message update PE for clarity. The message update PE uses the cost and the messages of iteration t−1 in node storage to compute the new message of iteration t. In the message update PE, the four processing, which are aggregation, forward, backward, and normalization processing, are divided by three pipeline stages. Each stage takes D cycles. The pipeline stage storage Mf and Mb are stack types, and the stack direction should be flipped every D cycles. To support the pipelining stack access scheme, both storage should implemented by registers, since one unit data of each storage is written and read by previous and next stage simultaneously, respectively. It is not economical to apply register to storage implementation, especially with large P.
Using Park's design and the bipartite approach , the storage cost amounts to 5DN+8DP = 9DN, where P = N/2. As a result of the large storage cost, both the node storage and the PE storage should be placed inside core instead of outside core, because limited external bandwidth is insufficient for the extremely high data rate.
C. Proposed Architectures
To reduce the large storage cost, we propose four architectures of message update PE depicted in Fig. 3.
Fig. 3. Proposed PE architectures: (a) Post-normalization, (b) shadow buffer, (c) no memory.
Previous | View All | Next
Fig. 3(a) shows the post-normalization architecture. Considering Fig. 1, we remove the normalization processing from the Loop3 and then merge it with the aggregation processing of Loop1. In other words, the normalization processing for the current iteration t is taken place in the beginning of the next iteration t+1. With the post-normalization scheme, the stage storage Mb can be eliminated, and PE storage can be saved 50%. However, this architecture has some overheads. First, the message width of node storage needs to be extended by 4 bits. Second, the register norm inside PE is moved to outside of PE, and thus the number of it becomes from 4P to 4N. Third, because of the movement of the normalization processing, the critical path of the aggregation processing is increased.
Shadow Buffer Architecture
Fig. 3(b) shows the shadow buffering architecture whose stage storage Mf could be implemented using 2-port memory instead of a group of registers. By inserting shadow buffer, the writing data computed by forward processing can be buffered for one cycle. Therefore, the stage storage Mf can avoid that the same element is written and read coincidently. This architecture only adds a few overheads of shadow buffers but has more cost savings.
No Memory Architecture
Fig. 3(c) shows the no memory architecture. All the stage storage in this architecture is removed so that the immediate computed data should be stored into the node storage. The access schedule of the node storage is described as follows. First, we complete the aggregation, post-normalization and forward processing steps and write data into the node storage for D cycles. Then, the backward processing reads data from the node storage and carries out calculating for D cycles. Simultaneously the produced data are written into the node storage. Since the node storage is read and written concurrently, the node storage needs to be implemented by a 2-port memory. Besides, the computing latency is twice of other architectures.
An advantage of this architecture is that the forward and backward processing can be merged into one circuit named convolution processing, because they only one of them works at one time and have partial identical circuit. However, this processing should suffer from the cost of a little additional multiplexer, and its critical path is also increased. By eliminating all storage and redundant circuit, the message update PE can alleviate the most hardware cost with some overheads in node storage.
No Memory+Double PE Architecture
To overcome the drawback of twice latency of the no memory architecture, we double the number of PE to enhance throughput. Although, the double PE results in the double hardware cost in PE, the overall cost is still reduced because the hardware cost of this PE is very small.
Table I shows the comparison of different architectures. The post-normalization architecture only requires half of PE storage in Park's design due to normalization replacement. However, its node storage needs longer data width. The shadow buffer architecture can replace costly registers in Park's design with 2-port memory which is more economical than registers. The no memory architecture has no storage inside PE and combines Fwd and Bwd into Conv. The cost of convolution processing is about half of the other architectures. However, its latency is double of others. The no memory+double PE architecture has the same as node storage of no memory architecture, but it doubles the number of PE to reduce latency from 2D to D. The double PE results in twice the hardware cost of PE storage and computational circuit.
TABLE I Comparison of Performance Estimation Among Park's Design and Proposed Architectures
To have some specific comparisons, we apply the equivalent gate count and delay time in Table II to Table I. All the architectures work at 25 MHz. To achieve real-time frame rate under the clock rate, PE number, P, should be more than 236 for I = 320 × 240, T = 40, D = 64, and B = 16 where B is the original data width of matching cost and message. For simplicity, we choose P as 256 to compare the hardware cost when N increases.
TABLE II List All the Equivalent Gate Count of Different Operators and Memories for Estimating Cost. All the Data Is Derived From UMC 90 nm CMOS Technology
Fig. 4 shows the hardware cost of each architecture when N increases from 256 to 2048 by step of 128. Fig. 4(a) compares only the storage cost with Park's design. Our proposed architectures can save storage cost when N is not very large. The superiority of our proposed architectures can be maintained when N is less than 896 for post-normalization and shadow buffer architectures, and 1280 for no memory and no memory+double PE architectures. Our proposed architectures could save the most storage cost, when N is equal to P. The most reduction of storage cost can achieve 26% in post-normalization architecture, 28% in shadow buffer architecture, 58% in no memory architecture, and 34% in no memory+double PE architecture.
Fig. 4. Comparison of different architecture cost, if N increases on the condition of P = 256, D = 64, B = 16. (a) shows storage cost. (b) shows the overall cost including storage cost and computational circuit.
Previous | View All
Considering the overall hardware cost, Fig. 4(b) shows the comparison of hardware cost including storage cost and computational circuit. Their cost savings could achieve 24% in post-normalization architecture, 26% in shadow buffer architecture, and 55% in no memory architecture when N = 256. Although, the no memory architecture has the lowest cost, its latency is double of others. The no memory+double PE architecture can double the throughput and reduce 28% of Park's hardware cost when N = 512. As a result of the above all cost estimation, we could observe that N determines the overhead cost of our proposed architectures. Therefore, under a constrained N, the message update PE can adopt our proposed architectures and gain the saving of hardware cost.
In this paper, we propose four architectures for message update PE and estimate their hardware cost. The better architecture of them is the no memory+double PE architecture which removes all the storage from PE and eliminates its redundant circuit. Beside, this architecture uses the double number of PE to promote its throughput and achieve real-time performance. Compared to Park's design, the proposed architecture can reduce 28% of hardware cost at most and support the real-time video-based application for 30 fps@320 × 240 and 64 disparity levels. Our future work is to take the node allocation into consideration for more reduction of hardware cost of BP.