Low-Latency In Situ Image Analytics With FPGA-Based Quantized Convolutional Neural Network

Real-time in situ image analytics impose stringent latency requirements on intelligent neural network inference operations. While conventional software-based implementations on the graphic processing unit (GPU)-accelerated platforms are flexible and have achieved very high inference throughput, they are not suitable for latency-sensitive applications where real-time feedback is needed. Here, we demonstrate that high-performance reconfigurable computing platforms based on field-programmable gate array (FPGA) processing can successfully bridge the gap between low-level hardware processing and high-level intelligent image analytics algorithm deployment within a unified system. The proposed design performs inference operations on a stream of individual images as they are produced and has a deeply pipelined hardware design that allows all layers of a quantized convolutional neural network (QCNN) to compute concurrently with partial image inputs. Using the case of label-free classification of human peripheral blood mononuclear cell (PBMC) subtypes as a proof-of-concept illustration, our system achieves an ultralow classification latency of 34.2 $\mu \text{s}$ with over 95% end-to-end accuracy by using a QCNN, while the cells are imaged at throughput exceeding 29 200 cells/s. Our QCNN design is modular and is readily adaptable to other QCNNs with different latency and resource requirements.


I. INTRODUCTION
Recent advances in deep neural network have given image analytics powerful tools to extract unprecedented amount of information by learning from massive data sets.For offline, batch mode analysis, a number of general-purpose [1], [2] and domain-specific [3], [4] software frameworks are readily available for both training and inference.Using state-of-the-art accelerators such as graphic processing units (GPUs), these software frameworks have enabled researchers to study and develop even the most complex application-specific neural networks with relative ease with high processing throughput.However, for online applications that require in-situ single item image analytics for immediate feedback control, real-time implementation of such complex neural network inference operation remains an open challenge.Existing acceleration techniques using parallel computers or GPUs, while being able to improve batch processing throughout, are not designed for real-time applications that have the additional requirements on low processing latency and guaranteed performance.In autonomous vehicle platooning control [5], for instances, complex control decisions must be made by analyzing input such as array of multi-spectral cameras and LIDAR with low-processing latency.Similarly in many novel cell-based applications that are instrumental in cell biology research and drug discovery, such as in high-speed cell sorting and encapsulation, real-time actuation of each individual cell after it is being imaged are required [6], [7], [8].In these application scenarios, the image analytic latency, as measured from the time an image is produced to the time an analytic decision can be made about that object, must be kept as low and predictable as possible to minimize its impact on the overall system design.Consider, for instance, the case of actuating cells that are imaged while flowing in a microfluidic channel at speed of 1 m/s such as those demonstrated in [9], [10].Every d additional microseconds of decision latency will lead to a corresponding increase of d micrometers in the microfluidic channel that requires precise control.In the system developed in [11], the proposed GPU accelerated cell classifier using a deep convolutional neural network (CNN) incurred 3.2 ms of inference latency, which translates into a 3.2 cm microfluidic channel that the authors had to tightly control for cell sorting at the end.While the authors have developed a hardwarebased cell velocity estimator to control its sorting mechanism, it imposed substantial overhead to the already complex system design.
Here in this work, instead of relying on GPU for acceleration, we demonstrate that a hardware-only quantized convolutional neural network (QCNN) implementation is superior in providing predictable and ultra-low inference latency suitable for a range of demanding real-time applications.We show that a field programmable gate array (FPGA) based reconfigurable computing system is superior in these application scenarios as it can efficiently serve both as a low-level hardware integration platform and as a high-level computing device for complex analytics computations.
To illustrate this integrated hardware-processing concept, we adopted this system architecture to a recently developed quantitative label-free imaging flow cytometry technique, coined multi-ATOM, that achieved a high imaging line-scan rate of 11.8 MHz [9].The entire image processing chain, from interfacing to the laser imaging sources and controlling the 2 data acquisition analog-to-digital converters (ADCs), to image formation and cell detection and classification using a quantized convolutional neural network (QCNN), are performed in hardware within the central FPGA processing unit (CFPU).This unified, deeply pipelined process chain allows the QCNN inference operation to execute in parallel to the series of cell image formation operations, thus ensuring the lowest possible overall classification latency by the time a cell is completely imaged.
The proposed QCNN featured a layer-parallel architecture that was designed to perform inference on a single input image without batch processing.The hardware design of each layer, including the convolutional layer, the pooling layer and the fully connected layers, were parameterizable to facilitate resource-latency tradeoff.Furthermore, computations were performed with an 8-bit number representation with quantized weights, which matches our input image data and provided resource-efficient hardware implementations.Finally, our hardware QCNN design was deeply pipelined to maintain high processing throughput that was compatible with the input cell imaging system.
With this system, we demonstrated an ultra-low cell classification latency of 34.2 µs, a 3 orders of magnitude reduction from state-of-the-art GPU-accelerated software designs, with over 95 % end-to-end accuracy when classifying the sub-types of human peripheral blood mononuclear cells at a real-time throughput of 29 200 cells/sec.In addition, when compared to software-based solutions, our FPGA-based analytic system is capable of producing results with fixed and predictable latency that is needed for precise feedback control.
This work thus advances the state-of-the-art in the following areas: • We demonstrate a first-of-its-kind hardware-only QCNN inference machine co-optimized with the real-time imaging front-end to perform real-time in-situ cell classification with low and predictable latency.• We propose a streaming architecture for mix-granularity pipelined operations, from image formation to cell classification, which allows flexible latency-area tradeoff while maintaining high data processing throughput.• Using the hardware-based QCNN inference implementation, we demonstrate an ultra-low latency in-situ imagebased cell classifier with 34.2 µs processing latency and 95 % end-to-end accuracy.
In the next section, a brief review of related work in hardware-based neural network inference will first be presented.An overview of the in-situ cell image analytic platform and the integration between image formation and neural network inference will be shown in Section III.Details about our QCNN implementation will be shown in Section IV.Performance evaluation of the system will be shown in Section V before we conclude in Section VI.

II. RELATED WORKS
Efficient hardware implementations of deep neural network inference has been a topic of intense research interest.When compared to typical software-based GPU-accelerated solutions, hardware implementations of neural network inference have been demonstrated with superior power-efficiency, making them particularly useful for power-limited systems such as autonomous vehicle and intelligent edge devices.The use of application-specific architectures, both from a system and micro-architecture's point of view, as well as custom numeric precision computations, have both been keys in enabling such advancements.
In one of the earliest work, [12] presented a custom architecture for performing convolutional neural network inference on FPGA.Subsequently, a number of large-scale frameworks have been developed [13], [14], [15], [16], [17], each based on their own custom architecture for accelerating neural network inference, but with a common thread focusing in developing efficient matrix computations for maximum processing throughput.
Another important optimization strategy for hardware inference is to utilize non-standard or reduced precision arithmetic for inference.As observed in [18], the redundancy presented in the vast parameter pool of parameters in deep neural networks allows inference to compute with reduced-precision arithmetic operations without significant effect on the model accuracy.This allows researchers to utilize fixed-point or integer arithmetic in place of more hardware-demanding floating point operations for most of the demanding operations.To that end, a large body of works [19], [20], [21] have already demonstrated neural network inference implementations with 8-bit weights on FPGAs with negligible accuracy drop.[22] presented a series of models using 8-bit weights and activation to provide a trade off between computation latency and classification accuracy.[23] also demonstrated a CNN with 8-bit weights for object detection while maintaining the model accuracy to be as good as its floating-point counterpart.In [24], Baskin et al implemented AlexNet inference on FPGA using 2bit activation and demonstrates energy efficiency over GPU.Pushing the technology limitation further, a number of works have also studied the use of extreme low bit width arithmetic such as binary [25], [26], [27], [28], [29], [30] and ternary weights [31], [32], [33] in various neural network inference operations, while maintaining acceptable model performance degradation.
While the above QCNN designs on FPGA have successfully resulted in good throughput performance, they were not particularly optimized for low latency operations.To address the need for low-latency inference, Venieris et al [34] proposed a latency-driven design methodology for mapping QCNN on FPGA.A flexible architecture was designed that can be automatically derived based on the QCNN workload and FPGA resources using a synchronous dataflow computation model.Ma et al [35] has implemented quantized VGG-16 on FPGA for ImageNet classification.By analyzing the loop operation and dataflow in FPGA, a latency of 48 ms can be achieved by balancing computation and memory traffic.Finally, Geng et al [36] has designed a low-latency binarized neural network inference of AlexNet, VGGNet, and ResNet.Our design also aimed at reducing inference latency, but with an additional requirement of maintaining a deterministic latency.Furthermore, we achieved these goals by employing a customized hardware architecture and by using 8-bit quantized operations which matched with our imaging frontend.Finally, our QCNN was tightly coupled with the imaging formation hardware to allow true in-situ inference on cell images concurrently as they were formed.To the best of our knowledge, this is the first work that addresses total end-to-end inference latency with systemlevel co-optimization between the imaging frontend and the hardware neural network inference engine to achieve an ultralow classification latency of 34.2 µs.

III. SYSTEM ARCHITECTURE
Fig. 1 shows an overview of our in-situ cell image analytic platform.Our target QCNN was implemented inside the central field programmable gate array (FPGA) that provided a unified reconfigurable computing platform for both lowlevel image formation and high-level image analytics.This tight integration allowed us to co-optimize the image formation and analytics operations, and to employ an end-to-end stream-based processing model across all tasks that maximized concurrency through careful pipelining at multiple levels of granularity.Most importantly, such tight integration allowed our QCNN classifier to perform image classification in-situ -The QCNN inference computation commenced as soon as the first few lines of input cell image was produced and it continued to operate on the subsequent image in parallel as soon as they were produced.
In the following, the coarse-grained system-level pipeline that included the optical frontend, the image formation process, data filtering and finally QCNN inference will first be described.Details of our fine-grained QCNN pipeline will be shown in Section IV.

A. Concurrent Image Formation & Image Analytics
At the system level, coarse-grain parallelism in a unified platform allowed our system to perform all sub-tasks of image formation and image analytics concurrently as a pipeline.This allowed analytic operations to commence before the input images were completely formed.(Fig. 2a).
In our current implementation with the multi-ATOM imaging front end [9], images of the fast-flowing target cells in a microfluidic channel (at speed exceeding 1 m/s) were captured by ultrafast laser line-scan illumination that was orthogonal to the cell flow.The distinctive feature of multi-ATOM is its ability to generate multiple raw image (phase-gradient) contrasts of the same cell in each line-scan.It has been demonstrated that these multiple (four in total in this work) contrasts can be harnessed to compute quantitative phase images (QPI), which quantify a multitude of valuable physical and mechanical cellular properties indicative of cell states and functions [37], [38].At each time step i (moving downward in Fig. 2a), each subtask j of the pipeline (including image alignment, background removal, cell detection, and the first convolution layer of our CNN, etc) passed the processing result from time i − 1 to the right while it began processing a new partial image data delivered from task j − 1 on the left.Because of this pipeline operation, as illustrated on the last 2 rows of the figure, the sliding window operation of first convolution layer conv1 (right-most column) was able to commence on the first part of a detected cell, while the rest of the cell was still being imaged (left-most column).Although not shown in Fig. 2a, this pipeline pattern continued throughout our CNN classifier until the final classification decision was formed (See Section IV for details).
It is worth noting that because of this pipeline operation, the complete image of a detected cell (Fig. 2c) was never available as a whole at any point in the system during real-time in-situ analytics.In applications where the complete recording of the imaging channel or the detected cell images are needed, they can be selected through the on-chip selection network and transmitted to the data aggregation cluster for further analysis.Furthermore, while each image snapshot shown in Fig. 2a contains 25 lines of the input for illustration purposes, in practice, our hardware implementation (Section III-B) operated on a much finer granularity of 16-pixel blocks that were produced by the sampling ADC.
The aforementioned unification of image formation and analytics into one seamless pipeline allows co-optimizations between the two that are otherwise difficult to achieve.To reduce complexity in the current imaging front end, the usual step of quantitative phase image reconstruction in the original multi-ATOM (Fig. 2c) was omitted.Instead, our classifier took the four raw phase-gradient images of each cell and stitched them together into a single image (Fig. 2d) for training and inference by the QCNN.The integrated pipeline also allowed us to adopt for the limited ADC bandwidth by compensating the reduced cell image resolutions with QCNN implementations that achieved similar accuracy at the same data throughput.

B. Fine-grained Reconfigurable Image Processing Hardware Pipeline
In additional to the coarse-grain parallelism at the system level, implementing our image analytics system on FPGA further allowed us to exploit fine-grain parallelism at the gateware implementation level.It was particularly important for the stages closest to the imaging source where the data bandwidth requirement was stringent.
To illustrate this need for fine-grained pipeline design, consider our current application that interfaced with the imaging frontend through a pair of high-speed ADCs.On each channel, the ADC produced 8-bit samples at 3.9648 GHz, and transmitted 16 data in parallel to the FPGA at the system clock rate of 247.8 MHz, which resulted in a sustained data input throughput of 3.9648 GB/s.In other words, to process these ADC samples in time, the first few stages of the image processing pipeline must be able to process 16 pixels in parallel every cycle.
Consequently, both the line alignment module (Fig. 3b) and the background removal module (Fig. 3d) employed a similar design strategy that relied on an off-datapath module for estimation while the main datapath maintained the input data rate.
In the case of line alignment, the estimator observed the first usable line of input to determine an offset that allowed the system to center the cells in the images.It was achieved by summing the values of each of the 16-pixel blocks with an adder tree as they were read from the ADC.Each image line contained 21 blocks and the index of the block with the lowest summed value was determined, which correspond to the dark color band in Fig. 3a.Subsequent lines of the cell images were aligned with respect to this index.The main datapath dropped all pixels until the correct alignment was obtained, and maintained the input data throughput from then on.Details of the line alignment operation are shown in Algorithm 1.
Similarly, the background removal subsystem also relied on an estimator that determined the background image by analyzing the incoming image lines in parallel to the main dataflow.Two common statistical methods (mean and median) were considered for estimating background values along the image column data.The median method was used by our collaborators for offline analysis, since it was more robust against sparse and asymmetric variations of pixel values in the raw data, and was able to provide more accurate background estimations over the mean method in some cases.However, the median computation involved data sorting, which introduced substantially higher latency than the simpler hardware pipeline for mean computation.Therefore, to maximize the latency headroom for the remaining image analysis stages, the mean method was adopted.Pixel values were accumulated across 8 rows of image data, then divided by 8 using simple right shift by 3 bits (Fig. 3d) to obtained the mean values as background estimation along the image column of 336 pixels.The resultant background values were then subtracted from the original image data as they streamed through the system to obtain pure   cell images (Fig. 3e) for the next processing stage.Algorithm 2 describes this background removal operation in more detail.
The cell detection module marked the point where the stringent dataflow requirement began to relax as it filtered out image data that would unlikely be needed for further analysis (Fig. 2b).To facilitate efficient fine-grain pipeline implementation at high-bandwidth, a hardware-optimized stream-based cell detection scheme was developed.The proposed detection modules employed a thresholding scheme based on the sum of absolute difference (SAD) of a scan line against the moving average (Fig. 3g).
7%6./0$8%"#% 6/9.:)0$% 0$;"6.$0IV.QCNN IMPLEMENTATION ON FPGA Classifying cell images with convolutional neural network (CNN) was the most computationally demanding subsystem in our processing pipeline.It was the major contributor to the overall classification latency and limited the system hardware processing throughput.The main challenge of our insitu QCNN design rested on the need to maintain the high stream-processing throughput as determined by the imaging frontend and input ADC sampling speed, while achieving ultra-low latency and high-accuracy classification under tight hardware resource constraints.Our design addressed these challenges by adopting a fully pipelined, parallel design of the CNN that operated on image stream immediately without buffering the whole image.Furthermore, our neural network operated with reduced-precision arithmetic as a quantized convolutional neural network (QCNN), which reduced hardware resource requirements and computation latency at the same time.Finally, based on this CNN architecture, a system-level design space exploration was performed to tradeoff between latency, throughput and classification accuracy under resource constraints.

A. Hardware Architecture
Our CNN consisted of 2 pairs of convolutional-maxpooling layers followed by 2 fully connected layers (Fig. 4b).
Input to the CNN were 336×336 pixel cell images generated from the cell detection module, while the CNN produced classification results as output.As a continuation of the stream processing paradigm for image formation, all the layers were designed to operate in parallel as a pipeline (Fig. 4a).The layers thus relied only on partial results from its previous layer or the image formation subsystem for streaming computation.
The network layout as well as the choice of metaparameters, from the sizes and strides amount of the kernels to the number of layers employed, were designed specifically to facilitate efficient hardware implementation.For instance in the input convolution layer (conv1), the kernel size was chosen to be 8 × 8 pixels -a relatively large square shape with a power of 2 number of pixels on the side -to allow efficient balanced adder trees be implemented in hardware.A large stride value of 4 was also chosen as a mean to reduce the amount of hardware resource consumption.These architectural choices allowed efficient 2D convolution operations to be carried out by a pair of streaming convolutional units (SCUs).
Inspired by the idea proposed in [39] that optimized computation efficiency, our SCU was optimized for reduced computation latency by performing partial sum of all kernels 1 that covered the 16 input pixels every cycle in parallel (Fig. 4c).Finally, 4 such hardware structures were instantiated, one for each feature kernel for the conv1 layer (Fig. 4a).
This streaming processing model continued through the interference operations with careful data rate control until the end of the final fully connected layer (fc2).Such streaming process not only reduced the amount of on-chip memory required for buffering, but it also significantly reduced our classification latency.

B. Continuous Inference
From a data analytic perspective, our streaming architecture was designed to perform inference operation on each individual cell as it was imaged without first being buffered to form a data processing batch.When compared to conventional throughput-optimized CNN implementations that often operated with input data batches, single datum inference operation avoided the need for batch buffering and was thus particularly suitable for our latency-optimized designs.Specifically, let N b be the batch size and t img be the time required to form a cell image.The first image needed to wait for the formation of all N b images in the batch before being sent to GPU for computation.Similarly, the second image needed to wait for (N b − 1) t img , while the last image still needed to wait for formation of itself.The average additional waiting time for each image due to batch buffering was therefore

C. Quantization
While the streaming architecture described above promised to achieve ultra-low classification latency at high throughput, it required a significant amount of configurable hardware resources of the FPGA to physically implement operations from all layers in parallel.As a result, the size of the CNN implementable was ultimately constrained by the amount of available resources on the FPGA.
To reduce the amount of configurable hardware resources required, a quantized convolutional neural network (QCNN) implementation was adopted [40], [22], [25].In our implementation, all weights and activation functions were represented as 8-bit fixed point numbers during inference.We used the network quantization method proposed in [22] to explore efficient bitwidth of weight and activation.Floating point weights were used to accumulate the gradient update during backward pass.During forward pass, weights and activations 1 The exception being that the last pair of kernels were delayed for computation in the following cycle when the following 16 pixels were input.  (1Average validation accuracy of 5 fold cross validation were quantized using integer quantization training scheme proposed in [22].As illustrated in Table I, classification accuracy of our CNN naturally decreased as the number of bits used to quantize the neuron weights decreased.Yet when compared to the original single-precision floating point implementation in software, the accuracy of our QCNN remained competitive even at 8-bit quantization.The network training only failed to converge when the weights were quantized to fewer than 5 bits As a result, to facilitate simple hardware design that matched with the 8-bit pixel input from the ADC, an 8bit number representation was chosen for our current QCNN implementation.

D. Design Space Exploration
In order to design a neural network model that achieved low inference latency under the constraints set by (i) the high input data rate imposed by the imaging front end, and (ii) limited reconfigurable resource and I/O bandwidth of the FPGA system, a design space exploration was conducted.
During our exploration, 3 models with different number of feature maps, which directly affect their classification accuracy and resource implications, were considered as shown in Table I.To avoid overfitting, we stopped increasing the width of the network when the training error of the widest model in Table I reached 99.94 %.
Based on these network models, hardware implementations with different resource requirements subject to the image processing throughput constraint were explored.As shown in Fig. 4a, hardware resources can be adjusted by varying the amount of parallelization employed in the QCNN implementation at the expense of reduced pixel processing rate.
Denote pixel processing rate, p i , as the number of pixels passed as input to the (i + 1)-th layer of our layer-parallel QCNN classifier per cycle.The activation output rate (p i ) of each pipeline stage i was balanced to avoid pipeline congestion.Let c i , H i , W i be the number, the height, and the width of stage i's output feature maps.It took ci−1Hi−1Wi−1 pi−1 to feed all the input feature maps into stage i.The pipeline congestionfree constraint implied that all the output feature maps were fed into stage i + 1 within the same period after stage i's processing delay.Hence p i−1 and p i satisfied Equation (1).
Note that both H i and W i were reduced to 1 si of their original size H i−1 , W i−1 after passing through stage i.After rearranging both sides of Equation ( 1), we obtained Equation ( 2) to describe the relation between p i−1 and p i .
Given a CNN model specification, the data rate of each pipeline stage could be computed recursively using Equation ( 2) and were all proportional to p 0 , the input pixel rate to the QCNN.As a result, we used p 0 as a hardware parallelism indicator.
Based on the value of p 0 , we can determine the amount of reconfigurable hardware resources needed for a particular QCNN implementation.For example, the number of multipliers for weight activation multiplication of convolutional stage i was where k i and s i was the kernel size and stride of the convolutional layer.Since p i was proportional to p 0 , the overall hardware resources usage would also be proportional to p 0 .At the same time, p 0 was also closely related to the classification latency.Assume the image size was N img , it took Nimg p0 clock cycles to feed the entire image into the pipeline.The classification latency would therefore become Nimg p0 + PIPELINEDELAY, which was roughly inversely proportional to p 0 .Fig. 4d shows the relation between the number of multipliers required in the QCNN and the classification latency of the models in Table I.
In our implementations, DSP blocks on the FPGA were used exclusively for weight activation multiplications.As a result, the number of DSP blocks required can be derived from Equation (3) directly, which was also proportional to p 0 .
In terms of on-chip memory, weights and partial convolution accumulations accounted for most of their usage in our design.The on-chip memory used for weights storage was related to the total number of parameters in each layer.In the case of convolution layers, it was proportional to the neural network width quantifying by the number of output feature maps c i of each stage i.As the data coming into each convolution stage followed the same line-by-line manner, a convolution could only be completed after k i lines of input.Partial convolutions of the first line were calculated immediately upon data arrival.The resulting partial accumulations were stored in a First-In-First-Out (FIFO) waiting to be merged with other partial accumulations which required the next k i − 1 lines' data.Note that each line was associated with other ki si convolutions due to the convolution window overlapping.For each output feature map of convolution stage i, ki si FIFOs of size W i were used for partial convolution accumulation.Total on-chip memory used in stage i for partial convolution accumulations could therefore be described using the following equation: where W i was the line width of the output feature map.Same as the weight storage, on-chip memory usage of partial convolution accumulations were also proportional to network width.
The DSP and BRAM usage model of our QCNN hardware can be obtained by adding the resource usage of each stage i.Note that c i , k i , s i , W i were network configuration parameters that determined classification accuracy and BRAM resource model in Equation ( 4) only contained these parameters.A larger network gave better classification accuracy while it consumed more BRAMs.Except network configuration parameters, DSP usage was also affected by input data rate p 0 through Equation (3) and Equation ( 2), which directly affected classification accuracy.For the same network configuration, shorter classification latency didn't require more BRAMs but the QCNN hardware consumed more DSPs.Experimental results in hardware resource consumption that correspond to this analytical model are shown in Section V. From Table II we can see that for the same network, using a large input data rate p 0 increases the number of DSP usage while the number of consumed BRAMs remains the same.

A. System Setup
Our system was designed around the FPGA (Xilinx Virtex-6 XC6VSX475T) housed inside the ROACH-2 platform from the CASPER collaboration [41].Two external ADCs (e2v EV8AQ160, 5Gsps, 8-bit) housed inside the CASPER ADC1x5000-8 board were used to interface with the multi-ATOM photo detector output.4 CPU nodes connected to the ROACH-2 through eight 10 Gbps Ethernet connections where data were transmitted as custom UDP packets for recording and analysis.A custom software aggregated and demultiplexed data from the 4 data aggregation nodes for downstream processing.Matlab 2012b with ISE system generator 14.5 was used for block diagram design and FPGA configuration file generation.
The imaging front end of our system was based on the multi-ATOM imaging system [9].In short, our multi-ATOM system imaged fast-flowing single-cells in a microfluidic channel based on an ultrafast line-scanning approach, in which a series of optical line-scans was formed by illuminating broadband time-stretched laser pulses (centre wavelength = 1064 nm, bandwidth = 10 nm and repetition rate = 11.8MHz) onto a diffraction grating.Images were subsequently reconstructed by stacking a series of line scans while the targets flowed in the microfluidic imaging channel (> 1 m/s).
Repetition rate of the laser pulse from the multi-ATOM imaging front end was used to serve as the master clock source for all remaining hardware processing clock domains.The laser pulse was first converted to an electrical pulse by FPD 310 photo detector from MenloSystems.The converted electrical pulse was then fed into a custom stretch and reshape circuit to produce a square wave master clock signal, which was synchronized to the original laser pulse with the same 11.8MHz frequency.A Valon 5009 Dual Frequency Synthesizer was used to multiply the frequency of the master square wave clock signal by 168.Using the resulting 1982.4MHz clock output from the synthesizer, the ADC further doubled the frequency internally to form a 3964.8MHz sampling clock.Each laser scan pulse from multi-ATOM was thus sampled exactly 336 times by the ADC to form one image line.An internal clock division circuit inside the ROACH2 divided the 3964.8MHz sampling clock into a 247.8 MHz system clock for the remaining hardware processing pipeline.

B. Classification Latency
Fig. 5a shows a plot of execution time for image formation as well as for various layers of the CNN to classify one cell.The pipeline action, in which different parts of the design executes concurrently, can easily be seen from the figure.Fig. 5b further maps the execution time to the corresponding cell image to highlight the spatial-temporal relationship between cell flow and classification latency.
To measure classification latency, recall that image lines of a cell were formed while the cell flowed in the microfluidic imaging channel in our system.The cell detection module continuously analyzed the latest 50 lines of images to determine if a cell was presented.Once a cell was detected, the classification process using QCNN commenced.This inference operation continued as new image lines of the cell were formed.The CNN input ended after all 336 lines were captured, which translated into a complete 336 × 336 pixel input to the CNN.The inference operation completed when all computation from the final fc2 layer completed.
There are thus 2 important latency measurements.First, inference latency is the time measured from the start to the end of CNN inference operation (t cs → t ce ).Using an image size of 336 × 336 pixels, our CNN completed inference operations in 7436 cycles, which translated to 30.0 µs when the FPGA system was running at 247.8 MHz.Second, the detection latency must be taken into account, which measures from the start of an image to the time CNN operation starts (t is → t cs ).In the worst case, our cell detection module required 50 lines of input to detect a cell (4.2 µs) before it would trigger the start of CNN inference.As a result, the maximum classification latency our system is determined to be 34.2 µs (t is → t ce ).
Finally, combining Fig. 5a and Fig. 5b together illustrates the spatial relationship between a cell flowing at 2.3 m/s and its corresponding classification result.It can been observed that with our FPGA classifier, the classification decision is available at most 78.66 µm after a cell passes through the imaging channel.

C. Resource Consumption
Table II summarizes the detailed resources usage of each module in our hardware.We adopt the QCNN design configuration narrow and p 0 = 16 as wider QCNN clock frequency failed to catch up with the ADC sampling rate, which is closely coupled with the imaging front end as explained in Section V-A.Fig. 5c shows the overall resource consumption of chosen classifier and other system components.Currently the ovearll design consumes 39 % of on-chip digital signal processing (DSP) units, 12 % of slice registers, 17 % of slice lookup tables (LUTs) as well as 5 % of on-chip memory, which include the QCNN implementation, the image formation hardware, ADC controllers, the data selection network and other system support designs.Fig. 5d further breaks down the resource consumption according to different submodules.Among them, the QCNN, being the most computationally demanding, is responsible of almost all of DSP block usage, while the eight 10Gbe controllers consume the most on-chip memory for buffering.

D. Object Detection Accuracy
To evaluate the detection performance of our hardware image-based detection module, a continuous full system capture of both the phase-gradient contrast and fluorescence detection channels was performed on a flow of stained beads.From the full recording of the flow, by correlating against the fluorescence detection signals, a total of 1079 beads were detected in the captured flow.Fig. 6a shows samples of beads detected by the image-based detection module, which are also verified by the presence of fluorescence signal.Fig. 6b shows samples of false positive images detected by the imagebased module.These false-positive objects are mostly debris in the flow.As shown in the ROC curve (Fig. 6c), the imagebased detection module is capable of detecting beads with high confidence when compared to conventional fluorescence detection methods.

E. Cell Classification Accuracy
We used 3 sub-types of peripheral blood mononuclear cell (PBMC), imaged by multi-ATOM, as the target to evaluate our classification system.Each sub-type was first negatively isolated (purified) from PBMCs using a targeted magnetic separation protocol with minimal stress and perturbation on the isolated cells.Subsequently, the purified sub-types (each of which was labelled with targeted fluorescence surfacemarker for validation) were imaged by our system.Flowing and imaging the the cells of the purified PBMC sub-types separately allowed us to apply the correct label to the cell images for training, validating and testing of our QCNN.In total, 4485 B cells, 104 975 monocytes and 21 114 NK cell images were collected.
To balance the number of each class, 20% of the captured monocyte images were randomly chosen from the whole capture set and combined with all the captured B and NK cells to construct the evaluation data set.The evaluation set was further split into 10 shares to perform standard 10fold cross-validation.Overall, our QCNN achieved an average classification top-1 accuracy of 0.9545 and an average F1score of 0.9542 in the 10-fold cross-validation.

F. Comparison With GPU-based Cell Classification
Finally we have also performed an extensive comparison between our FPGA-based classifier and the state-of-the-art real-time GPU-based cell classifier demonstrated in [11].To evaluate classification accuracy, the data set released by [11] that contained 65 534 single, aggregated platelet and leukocytes images was used.Since the label for each image was not available, we used the label generated by the pre-trained   (1) The number in the parentheses are percentage usage. (2)QCNN configuration choice in the complete system is narrow p0 = 16.
model in [11] as a baseline to evaluate the accuracy of our model.According to the classification results from the pretrained model, there were 18 176 single platelet images, 2205 leukocytes images and 45 153 aggregated platelet images.Since the number of leukocytes images was too small, they were dropped from our evaluation.Finally, we randomly selected 18 000 images from each of the remaining two cell types to form our evaluation dataset.The 36 000 images were split into a 4 : 1 proportion for training and testing.
The 5-fold cross validation results show that our QCNN was able to achieve classification results that are 94.03 % in agreement with the baseline labels produced from [11], while incurring a 3 orders of magnitude lower inference latency.The key factors that result in the significantly reduced classification latency in our system rest on the tight integration between the imaging frontend and the QCNN classifier in our case, as well as the use low latency hardware QCNN for insitu classification.Compared with the system in [11] where image construction and classification were performed using separate computation node connected through 10G Ethernet, our integrated in-situ classifier eliminates most of the system overheads, including the Ethernet connection and the CPU-! !"#$%&'($)*)+%&%,"-'#%$ " ./0%&'($)*)+%&%,"-'#%$GPU communication.Our hardware-based fully pipelined and parallel implementation in combination with cut-through processing of the cell images as they are formed also contributes to the lowered latency.By eliminating most of the variable latency from the network and I/O system, the proposed solution also provides predictable classification latency that facilitates precise control in ultra high speed real-time systems.
VI. CONCLUSIONS We have developed an integrated reconfigurable FPGAbased cell imaging and analytic platform that demonstrates real-time ultra-low processing latency needed for applications with real-time feedback requirements.We demonstrate its capability with real-time cell classification using a quantized convolutional neural network (QCNN), where unmatched low latency is achieved through carefully structured QCNN hardware that allows image formation, cell detection, and computation of different layers of QCNN to execute with pipeline parallelism.The success of our system demonstrates the feasibility and benefits of performing in-situ image analytics applications using FPGA-based reconfigurable hardware, especially in applications with stringent real-time feedback requirements.Most importantly, the platform is flexible and allows easy reconfiguration, providing a foundation on which systems with similar streaming architecture can be developed for future applications.
Maolin Wang, Kelvin C.M. Lee, Kevin K. Tsia and Hayden K.H.So are with the Department of Electrical & Electronic Engineering, University of Hong Kong.Bob M.F.Chung and Anderson H.C. Shum are with the Department of Mechanical Engineering, University of Hong Kong.Justin S.J. Wong is with Conzeb Ltd.Bogaraju Sharatchandra Varma and Ho-Chueng Ng were with the Department of Electrical & Electronic Engineering, University of Hong Kong during this work and are now with the Ulster University, UK and the Imperial College London, UK respectively.

Fig. 1 :
Fig.1: Field programmable gate array (FPGA) based reconfigurable computing platform for intelligent in-situ image analytics in real time.The central FPGA processing unit serves as a unified platform for concurrent image formation, cell detection, and classification using a quantized convolutional neural network (QCNN).Fully pipelined stream processing FPGA gateware results in predictable and ultra-low latency at high throughput.

Fig. 2 :
Fig. 2: Concurrent image formation and analytics in a pipeline (using ultrafast flowing cell imaging by multi-ATOM as an example).(a) Snapshots of the image formation process as a pipeline spanning multiple time steps.Every step operates on a partial image such that analysis of an image may commence before the rest of the image is formed (last row).Note that only one raw image contrast is shown here for simplicity.In practice, four different image phase-gradient contrasts of the same cell are acquired in multi-ATOM (see (d)) (b) Cell detection step selects only relevant cell image to be processed by CNN.(c) Complete bright field image and quantitative phase image reconstructed from raw image (d) Raw image of 4 phase-gradient contrasts

2 5 for j ← 1 to 21 do 6 Tj
background accumulation buffer Tj ← 0 3 end / * Estimate background from the average value of the first 8 lines * / 4 for i ← 1 to 8 do

Fig. 3 :
Fig. 3: Processing pipeline design (a) Raw image before line alignment.(b) Line alignment hardware.(c) Image after line alignment.Note the 4 phase-gradient contrasts of the cell samples are now centered.(d) Background computation and removal hardware.(e) Image after background removal.(f) Object detection hardware.(g) Sum of absolute difference (SAD) values of lines around a target cell.50 lines before and after SAD exceeding a threshold T are included as part of the detected object.

Fig. 4 :
Fig. 4: QCNN Design.(a) Hardware architecture overview.A fully pipelined parallel implementation of QCNN where all layers operate concurrently.The architecture accepts 16 byte data block every cycle.(b) Layer configuration of the proposed QCNN.(c) Partial products of all sliding kernels that intersect with the current input data block are computed concurrently in a single cycle.Kernels from conv1 shown for illustration.(d) Latency and hardware resource usage analysis of the proposed QCNN.

Fig. 5 :
Fig. 5: Latency and resource consumption.(a) Plot showing the timing of image formation and various layers of CNN inference that execute in a pipeline.CNN computation begins as soon as a cell is detected.Depending on the exact timing when a cell is detected, the worst case latency for cell detection is t ce − t is (b) Plot showing relative latency of CNN decision making compared to cell flow image.A classification decision is made soon after the sample cell is completely imaged (t ce − t ie ) (c) Resource consumption of the design relative to available resources on target FPGA.(d) Resource consumption breakdown of each submodule.

#Fig. 6 :
Fig. 6: Object detection using hardware image-based detection and fluorescence detection.(a) Objects detected by both fluorescence and image-based detection modules.(b) Objects identified by image-based module without a corresponding fluorescence signal.(c) ROC curve showing the image-based detection hardware module is capable of producing comparable detection result as fluorescence detection.

TABLE I :
Model Design Exploration

TABLE II :
Summary of FPGA Resource Usage