State of Art IoT and Edge Embedded Systems for Real-time Machine Vision Applications

IoT and edge devices dedicated to run machine vision algorithms are usually few years lagging currently available state-of-the-art technologies for hardware accelerators. This is mainly due to the non-negligible time delay required to implement and assess related algorithms. Among possible hardware platforms which are potentially being explored to handle real-time machine vision tasks, multi-core CPU and Graphical Processing Unit (GPU) platforms remain the most widely used ones over Field Programmable Gate Array (FPGA) and Application Specific Integrated Circuit (ASIC)-based platforms. This is mainly due to the availability of powerful and user friendly software development tools, in addition to their lower cost, and obviously their high computation power with reasonable form factor and power consumption. Nevertheless, the trend now is towards a System-On-Chip (SOC) processors which combine ASIC/FPGA accelerators with GPU/multicore CPUs. This paper presents different state of the art IoT and edge machine vision technologies along with their performance and limitations. It can be a good reference for researchers involved in designing state of the art IoT embedded systems for machine vision applications.


I. INTRODUCTION
In recent years, processors are gaining increasingly computation power which has reached tens of Tera Floating Operating Operations per Second (TFLOPS) at relatively low power. This paved the way for the emergence of several IoT and edge machine vision systems which were not possible few years ago. This includes DNN cameras which can accurately perform complex tasks such as face recognition, environmental monitoring, and smart city promoting. For instance, presently very low payload vision-based navigation systems enable drones to navigate in GPS-denied environments such as indoor environments and within areas prone to high electromagnetic interference. Also, in autonomous car industry, current autopilot systems are very costly as they rely on expensive and bulky equipment such as Light Detection and Ranging (LIDAR) for navigation [1]. With the availability of powerful IoT/edge processors, having a machine vision system of an autonomous vehicle relying solely on few cameras with associated complex algorithms can be quite possible in the near future. According to [2], the global autonomous driving market is expected to reach 173.15 B US$ by year 2030. Another very recent interesting application of IoT-edge based machine vision is to establish a defense line for COVID-19 in public places. Hence, simultaneous mask checking, counting the number of people, social distancing and temperature check is required to be automated using smart cameras. This can be an interesting substitute to current security, in addition it can accurately detect patients with COVID-19 symptoms [3]. These few instances of potential applications among others has led predicting that global smart camera market which was valued at USD 4.61 Billion in 2018 and is projected to reach USD 9. 17 Billion by year 2026 [2]. It is also reported that by the end of 2021 around 1 Billion smart cameras will be deployed worldwide to handle wider spectrum of applications. This includes access control traffic monitoring, retail analytics, assets managements, and logistics [3] [4][5] [6].Thus heavy machine vision applications which used to be executed within edge devices such as DNN-based machine vision tasks are nowadays seamlessly moved towards IoT devices. This has the merit to drastically reduce the network traffic congestion and consequently to reduce security threats. A major distinguishable property of nowadays edge computing devices for machine vision applications is their ability to handle tasks received from several IoT devices. For instance, this can be done to allow access to a larger reference database compared to what can be affordable within a single IoT device. This paved the way to an important emerging area being tackled by many researchers which is task offloading into edge devices and cloud servers. This is facilitated with the emerging 5G and 6G networks which yield higher throughput, lower latency, and lower network jitter for multimedia communication compared to former protocols. However, it is expected that security threats and network congestion with the increasing number of users will still remain pending challenges. Power consumption, which used to be one of the major issues few years ago, is effectively tackled with the emergence of new energy harvesting techniques. This includes low cost solar-based power supply in case of remote areas off-grid applications, and the continuous drop of power consumption of all kind of processors. This paper addresses the potential of different state-of-the-art processors such as multicore CPUs, GPU, ASIC and FPGA to handle machine vision tasks. It also suggests some future directions for tackling pending challenges. In the recent past, other review papers on IoT and edge processors targeting machine vision applications were reported [7][8] [9]. One contribution of this paper is to address the most recent processors that use various technologies, along with their expected performance. For instance, the paper [7] mentioned NVIDA\s GPU GTX-760 and Altera's Stratix-4 FPGAs for various machine applications including stereovision, images registration and digital image correlation. In this paper, more recent processors, along with their performance in edge machine vision applications are described and analyzed. This includes the very recent Intel's Startix-10 and Aria-10 FPGAs chips and Xilinx's Zynq Ultrascale for lower power and highly powerful deep learning-based machine vision applications. The other contribution of this paper, which to the authors' best knowledge is not covered yet in the literature is to describe and analyze the main recent edge smart cameras which are currently available. Hence, unlike previous machine vision systems which were designed to be dedicated for a specific machine vision algorithm, nowadays edge hardware are flexible to host various machine vision applications using for instance DNN algorithms.

II. BACKGROUND
As per Wikipedia definition, edge computing consists to "push the frontier of computing applications, data, and services away from centralized node to the logical extremes of a network". Thus, edge computing offers the advantage to yield lower latency and higher security protection than cloud computing ( Figure 1). This is a very valuable feature for real-time machine vision tasks which are sought to be performed at the edge or IoT devices. This is facilitated by the continuous increase of the performance of processors to handle complex machine vision applications, such as real-time object detection and recognition. However, unlike cloud servers which have unlimited computation power and storage capacity, IoT and edge devices have an upper bound limit, due to the limited power consumption and limited footprint as they are mainly dedicated for portable applications. Compared to IoT machine vision devices, edge devices counterparts can process more than one input video stream. This requires them supporting very high video bus bandwidth such as Gigabit Ethernet to carry encoded video streams, such as H264, MPEG4, and even MPEG2 streams. This requires them to incorporate real-time video decoding modules to simultaneously handle several video channels. For instance the latest Advantech's AIR-200 edge machine system, which is based on Intel's Movidius Myriad-X Vision Processing Unit (VPU) can handle up to 120 fps video encode & decode of 1080 pixels, corresponding to 4 simultaneous video streams [10]. This also requires using real-time communication protocol (e.g. Real-time Protocol (RTP)) over the Internet to reduce the network latency even further. In addition, edge machine vision platforms have the capability to process simultaneously several video streams (in the range of 10 approximately, versus thousands in cloud servers and only one in IoT devices).  Figure 2 shows a typical hardware architecture of an embedded vision system, with it be an IoT or an edge device. In most of the cases, the input image needs to be completely stored in the frame memory, before the processing task starts, causing a latency delay of few ms. The actual image may be dispatched to PE's local memories via direct access memory (DMA) channels, while another memory bank of the frame buffer is used to host the next image frame. This allows simultaneous capturing and processing of images. In some particular cases, such as image convolution, processing the input image on the fly using line memories (e.g. FIFO), the number of which is equal to the number of rows in the convolution mask is quiet possible using FPGA or even CPU/GPU processors if they have enough input ports to host the input images [11] [12]. In terms of main processing unit, several processors use SIMD or systolic-like architectures where each PE has access to its local memory space to perform a same instruction as the other PEs [13] [14]. While this computation model is efficient for low level convolution-type processing, it is not adequate for Thus, the need to significantly boost their performance has led today's main processing platforms for embedded IoTedge machine vison systems to be mainly multicore CPU, GPU, FPGA, ASIC-based, or a combination of some of them. For instance, some multi-core processors such as GAP9 processor integrates a multicore RISC-V processor, together with a DNN hardware accelerator for high speed machine vision application [17]. Similarly, several other ASIC or FPGA-based IOT or edge processors also include a single or multicore CPU for program flow control and also to handle tasks which are not supported by the ASIC or FPGA accelerators. Most CPU include multiply accumulate unit (MAC), zero overhead hardware loops and its manipulations which constitute a critical computation requirement in DNN and machine vision algorithms. As will be mentioned in Section 3, most systems comprises a communication module that may host new emerging low power communication protocols yielding relatively long communication range such as LPWAN (Low power WAN), in addition to legacy communication protocols such as PCIe, 100 Giga-Ethernet and WiFi since the IoT and edge communication devices can be either wired and/or wireless and to allow fast inter-chip communication. Video encoder/decoder are also usually embedded in most start state of the art IoT and edge devices to handle average complexity encoding algorithms such as MPEG4 and H263, except MPEG-2 which is highly complex to be handled in IoT devices. Additional hardware modules which can be useful to end users, while they require a marginal additional hardware size may also be available. This may include single output sensors such as GPS (Global positioning system) and IMU (Inertial Measurement Unit) which can be useful in the perception systems of moving vehicles such as drones [18].
Unlike communication and video encoding/decoding modules which usually consume relatively low power and feature low die size, the main processing unit usually consumes the most power because of the high complex machine vision and DNN algorithms it must handle in real-time. This is the reason why a Power Management unit (PMU) is usually available to reduce the power consumption of the device using mainly three different techniques: Voltage scaling, frequency scaling or setting the processor in a specific mode of operation (e.g. sleep, deep sleep, and idle modes). Nevertheless, in a conventional system implementation, the standard upper ceiling of around 20-40 W, requires a heat sink and either a fan or substantial air flow for cooling, which is fine for edge devices, but not for IoT devices which are usually battery powered. When the core runs at about 7 W, the fan can be removed and there is less need for sophisticated management of hot spots. The video bus used to capture the input video stream is usually designed using high speed standard serial communication bus such as MIPI lane interface (for HD RGB cameras interfacing supporting up to 700 Million pixels/s), USB3.0 or Gigabit Ethernet to yield a data transfer of up to 5 and 100 Gbps respectively. Another important feature of these devices is the very high memory bandwidth, which can exceed 50 GB/s [19] to access to frame memory which is usually offchip and DRAM-based. This is usually facilitated in most machine vision processors by embedding low power DRAM (LPDRAM) interfaces, such as LPDRAM-v4, for lowering power consumption during off-chip memory access. One more important feature of embedded IoT and edge machine vision systems when they are targeted for outdoor applications is their protection against harsh environments, such as dust and water. This is indicted by their ingress protection (IP) classification which can be up to the most recent IP69K standard for simultaneous highpressurized and high-temperature water (undersea application) [20]. However, this prevents the usage of an onboard cooling system which would consequently limit the computation power of the processing unit.

III. COMPUTATION MODELS IN MACHINE VISION SYSTEMS
Nowadays embedded IoT or edge machine vision systems are either modular where different phases of the algorithms are executed in different hardware modules or end-to-end based which are largely motivated by the development of artificial intelligence accelerators [21]- [25]. Modular designs allow different research teams with complementary background to work together. This includes the DAPRA project for building a perception system for autonomous cars [26], Junior project from Sanford [27], and Boss project from CMU [28].
Although the end-to-end approach has a potential to decrease the uncertainties facing modular designs, it still lacks mature, error free for real-life application [29].

A. Low, intermediate, and high-level image processing models
Image processing is a natural fit for data parallel processing which encouraged researchers to suggest different dedicated parallel hardware architectures. Based on their computation complexities, image processing algorithms can be categorized into three types: low, intermediate, and high-level processing. High level image processing mainly handles semantic data (e.g. classification) to estimate the semantic class. It has comparatively few data but very complex algorithms and would generally benefit from VLIW or RISC processor architectures. For instance, the results of vehicle tracking are explored for target range computation, range ate and forward collision warning [30] [31]. These set of tasks which feature complex program flow make them a natural fit for a Multiple Instruction Multiple Data (MIMD) hardware architecture where the PEs execute different threads with marginal inter-PE communication. They can also be suitable to be implemented in powerful generalpurpose CPU which explains why most IoT/edge machine vision processor, including GPU, usually comprise multicore CPUs. Pattern classification algorithm which is also a significant high level processing task can lend itself to an SIMD machine, which is the case for instance for the Support vector Machine (SVM). Similarly, Intermediate level processing such as image segmentation is also more tailored to MIMD architectures, where the parallelism is usually performed at thread level by allocating to each processor a region of the image and then proceed by aggregating the data to cover the whole image space. Nevertheless, few other image segmentation algorithms such as segment extraction can be suitable for an SIMDlike architecture where each processor is allocated to a given range in the parameter space [13] [15]. In low level algorithms such as n x n convolutions, neighborhood pixels are processed using n buffers of n registers each (See Figure 3 in case of n = 3 to generate into the computation hardware the elements of the matrix a to i, where each element is multiplied by a constant value and the results are added altogether). In Figure 3(b) at each clock cycle, The FIFO memory outputs the pixel to both the first buffer A that holds the second row of the image and the first stage of a shift register. Both Buffer A and Buffer B delay the output by the image width. In this case, the frame is processed on the fly without the need to store it in an input frame grabber, unless there is the need to perform additional image processing tasks such as feature extraction or segmentation. Such hardware architecture model is mainly suitable for FPGA or ASIC-based processing units. The FIFOs should have the size of one column of the image and their number should be equal to the number of rows in the convolution mask, minus 1, which has the merit to yield 0 frames latency. SIMD architecture is another more frequently used alternative (e.g. GPU architecture) for low-level image processing where each processing element handles a different portion of the image while executing the same instruction as the other processing elements. While this hardware model which is featured in modern GPUs can efficiently increase the execution throughput, it still has several rooms of improvements. For instance, it does not easily explore the spatial or temporal pixel correlation which may exist between adjacent pixels and adjacent frames respectively. Stereovision is another very popular low-level image processing task, which explores the pixel correlation across multiple images which are simultaneously captured by several cameras using convolution operation on SIMDlike architectures. It is used in several applications, such as in autopilot driving system as a tangible alternative to LiDAR system despite yielding less accuracy (Lidar systems can provide a resolution of down-to 2 cm for a depth distance of 100 m), however it is much more cheaper since the actual cost of the LIDAR system can easily exceed 50,000 US$.  Figure 4 summarizes the suitable hardware architecture that is suitable to each level of video processing. It also shows the required processing throughput and the suitable hardware architecture. Machine vision processors usually are designed to support all these architectures in a single chip.

B. AI-accelerators for machine vision systems
In order to overcome the non-scalability of hardware architectures dedicated for intermediary and high level image processing to handle a wide range of objects in the image space, AI-accelerators implementing either DNN or CNN inferring are now available for IoT or edge embedded Similarly, TFLite also reduces the memory requirement and number of bits per operands using quantization clustering and pruning [41] without significant loss of accuracy. In addition, it is platform independent making it adopted to any GPU architecture or even mobile phones (both IoS and android is smartphones). It can also be suitable for a fine grain pipeline architecture [42]. In [43], it was reported that the Tensor RT outperforms the TFLite by around 40% for realtime face recognition, but at the expense of high-power consumption when implemented on NVIDIA GPU. Thus, DNN features intensive low-level processing making it suitable for highly parallel architectures which are readily available nowadays, such as such as multicore-CPU or GPU using either SIMD or SIMT (single instruction Multiple threads) paradigms. This has led to the emergence of several DNN/CNN models which were designed for low power consumption, low computation power, and low memory capacity IoT and edge devices, while achieving good accuracy. This includes MobileNets family [44][45], SqueezeNets family [42], ShuffleNets family [45], and ESPNets family [41]. Nevertheless, SqueezNet architecture remains the smallest and simplest type of DNN as it requires only 3 x 3 and 1 x 1 convolutions, pooling, and a fully connected layer, which is not required for feature extraction task [43]. Similarly to DNN, CNNs are too complex to run on IoT/edge devices with limited computing power. This has created large interest in designing efficient CNNs AlexNet [2] and GoogLeNet [14] CNN models have also become renowned for their exceptional performance in image recognition targeting IoT and edge machine vision tasks. In [41], it was shown that using the squared Euclidian distance as a pruning technique, 6.7 x and 1.5x compression were achieved for the MNIST and AlexNet networks respectively, while losing less than 2.2% accuracy of ImageNet accuracy. Another promising design approach for CNN targeting IoT and edge devices is to use the Neural Architecture Search (NAS) which aims to automate the design of neural networks by optimizing multiple different objectives together, like accuracy and efficiency, which is difficult for humans [3]. NAS methods have already achieved state ofthe-art results on image classification and object detection [34], outperforming the previous best manually designed networks.

IV. IoT and EDGE PROCESSORS FOR EMBEDDED MACHINE VISION SYSTEMS
In 1990s, FPGAs and DSPs were the technologies of choice for real-time image processing associated to low and middle complexity tasks [3]. This was due to the fact they host functional units such as multiply and accumulate (MAC), which was not the case of single core CPUs. For instance, TI's OAP3530 DSP processor which includes a fixed point 520MHz-C64x core and an ARM's 720 MHz Cortex-8 processor can run stereovision algorithm using 7 x 7 pixel sliding window with a throughput of 8 fps [31], which however still remains weak for real-time applications. A higher caliber TI's DSP processor, TMS320C6678, which features 8 x 1GHz cores CPUs which can be configured for both fixed point and floating operation was designed to handle high complex image processing algorithms, such as motion estimation, which however also still remains weak since the throughout for 128 x 128 image is only 9. , which comprises an ALU that executes integer and floatingpoint arithmetic and logic operations. The number of cores in an SM is fixed to 32, however the number of SMs range from one for low-end GPUs to more than sixteen in high-end GPUs. L1 caches are available in each SM, however, they are much smaller than on a CPU. Additionally, all SMs in the GPU are connected to a shared L2 cache and the off-chip memory (GDDR) via a network on chip (NoC), as shown in Figure 5. To maximize the GPU overall performance, the shared L1 memory within an SM is organized into 32 banks (one bank per SP), so that all threads in a warp can access different memory banks in parallel. However, if two threads in a warp access different items in the same memory bank, a bank conflict occurs, and accesses to this bank are serialized, potentially hurting performance. GPUs achieve a very high degree of parallelism by having a single SM hosting several warps, holding up to 32 threads that can run in parallel at any point in time. Threads of a warp start executing at the same program address but have their private register state and program counters so that each thread is free to branch independently. However, when threads in the same warp follow a different execution path, threads are serialized by the hardware. Thus, GPU are based on single instruction multiple thread (SIMT) architectures and are designed to maximize the amount of parallel processing in the graphics pipeline. Unlike SIMD architecture where the processing engines typically execute the same instruction, SIMT architecture allows different threads to more readily follow divergent execution paths through a given thread program. Hence SIMD processing represents a functional subset of SIMT processing regime. A CPU (single or multicore) is frequently provided for GPUs to ensure their control and also to execute algorithms which are not adequate for SIMT/SIMD architectures. The CPU-GPU integration into a single chip is provided in most GPU processors to overcome the limited bandwidth of PCIe, which was used to transfer the data between the CPU and GPU in older GPU processors. In addition, both the CPU and GPU share the same main memory, allowing both CPU and GPU to have more opportunities for fine-grained cooperation [23]. Among the GPU processors recently used in IoT/edge cameras, one can mention the recent NVIDIA Jetson AGX Xavier processor which comprises 512-core Volta GPU It also includes 64 Tensor cores, an 8 core ARM v8.2 64-Bit CPU and 32 GB 256-Bit LPDDR4x with a bandwidth of up to 137 GB/s to yield up to 32 TOPS of AI performance. The processor can be configured to consume 10 W, 15W, or 30 W based on the target performance. This is the latest NVIDIA GPU processor which followed other ones almost 20 x less powerful such as the NVIDIA's TX2 processor which comprises 256-core supercomputer (Pascal GPU). Another precedent version of these processors, Tx1 processor, was recently tested in embedded autonomous car to implement the stereo matching algorithm to perform 3D mapping algorithm where a performance of 42 frames/s could be achieved [16] yields a mid-level representation of the scene where the image is divided into vertical regions that are segmented based on their disparity. Table 1 below shows the performance of very more recent NVIDIA edge GPUs to perform different kind of DNN inferring models which are suitable for image classification and object detection, pose estimation, and semantic segmentation [13]. In the It comprises up to 80 GB memory. However, the processor consumes mor than 100 W. ARM's 500 MHz Mali-400 MP2 is another available GPU processor, which comprises 1 to 4 core GPU to perform up to 500 Mpixels/s. the processor was initially designed for tablets and smartphones to accelerate graphics applications and then extended to host embedded IoT and edge machine vision tasks, such as real-time face recognition using an additional quadcore i7 processor [19].

B. Multicore CPUs-based IoT and edge machine vision systems
The limitations of multicore Superscalar processors are prominent as the difficulty of scheduling instruction becomes complex. The intrinsic parallelism in the instruction stream of machine vision algorithms, complexity, and the branch instruction issue get resolved by a higher instruction set architecture namely the Very Long Instruction Word (VLIW) or VLIW Machines. In this architecture several independent operations are grouped together in a single VLIW instruction and are initialized in the same clock cycle. This has the advantage to reduce the hardware complexity and power consumption, as well as the simple instruction issue since the compiler takes care of the data dependency check. operands. The processor offers an 8-bit CPI (Camera Parallel Interface) and combine state of the art 22nm FD-SOI semiconductor process. It also includes security modules such as AES128/256 cryptography and a Physically Unclonable Function (PUF) that allows the device to be uniquely and securely identified. The processor is dedicated to process small size images (160 x 160 image) where MobileNet V1 network can run in just 12 ms for a power consumption of 806 mW/frame/s [7]. The processor is dedicated for battery powered IoT cameras such as people counting, and attention awareness to run for years with a single AA battery using adjustable frequency and voltage domains and automatic clock gating. The built-in hardware convolution (HWCE) module which is a coprocessor dedicated for accelerated computation of CNN algorithms allows to run in one clock cycle a single 5 x 5 or 4 x 7 convolution using 16 bit weights and 16 bit pixels or up to four simultaneous 5 x 5 or 4 x 7 convolutions using 4 bit weight and 16 bits pixels. Using this processor, it was shown that the processor can perform face recognition, with 1 second latency, using only 1.5 Mb of weights with 16-bit fixed point integer computation. Intel Company is another active player in providing machine vision IoT and edge processors. For instance, the 700 MHz-. Myriad X VPU, which features a neural compute engine, a dedicated hardware accelerator for deep neural network inference, is designed to handle edge machine vision applications such as autonomous navigation, Virtual/Augmented reality (VR and AR), and smart camera [15]. The 16 nm technology processor which consumes as low as 2.5 W features 16 programmable 128-bit VLIW vector processors, a neural compute engine (DNN Accelerator) to perform 1 TOPS of DNN inferencing It also comprises an enhanced vision accelerator, a 4 K real-time video encoder (H264/H265), 8 MIPI lanes to connect to 8 high speed cameras, and supports LPDDR4 interface which are attractive attributes for an edge computing device. The vision accelerator is modular and comprises 20 dedicated hardware modules to perform tasks such as stereo depth and motion estimation in real-time (@ 60 Hz frame rate) by processing the image pixels of up to 6 cameras input (720p) on the fly, without storage. These features have led the processor to achieve an aggregated performance of up to 4 TOPS for 2.5 W power consumption. Very recently, a PICe board, namely AD-VEGA-340, which includes 8 Movidius Myriad X VPU processors was suggested to handle complex AI algorithms at the edge. Another similar board that comprises also 8 of such processors, interconnected with each through via 4 PCIe/USB3.0 interfaces to consume less than 30 W was suggested in [12] ( Figure 6). The board consumes approximatively 25 W and comprise a dynamic interconnection engine that interconnects the processors according to various topologies. Very recently, a high performance Industrial machine vision edge system that 16 of such processor, is commercially available by AAEON company to support up to 200 IP cameras to handle various machine vision applications at the , including automated warehouse and logistical processing, production line monitoring, and door security [33]. Most CPUs in the aforementioned processors supports VLIW instruction set for the reasons invoked earlier.
However, this requires complex compilers, and an increased program size, which may be prone to unscheduled events such as cash miss which could lead to a stall of the entire processor. Texas Instruments is another potential IoT-edge machine vision processor player targeting mainly perception systems for autonomous cars, through its very recent 2GHz Jacinto TDA4x machine vision edge processor that yields an impressive 10 TOPS/W. The processor can be interfaced to up 8 MP FPGA link capture streams with real-time processing and analytics. It comprises an array of pre and postprocessing engines for neural network and is able to accurately identify lanes lines of various types and colors with the recognition curvature radius of at least 5 m. It also allows multisensory fusion using Lidar, IMU, and ultrasonic sensor information to allow high precision trajectory calculation [14]. The processor comprises a single DSP C7x processor to perform 80 GFLOPs or 256 GOPS, with neural network accelerator via matrix multiply accelerators (MMA) to perform up to 8 TOPs at 1 GHz, a duo DSP core C66x DSP, and Integrated imaging signal pipe (VPAC) with image signal processor (ISP), a hardware engines for optical flow & stereo disparity (DMPAC), and the latest dual core ARM cortex A72. It also comprises imagination 8XE series GPU. The multiple algorithms namely motion segmentation, depth estimation; multiple object detection (pedestrian, cyclist, vehicle etc.), semantic segmentation, parking spot detection and visual localization are performed on DSP cores with MMA (matrix multiply acceleration). 2017 1 In spite of their high performance for both IoT and edge-based machine applications, GPUs and multicore CPUs processors are still relatively large and consume relatively high power since they are the result of scaling down an architecture which was originally dedicated for desktop computers to perform 3D graphics. In addition, while GPU and multicore CPUs excel for dense floating point computation, researchers have reported higher throughput and lower power consumption, at the expense of a reasonable drop of accuracy, with custom hardware using either FPGA or ASIC using low precision fixed-point quantization [15].  [11], it can also be downscaled to so be implemented into IoT and edge machine vision systems because of its low power density. These attributes paved the way to a recent Google's Edge TPU IC which performs inferring for machine vision tasks at the edge to yield 4 TOPs using 0.5 W for each TOPS [12] ( Figure 7). The accelerator contains a 2D array of processing elements (PE), where Each PE embodies a single or multiple cores, each core having multiple compute lanes with multiple MAC units operating in single-instruction multiple-data (SIMD) style; however the PEs can work in systolic manner. Each PE has a PE memory shared across the cores, and each core is equipped by a dedicated core memory. The device is available as an USB accelerator that can be plugged into a desktop computer or added to a Linuxbased Dev Board [13]. It is is one of the latest boards comprising edge TPUs and is interfaced with another Quad-core arm cortex A-53 processor using PCIe and I2C/GPIO interface and LPDDRAM4 memory to execute state of the art mobile vision models such as MobileNet v2 at almost 400 fps, in a power efficient manner.

C. ASIC-based embedded IoT/edge machine vision systems
(a) (b) Fig. 7. block diagram of the systolic array-based architecture of Google TPU processor: (a) overall architecture and (b_ PE architecture VOLUME XX, 2017 1 Table 2 shows the overall performance of the chip for different neural networks algorithms which are trained using Imagenet dataset with 100 classes. A systolic array claims several advantages such as simple and regular design, concurrency, and balancing computation with I/O. However, until now, there has been no commercially successful processor based on a systolic array. The TPU is the first and it is impressive, arguably the largest systolic array implemented or even conceived. Figure 8 shows the block diagram architecture of another recent Mobileeye's (an Intel Company) EyeQ5 machine vision processor at the edge [14]. More than 27 car manufactures have chosen the chip for their assisted-driving technologies based on its performance as it is able to support fully autonomous (level 5) vehicles. The 7 nm FinFet processor can provide up to 25 DNN TOPS for a maximal power of 10 W. It features several VLIW SIMD vector microcode processors (VMP) to provide hardware support for operations common to computer vision applications. It also features a Programmable Macro Array (PMA) which was initially introduced in the EyeQ4 and now reaches its 2nd generation of implementation which enables computation density nearing that of fixed-function hardware accelerators without sacrificing programmability. Another 8 core CPU engine is available to handle some complex tasks such as object tracking using the built-in Lucas and kanade algorithm [16], where the image inversion step does not fit the SIMD architecture. Geeley Auto Group is planning to use two chip by end of 2021 to yield level 4 autonomous vehicle using seven long range and four close range cameras and deliver 360 0 surrounds view to enable a scalable feature bundle supporting highway hands-free, navigation-based highway-tohighway, arterial, and up to urban hands-free driving. The chip is the fruit of joint efforts between Mobileye Mobileye and STMicroelectronics (STM) Leveraging Mobileye substantial experience in automotive-grade designs and STM support in state-of-the-art physical implementation, as well as automotive-grade memories, high-speed interfaces, and system-inpackage design. IBM company is one of the rare companies that suggest ASIC machine chips for asynchronous eventbased sensors namely "silicon retinae" which are inspired by biological systems. It very recently suggested the 28 nm IBM's TrueNorth all digital ASIC chip which performs about 46 Billion synaptic operations per second with 70 mW power consumption using 1 million neuros and 256 million synapses implemented using 5.4 billion transistors (Figure 9). The chip which comprises around 4096 cores (Each core comprises 256 input axons, 256 output neurons, and 256 x 256 synapses) was recently successfully used in real-time low power vision applications. It is the outcome of a decade research efforts under the DAPRA SyNAPSE program. TrueNorth chips can be connected directly together to form larger systems, and a circuit board with 16 chips has been developed (e.g. IBM's NS16e board [9].Nevertheless, one significant disadvantage of this chip is the weak data representation of neural networks trained with floating point precision, since the all communication between neurosynaptic cores are represented by binary spikes. This drop of accuracy is compensated using stochastic computing methods where the data is statically represented in both temporal and spatial domains using multiple spikes of the neural networks [10]. VOLUME XX, 2017 1 Fig. 8. Hardware block diagram of the EyeQ5 processor

D. FPGA-based IoT and edge machine vision systems
Initially, FPGAs comprised large number of reprogrammable LUT which are interconnected together via local and global interconnection networks to implement various sequential and combinatorial logic circuits which can run in parallel. Two major companies, namely Altera and Xilinx, emerged for successfully providing different chips and associated software development tools. The high latency to implement large bit width arithmetic operations required in machine vision algorithms, in addition to the memory access bottleneck drove the manufactures to enhance the architecture of their respective chips by adding hundreds of DSP blocks and local onchip memories to implement various parallel architectures such as MIMD, SIMD architectures, and systolic architectures which are suitable for machine vision algorithms [12]. Nevertheless, this still remain not enough for DNN algorithms because of their low capacity to host all the weights which are required in DNN which involves intensive extensive external memory-FPGA data transfer increasing significantly the algorithm latency. Altera was recently acquired by Intel, which may indicate the future direction of next generation multicore CPU to incorporate reprogrammable devices to accelerate machine vision and image processing tasks. This is the case of the recent Xilinx Zynq Ultrascale + MPSoC FPGA which is one of the largest IC comprising million logic cells and include a quad-core 1.5 GHz 64 bits Cortex A-3 processor, and a Dual core Arm Cortext R5F making it suitable to host Xilinx's deep learning processing unit (DPU), created for researchers working in hardwired machine learning algorithms [30]. The chip which supports several CNNs such as VGG, ResNEt, GoogLeNet, YOLO, SSD, MobileNet, and FPN is supported by an associated software development kit which allows to perform pruning and quantization to satisfy low latency (DECENT), maps the neural network to the DPU instructions (DNCC), and handles resource allocation and DPU scheduling (N 2 Cube). Very recently, in Arm TechCon exhibition, the chip was demonstrated to successfully perform traffic light detection [10]. When running CNN tasks, it achieves 14 images/sec/watt, which outperforms the Tesla K40 GPU (4 images/sec/watt). Also, for object tracking tasks, it reaches 60 fps in a live 1080p video stream. In [35], it was demonstrated that a similar FPGA, Intel's Stratix-10, could yield 50% greater throughput over Titan X GPU using 6-bit fixed point data, while the GPU processor performs better for FP32 operands. Intel also offers a similar edge platform, namely Mustang-F100 [26], which is based on Intel's FPGA Arria-10 FPGA, which according to the Company is more suitable than the 8 VPU-based Mustang-V100 platform in case additional applications such as Speech recognition are required. Otherwise for solely video processing applications Mutang-F100 platform is recommended.  [7].

V. EMBEDDED IoT and EDGE HARDWARE VISION SYSTEMS
Taking advantage of the excellent performance of IoT and edge processors dedicated for machine vision applications, various IoT and edge machine vision systems using either GPU, multicore CPU, ASIC, or FPGA are now available for a wide range of applications ranging from face detection to VOLUME XX, 2017 1 perception systems for autonomous vehicles . For instance, NVIDIA has very recently released an embedded off the grid DNN camera which is based on its Jetson TX2 processor [38] to support Caffe, Tensorflow, and cuDNN DNN algorithms in real-time. However, its cost is relatively expensive for a typical IoT devices (e.g. 2,418 US$), which may prevent its dense deployment in public places. The camera which can be considered as one of the existing highest standards yields all kind of communication interfaces such as Gb Ethernet, WiFi 802.11, and Bluetooth and consumes around 7 to 15 w with 12-42 V for solar or industrial power. Wahtari Neural camera is another Linuxbased camera, dedicated for real-time machine vision applications using a single intel's Movidus Myriad-X VPU processor [9]. The camera can deliver inference at 45 frames and detect more than 7000 license plates/hour using just 18 W of power. The camera can host simultaneously three different neural networks. Gorilla is another company that provides edge machine vision cameras using Intel's multiCPU Atom 2 GHz E3950 quadcore processor [10]. A similar binocular camera, iDS-2CD8426G0/F-I, using a not disclosed GPU processor is supplied by Hivision Digital technology to perform face recognition at the edge for up to five simultaneous input video streams (120 x 1080 pixels) [12]. The camera which uses mixed DNN and stereovisionbased algorithms is dedicated to operate in public places to perform additional tasks such as perimeter intrusions, crowd density calculation as well as fall and loitering detection. Duha Technology is another company that provide a DSPbased camera system using Sonny image sensor technology that can handle up to 10,000 different face images and manage five image libraries [122].

A. Autonomous navigation in moving vehicles
Recently, with the vast improvements in computing technologies, e.g., sensors, computer vision, machine learning, hardware acceleration, and the wide deployment of communication mechanisms, e.g., Dedicated short-range communications (DSRC), Cellular Vehicle-to-Everything (C-V2X), and 5G, autonomous driving research has attracted massive attention from both the academic and automotive communities [13] [14]. Many automotive companies have made enormous investments in this domain, including ArgoAI, Audi, Baidu, Cruise, Mercedes-Benz, Tesla, Uber, and Waymo, to name a few [15]. Nevertheless, the cost is one of the essential factors that affect the broad deployment of autonomous vehicles. According to [41], the cost of a level 4 autonomous driving vehicle attains 300,000 dollars, in which the sensors, computing device, and communication device cost almost 200,000 dollars. Another fact to pay attention to is that currently, the field-testing of level 2 autonomous driving vehicles mostly happens in places with good weather and light traffic conditions like Arizona and Florida. The real traffic environment is too complicated for the current autonomous driving systems to understand and handle easily. The objectives of levels 4 and 5 autonomous driving require colossal improvement of the computing systems for autonomous vehicles. Light company offers a low-cost multi-cameras array of different sizes to build the 3D map of the surrounding environment using stereovision technique [43]. The camera which comprises several lenses with different focal lengths is dedicated for autonomous vision cars to substitute the high cost and lower range LIDAR systems. The stereovision algorithm is implemented using a non-disclosed FPGA to yield 95 millions points/s, which according to the company will be substituted by an ASIC chip [24]. This follows other research works where FPGA-based stereovision algorithms were successfully implemented into FPGA-based board, such as the 2W Xilinx's Spatran 6-based FPGA real-time stereo camera [25]. This outperforms the Mobiley's single camera which uses one single camera to rely on some inferencing to estimate the distance to a target by performing image segmentation and object detection for measuring the object size in pixels, which is rather a very hit and miss system. Even multi-camera systems used by companies like Tesla are challenged because they use three lenses of varying focal length clustered together. MobileEye company is also offering another very recent vision system for autonomous cars which is based on the property EyeQ5 processor (two processors in their recent system) [26], interfaced with Intel's Atom 3xx4 CPU to yield 60% more performance at the same 30W power consumption as NVIDIA's new Jetson Xavier chip. The system yields hands-free high driving, automated parking, in addition to other legacy autonomous driving features.

B. Face recognition and detection for IoT and edge cameras
Several cameras have been suggested for face recognition, and most of them include a pair of stereovision cameras to estimate the 3D position of the face to ensure an easy and accurate detection of the face within the scene. For instance, in [37], a 1.2 GHz 4 core A5 CPU together with ARM's MALI 400 MP GPU processor were used for a 2 Mpixels binocular face recognition system to recognize up to 20,000 different faces within a database of up to 200,000 faces for a distance ranging from 0.5 to 3 m within a latency of less than 0.5 s, which is still excessively high for some real-time applications. The accuracy of the system is claimed to be 99.9 %. Aaeon Company is another supplier which provides an edge solution for face recognition using Gigabit Ethernet and the NVIDIA's Jetson Tx2 processor [27]. The board has four Intel LAB (i211) ports to support 4 simultaneous IP cameras at the edge.

VI. PENDING CHALLENGES and PROSPECTS
As was mentioned throughout this paper, one of the pending challenges with IoT and edge-based machine vision devices remain the excessive power consumption of mainly the processors. As can be shown in Figure 10, FPGA offers the best speed-up per Watt in terms of DNN algorithm. However, GPU offers the highest speed-up but at the expense of higher power consumption. Being power hungry, GPU are challenged by other alternatives such as ASIC technology where Google's TPU [29], Graphcore IPU [34] and amazon Inferential [35] are just few examples. To tackle VOLUME XX, 2017 1 further the power consumption issue, some researchers suggested purely analog DNN devices for machine vision applications [22]. For instance, in [39] the authors could implement LeNet-5 CNN for vision applications using CIFAR-10 datasets which has 60 k 32 x 32 images that can be assigned to 10 different classes corresponding to character recognition application with very low power. Another alternative being tackled by researchers is to perform processing in memory (e.g. content addressable memory (CAM) and Resistive RAM (ReRAM)), which however feature a reduced computation power compared to GPU and FPGA processors [13]. Nevertheless, this concept may be suitable for CNN applications since they do not require high complex computations, especially if weight quantization and compression are used.  Table 3 below compare the GPU and FPGA implementations for some image processing algorithms. It can be noticed that while FPGA outperforms GPU for small kernel image convolution algorithms, NVIDIA's performs better in terms of stereovision [16]. Very recently, IBM has licensed what it claims to be the smallest and most powerful processor using 2 nm technologies which can yield 45% higher performance and about 75% lower energy usage than today's most advanced 7-nm chips [28]. The chip is expected to be produced starting year 2024. This will pave the way to more accelerated machine vision and AI algorithms. In efforts to let programmers taking full advantage of the complex hardware architecture of FPGAs, several researchers suggested hardware compilers and high-level synthesis (HLS) tools to directly map DNN algorithms into FPGA platforms. Example of such tools is the DNNWeaver which seamlessly generate a synthesizable hardware accelerator for a given (DNN, FPGA) pair from high level specification in Caffe [28]. In [29] researchers from Intel revealed that using this tool for either Altera's Stratix V and Arria-10 or Xilinx's Zynq FPGAs, a speed-up over most existing powerful multicore CPU (i.e. Xeon E3) and manycore GPUs (Tegra K1, GTX 650 Ti and Tesla K40) could be achieved. ARM A15 CPU core was also assessed to perform the weakest with the advantage of very low power.

VII. CONCLUSION
This paper summarizes the main recent findings in hardware architecture dedicated for IoT and edge devices. Four different technologies for processor design are emerging: GPU, multicore CPU, FPGA, ASIC or a combination of some of them, which is the present trend. Despite unprecedented growth in processing abilities of these processors for machine vision applications, the actual performance is bounded by the low bandwidth, relatively high power consumption, and high latency of data movement between the processors and memory subsystems, especially if they are located in different chips. While FPGA offers the best performance per Watt, GPU is more adequate for applications requiring very high speedup with relaxed constraint on the power consumption, they require some hardware skills and usually time-consuming low-level programming to successfully implement a given application. A top-down and down-top hardware-software co-design approach by simultaneously designing the hardware platform and the machine vision algorithm can help optimizing the hardware solution in terms of area, power, and latency. This approach is more obvious for FPGA platform because of its configurability at the fine grain level which can yield an immense optimization using a proper high-level synthesis tool. Thus, there is a further need to explore more aggressively some features of machine vision algorithms such as data redundancy which is abundant in both image processing algorithms and DNN inferring models, as well as providing a dynamic quantization of their weights at different layers. Giant semiconductor companies are continuously increasing their revenues in AI-based accelerators, such as the actual 3.5 Billion US$ AI-related revenue for Intel [16]. Intel and Nvidia are engaged in battle over AI chips for autonomous vehicles with their state-of-the-art processors EyeQ5 and Jetson Xavier AGX respectively. This will bring great opportunities to researchers to build their IoT and edge machine vision chips for other applications. Nevertheless, one major issue facing scientists for using the state of the art commercially available AI accelerators is the lack of detailed documentation that allows full exploration of the chip. These include EyeQ5 and Jetson Xavier AGX processors, as well as other chips from car manufacturers Waymo, Uber and GM Cruise. Disclosing the datasheets and other associated software documentation of these processors to researchers would help achieving great machine vision systems. VOLUME XX, 2017 1