LOW LATENCY DEEP LEARNING INFERENCE MODEL FOR DISTRIBUTED INTELLIGENT IOT EDGE CLUSTERS

Edge computing is a new paradigm enabling intelligent applications for the Internet of Things (IoT) using mobile, low-cost IoT devices embedded with data analytics. Due to the resource limitations of Internet of Things devices, it is essential to use these resources optimally. Therefore, intelligence needs to be applied through an efficient deep learning model to optimize resources like memory, power, and computational ability. In addition, intelligent edge computing is essential for real-time applications requiring end-to-end delay or response time within a few seconds. We propose decentralized heterogeneous edge clusters deployed with an optimized pre-trained yolov2 model. In our model, the weights have been pruned and then split into fused layers and distributed to edge devices for processing. Later the gateway device merges the partial results from each edge device to obtain the processed output. We deploy a convolutional neural network (CNN) on resource-constraint IoT devices to make them intelligent and realistic. Evaluation was done by deploying the proposed model on five IoT edge devices and a gateway device enabled with hardware accelerator. The evaluation of our proposed model shows significant improvement in terms of communication size and inference latency. Compared to DeepThings for 5 X 5 fused layer partitioning for five devices, our proposed model reduces communication size by ~ 14.4% and inference latency by ~16%.

. Illustration of 'Geriatric care' a time-critical application of IoT deployment. Edge computing strategy overview where the edge server is equipped with pre-trained DL models to identify the abnormal sleeping position (fall detection) with the least latency and high accuracy. The computing is independent of the cloud to avoid associated transmission delays.
cannot afford to wait for the cloud to take care of such a huge data processing load in real-time. Cameras act as IoT devices that record data, process it, and deliver valuable insights onsite rather than sending it to the cloud, to reduce response time and latency. Processing such sensitive and personal data locally at the edge also satisfies privacy concerns.
The CAGR Edge computing market analysis report [3] estimates that the edge computing market is expected to reach USD 43.4 billion by 2027 due to numerous applications' growth of technologies such as 5G and data analytics. The unprecedented demand for smart edge devices has driven the industry to innovate and implement intelligent edge architectures for real-time, mission-critical applications which deal with large heterogeneous devices.
Any machine is intelligent if it mimics human behaviour such as perception, attention, cognition, and decision making. After a few winters of AI, machine learning [4] has brought momentum, and the proliferation of intelligent devices is driven via deep learning [5]. A convolution neural network is a deep learning algorithm widely used in computer vision, augmented reality, and virtual reality applications to process and classify images. An intelligent edge device is capable of handling data analytics via deep learning algorithms embedded in it. With Edge Intelligence [6], the aim is to push information processing load from the traditional cloud to edge devices to make them suitable for real-time applications.
Among all the deep learning models, CNN stands as a promising candidate for intelligent video analytics. Though there are many reasons for its massive adoption, the huge popularity can be owed to its ability to learn features without any human intervention and computational efficiency. As a result, many frameworks are being studied for the implementation of CNN. Among all the object detection algorithms, YOLO is the most preferred framework for object detection as it understands generalized object representation by look-ing only once [7]. This feature makes it super fast and can be run in real-time. In addition, YOLO is a fully convolutional network; it can process 45 frames per second.
Many attempts have been made to implement deep learning algorithms for edge intelligent applications [8]- [14]. The approach is either by training on the edge device [15] or the cloud [16]- [19] or localhost [20] [21]. The training on the cloud is done by developing inference models [22] [23] on edge to provide a speedy and efficient prediction process [24]. However, few studies are reported to study model compression techniques [25]- [28], knowledge distillation, network science-based knowledge partitioning algorithms [29], Early-Exit [30] and pruning algorithms [31]- [34] to reduce the memory footprint [35]. Table 1 presents state of art on DNN implementation at the edge and targetted performance metrics among recently published articles. List of important abbreviations in alphabetical order is found in Table 2.
In our extensive survey carried out for the DL implementation on edge devices, the following issues surfaced: 1) Deploying the deep learning model to edge devices such as mobile and other IoT devices requires enormous computational resources and leaves a vast memory footprint. 2) There is scarce research on inferencing techniques (for classification or prediction) in distributed heterogeneous IoT clusters. 3) Existing approaches are limited only to layerbased partitioning (to reduce the inference time) while completely neglecting the potential of compression techniques.
Generally, the previous works have tried to implement deep learning on edge devices using various frameworks for object detection, such as RCNN and SSD. YOLO is preferred for object detection as it is faster than RCNN. However, compared with SDD, the accuracy is less, has localization errors, and poor recall rate. Therefore, we chose YOLOv2, which is designed to overcome the limitations mentioned earlier.
Our work differs from previously published articles men- tioned above in many ways; most of the works have focused on layer partitioning or compression. In contrast, we have used pruning algorithms even before layer partitioning. In addition, the majority of the works have aimed at implementation while we attempted to optimize the pre-trained model to make it faster and robust for real-time applications. Our objective in implementing DL for edge devices is threefold: • We propose an Edge-to-Device edge computing framework that facilitates optimization using the weight pruning method to deploy the model onto small smart devices devised for time-critical applications. Our vision is to manifest a holistic distributed deep learning framework that orchestrates edge clusters to process the heterogeneous local data, take decisions (with the aid of a pre-trained DL model obtained from the cloud) and execute real-time applications with the least latency. • We formulate the latency minimization into a pruning problem and obtain an optimized pre-trained model by weight-pruning and fine-tuning. We intend to build a horizontal collaborative CNN inference accelerating system. The input feature maps are partitioned and distributed among resource-constrained edge devices, such that memory footprint and latency are minimal. • The performance of our proposed framework and optimization techniques on a heterogeneous distributed edge network is evaluated under a variety of system parameters.
The remainder of the paper is arranged as follows: Section II provides an account of a brief review of of edge computing and a decent review of literature on using various DL frameworks for object detection in real-time. Section III presents the proposed model, while section IV presents the details of the experimental setup. Results and discussion are presented in section V, and the paper concludes in section VI.

II. BACKGROUND AND LITERATURE REVIEW
Any IoT implementation aims to develop a smarter environment with the least intervention. IoT's core is its comput-ing framework [41]- [43], which processes data. The data processing can happen either at a cloud server called cloud computing or at the edge server called edge computing or between them called fog computing. Cloud servers are powerful computing machines [44] with no energy constraint. However, since all data must be transmitted to the cloud, cloud computing finds its limitation in processing IoT data due to its inherent high latency, lack of privacy [45], [46] and demand for high connectivity.
Moreover, IoT data differs from Big data and is characterized by large amounts of streaming data, heterogeneity, time and spatial correlation, and high noise data. Beyond traditional data analytics, IoT data demands fast and streaming data analytics with time sensitivity [47]. Hence, obtaining inference from this kind of data is not a straightforward task. However, recent progress in hardware with superior computing speeds and advanced ML techniques opens the doors for shifting the computing load to the edge.
Several paradigms have been proposed for the workload distribution of data analytics among the cloud, edge and fog [48]. One approach is to have edge clusters (servers) that will pre-process and send the raw data to the cloud before training and send updates to the cloud after training. In addition, edge clusters will carry out the data cleansing, dimensionality reduction (PCA), noise removal (LDA) [49] and may also employ an autoencoder for extracting standard features. Therefore, having edge clusters will benefit in 2 ways: 1) alleviating the workload of the network 2) significantly reducing the network latency.
Edge computing is ideal for time-critical IoT applications as it is near the end-user and is dedicated, unlike cloud computing (shared), thereby avoiding delays incurred in transmission. In addition, edge computing costs less, requires less bandwidth, provides data privacy, and can be programmed for application-specific tasks [50]. The features like mobility management, geo-distribution, location awareness, scalability, and ultra-low latency make edge computing [51] more suitable for IoT data analytics. Mission-critical applications (such as driverless cars, fire prediction, and geriatric care) demand for latency less than 1msec [52] with reliability of VOLUME x, xxxx The challenge is how to serve the resource-intensive functionalities by resource-constrained [53], [54] edge devices. Edge computing needs to embrace intelligent processes to alleviate the burden of computation, communication, and storage. Incorporating intelligence at the edge refers to specialized control mechanisms for being context-aware and responsive while optimizing latency and energy efficiency parameters. Edge intelligence (union of EC and AI) strives to mimic human cognition by processing and learning from the data generated among heterogeneous edge devices (and servers) in the proximity. Beyond being fast, secure, and economical, edge intelligence exploits the potential of richer data to provide application-specific optimization [55]. Thus, edge infrastructure will pave a path for the democratization of AI [56]. Either intelligent process enables edge architecture or vice-versa, the end-users will be beneficiaries with reduced bottlenecks and enhanced scalability.
Advancements in ML algorithms capable of emulating human reasoning seems to be a step towards AI. DL, a subset of ML, is a state-of-art method of uncovering patterns and extracting valuable insights from a large chunk of data. As the name describes, deep learning learns from multiple levels to develop a model embedding complex relations among the data [57]. Neural networks mimic neurons in the brain. Deep Neural Networks is the core of machine learning techniques spanning from simple data analytics to natural language processing. A typical Deep Neural Network model contains layers of fully connected nodes (depicting neurons), and the process of passing forward the raw data from the input layer to the concerned category at the output layer is called model inference.
Deep Learning is independent of domain-specific training among all ML techniques, thereby accelerating inference via pre-trained models. However, DL implementation demands resources for computation, memory (and cache), and power.
Leveraging the potential of DL models for inference on resource-constrained edge devices [58] is the crux of the matter. Hosting Artificial Intelligence on edge devices via DNN computations, i.e. deploying DNN models close to the users for fast real-time execution, has been focused on in the past few years. However, though the idea is excellent, the computational complexity creates a bottleneck for its implementation. As a result, resource-constrained edge devices have to redefine themselves to achieve it.
Transforming edge computing is possible either by hardware acceleration [59] or by software acceleration as described below: 1 Hardware transformations embrace DNN computation at the hardware level design for DL inference. Beyond usage of accelerators, they include hardware friendly optimizations such as matrix multiplication factorization, data path optimization, and parallel operations. 2 Software transformations involve defining novel DNN structures by finding the trade-off between accuracy and computation. Then, the desired accuracy with moderate computations in the resource-constrained devices can be achieved by compressing the DNN models via Pruning. Other techniques like quantization and approximation are also employed. 3 Complementary to the above two approaches, one can focus on run-time management. This includes DL model partitioning and offloading [60] the computational load between the cloud, edge and device to accelerate diverse embedded applications. The run-time management can be applied over and above the two transformations mentioned earlier. Hardware accelerators find limitations in providing storage space for the large pre-trained model. They may also fail to achieve the expected inference rate, which will worsen when the accelerator executes other parallel tasks. Therefore, software acceleration must complement hardware accelerators to IoT encompasses sensors to sense the event in the physical world, processing networks for the processing and preparing data, data analytics to get insights from data, and system monitoring for data sharing between devices and servers or among themselves.

Edge Learning
Edge devices or edge servers perform prior learning at the edge of the network by pre-processing the raw data to reduce the network traffic and speed up the computation at the data centers. or Edge learning enables edge devices to send the context based raw data to the edge server for inference. Gateway Gateway is a device to interface edge devices to the internet via various protocols.

Cloud Computing
In cloud computing, users can access plenty of computational resources over the internet on an on-demand basis with little management effort and minimal interaction from service providers. Offloading Sharing the workload among the cloud, edge and fog Fog computing Many distributed, heterogeneous distributed devices collaboratively cooperate to perform the tasks and store the data in a fog server/cloud server.

Edge Computing
Computing at the edge without passing anything to the cloud, works for offline mode. Computing Frameworks To process the data collected by the IoT, different computing frameworks are used depending on the applications.
Computing frameworks are generally categorized as Fog computing, Edge computing and Cloud computing.

Task Parallelization
Task Parallelizationis an approach to maximize resource utilization for the dynamic task by exploiting concurrency by executing the task parallel by adopting various techniques. Network Science Deals with representing certain complex phenomena as networks an then create models that can be used to understand and predict the desired phenomena Model Inference DNN model consists of series of inter-connected layers. The process of passing forward raw data from the input layer to one of the category at output layer is called Model inference Model Compression Model compression refers to a class of techniques that reduce size and computations of DNN without losing accuracy. It enables to run the model on tiny devices either through pruning(Weight, filter, layer) or by Quantization(Lower precision-Fewer bits per weight) Pruning Pruning is one of the compression techniques used in the deep neural network to reduce the model's parameters through weight or channel pruning to produce a simpler model without losing accuracy. Quantization Quantization is an approach that can be used in machine learning for model compression to reduce the memory footprint. Conventional deep learning models are trained with 32-bit floating-point weights and activations; this approach reduces the number of bits used for representing weights and activation functions. Knowledge Distillation Train a significantly smaller student network to mimic a larger teacher model (model compression by replacing teacher model by smaller student model)

Model Partitioning
Layer Partitioning: To distribute the work among the edge devices for faster inference and to reduce the memory footprint, each layer input feature map is partitioned and assigned to the edge device. Fused Layer Partitioning: Deep neural networks model's stacked layers are fused and dividing vertically and assign each partition to edge devices without any off-chip movement. Latency Latency is measured from the moment the request is initiated to the time the response is received, which includes transmission, waiting, and processing time. Network Latency: During the data transmission, the time taken for transferring the data from the device to the cloud is called Network Latency Service latency is measured as the time taken by the processing device for computation based on the current workload. Jitter Jitter is variation in latency and is an important measure for any applications. Accuracy Accuracy indicates the frequency at which predictions match the labels.
create an ecosystem for achieving inference models with excellent performance. In this regard, many attempts have been reported to optimize the DNN models at edge devices [61], [62]. While Communication load, communication overhead, cost, memory, processing speed, network bandwidth, jitter, complexity are a few performance parameters, much of the preliminary research has focused on low-latency and energyefficient computations.
Tailoring the pre-trained DL models to suit the specific application by creating hardware and software accelerators is the need of the hour. Along with hardware accelerators, DL model partitioning and distribution for inference is the key to exploiting the full potential of edge computing. Several frameworks have been proposed to leverage the capabilities of edge infrastructure enabled with hardware accelerator and embedded with DNN models. The aim is to have fast multimodal data analytics in smaller scale platforms. DNN partitioning was conceived in a bid to accommodate resourceintensive computations on resource-constrained edge devices. The DNN model is partitioned into multiple parts and shared among devices, thereby collaboratively computing for low latency DNN inference.
Layer partitioning is done in two ways: layer wise parallelization, where each layer is independently parallelized while selecting appropriate techniques for each layer to obtain the best performance. Another method is a fused layer parallelization where the output of one layer is fed as input to the next layer without any off-chip data movement (i.e. without going to the memory). This work considers parallelizing by multiple fused layers instead of a single layer individually because it is scalable, network bandwidth-efficient, and has less memory footprint.
The Table 3 provides the important definitions re-defined according to deep learning based edge computing environment. Table 4 outlines the architectures, critical performance metrics, enabling approaches, and DL Models and frameworks for DNN model inference at the edge.
Our work focuses on reviewing, analysing, and implementing DL model partitioning and distribution to accelerate VOLUME x, xxxx  [63]. The model optimization via Pruning is employed even before the model partitioning and distribution is performed. We use a Raspberry Pi 3B plugged with Intel Movidius Neural Compute Stick (NCS) as a hardware accelerator to build a DL inference system. We report the pruning approach, model partitioning and distribution methods, experimental results, and improvements in performance achieved. In summary, we intend to build a horizontal collaborative CNN inference accelerating system in which the feature maps are partitioned and distributed among resourceconstrained edge devices, such that memory footprint and latency are minimal.

III. PROPOSED MODEL
As CNNs are resource-intensive, the deployment of CNN on resource-constrained IoT devices is challenging to make this approach realistic for real-time applications. Therefore, we propose a CNN inference model for distributed heterogenous Edge clusters to minimize the communication size and inference latency. Table 5 presents the summary of notations used in this article.

A. SYSTEM MODEL
Our proposed system consists of set of Edge devices, Edge gateway and host machine. Edge devices and Edge gateway device is interconnected to share the information among each other and host machine is used to trigger the edge device and gateway device to process the input feature map.
All the IoT devices in the network are denotes as D, and C represents the communication Edges between the devices, ED represents Edge device and EG represents Edge Gateway which is presented as Overall network is denoted as (1) Set of devices in the network is represnted as Communication Edges of the network are represented by C, and the value of |C| depends on the number of IoT devices and edge computing devices.
If |D|=n, then maximum the values of |C| is given as below in Equ Several authors have attempted to model the latency in deep learning based Edge computing environment. To simplify the analysis, let us denote latency as λ and this contains Computational latency (λ cl ) and network latency (λ nl ) [64].
For our framework network latency (λ nl )=0 as data are not transmitted to the cloud network. Hence Latency for our framework The Computational latency (λ cl ) majorly depends on processing the input and in turn depends on workload of each procesor's of IoT devices. Computational Latency is caclulated as below at time t considering the task assigning indicator (α), is estimated as where y i is the computational complexity of the CNN used at k and z i is the computational capabilty at k.
Hence the Inference latecny(λ il ) [65] estimation for the IoT edge clustering network is given by We can envision that the proposed framework works well for real-time intelligent device applications such as smart homes, smart agriculture, intelligent surveillance, and automated vehicles, in which devices of varying capabilities need to cooperate to make an inference. Hence, edge devices such as intelligent devices with a wide range of computational capabilities are considered for processing input and performing collaborative inference tasks.
These devices are deployed with optimal sub-model of the pre-trained model of Yolov2 to make the devices intelligent and make the decision quickly. Figure 2 represents the proposed optimal pre-trained CNN based Yolov2 model for edge devices. The proposed model consists of four modules. The first module involves establishing heterogeneous edge clusters for IoT. The second module involves applying the pruning to the pre-trained model to obtain the pre-trained model's optimal sub-model while retaining significant accuracy. Finally, as the third step, we will load the optimized pre-trained CNN based Yolov2 model onto the Edge clusters. In the fourth step, fusion layer partitioning is applied, and the task is distributed to IoT edge devices, then processed. Lastly, the gateway device combines the results from all the edge devices to perform collective inference and display the total latency for the inference task. Each of the modules is explained in detail in the following subsections.

B. HETEROGENOUS EDGE CLUSTERS
The distributed heterogenous Edge clusters set up involves forming the interconnected distributed IoT device. IoT devices are resource constraints in nature; to perceive the limited computing resource, we have selected Raspberry Pi 3B+, Raspberry Pi 3B, Raspberry Pi 4 and Neural computing stick-2(NCS2) to form Edge Clusters. These devices are of different computing capabilities and also memory sizes. Furthermore, to speed up the inference task and to accelerate processing, the gateway device is plugged with NCS2. The detailed experimental setup is discussed in section IV.

C. CONVOLUTIONAL NEURAL NETWORK
This paper focuses on CNN, a deep learning algorithm that comprises an input layer, output, and multiple hidden layers. We have chosen CNN based Yolov2 object detection and classification CNN model as shown in Figure.3. Yolov2 network consists of 24 convolutional layers followed by a fully connected feed-forward neural network. With the convolutional operations, features are extracted from the image. In our experimental setup, we have used an image size of 608 X 608 X 3, which is width, height and three RGB channels, respectively, as input to Yolov2. As shown in Table.6 Yolov2 detection network, layer1 uses 32 filters and hence convolutional layer computes output as 608 X 608 X32. Furthermore, layer 2 Maxpooling performs a downsampling operation to reduce the image size, and hence the number of learnable parameters reduces the computation cost.
The Yolov2 detection network uses filters of size 32,64,128 and so on. These are two-or three-dimensional arrays applied across the input data through the sliding window through the element-wise dot product to produce the feature map. Stride represents the number of pixels shifting during the convolutional process. Value of Stride of one indicates the movement of 1 pixel at a time.
For better accuracy in Yolov2 architecture, the Reorg layer combines features from layers of middle level and high level. The 28th route layer uses the 27th and 24th layers to append the features of the previous layer and obtains the output as 19 X 19 X 1024. Three channels-R, G, B are considered with a momentum of 0.9 with a learning rate of 0.001, the stride of 1 and leaky activation function is used.
The role of the fully connected layer is to make the final decision by executing an activation function on the overall sum of the linear combination of the input.

D. PRUNING
In the literature, several compression techniques [29] such as pruning, quantization, and knowledge distillation have been used to optimize the convolutional neural network-based model without sacrificing accuracy.
Pruning is applied to reduce the network parameters to reduce the model's size and make it suitable for deployment on to resource constraint IoT devices for faster predictions. Pre-trained models are trained with large scale datasets. Transfer learning allows the model to be reused for similar tasks.
Due to the significant results shown by the pre-trained models, Transfer learning improves performance by optimizing the existing model on other related model designs. Pruning the trained model yields better results for efficient inference. The standard performance metrics used to evaluate the model after pruning are prediction accuracy, model size in terms of bytes/kilobytes and computation time through Floating point operations (FLOP) or memory utilization. The CNN can be optimized by either weight pruning, layer pruning, or filter pruning.
The most effective approach is to apply weight pruning. In our work, the pre-trained model of Yolov2, which is trained with large scale datasets, is tuned to accelerate by removing unimportant weights through weight pruning. The computation time is reduced with a smaller number of model parameters, and the model performs better with greater accuracy.
Formulation of pruning as an optimization problem [18] is represented below. The objective of pruning is to generate the optimized sub-model of the pre-trained model. Here, W s f * represents the optimized pre-trained model of the source model. W s f indicates the pre-trained model with f number of filters optimized for s source.
Consider D s = X 0 , X 1 , ...X n and D t = Y 0 , Y 1 , . . . .Y n as source data and target data respectively. £() represents the loss function during the optimization of the network. The parameters such as W =W 0 , W 1 ,W 2 , W 3 .....W n and B=B 0 , B 1 ,B 2 , B 3 .....B n are weights and bias of the pretrained model. These are optimized to minimize the loss function with respect to source data Pruning percentage denoted by q% varies from 20% up to 80% on source data. The threshold is calculated using percentile over all the weights and the pruning percentage. Accuracy of 90% is maintained during the pruning process for producing the sub-optimized model of the pre-trained model.
We define loss function as Architecture with the input image size 608 X 608. CNN based inference workflow for image classification tasks consists of a series of convolution, max pooling, batch normalization and activation function for feature map extraction. Layer1 computes output as 608X608X32 for 32 filters and Maxpooling reduces the image size to reduce the computation cost. In the feature extraction stage, the input image is processed to generate hidden features followed by a fully connected layer for classification. Feature extractor takes large inference latency leading to the bottleneck of CNN inference.
While optimizing the pre-trained model, Loss function £() is defined as minimizing the loss on source data D s of pretrained model W s f to loss on source data of optimized pretrained model with respect to accuracy.
To generate the optimized sub-model W s f * with minimal loss function with respect to source data D s , pruning has to be performed with q% of pruning ratio.
Let f represents the number of filters optimized for source data set, f * represents the optimized filter output after pruning.
W s f * , the optimized pre-trained model of the source model is evaluated as Algorithm 1 provides steps for pruning the Yolov2 pretrained model to optimize using weight pruning.
In Algorithm 1, the weights are pruned to lower the size of the yolov2 model while maintaining considerable accuracy without compromising the original task. Next, the model is pruned iteratively based on the weights and thresholds set. If the weights are less than the threshold, they are set to zero. Finally, the model is retrained based on the loss function. The optimized sub-model of the pre-trained model is obtained after achieving a minimum loss function and significant accuracy.

E. PARTITIONING THE FUSED LAYERS
In this work, we employ fused layer partitioning [63] for CNN inference models for multiple resource constraints edge devices. The primary purpose of fused layer partitioning is to reduce inference latency and communication size by distributing the input across multiple devices.
CNN consists of convolutional layers, pooling layers and activation functions for feature extraction and classifier to classify. In CNN, input to the layer purely depends on the output of the previous layer. The optimized pre-trained model parameters are split into multiple fused partitions to distribute among multiple devices to reduce the latency, as shown in Figure 4, Each partition in each device generates output feature maps. Finally, partial output from each device is collated by the gateway device to form the final output. As single input is distributed across numerous devices, this method minimizes computation latency and communication size as communication happens between the devices only during the partitioning in the beginning and at the end for the merging process.
Consider CNN with M layers. Each layer takes N set of feature maps in an input. Each convolution operation in each layer of m=1,2,...M with input dimension W m − 1 X VOLUME x, xxxx Hence to optimize the computational resources, the first 16 layers of yolov2 are vertically partitioned into tiles. Then the corresponding tiles from each layer are fused vertically to form a single unit. Finally, the input features map ( input data region) for that particular tile is loaded from memory and subsequently, the intermediate layers compute intermediate values.
The process of fused layer partitioning and distribution is shown in Figure 5. The figure illustrates the fused tile partitioning of two layers. First Layer is partitioned into N sets of 5X5 tiles. Thus, only 5X5XN input data are brought from memory. Next, the convolution operation is performed on Layer1 with a 3X3XN filter producing a 3X3XM intermediate feature map. Finally, the input is obtained to Layer 2, which convolves with P filters, and produces 1X1XP output feature maps.

IV. EXPERIMENTAL SETUP
The proposed system has been evaluated with the wellknown CNN based Yolov2 model. An optimized sub-model of the pre-trained model has been obtained after pruning. Parameters of the optimized model are fed into the framework, and fused layer partitioning and distribution modules have been implemented to distribute the task into multiple IoT devices for faster predictions. These modules are implemented in C and C-based Darknet neural network libraries used as CNN inference engines. As IoT devices are resource constraints in terms of Processing capabilities, Darknet with NNPACK has been used for accelerating the CNN performance on IoT devices. The experimental setup is as shown in Figure.6. TCP/IP socket APIs are used to communicate distributed edge clusters between hosts, gateway devices, and edge devices. An optimized sub-model of the pre-trained model is deployed on a set of IoT devices. To mimic the realworld scenario, which usually has resource constraint IoT devices, in our experiment, we have opted for 5 Raspberry Pi devices as Edge devices and one Gateway device accelerated by NCS2 and a Host machine. The edge devices used for evaluation are Raspberry Pi 3 Model B+ with 1GB RAM,  • Input data is considered as an image of any size.
• Computing capabilities such as varying processor speed and size of RAM is considered as heterogeneous IoT devices. • A single input data frame has been processed by multiple devices. • Each device has sufficient memory to load the trained weights and to perform the task. The experiment involves testing the impact of varying the number of devices, starting from one device up to five devices, on communication size and latency for distributed heterogeneous edge clusters. Table 8 lists the different heterogeneous configurations FIGURE 6. Hardware experimental setup of heterogeneous distributed collaborative IoT clusters consists of five raspberry pi's as edge devices, one Hub, and one raspberry pi plugged with NCS2 as a gateway device. We use an input image of size 608X608. VOLUME x, xxxx considered for the experiment. In this experiment, we have focused on heterogeneous IoT infrastructure to understand the inconsistency and the impact on inference speed. The experiment is conducted for splitting the fused layer into 3 X 3, 4 X 4, 5 X 5 partitions by considering the first 16 layers of Yolov2. We started the evaluation of the model with the host device, a gateway device and a single edge device. We noted the communication size and inference latency required for a single data frame with a single edge device. Similarly, we evaluated the model by increasing the number of devices for each partition and observed the output.
Communication size depends on the number of partitions and communication between input and output data. Distributed IoT edge clusters involve communication between the partitioned layers; the proposed system produces better results for the more exemplary partitioning, as shown in Figure.7.
With the varying number of partitions from 3 X3 to 5 X 5 and an increasing number of devices, we measured the communication size. Increasing the number of edge devices will improve the performance. However, it impacts the execution time. As more devices are involved for intelligent computation, the amount of time required for processing is reduced and the lighter way it increases the communication size. The proposed model achieved an accuracy of 92% for the edge devices varying from one to five Raspberry Pi devices.
We can observe from Figure.8. the inference latency on different partitions. The figure depicts that as we increase the number of devices, computation is shared by more devices, which reduces the execution time. The inference latency is compared with DeepThings [36] and MODNN [37] on a single data frame. Our proposed model outperforms inference latency for 3 x 3, 4 x 4 and 5 x 5 partitioning due to the reduced computation as the pre-trained model is optimized. In MODNN, computation time is slightly higher as the number of devices increases due to centralized data distribution, layer-wise synchronization, and processing. In DeepThings, communication size and inference latency are more significant due to partitioning; processing occurred without optimizing the model.
As we have optimized the model and deployed it onto the IoT devices, communication size and inference latency are reduced due to the reduced number of parameters. As a result, the proposed model achieves a minimum communication size of 8.56MB∼9.59MB and inference latency of 5sec∼7sec for 3X3 to 5X5 fused layer partitioning for up to 5 devices.

V. RESULTS AND DISCUSSIONS
To evaluate and validate the proposed system, we have compared the results with few existing models similar to our setup. Table 8 describes the experimental setup, and Table  9 depicts the related work and techniques adopted in the recent research. During the comparison, inference latency is considered for a single data frame as input. Paper [36] involves fusing of few convolutional layers for reducing the Compared to these two frameworks, our proposed work performs partition and distribution of fused layer for the optimized pre-trained model and obtains communication size of 8.56MB∼9.59MB and inference latency 5sec∼7sec for 3 X 3 to 5 X 5 fused layer partitioning for five devices. The proposed framework of optimized heterogeneous edge clusters enables CNN inference on multiple heterogenous devices; hence can be used in the applications such as smart homes, smart cities, competent healthcare, intelligent traffic signal, autonomous vehicles, intelligent drone, and computer vision applications where-in deep learning models are deployed on edge devices such as camera, mobile or any other intelligent devices.
Consider the case study to envision the adoption of the proposed framework for innovative home applications. Smart home consists of multiple intelligent cameras in various places to capture the image. Rather than sending the image to the cloud, the proposed framework will perform on-device computation in real-time. This framework not only reduces communication overheads but can provide faster inference even for distributed and heterogenous devices-the work between the other smart cameras to process the image, reducing the latency in real-time applications.  . Impact of Inference latency on 3X3, 4X4 and 5X5 partitioning. In the case of MoDNN, due to centralized data distribution and process synchronization communication overhead increases and as computation time is inversely proportional, latency will be reduced and increase after third devices. In the case of DeepThings, it adaptively explores the available communication bandwidth and for finer partitioning, up to three devices due to the overlapped partitioning latency being higher and saturated thereafter. Our proposed model performs the best as it does not depend on layer-wise synchronization and is due to the optimized pre-trained model.

VI. CONCLUSION AND FUTURE WORK
From our brief review in the background section, we learnt that while focusing on Low latency, we face many demands and challenges. For example, the demand for increased bandwidth can be addressed using mmWave, while parallel and coded computing can address the demand for computing power and task dependency. Furthermore, shifting the computing to edge devices and making them intelligent (capable of machine learning) will reduce propagation delay and prediction delay, respectively. Finally, proactive computing is a promising approach towards reducing propagation delay and making the system energy efficient. In our work, we have shifted the computing to the edge, implemented DNN based ML on the edge devices while parallelizing (partitioning) the task and distributing among the edge clusters to minimize propagation delay, prediction delay and power requirement.
In this paper, we have examined a hardware-based prototype and a software framework to optimize the pre-trained model and have designed a lightweight, optimized sub-model of the pre-trained CNN based Yolov2 model. The optimized sub-model is partitioned, and inference is distributed among multiple resource constraint edge devices and IoT gateway devices. Initial layers of CNN highly contribute to overall communication size and inference latency. Hence the first 16 layers of the sub-optimized CNN layers are split into multiple stacks of executable tasks and assigned to multiple IoT devices. The proposed framework obtains the suboptimal pre-trained model and then splits and distributes CNN parameters into multiple heterogeneous IoT devices. In this process, each partition produces a set of partial output. Finally, partial outputs are collated by the gateway device to produce the final output. Hence for the single input, this approach achieves less inference latency.
To recreate the realistic scenario of clusters of heterogeneous devices (-i.e. resource-constrained IoT devices) in the lab, our model is demonstrated by deploying the model on 5 Raspberry Pi boards with different core frequencies for realtime on-device inference.Evaluation result has shown that our proposed optimized model achieved significant improvement in the result and has got minimum communication size of 8.56MB∼9.59MB, thereby reducing the communication size by ∼ 14.4% and inference latency of 5sec∼7sec, which is a reduction by ∼16% compared to DeepThings for 3 X 3 to 5 X 5 fused layer partitioning for five devices. Thus, our VOLUME x, xxxx model outperforms DeepThings and ModNN with improved inference latency while maintaining significant accuracy and minimal communication size for the popular Yolov2 CNN model.
We further propose to explore federated learning and channel intermittency to achieve reliability and scalability in the proposed work. In the future, we intend to define a hardware-software ecosystem capable of hardware aware hyperparameter tuning while being sensitive to the DNN containerization and fault-tolerant. In addition, future researchers can explore potential advantages of Neuromorphic computing such as in-memory computing and event-based spiking NN. This framework is suitable for smart homes, healthcare applications. However, processing various data generated by the devices, such as audio, video, or sensor readings, poses a hurdle. Furthermore, IoT devices running on different operating systems introduces further research challenges.