I. Introduction
Since the proposition of the convolution neural networks, the field of deep learning has experienced an exponential growth, focussing on various aspects of deep learning such as defining new architectures like RESNET [5], Squeezenet [1], VGG16, VGG19 [2], developing new optimization techniques, various training methodology and implementation of non-linearity like ReLU, ELU, to combat optimization problems. These developments have substantially aided in training a deeper neural network for image recognition or object detection challenge. The increase in the depth of the neural networks has also led to accelerated development of various hardware architectures such as graphics processing unit, tensor processing unit and large-scale distributed deep networks [3], which uses parallel architectures and multiple computation units. The boards such as S32V234, Bluebox 2.0 (embedded systems) by NXP NVIDIAs TITAN, TESLA, GTX 1080(GPUs) and Jetson TK1(embedded systems) are widely being used for deploying and accelerating training process of various deep convolution neural networks.