I. Introduction
In a few decades, deep learning-based techniques have drawn remarkable attention due to achieving gigantic performances in diverse tasks such as surveillance systems [2], [3], autonomous driving [4], [5], real-time computing in edge devices [6], [11] and so on. These complex architectures demand high computational resource devices because of having millions of parameters [7]. These requirements of high computational resources limit the scope of the deep learning-based methods in low computational resource-based IoT, edge, or mobile devices [8]. To improve this situation, a number of lightweight architectures have been proposed [9]–[11]. However, these lightweight architectures are less subject to reliable in terms of performance compared to heavy and complex architectures. Moreover, plenty of domain and task-oriented learning paradigms such as distributed learning [12], [13], federated learning [14], [15], decentralized learning [16], [17] etc., have been proposed to advance the scope of deep learning in edge computing, IoT, microservices and so on [18], [19]. Though these methods reduce the latency and required computational resources, there is still demand to minimize the execution and response time in real-time applications. To the best of our knowledge, no prior works focus on reducing the time required by the network for end-to-end sequential execution by paralleling the network through performing block-wise dissection.