I. Introduction
In the last few years, there has been a growing attention towards learning problems framed as continual or lifelong [1]. Even if many recent approaches exists in this direction, this setting remains extremely challenging. Applications well-suited for continual learning have access to a continuous stream of data, where an artificial agent is not only expected to use the data to make predictions, but also to adapt to changes in the environment, i.e. videos, stream of texts, etc. [2], [3]. In the case of neural nets, the most challenging context is one in which a simple online update of weights is applied at each time instant, given the information from the last received sample [4]. Despite having access to powerful computational resources for continual learning-based applications, current algorithmic solutions have not been paired with the development of software libraries designed to speed-up computations. In fact, storing and processing portions of the streamed data in a batch-like fashion is the most common approach, reusing classic non-continual learning tools. However, the artificial nature of this approach is striking. Motivated by the intuitions behind existing libraries for batched data [5] and by approaches that rethink the neural network computational scheme making it local in time and along the network architecture [6]–[9], we propose a different approach to pipeline parallelism specifically built for data sequentially streamed over time, where multiple devices work in parallel to speed-up computations. Considering D independent devices, such as D GPUs, the computational time of a feed-forward deep network empowered by our approach theoretically reduces by a factor 1/D. We experimentally show that the existing overheads due to data transfer among different devices are constant with respect to D in certain hardware configurations. On the reverse side, the higher throughput obtained by a pipeline parallelism are associated with a delay between the forward wave and the backward wave while they propagate through the network proportional to D, a feature that is not critical in applications in which data samples are non-i.i.d. and smoothly evolve over time.