p-im2col: Simple Yet Efficient Convolution Algorithm with Flexibly Controlled Memory Overhead

Convolution is the most time-consuming operation in modern deep artificial neural networks, so its performance is crucial for fast inference. One of the standard approaches to fast convolution computation is to use GeMM-based convolution algorithms relying on efficient general matrix multiplication (GeMM) from optimized BLAS libraries. However, commonly used GeMM-based algorithms may cause significant memory overhead or avoid it only at the cost of worse performance. In this paper, we propose a novel convolution algorithm, p-im2col, based on a well-known im2col algorithm that avoids memory overhead by splitting a single multiplication of a large matrix into several multiplications of smaller matrices. We theoretically and experimentally compare our algorithm with two other GeMM-based algorithms: im2col, which is widely used as a baseline, and the memory-efficient kn2row-aa. We measure the inference time of these algorithms on central processing units of x86, x86_64, ARM, and MIPS architectures for a large set of convolutional parameters. The proposed algorithm demonstrates a speedup over im2col and kn2row-aa in a number of cases and a significant reduction in additional memory requirements compared to im2col. Based on our experiments, we present a new convolution algorithm selection scheme that considers memory restrictions, CPU architecture, and convolutional parameters and provides a noticeable advantage over each particular algorithm.


I. INTRODUCTION
C ONVOLUTIONAL neural networks (CNNs) remain a major tool for pattern recognition, segmentation, object detection and many other tasks [1]- [5]. The most resourceconsuming operation in such networks is the tensor convolution in convolutional layers, so its efficiency becomes crucial in actual applications. Efficiency is especially important in real-time recognition systems [6]- [10] that process large amounts of data, Internet of Things (IoT) environments [11], [12] and embedded systems [13]- [15].
In practical applications, it is essential to consider the memory, energy consumption, and computational power. En-ergy efficiency is important for autonomous systems ranging from smartphones to unmanned vehicles [16], [17]. Moreover, recently, the amount of energy required to solve deep learning problems has also raised concerns [18]- [20]. Low memory consumption improves inference time because high data locality provides fast access and reduces energy costs, as unused memory banks can be placed in low-power mode.
At present, memory and computationally efficient implementation of CNNs is crucial for on-device artificial intelligence systems and can significantly enhance their application domain. For example, it can help to utilize neural networks for visual geolocation of unmanned aerial vehicles [21] or save energy for user equipment in edge computing networks [22] or for other purposes.
CNNs can be executed on a wide range of computing devices: central processing units (CPUs) graphic processing units (GPUs), tensor processing units (TPUs), fieldprogrammable gate arrays (FPGAs), and application-specific circuits (ASICs). They are usually trained on either GPUs or TPUs because those devices allow for fast computations with high parallelism. However, after training, CNNs may be used in any computing environment. Since all kinds of devices have principally different architectures, efficient algorithms for linear algebra operations (for example, matrix multiplication, which is actively used in CNNs) can be noticeably different and are developed for each of them. For example, CPUs are better suited for sequential calculations and are very sensitive to memory locality, so one should attempt to reduce the number of memory cache misses [23]. GPUs contain many execution units that can simultaneously perform the same operation over a data set, so one should prepare the inputs for a highly parallel system, as well as meet memory limitations [24]. TPUs are similar to GPUs in a number of aspects, as they provide high concurrency for computations but are specialized for machine learning tasks. However, they also strictly constrain memory usage [25]. Unlike all previously discussed devices, FPGAs and ASICs have an unlimited number of execution units as long as the logic gate number and chip area limitations are satisfied; energy consumption is also important for those devices [26]. It is clear that the CNN inference algorithms should also vary to reach the full potential of a specific device.
Note that despite the availability of a variety of hardware, practical applications, especially in mobile, edge, and IoT computing often use CPUs due to their versatility and convenience for engineers and software developers. This is why this work focuses on CPU inference of CNNs.
The most computationally challenging aspect of CNNs concerns the convolutional layers. This is why efficient implementation of these layers is extremely important. The basic convolution is illustrated in Fig. 1. For a 3D input image I and 4D kernel F , the following operation is performed to obtain a 3D output image R: I(y + ∆y, x + ∆x, ci)· · F (ci, co, ∆y, ∆x), y = 0, .., Hout − 1, x = 0, .., Wout − 1, co = 0, .., Cout − 1, (1) where H×W ×C in defines the input image size, C in ×C out × K h × K w defines the kernel size and H out × W out × C out defines the output image size.
State-of-the-art deep CNNs widely utilize convolutions. We present a number of multiply-accumulate operations in convolutional layers and the total number of parameters for some network architectures in Table 1. As those numbers are high, methods for reducing the convolutional layer complex- ity are actively under development. For example, we may decrease the number of model parameters using pruning [27]- [30] and knowledge distillation [31], [32]. Additionally, we can approximate the tensor convolution operation with more computationally efficient separable convolutions [33], [34] or eliminate multiplication operations by replacing the covariance with the L 1 norm [35], [36] or morphological approximation [37]- [39]. However, such approximations are usually difficult to train. They may also reduce the accuracy of model predictions. Another method is to replace the convolution of floating-point tensors with their quantized approximation. In this way, we not only reduce the inference time of a network but also decrease the memory volume used with reasonable changes in accuracy. There are several widely used frameworks that conduct 8-bit quantized convolution and matrix multiplication, such as Google's gemmlowp [40] and Facebook's QNNPACK [41]. However, all these methods still require tensor convolution. Many works [47]- [50] show that direct implementation of tensor convolution is inefficient because it is not cachefriendly (has poor memory locality). The common practice is to transform convolution to matrix multiplication [47] using the im2col algorithm (described in detail in Section II-A), which unrolls the input tensor into a matrix, each row of which represents a single filter application. This allows us to fully exploit general matrix multiplication (GeMM) algorithms from basic linear algebra subprogram (BLAS) libraries. These libraries provide device-specific, highly efficient implementations and regular updates. When using them, we do not need to change a significant amount of code when moving from one device to another. One drawback of the im2col algorithm is the consumption of a vast amount of memory used to store the unrolled tensor. Several other GeMM-based convolution algorithms were presented in [48]. Some of those algorithms, such as the kn2row-aa algorithm, use significantly less memory. Instead of computing a single convolution with a K h × K w kernel, the algorithm computes K h K w convolutions with 1 × 1 kernels as matrix multiplications and accumulates their results with shift (described in detail in Section II-B). Although this algorithm does not allow for efficient computation of convolutions with nonunit stride, it allows a significant reduction in memory and preserves approximately the same inference time on Intel Core i7 and ARM Cortex A57 processors [48].
Note that the set of efficient tensor convolution algorithms is not limited to GeMM-based convolution. One can attempt to implement a cache-friendly direct convolution algorithm by dynamically compiling convolution microkernels best suited to a specific neural network [49] or replace the intermediate buffer of values (of an unrolled image) with a smaller buffer of pointers called an "indirection buffer" [50]. However, all these algorithms require a vast amount of devicespecific code and optimizations.
In this paper, we provide a novel algorithm for tensor convolution and compare it to state-of-the-art GeMM-based algorithms using inference speed and memory efficiency. We propose a simple yet effective modification of the im2col algorithm with flexibly controlled memory overhead that preserves the computational efficiency over a wide range of parameters. As a modification of im2col, the proposed method does not impose additional constraints, such as unit stride in the krn2row-aa algorithm. Additionally, it is easy to implement, does not depend greatly on the type (as long as the underlying GeMM supports matrix multiplication for this type) or structure of the data and can be adapted to quantized and decomposed convolutions. We provide a theoretical comparison of the memory consumption of the proposed algorithm with those of the original im2col and kn2rowaa and then experimentally measure the inference time on a wide variety of devices. In this way, we prove that our algorithm is perfectly suited for implementation with CNNs.
The main contributions of our work are as follows: • we propose p-im2col -a new GeMM-based CPUtargeted convolution algorithm with flexibly controlled memory overhead; • we provide experimental inference time measurements of the p-im2col and state-of-the-art im2col and kn2rowaa algorithms on a wide range of devices with different CPU architectures; • since the results vary for different sets of convolution parameters, we introduce a comparison method for convolution algorithms; and • we suggest an empirical rule for the selection of the fastest GeMM-based convolution algorithm (of the three under consideration).
The remainder of this paper is outlined as follows. In Section II, we describe the im2col and krn2row-aa GeMM-based convolution algorithms and provide our own p-im2col algorithm. In Section III, we theoretically compare these three algorithms in terms of memory and computational efficiency.
In Section IV, we present the experimentally measured inference time of these algorithms on multiple devices, analyze the results of the experiments and recommend how to choose the best algorithm for specific parameters of the convolution. Finally, in Section V, we conclude our work.

II. GEMM-BASED CONVOLUTION
We presented the idea of convolution in (1); furthermore, state-of-the-art convolutional layers also support stride and dilation parameters. In general, the convolution of an image I of size H × W × C in and a kernel F of size C in × C out × K h × K w with a result R of size H out × W out × C out can be written as: where: Here, (s x , s y ) is a stride, and (d x , d y ) is a dilation. To achieve a certain output size after convolution (for example, to preserve the image size), one can extend the input image with zeroes or other values. This technique is called padding. As (2) shows, the direct convolution computation contains 6 nested loops. It has poor memory locality. Values of the input image are not loaded in the order they are stored in memory, so the CPU cannot take advantage of prefetching the values via the hierarchical cache system.
To address this problem, GeMM-based convolution algorithms transform input images and kernels into 2D matrices and then perform matrix multiplication or several matrix multiplications and accumulate their results. In this way, we transfer the responsibility of performing cache-friendly computations to the GeMM function in the BLAS library.
Before we describe details of specific GeMM-based convolutions, let us note some important facts concerning GeMM itself. In most modern BLAS libraries, GeMM is inspired by [23]. These libraries split left and right matrices into small blocks so one of them can fit in the L1 cache and the other can fit in the L2 cache. They also reorder the values in these blocks so that these values can be loaded continuously when computing the matrix multiplication of a small piece at a time. We will refer to this piece of the result matrix and the function that computes it as a matrix multiplication VOLUME 4, 2016 microkernel. A matrix multiplication microkernel is usually sufficiently small to store all accumulated values in CPU common or SIMD registers. Now that we have a general idea of GeMM implementation, let us describe GeMM-based convolution algorithms.

A. IM2COL ALGORITHM
im2col is one of the most commonly used and easy-toimplement algorithms of GeMM-based convolution. It is often used in the literature as a baseline to demonstrate an increase in computation, memory or power efficiency.
In this algorithm, for each application of the convolution kernel, we copy values of the input image and save them in a single column (if the image matrix is the right matrix in matrix multiplication) or in a single row (if it is the left matrix). We will consider the unwrapped image to be the left matrix and call it the im2col buffer, and the kernel tensor represented as a matrix will be the right matrix (see Fig. 2). Let us denote the im2col buffer as B and the transformed filter matrix as G. Then, im2col transformation can be written as: B(x + yWout, ci + (∆u + ∆vKw)Cin) = I(v, u, ci), G(ci + (∆u + ∆vKw)Cin, co) = F (ci, co, ∆v, ∆u), y = 0, .., Hout − 1, x = 0, .., Wout − 1, co = 0, .., Cout − 1, ci = 0, .., Cin − 1, ∆v = 0, ..., K k − 1, ∆u = 0, ..., Kw − 1, where u and v are defined by (3).
Then, we can compute the matrix multiplication: The convolution result R is obtained from R by a simple unrolling: The height of the im2col buffer equals the size of the output image (H out × W out ), while its width is C in K h K w . The buffer requires a vast amount of memory, and initialization of the im2col buffer from the image takes some time. This is why one may want to use a more computationally and memory-efficient convolution algorithm.

B. KN2ROW-AA ALGORITHM
This algorithm is the most memory-efficient GeMM-based convolution algorithm presented in [48]. Instead of computing the convolution of an H × W × C in image I with a C in ×C out ×K h ×K w kernel F , it computes K h K w convolutions of the same image with C in ×C out ×1×1 kernels. Such an approach considers convolution with unit stride. Each of these convolutions equals the matrix multiplication of the image represented as the HW × C in matrix J by the kernel represented as the C in × C out matrix Q (we consider unit dilation for simplicity): The results of these multiplications are accumulated in one buffer with a shift. We will call this buffer the accumulator A. Using this notation, we can present the kn2row-aa algorithm as follows: where: The convolution result R can be obtained as: R(y, x, co) = A(x + yW + W (K h − 1) + Kw − 1, co), y = 0, .., Hout − 1, x = 0, .., Wout − 1, co = 0, .., Cout − 1.
If the input image is zero-padded to preserve the size of the image during the convolution (which was considered in [48]), then we need to zero out some values of the input image before each multiplication to avoid errors on the boundaries. However, if we do not use padding, then the correct result is already computed in the accumulator, but there are also some excess values (shown in gray in Fig. 3). In this algorithm, we only need to store the "tails" of the accumulator and the zeroed-out values of an image if we use padding or the excess values in the accumulator if we do not. Asymptotically, it requires O((H + W )KC) memory, where K is the linear size of the kernel and C is the number of channels in the input or output, which is significantly less than the im2col value of O(HW K 2 C). Furthermore, while im2col suffers from a large memory overhead, kn2row-aa cannot reach the full GeMM efficiency potential because it computes many multiplications with a smaller depth (C in instead of C in K h K w ). kn2row-aa also produces excess values if zero-padding is not used, so one may need to reorder the results after computation. Finally, it does not allow for efficient computation of a convolution with nonunit stride.

C. PROPOSED P-IM2COL ALGORITHM
In practice, we often need an algorithm with a flexible memory/efficiency balance to identify a variant that will be more efficient than the im2col algorithm in memory but without the constraints of the kn2row-aa algorithm. This is why we propose our easy-to-implement modification of the im2col algorithm, which we call p-im2col.
Matrix multiplication is separable alongside rows of the left matrix or columns of the right matrix, so we do not need the entire large im2col buffer to compute several rows of the output. We can compute only P rows of the im2col buffer at a time and use them to produce P rows of the output. The result of the matrix multiplication has H out W out rows, so we will need H out W out /P iterations to compute it. In this way, we need to allocate memory for P C in K h K w values. We call it the left buffer (see Fig. 4). The algorithm is presented below (see Algorithm 1).

FIGURE 4: Proposed algorithm.
Input: image -3D tensor with shape in_height × in_width × in_channels; kernel -4D convolution kernel with shape out_channels × in_channels × kernel_height × kernel_width, represented as a (in_channels · kernel_height · kernel_width) × out_channels matrix, parameters -convolution parameters Output: output -3D tensor shape out_height × out_width × out_channels, represented as (out_height · out_width) × out_channels matrix for a significant memory reduction compared to the im2col algorithm. Now, let us consider computational efficiency. If parameter P is a multiple of the size of the matrix multiplication microkernel, then there is no difference in the behavior of this microkernel between the proposed algorithm and im2col: the same values are multiplied and accumulated in the same order. However, there is a slight overhead: each time the matrix multiplication function is called, BLAS needs to allocate some memory to store the reordered values for the microkernel. However, the smaller the left matrix is, the more cache-friendly this reordering will be.
Therefore, parameter P needs to be 1) a multiple of the multiplication microkernel height, 2) sufficiently small to reduce the memory overhead, and 3) sufficiently large to not increase the inference time due to the number of GeMM function calls. These are general ideas regarding the value of parameter P , but a specific optimal value depends on many things, such as the microkernel code, CPU preloading strategy, and CPU cache sizes, which are difficult to estimate. This is why we experimentally investigate the dependence of the inference time of our convolution algorithm on the value of P on a wide range of devices and report the obtained values in Section IV.

III. THEORETICAL COMPARISON
Now that we have considered three GeMM-based convolution algorithms, let us highlight their main differences and similarities. The im2col algorithm computes the convolution using a single call of the GeMM function from the BLAS library but requires the allocation and initialization of a large buffer to do so. Our p-im2col algorithm splits the computation of the result and computes P rows of output H out W out /P times, reinitializing the left matrix before each GeMM call. Furthermore, the kr2row-aa algorithm uses K h K w GeMM calls to compute 1 × 1 convolutions as matrix multiplications with the accumulation of shifted results. Unlike the first 2 algorithms, the latter does not require any additional initialization before the GeMM calls; however, it requires some additional memory to store the excess values in the output, and it also computes more multiplications than needed.
In Table 2, we show the exact amount of memory, the number of GeMM calls and the total number of multiplications in each algorithm for computing the convolution of an with unit stride, unit dilation and no padding.
To illustrate these formulas, let us consider the convolution of an , 128}, and K ∈ 3, 5, and assume that values are stored in 32-bit floating point format. These values are representative of real-world applications of the convolution operation in CNNs. The amount of memory needed for the computation of a discrete convolution using the considered GeMM-based algorithms is shown in Table 3. P is set to 120, which is a viable value (see Section IV). Table 3 shows that on small images, the amount of additional memory is low for all algorithms. However, if we consider large images, im2col requires a vast amount of memory to store the im2col buffer, while neither the proposed method nor kn2row-aa requires as much memory. Usually, kn2rowaa is more memory efficient than our algorithm, but the amount of additional memory required by this algorithm still increases as the image size increases, while the requirement for the p-im2col algorithm remains the same.

IV. MULTIPLE DEVICE RESEARCH A. OVERALL EFFICIENCY EVALUATION
The memory efficiency of the p-im2col algorithm is strictly defined by the value of parameter P , but its computational efficiency does not allow an accurate theoretical prediction. Therefore, our first goal was to experimentally evaluate the algorithm on a wide variety of hardware. Therefore, we considered 8 CPUs of x86, ARM, and MIPS architectures. We used x86/x86_64 devices of several generations. Table 4 presents the list of these devices.

1) Experimental setup
To experimentally evaluate the computational efficiency of the p-im2col algorithm, we measured the inference time for convolutions of various sizes. The main objective of the p-im2col algorithm is lightweight neural networks used in embedded recognition since this is where memory limits are stringent, and we considered quite common parameters for such networks. We tested the convolution of an 8,1500], and P mod 4 ≡ 0. We used 32-bit floating-point data and Eigen [51] as a BLAS library. It uses 4 × 4 microkernels, so the taken P is always divided by its size. We ran our experiments several times for each set of parameters to ensure that the relative error was less than 0.5%.

2) Efficiency evaluation
The computational efficiency evaluation for the considered algorithms is slightly tricky. The inference time of the algorithm has a rather complicated dependence on the convolution parameters. It depends not only on the number of multiplications but also on the number of additions, memory load/store instructions, cache misses, etc. In our problem, we need to compare the inference time for different parameter sets, so we normalize the inference time using the number of multiplications (see Table 2). Let us denote the set of convolution parameters for our experiment as θ = {S, C in , C out , K}, the inference time of p-im2col with parameter value P as T A (θ, P ), the time of the kn2row-aa algorithm as T B (θ), and the time of the im2col algorithm as T C (θ). The number of multiplications is: For the kn2row-aa algorithm, it is slightly larger: C out C in K 2 S 2 . However, we consider the difference as an overhead. Therefore, the inference time per multiplication in the p-im2col algorithm can be measured as The same holds for the kn2row-aa and im2col algorithms: To find an optimal value of parameter P , we take the minimum of the function t A (P ) = E θ (t A (θ, P )) P opt = argmin(t A (P )).
Thus, t A (P ) is the average inference time per multiplication for all convolution parameters in our experiment. These functions for all devices are presented in Fig. 5. Note that we do not compare t A (P ) functions for different devices and only show them in the same figure for the reader's convenience.
The optimal values of the parameter P are shown in Table 5. From Fig. 5, we can see that a single P opt exists for all the CPUs in our experiment but can vary significantly across different CPUs. Our experiments did not show a strong correlation between the optimal value of P and CPU architecture or convolution parameters. Thus, based on the experimental results, we developed an empirical solution for the suboptimal choice of parameter P : use P = 120 for x86/x86_64 CPUs and P = 64 for others.

3) Efficiency analysis
Now that we have the optimal value of parameter P in the p-im2col algorithm, we can compare its computational efficiency to those of the kn2row-aa and im2col algorithms. The inference times of p-im2col with the optimal P , kn2rowaa, and im2col for each test on each device are presented in B.
To compare and analyze the methods, we introduce average dependencies for each convolution parameter. Let us once again use the notations T A (θ, P ), T B (θ) and T C (θ) for the experimentally measured inference times of the p-im2col, kn2row-aa and im2col algorithms in test θ, VOLUME 4, 2016 respectively. The number of "useful"multiplications in this test is M (θ) (see eq. (11)). Now, using the values from Table  5, we can introduce the functions: where X is S, C in or C out . An example of these functions on the Ryzen 9 CPU is presented in Fig. 6. This figure shows that the dependence of the inference time on a convolution parameter varies significantly across the different algorithms. In A, we also present these functions for other devices. We observe some common tendencies of the normalized inference time (t X Y , where Y is an algorithm and X is a parameter): • As C in increases, the inference time of the kn2rowaa algorithm usually decreases. This is because both p-im2col and im2col have an initialization stage (the im2col transformation), which grows linearly as C in increases. kn2row-aa, however, does not require any initialization before matrix multiplications are computed, and the efficiency of matrix multiplication improves as the depth increases with increasing C in (this holds for all three algorithms). • As C out increases, the normalized inference time of all three algorithms decreases as matrix multiplication becomes more efficient. However, t Cout C demonstrates the fastest descent, probably because it has only one GeMM call. • The dependence of the normalized inference time on S is not very clear and varies across different devices. This parameter mainly affects the amount of allocated memory, especially in the im2col algorithm.

B. CONVOLUTION ALGORITHM SELECTION
We have described the theoretically and experimentally measured inference times of three GeMM-based convolution algorithms: the classic im2col, the memory-efficient kn2rowaa, and the novel p-im2col, which is, in fact, a family of algorithms since we can vary parameter P . Now, it would be reasonable to decide how to choose the best algorithm. Let us consider two scenarios: • when there is a strict memory constraint and • when we use optimal parameter P to obtain a better inference time along with reduced memory usage.

1) Algorithm selection based on memory constraints
If the amount of memory is limited, then we would not recommend using the im2col algorithm because the im2col buffer requires a vast amount of memory. Instead, one should choose between the kn2row-aa and p-im2col algorithms using the formulas from Table 2 to meet the constraints. By decreasing the value of parameter P , we can significantly decrease the memory consumption of the p-im2col algorithm (but at the cost of computational efficiency if P < P opt ), and it would not increase as the image size increases (which is the case for both the im2col and kn2row-aa algorithms).
However, we believe that this case matches specialized devices only, such as microcontrollers or FPGAs, because the memory requirements for optimal efficiency are not very strong, at least for the considered CPUs.

2) Algorithm selection based on computational efficiency
If no memory constraint is present, then we can use the optimal P for the p-im2col algorithm, which is given in Table 5. Therefore, we compared the inference times of im2col, kn2row-aa, and optimal p-im2col for the set of convolutional parameters defined in Section IV-A1. The comparison is presented in B. From these results, we can conclude that the fastest convolution algorithm may significantly vary across different devices and different convolutional parameters. However, there are common tendencies: • The im2col algorithm tends to be the fastest for a large number of output channels, especially for a larger kernel. Matrix multiplication is asymptotically more complex than reordering in the im2col transformation (it requires O(HW C out C 2 in K 4 ) operations versus O(HW C in K 2 ) in initialization of the im2col buffer). Therefore, as C out or K increases, the time required for im2col transformation becomes less impactful than the time for matrix multiplication, and the single GeMM call in the im2col algorithm works faster than multiple calls in the p-im2col or kn2row-aa algorithms. • The kn2row-aa algorithm is often the fastest if the number of input channels is large and the number of output channels is small. For kn2row-aa, C in is the depth of matrix multiplication; therefore, as it increases, GeMM becomes more efficient. However, as C out increases, the amount of additional memory and the number of excess computations in this algorithm also increase. • In other cases, when neither C in nor C out is large, the p-im2col algorithm tends to be the fastest, as it has the same computational complexity as im2col but better memory locality. Considering these points, we developed the following five empirical strategies to choose a convolution algorithm: • (i)m2col. Always use the im2col algorithm. • (k)n2row. Always use the kn2row-aa algorithm. • (p)-im2col. Always use the p-im2col algorithm with P = 120 for x86/x86_64 CPUs and P = 64 for others. • (c)ombined.
-Use the im2col algorithm if C out ≥ 128.
-Use the kn2row-aa algorithm if we did not choose im2col and C in ≥ 128. -Use the p-im2col algorithm with the same parameter P as in the p-im2col strategy in all other cases.
• (a)daptive. For each specific device, we can run a set of experiments as we did in Sections IV-A and B and choose the fastest algorithm for particular convolution parameters. In the first three strategies (i, k and p), we simply use one of the algorithms. In strategy c, we combine them based on the empirical rules deduced after analyzing the results presented in B. Strategy a is likely to achieve the best results but requires running additional tests on a target device. Fortunately, this can be done once during the configuration of a library containing this convolution algorithm, which is a VOLUME 4, 2016 standard case for high-performance libraries. For example, the FFTW library for fast discrete Fourier transform uses such an approach [52].
Let us compare these 5 strategies. To this end, we used data obtained in the experiments from B. Let us assume that on a specific test θ ∈ Θ (Θ is a set of all tests), strategy x achieves an inference time of τ x (θ), and strategy y achieves an inference time of τ y (θ). Then, we define the comparison function on a single test as: where α denotes a certain threshold value. For the purpose of our comparison, we used α = 0.05, which means that if the difference between τ x (θ) and τ y (θ) was less than 5%, we considered that strategies x and y have approximately the same inference time on test θ. Table 6 presents the average value of f (x, y, θ, α) on all tests θ ∈ Θ and on all devices that we used in our experiments. As Table 6 shows, strategy a outperforms all other strategies, but it is more difficult to implement because it requires an additional test for each device. Instead, we can use strategy c or even the simpler p without considerable performance loss. However, strategies i and k show similar results that are noticeably worse than the results of the remaining three strategies. This means that strategies p and c are more universal than i and k. Therefore, we can give the following recommendations for the choice of the convolution algorithm: • If there are no memory limitations and we can run tests on a specific device during the configuration of the algorithm, then one should use strategy a. • If there are no memory limitations but we need an outof-the-box solution with satisfactory performance, then one should use strategy c. • If we need to keep memory consumption low but the limitation is not strict, then one should use strategy p. • If the memory constraint is strict, then one should choose an algorithm according to this constraint, as we discussed in Section IV-B1. • Recall that we conducted the experiment with Eigen as a BLAS library, so it would be sensible to experimentally find an optimal value of P if another library is used.

V. CONCLUSION
In this paper, we considered two standard GeMM-based convolution algorithms that can be used in CNNs: im2col and kn2row-aa. We proposed the novel algorithm p-im2col, which is an easy yet efficient modification of the im2col algorithm. In this algorithm, we can flexibly control the memory requirements using the parameter value. p-im2col can be used on any CPU architecture and with any BLAS library that implements GeMM operations for this architecture; however, parameter P should be fine-tuned for this combination of a CPU and a library. Although the proposed algorithm does not achieve the memory efficiency of the kn2row-aa algorithm, the computation of kn2row-aa can take more time, especially when the number of input channels is low. We performed extensive experiments on a wide range of x86/x86_64, ARM, and MIPS CPUs using Eigen as a BLAS library. We compared the required amount of memory and computational complexity of the considered algorithms and measured the corresponding inference times. Using these results, we developed several strategies for choosing the convolution algorithm for specific parameters based on memory limitations or minimizing the inference time. The p-im2col algorithm is a more universal solution than kn2row-aa or the classic im2col. It falls slightly behind the adaptive strategy (that optimally chooses p-im2col, kn2row-aa, or im2col for a specific device and specific convolution parameters) in terms of computational efficiency.
Future research on the p-im2col algorithm could take one of the following directions: • one could theoretically deduce or at least find strong empirical patterns for P opt and its relations to convolution parameters, since in this paper we experimentally selected a constant value for each device; • experimentally obtain memory and inference time measurements with different BLAS libraries and data types of the image and kernel, for example, for quantized neural networks; • or investigate the potential of p-im2col algorithm on non-CPU architectures. .

APPENDIX A DEPENDENCE OF THE INFERENCE TIME ON THE CONVOLUTION PARAMETERS
In this section, we present the dependence of the normalized inference speed of the p-im2col (t X A (x)), kn2row-aa (t X B (x)) and im2col (t X C (x)) algorithms (according to (15)) on particular convolution parameters for a set of considered devices (experimental setup described in Section IV-A).