Tensor Completion Methods for Collaborative Intelligence

In the race to bring Artificial Intelligence (AI) to the edge, collaborative intelligence has emerged as a promising way to lighten the computation load on edge devices that run applications based on Deep Neural Networks (DNNs). Typically, a deep model is split at a given layer into edge and cloud sub-models. The deep feature tensor produced by the edge sub-model is transmitted to the cloud, where the remaining computationally intensive workload is performed by the cloud sub-model. The communication channel between the edge and cloud is imperfect, which will result in missing data in the deep feature tensor received at the cloud side, an issue that has mostly been ignored by existing literature on the topic. In this paper we study four methods for recovering missing data in the deep feature tensor. Three of the studied methods are existing, generic tensor completion methods, and are adapted here to recover deep feature tensor data, while the fourth method is newly developed specifically for deep feature tensor completion. Simulation studies show that the new method is 3–18 times faster than the other three methods, which is an important consideration in collaborative intelligence. For VGG16’s sparse tensors, all methods produce statistically equivalent classification results across all loss levels tested. For ResNet34’s non-sparse tensors, the new method offers statistically better classification accuracy (by 0.25%–6.30%) compared to other methods for matched execution speeds, and second-best accuracy among the four methods when they are allowed to run until convergence.


I. INTRODUCTION
As the Internet of Things (IoT) infrastructure gets deployed, there will be many opportunities for innovative applications that make use of the newly available sensor data. Many of these new applications will rely on DNNs to process the sensor data and produce useful predictions and analytics. One current research direction is towards miniaturization of DNNs, so that they can be implemented at or near the edge sensors, with limited computation and energy resources. While such DNNs may be sufficient for certain applications, resources available in the cloud will always be able to support larger and more sophisticated models than those that could be deployed at the edge. Hence, cloud-based analytics will remain essential even if some of the AI-based processing gets moved to the edge.
The associate editor coordinating the review of this manuscript and approving it for publication was Juan Wang . Collaborative Intelligence (CI) [1]- [4] is an AI deployment strategy that leverages both edge-based and cloud-based resources to make DNN computing faster and more efficient. In CI, a deep model is split into an edge sub-model and a cloud sub-model. For example, an edge sub-model may consist of the initial m layers of a DNN, while the cloud sub-model is made up of the remaining DNN layers. When an input signal is captured by an edge sensor, the edge sub-model processes the signal and produces a tensor of deep features, which is then transmitted to the cloud for subsequent processing by the cloud sub-model. Due to the imperfect channel between the edge and the cloud, deep feature tensor data may be damaged or missing, much like data transmitted over the Internet. Hence, error control schemes must be deployed to achieve seamless operation of edge and cloud sub-models in AI.
In this paper, we study four methods for recovery of missing data in a deep feature tensor. Three of these methods come from existing literature: Simple Low Rank Tensor Completion (SiLRTC) [5], High Accuracy Low Rank Tensor Completion (HaLRTC) [5], and the recent Fused Canonical Polyadic (FCP) decomposition [6]. All three are general tensor completion methods based on the low-rank tensor assumption. We explain how these methods can be adapted to deep feature tensor completion. The fourth method we study is simple and tailor-made for recovery of missing data (i.e., imputation) in deep feature tensors. It is adaptive and linear in nature. Specifically, missing data in the tensor are recovered as a linear combination of other, available data, so the tensor rank is not increased. Due to these attributes, we call it Adaptive Linear Tensor Completion (ALTeC).
The paper is organized as follows. Section II provides the background that is necessary to understand the tensor completion algorithms under study. In Section III, we review the SiLRTC, HaLRTC, and FCP algorithms and explain how they are applied to deep feature tensor completion. Section IV introduces ALTeC and describes how its parameters are computed. Section V describes the simulation environment and experiments, and provides statistical analysis of the results. Lastly, Section VI concludes the paper and indicates potential avenues for further work.

II. PRELIMINARIES
In this section we illustrate several tensor and matrix operations that are used later in the paper. In terms of notation, bold calligraphic letters (X ) will denote tensors, bold uppercase non-italic letters (X) will denote matrices, bold lowercase non-italic letters (x) will denote vectors, and italic letters (x or X ) will denote scalars.

A. TENSOR FOLDING & UNFOLDING
Tensor unfolding is a structured mapping from a tensor to a matrix. A tensor can be unfolded along any of its dimensions. For example, consider a 3D tensor X with two channels, where the left matrix (even integers) is the first channel and the right matrix (odd integers) is the second channel. Then unfolding along the x-, y-, and z-axis (axis 0, 1, and 2) produces matrices X 0 , X 1 , and X 2 , respectively, shown below.
Once a 3D tensor is unfolded into a 2D matrix, then matrix operations such as Singular Value Decomposition (SVD) can be performed. Once matrix processing is done, folding operation converts the 2D matrix into a 3D tensor. Folding along a given axis is the inverse of unfolding along the same axis, i.e., fold(unfold(X , i), i) = X .

B. SINGULAR VALUE DECOMPOSITION & SHRINKAGE
Singular Value Decomposition (SVD) of a given m × n matrix A is given by [7]: where U and V are unitary matrices (UU T = I, VV T = I) whose dimensions are m × m and n × n, respectively. The matrix is m × n and contains the singular values of A along its main diagonal, = diag(σ i ).
In some applications, soft-thresholding (also known as shrinkage) [8] is applied to singular values in in order to arrive at a lower-rank matrix that is still a good approximation to the original matrix A. Specifically, if σ i are the singular values of A and τ is a given threshold, then the corresponding shrinkage operation is defined as where τ = diag(max(σ i − τ, 0)). In words, shrinkage reduces all singular values by τ , clips the negative results to 0, and then re-synthesizes the matrix with the new singular values.
A m × n × p tensor X can be approximated with a low-rank tensor X X X through Canonical Polyadic Decomposition (CPD) by solving [9]: where · F is the Frobenius norm of a tensor, and the low-rank approximation tensor, X , is given by: The dimensions of u r , v r , and w r are m × 1, n × 1, and p × 1, respectively. The symbol • represents the vector outer product, such that u r • v r • w r is a m × n × p tensor for any r. The matrices U, V, and W represent the CP factor matrices whose columns are the corresponding vectors u r , v r , and w r , namely, U = [u 1 , . . . , u rank(X ) ], V = [v 1 , . . . , v rank(X ) ], and W = [w 1 , . . . , w rank(X ) ]. The above minimization problem can also be rephrased and solved in terms of the unfolded tensors X (i) for i = 1, 2, 3: where the symbol represents the Khatri-Rao product defined as U V = [u 1 ⊗ v 1 , . . . , u rank(X ) ⊗ v rank(X ) ], and ⊗ is the Kronecker product of the corresponding columnvectors. VOLUME 8, 2020

D. TENSOR DATA PACKETIZATION & TRANSMISSION
In collaborative intelligence (CI), tensor data needs to be transferred from the edge to the cloud. This process involves writing tensor values into data packets (which we refer to as packetization) and sending these data packets over the network to the cloud. It is too early to say what kind of tensor packetization schemes will be adopted in CI applications in the future. We note, however, that transmission of another kind of tensor data, namely video, has been around for a while, and video packetization schemes are well-established and tested in practice. We believe that similar schemes will be strong contenders for deep tensor data packetization as well. Therefore, for the purposes of this paper, we adopt a packetization method that is popular in video streaming [10], where in each video frame, rows of macro-blocks are mapped to packets. In a similar manner, we write tensor data into packets row-by-row and then channel-by-channel. For example, the data in tensor X in (1) would generate six packets, each composed of one row of tensor data: [0, 2, 4], [6,8,10], . . . , [13,15,17].
Errors in the communication channel may cause data packets not to arrive at the cloud sub-model. Such packets are referred to as ''lost'' and the probability of their loss is p loss . 1 The result of packet loss at the cloud side is that the corresponding tensor rows are not available. As an example, tensor X from (1), with two rows of missing data, is shown below: ?
where the missing values are indicated by question marks ('?'). The goal of tensor completion is to recover these missing values so that the cloud sub-model can perform successful inference.

III. GENERAL TENSOR COMPLETION ALGORITHMS
Tensor completion has found applications in a number of research areas, including computer vision, data analytics, etc. To our knowledge, however, it has not been studied in the context of recovering missing feature tensor values in collaborative intelligence. Often, the underlying assumption is that tensor data is ''low-rank'', or, more generally, ''smooth'' in some sense. A number of methods [5], [11], [12] have been developed based on the assumption that tensors lie in a low-rank manifold, which leads to iterative procedures for approximating the original tensor by a low-rank tensor. In these cases, it is not crucial to know where the data comes from, so long as the low-rank assumption holds. We refer to these methods as general, meaning that they could be applied to any kind of tensor. We will review three such methods, namely Simple Low Rank Tensor Completion (SiLRTC) [5], High Accuracy Low Rank Tensor Completion (HaLRTC) [5], and Fused Canonical Polyadic (FCP) [6], and adapt them to the problem of recovering missing feature tensor values produced by packet loss in collaborative intelligence. We note that it is not clear whether in fact low-rank assumption holds for deep feature tensors. The existence of adversarial examples [13] shows that small perturbations in the input of a deep model may cause large changes downstream, which might indicate that the notions of smoothness and low-rank are less applicable to the deep feature tensors than they might be to the kind of data for which tensor completion has mostly been used so far, such as color images. Nonetheless, it is still important to establish what level of performance existing tensor completion methods can achieve on this new problem.

A. SIMPLE LOW RANK TENSOR COMPLETION (SILRTC)
In this section, we briefly review SiLRTC [5] and show how it can be applied to completion of feature tensors in collaborative intelligence. A summary of SiLRTC is shown in Algorithm 1 in the Appendix. The inputs are the corrupt tensor X (with some of its rows missing), the number of iterations K , and non-negative scaling factors α i , β i , i ∈ {1, 2, 3} where α i 's add up to 1. The scaling factors are chosen randomly [5], and each pair of (α i , β i ) is for one dimension of the tensor. The output is the ''completed'' tensor X .
To reconstruct the missing values in X , SiLRTC makes a copy of X in X (step 1) and then loops over K iterations. In each iteration, X is unfolded along each dimension (step 5), shrinking with threshold τ = α i /β i is applied (step 6), and the tensor is folded back to 3D (step 7). The three folded tensors are added (step 8) and the result is re-scaled (step 10). The above operations change all values in the tensor, however, some of the rows in X are known (i.e., not missing). Therefore, known rows are replaced in the resulting tensor (step 11) and the result is passed to the next iteration.
In essence, SiLRTC attempts to iteratively reduce the rank of the corrupt tensor X by performing shrinkage on the unfolded versions of the tensor and averaging the folded results. The next method we review, HaLRTC, is based on a similar idea, but is more accurate.

B. HIGH ACCURACY LOW RANK TENSOR COMPLETION (HALRTC)
HaLRTC [5] follows the same reasoning as SiLRTC, but using an alternating direction method of multipliers (ADMM) [14] to find the solution. It is more sophisticated than SiLRTC and is expected to produce better tensor completion results. The steps are presented in Algorithm 2 in the Appendix.
The inputs are the corrupt tensor X (with some of its rows missing), the number of iterations K , and non-negative scaling factors ρ, α i , i ∈ {1, 2, 3} where α i 's add up to 1. The scaling factors are chosen randomly [5], and each α i is for one dimension of the tensor. The output is the ''completed'' tensor X .
The algorithm starts by initializing the output tensor X (step 1) and three auxiliary tensors M i , Y i (step 3), one for each dimension. Then, in each of the K iterations, for each tensor dimension, X and Y i are unfolded along the corresponding dimension (steps 7-8), their sum is shrunk (step 9) and the result folded into M i (step 10). Finally, output tensor X , auxiliary tensors Y i , and scaling parameter ρ are updated (steps 12, 15, and 17, respectively).

C. FUSED CANONICAL POLYADIC (FCP) DECOMPOSITION
The FCP algorithm [6] is somewhat more involved than SiLRTC and HaLRTC. The main steps are presented in Algorithm 3 in the Appendix, but the reader is referred to [6] for full details. The inputs to the algorithm include the corrupt tensor X (with some of its rows missing), the CP factor matrices A i , i ∈ {1, 2, 3} (which represent U, V, W in (6)), regularization matrices L s and L p , the number of iterations K , non-negative scaling factors µ, α i , β i , δ i , ζ i , i ∈ {1, 2, 3}, tensor rank R along with its corresponding increment (R u ) and maximum value (R m ), and stopping criteria variables η and . The output is the ''completed'' tensor X .
The algorithm starts by initializing the output tensor X (step 1). Then it loops (iter) through each of the K iterations and for each tensor dimension, X is unfolded along the corresponding dimension i and matrix S is computed (step 4) as the Khatri-Rao product between a pair of CP factor matrices (depending on i) as follows: Then the reciprocal of the step size L and the gradient of regularized error ∇h are computed (steps 5-6). The gradient computation involves the rectified linear function Q(·) defined as After that, the CP factor matrices A i are updated in steps 9-12 until the stopping criteria (computed by the function stopping_criterion(·) in step 8) are met. It should be noted that stopping criteria, as well as several other parameters and scaling factors, are different for sparse and non-sparse tensors [6]. Next, the tensor rank is updated if needed (steps [15][16][17], an operation is performed on the CP factor matrices and the result is folded back into X iter−1 along the first tensor dimension (step 18). Lastly, X iter−1 is assigned to X iter (step 19), X iter is updated with known rows in the corrupt tensor X (step 20), and the Forbenious norm difference between successive iterations of the completed tensor X is compared to a tolerance η to determine if the algorithm converged (steps 21-23).

IV. ADAPTIVE LINEAR TENSOR COMPLETION (ALTEC)
The algorithms presented in Section III manipulate the singular values (SiLRTC and HaLRTC) and CP factors (FCP) of an unfolded tensor to reconstruct the missing elements of the corrupt tensor. While each singular value or CP factor highlights some relevant features in the tensor, it does not capture all of it. In addition, SVD and CPD computation are expensive, and in the above-mentioned algorithms, these need to be performed at each iteration. In this section we present a simple tensor completion method specifically designed to recover missing rows of a deep feature tensor. We refer to it as Adaptive Linear Tensor Completion (ALTeC). The proposed method assumes an approximate linear relationship among the rows of a deep feature tensor and its neighbors. Let x (c) i be the i-th row in channel c of tensor X . The focus on rows comes from the specific row-by-row packetization scheme described in Section II-D. If a different packetization scheme is adopted, x (c) i and its neighborhood would need to be redefined, but the methodology below would still be applicable. We assume that x (c) i can be approximated by a linear combination of its neighbors -co-located rows in other channels and two spatial neighbors in the same channel, one above and one below: where w (c) j 's are the weights and w (c) i (the weight for the row itself in the sum on the right-hand side) is set to zero. When x is not available, so it is assumed to be all-zero. The above equation can be written in a matrix-vector form as where the neighbor rows of x (c) i have been stacked into matrix X (c) i as columns, and the corresponding weights have been placed into the column vector w (c) i . Finding the optimal weights amounts to solving the following problem: which has a well-known solution [15]: For obtaining the weights w (c) i , we used 5,000 randomly selected images from the validation set of the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [16]. This training set was separate from the test set used to compare the tensor completion methods in Section V. Every input image generates a deep feature tensor at a chosen intermediate layer of a given deep model, so that w (c) i can be computed from (14) for every row i and every channel c. These w  storage requirements, we averaged all the weights for different rows i in each channel, to eventually obtain weights w (c) . The resulting weights can now be stored in a (n+2)×n matrix where the c-th column represents the weights for channel c of the deep feature tensor. The process of tensor completion is summarized in Algorithm 4. The algorithm takes in the corrupt tensor X and the weight matrix W. First, the output tensor X is initialized (step 1). Then the algorithm loops through the tensor channels and each missing row is reconstructed as a linear combination of its neighbors using the corresponding weights (step 7). Note that if some of the neighbors of the missing rows are also missing, the corresponding column in X (c) i is all-zero, so the corresponding term is effectively eliminated from the linear combination in (11).
As seen above, tensor reconstruction in ALTeC is linear in nature, and the linear combination for row reconstruction changes from channel to channel (hence ''adaptive'').
The key feature that makes ALTeC attractive compared to SiLRTC, HaLRTC, and FCP for collaborative intelligence applications, where latency is important, is its speed. ALTeC does not use computationally expensive operations such as shrinkage-based SVD or CPD. Moreover, ALTeC only reconstructs missing rows, whereas SiLRTC, HaLRTC, and FCP update the entire tensor in each iteration. Despite its relative simplicity, ALTeC achieves similar reconstruction accuracy as the other three methods, as will be seen in the next section.

A. EXPERIMENTAL SETUP
In this section, we present experiments to compare the performance of the four tensor completion algorithms described earlier -SiLRTC, HaLRTC, FCP, and ALTeC -on two pre-trained deep models for image classification, VGG16 [17] and ResNet34 [18], implemented in Keras. 2 Even though they are no longer state-of-the-art models for image classification, we selected these two models because they are well known in the research community and 2 https://keras.io/applications/ well-studied in the literature under a variety of application scenarios, including collaborative intelligence [1], [2]. The data used in the experiments is a randomly-selected subset of 1,000 images from the ILSVRC [16] validation set, which were different from the 5,000 images on which ALTeC was trained.
Following a common practice in video streaming simulations [10], we consider an independent random packet loss channel with packet loss probability p loss ∈ {5%, 10%, 15%, 20%, 25%, 30%}. It is assumed that at the receiver (cloud sub-model), missing packets are identified via packet sequence numbers provided by a transport-layer protocol such as the Real-time Transport Protocol (RTP) [19].
For each value of p loss , each image in the test set is input to the edge sub-model and the resulting deep feature tensor X is transmitted over the packet loss channel N = 100 times, to obtain statistically meaningful results. For each of these N = 100 channel realizations, on the receiver side, a corrupt tensor X is received, a specific tensor completion method is executed to obtain the completed tensor X , and this tensor is input to the cloud sub-model to complete the inference task. Since we focus on image classification models, we measure classification accuracy under three conditions illustrated in Fig. 1: • no loss (NL), to establish baseline performance; • no tensor completion (NC), where all missing data is assumed to be zero; • tensor completion (TC), where a specific tensor completion algorithm is performed on the corrupt tensor. The average Top-1 classification accuracy for the three cases (µ NL , µ NC , and µ TC ) and the standard deviation of Top-1 classification accuracy under NC and TC conditions (σ NC and σ TC ) are measured. It is important to note that for each combination of packet loss level and tensor completion algorithm, random packet loss is simulated using a random seed corresponding to the trial index (1, 2, . . . , 100), so that each completion algorithm sees the same set of packet loss realizations. This ensures the fairness of the comparison of the four algorithms.
Before moving on to the performance comparison, we examine the convergence of SiLRTC, HaLRTC, and FCP. Recall from Section III that these algorithms iteratively update the to-be-completed tensor via unfolding and shrinking (or CPD) for a given number of iterations K . Fig. 2 shows the Frobenius norm ( · F ) of the difference between tensors in two consecutive iterations (K and K + 1): The curves in Fig. 2 represent the average Frobenius norm of the difference between tensors in two consecutive iterations across the six packet loss probabilities (p loss ). As seen in the figure, all three algorithms have essentially converged and tensor updates have stopped at around K = 50. Hence, we use K = 50 iterations in our experiments for these three algorithms.
All four tensor completion algorithms were implemented and tested on a Linux based machine with the following specifications:  Fig. 3 shows the block diagram of the VGG16 network, and indicates the point where the network is split into the edge sub-model and the cloud sub-model. The split is at the output of the ''block4_pool'' layer. At this point, the feature tensor is of size 14 × 14 × 512, and its total number of elements is less than the number of pixels in the input image (which is 224 × 224 × 3). The volume of data to be transferred from the edge device to the cloud is an important consideration in collaborative intelligence [1], [20] and one wants to choose a split point where the data volume in the feature tensor is less than the data volume of the input. With such a split, the edge sub-model contains 7,635,264 (5.52%) of the total (trainable) parameters of the VGG16 network, and the cloud sub-model contains the remaining 130,722,280 (94.48%) of the total (trainable) parameters. Again, this is a reasonable workload distribution considering the computational resources available at the edge and in the cloud.
Note that in the VGG16 network, each convolutional layer applies the Rectified Linear Unit (ReLU) activation to its output. Hence, the resulting feature tensors already contain many zero elements. Since the missing rows in the received tensor X are initially filled in with zeros (the NC case), we can expect that the non-completed tensors X will be relatively similar to the completed tensors X . To illustrate this point, in Fig. 4 we show the percentage of zero and non-zero elements in the feature tensors for each loss value. As seen in the figure, the percentage of zero elements in the feature tensors under no loss (p loss = 0%) is already more than 75% and does not increase proportionally to packet loss. Hence, we expect that in this case, the difference in classification between NC and TC cases will be relatively small, as will be confirmed by quantitative results later in this section.

1) EXECUTION SPEED
First we compare the execution speed of the four tensor completion algorithms and present the results in Fig. 5. The solid curves in the figure represent the average tensor completion speed over the test set of 1,000 images, across various packet loss levels. The shaded band around solid curves indicates one standard deviation of the execution speed. As seen in VOLUME 8, 2020  the figure, ALTeC is significantly faster than the other methods, with FCP being the second-fastest (due to its built-in convergence metric), followed by SiLRTC and HaLRTC, as expected.
Part of the reason for the speed advantage of ALTeC over the other three algorithms comes from avoiding the shrinkage-based SVD and CPD at each iteration. Another reason is that, in each iteration, SiLRTC, HaLRTC, and FCP update all elements of the tensor only to replace the non-missing values once the update is done. Meanwhile, ALTeC only spends computation on the tensor elements that are actually missing. ALTeC does require off-line training, however, this can be done at the same time when the main model is trained, so in terms of the overall collaborative system design, it does not add any extra overhead, and yet results in run-time savings upon system deployment. Table 1 shows the Top-1 classification accuracy and its standard deviation for the several cases. The first three numerical columns show the accuracy under no packet loss (µ NL ), the accuracy with no tensor completion (µ NC ) and its standard deviation (σ NC ). Note that due to the large number of zeros produced by the ReLU activation functions in certain VGG16 layers (Fig. 4), µ NC is fairly close to µ NL , as predicted earlier. Even under 30% loss, the classification accuracy drops by less than 4%.

2) CLASSIFICATION ACCURACY
The next two columns show the accuracy (µ TC ) and standard deviation (σ TC ) for the four tensor completion algorithms with default settings. By this we mean that SiLRTC, HaLRTC, and FCP are run for K = 50 iterations, as explained earlier. However, this means that their execution speed is significantly higher than that of ALTeC (Fig. 5). Therefore, in the last two columns, we report the results for the matched execution speed, where we only let SiLRTC, HaLRTC, and FCP run until their execution matches that of ALTeC. This means that they were only able to run 1-5 (SiLRTC & HaLRTC) or 5-20 (FCP) iterations, depending on the case.
In the default-settings case, the accuracies of all four methods were similar, with the maximum difference of around 0.25% (between SiLRTC and HaLRTC) under 30% loss. To test for statistical significance of these differences, we applied Welch's t-test for samples with unequal variance [21]. The resulting p-values are shown in the middle three columns of Table 2. In experimental sciences, a p-value of less than 0.05 is usually taken as a sign of statistically significant difference. As seen in Table 2, all pairwise differences between average accuracies in the default-settings case were insignificant, except for the difference between SiLRTC & HaLRTC for 25% & 30% loss and SiLRTC & FCP for 15%, 20%, and 25% loss, and these cases are indicated with green shading in the table. Since no algorithm came out as the clear winner (i.e., provided significantly better results than all alternatives) no accuracy in the corresponding column in Table 1 is indicated in bold.
Likewise, in the matched-speed case, there was no clear winner that statistically outperformed all its rivals, as indicated by p-values in Table 2. Thus, again, no accuracy in the corresponding column in Table 1 is indicated in bold. Overall, results obtained from the VGG16 model suggest that ALTeC offers equivalent performance to SiLRTC, HaLRTC, and FCP, both when complexity is not constrained and when the algorithms are constrained to be equally fast.

3) INTERESTING EXAMPLES
Finally, we highlight several interesting examples that were observed during testing of tensor completion algorithms on VGG16. Here, all tensor completion algorithms run in the default (i.e., not speed-matched) configuration. Table 3 shows classification predictions made on two images: #102 ('Sleeping Bag' or 'SB') and #3 ('Bulbul'). Ground truth labels are listed in the row that starts with ''GT Label''. The next two rows show the results under no loss (NL). In other words, these are the outputs obtained from the pre-trained VGG16 model. For image #102, we see that the Top-1 label is wrong ('Cloak') and the model is fairly confident about it (74.83%). After packet loss of 30%, but without tensor completion (NC), the model is still wrong ('Cloak'), but it is less confident than before (49.73%). Finally, after tensor completion (TC), ALTeC produces the correct result ('SB'), while SiLRTC, HaLRTC, and FCP are wrong, but not as confident about it as the original model (around 52-56%).
For image #3, the pre-trained model is correct ('Bulbul'), and also confident about it (95.80%). After packet loss, but without tensor completion (NC), the model produces wrong result ('Kite') with confidence of 42.24%. All tensor completion algorithms lead to wrong result ('Kite'), but ALTeC leads to least confidence about this result (36.93%) while SiLRTC, HaLRTC, and FCP increase their confidence about the wrong decision (to 43-46%) compared to the NC case. Although all tensor completion methods lead to wrong decision in this case, one could argue that ALTeC is still better than the other three methods because with ALTeC-completed tensor, the VGG16 model is least confident about its wrong answer. Fig. 6 shows the architecture of the ResNet34 model. By similar reasoning as in Section V-B, we decided to split the FIGURE 6. ResNet34 model split at layer ''add_7'' [18]. VOLUME 8, 2020 model at layer ''add_7''. With this split, the edge sub-model produces tensors of size 28 × 28 × 128, which contain the same number of elements as in the VGG16 case. Also, the edge sub-model contains 1,351,619 (6.19%) trainable parameters, while the cloud sub-model contains the remaining 20,488,488 (93.81%) trainable parameters. Again, this is a reasonable workload distribution considering the computational resources available at the edge and in the cloud.

C. RESULTS ON RESNET34
ResNet34, like VGG16, uses ReLU activation functions. However, it also makes use of batch normalization [22], which centers the data distribution prior to adding it to the data passed down the residual (skip) connection [18]. Hence, the tensor produced by the edge sub-model in this case does not contain as many zeros as the one produced by the VGG16 edge sub-model. Indeed, Fig. 7 shows that without loss, there are virtually no zeros in the tensor X produced by the ResNet34 edge sub-model. As the loss increases, the percentage of zero-elements in the received tensor X increases proportionally, and we can expect a larger difference between non-completed tensors X and completed tensors X than we had in the VGG16 case.

1) EXECUTION SPEED
Similarly to the VGG16 case, we tested the execution speed of the four tensor completion algorithms and report the results in Fig. 8. The solid curves represent the average tensor completion speed over the test set of 1,000 images, across various packet loss levels. The shaded band around solid curves indicates one standard deviation of the execution speed. As before, ALTeC is significantly faster than the other three methods, with FCP being the second-fastest, followed by SiLRTC and HaLRTC.
When comparing these results to those in Fig. 5, we note that the execution speeds of SiLRTC and HaLRTC are similar, about 1.2-1.5 seconds/tensor. Likewise, the execution speeds of FCP are similar at about 0.25-0.30 seconds/tensor. This is not surprising considering that in both VGG16 and ResNet34 cases, the tensors (and their unfolded versions) have the same number of elements. However, ALTeC is  noticeably faster on ResNet34 tensors (Fig. 8) than on VGG16 tensors (Fig. 5), because ResNet34 tensors have fewer channels than VGG16 tensors, so matrix X (c) i in step 7 of ALTeC is smaller in the ResNet34 case, which leads to faster matrix-vector multiplication.

2) CLASSIFICATION ACCURACY
As in the VGG16 case, we compute the classification accuracy offered by the four tensor completion methods in the default case (where SiLRTC, HaLRTC, and FCP run K = 50 iterations) and in the speed-matched case, where they are only allowed to run as long as ALTeC. The results are shown in Table 4. First, note that in the case of ResNet34, there is a large difference between the no-loss case (NL) and the the case with no tensor completion (NC). Specifically, for 30% loss, the difference in Top-1 classification accuracy is now over 40%, whereas in the case of VGG16 it was less than 4%. As mentioned earlier, this is due to the fact that tensors X produced by the ResNet34 edge sub-model (without loss) contain virtually no zeros, so they are quite different compared to the corrupt X which haven't been completed yet. This also means that tensor completion (TC) has the potential to bring much higher gain over no-completion (NC), compared to the VGG16 case.
In Table 4, the middle two columns labeled µ TC and σ TC correspond to the case with default settings, where SiLRTC, HaLRTC, and FCP are able to execute all K = 50 iterations until convergence. The corresponding statistical significance results using Welch's t-test are shown in the middle three columns in Table 5. As seen in Table 5, we have a lot more significant differences in accuracy now, compared to the VGG16 case. At the 5% loss level, ALTeC, HaLRTC, and FCP are statistically tied, and all give a higher classification accuracy than SiLRTC. At higher loss levels, HaLRTC statistically outperforms ALTeC, SiLRTC, and FCP, with ALTeC being the next best performing algorithm, followed by FCP and SiLRTC. For this reason, the corresponding accuracies are made bold in the table. The columns labelled ''Speed-matched'' in both Table 4 and Table 5 correspond to the case when the execution speeds of the four tensor completion algorithms are matched. In this case, SiLRTC and HaLRTC are only able to run 1-3 iterations, while FCP can run 3-8 iterations. Under these conditions, ALTeC statistically outperforms the other three methods at all loss levels. In summary, ResNet34 results show that when complexity is not constrained and execution speed is of no concern, HaLRTC is the best of the four methods, followed by ALTeC in the second place. When the execution speeds are matched, ALTeC is superior to the other three methods.

3) INTERESTING EXAMPLES
Again, we highlight several interesting examples that were observed during the experiments on ResNet34 with default configurations of the tensor completion algorithms. Table 6 shows classification predictions made on two images: #4 ('Toyshop') and #7 ('Drake'). Ground truth labels are listed in the row that starts with ''GT Label''. The next two rows show the results under no loss (NL), obtained from the pre-trained ResNet34 model. In both cases, Top-1 labels are correct, but the model is relatively less confident about image #4 (16.66%). After packet loss of 30%, but without tensor completion (NC), the model makes wrong predictions in both cases -image #4 is classified as 'Abacus' and #7 is classified as 'Ptarmigan' -though both predictions are made with low confidence (7.89% and 26.57%, respectively).
Finally, after tensor completion (TC), ALTeC produces correct result in the case of image #4 ('Toyshop'), but with low confidence (6.32%), while SiLRTC, HaLRTC, and FCP are wrong. In the case of image #7, all four methods produce wrong results, but ALTeC is the least confident about it (57.92% vs. 60-75% for other methods). Again, among the four wrong decisions, the best one is where the model is least confident about it. Note that in the case of image #7, although final TC labels are wrong for all the methods, they are relatively similar to the ground truth, as both 'Drake' and 'American Coot' are birds that resemble each another.

VI. CONCLUSION AND FUTURE WORK
In this paper, we studied several methods for tensor completion in collaborative intelligence applications. Specifically, we focused on three representative methods from the literature -Simple Low Rank Tensor Completion (SiLRTC), High-accuracy Low Rank Tensor Completion (HaLRTC), and Fused Canonical Polyadic (FCP) -and a simple newly-developed Adaptive Linear Tensor Completion (ALTeC). These methods were compared on their ability to recover the missing data caused by packet loss in the feature tensors produced by VGG16 and ResNet34 image classification models.

Algorithm 1 SiLRTC [5]
Input: X , K , α i , β i ≥ 0, 3 i=1 α i = 1 Output: X 1: X ← X 2: for iter = 1, 2, . . . , K , do 3: M ← zeros( X .shape) 4: for i = 1, 2, 3, do 5: X ← unfold( X , i) 6: end for 10: Replace values in M by known rows in X 12: X ← M 13: end for Algorithm 2 HaLRTC [5] Input: shape) 4: end for 5: for iter = 1, 2, . . . , K , do 6: for i = 1, 2, 3, do 7: Z τ ← shrink(X + Y, τ = α i /ρ) 10: M i ← fold(Z τ , i) 11: end for 12: Replace values in X by known rows in X 14: for i = 1, 2, 3, do 15: 16: end for 17: ρ ← 1.2 · ρ 18: end for Among the four studied methods, ALTeC was the fastest, which is well-suited for collaborative intelligence applications where inference latency is one of the important issues. Regarding reconstruction accuracy, on VGG16 tensors (which tend to be sparse), all four methods were in a statistical tie, both when SiLRTC, HaLRTC, and FCP were allowed to run sufficiently many iterations to converge and when their execution speeds were matched with that of ALTeC by restricting the number of iterations. On ResNet34 tensors (which are non-sparse), HaLRTC showed the best accuracy when it was allowed to converge, followed by ALTeC as the second-best. However, when the execution speeds were matched, ALTeC emerged as the winner.

Algorithm 3 FCP [6]
Input: X , A i , L p , L s , K , µ, α i , β i , δ i , ζ i , R, R u , R m , η, Output: X 1: X ← X 2: for iter = 1, 2, . . . , K , do 3: for i = 1, 2, 3, do 4: S, V ← Compute S as in (9), unfold( X , i) 5: In essence, SiLRTC, HaLRTC, and FCP pay the price for their generality. By not embedding the specifics of tensors they are supposed to complete into their procedures, they need to re-discover low-rank tensor structures anew every time they start the completion procedure. By contrast, ALTeC learns simple linear relations among the rows of tensors it is supposed to complete off-line, so it is able to execute quickly at run-time. While ALTeC requires off-line training, this is quite feasible in collaborative intelligence because the main model (VGG16 or ResNet34 in this case) also requires off-line training, and ALTeC could be trained in parallel with the main model on the same data. Further, note that ALTeC does not require labeled data for training, only input data. Each input sample produces the ''ground truth'' tensor at a given layer of the model, which is then used to fit the parameters of ALTeC.
Note that both fast methods, as well as slower but more accurate methods, may have their place in CI tensor completion, depending on the application. For example, in a video surveillance application where a subway station is being monitored by several cameras to detect abandoned luggage, speed is of the essence, since the luggage may pose a security threat. Meanwhile, the accuracy of luggage classification

Algorithm 4 ALTeC
Input: X , W Output: X 1: X ← X 2: for channel c = 1, 2, . . . , n do 3: for each row i in channel c of X do 4: if row i is missing then 5: Collect neighbors of the i-th row into X (suitcase, backpack, purse) is less relevant. Hence, for such an application, inference latency (i.e., speed) would be more important than classification accuracy. On the other hand, for applications such as satellite-based surveillance of crops, speed is not as important because changes on crop fields are relatively slow. Hence, in this application, accuracy could be preferred over speed.
In the future, we plan to study other tensor completion methods [23] in the context of collaborative intelligence, such as Geometric Conjugate Gradients (GeomCG) [24], Tensor SVD (t-SVD) [25], and Tensor Robust Principal Component Analysis (TRPCA) [26]. Like SiLRTC, HaLRTC, and FCP, most of these completion algorithms rely on computation-heavy procedures such as eigen-decomposition or SVD, so they are unlikely to be faster than ALTeC, but they may offer better accuracy in cases where complexity is less important. The inclusion of specific constrains found in image or video completion algorithms [27], [28] could also be explored. Furthermore, similar studies could be performed on models trained for other collaborative intelligence applications, such as object detection, segmentation, action recognition, etc. Finally, the completion algorithms should also be evaluated on a burst-loss channel model such as the Gilbert-Elliott model [29], which offers a more realistic representation of packet loss in real networks.

APPENDIX TENSOR COMPLETION ALGORITHMS PSEUDO CODE
See Algorithms 1-4. IVAN V. BAJIĆ (Senior Member, IEEE) is currently a Professor of engineering science and the Co-Director of the Multimedia Lab, Simon Fraser University, Burnaby, BC, Canada. His research interests include signal processing and machine learning with applications to multimedia processing, compression, and collaborative intelligence. He has authored about a dozen and coauthored another ten dozen publications in these fields. Several of his articles have received awards, most recently at ICIP 2019. He is an Elected Member of the IEEE Multimedia Signal Processing Technical Committee and the IEEE Multimedia Systems and Applications Technical Committee. He has served on the organizing and/or program committees for the main conferences in the field, and has received five reviewer awards, most recently at ICASSP 2019. He was an Associate Editor of the IEEE TRANSACTIONS ON MULTIMEDIA and the IEEE Signal Processing Magazine. He is also an Area Editor of Signal Processing: Image Communication.