Local Re-encoding for Coded Matrix Multiplication

Matrix multiplication is a fundamental operation in various machine learning algorithms. With the sizes of datasets increasing rapidly, it is now a common practice to distribute the computation on multiple servers. As straggling servers are inevitable in a distributed infrastructure, various coding schemes have been proposed to tolerate potential stragglers. However, as resources are shared with other jobs in a distributed infrastructure and their performance can change dynamically, the optimal way to encode the input matrices is not static. So far, all existing coding schemes require encoding input matrices in advance, and cannot change the coding schemes or adjust their parameters ﬂexibly. In this paper, we propose a framework that can change the coding schemes and/or their parameters only by locally re-encoding the coded task on each server. We ﬁrst present this framework for entangled polynomial codes, allowing us to change the coding parameters with marginal overhead and to save job completion time. We then extend the framework for matrices with bounded entries, achieving a higher level of ﬂexibility for local re-encoding while maintaining better numerical stability.


M
ATRIX multiplication is an essential building block in various machine learning algorithms.With the growing size of the dataset, it is now common that the input matrices are too large to compute multiplication on a single server.
Therefore, it becomes inevitable to run such algorithms on multiple servers in a distributed infrastructure, e.g., in a cloud, where each server executes a task multiplying two smaller submatrices.However, it is well known that servers in a distributed infrastructure may experience temporary performance degradation, due to load imbalance or resource congestion [2]- [4].
Therefore, when distributing computation onto multiple servers, the progress of the algorithm can be significantly affected by the tasks running on such slow or failed servers, which we call stragglers.
In order to tolerate stragglers in distributed matrix multiplication, a naive method is to replicate each task on multiple servers.

For example, with two input matrices
Hence, we can split the job into four tasks A i B j , i ∈ [0, 1] and j ∈ [0, 1], and replicate each of such four tasks on multiple servers.This method, however, requires a high number of tasks to tolerate just a small number of stragglers.To tolerate only r stragglers, we need to replicate all tasks r + 1 times.
X. Su and X. Fan are with the Graduate Center of the City University of New York.J. Li is with Queens College and the Graduate Center of the City University of New York.J. Parker is with Department of Computer Science, Virginia Commonwealth University.X. Zhong is with the School of Software Engineering, East China Jiaotong University.
This paper was presented in part at the 2020 IEEE International Symposium on Information Theory [1].
< l a t e x i t s h a 1 _ b a s e 6 4 = " y / a u g u T / Q k 0 V Y O z + E s y o I w + F f G U = " > A A A D 1 H i c l V N L b 9 N A E N 7 G P E p 4 t I U j l x U R E o c m c l o k u F Q q 6 a U H Q A E 1 a a T Y i j b r c b z q e t f s j g u W y Q l x 5 c g V f g P / h n / D 2 v W B J O X A S C v N z D e v n f 1 2 n k l h 0 f d / b 7 W 8 G z d v 3 d 6 + 0 7 5 7 7 / 6 D n d 2 9 h 2 O r c 8 N h x L X U Z j J n F q R Q M E K B E i a Z A Z b O J Z z P L 0 4 q / P w S j B V a n W G R Q Z i y h R K x 4 A y d K 3 g 1 8 w M e a a S D m T / b 7 f g 9 v x a 6 q f Q b p U M a G c 7 2 W r + C S P M 8 B Y V c M m u n f T / D f S q Z i i x n r l n J e A 6 g m 4 9 X 7 s d K P h Y J c c s F b I I 5 j o q K n M a l q c g L 8 F 1 Z / Q t 5 E B f i 0 W C y / V 4 T I T 6 n / g E W C T U 4 v q U E 6 0 i U N X a B 1 p G 1 X A R x C y X W B c A h r k B W 7 5 h W e Z K H C F 8 6 r p T 9 b C A K W s G W Y f 3 a c 3 H y s b 6 E f 8 1 6 c o D V p / A V b E r J G m o s f r S c p F Z y B 0 j 3 V L X I F N E N p H K l X Y s 7 6 9 z e l M Z H / T 6 h 7 2 D d 8 8 7 x 4 O G 7 9 v k M X l C n p E + e U G O y S k Z k h H h J C P f y Q / y 0 x t 7 n 7 0 v 3 t e r 0 N Z W k / O I r I j 3 7 Q 9 Q p U q A < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 8 d K I q E H R Y 9 j X Y J J P 4 2 S w s n W U B w Q = " > A A A D 1 H i c l V N L b 9 N A E N 4 m P E p 4 t I U j l x U R E o c m c l o k u F Q q 6 a U H Q A E 1 a a T Y i t b r i b 3 q e t f s j g u W y Q l x 5 c g V f g P / h n / D 2 v W B J O X A S C v N z D e v n f 0 2 z K S w 6 H m / t 1 r t G z d v 3 d 6 + 0 7 l 7 7 / 6 D n d 2 9 h x O r c 8 N h z L X U Z h o y C 1 I o G K N A C d P M A E t D C e f h x U m F n 1 + C s U K r M y w y C F I W K 7 E Q n K F z + a / m n s 8 j j X Q 4 H 8 x 3 u 1 7 f q 4 V u K o N G 6 Z J G R v O 9 1 i 8 / 0 j x P Q S G X z N r Z w M t w n 0 q m I s u Z a 1 Y y g 4 J L W H b 8 3 E L G + A W L Y c Y O M w e a o I x B p 4 C m W I W d q l g K N i j r 6 y 3 p U + e J 6 E I b d x T S 2 v t 3 R s l S a 4 s 0 d J E p w 8 S u Y 5 X z O m y W 4 + J l U A q V 5 Q i K X z V a 5 J K i p t W u a C Q M c J S F U x g 3 w l 2 F 8 o Q Z x t F t t O O / h w + 5 i x g 1 1 Z w B G J T V j D Y D v t w I q A b o W S w k H J 3 B N C h z J b i O o F f P 1 + n 4 C j 5 W y Q u W C l n 4 o Y 6 K y p w F 5 S n I S 3 D d G X 0 L O d D X I k 5 w u R 6 P i V D / E 5 8 A i 4 S K r 0 8 5 0 S o C V a 1 9 q G V U D R f B g u U S 6 w L A M D d g y z c s y 1 y J I 4 R P P X e q H h Y w Z c 0 g 6 / A + r f l Y 2 V g / 4 r 8 m X X n A 6 h O 4 K n a F J A 0 1 V l 9 a x p m F 3 D H S L X U N M k V k E 6 l c a c f y w T q n N 5 X J Q X 9 w 2 D 9 4 9 7 x 7 P G z 4 v k 0 e k y f k G R m Q F + S Y n J I R G R N O M v K d / C A / 2 5 P 2 5 / a X 9 t e r 0 N Z W k / O I r E j 7 2 x 9 U C k q B < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " / m N 8 H f s o c S t a F F I 3 U Y 6 3 q q J U W i 4 = " > A A A D 1 H i c l V N L b 9 N A E N 4 m P E p 4 t I U j l x U R E o c m c l o k u F Q q 6 a U H Q A E 1 a a T Y i t b r i b 3 q e t f s j g u W y Q l x 5 c g V f g P / h n / D 2 v W B J O X A S C v N z D e v n f 0 2 z K S w 6 H m / t 1 r t G z d v 3 d 6 + 0 7 l 7 7 / 6 D n d 2 9 h x O r c 8 N h z L X U Z h o y C 1 I o G K N A C d P M A E t D C e f h x U m F n 1 + C s U K r M y w y C F I W K 7 E Q n K F z + a / m A 5 9 H G u l w 7 s 1 3 u 1 7 f q 4 V u K o N G 6 Z J G R v O 9 1 i 8 / 0 j x P Q S G X z N r Z w M t w n 0 q m I s u Z a 1 Y y g 4 J L W H b 8 3 E L G + A W L Y c Y O M w e a o I x B p 4 C m W I W d q l g K N i j r 6 y 3 p U + e J 6 E I b d x T S 2 v t 3 R s l S a 4 s 0 d J E p w 8 S u Y 5 X z O m y W 4 + J l U A q V 5 Q i K X z V a 5 J K i p t W u a C Q M c J S F U x g 3 w l 2 F 8 o Q Z x t F t t O O / h w + 5 i x g 1 1 Z w B G J T V j D Y D v t w I q A b o W S w k H J 3 B N C h z J b i O o F f P 1 + n 4 C j 5 W y Q u W C l n 4 o Y 6 K y p w F 5 S n I S 3 D d G X 0 L O d D X I k 5 w u R 6 P i V D / E 5 8 A i 4 S K r 0 8 5 0 S o C V a 1 9 q G V U D R f B g u U S 6 w L A M D d g y z c s y 1 y J I 4 R P P X e q H h Y w Z c 0 g 6 / A + r f l Y 2 V g / 4 r 8 m X X n A 6 h O 4 K n a F J A 0 1 V l 9 a x p m F 3 D H S L X U N M k V k E 6 l c a c f y w T q n N 5 X J Q X 9 w 2 D 9 4 9 7 x 7 P G z 4 v k 0 e k y f k G R m Q F + S Y n J I R G R N O M v K d / C A / 2 5 P 2 5 / a X 9 t e r 0 N Z W k / O I r E j 7 2 x 9 U E 0 q B < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " e X x r D r j S A Z B j P 3 1 m J t S V 3 K p 4 n I g K 6 e 7 7 c 5 e e / f t 8 8 Z B t + T 7 J n l E H p M m 6 Z A X 5 I A c k x 7 p E 0 4 + k e / k B / l Z F d X P 1 S / V r 5 e h l Y 0 y 5 w F Z k u q 3 P + P v T c Q = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 On the other hand, coding-based approaches for distributed matrix multiplication have been proposed to tolerate stragglers more efficiently [3]- [8], where each server multiplies coded matrices that are encoded from submatrices split from A or/and B.
For example, in Fig. 1, A and B are partitioned and encoded into (A 0 + A 1 ) and (B 0 + B 1 ), respectively.Then a coded task can be created that computes (A 0 + A 1 )(B 0 + B 1 ).If we run this coded task with the four original tasks, we can recover the four submatrices in AB once we have the results of any four of the five tasks, i.e., its recovery threshold is 4. Compared to replicating each task on two servers, this coding scheme can tolerate any single straggler with 75% fewer additional tasks.
As the performance of servers can be affected by a number of factors, it is typical to consider the performance of different resources when choosing how tasks are encoded for distributed matrix multiplication.For example, if CPU is the bottleneck, it is then desirable to split A and B into more submatrices, i.e., making the recovery threshold higher in exchange for a lower complexity of each task.On the other hand, if the network bandwidth is limited, it becomes desirable to have a lower recovery threshold, making it possible to complete the computation on fewer tasks with lower communication overhead.Unfortunately, the performances of resources in a cloud are subject to change due to the shared nature of resources in the cloud.However, all existing coding schemes for the distributed matrix multiplication, to the best of our knowledge, require that the two coded matrices in each task must be encoded in advance before the multiplication.With tasks encoded in advance, achieving an optimal tradeoff dynamically between computation and communication overhead becomes a great challenge.
To dynamically change the coding scheme, conventionally we can only encode tasks again from scratch, i.e., by reading the input matrices, splitting them into submatrices in a different way, encoding them into coded tasks with a different coding scheme or with different values of parameters, and finally placing them on different servers.It inevitably consumes a significant amount of time and network bandwidth to compute and distribute new coded matrices.In this paper, we propose a framework for distributed matrix multiplication that supports changing the coding schemes and/or their parameters by only locally re-encoding the coded matrices on each server, i.e., each coded task only needs to be re-encoded with its local data, without the need of receiving any additional data.In this paper, we first propose a framework of local re-encoding for entangled polynomial codes [6].It not only supports changing the values of coding parameters after local re-encoding, but also allows changing the coding schemes from other two representative codes for matrix multiplication, i.e., polynomial codes [7] and MatDot codes [8], to entangled polynomial codes.
Besides local re-encoding for entangled polynomial codes, we extend the original framework by further mitigating two issues.First, in our original framework the flexibility of re-encoding is limited since the recovery threshold changes after local re-encoding.Second, all of the three coding schemes above may suffer from poor numerical stability for the multiplication of matrices with real numbers.The decoded result may be significantly different from the actual result due to even small perturbations during decoding, due to the imprecision of floating point numbers, as the generator matrix of these three coding schemes is Vandermonde matrix, known as numerically unstable [9], [10].Hence, the numerical error after local re-encoding may become significantly higher than the original one.To mitigate these two issues, we further extend our framework to support a variation of the entangled polynomial code for matrices with bounded entries, which maintains better numerical stability [11].
Meanwhile, a more flexible tradeoff between numerical stability and computational overhead can be achieved in the extended framework.
We demonstrate the performance of our framework of local re-encoding through experiments running in both our local cluster and Microsoft Azure.The experiments consistently show that the time of re-encoding can be significantly saved.We also demonstrate that, for matrices with bounded entries, the numerical error after re-encoding can be maintained as that before re-encoding.Besides the tradeoff between computation and communication, our framework can also achieve the tradeoff between numerical error and the job completion time.

II. MOTIVATING EXAMPLES
To demonstrate the impact of different resources on the performance of the distributed matrix multiplication, we run a job that multiplies two matrices of sizes 4096 × 4096 in our local cluster.The tasks in the job are encoded with a polynomial code, a MatDot code, and an entangled polynomial code, respectively.The job is implemented with Open MPI [12].We calculate the results of tasks on servers called workers, and all workers have the same hardware configurations.Each worker uploads the result of its task to another server called master.The number of workers in each job is chosen to tolerate at most 5 stragglers.
When the number of results received by the master is sufficient, the master will stop receiving any new result and decode such results.Hence, the job completion time includes the time of executing tasks on workers, uploading the results to the master, and decoding the results on the master.
In our experiment, we can observe how the performance of the job, in terms of its completion time, changes with network bandwidth.In order to change the network bandwidth, we use iperf to send additional traffic at a fixed throughput of 3 Gbps from another server to the master, which competes for the network bandwidth with all the workers.With additional traffic, the job will get less available bandwidth and need more time to finish.However, as we run the same job with different coding schemes, different coding schemes can be affected differently with the loss of available bandwidth, and we present two examples in Fig. 2.
In Fig. 2a, we first compare the performance with a polynomial code and an entangled polynomial code.With the traffic described above, the entangled polynomial code completes the job 11.1% slower than the polynomial code.When such traffic is stopped, however, its time becomes 34.2% faster.We can also observe the same overtaking in Fig. 2b, between a MatDot code and an entangled polynomial code.The time of the MatDot code is also originally 6.7% faster when there is less available bandwidth, but becomes even 32.5% slower when there is no such traffic.
From the examples above, we can see that the entangled polynomial codes can be more easily affected by the available bandwidth than the polynomial code and the MatDot code.This is because in our experiment the task encoded by the entangled polynomial code has a lower complexity but the master also needs to receive results from more workers before decoding.Furthermore, the three coding schemes also have different decoding complexities.If the CPU is shared by another job on the master or a worker, their completion time can also be affected differently.As resource availability is subject to frequent changes in the cloud, it is challenging to choose the optimal coding scheme and parameters in advance.Therefore, we propose a local re-encoding framework that will allow changing the coding scheme and/or their parameters dynamically with marginal overhead.

III. RELATED WORK
In order to tolerate stragglers in a job of distributed computing, a common method is to relaunch the affected tasks on a replacement server after the straggler is detected [13], [14].While relaunching a task on another server may still consume a significant amount of time, we can launch each task in the job multiple times on different servers at the beginning of the job [7], [13], [15]- [18].In this way, only one of the replicas will be sufficient to complete each task.However, replication-based methods suffer from high resource consumption.To tolerate any r stragglers, each task needs to be replicated on r + 1 servers.
Compared with replication, existing kinds of literature have proposed a number of coding-based methods that can tolerate the same number of stragglers by adding fewer additional tasks.In the first effort made by Lee et al. [3], a coding scheme based on MDS codes was proposed for the matrix-vector multiplication, where the matrix is horizontally split and encoded to create coded tasks.More coding schemes, such as sparse coding [5] and rateless coding [19], were also proposed for the matrix-vector multiplication.
As for matrix-matrix multiplication, the coding schemes will need to split and encode two input matrices as both of them may be large.Coding schemes based on product coding create the coded task in two steps [20]- [23].A job is firstly encoded into intermediate coded tasks by applying one coding scheme to one matrix, and each intermediate task is encoded again by applying another (or the same) coding scheme on the other matrix.However, since the actual tasks are created from intermediate coded tasks, the pattern of stragglers tolerable becomes limited as each intermediate tasks need to be decodable.On the other hand, polynomial codes [7] and MatDot [8] codes can directly encode the two input matrices while tolerating any patterns of stragglers, as long as the number of remaining tasks is more than the corresponding recovery threshold.However, each input matrix can only be split in one dimension, either vertically or horizontally in such two coding schemes.More generally, entangled polynomial codes [6] and PolyDot [8] codes support to split the two input matrices vertically and horizontally at the same time.
Many existing coding schemes for distributed matrix multiplication are constructed based on polynomials [3], [6], [7].Two input matrices A and B are encoded as the evaluation of two polynomial(s) of Ã(x) and/or B(x), where the values of x in each task must be unique.By carefully designing the polynomial, decoding can be done by interpolating the coefficients of C(x) = Ã(x) B(x), such that all submatrices of AB appear in the coefficients of C(x).However, given multi-point evaluations of C(x), the interpolation is equivalent to solving a linear system with a Vandermonde matrix, which is known to have a large condition number, i.e., a small perturbation in the Vandermonde matrix, which is inevitable due to the limited precision of floating point numbers in the computer, may lead to a large error after decoding [10].Hence, numerically stable codes have been proposed to achieve the same level of tolerance against stragglers with much lower numerical errors after decoding [9], [11].
Conventionally the computation is considered as the major bottleneck in distributed systems.However, other resources such as the bandwidth may also become a bottleneck, as shown in Sec.II.The tradeoff between multiple resources can make the choice of coding schemes and parameters more difficult, and in many cases it can change dynamically with time.Hence, the optimal choice of the coding scheme and its parameters also keeps changing dynamically.To address this problem, re-encoding was proposed in distributed storage systems, where Maturana and Rashmi proposed convertible codes allowing the change of the parameters of MDS codes with the optimal overhead of data transfer [24], [25].In this paper, we, for the first time, propose a coding framework that supports the re-encoding of tasks for distributed matrix multiplication.We present a framework of local re-encoding where the coding scheme or/and its parameters can be changed with no data transfer.Moreover, our framework can be further extended to support maintaining the numerical stability.By local re-encoding with low overhead, flexible tradeoffs can be easily achieved, such as that between computation and communication, and that between numerical stability and the job completion time.

IV. PRELIMINARIES
In this paper, we demonstrate that coded tasks with polynomial codes [7] or MatDot codes [8] can be locally re-encoded into those with entangled polynomial codes [6], or be updated to have different values of their parameters.We will first present the preliminary knowledge about such coding schemes in this section, and our coding framework in the rest of this paper will be based on this background knowledge.We assume coded tasks are created for the multiplication of two large matrices A and B, i.e., AB.

A. Polynomial Code
Polynomial codes assume that the input matrices A and B can be horizontally and vertically split into m and n submatrices, respectively.In other words, , and the result of multiplication can be obtained if we can have the mn submatrices in AB.A polynomial code encodes A and B as two polynomial functions of δ, i.e., ÃP (δ) = B y δ y , respectively.A coded task will then multiply ÃP (δ) with BP (δ).Note that the values of δ must be different among all coded tasks.Hence, we have which is also a polynomial function of δ with a degree of mn − 1, where , appears as the coefficient of δ nx+y .In other words, if we run tasks of ÃP (δ) BP (δ) with different values of δ on multiple servers, we can recover all the coefficients from the results of any mn tasks by polynomial interpolation or Reed-Solomon decoding.Hence, its recovery threshold is mn.

B. MatDot Code
MatDot codes assume that A and B are split vertically and horizontally into p submatrices, respectively.In other words, For example, if p = 2, ÃMD (δ) = A 0 δ 0 + A 1 δ 1 and BMD (δ) = B 1 δ 0 + B 0 δ 1 .Then their multiplication equals Hence, the result of AB appears as the coefficient of δ 1 .
In general, observing the coefficients in (2), we can see that when t = p − 1, Hence, after interpolating the coefficients of ÃMD (δ) BMD (δ), we will be able to obtain the results of AB.Since the degree of

C. Entangled Polynomial Code
Allowing a more general partition of A and B than polynomial codes and MatDot codes, entangled polynomial codes can further reduce the complexity of coded tasks with a more flexible choice of the recovery threshold.Entangled polynomial codes assume that A and B are split both vertically and horizontally into m × p and p × n submatrices, respectively.In other words, With an entangled polynomial code, each server runs a task that calculates ÃEP (δ) BEP (δ), where and Still, ÃEP (δ) BEP (δ) is a polynomial function of δ whose degree is mnp + p − 2. Hence, we can interpolate its coefficients with any mnp + p − 1 tasks that have different values of δ.In particular, given t ∈ [0, mnp + p − 2], it can be uniquely written as Thus, we can obtain the mn submatrices in AB from the coefficients of ÃEP (δ) BEP (δ), and the recovery threshold of the entangled polynomial code is mnp + p − 1.
By comparing ( 4) with ( 1) and ( 2), we can see the polynomial codes and MatDot codes can be considered as special cases of entangled polynomial codes.When p = 1, the corresponding entangled polynomial code becomes a polynomial code whose recovery threshold is mn + 1 − 1 = mn.When m = n = 1, it becomes a MatDot code whose recovery threshold is

V. LOCAL RE-ENCODING FOR ENTANGLED POLYNOMIAL CODES
In this paper, we present a framework that allows not only changing the coding schemes of a task with local re-encoding, but also changing the values of their parameters.Specifically, we propose a framework that achieves the following property: The change of coding schemes can also be supported by this theorem.In fact, polynomial codes can be seen as a special case of EP codes with p = 1 and MatDot codes can also be seen as a special case with m = n = 1.From Sec.IV we can easily verify that ÃP (δ) = ÃEP when p = 1, and ÃMD (δ) = ÃEP (p) when m = n = 1.The same equivalence can also be found in B. Hence, as a special case, Theorem 1 allows changing the coding schemes and the parameters of a polynomial code or MatDot code.In the rest of this section, we will present the detailed framework that also proves Theorem 1.We first consider three special cases where only p, m, n is changed in Sec.V-A, Sec.V-B, and Sec.V-C and then present the general case where a (m, n, p) EP code is re-encoded into a (λ m m, λ n n, λ p p) EP code in Sec.V-D.For convenience, we may omit EP in ÃEP (δ) and BEP (δ) in the rest of this section when there is no ambiguity, i.e., Ã(δ) = ÃEP (δ) and B(δ) = BEP (δ).

A. Changing p to λ p p
We first show that a task with an (m, n, p) EP code can be locally re-encoded into a task with an (m, n, λ p p) EP code.We assume that the two input matrices A and B are originally split into mp and np partitions as in (3), and has been encoded into coded tasks with Ã(δ) and B(δ).
In order to re-encode Ã(δ) and B(δ), we will further split Ã(δ) vertically into p partitions, and B(δ) horizontally into p partitions.Hence, we define , and z ∈ [0, p − 1], we are also equivalently splitting them into p partitions vertically and horizontally, respectively.In other words, we can re-write A and B as In other words, we split A x,z as A x,λpz • • • A x,λpz+λp−1 , and B z,y as Then we have In other words, Ãl (δ) = From ( 8) and ( 9) we can see that the task after re-encoding is equivalent to Ã(δ) and B(δ) encoded with an (m, n, λ p p) EP code.

B. Changing m to λ m m
Now we show that a task with an (m, n, p) EP code can be locally re-encoded into a task with a (λ m m, n, p) EP code.
In this case, we will split Ã(δ) into λ m partitions horizontally, i.e., Ã(δ) = . Similar to the case above, A will also be equivalently further split into λ m m partitions, i.e., A = . In other words, we have , and then A λmx,z . . .
A λmx,z δ pnx+z . . .A λmx+l,z δ pn(lm+x)+z Here, after re-encoding, we generate Ã (δ) which is encoded by a (λ m m, n, p) EP code from a matrix A with rows in A switched: Although the sequence of rows in A is switched, it will not change the result of multiplication after decoding as we can always switch the rows in the result back to the original order, i.e., C. Changing n to λ n n Similar to Sec.V-B, we assume that B(δ) will be further split into λ n partitions vertically, i.e., B(δ It means that B is equivalently split into λ n n partitions vertically, i.e., B = . In other words, We can then have When we change n to λ n n, we only need to re-encode B(δ) as Bl (δ)δ lpmn .As we will show below, although it cannot be directly written as B(δ) with a (m, λ n n, p) EP code, we show that it is equivalent to an (m, λ n n, p) EP code, as they achieve the same recovery threshold.Since the degree of the polynomial above is pmn(λ n − 1) + pn(m − 1) + p(n − 1) + 2p − 2 = pmλ n n + p − 2, and it's the same as that of an (m, λ n n, p) EP code.However, after the interpolation, the sequence of submatrices in AB will be different from their order in the EP code and should be shuffled back.
In order to retrieve the original order of submatrices in AB, we first consider the order before re-encoding.From (4) we can see that for any t ∈ [0, mnp + p − 2], it can be uniquely written as t = pnx + py + s.When s = p − 1, the coefficients of δ t are p−1 l=0 A x,l B l,y .In (11), instead, we can uniquely rewrite the exponent of δ t as t = pmnl Combining the three cases together, we now consider the general case of re-encoding where (m, n, p) is changed to (λ m m, λ n n, λ p p).In this case, we will need to split Ã(δ) and B(δ) both vertically and horizontally, i.e., Correspondingly, A and B should also be further partitioned into λ m m • λ p p and λ n n • λ p p submatrices.In other words, we have We can then get Ãi,l (δ) = To re-encode the task, we firstly define δ = σ λp , and then re-encode Ã(δ) as Ãi,l (δ)δ ipmn σ l and B(δ) Bλp−1−l,j (δ)δ jpλmmn σ l .We can then prove that a task with the two matrices above after re-encoding is equivalent to that with an (λ m m, λ n n, λ p p) EP code.
Similar to (11), the degree of the polynomial in ( 13 it can be uniquely written as t = λ p p(y + n(ml and j ∈ [0, A λmx0+l0,l B l,λn+j when s = λ p p − 1. Assume

E. Complexity Analysis
We now discuss the complexity of our framework, especially the complexity of re-encoding, and compare it with the complexity of the encoding and the complexity of the task.We find that the complexity of re-encoding is marginal compared to both of them.
Since the overhead of the addition is much cheaper than that of multiplication, we analyze the complexity as the number of multiplications.For convenience, we rewrite the sizes of A and B as M × P and P × N , i.e., M = Λ m m, N = Λ n n, and P = Λ p p. Then the sizes of Ã(δ) and B(δ) are M m × P p and P p × N n , respectively.When we encode a task with an (m, n, p) EP code, both Ã(δ) and B(δ) are encoded as a linear combination of the mp submatrices in A and the np submatrices in B. Therefore, each element in A and B will be multiplied with a constant, and the complexity of Ã and B is O(M P ) and O(N P ), respectively.Moreover, the constants are powers of δ, leading to pn(m − 1) + (p − 1) multiplications.However, this complexity can be ignored as we assume A and B are large matrices.
As a comparison, when we adjust the values of (δ), the complexity of re-encoding is much lower.When p changes to λ p p, Ã and B should be further split into λ p partitions and re-encoded into their linear combinations.Hence, their numbers of multiplications are O( M P mp ) and O( N P np ), respectively.Similarly, when the value of m or n changes, its complexity is also O( M P mp ) or O( N P np ).Hence, in the general case when (δ) is changed to (λ m m, λ n n, λ p p), the overall complexity is O( M P mp + N P np ).Given the sizes of Ã and B, the complexity of the matrix multiplication in a task with an (m, n, p) EP code is M N P mnp .After re-encoding, the complexity of the multiplication becomes We can also find that the decoding overhead of the job will not be affected after re-encoding.Compared to a job with a (λ m m, λ n n, λ p p) EP code, the decoding overhead of the EP code after re-encoding will be the same, since the new code will be equivalent to a (λ m m, λ n n, λ p p) EP code.

VI. LOCAL RE-ENCODING FOR TANG-KONSTANTINIDIS-RAMAMOORTHY CODES
In Sec.V, we have proposed a framework of local re-encoding for EP codes, which are constructed based on polynomials, i.e., the coded task is essentially a multiplication of two polynomials.Therefore, the decoding requires interpolating a polynomial from multiple evaluation points which are the results uploaded from workers, as the results of matrix multiplication are located in the coefficients of this polynomial.Although it is numerically stable for polynomials on finite fields, as in classical coding theory, it is not the case for matrix multiplication with real numbers.The interpolation of a polynomial effectively solves a linear system with a Vandermonde matrix, and Vandermonde matrices are well known to have large condition numbers [10], [26]- [28].Therefore, small perturbations due to numerical precision errors can lead to large errors [29], [30], especially for the polynomial with a large degree.
Compared to re-encoding all tasks from scratch, the numerical stability can be more easily affected after local re-encoding.If we re-encode all tasks from scratch, we can easily change the evaluation point of the polynomials in any task.However, the evaluation points chosen for the original polynomial cannot be changed after local re-encoding, and thus the evaluation points after local re-encoding may lead to arbitrarily high condition numbers in the corresponding Vandermonde matrix.Moreover, after changing the parameter values from (m, n, p) to (λ m m, λ n n, λ p p), the corresponding recovery threshold will also increase from mnp + p − 1 to λ m mλ n nλ p p + λ p p − 1.In other words, there needs to be a sufficient number of workers such that the result of the job can still be decoded after re-encoding.Such a large number of workers, however, may not be originally necessary.
In this section, we present an extension of the local re-encoding framework, by supporting a variation of EP codes which maintains the numerical stability for bounded entries in the matrices and allows achieving a flexible tradeoff between the numerical stability and the computational overhead, without changing the recovery threshold.Hence, the error after re-encoding can be significantly lower than that with EP codes, and we can now change the complexity of a task without changing the recovery threshold and meanwhile maintain the numerical stability.

A. Background
We now briefly introduce the code construction and its properties.For convenience, we name it as Tang-Konstantinidis-Ramamoorthy codes or TKR codes. 1 Interested readers may find more details of TKR codes in [11].
Assume that the input matrices A and B are originally split into m × p and p × n partitions as in (3).In addition, all entries in A and B are non-negative integers. 2We first demonstrate a special case of TKR codes.Assuming that ζ is a large enough

Similar to polynomial codes, we can decode
with mn tasks since the values of δ are different.Furthermore, if ζ is large enough such that all entries in the coefficients are . Therefore, we can recover the coefficient of ζ 0 , i.e., p−1 l=0 A x,l B l,y , by rounding (14) to the nearest integer and then computing the reminder upon division by ζ.Although the first step of decoding still involves a Vandermonde matrix, its degree is only mn − 1, and the error can be further mitigated in the second step.Hence, it is shown in [11] that the numerical error of TKR codes is much smaller than that of EP codes.
We can see that the above example of TKR codes has a recovery threshold of mn, although the input matrices are split into mp and np partitions.TKR codes also support a more general recovery threshold by trading off the numerical precision.Assume p |p and q = p p .We then have In other words, the special case above corresponds to p = 1.Therefore, As a polynomial of δ, (15) has a degree of mnp + p − 2 and hence has a recovery threshold of mnp + p − 1.After interpolation we can get q−1 t=−(q−1) min{q−1,t+q−1} l=max{0,t} A x,ql +l B q(p −1−t +l )−t+l,y ζ t as the coefficient of δ p nx+p y+t .We then round it to the nearest integer and compute the remainder upon division by ζ.Eventually we will obtain q−1 l=0 A x,ql +l B ql +l,y when t = p − 1, where x = 0, . . ., m − 1, y = 0, . . ., n − 1, and l = 0, . . ., q − 1.We can then further add them up to obtain In this paper, we show that with a non-trivial extension, our local re-encoding framework can also be applied to TKR codes.We can also see that in TKR codes the recover threshold does not change when p increases.In other words, with the same values of m, n, p , TKR codes achieve a tradeoff between the numerical stability and the complexity of the task.Tang et al.
have reported that the numerical precision decreases, i.e., the error after decoding increases when the value of p increases [11].
With Theorem 2, we can also demonstrate that our framework achieve a flexible tradeoff between numerical precision and the complexity of the task (equivalently the job completion time) through local re-encoding.
We present how we can locally re-encode a task with TKR codes in the rest of this section and meanwhile prove Theorem 2.
B. Changing (m, n, p, p ) to (m, n, λ p p, p ) As described above, the recovery threshold of TKR codes, i.e., mn p + p − 1 does not depend on p.Hence, different from the local re-encoding for EP codes in Sec.V where the recovery threshold must be changed after re-encoding, it is possible to re-encode a task encoded from TKR codes without changing the recovery threshold.In other words, we can flexibly achieve a different tradeoff between numerical precision and task complexity.
In order to achieve a different tradeoff, we need to further split ÃTKR and BTKR vertically and horizontally, respectively.In other words, we have . Equivalently, by splitting Ã(δ) and B(δ), we are also splitting A and B as in (5).Similar to ( 6) and ( 7), we can get We can now re-encode Ã(δ) and B(δ) as: Although the task after re-encoding is not exactly the same as an (m, n, λ p p, p ) TKR code, we can show that after multiplication, we can still decode the result of AB with the same recovery threshold, i.e., it is equivalent to an (m, n, λ p p, λ p p ) TKR code.
Given A partitioned into m × λ p p submatrices and B partitioned into λ p p × n submatrices (as in ( 5)), we can shuffle the columns of A and rows of B such that the result of matrix multiplication remains unchanged.Given any z ∈ [0, λ p p − 1], it can be uniquely written as v = q(λ p z + l) + u where l ∈ [0, .
Similarly, we can also map . Since we only have the columns of A and rows of B shuffled in the same way, we have Applying an (m, n, λ p p, p ) TKR code on A , we have In particular, as we can uniquely have z 0 and l such that u = ql + u 0 where l ∈ [0, λ p − 1] and u 0 ∈ [0, q − 1], we then have Similarly, we have The equations above show that the re-encoded task is equivalent to that encoded with an (m, n, λ p p, p ) TKR code.We can also see that after re-encoding, the recovery threshold remains unchanged as mnp + p − 1.However, the sizes of Ã and B are reduced by λ p times, reducing the complexity of the multiplication by λ p times.As we will demonstrate in the experiment results, the tradeoff between the complexity of the task and the numerical precision can be achieved in this way.
C. Changing (m, n, p, p ) to (m, n, λ p p, λ p p ) Moreover, we can further demonstrate that with local re-encoding, we can change the complexity of the task while flexibly changing the recovery threshold at the same time.We first present a special case by splitting Ã(δ) and B(δ) in the same way as in Sec.VI-B, then recovery threshold can also be changed to mnλ p p + λ p p − 1, where the value of λ p is a divisor of λ p .We will present a more general framework for local re-encoding in Sec.VI-D and Sec.VI-E where both the complexity of the task and the recovery threshold can be changed more arbitrarily.
Since the value of p is still changed to λ p p, we will split Ã(δ) and B(δ) as in (5).However, this time we will re-encode them in a different way.
In this section, we consider a special case where λ p = λ p .Assume that δ = δ Although the task after re-encoding is not exactly the same as an (m, n, λ p p, λ p p ) TKR code, we can show that after multiplication, we can still decode the result of AB with the same recovery threshold, i.e., it is equivalent to an (m, n, λ p p, λ p p ) TKR code.
The same as in Sec.VI-B, we can create A and B by shuffling the columns and rows of A and B, respectively.
Applying an (m, n, λ p p, λ p p ) TKR code on A , we have In particular, as we can uniquely have z 0 and l where z 0 ∈ [0, p − 1] and l ∈ [0, λ p − 1], such that z = λ p z 0 + l, we then have Similarly, we have Hence, we can see that the task after re-encoding is equivalent to A and B encoded by an (m, n, λ p p, λ p p ) TKR code.
Since the results of the two jobs are the same, we will get the same result after decoding.
D. Changing (m, n, p, p ) to (m, n, λ p p, λ p p ) We now present that the recovery threshold can be changed more flexibly.After splitting A and B into λ p partitions, we demonstrate that the recovery threshold can be changed to mnλ p p + λ p p − 1, as long as λ p |λ p .
Equivalently, we split A and B as in (12).In addition, we have Ãi,l (δ) = Similar to (13), we have where A is constructed by shuffling the rows of A as in (10).
If we consider the above equation as a polynomial of δ new , its degree is λ m mλ n nλ p p + λ p p − 2. Hence with any λ m mλ n nλ p p + λ p p − 1 tasks, we can obtain all its coefficients.In particular, given a Polynomial (m=2,n=2) -> entangled polynomial (m=2,n=4,p=2) MatDot (p=2) -> entangled polynomial (m=2,n=1,p=4) Fig. 3a with the original coding schemes in Fig. 2, we can see that it can be saved by 14.3% and 18.5%, respectively.As for TKR codes, the sizes of matrices remain the same but the entries are now integers.We can also observe in Fig. 3b that local re-encoding can also save time by 69.2% and 71.4%, respectively.The saving of time in the two cases above mainly comes from the saving of the time for encoding and deploying tasks with the new coding schemes, which can also be validated from the results running in Microsoft Azure below.We also evaluate the job completion time with re-encoding in Microsoft Azure.We run the master on a virtual machine of type B4ms and all workers on virtual machines of type B1ms.We set initial parameters of the EP code as (m = 2, n = 2, p = 2) and encode input matrices of three jobs.The sizes of input matrices of such three jobs remain the same as in Table I, except that all entries are integers.Similarly, we also encode the three jobs with an (m = 2, n = 2, p = 2, p = 2) TKR code so that they are split in the same way as the (m = 2, n = 2, p = 2) EP code and thus they have the same recovery threshold.In each job, we change the parameters with four configurations: λ m = 4, λ n = 8, λ p = 2 and λ m = λ n = λ p = 2 for EP codes.As for TKR codes, we change the parameters similarly: λ m = 4, λ n = 8, λ p = 4, λ p = 2, and λ m = λ n = λ p = 2, λ p = 4.
All other parameters if they are not mentioned, remain unchanged.Hence, the recovery thresholds of TKR codes after local re-encoding remain the same as EP codes.
With each configuration, we also repeat every job 50 times and obtain the mean and standard deviation of its results.As for local re-encoding, the overhead of re-encoding comes only from re-encoding Ã and B locally.The overhead of global re-encoding, however, comprises the overhead of encoding which is performed solely at the master, and the overhead of distributing all new coded tasks.Therefore, although originally the time of global re-encoding with EP codes ranges between 1.90 seconds and 17.95 seconds, local re-encoding only needs 0.22 seconds on average at most, as shown in Fig. 4a.Similarly, Fig. 4b illustrates the overhead of re-encoding with TKR codes.We can see that the maximum value in Fig. 4b   time of re-encoding ranges between 1.61 seconds and 12.09 seconds in Fig. 4b.In general, Fig. 4 shows that the overhead of re-encoding can be saved by at most 99.37% for EP codes and 99.76% for TKR codes by local re-encoding.
We also evaluate how re-encoding affects the overall job completion time in Fig. 5. Compared to the job completion time in Fig. 5, we can see that the re-encoding overhead of local in Fig. 4 is marginal.Due to the saved re-encoding overhead, we can also see that the job completion time can also be saved by up to 88.59% with EP codes in Fig. 5a.Similarly, we can observe that the job completion time can be saved up to 87.05% with TKR codes in Fig. 5b.

B. Numerical Stability
We now compare the numerical stability with EP codes and TKR codes before and after re-encoding.For a fair comparison, the entries of all matrices are integers.First we run the three jobs in Table I with (m = 2, n = 2, p = 2) EP codes, and then re-encode such jobs with λ m = 2, λ n = 2, λ p = 2, and λ m = λ n = λ p = 2.We compare the error of each job before re-encoding and after re-encoding.The number of workers is chosen to tolerate 5 stragglers after re-encoding.The error is measured as the Frobenius norm, i.e., e = C− Ĉ F C F , where C is the precise result of AB, and Ĉ is the result obtained after decoding.
The results of the errors are shown in Table II, where the data of errors were obtained as the averages of running the three jobs 50 times on Microsoft Azure.We can see from Table II that before re-encoding the three jobs all have very low numerical   errors.The errors after local re-encoding, however, become significantly higher, as the choices of δs cannot be changed with local re-encoding, making them less desirable for the new (λ m m, λ n n, λ p p) EP codes.
In Table III, on the other hand, we demonstrate that the numerical stability for matrix multiplication with TKR codes before and after local re-encoding.The sizes of input matrices of such three jobs are still those in Table I.The three jobs are now originally encoded with an (m = 2, n = 2, p = 2, p = 2) TKR code so that they are split in the same way as the (m = 2, n = 2, p = 2) EP code in Table II.In each job, we change the parameters with four configurations: λ m = 2, λ n = 2, λ p = 2 and λ m = λ n = λ p = 2.We can see in Table III that the numerical stability of TKR codes after local re-encoding can be maintained -the errors remain as 0 for different values of parameters.Compared to the errors in Table II, we can see that the numerical stability is significantly improved by TKR codes, and meanwhile local re-encoding does not hurt the numerical stability as with EP codes.We now demonstrate how local re-encoding for TKR codes helps to achieve a flexible tradeoff between the completion time and the error before and after local re-encoding.This time we multiply two matrices of sizes 1024×7168 and 7168×1536, respectively.
We launch two jobs that are encoded with two TKR codes: (m = 2, n = 3, p = 1, p = 1) and (m = 1, n = 1, p = 3, p = 3), respectively.Each job runs in a cluster of 9 workers and 1 master on Microsoft Azure.We then locally re-encode such two jobs with λ p being 2, 3, and 4, respectively.Hence, we can expect to see the completion time reduced due to the lower complexity of the task and the higher error, which can be seen in Fig. 6.
In Fig. 6, we can see that the completion time and the errors of the two jobs change as expected.With λ p increasing, the completion time of the two jobs is saved by 19.1% and 18.3% eventually.On the other hand, the errors are also increased from 0 to 1.6 × 10 3 and 5.1 × 10 4 .Hence, with local re-encoding, we don't need to change the number of workers while flexibly choosing between lower completion time or better numerical precision in the result.

VIII. CONCLUSION
As resources in the distributed infrastructure are shared by multiple jobs, its performance is subject to change with time dynamically.Although coded matrix multiplication has been demonstrated to tolerate stragglers, existing coding techniques do not support a flexible change of the coding scheme or parameters without receiving additional data, when the performance of some resource changes in the distributed infrastructure.In this paper, we propose a framework of local re-encoding, which allows changing the coding scheme and/or its parameters for distributed matrix multiplication without incurring any additional traffic.By extensive experiments, we demonstrate that our framework can significantly save the time and communication overhead to complete matrix multiplication with dynamic resources, and flexibly achieve the tradeoff between computation and communication, and that between numerical precision and completion time.

Fig. 2 :
Fig. 2: Comparisons of completion time of the coded distributed matrix multiplication with and without additional traffic.

Theorem 1 :
A task encoded with an (m, n, p) entangled polynomial (EP) code can be locally re-encoded into a task encoded with a (λ m m, λ n n, λ p p) EP code, where λ m , λ n , and λ p are positive integers.According to this theorem, if a job is originally encoded with an (m, n, p) EP code, we are able to further split ÃEP and BEP , and re-encode them directly into a new coded task which is equivalent to that encoded with a (λ m m, λ n n, λ p p) EP code.While conventionally the re-encoding needs to re-encode the original matrices A and B again from scratch, local re-encoding requires no additional data from any remote server, leading to marginal overhead.Saving the complexity of each task by λ m λ n λ p times by increasing the recovery threshold to λ m m • λ n n • λ p p + λ p p − 1, our framework achieves a flexible tradeoff between computation and communication overhead.

.
Since B(δ) is not a function of m, we only need to re-encode Ã(δ) when we adjust the value of m.When m is changed to λ m m, we will re-encode Ã(δ) as

z=0z=0
integer, A and B are then encoded as ÃTKR = A x,z ζ z δ nx and BTKR = B z,y ζ −z δ y .In particular, the value of ζ is the same in all tasks, while the values of δ should be different.Hence, we have A x,l B −t+l,y ζ t   δ nx+y .

Theorem 2 :
A task encoded with an (m, n, p, p ) TKR code can be locally re-encoded into a task encoded with a (λ m m, λ n n, λ p p, λ p p ) TKR code, where λ m , λ n , and λ p are positive integers, and λ p |λ p .

Fig. 3 :
Fig. 3: Job completion time with re-encoding in the cluster.

Fig. 4 :
Fig. 4: Overhead of re-encoding of EP codes and TKR codes.

Fig. 5 :
Fig. 5: Job completion time with local and global re-encoding in Microsoft Azure.

before after λ m = 2 2 Fig. 6 :
Fig. 6: The tradeoff between completion time and the numerical stability EP code, if λ m |Λ m , λ n |Λ n , and λ p |Λ p .The more divisors Λ m , Λ n , and Λ p have, the more EP codes we can re-encode to.Moreover, even though λ m /λ n /λ p is not a divisor of Λ m /Λ n /Λ p , we can still add all-zero additional rows or columns into Ãi or/and Bi so that they can be divisible.As Ãi and Bi are linear combinations of submatrices in A and B, respectively, it is equivalent to adding additional rows or/and columns in A and B, which will only add additional rows or/and columns with zero elements but not change any existing element in the result.The overhead of such padding is at most 1 + λm that each task is originally encoded with an (m, n, p) EP code.If A and B are of size Λ m m × Λ p p and Λ p p × Λ n n, then each task can be re-encoded into any (λ m m, λ n n, λ p p) λp Λp of Ãi and 1 + λn Λn 1 + λp Λp of Bi , which is marginal if λ m Λ m , λ n Λ n , and λ p Λ p .Since A and B are supposed to be large matrices, it is easy to satisfy such requirements.

TABLE I :
Sizes of input matrices in three jobs of matrix multiplications.
is 0.05 seconds that occurred in Job 1 with the configuration (1, 1, 2) for local re-encoding.As for global re-encoding, however, the original

TABLE II :
Comparison of errors of EP codes before and after local re-encoding.

TABLE III :
Comparison of errors of TKR codes before and after local re-encoding.