Improving Random Projections With Extra Vectors to Approximate Inner Products

This research concerns itself with increasing the accuracy of random projections used to quickly approximate the inner products of data vectors from a given dataset by adding additional information, namely, adding and storing more extra known vectors to the given dataset and associated information. We show how the variance of estimated inner products is reduced as more vectors are added, how variance reduction is related to the geometry of the dataset and moreover, the asymptotic behaviour of the variance as the number of extra vectors added goes to infinity. We provide the formulae governing the estimate of inner products for adding arbitrarily many extra vectors. Lastly, we demonstrate how to efficiently implement the computations of the estimates by showing we can use pre-computed and stored values for most of the computations. Numerical simulations are conducted to support the analytical results.


I. INTRODUCTION
As a popular and useful dimension reduction technique in high dimensional learning in statistics, random projections have been of considerable independent interest, and their applications have appeared in machine learning [1], data mining [2], information retrieval [3], classification [4], clustering [5], hypothesis testing [6], regression [7], and similarity search [8], etc. Many papers have been devoted to the discussion of various aspects of random projections, to name a few, [9]- [17].
Particularly in machine learning, distances between pairs of data points are compared against each other, which could be inner products, L p norms, Euclidean distances, etc. Today, machine learning is used for a variety of complex applications (e.g. image processing). As such, pairwise distances for large numbers of high dimensional data points are being compared against each other. The direct computation of such distances would take too much time, so there must be methods to quickly estimate the distances between such data points with minimal compromises in accuracy. Such estimations can be The associate editor coordinating the review of this manuscript and approving it for publication was Xiao-Sheng Si . done with random projections and this research is concerned with the improvement of the accuracy of random projections.
Suppose that we have q observations with p features and stored in matrix M ∈ R q×p , and that we are interested in calculating the inner products of all pairs of row vectors in M , i.e. MM T . With q observations, there are q(q−1) 2 pairs. Let each i th row in M be denoted by x i . Since we have p features, computing the inner product p t=1 x i,t x j,t between any two pairs x i , x j would take O(p) time for each pair, hence the total time taken would be O(q 2 p). When q and p are large, the direct computation of MM T would be computationally costly. Random projections compute V = 1 √ k RM T where R ∈ R k×p is a random matrix with each component r i,j i.i.d. from N (0, 1). After this, the estimation of MM T is given by To understand why this approximation holds, consider the form of the (i, j) th entry of V , given by Hence each (i, j) entry of V T V in (I.1) is an estimate of the inner product between x i and x j . When k is much smaller than q, p, i.e. k min (q, p), the saving in computational complexity from O(q 2 p) of calculation for MM T to O(qpk + q 2 k) of calculation for 1 √ k MR plus V T V is substantial. A second way to think about the computational cost is this: Computing V takes O(qpk) time which is a one-off cost, and can be stored in O(qk) space. Subsequent estimates of any pairs of inner products would take O(k) time, rather than O(p) time, which is a drastic speedup when k p, and the values of V T V could be used in applications such as ANN (approximate nearest neighbors).

A. RELATED LITERATURE
From the above example, we can see that random projections focus on projecting the data to a lower dimensional space, and then quickly estimating the distance between vectors. There has been research which focused on improving the speed of random projections. For example, [18] and [19] looked at changing the distribution of the random projection matrix. Acholioptas [18] considered using i.i.d. entries from the Rademacher distribution for the random projection matrix, while Li et al. [19] considered using i.i.d. entries of the the Sparse Bernoulli distribution for the random projection matrix. Since the support of the Rademacher distribution consisted of {−1, 1}, and the support of the Sparse Bernoulli distribution consisted of {−1, 0, 1}, matrix multiplication with the random projection matrix became much faster compared to ordinary random projections. References [20] and [21] also looked at the cases where the entries of the random projection matrix were dependent on each other and could be generated recursively. All of these methods ended up speeding random projections.
On the other hand, there is also research in reducing the variance of the random projection estimate via various computational statistical techniques. Techniques such as antithetic sampling [22], Bayesian priors [23], control variates [16], [24], maximum likelihood estimation [10], [17] have all been used to improve the estimate of random projections.
Our work mainly focuses on improving the estimate of random projections, and builds upon the ideas in [10], [17].

B. IMPROVING RANDOM PROJECTIONS BY USING MARGIN NORMS
By adding additional information, namely, pre-calculating and storing partial information beforehand, one can improve the accuracy of random projections to obtain a better approximation V T V for MM T ; and some work (e.g. [10], [17]) have specifically addressed this topic. For example, by using a maximum likelihood estimator (MLE) and taking advantage of marginal norms, Li et al. in [10] have shown that marginconstrained random projections can improve estimation accuracy remarkably.
More specifically, consider the data matrix M ∈ R q×p that is given. Pre-calculate and store the norm of each row vector in M , from which margin norms come at a cost of O(qp) complexity. To approximate MM T , let x T 1 = (x 1,1 , . . . , x 1,p ), x T 2 = (x 2,1 , . . . , x 2,p ) ∈ R p represent two arbitrary different row vectors of M . 1 Instead of directly computing x T 1 x 2 for all possible pairs of x 1 , x 2 in M T , one can adopt a random matrix R ∈ R k×p with each entry r i,j i.i.d. N (0, 1), to generate new random vectors v 1 = (v 1,1 , . . . , v 1,k ) T and v 2 = (v 2,1 , . . . , v 2,k ) T by the map T ≡ 1 √ k R, with v 1 = T x 1 and v 2 = T x 2 respectively. Denote the vector { u (2) j } = [v 1,j , v 2,j ] T which comprises the j th element of v 1 and v 2 , j = 1, · · · , k.
Then it was shown in [10] that where a := x T 1 x 2 is to be estimated; and the associated joint likelihood function for tuples v 1,j 1 By pre-computing and storing norms of row vectors of M , the computation of MM T is therefore the same as the computation of x T 1 x 2 , with x 1 = x 2 and letting x T 1 and x T 2 go over all different pairs of row vectors in M . This way, we only need to focus on approximating each x T 1 x 2 , which is convenient to unfold the theory. VOLUME 8, 2020 By writing down the log likelihood function, l(a), from (I.10), the following estimation result is obtained: Proposition 1 ( [10], Lemma 2): Suppose the margins, m 1 = x T 1 x 1 and m 2 = x T 2 x 2 are known; the maximum likelihood estimator (MLE), denoted asâ, is a solution to: We can see that the approximation of a is just one of the roots of the cubic equation, which can be carried out without difficulty by a root-finding algorithm such as Newton Raphson once the coefficients are determined; and it was also pointed out that this (asymptotic) variance of MLE, namely, Var â in Proposition 1 is smaller than the ones obtained by other methods (see [10]), and therefore is an ideal candidate to use in applications.

C. FROM ADDING MARGINAL NORM TO ADDING MORE ADDITIONAL INFORMATION (VECTORS)
The results in Proposition 1 suggest that it is seemingly true that adding more known information will yield a further variance reduction and therefore obtaining a better estimate for each inner product x T 1 x 2 . So, it is natural to ask: '' can we continue to reduce the variance Var â with more additional information?'' In this work, based on results in 1, we endear to the investigation of improvement of random projections by adding more additional information (extra known vectors) in the context of maximum likelihood estimation.
More precisely, for data matrix M ∈ R q×p , we not only pre-compute and store norms of each row in M , but also pick n extra known vectors denoted as x 3 , · · · , x n+2 ∈ R p and further calculate and store their norms and the inner products between each of these n vectors and each row vector of M as well. The goal is still to approximate x T 1 x 2 , where x T 1 , x T 2 are two arbitrary different vectors of dataset M . However, with these more known information, we expect to obtain a much smaller variance Var â than the one in Proposition 1.
To unfold the work, let us do setting up and go to a little details in the following.

D. OUR RESULTS
Towards the goal of estimatingâ = x T 1 x 2 , let us denote these n extra vectors as . . .
where n is a non-negative integer with the convention that we add nothing if n = 0 (leading back to the case in Proposition 1). Under random projection mappings, accordingly we have additional vectors We similarly let the vector { u , j = 1, · · · , k (I.14) with 15) and the according joint likelihood function for tuples Under this setting, our main results are Theorems 6 and 9, which refine the two results in Lemma 2 and establish the formulae governing MLEâ and variance Var â separately, and therefore generalizing Proposition 1.
We shall discuss analysis questions include how the variance reduction works as we add n extra vectors, whether there exists an optimal n that minimize the variance Var â , and what is the asymptotic behaviour of Var â as n goes to infinity. As a byproduct, we also point out how the variance reduction is related to the geometry of dataset M . Meanwhile, in the numerical part, computational cost ofâ for adding extra vectors is discussed; we provide simulations for various comparisons to support the analytical results and propose effective algorithms for evaluating the MLEâ.

E. NOTATION
Let us first summarize the notation and convention that will be adopted throughout the material.
• Notation · is a column vector by default. And the subscript of a matrix denoting the size will be often omitted when there is no chance of misunderstanding.
• Matrix M ∈ R q×p represents a dataset, q, p represent observations and features of data points.
• x 1 , x 2 will be called testing vectors and denote an arbitrary pair of different column vectors in our dataset M T . Therefore, instead of directly estimating MM T , we can focus on estimating x T 1 x 2 , letting x 1 , x 2 vary over all possible pairs.
where is the covariance matrix in (I. 15).
and in particular, a := x T 1 x 2 is what we want to approximate in this work.
• R ∈ R k×p represents the random matrix and its (i, j)-th entry is denoted by r i,j , where r i,j are i.i.d. N (0, 1). VOLUME 8, 2020 • det (·) and | · | will be used interchangeably to denote the determinant of a matrix.

II. MAXIMUM LIKELIHOOD ESTIMATION
Now we are in the position to calculate the maximum likelihood estimatorâ by utilizing the joint likelihood function (I. 16) for adding n extra vectors. We expect to obtain a generalization of Proposition 1, where only marginal norms are added, namely n = 0. As n increases, the pattern for findingâ and Var â becomes obscure. The first challenge is to find the uniform equations involvingâ and Var â . We first state the result.
and A l,i is (l, i)-th entry of adj(A).

2) The asymptotic variance ofâ is given by
(II.20) The usefulness of this lemma lies in that it confirms the existence of equations governingâ and variance Var â for adding any n extra vectors and that it provides general formulae for findingâ and Var â .
In view of Proposition 1, it is natural to expect that equation (II.18) is supposed to be a cubic equation in terms of a for arbitrary n and Var â decreases as n increases. Unfortunately, this is not straightforward at this stage from Lemma 2. As n increases, on one hand, the order of polynomial in the left-hand side of (II.18) with respect to a is not immediately clear and thus solving for the roots of equation (II.18) might be intimidating; on the other hand, the right-hand side of equation (II.20) does not involve n explicitly, which blurs the relationship between Var â and the number of extra vectors n. We need to seek alternate equivalent expressions that are more convenient for analysis.

A. FURTHER ANALYSIS OFÂ AND VAR(Â)
We intend to refine equations (II.18) and (II.20) separately. For analytical reason, let us first focus on analysis of (II.20), and turn to (II.18) after.

B. THE BEHAVIOR OF VAR(Â)
Observe that Var â in equation (II.20) does not explicitly depend on n, but does depend on |A|. To build the connection between Var â and n, we first investigate the relations between Var â and |A|, and then the relations between |A| and n. By combining this two pieces together, the relationship between Var â and n will be clear. The main result will be established after several lemmas that are crucial.
The first lemma produces a nice expression of |A|.

Lemma 3: Partition the matrix A in (I.17) as
then the determinant of A can be expressed, with respect to a, as |W | and W is the (n + 1) × (n + 1) matrix generated by eliminating the 1st row and the 2nd column of A, and A 0 denotes the value of |A| evaluated at a = 0.
By utilizing this lemma, we now can rewrite Var â in Lemma (2) into concise expressions in terms of |A|.
Lemma 4: The variance Var â satisfies where D and A 0 are defined as in Lemma 3.

2|D|
|A| appears in equation (II.23) and ( |D| |A| ) 2 a 2 appears in equation (II. 24), and all other terms are non-negative. Therefore, both equations (II.23) and (II.24) imply that Var â decreases as |A| decreases (with a fixed |D|, which is determined by our n extra vectors). However, the subtle difference between this two equations is: if |A| is decreasing and very close to |D|, then we are supposed to look at equation (II.23) instead of equation (II.24) as Var â decreases in order 2|D| |A| that is quicker than order ( |D| |A| ) 2 a 2 ; otherwise, if |A| is decreasing and very close to 0, then we are supposed to look at equation (II.24) instead of equation (II.23) as Var â decreases in order ( |D| |A| ) 2 a 2 that is quicker than order 2|D| |A| . Indeed, other alternate equivalent expressions of Var â can also be derived, which provide insight for the behavior of variance from different angles and serve as convenient tools for various purposes. Now from above, we already see that Var â closely depends on the behavior of |A|, we therefore turn to investigation of the relationship between |A| and the number of additional vectors n. Namely, as n increases, how the determinant |A| behaves, which will eventually help connect Var â and n.
Lemma 5: Let m be a positive integer and inductively define the sequence of symmetrical and positive definite VOLUME 8, 2020 matrices, denoted as {T m+1 }, by (II.25)

(they have overlapping entries). It is valid that
t m,m }. Based on this lemma, we arrive at the following: Theorem 6: Suppose we add n extra known vectors x 3 , · · · , x n+2 that are orthogonal to each other and det (A) = 0. Then • Suppose each of x 3 , · · · , x n+2 is not perpendicular to at least one of the testing vectors x 1 , x 2 and denote (II.28)

Then
Var â, n (II.29) Notice one of the constrains imposed on extra vectors added in Theorem 6 is orthogonality, which is always possible by the Gram-Schmidt process.
The first part points out that we can reduce variance Var â corresponding to the testing pair x 1 , x 2 only when additional vectors added are not all orthogonal to x 1 and x 2 , which is a good reason for the assumption in the second part.
Observe from equation II.28 and II.29 that s i essentially measures the angles between extra vectors x 3 , · · · , x 2+n and testing vectors x 1 , x 2 and that Var â, n goes to zero if any s i is close to 1.
Hence, the second part tells us that there does not actually exists an ''optimal'' n that minimizes the variance Var â ; to lower the variance, we should on one hand add more additional vectors, and on the other hand, we should try to make extra vectors have small angles with all data points (all possible testing vectors x 1 , x 2 ). Since the later requires one to know the geometry information of dataset M in advance, keeping adding additional vectors seems to be the simplest way to reduce the variance. However, as we will see soon, the computational cost will truly be the concern that prevents one from choosing n too large .

C. FURTHER ANALYSIS OFÂ
In this section, we show that the highest order of a in equation (II.18) is 3 for any n. The following definition provides a convenient tool to describe the coefficients of the cubic equation.
Denote the matrices 1 A, 2 A, 3 A, 4 A of size (n + 2) × (n + 2) as Further denote sequences of vectors as (II.37) Now the first part of Lemma 2 can be restated as follows. Theorem 9:â is a solution to the cubic equation (II.39) Theorem 9 confirms that, for each testing vector pair x 1 , x 2 , equation (II.18) in Lemma 2 is a cubic equation with respect to a for arbitrary n. Since coefficients of the cubic equation consist of α's and β's, in order to solve the equation forâ, the main work is to determine α's and β's.
Therefore, as n increases, the computational complexity of α's and β's deserves attention and moreover, how to make the implement efficient needs to be dealt with as we will discuss in next section.

III. NUMERICAL COSTS AND SIMULATIONS
In this section, we discuss the numerical costs and experiments of this work. We will look at main aspects that dominate the computational complexity and do numerical simulations of some datasets to support the analytical results.

A. NUMERICAL COSTS
It is not surprising that adding n extra vectors and storing associated information will result in a trade off between accuracy and increase of computational cost. The main contribution of computation of approximating MM T comes from three aspects: Phase 1 The random projections, namely V = 1 √ k R · [M T , x 3 , · · · , x n+2 ], take O((q + n)pk). This needs to be done once. Phase 2 Additional information that needs to be precomputed and stored for later computation in Phase 3 includes which takes about O((q + n)p + (q + n)pn + n 2 k) time. This needs to be done once. Phase 3 Solving the cubic equation (II.38) for each testing pair x 1 , x 2 in M T . Note that α 1 , α 2 , α 3 and β 1 uses values that have been pre-computed in Phase 2, regardless of which pairs of inner products we want to estimate, so we do not need to do any additional computation. While the calculations for β 2 and β 3 change for each pair. For Phase 3, in the stage of solving for the roots of (II.38), the calculation of the α i 's and β i 's could be quite involved provided n is large. This is mainly because that in the expressions of α i 's and β i 's many determinants of matrices have to be evaluated permutationally according to Definition 7. Although different methods can be applied to compute these determinants, such as Gaussian elimination method, which is quite general, or use the formula of determinant of 2 × 2 block matrix, which is based on the features of matrices 1 A, 2 A, 3 A, 4 A, we find that efficient implementation is more about how many and how to choose the n extra vectors.
Therefore, as far as Phase 3 is concerned, it is meaningful to know how quickly the computational cost increases as n grows, i.e. the power in O n ? with respect to n.
Let us consider this in the following by comparing n arbitrary vectors against n orthonormal and normalized vectors.

Procedure 1
Computing α 1 , α 2 , α 3 (1). Utilize the formula of the determinant of block square matrix det 3 A, 4 A. Comparing with the cost for α 1 , α 2 , α 3 , we see that the cost of calculation of β 1 , β 2 , β 3 dominates for all but very small problems. The increase of the cost of Procedure 2 is quadruplicate with respect to n, i.e., O n 4 , from which the computation can be too large to be prohibitive as n increases.
On the other hand, we can reduce the computational cost if we choose extra vectors to be orthonormal and normalized. This will make D in (III.40) be an identity matrix and drastically simplify calculations, which therefore produces concise formulae for α's and β's using pre-computed and stored values in Phase 2. This way, we can not only lower the number of arithmetic operations but also take advantage of the way in which data was stored to decrease the memory access.
We can hence simplify Procedure 1 to take O(n) time, which is a one-off cost, and simplify Procedure 2 to take O(n 2 ) time for each vector pair. The exact expressions for α's and β's are in Appendix H.  Table 1 provides a comparison of computational cost between naive random projections (RP), random projections with control variates [16], RP with marginal norms (namely the work in [10]) and RP with n extra vectors, and sign full random projections [25] showing how computational complexity increases as additional information is added. As Newton Raphson has quadratic convergence and converges in about 1-3 steps in finding the root of the cubic polynomial, we can treat this as having complexity of O(1) time. It can be seen that while we have an increased computational cost at the start, this is a one-off cost, and estimating the inner product of any i-j pair just requires an additional O(n 2 ) time, which is negligible when we generate some small n number of extra vectors.

B. SIMULATIONS
We compare the numerical experiments of approximating inner products using maximum likelihood estimate method with n = 1, 2, 3 extra vectors to the methods of naive random projections, Li's work [10], the method of control variates (see for example [24]), and sign full random projections [25]. We denote RP to be the ordinary random projection estimator of the inner product, Li to be Li's estimate of the inner product, CV to be the bivariate control variate estimate of the inner product, and sign RP to denote sign full random projections. We denote MLE-n, n = 1, 2, 3 to be the maximum likelihood estimate algorithm with n extra vectors.
We perform experiments on two datasets: Arcene and Driv-Face, which are retrieved from the UCI machine learning repository [26], [27].
Before we proceed with our main experiments, we first do a sanity check and look at the computational cost of our algorithm. Table 1 shows the computational complexity of our algorithm, but we want to see how our algorithm performs in practice in terms of speed. We take the first two vectors x 1 , x 2 from the DrivFace dataset, and find the average time taken (over 1000 simulations) to compute the estimate of the inner product x T 1 x 2 for varying k columns of the random projection matrix for k = {10, 20, . . . , 100000}, before smoothing the results. Figure 1 shows our results. We can see that while our method takes more time than the four baseline methods, it still keeps to the same magnitude of time taken with k < 10 3 . Since we usually look at the random projection performance with k < 10 2 , our method is still comparable with the baseline methods, though at a cost of speed. This is the main limitation of our method.
We next proceed with our main experiments, which are to check whether our new algorithms outperform those we proposed in the previous papers. Specifically, we compute the average root mean squared error (RMSE) of all pairwise inner products on the dataset, and see whether the RMSE of maximum likelihood estimate is lower than that of other algorithms. The RMSE is defined as follows: whereâ i,j is the estimated inner product computed by our algorithms, and x T i x j represents the true value of inner product between ith and jth row vectors of dataset M ∈ R q×p , and C 2 q denotes the total number of pairs x i x j .
We choose the test set of Arcene which has 100 observations and 10000 features. The dataset DrivFace has 606 observations and 6400 features. Our datasets are preprocessed by centering and normalization. For each dataset, we use a random projection matrix R, where each entry is i.i.d. N (0, 1), and k ranges from 1 to 100. We repeat the experiments for 1000 simulations with different random seeds, and then we plot the average performance as well as 3 standard deviations over the 1000 experiments.
In the simulation, we choose the singular vectors of the dataset M ∈ R q×p to be the extra vectors in the MLE method. In particular, the singular value decomposition (SVD) of matrix M is where the columns of V T are orthonormal to each other. We choose the first n columns of V T to be the n extra vectors of MLE-n. The motivation is that the first n singular vectors are equivalent to the first n eigenvectors of the sample covariance matrix of the corresponding dataset, and the first n eigenvectors tend to form small angles with the majority of the vectors in the dataset. Figure 2 and 3 are the plots of the performances of the algorithms using mean and 3 standard deviations, which compares the RMSE of RP, Li and CV, with the RMSE of MLEn, n = 1, 2, 3, and we use a log scale plot to visualize the differences.
For both of the datasets, we can see that the more extra vectors we add, the better the performance.  Furthermore, we evaluate the precision and recall of the algorithms when they are estimating the inner products. Firstly we define 39 thresholds from −0.95 to 0.95, with 0.05 difference between each two consecutive thresholds. i.e., −0.95, −0.9, −0.85, · · · , 0.95. Then we define the following terms: • True Positive (TP) = the number of (i, j) pair wherê a i,j < threshold and x T i x j < threshold • True Negative (TN) = the number of (i, j) pair wherê a i,j > threshold and x T i x j > threshold • False Positive (FP) = the number of (i, j) pair wherê a i,j < threshold and x T i x j > threshold • False Negative (FN) = the number of (i, j) pair wherê a i,j > threshold and x T i x j < threshold whereâ i,j is the estimated inner product between x i , x j , and x T i x j is the exact inner product.  Based on this we define the two metrics: We look at the performance of the precision and recall of both datasets over different values of k ∈ {10, 50, 100} in Figures 4 to 11.
From the plots we observe that as more vectors are added, we get a higher average precision and recall as well as lower standard deviations.

IV. CONCLUSION AND FUTURE WORK
In this work we have investigated improving random projections using additional vectors. By establishing Theorem 6 and 78598 VOLUME 8, 2020  Theorem 9, the specific cubic equation that governs MLE is obtained for adding arbitrary n extra vectors; and the upper bound of variance reduction is obtained as well. As we can see, the variance of MLE will be continuously reduced as n increases under certain assumptions, meanwhile, the calculation of MLE can be performed at a reasonable cost of computation by choosing additional vectors to be orthonormal to each other. Theorem 9 also shows that the variance can be dramatically decreased provided the extra vectors have small angles with testing vectors of dataset, which suggests that we can pick these vectors by knowing some special characteristics of the dataset (e.g. singular vectors).
These work opens opportunities for studying other aspects of improving random projections using additional VOLUME 8, 2020  information, and can be used as a step stone to develop parallel results of those in [10], such as various probability bounds. Last but not least, during the process of this work, authors also found that there is a close connection between MLE method and the method of control variates for improving random projections, which deserves special attention and will be dealt with in separate work.

APPENDIXES APPENDIX A PROOF OF LEMMA 2
We first prove the first part of Lemma 2.
Proof: 1. Recall the notation A = k in (I.17), therefore, Taking the logarithm of the right-hand side of equation (I. 16) and and utilizing (A.44) gives Q s,t r j,s r j,t VOLUME 8, 2020 p t=1 x i,s x l,t r j,s r j,t = k v T i v l , we continue to simplify to obtain Therefore, To compute E(B), E(B ) and E(B ), first we observe the fact, by Laplace's Formula, that Keeping tracking of variable a and differentiating both sides with respect to a, we have Taking these identities into account and recall the expression of B in the lemma, we obtain 2. To determine c 2 , the key observation is that all the terms of |A| involving a must come from either aA 1,2 or aA 2,1 (since adj(A) is symmetric, A 1,2 = A 2,1 ). To see this, denoting the entries of A as x T i x j = a i,j for simplicity and expanding |A| by definition gives |A| = (−1) τ (j 1 j 2 j 3 ···j n ) a 1,j 1 a 2,j 2 a 3,j 3 · · · a n,j n = (−1) τ (2 j 2 j 3 ···j n ) a a 2,j 2 a 3,j 3 · · · a n,j n + (−1) τ (j 1 1 j 3 ···j n ) a 1,j 1 a a 3,j 3 · · · a n,j n (−1) τ (j 1 j 2 j 3 ···j n ) a 1,j 1 a 2,j 2 a 3,j 3 · · · a n,j n All the terms of |A| involving a must come from either 1 or 2 since 3 has excluded all the a. Now notice the fact that 1 = a A 1,2 and 2 = a A 2,1 . By excluding the overlapping terms of 1 and 2 that involve a 2 , we derive that where the symmetry of A is used in the last line.

APPENDIX C PROOF OF LEMMA 4
Proof: 1. Differentiating A 1,2 (the (1, 2)-th entry of adj(A)) with respect to a, we have A 1,2 = −|D|. By virtue of Lemma 3, we calculate Substituting this into the second line of equation (C.58) and simplifying, we obtain where α 1 , α 2 , α 3 are as in (II.37). where i = 1, · · · , n + 2 and matrices A i (i = 1, · · · , n + 2) are generated by replacing the i-th row of matrix A with l T i . Let us analyze each A i separately. It can be directly verified with the aid of Lemma 3 and Definition 7 that which is the desired result and completes the whole proof.

APPENDIX F PROOF OF THEOREM 6
Proof: 1. Proof for the first part of theorem is established first.

APPENDIX G PROOF OF THEOREM 9
Proof: Directly substitute equation (II.36) into equation (II.18) for |A| and B, simplify and expand it, and then apply (II.37).

APPENDIX H PROCEDURES
Since the n extra vectors are chosen to be normalized and orthogonal, D is an identity matrix in formula By virtue of this simplified formula, the specific expressions of α 1 , α 2 , α 3 and β 1 , β 2 , β 3 can be calculated as follows.