### Lecture notes (pdf) - Department of Computer Sciences

```CS 395T: Sublinear Algorithms
Fall 2014
Lecture 22 — Nov 11, 2014
Prof. Eric Price
1
Scribe: Taewan Kim
Overview
In the last few lectures we covered
1. Fourier Transform
2. Sparse Fourier Transform
3. Fourier RIP
In this lecture, a new topic ’Oblivious Subspace Embeddings’ is covered, especially algorithms
introduced by Clarkson and Woodruff [CW13] for regression and low rank approximation problems.
2
Application
Oblivious Subspace Embedding (OSE) is a tool for faster numerical linear algebra. There are two
possible applications where OSE can be applied: regression and low rank approximation.
2.1
Regression
Problem Statement Given A ∈ Rn×d and b ∈ Rn , Find x ∈ Rd minimizing kAx − bk2 . (n d)
A is given data composed of n rows of size d which indicates the d different features. And for those
n items, vector b is composed of n outcomes corresponding to each 1 × d feature vector. By finding
solution x ∈ minimizing kAx − bk2 , we can find an approximate linear mapping between A and b
via x: Ax ≈ b.
This problem can be relaxed by allowing error:
Find x0 s.t. kAx0 − bk2 ≤ (1 + ) min kAx − bk2
x
An algorithm to find optimal solution of the regression problem ( = 0) is using Moore-Penrose
pseudoinverse 1 .
Algorithm
1. x = A+ b, (A+ is pseudoinverse of A)
1
Details of Moore-Penrose pseudoinverse can be found in Wikipedia or Chapter 4 of Laub, Alan J. Matrix analysis
for scientists and engineers. Siam, 2005.
1
2. When rank(A) = d and d n, A+ = (AT A)−1 AT
Time complexity of this algorithm to calculate x = A+ b is O(d2 n + d3 ) = O(d2 n) when d n.
To speed up, sparsity of A can be utilized when A is sparse. If nnz(A) represents the number
of nonzero elements in A, time complexity can be improved to O(d · nnz(A) + d3 ). The actual
computation should be done as follows: compute AT x first, which gives O(nnz(A)), then compute
(AT A)−1 which gives O(d3 ), and finally compute (AT A)−1 (AT x).
However, by using OSE of [CW13] one can achieve:
˜ 3 /2 )
• O(nnz(A)) + O(d
˜ 3 log(1/))
• O(nnz(A) log(1/)) + O(d
˜ ) , f · logO(1) (f ))
(Here, O(f
2.2
Low Rank Approximation
Problem Statement Given a matrix A ∈ Rn×n , find a matrix B with rank(B) = k which
minimizes kA − Bk2F .
This low rank approximation problem with Frobenius norm can also be relaxed by allowing error:
Find B 0 s.t. kA − B 0 k2F ≤ (1 + )
min
B
rank(B)=k
kA − Bk2F
When = 0, Singular Value Decomposition (SVD) gives the best rank-k approximation of A
by selecting top k singular values and corresponding singular vectors. SVD requires O(n3 ) of
computational time.
However by using Power method/subspace iteration:
• Each iteration takes O(n2 k) time.
• For Frobenius norm approximation, bound is not known.
˜ 2 k/2 ) is possible per iteration.
• By allowing spectral error, O(n
Also, utilizing OSE can give better time bound:
2 /4 + k 3 /5 )
˜
• O(nnz(A)) + O(nk
For a dense matrix A, rank-k matrix approximation using random projection was introduced by
[Sarlos06, CW09].
3
Oblivious Subspace Embedding
Definition 1. Defined on parameters (m, n, d, , δ). An Oblivious Subspace Embedding (OSE) is a
distribution on matrices S ∈ Rm×n , s.t. ∀ d-dimensional subspace U of Rn , with probability 1 − δ
over S, we have ∀x ∈ U that kSxk2 = (1 ± )kxk2
2
3.1
Regression with OSE
Now, we can solve the problem in easier way with lower dimension using OSE. Rather than solving
x∗ = arg minx kAx − bk, solve:
x0 = arg min kSAx − Sbk
x
= arg min kS(Ax − b)k
x
where (Ax − b) ∈ Col(A ◦ b).
(Col(A ◦ b) means a column space of A adjoined with the vector b)
Then from the definition of OSE,
kAx0 − bk
≤
kAx∗ − bk
1+
1−
kS(Ax0 − b)k
≤
kS(Ax∗ − b)k
1+
1−
. 1 + 3
(1)
Computational time is determined by ”Embedding time + Solve(m, d)”, where Solve(m, d) represents the time to solve new regression problem with size m × d of SA and m × 1 vector Sb.
One example of OSE is Gaussian random matrix which can be defined as:
Si,j = N (0, 1/m)
(2)
With Gaussian OSE, m = O(d/2 ). Therefore, embedding requires O(mnd) = O(d2 n/2 ) and
Solve(d, m) requires O(d3 /2 ) computational time. So, total time is O(d2 n/2 + d3 /2 ).
3.2
Fast Johnson-Lindenstrauss
Now, we introduce an important lemma, which is called Johnson-Lindenstrauss (JL) lemma.
Definition 2 (Johnson-Lindenstrauss Lemma). If m = O((1/2 ) log(1/δ)), then
∀x, kAxk22 = (1 ± )kxk22
w.p. 1 − δ
Think as this way: given d-dim. subspace U , take -net: C = (1 + 1/)d points.
O((1/2 ) log(1/δ)), then all are preserved, i.e. C can be covered.
x = x1 +x2 + 2 x3 + · · · for x1 , · · · ∈ C
⇒ kAxk2 ≥ kAx1 k − kAx2 k − 2 kAx3 k − · · ·
≥ 1 − − (1 + )( + 2 + · · · )
≥ 1 − 3
Faster version of Johnson-Lindenstrauss embedding technique was introduced by [KW11]:
If A has RIP of order k, then AD has (, 2−k ) JL property, where


±1


±1


D=

..


.
±1
3
If m =
.
Last class, it is shown that FΩ∈[n] satisfies (k, ) RIP if |Ω| ≥ (1/2 )k log4 n. So, if m = |Ω| is
greater than (d/2 ) log(1/) log4 n, then subspace embeddings with m = (d/2 ) log5 n, and computational time is n log n. So, with Fast JL, embedding requires O(dn log n) and Solve(m, d) requires
O((d3 /2 ) log5 n).
3.3
[CW13]
To improve the complexity, [CW13] used the sparsity of A. In each column of S, exactly one
element has ±1 value defined with hash functions:
h : [n] → [m]
← 2-independent
σ : [n] → {±1}
← 4-independent
therefore OSE matrix S is defined as,
Sh(i),i = σi
Let’s prove that above S is OSE by showing:
a, b ∈ Rn ⇒ hSa, Sbi ≈ ha, bi
Proof. Denote δr,i = Ih(i)=r (indicator function).


! n
m
n
X
X
X

hSa, Sbi =
δr,i σr,i ai 
δr,j σr,j bi 
r=1
"
=
i=1
n
X
ai bi
i=1
= ha, bi +
j=1
m
X
!#
2 2
δr,i
σr,i
r=1
m X
X
+
m X
X
δr,i δr,j σr,i σr,j ai bj
r=1 i6=j
δr,i δr,j σr,i σr,j ai bj
r=1 i6=j
⇒ E[hSa, Sbi] = ha, bi
Now let’s consider the variance, V ar[hSa, Sbi]. (This proof can be referred to [NN13])
2
(V ar[hSa, Sbi]) =
m X
X
2 2 2 2
E σr,i
δr,j (ai bj + ai bj aj bi )
r=1 i6=j

∵ Consider (r, i), (r, j), (r0 , i0 ), (r0 , j 0 )
 r = r0 or {i, j} = {i0 , j 0 } → E[ · ] 6= 0

Otherwise
→ E[ · ] = 0 by independence.
X
1
a2i b2j + ai bj aj bi
⇒ V ar[hSa, Sbi] =
m

i6=j
2 X 2 2
≤
ai bj
m
i6=j
2 X 2 2
2
≤
ai bj = kak22 kbk22
m
m
i,j
4
Let U ∈ Rn×d have orthonormal columns. We want,
kSU xk2 = (1 ± )kxk2 ∀x ∈ Rd
⇔ xT U T S T SU x = (1 ± )xT x
⇔ kU T S T SU − Ik2 ≤ ⇐ kU T S T SU − Ik2F ≤ 2
So it is sufficient to show for Frobenius norm case.
(U T S T SU )i,j = hSUi , SUj i (Ui : ith column of U )
Ii,j = hUi , Uj i
Also,
∀i, j E[(U T S T SU − I)2i,j ] ≤
2
m
2d2
≤ 22
m
⇒ kU T S T SU − Ik2 ≤ ⇒ E[kU T S T SU − Ik2F ] ≤
which shows that kSU xk2 = (1 ± )kxk2 ∀x ∈ Rd , i.e. S is OSE.
With this setting of S by [CW13], complexity can be achieved to O(nnz(A) + (d3 /2 ) log5 (d/)),
˜ 3 /2 ).
which is O(nnz(A)) + O(d
Following Table compares the computational time for introduced algorithms when applied to regression problem. (O notation is omitted.)
No OSE:
d · nnz(A) + d3 / d2 n + d3
with OSE:
Embedding
+
Solve(d, m)
Gaussian:
mnd = d2 n/
+
d3 /2
Fast JL:
dn log n
+
(d3 /2 ) log5 n
C-W:
nnz(A)
+
Solve(d, d2 /2 )
→ (d3 /2 ) log5 (d/)
Table 1: Comparing complexities for various algorithms for regression
5
References
[CW09] Clarkson, Kenneth L., and David P. Woodruff. Numerical linear algebra in the streaming
model. Proceedings of the forty-first annual ACM symposium on Theory of computing, ACM,
2009.
[CW13] Clarkson, Kenneth L., and David P. Woodruff. Low rank approximation and regression
in input sparsity time. Proceedings of the forty-fifth annual ACM symposium on Theory of
computing, ACM, 2013.
[KW11] Krahmer, Felix, and Rachel Ward. New and improved Johnson-Lindenstrauss embeddings
via the restricted isometry property. SIAM Journal on Mathematical Analysis 43.3 (2011):
1269-1281.
[NN13] Nelson, Jelani, and Huy L. Nguyn. OSNAP: Faster numerical linear algebra algorithms
via sparser subspace embeddings. Foundations of Computer Science (FOCS), 2013 IEEE 54th
Annual Symposium on. IEEE, 2013.
[Sarlos06] Sarlos, Tamas. Improved approximation algorithms for large matrices via random projections. Foundations of Computer Science, 2006. FOCS’06. 47th Annual IEEE Symposium
on, IEEE, 2006.
6
```