Operators on Inner Product Space¶
Orthogonal Complements¶
Theorem for null space and range of \(T^*\)
Suppose \(T\in \mathcal{L}(V, W)\), then
(i) \(N T^*=(R T)^\perp\).
(ii) \(R T^*=(NT)^\perp\).
\(\square\)
Singular value decomposition¶
For a linear map \(T \in \mathcal{L}(V,W)\), we could decompose it as we have for self-adjoint operator or normal operator.
Recall the important Riesz representation theorem in inner product space.
Riesz representation theorem
Assume \(V\) is finite-dimensional and \(\phi\) is a linear functional on \(V\), then there exists a unique vector \(v\in V\) such that
\(\square\)
In functional analysis, we have actually a similar result for infinite-dimensional spaces.
The following lemma of \(T^*T\) is necessary.
Lemma
Properties of \(T^*T\). Suppose \(T \in \mathcal{L}(V,W)\)
(i) \(T^*T\) is a self-adjoint operator on \(V\). We could also check \(TT^*\) is a self-adjoint operator on \(W\).
(ii) \(N T^*T=N T\).
(iii) \(R T^*T = R T^*\).
(iv) dimension. \(\text{Dim} R T=\text{Dim} R T^*=\text{Dim}R T^*\).
(i) by definition.
(ii) \(N T\subset N T^*T\) is apparent. Assume \(v\in N T^*T\), \(T^*T v=0\), so \(\langle v, T^*Tv\rangle=0\), so \(\langle Tv, Tv\rangle=\|Tv\|=0\), which means \(Tv=0\).
(iii) \(R T^*T \subset R T^*T\) is apparent. For another direction, we use (ii) \(R T^*T=(N T^*T)^\perp=(N T)^\perp=R T^*\).
(iv) Use fundamental theorem of linear maps.
Definition of singular value
Assume a linear operator \(T\in \mathcal{L}(V, W)\), the singular values of \(T\) are defined as the nonnegative square roots of the eigenvalues of \(T^*T\), listed in decreasing order.
(SVD) Singular value decomposition
Assume a linear operator \(T\in \mathcal{L}(V, W)\), with its positive singular values \(s_1,\cdots, s_r\). Then there exists orthonormal lists \(e_1,\cdots, e_r\subset v\), \(f_1,\cdots, f_r\subset W\), such that
Here we denote that \(V\) and \(W\) is finite-dimensional. And the proof is constructive. This method also gives info about the eigenvectors construction.
Let \(s_1,\cdots,s_n\) to be the singular value of \(T(\text{Dim}V=n)\), where \(s_{r+1},\cdots, s_n\) are zero singular values.
- Apply spectral thoerem to \(T^*T\) and there exists orthonormal basis \(e_1,\cdots, e_n\subset V\), such that
- Define \(f_k=\frac{Te_k}{s_k}\) for \(k=1,\cdots, r\).
this is actually orthonormal basis in \(W\). This is also inspired by \(T=W\Sigma V^T\), so \(TV=W\Sigma\), so \(TV\Sigma^{-1}=W\), which shows a relationship of basis from \(V\) to \(W\) space.
- Prove the proposition by expressing \(v\) in the constructed orthonormal basis
for \(k\geq r\), \(Te_k=0\) because \(T^*T e_k =0\cdot e_k\) and Property of self-adjoint \(T^*T\) (ii).
We could also check that the matrix with respect to basis \(\{e_k\}_{1\leq k\leq r}\) and \(\{f_k\}_{1\leq k\leq r}\) which should be extended.
Note we have \(\{e_k\}_{1\leq k\leq n}\), and from the above proof we have \(Te_k=s_k f_k\) for \(k\leq r\) and \(0\) for \(k >r\). We shall extend \(\{f_k\}_{1\leq k\leq r}\) to \(\{f_k\}_{1\leq k\leq m} (\text{Dim} W=m)\) by utilizing \(N T^*\). This is because we want to solve \(R(T)^\perp\), which equals \(NT^*\) by Theorem for null space and range of \(T^*\). (Readers should double check the dimension of \(N T^*\), which is \(m-r\), for \(\text{Dim} RT=r\).)
\(\square\)
Matrix version of SVD, a compact SVD form
Assume \(A\) is an \(m\)-by-\(n\) matrix of rank \(r\geq 1\). Then there exists an \(m\)-by-\(r\) matrix \(W\) with orthogonal columns, an \(r\)-by-\(r\) diagonal matrix \(\Sigma\) with positive numbers on the diagonal, and an \(n\)-by-\(r\) matrix \(V\) with orthonormal columns such that
Let \(T: \mathbb{F}^n\rightarrow \mathbb{F}^m\) whose matrix with respect to the standard basis equals \(A\). From the above proof of the SVD theorem, we have \(\text{Dim}RT=r\) and
we make use of the above structure. Let
\(W\) to be the \(m\)-by-\(r\) matrix whose columns are \(f_1,\cdots,f_r\),
\(\Sigma\) to be the \(r\)-by-\(r\) diagonal matrix \(\Sigma\) with entries \(s_1,\cdots,s_r\),
\(V\) to be the \(n\)-by-\(r\) matrix whose columns are \(e_1,\cdots,e_r\).
Choose \(u_k\), a standard base of \(\mathbb{F}^m\), then apply this matrix
so \(AV=W\Sigma\), multiply both sides by \(V^*\) and we have \(A=W\Sigma V^*\). But we have to be careful.
Here actually \(VV^*= I\) does not hold absolutely. We have to argue as follows. If \(k\leq r\), \(V^*e_k=u_k\), so \(VV^*e_k=e_k\). Thus \(AVV^*v=Av\) for all \(v\in \text{span}(e_1,\cdots,e_m)\). For \(v\in \text{span}(e_1,\cdots,e_m)^\perp\), we have \(Av=0\) and \(V^*v=0\), so we also have \(AVV^*v=Av=0\).
\(\square\)
Denote \(S=\text{diag}(s_1,\cdots,s_r)\), \(\Sigma=\left[\begin{array}{cc} S & 0 \\ 0&0\end{array}\right]_{n \times n}\), \(V_1=(e_1,\cdots,e_r)\), \(V_2=(e_{r+1}, \cdots,e_n)\) where the orthonomal basis in \(V_2\) is with respect to eigenvalue \(0\). Notice
define \(W_1=A V_1 S^{-1}\), we have \(W_1^* W_1=I_r\). As for \(V_2\), we have \(A^*A V_2=V_2 0^2=0\), So \(V_2^* A^* A V_2=0\), \(AV_2=0\).
Choose \(W_2\) to be an orthogonal complement of \(W_1\), which is actually calculated from \(NA^*\), \(A^* W_2=0\). So let \(W=(W_1,W_2)\), we have
Principle Component Analysis¶
We first talk about total PCA.
Principle Component Analysis
Assume \(X, Y\in\mathbb{R}^n\) are random vectors. A linear map \(T: \mathbb{R}^n\rightarrow \mathbb{R}^n\) defined by
where \(T\) has an orthonormal matrix \(A=(\alpha_i)^T\) with respect to standard basis, \(\alpha_i\in \mathbb{R}^n\) and \(\alpha_i^T \alpha_j=\delta_{ij}\). We could show that there exsits \(alpha_1\) such that after transformation, \(y_1\) has the maximum variance, which is called a principle component.
Firstly, let us recall that \(\pmb{\mu}=(\mathbb{E}x_1,\cdots,\mathbb{E}x_n)^T\) is the mean vector, and corresponding covaraince matrix \(\Sigma=(\text{cov}(x_i,x_j))_{ij}=\mathbb{E}(X-\pmb{\mu})(X-\pmb{\mu})^T=\mathbb{E}X X^T-\pmb{\mu}\pmb{\mu}^T.\)
After transformation, we have the following property by linearity of ME.
Property of ME after Transformation
(i) \(\pmb{\mu}_y=A\pmb{\mu}\), that is, \(\mathbb{E}y_i=\alpha_i^T \pmb{\mu}\).
(ii) \(\Sigma_y = A^T\Sigma A\), that is, \(\sigma_{ij}=\text{cov}(x_i,x_j)=\alpha_i^T \Sigma \alpha_j\).
We prove for (ii). By definition
From the above property, we could explain: we ask the matrix to be orthonormal because we want the covariance matrix of Y to be diagonal, i.e. \(y_i\) and \(y_j\) are mutually irrelevant unless \(i=j\).
Theorem for principle component analysis
The maximum of variance of \(y_1\) is reached when \(\alpha_1\) is the eigenvector of the maximum eigenvalue \(\lambda_1\) of matrix \(\Sigma\), and satisfies \(Var(y_1)=\lambda_1\).
To maximize \(Var(y_1)\), is equivalently to maximize \(\alpha_1^T \Sigma \alpha_1\) for all possible \(\alpha_1\in\mathbb{R}^n\). Take derivative(gradient for \(\alpha_1\in \mathbb{R}^n\)) of the corresponding Lagrandian function with condition \(\alpha_1^T\alpha_1=1\) and we have \(2\Sigma \alpha_1 -2 l (\alpha_1)= \theta\). So to reach the maximum, \(l\) is an eigenvalue of \(\Sigma\), and at this time the goal function equals
To reach the maximum, let \(l\) to be the maximum eigenvalue of \(\Sigma\), denoted by \(\lambda_1\), and choose a eigenvector \(\alpha_1\) correspondingly.
\(\square\)
If we want to get \(k\) principle components, which are mutually irrelevant, i.e. \(\text{cov}(y_i,y_j)=0\) unless \(i=j\), we could have the following conclusion.
Theorem for \(k\) principle components analysis
The \(k\) principle components of \(X\) is determined by a transformation \(T\) defined by
where \(\alpha_i(i=1,\cdots, k)\) is the eigenvector with respect to the maximum \(k\) eigenvalues of \(\Sigma\).
We only prove for \(k=2\), the other situation could be deduced by induction.
We aim to find a vector \(\alpha_2\), such that we maximize \(\alpha_2\Sigma \alpha_2\), with a condition \(\alpha_2^T \alpha_2=1, \langle\alpha_2,\alpha_1 \rangle=0\). Take a gradient we have
apply inner product with \(\alpha_1\), we have
so \(l_2=0\). Then by the same logic of Theorem for principle component analysis, we also have \(\lambda_2\) to be the second largest eigenvalue of \(\Sigma\). Apparently \(l_2\neq \lambda_1\) otherwise \(\langle\alpha_2,\alpha_1\rangle\neq 0\).
\(\square\)
After transformation to \(Y\in \mathbb{R}^n\), we have an amazing result of total variance of \(Y\)
which is given by taking the trace of
and making use of \(\Sigma = A^T \Sigma_y A = A \Sigma_y A^T\).
After choosing \(n\) principle components, we also want to find some relationship between \(y_i\) and \(x_j\).
Factor of Loading
The factor of loading for \(y_i\) with respect to \(x_j\) is defined by
where \(\alpha_{ji}\) is the \(j\)th component of vector \(\alpha_i\). We have to compute this element-wisely.
Just by definition.
\(\square\)
Properties of factor of loading
(i) Sum over original variable.
(ii) Sum over all principle components
We give proof for (ii) using outer product formula.
Since \(\Sigma=A\Sigma_y A^T=\sum_{i=1}^n \lambda_i \alpha_i\alpha_i^T\), so
Normalized version¶
Usually different random variables have distinct values. We have to normalize them if we want to analyse them together.
So all the content above would be the same except the folloiwng changes.
Changes applied to normalized random vectors
(i) \(\pmb{\mu}^*=\theta\) and \(\Sigma^*=R\), where \(R\) is the correlation coefficient matrix with \(r_{jj}=\sigma_{ii}=1\).
(ii) sum over variance after transformation. \(\sum_{i=1}^n \lambda_i^*=n\).
(iii) load of factors. \(\rho(y_i,x_j)=\sqrt{\lambda_i^*}\alpha_{ji}\).
Truncated principle components¶
In practice, we usually do not choose \(n\) principle components but rather \(k \ll n\) to achieve compression of data.
How to choose these \(k\) components? we based on the following criterion.
Contribution to variance
the contribution to total variance of principle component \(y_i\) is defined by
usually we need to let \(\sum_{i=1}^k \eta_i\) to be larger than \(70%\).
Sampled PCA¶
In actual experiments, we have to observe independently \(m\) times. We have to replace mean and covariance matrix with their empirical versions. Assume \(X_1,\cdots,X_m\) are \(m\) mutually independent random vectors (samples in \(\mathbb{R}^n\)), then the unbiased estimates of mean and variance are
So we have its empirical covariance matrix \(S=\frac{1}{m-1}\sum_{k=1}^m (X_k-\overline{X})(X_k-\overline{X})^T\). Tackle this matrix with the same method we used in PCA, then we are done.
For actual calculation, we usually let \(X_k=\frac{X_k-\overline{X}}{\sqrt{s_{kk}}}\) for each \(k=1,\cdots,m\), and solve singular values of \(X'=(X_1,\cdots,X_m)_{n\times m}\) as \(s_1>\cdots>s_n\), then \(\lambda_i=s_i^2\) for \(i=1,\cdots,n\). And \(V=A\) and \(Y=V^T X\). If we choose \(k\) principle components, then choose \(k\) columns of \(V\) as eigenvectors.