Operators on Inner Product Space¶

Orthogonal Complements¶

Theorem for null space and range of \(T^*\)

Suppose \(T\in \mathcal{L}(V, W)\), then

(i) \(N T^*=(R T)^\perp\).

(ii) \(R T^*=(NT)^\perp\).

Proof

\(\square\)

Singular value decomposition¶

For a linear map \(T \in \mathcal{L}(V,W)\), we could decompose it as we have for self-adjoint operator or normal operator.

Recall the important Riesz representation theorem in inner product space.

Riesz representation theorem

Assume \(V\) is finite-dimensional and \(\phi\) is a linear functional on \(V\), then there exists a unique vector \(v\in V\) such that

\[ \phi(u)=\langle v,u\rangle,\quad \forall v\in V. \]

Proof

\(\square\)

In functional analysis, we have actually a similar result for infinite-dimensional spaces.

The following lemma of \(T^*T\) is necessary.

Lemma

Properties of \(T^*T\). Suppose \(T \in \mathcal{L}(V,W)\)

(i) \(T^*T\) is a self-adjoint operator on \(V\). We could also check \(TT^*\) is a self-adjoint operator on \(W\).

(ii) \(N T^*T=N T\).

(iii) \(R T^*T = R T^*\).

(iv) dimension. \(\text{Dim} R T=\text{Dim} R T^*=\text{Dim}R T^*\).

Proof

(i) by definition.

\[ \langle T^*Tv, w\rangle=\langle Tv, Tw\rangle=\langle v, T^*T w\rangle\Rightarrow T^*T=(T^*T)^*. \]

(ii) \(N T\subset N T^*T\) is apparent. Assume \(v\in N T^*T\), \(T^*T v=0\), so \(\langle v, T^*Tv\rangle=0\), so \(\langle Tv, Tv\rangle=\|Tv\|=0\), which means \(Tv=0\).

(iii) \(R T^*T \subset R T^*T\) is apparent. For another direction, we use (ii) \(R T^*T=(N T^*T)^\perp=(N T)^\perp=R T^*\).

(iv) Use fundamental theorem of linear maps.

Definition of singular value

Assume a linear operator \(T\in \mathcal{L}(V, W)\), the singular values of \(T\) are defined as the nonnegative square roots of the eigenvalues of \(T^*T\), listed in decreasing order.

(SVD) Singular value decomposition

Assume a linear operator \(T\in \mathcal{L}(V, W)\), with its positive singular values \(s_1,\cdots, s_r\). Then there exists orthonormal lists \(e_1,\cdots, e_r\subset v\), \(f_1,\cdots, f_r\subset W\), such that

\[ Tv=\sum_{k=1}^r s_k \langle v, e_k\rangle f_k. \]

Proof

Here we denote that \(V\) and \(W\) is finite-dimensional. And the proof is constructive. This method also gives info about the eigenvectors construction.

Let \(s_1,\cdots,s_n\) to be the singular value of \(T(\text{Dim}V=n)\), where \(s_{r+1},\cdots, s_n\) are zero singular values.

Apply spectral thoerem to \(T^*T\) and there exists orthonormal basis \(e_1,\cdots, e_n\subset V\), such that

\[ T^*T e_k=s_k^2 e_k,\quad k=1,\cdots, n. \]

Define \(f_k=\frac{Te_k}{s_k}\) for \(k=1,\cdots, r\).

this is actually orthonormal basis in \(W\). This is also inspired by \(T=W\Sigma V^T\), so \(TV=W\Sigma\), so \(TV\Sigma^{-1}=W\), which shows a relationship of basis from \(V\) to \(W\) space.

Prove the proposition by expressing \(v\) in the constructed orthonormal basis

\[ \begin{align*} Tv&=T\left(\sum_{k=1}^n \langle v, e_k\rangle e_k\right)\\ &=\sum_{k=1}^n \langle v, e_k\rangle T e_k\\ &=\sum_{k=1}^r \langle v, e_k\rangle s_k f_k\\ \end{align*} \]

for \(k\geq r\), \(Te_k=0\) because \(T^*T e_k =0\cdot e_k\) and Property of self-adjoint \(T^*T\) (ii).

We could also check that the matrix with respect to basis \(\{e_k\}_{1\leq k\leq r}\) and \(\{f_k\}_{1\leq k\leq r}\) which should be extended.

Note we have \(\{e_k\}_{1\leq k\leq n}\), and from the above proof we have \(Te_k=s_k f_k\) for \(k\leq r\) and \(0\) for \(k >r\). We shall extend \(\{f_k\}_{1\leq k\leq r}\) to \(\{f_k\}_{1\leq k\leq m} (\text{Dim} W=m)\) by utilizing \(N T^*\). This is because we want to solve \(R(T)^\perp\), which equals \(NT^*\) by Theorem for null space and range of \(T^*\). (Readers should double check the dimension of \(N T^*\), which is \(m-r\), for \(\text{Dim} RT=r\).)

\(\square\)

Matrix version of SVD, a compact SVD form

Assume \(A\) is an \(m\)-by-\(n\) matrix of rank \(r\geq 1\). Then there exists an \(m\)-by-\(r\) matrix \(W\) with orthogonal columns, an \(r\)-by-\(r\) diagonal matrix \(\Sigma\) with positive numbers on the diagonal, and an \(n\)-by-\(r\) matrix \(V\) with orthonormal columns such that

\[ A=W\Sigma V^*. \]

ProofAnother version

Let \(T: \mathbb{F}^n\rightarrow \mathbb{F}^m\) whose matrix with respect to the standard basis equals \(A\). From the above proof of the SVD theorem, we have \(\text{Dim}RT=r\) and

\[ Tv=\sum_{k=1}^r s_k \langle v, e_k\rangle f_k. \]

we make use of the above structure. Let

\(W\) to be the \(m\)-by-\(r\) matrix whose columns are \(f_1,\cdots,f_r\),

\(\Sigma\) to be the \(r\)-by-\(r\) diagonal matrix \(\Sigma\) with entries \(s_1,\cdots,s_r\),

\(V\) to be the \(n\)-by-\(r\) matrix whose columns are \(e_1,\cdots,e_r\).

Choose \(u_k\), a standard base of \(\mathbb{F}^m\), then apply this matrix

\[ (AV-W\Sigma)u_k=Ae_k-Ws_k u_k=s_kf_k-s_kf_k=0. \]

so \(AV=W\Sigma\), multiply both sides by \(V^*\) and we have \(A=W\Sigma V^*\). But we have to be careful.

Here actually \(VV^*= I\) does not hold absolutely. We have to argue as follows. If \(k\leq r\), \(V^*e_k=u_k\), so \(VV^*e_k=e_k\). Thus \(AVV^*v=Av\) for all \(v\in \text{span}(e_1,\cdots,e_m)\). For \(v\in \text{span}(e_1,\cdots,e_m)^\perp\), we have \(Av=0\) and \(V^*v=0\), so we also have \(AVV^*v=Av=0\).

\(\square\)

Denote \(S=\text{diag}(s_1,\cdots,s_r)\), \(\Sigma=\left[\begin{array}{cc} S & 0 \\ 0&0\end{array}\right]_{n \times n}\), \(V_1=(e_1,\cdots,e_r)\), \(V_2=(e_{r+1}, \cdots,e_n)\) where the orthonomal basis in \(V_2\) is with respect to eigenvalue \(0\). Notice

\[ \begin{align*} A^* AV_1 = S^2 V_1&=V_1 S^2 \\ V_1^* A^* AV_1&=S^2 \\ \Rightarrow S^{-1} V_1^* A^* A V_1 &S^{-1}=I_{r}. \end{align*} \]

define \(W_1=A V_1 S^{-1}\), we have \(W_1^* W_1=I_r\). As for \(V_2\), we have \(A^*A V_2=V_2 0^2=0\), So \(V_2^* A^* A V_2=0\), \(AV_2=0\).

Choose \(W_2\) to be an orthogonal complement of \(W_1\), which is actually calculated from \(NA^*\), \(A^* W_2=0\). So let \(W=(W_1,W_2)\), we have

\[ \begin{align*} W^T A V&=\left[\begin{array}{cc} W_1^T A V_1 & W_1^T A V_2\\ W_2^T A V_1 & W_2^T A V_2\end{array}\right]\\ &=\left[\begin{array}{cc} W_1^T A V_1 &0\\ W_2^T A V_1 & 0\end{array}\right]\quad\text{by } AV_2=0\\ &=\left[\begin{array}{cc} W_1^T W_1 S &0\\ W_2^T W_1 S & 0\end{array}\right]\\ &=\left[\begin{array}{cc} S &0\\ 0 & 0\end{array}\right] \end{align*} \]

Principle Component Analysis¶

We first talk about total PCA.

Principle Component Analysis

Assume \(X, Y\in\mathbb{R}^n\) are random vectors. A linear map \(T: \mathbb{R}^n\rightarrow \mathbb{R}^n\) defined by

\[ Y=TX,\quad y_i =\alpha_i^T X,\quad i=1,\cdots,n. \]

where \(T\) has an orthonormal matrix \(A=(\alpha_i)^T\) with respect to standard basis, \(\alpha_i\in \mathbb{R}^n\) and \(\alpha_i^T \alpha_j=\delta_{ij}\). We could show that there exsits \(\alpha_1\) such that after transformation, \(y_1\) has the maximum variance, which is called a principle component.

Firstly, let us recall that \(\pmb{\mu}=(\mathbb{E}x_1,\cdots,\mathbb{E}x_n)^T\) is the mean vector, and corresponding covaraince matrix \(\Sigma=(\text{cov}(x_i,x_j))_{ij}=\mathbb{E}(X-\pmb{\mu})(X-\pmb{\mu})^T=\mathbb{E}X X^T-\pmb{\mu}\pmb{\mu}^T.\)

After transformation, we have the following property by linearity of ME.

Property of ME after Transformation

(i) \(\pmb{\mu}_y=A\pmb{\mu}\), that is, \(\mathbb{E}y_i=\alpha_i^T \pmb{\mu}\).

(ii) \(\Sigma_y = A^T\Sigma A\), that is, \(\sigma_{ij}=\text{cov}(x_i,x_j)=\alpha_i^T \Sigma \alpha_j\).

Proof

We prove for (ii). By definition

\[ \begin{align*} \sigma_{ij}&=\mathbb{E}(y_i-\alpha_i^T \pmb{\mu})(y_j-\alpha_j^T \pmb{\mu})^T\\ &=\mathbb{E}(\alpha_i^T X-\alpha_i^T \pmb{\mu})(\alpha_j^T X-\alpha_j^T \pmb{\mu})\\ &=\mathbb{E}\alpha_i^T (X-\pmb{\mu})\alpha_j^T (X-\pmb{\mu})\\ &=\mathbb{E}\alpha_i^T (X-\pmb{\mu})(X- \pmb{\mu})^T\alpha_j\quad \text{by symmetry of inner product}\\ &=\alpha_i^T\mathbb{E} [(X-\pmb{\mu})(X- \pmb{\mu})^T]\alpha_j\\ &=\alpha_i^T \Sigma \alpha_j. \end{align*} \]

From the above property, we could explain: we ask the matrix to be orthonormal because we want the covariance matrix of Y to be diagonal, i.e. \(y_i\) and \(y_j\) are mutually irrelevant unless \(i=j\).

Theorem for principle component analysis

The maximum of variance of \(y_1\) is reached when \(\alpha_1\) is the eigenvector of the maximum eigenvalue \(\lambda_1\) of matrix \(\Sigma\), and satisfies \(Var(y_1)=\lambda_1\).

Proof

To maximize \(Var(y_1)\), is equivalently to maximize \(\alpha_1^T \Sigma \alpha_1\) for all possible \(\alpha_1\in\mathbb{R}^n\). Take derivative(gradient for \(\alpha_1\in \mathbb{R}^n\)) of the corresponding Lagrandian function with condition \(\alpha_1^T\alpha_1=1\) and we have \(2\Sigma \alpha_1 -2 l (\alpha_1)= \theta\). So to reach the maximum, \(l\) is an eigenvalue of \(\Sigma\), and at this time the goal function equals

\[ \alpha_1^T \Sigma \alpha_1=\alpha_1^T l \alpha_1=l \alpha_1^T \alpha_1=l. \]

To reach the maximum, let \(l\) to be the maximum eigenvalue of \(\Sigma\), denoted by \(\lambda_1\), and choose a eigenvector \(\alpha_1\) correspondingly.

\(\square\)

If we want to get \(k\) principle components, which are mutually irrelevant, i.e. \(\text{cov}(y_i,y_j)=0\) unless \(i=j\), we could have the following conclusion.

Theorem for \(k\) principle components analysis

The \(k\) principle components of \(X\) is determined by a transformation \(T\) defined by

\[ y_i = \alpha_i^T X,\quad i=1,\cdots, k, \]

where \(\alpha_i(i=1,\cdots, k)\) is the eigenvector with respect to the maximum \(k\) eigenvalues of \(\Sigma\).

Proof

We only prove for \(k=2\), the other situation could be deduced by induction.

We aim to find a vector \(\alpha_2\), such that we maximize \(\alpha_2\Sigma \alpha_2\), with a condition \(\alpha_2^T \alpha_2=1, \langle\alpha_2,\alpha_1 \rangle=0\). Take a gradient we have

\[ 2\Sigma \alpha_2 -2l_1 \alpha_2 -l_2 \alpha_1 = \theta. \]

apply inner product with \(\alpha_1\), we have

\[ 2\lambda_1\alpha_2^T\alpha_1-l_2 = 0,\quad \Sigma\text{ is self-adjoint} \]

so \(l_2=0\). Then by the same logic of Theorem for principle component analysis, we also have \(\lambda_2\) to be the second largest eigenvalue of \(\Sigma\). Apparently \(l_2\neq \lambda_1\) otherwise \(\langle\alpha_2,\alpha_1\rangle\neq 0\).

\(\square\)

After transformation to \(Y\in \mathbb{R}^n\), we have an amazing result of total variance of \(Y\)

\[ \sum_{i=1}^n Var(y_i)=\sum_{i=1}^n \lambda_i = \sum_{i=1}^n \sigma_{ii}. \]

which is given by taking the trace of

\[ tr(\Sigma_y) = tr(A^T \Sigma A) = \sum_{i=1}^n (\alpha_i^T \Sigma \alpha_i) = (\lambda_i) \]

and making use of \(\Sigma = A^T \Sigma_y A = A \Sigma_y A^T\).

After choosing \(n\) principle components, we also want to find some relationship between \(y_i\) and \(x_j\).

Factor of Loading

The factor of loading for \(y_i\) with respect to \(x_j\) is defined by

\[ \rho(y_i,x_j)=\frac{\sqrt{\lambda_i}\alpha_{ji}}{\sqrt{\sigma_{jj}}}. \]

where \(\alpha_{ji}\) is the \(j\)th component of vector \(\alpha_i\). We have to compute this element-wisely.

Proof

Just by definition.

\[ \begin{align*} \rho(y_i, x_j)&=\frac{\text{cov}(y_i, x_j)}{\sqrt{\lambda_i\sigma_{jj}}}\\ &=\frac{\text{cov}(\alpha_i^T X, e_j^T X)}{\sqrt{\lambda_i\sigma_{jj}}}\\ &=\frac{\alpha_i^T\text{cov}(X,X)e_j}{\sqrt{\lambda_i\sigma_{jj}}}\\ &=\frac{e_j^T\Sigma \alpha_i}{\sqrt{\lambda_i\sigma_{jj}}}\\ &=\frac{\lambda_i e_j^T\alpha_i}{\sqrt{\lambda_i\sigma_{jj}}}\\ &=\frac{\sqrt{\lambda_i}\alpha_{ji}}{\sqrt{\sigma_{jj}}}. \end{align*} \]

\(\square\)

Properties of factor of loading

(i) Sum over original variable.

\[ \sum_{j=1}^n \sigma_{jj}\rho^2(y_i,x_j)=\lambda_i. \]

(ii) Sum over all principle components

\[ \sum_{i=1}^n \rho^2(y_i,x_j)=1. \]

Proof

We give proof for (ii) using outer product formula.

Since \(\Sigma=A\Sigma_y A^T=\sum_{i=1}^n \lambda_i \alpha_i\alpha_i^T\), so

\[ \sigma_{jj}=\sum_{i=1}^n \lambda_i \alpha_{ji}^2. \]

Normalized version¶

Usually different random variables have distinct values. We have to normalize them if we want to analyse them together.

\[ x_i^*=\frac{x_i-\mathbb{E}x_i}{\sqrt{Var(x_i)}},\quad i=1,\cdots,n. \]

So all the content above would be the same except the folloiwng changes.

Changes applied to normalized random vectors

(i) \(\pmb{\mu}^*=\theta\) and \(\Sigma^*=R\), where \(R\) is the correlation coefficient matrix with \(r_{jj}=\sigma_{ii}=1\).

(ii) sum over variance after transformation. \(\sum_{i=1}^n \lambda_i^*=n\).

(iii) load of factors. \(\rho(y_i,x_j)=\sqrt{\lambda_i^*}\alpha_{ji}\).

Truncated principle components¶

In practice, we usually do not choose \(n\) principle components but rather \(k \ll n\) to achieve compression of data.

How to choose these \(k\) components? we based on the following criterion.

Contribution to variance

the contribution to total variance of principle component \(y_i\) is defined by

\[ \eta_i=\frac{\lambda_i}{\sum_{k=1}^n \lambda_k} \]

usually we need to let \(\sum_{i=1}^k \eta_i\) to be larger than \(70%\).

Sampled PCA¶

In actual experiments, we have to observe independently \(m\) times. We have to replace mean and covariance matrix with their empirical versions. Assume \(X_1,\cdots,X_m\) are \(m\) mutually independent random vectors (samples in \(\mathbb{R}^n\)), then the unbiased estimates of mean and variance are

\[ \pmb{\mu}\approx \overline{X}=\sum_{k=1}^m X_k,\quad \sigma_{jj}=1/(m-1)\sum_{k=1}^m (X_k-\overline{X})^2. \]

So we have its empirical covariance matrix \(S=\frac{1}{m-1}\sum_{k=1}^m (X_k-\overline{X})(X_k-\overline{X})^T\). Tackle this matrix with the same method we used in PCA, then we are done.

For actual calculation, we usually let \(X_k=\frac{X_k-\overline{X}}{\sqrt{s_{kk}}}\) for each \(k=1,\cdots,m\), and solve singular values of \(X'=(X_1,\cdots,X_m)_{n\times m}\) as \(s_1>\cdots>s_n\), then \(\lambda_i=s_i^2\) for \(i=1,\cdots,n\). And \(V=A\) and \(Y=V^T X\). If we choose \(k\) principle components, then choose \(k\) columns of \(V\) as eigenvectors.