Wrapping PCA into proteoQ

Last updated on Jan 17, 2021 16 min read

\[\newcommand{\mx}[1]{\mathbf{#1}} \def\pone{\mathbf{\mathbf{\phi}}_{1}} \def\ponet{\mathbf{\mathbf{\phi}}_{1}^{T}} \def\phat{\hat{\phi}_{1}} \def\xi{\mathbf{x}_{i}} \def\xit{\mathbf{x_{i}}^{T}} \def\X{\mathbf{X}} \def\Xt{\mathbf{X}^{T}} \def\Q{Q(\mathbf{\phi}_{1})} \def\A{\mathbf{A}} \def\U{\mathbf{U}} \def\V{\mathbf{V}} \def\Vt{\mathbf{V}^{T}} \def\I{\mathbf{I}} \def\L{\mathbf{\Lambda}} \def\l{\mathbf{\lambda}} \def\v{\mathbf{v}} \def\sumn{\sum_{i=1}^{n}} \def\sumj{\sum_{j=1}^{p}} \def\summ{\sum_{m=1}^{p}} \def\argmax{\mathrm{arg\, max}} \def\argmin{\mathrm{arg\, min}} \def\argminphii{\underset{\left \| \pone \right \|=1}{\argmin}} \def\argmaxphii{\underset{\left \| \pone \right \|=1}{\argmax}} \def\argmaxvphii{\underset{\left \| \V \pone \right \|=1}{\argmax}} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%% Diagnoal matrix Lambda \def\Lmat{\begin{bmatrix} \l_{1} & & & \\ & \l_{2} & & \\ & & \ddots & \\ & & & \l_{p} \end{bmatrix}} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%% Transposed phi_1 * v \def\phiv{\begin{bmatrix} \pone \cdot\v_{1} \\ \pone \cdot\v_{2} \\ \vdots \\ \pone \cdot\v_{p} \\ \end{bmatrix}} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%% phi_1 * v \def\phivc{\begin{bmatrix} \pone\cdot\v_{1} & \pone\cdot\v_{2} & \cdots & \pone\cdot\v_{p} \end{bmatrix}} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%% lambda * (phi_1 * v) \def\phivl{\begin{bmatrix} \l_{1} \pone \cdot\v_{1} & \l_{2} \pone \cdot\v_{2} & \cdots & \l_{p} \pone \cdot\v_{p} \end{bmatrix}} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%% lambda * (phi_1 * v)^2 \def\phivls{\l_{1} (\pone \cdot \v_{1})^{2} + \l_{2} (\pone \cdot \v_{2})^{2} + \cdots \l_{p} (\pone \cdot \v_{p})^{2}} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%% Transposed V \def\Vmatt{\begin{bmatrix} \v_{1}^{T} \\ \v_{2}^{T} \\ \vdots \\ \v_{p}^{T} \\ \end{bmatrix}} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%% V \def\Vmat{\begin{bmatrix} \v_{1} & \v_{2} & \cdots & \v_{p} \end{bmatrix}}\]

The prnPCA and pepPCA utilities in proteoQ perform regular PCA analysis for protein and peptide data, respectively. They are wrappers of the stats::prcomp function in R and plot PCA findings by either observations or features. Here observations refer to \(n\)-number samples, and features stand for \(p\)-number peptides or proteins. In a view of data matrix, the input is in either an \(n \times p\) matrix or its transpose.

The PCA analysis involves data projection, abstraction and probable reconstruction through lines (passing the origin \(\mathbf{0}\)) in a high-dimension space. Although relatively simple by its nature: a mathematical minimization problem without a probability structure being applied to the data,¹ it might not be an easy task to explain the intuition behind basic PCA without algebraic terms. This already challenging situation might be further aggravated with the following perplexing statement

… we must first subtract the mean of each variable from the dataset to center the data around the origin… Then we must normalize each of the orthogonal eigenvectors to become unit vectors. Once this is done, each of the mutually orthogonal, unit eigenvectors can be interpreted as an axis of the ellipsoid fitted to the data… The proportion of the variance that each eigenvector represents can be calculated… - Wikipedia

More specifically in the framework of proteoQ, we want to be certain about any alerted procedures for proper explanation of PCA results. Taking protein data as an example, it is a common practice in proteomics data analysis by aligning sample-wisely protein expression by median values.² We may wonder: is the data alignment analogous to the data centering in PCA? Alternatively, when calling prnPCA(center = FALSE), we assume no data centering and the same setting will be relayed to prcomp. In this case, is it still valid to make statements such as “the proportion of the variance” for projected data under each principal axis?

To address questions like these, I want to take a moment to pencil out the calculations behind basic PCA and, in this way, be more mindful in performing PCA through proteoQ.

1 Notation

We need some basic notation to reconstruct PCA. In general, scalars will be in italic lowercase, vectors in bold lowercase and matrices in bold uppercase:

\(v\) — scalar
\(\mathbf{v}\) — vector
\(\mathbf{A}\) — matrix

Vectors are expressed in a column-major order, for example

\[ \mathbf{v}=\begin{bmatrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{bmatrix} \] and the transpose of \(\mathbf{v}\) is

\[ \mathbf{v}^{T} = \begin{bmatrix} v_{1} & v_{2} & \cdots & v_{n} \end{bmatrix} \]

Matrix \(\mathbf{A}\) with \(p\) number of vectors (\(\mathbf{v} \in \mathbb{R}^{n}\)) will look like

\[ \mathbf{A} = \begin{bmatrix} v_{1,1} & v_{1,2} & \cdots & v_{1,p} \\ v_{2,1} & v_{2,2} & \cdots & v_{2,p} \\ \vdots & \vdots & \ddots & \vdots \\ v_{n,1} & v_{n,2} & \cdots & v_{n,p} \end{bmatrix} = \begin{bmatrix} \mathbf{v}_{1} & \mathbf{v}_{2} & \cdots & \mathbf{v}_{p} \end{bmatrix} \] and the corresponding transpose of \(\mathbf{A}\) is

\[ \mathbf{A}^{T}=\begin{bmatrix} v_{1,1} & v_{2,1} & \cdots & v_{n,1} \\ v_{1,2} & v_{2,2} & \cdots & v_{n,2} \\ \vdots & \vdots & \ddots & \vdots \\ v_{1,p} & v_{2,p} & \cdots & v_{n,p} \end{bmatrix}=\begin{bmatrix} \mathbf{v}_1^{T} \\ \mathbf{v}_2^{T} \\ \cdots \\ \mathbf{v}_p^{T} \end{bmatrix} \]

2 The first principal axis

Supposed there is a data set of \(n\) biological samples, each with \(p\) features (proteins). Our goal, let’s say, is to draw the quantitative information enclosed in individual samples as separate dots.³ In general, we will index samples with \(i\) in \(1,2,\cdots n\) and proteins with \(j\) in \(1,2,\cdots p\).

2.1 Matrix of input data

The protein expression for sample \(i\) may be presented as a column vector \(\in \mathbb{R}^p\):

\[ \begin{align} \mathbf{x}_{i} &= \begin{bmatrix} x_{i1} \\ x_{i2} \\ \vdots \\ x_{ip} \end{bmatrix} && \text{where } i=1, 2, \cdots n \end{align} \]

The complete set of data may be displayed in a matrix format

\[ \mathbf{X}=\begin{bmatrix} x_{1,1} & x_{1,2} & \cdots & x_{1,p} \\ x_{2,1} & x_{2,2} & \cdots & x_{2,p} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n,1} & x_{n,2} & \cdots & x_{n,p} \end{bmatrix}=\begin{bmatrix} \mathbf{x}_1^{T} \\ \mathbf{x}_2^{T} \\ \cdots \\ \mathbf{x}_n^{T} \end{bmatrix} \]

Because canonical PCA is a rotation operation through the origin in a vector space, \(\mathbf{X}\) is often in mean deviation form within each column. More specifically, it means that data are centered around zero for each protein \(j = 1, 2, \cdots p\). If not, we may be later on talking about the square length of a vector (divided by \(n\)) other than variance (optional: see Appendix: Property 9 for an inequality between the two quantities).⁴

It appears that data centering would be performed by default when using PCA tools such as prcomp. This means that, with proteoQ::prnPCA that wraps around prcomp, there is no need for us to manually center the input data by doing the following:

\[ \mathbf{X^{*}}=\begin{bmatrix} x_{1,1}-\bar{x}_1 & x_{1,2}-\bar{x}_2 & \cdots & x_{1,p}-\bar{x}_p \\ x_{2,1}-\bar{x}_1 & x_{2,2}-\bar{x}_2 & \cdots & x_{2,p}-\bar{x}_p \\ \vdots & \vdots & \ddots & \vdots \\ x_{n,1}-\bar{x}_1 & x_{n,2}-\bar{x}_2 & \cdots & x_{n,p}-\bar{x}_p \end{bmatrix} \] Note that the data centering can be turned off by setting center = FALSE. This would lead to a different explanation of PCA results that we will find out later.

Intuition

PCA at first seeks a unit vector, \(\pone \in \mathbb{R}^p\) that would minimize the total square Euclidean distance between all \(\xi\) (\(i = 1, 2, \cdots n\)) and their respective projections on the axis \(\pone\) (Trevor Hastie 2009, Section 14.5). The projected vectors are \(\pone\ponet\xi\) for each \(i\) (Appendix: Property 6). In other words, \(\xi\) contains a list of \(p\)-number, protein expression for sample \(i\) whereas their projected values on axis \(\pone\) are stored in vector \(\pone\ponet\xi\). Hence the quantity that we are minimizing is

\[ \begin{align} \Q = \sumn \frac{1}{n}\left \| \xi - \pone \ponet \xi \right \|^{2} && \text{(1)} \\ \end{align} \]

The constant \(1/n\) is kept for a convenience reason that we would go into momentarily. After some basic matrix algebra, we have the following quantity:

\[ \begin{align} \Q &= \sumn(\frac{\left \| \xi \right \|^{2}}{n} -\frac{\left \| \pone \ponet \xi \right \|^{2}}{n}) && \color{blue}{(\pone \text{ is a unit vector)}} \\ &= \sumn(\frac{\left \| \xi \right \|^{2}}{n} -\frac{\left \| \ponet \xi \right \|^{2}}{n}) && \color{blue}{(\left \| \ponet \xi \right \|^{2} = \left \| \pone \cdot \xi \right \|^{2} = \left \| \xit \pone \right \|^{2})} \\ &= \sumn(\frac{\left \| \xi \right \|^{2}}{n} -\frac{\left \| \xit \pone \right \|^{2}}{n}) && \text{(2)} \\ \end{align} \]

There is nothing out of ordinary: if \(\xi\) is a hypotenuse of a triangle, the derivation from (1) to (2) would have been expected from the Pythagorean Theorem.

If \(\xi\) is in mean deviation form, so would be \(\xit\pone\) (Property 8). This would allow us to re-express the quantity in (2) as

\[ \begin{align} \Q &= \sumn(\frac{\left \| \xi - \mathbf{mean}({\xi})\right \|^{2}}{n} -\frac{\left \| (\xit\pone - \mathbf{mean}(\xit\pone)) \right \|^{2}}{n}) \\ \end{align} \] where both \(\mathbf{mean}({\xi})\) and \(\mathbf{mean}({\xit\pone})\) are \(\mathbf{0}\) for all \(i\). Equivalently in a matrix form, we have

\[ \Q = \mathbf{Var}(\X) - \mathbf{Var}(\Xt \pone) \]

The first term in the rhs corresponds to the total sample variance whereas the second term links to the projected variance on the \(\pone\). The two variance values could be used for calculating the proportion of variance explained (under a given principal component). It is also revealing that such interpretation would be more suitable provided the centering of data.

Notes on proteoQ

The values of the proportion of variance explained are incorporated into the labels of PCA plots when calling prnPCA or pepPCA with center = TRUE, but not with center = FALSE for the reason shown above.
For protein analysis by observations and with data centering (prnPCA(type = obs, center = TRUE, ...)), each column in \(\X\) corresponds to a protein at mean deviation form. This contrasts to common proteomic data alignments, e.g. median centering, where the centering of data is by samples. The same argument would hold for pepPCA.
In protein outputs, columns prot_mean_raw, prot_mean_n and prot_mean_z summarize the mean log2FC for individual proteins. Analogously we have columns pep_mean_raw, pep_mean_n and pep_mean_z for peptide data. These columns might be used for the exclusion of data points that are far from mean deviation form. Examples are available in the README.
For PCA analysis by features and with data centering (prnPCA(type = feats, center = TRUE, ...)), the median deviation form in PCA is by samples. In this case, it is similar to the proteomic data alignment.

Denote the optimized \(\pone\) as \(\phat\), PCA seeks for the axis that best approximate the input data in \(\X\), that is

\[ \begin{align} \phat=\argminphii \sumn(\frac{\left \| \xi \right \|^{2}}{n} -\frac{\left \| \xit \pone \right \|^{2}}{n}) && \text{(3)} \\ \end{align} \]

There is a final twist: finding a vector minimizing the quantity indicated in (3) is equivalent to finding a vector maximizing \(\left \| \xit \pone \right \|^{2}\) (since \(\frac{1}{n}\sumn\left \| \xi \right \|^{2}\) is a fixed quantity irrespective to \(\pone\)). Thus we can re-write (3) as

\[ \begin{align} \phat = \argmaxphii \sumn \left \| \xit \pone \right \|^{2} && \text{(4)} \\ \end{align} \]

or in a matrix form:

\[ \begin{align} \phat &= \argmaxphii \left \| \X \pone \right \|^{2} \\ &= \argmaxphii (\X \pone)^{T} (\X \pone) \\ &= \argmaxphii(\ponet \Xt \X \pone) && && \text{(5)} \\ \end{align} \]

Maximum values of quadratic equations

Note that \(\Xt\X\) is symmetric (Property 3). Spectral theorem in every linear algebra textbook tells us that a symmetric matrix can be decomposed,⁵ such as

\[ \Xt\X = \V \L \V^{T} = \Vmat \Lmat \Vmatt \]

where the columns of \(\V\) are orthonormal eigenvectors of \(\Xt\X\) and the corresponding eigenvalues are in the diagonal matrix \(\L\). Thus we have

\[ \begin{align} \phat &= \argmaxphii(\ponet \V \L \Vt \pone) = \argmaxphii(\ponet \V) \L (\ponet \V)^{T}\\ &= \argmaxphii(\phivc \Lmat \phiv) \\ \end{align} \]

Continuing on the column-row expansion of a product, we have

\[ \begin{align} \phat &= \argmaxphii (\phivls) \\ &= \argmaxvphii (\phivls) && \color{blue}{(\text{Property 7: }\left \| \pone \right \|=\left \| \V \pone \right \|)} \\ %%%%% &= \argmaxvphii (\sum_{i=1}^{p} \lambda_{i} \sumj \phi_{j1} v_{ij} ) \\ \end{align} \]

Solving the above represents a constrained optimization problem subject to \(\left \| \V \pone \right \|=1\). For convenience, we may assume \(\l_1 \ge \l_2 \ge \cdots \l_p\) (by reordering subscripts if necessary). We then have the following inequalities:

\[ \begin{align} \phivls &\le \l_1(\phi_1\cdot\v_1)^2 + \l_1(\phi_1\cdot\v_2)^2 + \cdots \l_1(\phi_1\cdot\v_p)^2 \\ &= \l_1 \\ \end{align} \]

Also note that \(\l_1\) is the maximum value of the above inequality as it can be attained when \(\pone\cdot\v_1=\pm1\) and \((\phi_1\cdot\v_2)^2=\cdots=(\phi_1\cdot\v_p)^2=0\).⁶ In other words, the vector \(\phat\) that we have been seeking for is \(\phat = \pm\v_1\). We may now see that why PCA results (scores) are up to a sign flip (Gareth James 2013, Chapter 10).

References

David C. Lay, Judi J. McDonald, Steven R. Lay. 2016. Linear Algebra and Its Applications. https://github.com/KvalheimRacing/KubenRobotics/blob/master/Resources/David%20C.%20Lay%2C%20Steven%20R.%20Lay%2C%20Judi%20J.%20McDonald-Linear%20Algebra%20and%20Its%20Applications-Pearson%20Education%20(c2016).pdf.

Gareth James, Trevor Hastie, Daniela Witten. 2013. An Introduction to Statistical Learning with Applications in r (Print 2015). https://github.com/LamaHamadeh/DS-ML-Books/blob/master/An%20Introduction%20to%20Statistical%20Learning%20-%20Gareth%20James.pdf.

George Casella, Roger L Berger. 2002. Statistical Inference.

Trevor Hastie, Jerome Friedman, Robert Tibshirani. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction (Print 12th, 2017). https://github.com/tpn/pdfs/blob/master/The%20Elements%20of%20Statistical%20Learning%20-%20Data%20Mining%2C%20Inference%20and%20Prediction%20-%202nd%20Edition%20(ESLII_print4).pdf.

APPENDIX A - Properties of vectors and matrices

A.1 Vectors

Property 1:

\(\v_1^T\v_2=\v_1\cdot\v_2\) (the dot product of two vectors)
\(\left \| \v \right \|^2=\v^T\v=\v\cdot\v=(v_1^2+v_2^2+\cdots v_n^2)\) (the square length of a vector)

A.2 Matrices

A.2.1 Rectangular matrix (\(n \times p\))

Property 2:

\(\mathbf{A}(\mathbf{B}\mathbf{C})=(\mathbf{A}\mathbf{B})\mathbf{C}\) (the associative law of matrix multiplication)
\((\mathbf{A}\mathbf{B})^T=\mathbf{B}^T\mathbf{A}^T\) (the law of matrix transpose)

A.2.2 Symmetric matrix

Matrix \(\A^T\A\) is symmetric in that \((\A^T\A)^T=\A^T(\A^T)^T=\A^T\A\).

Property 3:

If \(\A\) is an \(n \times p\) matrix, \(\A^T\A\) is a symmetric \(p \times p\) matrix.

A.2.3 Matrix with orthonormal columns (\(n \times p\))

Vectors (\(\v_1, \v_2, \cdots \v_p \in \mathbb{R}^n\)) are said to be orthonormal to each other iff for any \(i, j = 1, 2, \dots p\), we have

\[ \mathbf{v}_{i} \cdot \mathbf{v}_{j}=\mathbf{v}_{i}^{T}\mathbf{v}_{j}=\left\{\begin{matrix} 1\: (i = j) \\ 0\: (i \neq j) \end{matrix}\right. \] Equivalently in a matrix format, we observe that⁷

\[ \def\uone{\mathbf{v}_{1}} \def\utwo{\mathbf{v}_{2}} \def\up{\mathbf{v}_{p}} \def\uonet{\mathbf{v}_{1}^{T}} \def\utwot{\mathbf{v}_{2}^{T}} \def\upt{\mathbf{v}_{p}^{T}} \begin{align} \mathbf{V}^{T}\mathbf{V}=\begin{bmatrix} \uonet\\ \utwot\\ \vdots \\ \upt\\ \end{bmatrix}\begin{bmatrix} \uone & \utwo & \cdots & \up \end{bmatrix}=\begin{bmatrix} \uonet\uone & \uonet\utwo & \cdots & \uonet\up\\ \utwot\uone & \utwot\utwo & \vdots & \utwot\up\\ \vdots & \vdots & \ddots & \vdots \\ \upt\uone & \upt\utwo & \cdots & \upt\up \end{bmatrix} = \mathbf{I} \\ \end{align} \]

Property 4:

For matrix \(\V\) with orthonormal columns (\(\V\) a.k.a. orthonormal matrix), we have \(\Vt\V=\mathbf{I}\).

A.2.4 Orthogonal matrix (\(n \times n\))

An orthogonal matrix is a square orthonormal matrix. For an orthogonal matrix \(\V\), if there is a matrix (denoted as \(\V^{-1}\)) that satisfies \(\V^{-1}\V=\mathbf{I}\), matrix \(\V^{-1}\) is termed the inverse of \(\V\) and \(\V\) is said to be invertible. Further note that \(\V^{-1}\V=\V\V^{-1}=\I\) in that:

\[ \begin{align} \V^{-1}\V &= \mathbf{I} \\ \V(\V^{-1}\V) &= \V\mathbf{I} \\ (\V\V^{-1})\V &= \V \\ \V\V^{-1} &= \mathbf{I} \end{align} \]

An orthogonal matrix is a special case of orthonormal matrix and is necessarily invertible (Property 4). We next show that the inverse is unique. Supposed there is another matrix \(\U\) that is also an inverse of \(\V\):

\[ \begin{align} \U\V &= \I \\ \U\V\V^{-1} &= \I\V^{-1} \\ \U(\V\V^{-1}) &= \V^{-1} \\ \U &= \V^{-1} \\ \end{align} \]

Therefore \(\V^{-1}\) is unique. Provided the unique existence of \(\V^{-1}\) and together with Property 4, we have

\[\begin{align} \Vt=\V^{-1} \\ \end{align}\]

In summary, we have the following property for an orthogonal matrix

Property 5

For an orthogonal matrix \(\V\), it satisfies \(\V\Vt=\V\V^{-1}=\V^{-1}\V=\Vt\V=\mathbf{I}\)

APPENDIX B - Additional properties

B.1 Orthogonal projections

It can be shown that (David C. Lay 2016 Sections 6.2-6.3):

Property 6

The orthogonal projection of vector \(\mathbf{x}\) to the subspace \(W\) spanned by an orthonormal matrix \(\mathbf{V} \in \mathbb{R}^{n \times p}\) is

\[\begin{align} \mathbf{proj}_W\, \mathbf{x}=\mathbf{V}\mathbf{V}^{T}\mathbf{x} \\ \end{align}\]

and the orthogonal complement is

\[\begin{align} W^{\perp }\mathbf{x}=\mathbf{x}-\mathbf{V}\mathbf{V}^{T}\mathbf{x} \\ \end{align}\]

where \(\mathbf{V} = \begin{bmatrix} \mathbf{v}_{1} & \mathbf{v}_{2} & \cdots & \mathbf{v}_{p} \end{bmatrix}\).

Property 6.a might be readily envisioned by noting geometrically that \(\mathbf{v}_{i}^{T}\mathbf{x}\) is the coordinate of \(\mathbf{x}\) on the direction specified by unit vector \(\mathbf{v}_{i}\) (upon orthogonal projection). Thus \(\mathbf{v}_{i}(\mathbf{v}_{i}^{T}\mathbf{x})\) is the corresponding vector of \(\mathbf{x}\) on \(\mathbf{v}_{i}\). It then follows vector addition/subtraction to go from 6.a to 6.b.

B.2 Preserved length under orthonormal transformation

Denote a vector \(\pone\). The length of \(\pone\) is preserved after the orthonormal mapping \(\pone \mapsto \Vt\pone\). For convenience, we show that the square length is preserved:

\[\begin{align} \left \| \Vt \pone \right \| ^{2}=(\Vt \pone)^{T}(\Vt \pone)=\ponet\V\Vt\pone=\ponet\pone = \left \| \pone \right \| ^{2} \\ \end{align}\]

and the expanded form is

\[\begin{align} \left \| \Vt \pone \right \| ^{2} &= \ponet\V\Vt\pone = \begin{bmatrix} \pone \cdot \v_{1} & \pone \cdot \v_{2} & \cdots & \pone \cdot \v_{p} \end{bmatrix} \begin{bmatrix} \pone \cdot \v_{1} \\ \pone \cdot \v_{2} \\ \cdots \\ \pone \cdot \v_{p} \end{bmatrix} \\ &= (\ponet\v_{1})^{2} + (\ponet\v_{2})^{2} + \cdots (\ponet\v_{p})^{2} \end{align}\]

In the special case of unit vector \(\pone\), we have

Property 7

The length of a unit vector \(\pone\) is preserved after the orthonormal mapping of \(\pone \mapsto \Vt\pone\):

\[\begin{align} \left \| \Vt \pone \right \| ^{2} = \left \| \pone \right \| ^{2} = 1\\ \end{align}\]

and the corresponding column-row expansion is

\[\begin{align} \left \| \Vt \pone \right \| ^{2} &= (\ponet\v_{1})^{2} + (\ponet\v_{2})^{2} + \cdots (\ponet\v_{p})^{2} = 1 \\ \end{align}\]

B.3 Mean-deviation form of matrix

Denote the mean of a vector \(\mathbf{x}\) in \(\mathbb{R}^{n}\) as \(\bar{x}=\frac{1}{n}\sum_{i=1}^{n}x_{i}\). Then the transformed vector

\[ \mathbf{x}^{*}=\begin{bmatrix} x_1-\bar{x} \\ x_2-\bar{x} \\ \vdots \\ x_n-\bar{x} \end{bmatrix} \]

is said to be in mean-deviation form (mdf). Also note that a vector is in mdf if it has a mean of zero (taking \(\bar{x}=0\) in the above, we have \(\mathbf{x} = \mathbf{x}^{*}\) in mdf) or vice versa (\(\mathbf{x}^{*}\) has a mean of zero since \(\sumn(x_i-\bar{x})=0\)). This is to say the equivalency between a vector in mdf and a vector with a mean of zero.

Extends the above notation to a matrix, we say that \(\mathbf{X}=\begin{bmatrix} \mathbf{x}_{1} & \mathbf{x}_{2} & \cdots & \mathbf{x}_{p} \end{bmatrix}\) is in mdf if each of its vector is in mdf.

It can be shown that, with \(\mathbf{X}\) in mdf, the projected vector, \(\mathbf{z}=\mathbf{X}\mathbf{\phi}\), is also in mdf for any vector \(\mathbf{\phi} \in \mathbb{R}^{p}\):

\[ %% Comment -- define some macros %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \begin{align} n\bar{z} &= \sumn z_{i} \\ &= \sumn \sumj x_{ij} \varphi_{j} && \color{blue}{(\text{switch } n \text{ and } p)} \\ &= \sumj \sumn x_{ij} \varphi_{j} \\ &= \sumj \varphi_{j} \sumn x_{ij} \\ &= \sumj \varphi_{j} n \bar{x_{j}} && \color{blue}{(\bar{x_{j}}=0, \; \forall j)} \\ &= 0 \end{align} \] That is, vector \(\mathbf{z}\) is in mdf. Denote \(\mathbf{Z}=\X\Phi\) where \(\Phi=\begin{bmatrix} \pone & \mathbf{\phi}_2 & \cdots & \mathbf{\phi}_m \end{bmatrix}\), we then have

Property 8

If \(\X \in \mathbb{R}^{n \times p}\) is in mean-deviation form, \(\mathbf{Z}=\X\Phi\) is also in mean-deviation form for any \(\Phi \in \mathbb{R}^{p \times m}\).

B.4 Sample variance versus square length

For any real number, \(a\), we have

\[ \begin{align} \sumn(x_i-a)^2 &= \sumn((x_i-\bar{x}) + (\bar{x}-a))^2 \\ &= \sumn(x_i-\bar{x})^2 + 2\sumn(x_i-\bar{x})(\bar{x}-a) + \sumn(\bar{x}-a)^2 \\ &= \sumn(x_i-\bar{x})^2 + 2(\bar{x}-a)\sumn(x_i-\bar{x}) + \sumn(\bar{x}-a)^2 && \color{blue}{((\bar{x}-a) \text{ is a constant)}} \\ &= \sumn(x_i-\bar{x})^2 + \sumn(\bar{x}-a)^2 && \color{blue}{(\sumn(x_i-\bar{x})=0 )} \end{align} \]

The relation above has an interpretation of bias-variance tradeoff when \(a\) takes the expectation value of \(\mathbf{x}\) (Gareth James 2013, ch. 2). Here we simply take \(a=0\) and divided both sides of the equation by \(n\) and thus have

\[ \begin{align} \frac{1}{n}\sum_{i=1}^{n}x_i^2 &= \frac{1}{n}\sumn(x_i-\bar{x})^2+\frac{1}{n}\sum_{i=1}^{n}\bar{x}^2 \\ \end{align} \] Express the above in a vector form yields

\[ \frac{1}{n} \left \| \mathbf{x} \right \|^{2} = \mathbf{Var}(\mathbf{x}) + \frac{1}{n} \left \| \mathbf{\bar{x}} \right \|^{2} \] where \(\mathbf{Var}(\mathbf{x}) = \frac{1}{n}\sumn(x_i-\bar{x})^2\) is the sample variance and \(\mathbf{\bar{x}} = \begin{bmatrix} \bar{x} & \bar{x} & \cdots & \bar{x} \end{bmatrix}^{T}\) is an \(n\)-element vector with identical \(\bar{x}\).

Since \(\left \| \mathbf{\bar{x}} \right \|^{2} \ge 0\), we have \(\mathbf{Var}(\mathbf{x}) \le \frac{1}{n} \left \| \mathbf{x} \right \|^{2}\).

Property 9

For any vector \(\mathbf{x} \in \mathbb{R}^{n}\), it satisfies

\[ \mathbf{Var}(\mathbf{x}) \le \frac{1}{n} \left \| \mathbf{x} \right \|^{2} \]

where \(\mathbf{Var}(\mathbf{x}) = \frac{1}{n}\sumn(x_i-\bar{x})^2\) is the sample variance.

B.5 Eigenvalues and eigenvectors

Considering the linear system:

\[\mathbf{A}\textbf{v}=\lambda \textbf{v}\]

or equivalently \[(\mathbf{A}-\lambda \mathbf{I}) \textbf{v}=0\]

where

\(\mathbf{A}\in \mathbb{R} ^{p\times p}\) is a real \(p\times p\) square matrix,
\(\textbf{v}\) is an eigenvecotr of \(\mathbf{A}\),
\(\lambda\) is the corresponding eigenvalue and
\(\mathbf{I}\) is the \(p\times p\) identity matrix.

This yields a \(p\)-th order polynomial equation

\[\textbf{det}(\mathbf{A}-\lambda \mathbf{I}) \textbf{v}=0\]

When solvable, the equation will yield \(p\) solutions: \(\lambda_1,…,\lambda_k,…,\lambda_p\). For each \(\lambda_k\), there is a corresponding \(\textbf{v}_k\) that satisfies

\[(\mathbf{A}-\lambda_k \mathbf{I}) \textbf{v}_k=0\]

see George Casella (2002) ch. 11 for interesting discussion of a mathematical versus a statistical solution↩︎
See also the help document ?proteoQ::standPrn.↩︎
It can be the other way around where each protein corresponds to a dot.↩︎
https://stats.stackexchange.com/questions/22329/how-does-centering-the-data-get-rid-of-the-intercept-in-regression-and-pca ↩︎
The real challenge is to show that a symmetric matrix is necessarily orthogonally diagonalizable. Following that, the theorem unseals the secret of a symmetric matrix in projecting and stretching any vector that it operates against (onto so-called principal axes).↩︎
In case of degeneracy among \(\l_1, \, \l_2, \, \cdots\), a technical device is to introduce small perturbation to break the ties.↩︎
Note that \(\mathbf{V}^{T}\mathbf{V}\) is a \(p\times p\) matrix. This is not to be confused with \(\mathbf{V}\mathbf{V}^{T}\), which is a \(n \times n\) matrix. For \(n \neq p\), \(\mathbf{V}\mathbf{V}^{T} \neq \mathbf{V}^{T}\mathbf{V}\).↩︎