In [1]:
%matplotlib inline
In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Principal Component Analysis (PCA): Toy Example

Build the dataset

Build $X \in \mathbb{R}^{m \times n}$, $m$ data points $\in \mathbb{R^n}$.

Let $m = 400$ et $n=2$ (i.e., each row $A$ represents a point in $\mathbb{R}^2$).

  • To build $X$ we first create a matrix $A \in \mathbb{R}^{m \times n}$ such that $\forall i,j A_{i,j}\sim \mathcal{N}(0,1)$
  • Then we transform $A$ by applying a matrix $T$ (rotation, and scaling)
In [3]:
m = 400
n = 2
# Build a matrix with m columns and n rows, such that each element follows a normal distribution
A = np.random.randn(m,n) 
# Create a matrix that will modify the initial matrix (rotation, scale change)
T = np.array([[-0.5,0.3],
              [1,2]])
# Apply the linear transformation matrix T to the data matrix A
X = np.dot(A,T)
In [4]:
# Plot the 1st column vector of A vs. the 2nd column vector of A
plt.plot(A[:,0],A[:,1],"o",alpha=0.1,label="$A$")
# Plot the 1st column vector of A vs. the 2nd column vector of X
plt.plot(X[:,0],X[:,1],"o",alpha=0.1,label="$X$")
# Set limits to x and y axis
plt.xlim(-15, 15)
plt.ylim(-15, 15)
# Plot the legend
plt.legend()
Out[4]:
<matplotlib.legend.Legend at 0x1217a7278>

Center the dataset

  • Let $E[ X_{.,i} ]$ be the mean of $X$ along column $i$.
  • Let us center each column $i$ of $X$ by computing $X_{.,i} \leftarrow X_{.,i} - E[X_{.,i}]$
In [5]:
# Compute the mean of each column vector of X. Notice that X.mean(axis=1) would compute the mean of each row, and X.mean() the overall mean
col_mean = X.mean(axis=0)
# Numpy will subtract the mean vector to each row
Xc = X - col_mean
print(X.mean(axis=0), Xc.mean(axis=0))
[-0.03486125 -0.01405511] [-1.31838984e-17  3.27515792e-17]

Covariance Matrix

  • The covariance between column vectors $X_{.,i}$ and $X_{.,j}$ is defined as:

$cov [X_{.,i},X_{.,j}]= \sum_{k=1}^m \frac{(X_{k,i}-E[X_{.,i}])(X_{k,j}-{E}[X_{.,j}])}{m}$

  • Since we have centered our dataset: $\forall i, \; E[X_{.,i}] = 0$, then:

$cov [X_{.,i},X_{.,j}]= \sum_{k=1}^m \frac{X_{k,i}X_{k,j}}{m}$

  • In matrix notation

$cov [X_{.,i},X_{.,j}] = \frac{1}{m} X_{.,i}^* \cdot X_{.,j}$

  • Therefore matrix $Q = X^* \cdot X$ is proportional to the covariance matrix (by a factor $\frac{1}{m}$)
  • Question 1: Compute the covariance matrix of $X$
In [6]:
# If Xc is a numpy array, you can obtain its transpose by typing Xc.T or Xc.transpose()
Xc_transpose = Xc.T
# Xc_transpose.dot(Xc) computes the dot product between Xc_transpose and Xc
Q = Xc_transpose.dot(Xc)
# Xc.shape is a tuple containing the size of the matrix, the 1st element is the number of rows and the 2nd the nb. of cols.
print(Xc.shape)
# Here we simply multiply by 1/m to get the variances and covariances
Q*1/Xc.shape[0]
(400, 2)
Out[6]:
array([[1.22187866, 1.66809185],
       [1.66809185, 3.64005625]])
  • Question 2.A: Compute the eigen vectors $V$ and eigen values $\lambda$ of matrix $Q = X^*\cdot X$

(recall: $Q\cdot V_{.,i} = \lambda_i V_{.,i} \$)

In [7]:
# np.linalg.eig returns a tuple with 2 elements, the first one is a np.array with the eigen values, and the 2nd a matrix with the eigen vectors (in columns)
lamb,V = np.linalg.eig(Q)
In [8]:
# Check the property of eigen vectors and eigen values
Q.dot(V), V*lamb
Out[8]:
(array([[ -132.10459719,  -816.47488325],
        [   67.40382437, -1600.2072074 ]]),
 array([[ -132.10459719,  -816.47488325],
        [   67.40382437, -1600.2072074 ]]))
  • Question 2.B: The eigen values in numpy are not always sorted, sort them in descending order:
In [10]:
# lamb.argsort() sorts the index of lamb, according to their elements' values (ascending order)
idx = lamb.argsort()  
# We want to order themin descending order, so we invert the list
idx = idx[::-1] 
# We get the eigen values in the right order
lamb = lamb[idx]
# We get the eigen vectors in the right order
V = V[:,idx]
print(lamb)
print(V)
[1796.46718356  148.30677712]
[[-0.45448917 -0.89075226]
 [-0.89075226  0.45448917]]

Principal Component Analysis

  • Let $PC_i = X \cdot V_{.,i}$ be the $i$-th Principal Component.

  • Compute the covariance between $PC_j$ and $PC_k$

$\begin{aligned} cov[PC_j,PC_k] &\propto (X \cdot V_{.,j})^{*} \cdot (X \cdot V_{.,k})\\&= V_{.,j}^*\cdot X^*\cdot X\cdot V_{.,k}\\&=V_{.,j}^*\cdot (\lambda_k V_{.,k})\\&=\lambda_k V_{.,j}^{*}\cdot V_{.,k}\end{aligned}$

  • If $j=k$:

$cov[PC_j,PC_k] = cov[PC_k,PC_k] \propto \lambda_k$

  • Otherwise:

$cov[PC_j,PC_k] = 0$

  • Question 3.A: Plot the variance explained along each Principal Axis.
In [11]:
# lamb is equal to the sum of coordinates to the square in the principal axis space. We need to devide these values by the number of rows, to get the variance
plt.plot(lamb/Xc.shape[0], "o")
plt.xlabel("Principal Axis",fontsize=15)
plt.ylabel('Variance explained')
Out[11]:
Text(0, 0.5, 'Variance explained')
  • Question 3.B: Plot the percentage of variance explained along each Principal Axis.
In [12]:
# lamb.sum() returns the sum of the lambda values. 
plt.plot(lamb/lamb.sum()*100, "o")
plt.xlabel("Principal Axis",fontsize=15)
plt.ylabel('Variance explained (%)')
Out[12]:
Text(0, 0.5, 'Variance explained (%)')
  • Question 3.C: Explain the role of the Principal Component Analysis
In [ ]:
 

Question 3.D

  • Apply the matrix $V$ to $X$ and plot the Principal Components
  • Apply the matrix $V$ to the cardinal basis and plot them
In [13]:
# Compute the principal components
PC = Xc.dot(V)
# Canonical axis (2D idendity matrix)
cannonical_axis = np.eye(2)
# Project the canonical using matrix V
axis_proj = cannonical_axis.dot(V)
In [14]:
plt.plot(PC[:,0],PC[:,1],"o",alpha=0.1)
plt.quiver(*axis_proj[:,0],color="y",angles='xy', scale_units='xy', scale=1)
plt.quiver(*axis_proj[:,1],color="k",angles='xy', scale_units='xy', scale=1)
plt.xlim(-5,5)
plt.ylim(-5,5)
Out[14]:
(-5, 5)
In [15]:
plt.plot(Xc[:,0],Xc[:,1],"o",alpha=0.1)
plt.quiver(*V.T[0,:],color="r",angles='xy', scale_units='xy', scale=1)
plt.quiver(*V.T[1,:],color="g",angles='xy', scale_units='xy', scale=1)
plt.xlim(-5,5)
plt.ylim(-5,5)
Out[15]:
(-5, 5)

Question 3.E: Given the structure of the dataset in the new space, and the percentage of variance explained along each principal axis, does it seem necessary to keep a a 2D space representation?

In [ ]:
 

Principal Component Analysis and SVD

The singular value decomposition (SVD) of $X$ is:

$X = U \cdot \Sigma \cdot \mathbf{V}^*$

  • $U$ is a $m\times m$ unit orthogonal matrix (called left singular vectors of X)
  • $V$ is a $n\times n$ unit orthogonal matrix (called right singular vectors of X)
  • $\Sigma$ is an $m\times n$ positive rectangular diagonal matrix (singular values of X)

Using the SVD, $X^* X$ can be written as:

$\begin{aligned}X^*X &=V \Sigma^* U^* U \Sigma V^*\\&= V \Sigma^*\Sigma V^* \\&= V \hat {\Sigma}^{2}V^*\end{aligned}$

Therefore:

  • The right singular vectors $V$ of $X$ and the eigenvectors of $X^*X$ are equivalent.
  • The singular values of $X$ are equal to the sqrt of the the eigenvalues of $X^*X$.
$$PC = U\cdot \Sigma = X\cdot V$$

Question 4: Compute the Principal Components of $X$ using the SVD

In [16]:
# Compute the SVD of Xc, this function returns a tuple with 3 elements: matrix U, an array with the singular values, and matrix V transposed
U,s,Vt = np.linalg.svd(Xc)
In [17]:
# Compute the principal components
V = Vt.T
PC = Xc.dot(V)
In [18]:
plt.plot(PC[:,0],PC[:,1],"o", alpha=0.1)
plt.xlim(-5,5)
plt.ylim(-5,5)
Out[18]:
(-5, 5)

Principal Component Analysis (PCA): Faces dataset

Load the dataset

In [19]:
# Load the dataset from a library
from sklearn.datasets import fetch_olivetti_faces
dataset = fetch_olivetti_faces(shuffle=True)
faces = dataset.data
In [20]:
#If the sklearn library is not installed, you can load the faces dataset (attached to the email) uncommenting and running the following commands:
#import pandas as pd
#faces = pd.read_csv("faces.csv",header=0,index_col=0)
#faces = faces.values

Each row of the faces dataset represents a human face picture in gray scale. The dataset contains 400 pictures in total.

In practice, the size of a picture is $64 \times 64$ pixels, and in order to fill the matrix, each picture has been flattened in a row vector with 4096 ($=64 \times 64$) elements.

Therefore the faces image represents 400 pictures described in a 4096 dimensional space of pixels.

If we transpose the faces dataset, we have 4096 pixels described in a 400 dimensional space of human faces.

In this practical work we are going to use the SVD to find the "Principal Faces" in the dataset

Question 1: Reshape the first row of the dataset, to create a $64\times 64$ image, and plot the first face, using the plt.imshow

In [21]:
# We get the first row (the first face)
first_face = faces[0,:]
# We reshape it, to have a 64x64 matrix 
reshaped_first_face = first_face.reshape(64,64)
# We plot it
plt.imshow(reshaped_first_face)
Out[21]:
<matplotlib.image.AxesImage at 0x1a25531e80>

Question 4: Transpose the faces matrix (we are going to work with the transpose only) and center each column

In [22]:
 

Question 3: Compute its SVD

In [23]:
 

Question 4: Plot the percentage of standard deviation explained along the 20 first principal axis

In [ ]:
 

Question 5: Compute the principal components

In [ ]:
 

Question 6: Plot the first and the last 4 "Principal Faces"

In [ ]:
 
In [ ]:
 

Let $V^k$ be a matrix containing the first $k$ columns of matrix $V$.

Let $\hat X = X \cdot V^k \cdot (V^k)^*$ be an approximation of $X$ using only the first $k$ principal axis

Question 7: Compute $\hat X$ for the face dataset (for k = 3 and k = 50) and plot the first 4 faces of the new dataset

In [ ]:
 

Question 8: Let $|| X - \hat X ||_2$ be the approximation error. Compute the error for k = 2 to 400

In [31]:
def SSE(X,X_hat):
    return(((X - X_hat)**2).sum().sum())
In [32]: