%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
To read the dataset we are going to use the function read_csv
from the pandas library. In the following box the dataset is first loaded as a "dataframe" (similar to those from R), each column correspond to a variable (dimension) and each row to a point.
This dataset consist of $n=9$ physiological and medical variables (columns) measured for $m=768$ patients (rows)
Each column represents the following variables:
import pandas as pd
dataset = pd.read_csv('diabetes.csv',header=0)
To see the first 3 lines of the dataset we use the head
method with a parameter 3
dataset.head(3)
For this project, we will consider a reduced dataset with the following 5 columns as the conditions:
Let $X$ be a $m\times 5$ matrix corresponding to the values of each one of these condition variable for each patient.
And We are going to consider the following column as our observation:
Let $y$ be the vector of observations for each patient
Our goal is to find $c$ a vector of parameters such that: $$X\cdot c +r = y \quad \text{and}\quad ||r||_2 \text{ is minimized }$$
In the following box we construct our matrix $X$ and vector $y$ as np.array
, so you do not need to bother understanding the structure of the dataframe.
# Get only the requiered variables
dataset_X = dataset[["Pregnancies",
"BloodPressure",
"SkinThickness",
"BMI",
"Age"]]
# Get only the observation variable
dataset_y = dataset["Glucose"]
# Get only np.array out of the dataset
X = dataset_X.values
y = dataset_y.values
print("type of X: ",type(X))
print("shape of X: ",X.shape)
print("type of y: ",type(y))
print("shape of y: ",y.shape)
plt.figure(1,figsize=(13,13))
# Plot Glucose vs Pregnancies
plt.subplot(321)
plt.plot(X[:,0], y, "o")
plt.xlabel("Pregnancies", fontsize=20)
plt.ylabel("Glucose", fontsize=20)
# Plot Glucose vs BloodPressure
plt.subplot(322)
plt.plot(X[:,1], y, "o")
plt.xlabel("BloodPressure", fontsize=20)
plt.ylabel("Glucose", fontsize=20)
# Plot Glucose vs BloodPressure
plt.subplot(323)
plt.plot(X[:,2], y, "o")
plt.xlabel("SkinThickness", fontsize=20)
plt.ylabel("Glucose", fontsize=20)
# Plot Glucose vs Glucose
plt.subplot(324)
plt.plot(X[:,3], y, "o")
plt.xlabel("BMI", fontsize=20)
plt.ylabel("Glucose", fontsize=20)
# Plot Glucose vs Age
plt.subplot(325)
plt.plot(X[:,4], y, "o")
plt.xlabel("Age", fontsize=20)
plt.ylabel("Glucose", fontsize=20)
Q,R = np.linalg.qr(X)
Q.shape,R.shape
R is upper triangular matrix
plt.imshow(R)
Q is a orthonormal matrix
plt.imshow(Q.T.dot(Q))
$X^* X c = X^* y$ (normal equation) and $X = QR$ (QR factorization)
Then: $(QR)^*(QR) c = (QR)^* y$
$R^* Q^* Q R c = R^* Q^* y$
$R^* R c = R^* Q^* y$
If $X$ is full rank then $R$ is invertible
$Rc = Q^* y$
$c = R^{-1} Q^* y$
c = np.linalg.inv(R).dot(Q.T).dot(y)
print(c)
plt.plot(c,"o")
plt.hlines(0,xmin=0,xmax=4)
What happens if an individual increases his BMI of one unit? The regression tells you that his glucose level should increase by 1.866
What happens if an individual get 1 year older? The regression tells you that his glucose level should increase by 1.199
If the skin thickness increases, then the impact in the glucose level is very low (-0.047)
...
Keep in mind that this is not reality, this is just a regression model, and we did not even checked the statistical validity of these results (out of the scope of this lecture)
Compute the regression $X\cdot c$
Xc = X.dot(c)
plt.figure(1,figsize=(13,13))
# Plot Glucose vs Pregnancies
plt.subplot(321)
plt.plot(X[:,0], y, "o")
plt.plot(X[:,0], Xc, "o")
plt.xlabel("Pregnancies", fontsize=20)
plt.ylabel("Glucose", fontsize=20)
# Plot Glucose vs BloodPressure
plt.subplot(322)
plt.plot(X[:,1], y, "o")
plt.plot(X[:,1], Xc, "o")
plt.xlabel("BloodPressure", fontsize=20)
plt.ylabel("Glucose", fontsize=20)
# Plot Glucose vs BloodPressure
plt.subplot(323)
plt.plot(X[:,2], y, "o")
plt.plot(X[:,2], Xc, "o")
plt.xlabel("SkinThickness", fontsize=20)
plt.ylabel("Glucose", fontsize=20)
# Plot Glucose vs Glucose
plt.subplot(324)
plt.plot(X[:,3], y, "o")
plt.plot(X[:,3], Xc, "o")
plt.xlabel("BMI", fontsize=20)
plt.ylabel("Glucose", fontsize=20)
# Plot Glucose vs Age
plt.subplot(325)
plt.plot(X[:,4], y, "o")
plt.plot(X[:,4], Xc, "o")
plt.xlabel("Age", fontsize=20)
plt.ylabel("Glucose", fontsize=20)
U,s,Vt = np.linalg.svd(X,full_matrices=False)
V = Vt
X_.mean(axis=0)
Compute the STD along each principal axis
variances = s/(X.shape[0]-1)
Compute the % of STD carried by each principal axis
variances_percent = variances / sum(variances)
variances_percent*100
plt.plot(variances_percent*100,"o")
X_principal = X.dot(V[:,:2])
plt.plot(X_principal[:,0],X_principal[:,1],"o")
map_cannonical = np.eye(5).dot(V).T[:,:2]
plt.arrow(0,0,*map_cannonical[0,:],head_width=0.05)
plt.arrow(0,0,*map_cannonical[1,:],head_width=0.05)
plt.arrow(0,0,*map_cannonical[2,:],head_width=0.05)
plt.arrow(0,0,*map_cannonical[3,:],head_width=0.05)
plt.arrow(0,0,*map_cannonical[4,:],head_width=0.05)
plt.text(x=map_cannonical[0,0],y=map_cannonical[0,1],s = "Pregnancies")
plt.text(x=map_cannonical[1,0],y=map_cannonical[1,1],s = "BloodPressure")
plt.text(x=map_cannonical[2,0],y=map_cannonical[2,1],s = "SkinThickness")
plt.text(x=map_cannonical[3,0],y=map_cannonical[3,1],s = "BMI")
plt.text(x=map_cannonical[4,0],y=map_cannonical[4,1],s = "Age")
plt.xlim([-1,1])
plt.ylim([-1,1])