[PYTHON] <Course> Machine learning Chapter 4: Principal component analysis

Machine learning

Chapter 4: Principal Component Analysis

What is principal component analysis?




(Practice 4) Dimensional compression of breast cancer test data using scikit learn

Start with Google Drive Maps

from google.colab import drive

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import confusion_matrix
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
%matplotlib inline

sys.path settings

Below, we are creating a study_ai_ml folder directly under My Drive in Google Drive.

cancer_df = pd.read_csv('/content/drive/My Drive/study_ai_ml/data/cancer.csv')
print('cancer df shape: {}'.format(cancer_df.shape))


cancer df shape: (569, 33)
スクリーンショット 2019-12-13 16.53.21.png
cancer_df.drop('Unnamed: 32', axis=1, inplace=True)
スクリーンショット 2019-12-13 16.56.55.png

** ・ Diagnosis: Diagnosis result (Benign is B / Malignant is M) ・ Explanatory variables are classified by logistic regression with the objective variable in the second column after the third column **

#Extraction of objective variable
y = cancer_df.diagnosis.apply(lambda d: 1 if d == 'M' else 0)
#Extraction of explanatory variables
X = cancer_df.loc[:, 'radius_mean':]
#Separate data for learning and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

#Learning with logistic regression
logistic = LogisticRegressionCV(cv=10, random_state=0)
logistic.fit(X_train_scaled, y_train)

print('Train score: {:.3f}'.format(logistic.score(X_train_scaled, y_train)))
print('Test score: {:.3f}'.format(logistic.score(X_test_scaled, y_test)))
print('Confustion matrix:\n{}'.format(confusion_matrix(y_true=y_test, y_pred=logistic.predict(X_test_scaled))))


Train score: 0.988
Test score: 0.972
Confustion matrix:
[[89  1]
 [ 3 50]]

** ・ Confirmed that it can be classified with a verification score of 97% **

pca = PCA(n_components=30)
plt.bar([n for n in range(1, len(pca.explained_variance_ratio_)+1)], pca.explained_variance_ratio_)
スクリーンショット 2019-12-13 17.02.01.png
#Compress up to 2 dimensions
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train_scaled)
print('X_train_pca shape: {}'.format(X_train_pca.shape))
# X_train_pca shape: (426, 2)

#Contribution rate
print('explained variance ratio: {}'.format(pca.explained_variance_ratio_))
# explained variance ratio: [ 0.43315126  0.19586506]

#Plot on scatter plot
temp = pd.DataFrame(X_train_pca)
temp['Outcome'] = y_train.values
b = temp[temp['Outcome'] == 0]
m = temp[temp['Outcome'] == 1]
plt.scatter(x=b[0], y=b[1], marker='o') #Benign is marked with ○
plt.scatter(x=m[0], y=m[1], marker='^') #Malignant is marked with △
plt.xlabel('PC 1') #X-axis of the first principal component
plt.ylabel('PC 2') #The second principal component is the y-axis
スクリーンショット 2019-12-13 17.03.18.png
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

#Learning with logistic regression
logistic = LogisticRegressionCV(cv=10, random_state=0)
logistic.fit(X_train_scaled, y_train)

print('Train score: {:.3f}'.format(logistic.score(X_train_scaled, y_train)))
print('Test score: {:.3f}'.format(logistic.score(X_test_scaled, y_test)))
print('Confustion matrix:\n{}'.format(confusion_matrix(y_true=y_test, y_pred=logistic.predict(X_test_scaled))))


Train score: 0.927
Test score: 0.944
Confustion matrix:
[[87  3]
 [ 5 48]]

** ・ Confirmed that it can be classified with a verification score of 94% ** Even if the number of dimensions was reduced to 2, the verification score did not drop much from 97% to 94%, and the result was that the number of dimensions was reduced while maintaining accuracy.

