[PYTHON] <Course> Machine learning Chapter 4: Principal component analysis

Machine learning

table of contents Chapter 1: Linear Regression Model [Chapter 2: Nonlinear Regression Model] (https://qiita.com/matsukura04583/items/baa3f2269537036abc57) [Chapter 3: Logistic Regression Model] (https://qiita.com/matsukura04583/items/0fb73183e4a7a6f06aa5) [Chapter 4: Principal Component Analysis] (https://qiita.com/matsukura04583/items/b3b5d2d22189afc9c81c) [Chapter 5: Algorithm 1 (k-nearest neighbor method (kNN))] (https://qiita.com/matsukura04583/items/543719b44159322221ed) [Chapter 6: Algorithm 2 (k-means)] (https://qiita.com/matsukura04583/items/050c98c7bb1c9e91be71) [Chapter 7: Support Vector Machine] (https://qiita.com/matsukura04583/items/6b718642bcbf97ae2ca8)

Chapter 4: Principal Component Analysis

What is principal component analysis?

PCA1.jpg

PCA2.jpg

PCA3.jpg

(Practice 4) Dimensional compression of breast cancer test data using scikit learn

Start with Google Drive Maps

from google.colab import drive
drive.mount('/content/drive')

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import confusion_matrix
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
%matplotlib inline

sys.path settings

Below, we are creating a study_ai_ml folder directly under My Drive in Google Drive.

cancer_df = pd.read_csv('/content/drive/My Drive/study_ai_ml/data/cancer.csv')
print('cancer df shape: {}'.format(cancer_df.shape))

result


cancer df shape: (569, 33)
cancer_df
スクリーンショット 2019-12-13 16.53.21.png
cancer_df.drop('Unnamed: 32', axis=1, inplace=True)
cancer_df
スクリーンショット 2019-12-13 16.56.55.png

** ・ Diagnosis: Diagnosis result (Benign is B / Malignant is M) ・ Explanatory variables are classified by logistic regression with the objective variable in the second column after the third column **

#Extraction of objective variable
y = cancer_df.diagnosis.apply(lambda d: 1 if d == 'M' else 0)
#Extraction of explanatory variables
X = cancer_df.loc[:, 'radius_mean':]
#Separate data for learning and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

#Standardization
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

#Learning with logistic regression
logistic = LogisticRegressionCV(cv=10, random_state=0)
logistic.fit(X_train_scaled, y_train)

#Verification
print('Train score: {:.3f}'.format(logistic.score(X_train_scaled, y_train)))
print('Test score: {:.3f}'.format(logistic.score(X_test_scaled, y_test)))
print('Confustion matrix:\n{}'.format(confusion_matrix(y_true=y_test, y_pred=logistic.predict(X_test_scaled))))

result


Train score: 0.988
Test score: 0.972
Confustion matrix:
[[89  1]
 [ 3 50]]

** ・ Confirmed that it can be classified with a verification score of 97% **

pca = PCA(n_components=30)
pca.fit(X_train_scaled)
plt.bar([n for n in range(1, len(pca.explained_variance_ratio_)+1)], pca.explained_variance_ratio_)
スクリーンショット 2019-12-13 17.02.01.png
# PCA
#Compress up to 2 dimensions
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train_scaled)
print('X_train_pca shape: {}'.format(X_train_pca.shape))
# X_train_pca shape: (426, 2)

#Contribution rate
print('explained variance ratio: {}'.format(pca.explained_variance_ratio_))
# explained variance ratio: [ 0.43315126  0.19586506]

#Plot on scatter plot
temp = pd.DataFrame(X_train_pca)
temp['Outcome'] = y_train.values
b = temp[temp['Outcome'] == 0]
m = temp[temp['Outcome'] == 1]
plt.scatter(x=b[0], y=b[1], marker='o') #Benign is marked with ○
plt.scatter(x=m[0], y=m[1], marker='^') #Malignant is marked with △
plt.xlabel('PC 1') #X-axis of the first principal component
plt.ylabel('PC 2') #The second principal component is the y-axis
スクリーンショット 2019-12-13 17.03.18.png
#Standardization
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

#Learning with logistic regression
logistic = LogisticRegressionCV(cv=10, random_state=0)
logistic.fit(X_train_scaled, y_train)

#Verification
print('Train score: {:.3f}'.format(logistic.score(X_train_scaled, y_train)))
print('Test score: {:.3f}'.format(logistic.score(X_test_scaled, y_test)))
print('Confustion matrix:\n{}'.format(confusion_matrix(y_true=y_test, y_pred=logistic.predict(X_test_scaled))))

result


Train score: 0.927
Test score: 0.944
Confustion matrix:
[[87  3]
 [ 5 48]]

** ・ Confirmed that it can be classified with a verification score of 94% ** Even if the number of dimensions was reduced to 2, the verification score did not drop much from 97% to 94%, and the result was that the number of dimensions was reduced while maintaining accuracy.

Related Sites

Chapter 1: Linear Regression Model [Chapter 2: Nonlinear Regression Model] (https://qiita.com/matsukura04583/items/baa3f2269537036abc57) [Chapter 3: Logistic Regression Model] (https://qiita.com/matsukura04583/items/0fb73183e4a7a6f06aa5) [Chapter 4: Principal Component Analysis] (https://qiita.com/matsukura04583/items/b3b5d2d22189afc9c81c) [Chapter 5: Algorithm 1 (k-nearest neighbor method (kNN))] (https://qiita.com/matsukura04583/items/543719b44159322221ed) [Chapter 6: Algorithm 2 (k-means)] (https://qiita.com/matsukura04583/items/050c98c7bb1c9e91be71) [Chapter 7: Support Vector Machine] (https://qiita.com/matsukura04583/items/6b718642bcbf97ae2ca8)

Recommended Posts

<Course> Machine learning Chapter 4: Principal component analysis
Unsupervised learning 3 Principal component analysis
<Course> Machine Learning Chapter 6: Algorithm 2 (k-means)
<Course> Machine Learning Chapter 7: Support Vector Machine
Python: Unsupervised Learning: Principal Component Analysis
Principal component analysis
Coursera Machine Learning Challenges in Python: ex7-2 (Principal Component Analysis)
<Course> Machine Learning Chapter 3: Logistic Regression Model
<Course> Machine Learning Chapter 1: Linear Regression Model
<Course> Machine Learning Chapter 2: Nonlinear Regression Model
Machine learning course memo
Principal component analysis (Principal component analysis: PCA)
PRML Chapter 12 Bayesian Principal Component Analysis Python Implementation
Python learning memo for machine learning by Chainer Chapter 7 Regression analysis
[Language processing 100 knocks 2020] Chapter 6: Machine learning
Machine learning algorithm (multiple regression analysis)
Machine learning algorithm (simple regression analysis)
Face recognition using principal component analysis
Principal component analysis with Spark ML
100 Language Processing Knock 2020 Chapter 6: Machine Learning
Machine Learning: Supervised --Linear Discriminant Analysis
Machine learning
Machine learning beginners take Coursera's Deep learning course
TensorFlow Machine Learning Cookbook Chapter 2 Personally Clogged
[Python] First data analysis / machine learning (Kaggle)
Principal component analysis with Power BI + Python
TensorFlow Machine Learning Cookbook Chapter 3 Personally Clogged
Preprocessing in machine learning 1 Data analysis process
[Machine learning] Regression analysis using scikit learn
TensorFlow Machine Learning Cookbook Chapter 6 (or rather, tic-tac-toe)
Principal component analysis with Livedoor News Corpus --Preparation--
I tried principal component analysis with Titanic data!
[Python] Data analysis, machine learning practice (Kaggle) -Data preprocessing-
Analysis of shared space usage by machine learning
[Memo] Machine learning
Machine learning classification
Robot grip position (Python PCA principal component analysis)
A story about data analysis by machine learning
Machine Learning sample
2. Multivariate analysis spelled out in Python 3-2. Principal component analysis (algorithm)
Collaborative filtering with principal component analysis and K-means clustering
Python learning memo for machine learning by Chainer from Chapter 2
Mathematical understanding of principal component analysis from the beginning
Python Scikit-learn Linear Regression Analysis Nonlinear Simple Regression Analysis Machine Learning
Clustering and principal component analysis by K-means method (beginner)
Challenge principal component analysis of text data with Python
Principal component analysis Analyze handwritten numbers using PCA. Part 2
Principal component analysis using python from nim with nimpy
Principal component analysis (PCA) and independent component analysis (ICA) in python
Principal component analysis Analyze handwritten numbers using PCA. Part 1
2. Multivariate analysis spelled out in Python 3-1. Principal component analysis (scikit-learn)
Machine learning tutorial summary
About machine learning overfitting
Machine learning ⑤ AdaBoost Summary
Machine learning logistic regression
Machine learning support vector machine
Studying Machine Learning ~ matplotlib ~
Machine learning linear regression
Machine learning library dlib
Machine learning (TensorFlow) + Lotto 6
Somehow learn machine learning