Machine learning

table of contents Chapter 1: Linear Regression Model [Chapter 2: Nonlinear Regression Model] (https://qiita.com/matsukura04583/items/baa3f2269537036abc57) [Chapter 3: Logistic Regression Model] (https://qiita.com/matsukura04583/items/0fb73183e4a7a6f06aa5) [Chapter 4: Principal Component Analysis] (https://qiita.com/matsukura04583/items/b3b5d2d22189afc9c81c) [Chapter 5: Algorithm 1 (k-nearest neighbor method (kNN))] (https://qiita.com/matsukura04583/items/543719b44159322221ed) [Chapter 6: Algorithm 2 (k-means)] (https://qiita.com/matsukura04583/items/050c98c7bb1c9e91be71) [Chapter 7: Support Vector Machine] (https://qiita.com/matsukura04583/items/6b718642bcbf97ae2ca8)

Chapter 4: Principal Component Analysis

What is principal component analysis?

Compress the structure of multivariate data into a smaller number of indicators
I want to minimize the loss of information associated with reducing the number of variables
Achieves analysis and visualization (in the case of 2D and 3D) using decimal variables

If the coefficient vector changes, the value after linear transformation changes
Consider the amount of information as the size of the variance
Search for the projection axis that maximizes the variance of the variables after linear transformation 90 Variance after linear transformation 90 Principal component analysis

Solve the following constrained optimization problem
Insert a constraint whose norm is 1 (without the constraint, there are infinite solutions)

(Practice 4) Dimensional compression of breast cancer test data using scikit learn

Settings
Create a logistic regression model using breast cancer test data
Dimensional compression in 2D space using principal components
Number of records 569 Number of columns 33
Challenges
Check if it can be discriminated well when 32D data is compressed 2D.

Start with Google Drive Maps

from google.colab import drive
drive.mount('/content/drive')

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import confusion_matrix
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
%matplotlib inline

sys.path settings

Below, we are creating a study_ai_ml folder directly under My Drive in Google Drive.

cancer_df = pd.read_csv('/content/drive/My Drive/study_ai_ml/data/cancer.csv')
print('cancer df shape: {}'.format(cancer_df.shape))

`result`


cancer df shape: (569, 33)

cancer_df

cancer_df.drop('Unnamed: 32', axis=1, inplace=True)
cancer_df

** ・ Diagnosis: Diagnosis result (Benign is B / Malignant is M) ・ Explanatory variables are classified by logistic regression with the objective variable in the second column after the third column **

#Extraction of objective variable
y = cancer_df.diagnosis.apply(lambda d: 1 if d == 'M' else 0)
#Extraction of explanatory variables
X = cancer_df.loc[:, 'radius_mean':]
#Separate data for learning and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

#Standardization
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

#Learning with logistic regression
logistic = LogisticRegressionCV(cv=10, random_state=0)
logistic.fit(X_train_scaled, y_train)

#Verification
print('Train score: {:.3f}'.format(logistic.score(X_train_scaled, y_train)))
print('Test score: {:.3f}'.format(logistic.score(X_test_scaled, y_test)))
print('Confustion matrix:\n{}'.format(confusion_matrix(y_true=y_test, y_pred=logistic.predict(X_test_scaled))))

`result`


Train score: 0.988
Test score: 0.972
Confustion matrix:
[[89  1]
 [ 3 50]]

** ・ Confirmed that it can be classified with a verification score of 97% **

pca = PCA(n_components=30)
pca.fit(X_train_scaled)
plt.bar([n for n in range(1, len(pca.explained_variance_ratio_)+1)], pca.explained_variance_ratio_)

# PCA
#Compress up to 2 dimensions
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train_scaled)
print('X_train_pca shape: {}'.format(X_train_pca.shape))
# X_train_pca shape: (426, 2)

#Contribution rate
print('explained variance ratio: {}'.format(pca.explained_variance_ratio_))
# explained variance ratio: [ 0.43315126  0.19586506]

#Plot on scatter plot
temp = pd.DataFrame(X_train_pca)
temp['Outcome'] = y_train.values
b = temp[temp['Outcome'] == 0]
m = temp[temp['Outcome'] == 1]
plt.scatter(x=b[0], y=b[1], marker='o') #Benign is marked with ○
plt.scatter(x=m[0], y=m[1], marker='^') #Malignant is marked with △
plt.xlabel('PC 1') #X-axis of the first principal component
plt.ylabel('PC 2') #The second principal component is the y-axis

#Standardization
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

#Learning with logistic regression
logistic = LogisticRegressionCV(cv=10, random_state=0)
logistic.fit(X_train_scaled, y_train)

#Verification
print('Train score: {:.3f}'.format(logistic.score(X_train_scaled, y_train)))
print('Test score: {:.3f}'.format(logistic.score(X_test_scaled, y_test)))
print('Confustion matrix:\n{}'.format(confusion_matrix(y_true=y_test, y_pred=logistic.predict(X_test_scaled))))

`result`


Train score: 0.927
Test score: 0.944
Confustion matrix:
[[87  3]
 [ 5 48]]

** ・ Confirmed that it can be classified with a verification score of 94% ** Even if the number of dimensions was reduced to 2, the verification score did not drop much from 97% to 94%, and the result was that the number of dimensions was reduced while maintaining accuracy.

Related Sites

Chapter 1: Linear Regression Model [Chapter 2: Nonlinear Regression Model] (https://qiita.com/matsukura04583/items/baa3f2269537036abc57) [Chapter 3: Logistic Regression Model] (https://qiita.com/matsukura04583/items/0fb73183e4a7a6f06aa5) [Chapter 4: Principal Component Analysis] (https://qiita.com/matsukura04583/items/b3b5d2d22189afc9c81c) [Chapter 5: Algorithm 1 (k-nearest neighbor method (kNN))] (https://qiita.com/matsukura04583/items/543719b44159322221ed) [Chapter 6: Algorithm 2 (k-means)] (https://qiita.com/matsukura04583/items/050c98c7bb1c9e91be71) [Chapter 7: Support Vector Machine] (https://qiita.com/matsukura04583/items/6b718642bcbf97ae2ca8)

[PYTHON] <Course> Machine learning Chapter 4: Principal component analysis

Machine learning

Chapter 4: Principal Component Analysis

What is principal component analysis?

(Practice 4) Dimensional compression of breast cancer test data using scikit learn

Start with Google Drive Maps

sys.path settings

result

result

result

`result`

`result`

`result`