Introduction to Python Basics of Machine Learning (Unsupervised Learning / Principal Component Analysis)

index -Introduction to Python Python Basics -Introduction to Python Scientific Calculations with Python -Introduction to Python Machine Learning Basics (Unsupervised Learning / Principal Component Analysis)

What is Principal Component Analysis (PCA)?

――Summary the whole into 1 to 3 dimensions that are easy to understand and have a good view. ――Big data is multivariate and multidimensional, so it is difficult to understand as it is, but by performing principal component analysis, the information contained in the data is not impaired as much as possible, and the atmosphere of the entire data is visualized so that everyone can understand it. It is possible to make it easy to make.

The following is an excerpt from Wikipedia

Principal component analysis (PCA) is a multivariate analysis that synthesizes a variable called the principal component that best represents the overall variation with a small number of uncorrelated variables from a large number of correlated variables. One method [1]. Used to reduce the dimensions of the data.

The transformation that gives the principal component is chosen so as to maximize the variance of the first principal component and maximize the variance of the following principal components under the constraint that they are orthogonal to the previously determined principal components. The purpose of maximizing the variance of the principal components is to give the principal components as much as possible the ability to explain changes in observed values. The selected principal components are orthogonal to each other and a given set of observations can be represented as a linear combination. In other words, the principal component is the orthogonal basis of the set of observations. The orthogonality of the principal component vector is derived from the fact that the principal component vector is the eigenvector of the covariance matrix (or correlation matrix) and the covariance matrix is a real symmetric matrix.

Try principal component analysis

The following program uses a RandomState object to generate a two-variable dataset and plots the standardized ones for each variable.

from sklearn.preprocessing import StandardScaler

# np.Random.RandomState(1)Create a RandomState object with the seed (initial value of random number) set to 1 as
sample = np.random.RandomState(1)

#Generate two random numbers using the rand and randn functions
X = np.dot(sample.rand(2, 2), sample.randn(2, 200)).T

#Standardization
sc = StandardScaler()
X_std = sc.fit_transform(X)

#Calculation and graphing of correlation coefficient
print('Correlation coefficient{:.3f}:'.format(sp.stats.pearsonr(X_std[:, 0], X_std[:, 1])[0]))
plt.scatter(X_std[:, 0], X_std[:, 1])

The following is the output result

Correlation coefficient 0.889:
スクリーンショット 2019-12-16 0.56.57.png

Reference URL for the standardization part

-scikit-learn fit () / transform () / fit_transform () -What is standardization

Perform principal component analysis

Principal component analysis can be performed using the PCA class of the sklearn.decomposition module.

When initializing an object, specify how many dimensions you want to compress the variable, that is, the number of principal components you want to extract as n_components. Normally, set a value smaller than the original variable. (Reduce 30 variables to 5 variables, etc.) </ Strong>

By executing the fit method, the information necessary for extracting the principal components is learned. (Specifically, eigenvalues and eigenvectors are calculated)

#import
from sklearn.decomposition import PCA

#Principal component analysis
pca = PCA(n_components=2)
pca.fit(X_std)

Output result


PCA(copy=True, iterated_power='auto', n_components=2, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False)

Confirmation of learning results

components_attribute

The components_ attribute is called the eigenvector and represents the orientation of the ** new feature space axis discovered by principal component analysis. ** **

print(pca.components_)

Output result


[[-0.707 -0.707] #Orientation of the first principal component
 [-0.707  0.707]] #Orientation of the second principal component

explained_variance_ attribute

The ʻexplained_variance_` attribute represents the variance of each principal component.

print('Dispersion of each principal component:{}'.format(pca.explained_variance_))

Output result


Dispersion of each principal component:[1.899 0.111]

It can be seen that the variances of the two principal components extracted this time are 1.889 and 0.111, respectively, but it is not a coincidence that the sum of the variances here is 2.0, and the (standardized) variable originally has it. The sum of the variances is the same as the sum of the variances of the principal components. In other words, the variance (information) is maintained.

explained_variance_ratio_ attribute

The ʻexplained_variance_ratio_` attribute is the variance ratio of each principal component.

print('Dispersion ratio of each main component:{}'.format(pca.explained_variance_ratio_))

Output result


Dispersion ratio of each main component:[0.945 0.055]

The first 0.945 is obtained by 1.889 / (1.889 + 0.111), and it can be seen that the first principal component holds 94.5% of the information of the original data.

Let's diagram the above results.

#parameter settings
arrowprops=dict(arrowstyle='->',
                linewidth=2,
                shrinkA=0, shrinkB=0)

#Function for drawing an arrow
def draw_vector(v0, v1):
    plt.gca().annotate('', v1, v0, arrowprops=arrowprops)

#Plot the original data
plt.scatter(X_std[:, 0], X_std[:, 1], alpha=0.2)

#Display the two axes of principal component analysis with arrows
for length, vector in zip(pca.explained_variance_, pca.components_):
    v = vector * 3 * np.sqrt(length)
    draw_vector(pca.mean_, pca.mean_ + v)

#Adjust the upper and lower limits of x or y so that the increments of the same coordinate value have the same length.
plt.axis('equal');

The following is the output result. スクリーンショット 2019-12-16 9.34.04.png The upper arrow is the direction of the axis of the new feature space obtained by principal component analysis. It can be seen that the first principal component is determined in the direction of maximum variance and is orthogonal to each other with respect to the vector with the second principal component.

As you can see from the figure, the vector in the direction of maximum variance is the first principal component, and the vector in the direction of the next largest variance is the second principal component. The first principal component and the second principal component are orthogonal. (Orthonormal basis)

Example of principal component analysis

From here, we will look concretely in what situations it is useful to compress dimensions using principal component analysis.

Breast cancer data can be loaded using the load_breast_cancer function in sklearn.datasets. The following shows the data actually read and the distribution of each explanatory variable visualized according to whether the value of the objective variable (cancer.target) is "malignant" or "benign".

#Import to read breast cancer data
from sklearn.datasets import load_breast_cancer

#Acquisition of breast cancer data
cancer = load_breast_cancer()

#Filtering to separate data into malignant or benign
#malignant is cancer.target is 0
malignant = cancer.data[cancer.target==0]

#benign is cancer.target is 0
benign = cancer.data[cancer.target==1]

#Histogram with blue for malignant and orange for benign
#Each figure is a histogram showing the relationship between each explanatory variable (mean radius, etc.) and the objective variable.
fig, axes = plt.subplots(6,5,figsize=(20,20))
ax = axes.ravel()
for i in range(30):
    _,bins = np.histogram(cancer.data[:,i], bins=50)
    ax[i].hist(malignant[:,i], bins, alpha=.5)
    ax[i].hist(benign[:,i], bins, alpha=.5)
    ax[i].set_title(cancer.feature_names[i])
    ax[i].set_yticks(())
    
#Label settings
ax[0].set_ylabel('Count')
ax[0].legend(['malignant','benign'],loc='best')
fig.tight_layout()

The following is the output result スクリーンショット 2019-12-16 10.22.26.png

For most histograms, the malignant and benign data overlap, and it is difficult to determine where to draw the boundary to distinguish between malignant and benign.

Now let's use principal component analysis to reduce the dimensions of these 20 or more variables. Specifically, the data used as explanatory variables are standardized and principal component analysis is performed. The number of main components to be extracted (n_component) is 2.

#Standardization
sc = StandardScaler()
X_std = sc.fit_transform(cancer.data)

#Principal component analysis
pca = PCA(n_components=2)
pca.fit(X_std)
X_pca = pca.transform(X_std)

#display
print('X_pca shape:{}'.format(X_pca.shape))
print('Explained variance ratio:{}'.format(pca.explained_variance_ratio_))

Output result


X_pca shape:(569, 2)
Explained variance ratio:[0.443 0.19 ]

From the above, when checking the value of the ʻexplained_variance_ratio_` attribute, although the number of variables is reduced to two, about 63% (= 0.443 + 0.19) of the original information is condensed into the first principal component and the second principal component. You can see that. This can be seen from the output result that "X_pca shape: (569, 2)" is 569 rows and 2 columns (2 variables) in the data after principal component analysis, and 2 variables are principal component analysis. Since the number of is set to 2, it is 2.

Next, let's visualize the data with lower dimensions. First, in preparation for visualization, the objective variables corresponding to the explanatory variables are linked to the data of the first principal component and the second principal component, and then separated into benign data and malignant data.

#Label the columns, the first is the first principal component and the second is the second principal component
X_pca = pd.DataFrame(X_pca, columns=['pc1','pc2'])

#In the above data, the objective variable (cancer.Associate target), join horizontally
X_pca = pd.concat([X_pca, pd.DataFrame(cancer.target, columns=['target'])], axis=1)

#Separate malignant and benign
pca_malignant = X_pca[X_pca['target']==0]
pca_benign = X_pca[X_pca['target']==1]

#Plot malignancy(red)
ax = pca_malignant.plot.scatter(x='pc1', y='pc2', color='red', label='malignant');

#Plot benign(Blue)
pca_benign.plot.scatter(x='pc1', y='pc2', color='blue', label='benign', ax=ax);

The following is the output result スクリーンショット 2019-12-16 10.32.03.png

From the graph above, it can be seen that in this case, the class of objective variables can be almost separated by only two principal components. If there are many variables and you do not know which variable should be used for analysis, perform principal component analysis in this way. (1) Clarify the relationship between each principal component and the objective variable (2) Interpret the relationship between the original variable and the objective variable from the relationship between each principal component and the original variable. If you proceed with such things, data understanding will progress.

It should also be remembered that principal component analysis can also be used when you want to reduce the number of variables (dimension reduction) when building a prediction model.

reference

-I will explain what principal component analysis is in an easy-to-understand manner with all my strength -Principal component analysis concept -Detecting orthonormal basis from "Kujo Karen's pose" by principal component analysis

Recommended Posts

Introduction to Python Basics of Machine Learning (Unsupervised Learning / Principal Component Analysis)
Python: Unsupervised Learning: Principal Component Analysis
Unsupervised learning 3 Principal component analysis
Coursera Machine Learning Challenges in Python: ex7-2 (Principal Component Analysis)
<Course> Machine learning Chapter 4: Principal component analysis
An introduction to Python for machine learning
[Introduction to Data Scientists] Basics of Python ♬
Python: Unsupervised Learning: Basics
Introduction to machine learning
[Python] Easy introduction to machine learning with python (SVM)
Python & Machine Learning Study Memo ②: Introduction of Library
An introduction to machine learning
Basics of Machine Learning (Notes)
Challenge principal component analysis of text data with Python
Super introduction to machine learning
[Introduction to Data Scientists] Basics of Python ♬ Functions and classes
Introduction to machine learning Note writing
Introduction to Machine Learning Library SHOGUN
Introduction to image analysis opencv python
[Python] PCA scratch in the example of "Introduction to multivariate analysis"
Python learning memo for machine learning by Chainer Chapter 8 Introduction to Numpy
Python learning memo for machine learning by Chainer Chapter 10 Introduction to Cupy
[Introduction to StyleGAN] Unique learning of anime with your own machine ♬
From the introduction of JUMAN ++ to morphological analysis of Japanese with Python
Python learning memo for machine learning by Chainer Chapter 9 Introduction to scikit-learn
[Introduction to Data Scientists] Basics of Python ♬ Conditional branching and loops
[Introduction to Data Scientists] Basics of Python ♬ Functions and anonymous functions, etc.
Introduction to Machine Learning: How Models Work
[Learning memo] Basics of class by python
Machine learning with python (2) Simple regression analysis
I installed Python 3.5.1 to study machine learning
An introduction to OpenCV for machine learning
Introduction to ClearML-Easy to manage machine learning experiments-
[Python] First data analysis / machine learning (Kaggle)
Principal component analysis with Power BI + Python
Take the free "Introduction to Python for Machine Learning" online until 4/27 application
Python beginners publish web applications using machine learning [Part 2] Introduction to explosive Python !!
A beginner of machine learning tried to predict Arima Kinen with python
Python learning memo for machine learning by Chainer Chapter 13 Basics of neural networks
Introduction of Python
Principal component analysis
Basics of Python ①
Basics of python ①
Unsupervised learning 1 Basics
"Introduction to Machine Learning by Bayesian Inference" Approximate inference of Poisson mixed model implemented only with Python numpy
Introduction of Python
[Super Introduction to Machine Learning] Learn Pytorch tutorials
An introduction to machine learning for bot developers
Python & Machine Learning Study Memo ⑤: Classification of irises
[Introduction to Udemy Python 3 + Application] 19. Copy of list
[Introduction to cx_Oracle] (Part 3) Basics of Table Reference
[Python] Data analysis, machine learning practice (Kaggle) -Data preprocessing-
PRML Chapter 12 Bayesian Principal Component Analysis Python Implementation
[Super Introduction to Machine Learning] Learn Pytorch tutorials
Analysis of shared space usage by machine learning
[Translation] scikit-learn 0.18 Tutorial Introduction of machine learning by scikit-learn
Get a glimpse of machine learning in Python
Robot grip position (Python PCA principal component analysis)
[For beginners] Introduction to vectorization in machine learning
Arrangement of self-mentioned things related to machine learning
[Introduction to Python] Basic usage of lambda expressions