Aidemy　2020/10/28

Introduction

Hello, it is Yope! I am a liberal arts student, but I was interested in the possibilities of AI, so I went to the AI-specialized school "Aidemy" to study. I would like to share the knowledge gained here with you, and I am summarizing it on Qiita. I am very happy that many people have read the previous summary article. Thank you! This time, it will be a post of unsupervised learning. Nice to meet you.

This article is a summary of what you learned in "Aidemy" "in your own words". It may contain mistakes and misunderstandings. Please note.

What to learn this time ・ About unsupervised learning ・ Types of unsupervised learning ・ Mathematical prior knowledge

Unsupervised learning

What is unsupervised learning?

-In supervised learning, learning is performed by giving an "answer" called a class label, but in unsupervised learning, the computer itself judges and learns without passing this answer. ・ This time, we will learn about __ "clustering" __ and __ "principal component analysis" __ in this unsupervised learning.

Clustering

-Clustering is a __ method that divides __data into chunks (clusters). -One of the clustering methods __ "k-means method" __ is that __ people decide the number of clusters __ and the computer divides the data so that the number is the same. -In the k-means method, learning is performed so that the position of a point called the "center of gravity" is appropriate, and clustering is performed based on this.

Principal component analysis

-Principal component analysis is a __ method that reduces the dimensions of __ data (dimension reduction) and aggregates information in one graph. -Principal component analysis is performed by learning and determining the (principal component) axis that specifically indicates the characteristics of the data. -For example, an axis is defined from three different data of "age, height, and weight" and represented in a two-dimensional graph in the form of "personal data".

Prior knowledge of unsupervised learning

Euclidean distance

・ The coordinate distance between two points (x1, x2) and (y1, y2) in two-dimensional space is \sqrt{(x_1-y_1)^2+(x_2-y_2)^2} Can be obtained at. ・ Similarly, the distance between two points (x1, x2 ... xn), (y1, y2 ... yn) in n-dimensional space is \sqrt{(x_1-y_1)^2+(x_2-y_2)^2+...+(x_n-y_n)^2} Is required by. This distance is called __Euclidean distance (norm) __.

・ Euclidean distance can be calculated by NumPy as follows. (__np.linalg.norm () __ stands for "sum of squares in ()")

スクリーンショット 2020-10-28 23.05.43.png

Cosine similarity

-When evaluating how similar two vectors are, it is judged from the similarity between __ "length" and "direction" . ・ Focusing on the direction, it can be said that the smaller the angle __ “θ” __ created by the two vectors, the higher the similarity. ・ As a method of finding θ, the formula of the inner product of vectors\vec{a} \cdot \vec{b} = |\vec{a}|\, |\vec{b}| \, \mathrm{cos}\thetaIt can be found by developing cos θ. About this method"Cosine similarity"__That is. ・ Regarding cosθ at this time, note that the larger the value of __cosθ, the smaller the θ. -Also, the cosine similarity also corresponds to n-dimensional data.

-In the code, it can be calculated by NumPy. (__np.dot () __ represents "the sum of the products of each element" (1 * 2 + 2 * 3 + 3 * 4 in the following))

スクリーンショット 2020-10-28 23.06.54.png

Summary

・ Unsupervised learning is a method in which the computer itself judges and learns without passing the correct answer label. -Unsupervised learning includes "clustering" and "principal component analysis". The former is a method of dividing data into clusters, and the latter is a method of aggregating information into one graph by reducing the dimensions. -In unsupervised learning, data similarity may be judged by "__ Euclidean distance (norm) " or " cosine similarity __".

This time is over. Thank you for reading until the end.

[PYTHON] Unsupervised learning 1 Basics