Python: Unsupervised Learning: Basics

What is unsupervised learning?

In supervised learning (regression, classification), the answer is known It is to train AI using data (a set of input values and corresponding output values).

For unanswered datasets, as opposed to supervised learning There is unsupervised learning in which AI decides the answer by itself.

You will learn about "clustering" and "principal component analysis", which are unsupervised learning.

Types of unsupervised learning

Clustering

There is a technique called "clustering" as a representative of unsupervised learning. An operation that divides data into "clusters" In Japanese, data division is sometimes referred to as "clustering".

The following shows how data is manipulated using the "k-means method" as an example of clustering.

image.png

image.png

The black dots are the state before clustering. The purple dot is a parameter called the "center of gravity" of the data. The k-means method learns the optimum position of this center of gravity from the data. Cluster the data using the learned centroids.

There are two types of clustering: an automatic estimation of the number of clusters and a human-determined method.

k-The means method is one of the methods that human beings use by deciding the number of clusters.

Number of clusters

The purpose of unsupervised learning is to mechanically capture and analyze the characteristics of the data you want to analyze. For this reason, there is also the idea that it is better for people not to determine the number of clusters.

A technique called "hierarchical" is a technique that automatically estimates the number of clusters. However, the hierarchical method requires a relatively large amount of calculation. If you have a lot of data, a non-hierarchical approach may be appropriate.

Principal component analysis

"Principal component analysis" is a technique often used to "reduce" data into graphs.

Dimensionality reduction is the lowering of the dimensions that represent the data. For example, you can create a 2D graph by reducing one coordinate axis from 3D data.

Consider a concrete example. Suppose you have a lot of data about your students, such as test scores, number of questions in class, number of late arrivals, and sleep time. How can you graph the student characteristics from these data?

You may be able to create a graph for each data. However, it is difficult to analyze the tendency of hundreds or thousands of students from multiple graphs. Principal component analysis allows you to combine different types of data to create a single graph, such as 2D or 3D, while preserving the information in each data as much as possible.

image.png

You can convert it to data using principal component analysis, as in the example above. First, the machine learns the axes (main components) that specifically indicate the characteristics of the data. If you recreate the graph with the learned axes, you can easily see all the data in one graph as shown in the above figure while keeping the information as much as possible. The method for determining this axis is the outline of principal component analysis.

Prior knowledge

Euclidean distance

Given the coordinates x (x1, x2), y (y1, y2) of the two points The distance between two points can be obtained from the Pythagorean theorem.

image.png

More generally, an extension of this between two points in n-dimensional space

It is called the Euclidean distance.

image.png

"Distance" in a space of n = 4 or more can no longer be imagined by human intuitive spatial recognition, but in mathematical formulas, the expression simply extended as above is defined as distance. .. The Euclidean distance is also sometimes called the norm.

You can also use numpy to find the Euclidean distance.

import numpy as np
vec_a = np.array([1, 2, 3])
vec_b = np.array([2, 3, 4])
print(np.linalg.norm(vec_a - vec_b))

Cosine similarity

When a two-dimensional vector a → = (a1, a2), b → = (b1, b2) is given I would like to evaluate how similar these two vectors (actually some 2D data) have.

The properties that represent a vector are "length" and "direction". Here, we focus on "direction". What is the similarity of the "directions" that two vectors are facing? You can think of it simply as corresponding to the angle between these two vectors.

Assuming that the angle formed by the two vectors is θ, the smaller θ is, the more similar the two data are. Here, the formula for calculating the inner product of vectors

image.png

If you calculate a little

image.png

It will be. The smaller the θ, the larger the cos θ.

From the above, it was found that cosθ represents the similarity between the two data. The cos of the angle formed in this way is used as an index of the similarity of the data.

It is called "cosine similarity".

Extend "Cosine Similarity" so that it can be used for n-dimensional data as well as the Euclidean distance. When two n-dimensional vectors a → = (a1, a2, ⋯, an), b → = (b1, b2, ⋯, bn) are given "Cosine similarity" is expressed by the following formula.

image.png

In addition, the cosine similarity can be calculated with the following code.

import numpy as np
vec_a = np.array([1, 2, 3])
vec_b = np.array([2, 3, 4])
print(np.dot(vec_a, vec_b) / (np.linalg.norm(vec_a) * np.linalg.norm(vec_b)))

Recommended Posts

Python: Unsupervised Learning: Basics
Unsupervised learning 1 Basics
Python basics ⑤
Python basics
Python basics ④
Python basics ③
Python basics
python learning
Python basics
Python basics
Python basics ③
Python basics ②
Python basics ②
Python: Unsupervised Learning: Non-hierarchical clustering
Python: Unsupervised Learning: Principal Component Analysis
(python) Deep Learning Library Chainer Basics Basics
Python basics: list
[Python] Learning Note 1
Python learning notes
python learning output
#Python basics (#matplotlib)
Python CGI basics
Python basics: dictionary
Python learning site
Basics of Python ①
Basics of python ①
Python learning day 4
Python slice basics
#Python basics (scope)
#Python basics (#Numpy 1/2)
Python Deep Learning
#Python basics (#Numpy 2/2)
Python learning (supplement)
#Python basics (functions)
Deep learning × Python
Python array basics
Python profiling basics
Python #Numpy basics
Python basics: functions
python learning notes
#Python basics (class)
Python basics summary
[Learning memo] Basics of class by python
Python learning basics ~ What is type conversion? ~
Introduction to Python Basics of Machine Learning (Unsupervised Learning / Principal Component Analysis)
Learning Python with ChemTHEATER 03
Python module (Python learning memo ④)
Reinforcement learning 1 Python installation
Python basics ② for statement
Learning Python with ChemTHEATER 05-1
Python: Deep Learning Practices
Python ~ Grammar speed learning ~
Basics of Python scraping basics
Python basics 8 numpy test
Unsupervised learning 2 non-hierarchical clustering
Errbot: Python chatbot basics
#Python DeepLearning Basics (Mathematics 1/4)
Private Python learning procedure
Learning Python with ChemTHEATER 02
Python basics: Socket, Dnspython
Python: Deep Learning in Natural Language Processing: Basics