[PYTHON] [AI algorithm] About dimensionality reduction (t-SNE)

This time, I will write an article about unsupervised learning of machine learning, the algorithm of "dimension reduction (t-SNE)". There are several well-known algorithms for dimensionality reduction, but here I would like to write about t-SNE, a method suitable for multidimensional data.

Dimensionality reduction is a transformation that reduces the explanatory variables of the input data so that the characteristics of the original data can be explained with fewer variables while preserving the information of the original input data.

For example, suppose that when you want to predict Mr. A's age (objective variable y), only height (explanatory variable x1) and weight (explanatory variable x2) are given as Mr. A's information. As adults grow taller and heavier than children, there seems to be some correlation with age. If you try to reduce the dimensions of the two variables, height and weight, at this time, consider making one variable while maintaining some information on these two variables. Then, it seems that one variable called "physique" can be created from the relationship between height and weight. And it seems that it is possible to predict the age using the variable of the physique. Combining several variables into one in this way is called dimensionality reduction.

It is common to reduce multivariable data, which usually has dozens and hundreds of variables, to two or three variables.

Then, there are two main reasons why we bother to reduce dimensions. The first is to visualize multidimensional data. Visualization makes it easier for humans to grasp even data that is not easy to understand. The easier it is to capture trends in data, the easier it will be to explain in EDA and evaluation of analysis results. The second is to prevent the "curse of dimensionality". The curse of dimensionality is, roughly speaking, that if the data has too many variables (the number of dimensions is large), the performance of various machine learning algorithms will deteriorate. I'm not very familiar with the curse of dimensionality, but assuming that the information held by all the data does not change, it seems that the performance of the algorithm is likely to improve if there are fewer variables and more information per variable than there are many variables. ..

Now, let's think about the t-SNE algorithm. Once again, t-SNE is easy to use among several dimensionality reduction methods such as PCA. The feature of t-SNE is that the data distributed in the manifold space (a space where the distance between two points can be expressed even in a partially multidimensional space) can be compressed and visualized. By the way, PCA is good at reducing the dimension of linear data as a whole, but t-SNE is good at reducing nonlinear data locally (between two points). In other words, t-SNE reduces the number of dimensions to 2 to 3 by focusing on the distance between data while leaving the multidimensional structure as much as possible.

Since the base of t-SNE is an algorithm called SNE, I will explain it from here first. SNE の数式です。 This is the conditional probability of j under the given i condition. Probability is only expressed, but it expresses the closeness between data in high-dimensional space. The contents are exactly the same as the Gaussian distribution formula with multivariates. I understand that it is like a normal distribution that is handled in three or more dimensions, but please refer to another article for details. The important thing is that x is one data point in the uncompressed data set X.