Introduction

This time, I summarized the dimension reduction algorithm ** t-SNE ** (t-Distributed Stochastic Neighbor Embedding). t-SNE is a ** dimension reduction algorithm ** for converting and visualizing high-dimensional data into 2D or 3D, and was developed by Professor Hinton, who is also known as the father of deep learning. (Professor Hinton is amazing) This time, I will understand this t-SNE and improve the visualization power.

reference

In understanding t-SNE, I referred to the following.

-Visualizing Data using t-SNE (original paper)

Overview of t-SNE

Source of t-SNE

** t-SNE ** is a dimension reduction algorithm for dropping high-dimensional data into 2D or 3D. PCA and MDS are the classics when it comes to dimensionality reduction, but there were some problems with these linear dimensionality reductions.

** Algorithm focused on keeping different data far away in lower dimensions **, so it is vulnerable to keeping similar data close in lower dimensions
Especially for non-linear data in higher dimensions, it is almost impossible to "keep similar data close even in lower dimensions" **

In order to solve these problems, various non-linear dimensionality reduction technologies ** have been created to maintain the local structure of data (keeping similar data close even in lower dimensions). I did. t-SNE is an algorithm that follows that trend.

Below is an image of non-linear data.

Quoted from this site

Features of t-SNE

The points of t-SNE are described. I would like to describe the specific contents of the process and the reasons for the advantages and disadvantages in detail in the explanation of the algorithm described later.

** Processing points **

--Convert so that the distance distribution in the higher dimension matches the distance distribution in the lower dimension as much as possible. --Assuming that the distance distribution follows the Student's t-distribution (SNE assumed a Gaussian distribution, but it was improved from that)

merit

--Very good capture of high-dimensional local structures --Capture the overall structure as much as possible

Demerit

--If you change the Perplexity (internal parameter), a completely different cluster will appear.

t-SNE algorithm

In understanding the t-SNE algorithm, we start by understanding its predecessor, the SNE algorithm. After that, we will sort out the problems of SNE and summarize the t-SNE created by solving the problems.

How SNE works

1. Convert distances between data points to conditional probabilities

SNE starts by converting the distance between data points into conditional probabilities. Select $ x_ {j} $ as the neighborhood with the similarity of data points $ x_ {i} $ and $ x_ {j} $ given $ x_ {i} $ ** Conditional probability $ p_ { Expressed as j | i} $. ** Furthermore, we assume that $ x_j $ is stochastically selected based on a ** normal distribution centered on $ x_ {i} $ **.

Then $ p_ {j | i} $ can be expressed by the following formula.


p_{j|i} = \frac{\exp(-||x_{i} - x_{j}||^2/2\sigma_{i}^2)}{\sum_{k\neq i}\exp(-||x_{i} - x_{k}||^2/2\sigma_{i}^2)}

The above is obtained from the probability density function of the normal distribution. $ \ frac {1} {\ sqrt {2πσ ^ 2}} $ is a constant and is canceled by the denominator numerator.


\frac{1}{\sqrt{2πσ^2}}\exp{(-(x-μ)^2/2σ^2)}

We assume a normal distribution with mean $ x_ {i} $ and variance $ \ sigma_ {i} ^ 2 $, but the value of variance $ \ sigma_ {i} ^ 2 $ is adjusted by the parameters described below. .. Since the assumption of the distribution changes depending on the variance $ \ sigma_ {i} ^ 2 $, the output after the final dimension reduction also changes significantly.

Also, since this time the purpose is to express the distance between different data with conditional probabilities, we will place it as follows.


p_{i|i} = 0

2. The distance between data points after dimension reduction is also expressed by conditional probability.

The similarity between the dimensionally reduced data points $ y_ {i} $ and $ y_ {j} $ is also expressed as ** conditional probability $ q_ {j | i} $ as before. ** Similarly, we assume that $ y_j $ is stochastically selected based on a normal distribution centered on $ y_ {i} $, but unlike before ** the variance is $ \ frac {1} { Fixed with \ sqrt {2}} $ **. By fixing it, you can cancel the variance from the previous formula and simplify it.

$ q_ {j | i} $ can be expressed by the following formula.


q_{j|i} = \frac{\exp(-||y_{i} - y_{j}||^2)}{\sum_{k\neq i}\exp(-||y_{i} - y_{k}||^2)}

Place it as follows as before.


q_{i|i} = 0

3. Loss function design with KL divergence

Since the distance relationship between data points in the higher dimension and the distance relationship in the lower dimension after dimension reduction should match as much as possible,p_{i|j} = q_{i|j}I aim to be. This timep_{i|j}Whenq_{i|j}の距離を測る指標WhenしてKLダイバージェンス(Not a strict definition of distance)を使用します。KLダイバージェンスWhen損失関数Whenしてその最小化を目指します。

The loss function $ C $ this time is expressed by the following formula.


C = \sum_{i}KL(P_{i}||Q_{i}) = \sum_{i} \sum_{j} p_{j|i}\log\frac{p_{j|i}}{q_{j|i}}

$ P_ {i} $ represents all conditional probabilities given $ x_ {i} $, $ Q_ {i} $ represents all conditional probabilities given $ y_ {i} $ Represents a probability.

hereKL divergence is an asymmetric indicatorSoKL(P_{i}||Q_{i}) \neq KL(Q_{i}||P_{i})The point is to be. bigp_{j|i}Smallq_{j|i}If you model with, the loss function will be large, smallp_{j|i}The bigq_{j|i}When modeling with, the loss function does not become so large. this is**If the position of the data point after dimension reduction is placed far away in expressing the data point that is close in the higher dimension, the loss function becomes very large.**It means that. This is why SNE is said to be focused on preserving the local structure of the data.

4.Perplexity The remaining parameters were $ \ sigma_ {i} ^ 2 $. This is the variance of the normal distribution centered on the high-dimensional data point $ x_ {i} $. It is a very important parameter because what kind of distribution is assumed depends on the variance $ \ sigma_ {i} ^ 2 $.

If the data is dense, it is appropriate to assume $ \ sigma_ {i} ^ 2 $ small, and if the data is sparse, it is appropriate to assume $ \ sigma_ {i} ^ 2 $ large. A parameter called Perplexity is used in terms of how to assume data.

SNE will binary search for $ \ sigma_ {i} ^ 2 $ that will generate $ P_ {i} $ with a fixed Perplexity specified by the user. Perplexity is defined as follows.


Perp(P_{i}) = 2^{H(P_{i})}

Also, $ H (P_ {i}) $ is the entropy of $ P_ {i} $ and is defined as follows.


H(P_{i}) = -\sum_{j}p_{j|i}\log_{2}p_{j|i}

For example, fix Perplexity with a value such as $ 40 $ and search for $ \ sigma_ {i} ^ 2 $ that matches the equation. The larger the Perplexity, the larger $ \ sigma_ {i} ^ 2 $, of course, and we assume that the data is sparse. Also, if Perplexity is small, $ \ sigma_ {i} ^ 2 $ is also small, assuming that the data is dense.

Perplexity can also be rephrased as ** the number that determines how many neighborhood points are considered from the data point $ x_ {i} $ **.

5. Minimize the loss function with stochastic gradient descent

We will use the stochastic gradient descent method to minimize the loss function. The gradient uses the following, which is the value obtained by differentiating the loss function by $ y_ {i} $.


\frac{\delta C}{\delta y_{i}} = 2\sum_{j}(p_{j|i}-q_{j|i}+p_{i|j}-q_{i|j})(y_{i}-y_{j})

We will gradually move $ y_ {i} $ using this gradient, and the update formula is as follows.


Y^{(t)} = Y^{(t-1)} + \eta \frac{\delta C}{\delta Y} + \alpha(t)(Y^{(t-1)} - Y^{(t-2)})

The image is that $ Y ^ {(t-1)} $ is moved by the learning rate $ \ eta $ only in the direction of the gradient $ \ frac {\ delta C} {\ delta Y} . ( T $ represents the number of iterations.) Finally, there is a $ \ alpha (t) (Y ^ {(t-1)} --Y ^ {(t-2)}) $ called a momentum term, which is a little from the optimization of the gradient. There is a meaning to shift. The optimization is controlled so that it does not fall into the local minimum value and stays within the minimum value as much as possible.

This is the flow of SNE.

Problems with SNE

The above is the flow of SNE, but SNE has the following two problems.

--Difficult to minimize the loss function --Crowding problem

Because the loss function uses KL divergence,p_{j|i}Whenq_{j|i}Cannot be handledLoss function formula complicateddoing.((p_{j|i}-q_{j|i}+p_{i|j}-q_{i|j})Part of)

Crowding problem When a multidimensional object is dropped into a lower dimension, the number of data points that can be equidistant from each other decreases, so when the dimension is dropped, it becomes crowded in an attempt to maintain equidistantness **. It's a problem.

For example, in 3D, 4 data points with the number of dimensions + 1 can exist equidistant from each other, but when dropped into 2D, only 3 data points with the number of dimensions + 1 can exist equidistant. There is a possibility of crushing the gap that should have occurred when the number of dimensions is reduced. That is the Crowding problem.

T-SNE tried to solve these problems.

How t-SNE works

In order to solve the problems of SNE, the following features have been added to t-SNE.

--Symmetrize the loss function --Assuming Student's t distribution when considering the distance between low-dimensional data points

Loss function symmetrization

The closeness of the point $ x_ {i} $ and the point $ x_ {j} $ is represented by the joint probability distribution $ p_ {ij} $ as the symmetry processing of the loss function. (Note that the symmetrization is a simultaneous probability rather than a conditional probability)


p_{ij} = \frac{p_{i|j} + p_{j|i}}{2n}


p_{j|i} = \frac{\exp(-||x_{i} - x_{j}||^2/2\sigma_{i}^2)}{\sum_{k\neq i}\exp(-||x_{i} - x_{k}||^2/2\sigma_{i}^2)}

p_{j|i}There is no change in, but processing that takes an averagep_{ij} = \frac{p_{i|j} + p_{j|i}}{2n}It is symmetric with.

Assuming Student's t distribution

The closeness of the point $ y_ {i} $ and the point $ y_ {j} $ on the lower dimension is expressed by the joint probability distribution $ q_ {ij} $ as follows.


q_{ij} = \frac{(1+||y_{i} - y_{j}||^2)^{-1}}{\sum_{k\neq i}(1+||y_{i} - y_{j}||^2)^{-1}}

Here, in considering the distance of points on the lower dimension, we assume a student's t distribution with $ 1 $ degrees of freedom. ** The t distribution is characterized by a higher base than the normal distribution **. By using a high-base t distribution, medium-range data points can be modeled in high-dimensional space as longer distances, and ** Crowding problems can be avoided for medium-range data points. I can do it. ** **

Minimize loss function

t-SNE uses these $ p_ {ij} $ and $ q_ {ij} $ to minimize the loss function. The loss function is expressed as follows.


C = \sum_{i}KL(P ||Q) = \sum_{i} \sum_{j} p_{ji}\log\frac{p_{ji}}{q_{ji}}

This is also optimized using the stochastic gradient descent method as in SNE.


\frac{\delta C}{\delta y_{i}} = 4\sum_{j}(p_{ji}-q_{ji})(y_{i}-y_{j})(1+||y_{i} - y_{j}||^2)^{-1}

The update formula is the same as SNE.

Visualization example of t-SNE

This time, I would like to visualize the distribution situation using text data. Dimensionality reduction algorithms are very effective because text data tends to be high-dimensional when vectorized.

Library used

scikit-learn 0.21.3

data set

This time, I think that the dataset uses "livedoor news corpus" to visualize the data distribution status. For details of the dataset and the method of morphological analysis, please refer to Posted in the previously posted article. I will.

In the case of Japanese, preprocessing that decomposes sentences into morphemes is required in advance, so after decomposing all sentences into morphemes, they are dropped into the following data frame.

スクリーンショット 2020-01-13 21.07.38.png

Visualization of data distribution

After vectorizing the text data with TF-IDF, it is reduced to two dimensions using t-SNE.

import pickle
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

#It is assumed that the data frame after morpheme decomposition is already pickled and has
with open('df_wakati.pickle', 'rb') as f:
    df = pickle.load(f)

#tf-Vectorization using idf
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df[3])

#t-Dimensionality reduction with SNE
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, random_state = 0, perplexity = 30, n_iter = 1000)
X_embedded = tsne.fit_transform(X)

ddf = pd.concat([df, pd.DataFrame(X_embedded, columns = ['col1', 'col2'])], axis = 1)

article_list = ddf[1].unique()

colors =  ["r", "g", "b", "c", "m", "y", "k", "orange","pink"]
plt.figure(figsize = (30, 30))
for i , v in enumerate(article_list):
    tmp_df = ddf[ddf[1] == v]
    plt.scatter(tmp_df['col1'],  
                tmp_df['col2'],
                label = v,
                color = colors[i])
    
plt.legend(fontsize = 30)

The result is here. It is possible to visualize the existence of clusters for each article medium.

Next I think it would be nice to summarize other dimensionality reduction algorithms. Thank you for reading until the end.

[PYTHON] Understand t-SNE and improve visualization