This is the record of the 99th "Visualization by t-SNE" of Language processing 100 knock 2015. The word vector is visualized as shown in the figure below by reducing it to two dimensions with t-SNE (t-distributed Stochastic Neighbor Embedding). Humans can see it in 2D and 3D.

Reference link

Link	Remarks
099.t-Visualization by SNE.ipynb	Answer program GitHub link
100 amateur language processing knocks:99	I am always indebted to you by knocking 100 language processing
The basic knowledge of matplotlib that I wanted to know early, or the story of the artist who can adjust the appearance	I learned a little about the basics of Matplotlib
color example code: colormaps_reference.py	Matplotlib Color Map

environment

type	version	Contents
OS	Ubuntu18.04.01 LTS	It is running virtually
pyenv	1.2.15	I use pyenv because I sometimes use multiple Python environments
Python	3.6.9	python3 on pyenv.6.I'm using 9 3.7 or 3.There is no deep reason not to use 8 series Packages are managed using venv

In the above environment, I am using the following additional Python packages. Just install with regular pip.

type	version
matplotlib	3.1.1
pandas	0.25.3
scikit-learn	0.21.3

Task

Chapter 10: Vector Space Method (II)

In Chapter 10, we will continue to study word vectors from the previous chapter.

99. Visualization by t-SNE

Visualize the vector space with t-SNE for> 96 word vectors.

Problem supplement (t-SNE)

t-SNE (t-distributed Stochastic Neighbor Embedding) reduces dimensions to 2 or 3. In terms of dimensionality reduction, it is the same as PCA (Principal Component Analysis). However, it can also handle data with a non-linear structure that PCA cannot. I wrote it profusely, but I do not understand mathematical formulas and it is a sale of the article "Cool dimension compression & visualization by t-SNE".

Answer

Answer program [099.t-Visualization by SNE.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/10.%E3%83%99%E3%82%AF%E3%83%88 % E3% 83% AB% E7% A9% BA% E9% 96% 93% E6% B3% 95% 20 (II) /099.t-SNE%E3%81%AB%E3%82%88%E3% 82% 8B% E5% 8F% AF% E8% A6% 96% E5% 8C% 96.ipynb)

import matplotlib.pyplot as plt
import pandas as pd
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans

country_vec = pd.read_pickle('./096.country_vector.zip')
print(country_vec.info())

# t-SNE
t_sne = TSNE().fit_transform(country_vec)
print('t_sne shape:', t_sne.shape)

#KMeans clustering
clustered = KMeans(n_clusters=5).fit_predict(country_vec)

fig, ax = plt.subplots(figsize=(22, 22))

# Set Color map
cmap = plt.get_cmap('Dark2')

for i in range(t_sne.shape[0]):
    cval = cmap(clustered[i] / 4)
    ax.scatter(t_sne[i][0], t_sne[i][1], marker='.', color=cval)
    ax.annotate(country_vec.index[i], xy=(t_sne[i][0], t_sne[i][1]), color=cval)
plt.show()

Answer commentary

Nearly 80% of the total is a copy of Article "Amateur Language Processing 100 Knock: 99".

Here is the main code for this time. TSNE of scikit-learn has some parameters, but I ran it with the defaults. Some blogs mentioned that TSNE of scikit-learn is not so good, but for the time being I will go.

t_sne = TSNE().fit_transform(country_vec)

Also, K-Mean is used for non-hierarchical clustering as the color to be displayed in the scatter plot.

clustered = KMeans(n_clusters=5).fit_predict(country_vec)

Finally, use matplotlib to display the scatter plot. The display color is defined using plt.get_cmap, and there is information on color example code: colormaps_reference.py. .. The dot is displayed with scatter, and the label (country name) is displayed with ʻannotate`.

I'm embarrassed to learn a little about the basics of matplotlib [Article" Basic knowledge of matplotlib that I wanted to know early, or the story of an artist who can adjust the appearance "](https://qiita.com/skotaro/items / 08dc0b8c5704c94eafb9) I understood.

fig, ax = plt.subplots(figsize=(22, 22))

# Set Color map
cmap = plt.get_cmap('Dark2')

for i in range(t_sne.shape[0]):
    cval = cmap(clustered[i] / 4)
    ax.scatter(t_sne[i][0], t_sne[i][1], marker='.', color=cval)
    ax.annotate(country_vec.index[i], xy=(t_sne[i][0], t_sne[i][1]), color=cval)
plt.show()

Looking at the area around Japan in a magnified view, it looks like this. It is easier to understand than the hierarchical clustering that was done in the previous knock.

[PYTHON] 100 language processing knock-99 (using pandas): visualization by t-SNE