This is the record of the 99th "Visualization by t-SNE" of Language processing 100 knock 2015. The word vector is visualized as shown in the figure below by reducing it to two dimensions with t-SNE (t-distributed Stochastic Neighbor Embedding). Humans can see it in 2D and 3D.
Link | Remarks |
---|---|
099.t-Visualization by SNE.ipynb | Answer program GitHub link |
100 amateur language processing knocks:99 | I am always indebted to you by knocking 100 language processing |
The basic knowledge of matplotlib that I wanted to know early, or the story of the artist who can adjust the appearance | I learned a little about the basics of Matplotlib |
color example code: colormaps_reference.py | Matplotlib Color Map |
type | version | Contents |
---|---|---|
OS | Ubuntu18.04.01 LTS | It is running virtually |
pyenv | 1.2.15 | I use pyenv because I sometimes use multiple Python environments |
Python | 3.6.9 | python3 on pyenv.6.I'm using 9 3.7 or 3.There is no deep reason not to use 8 series Packages are managed using venv |
In the above environment, I am using the following additional Python packages. Just install with regular pip.
type | version |
---|---|
matplotlib | 3.1.1 |
pandas | 0.25.3 |
scikit-learn | 0.21.3 |
In Chapter 10, we will continue to study word vectors from the previous chapter.
Visualize the vector space with t-SNE for> 96 word vectors.
t-SNE (t-distributed Stochastic Neighbor Embedding) reduces dimensions to 2 or 3. In terms of dimensionality reduction, it is the same as PCA (Principal Component Analysis). However, it can also handle data with a non-linear structure that PCA cannot. I wrote it profusely, but I do not understand mathematical formulas and it is a sale of the article "Cool dimension compression & visualization by t-SNE".
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans
country_vec = pd.read_pickle('./096.country_vector.zip')
print(country_vec.info())
# t-SNE
t_sne = TSNE().fit_transform(country_vec)
print('t_sne shape:', t_sne.shape)
#KMeans clustering
clustered = KMeans(n_clusters=5).fit_predict(country_vec)
fig, ax = plt.subplots(figsize=(22, 22))
# Set Color map
cmap = plt.get_cmap('Dark2')
for i in range(t_sne.shape[0]):
cval = cmap(clustered[i] / 4)
ax.scatter(t_sne[i][0], t_sne[i][1], marker='.', color=cval)
ax.annotate(country_vec.index[i], xy=(t_sne[i][0], t_sne[i][1]), color=cval)
plt.show()
Nearly 80% of the total is a copy of Article "Amateur Language Processing 100 Knock: 99".
Here is the main code for this time. TSNE of scikit-learn has some parameters, but I ran it with the defaults. Some blogs mentioned that TSNE of scikit-learn is not so good, but for the time being I will go.
t_sne = TSNE().fit_transform(country_vec)
Also, K-Mean is used for non-hierarchical clustering as the color to be displayed in the scatter plot.
clustered = KMeans(n_clusters=5).fit_predict(country_vec)
Finally, use matplotlib
to display the scatter plot. The display color is defined using plt.get_cmap
, and there is information on color example code: colormaps_reference.py. ..
The dot is displayed with scatter
, and the label (country name) is displayed with ʻannotate`.
matplotlib
[Article" Basic knowledge of matplotlib that I wanted to know early, or the story of an artist who can adjust the appearance "](https://qiita.com/skotaro/items / 08dc0b8c5704c94eafb9) I understood.fig, ax = plt.subplots(figsize=(22, 22))
# Set Color map
cmap = plt.get_cmap('Dark2')
for i in range(t_sne.shape[0]):
cval = cmap(clustered[i] / 4)
ax.scatter(t_sne[i][0], t_sne[i][1], marker='.', color=cval)
ax.annotate(country_vec.index[i], xy=(t_sne[i][0], t_sne[i][1]), color=cval)
plt.show()
Looking at the area around Japan in a magnified view, it looks like this. It is easier to understand than the hierarchical clustering that was done in the previous knock.
Recommended Posts