[PYTHON] 100 language processing knock-99 (using pandas): visualization by t-SNE

This is the record of the 99th "Visualization by t-SNE" of Language processing 100 knock 2015. The word vector is visualized as shown in the figure below by reducing it to two dimensions with t-SNE (t-distributed Stochastic Neighbor Embedding). Humans can see it in 2D and 3D. image.png

Reference link

Link Remarks
099.t-Visualization by SNE.ipynb Answer program GitHub link
100 amateur language processing knocks:99 I am always indebted to you by knocking 100 language processing
The basic knowledge of matplotlib that I wanted to know early, or the story of the artist who can adjust the appearance I learned a little about the basics of Matplotlib
color example code: colormaps_reference.py Matplotlib Color Map

environment

type version Contents
OS Ubuntu18.04.01 LTS It is running virtually
pyenv 1.2.15 I use pyenv because I sometimes use multiple Python environments
Python 3.6.9 python3 on pyenv.6.I'm using 9
3.7 or 3.There is no deep reason not to use 8 series
Packages are managed using venv

In the above environment, I am using the following additional Python packages. Just install with regular pip.

type version
matplotlib 3.1.1
pandas 0.25.3
scikit-learn 0.21.3

Task

Chapter 10: Vector Space Method (II)

In Chapter 10, we will continue to study word vectors from the previous chapter.

99. Visualization by t-SNE

Visualize the vector space with t-SNE for> 96 word vectors.

Problem supplement (t-SNE)

t-SNE (t-distributed Stochastic Neighbor Embedding) reduces dimensions to 2 or 3. In terms of dimensionality reduction, it is the same as PCA (Principal Component Analysis). However, it can also handle data with a non-linear structure that PCA cannot. I wrote it profusely, but I do not understand mathematical formulas and it is a sale of the article "Cool dimension compression & visualization by t-SNE".

Answer

Answer program [099.t-Visualization by SNE.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/10.%E3%83%99%E3%82%AF%E3%83%88 % E3% 83% AB% E7% A9% BA% E9% 96% 93% E6% B3% 95% 20 (II) /099.t-SNE%E3%81%AB%E3%82%88%E3% 82% 8B% E5% 8F% AF% E8% A6% 96% E5% 8C% 96.ipynb)

import matplotlib.pyplot as plt
import pandas as pd
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans

country_vec = pd.read_pickle('./096.country_vector.zip')
print(country_vec.info())

# t-SNE
t_sne = TSNE().fit_transform(country_vec)
print('t_sne shape:', t_sne.shape)

#KMeans clustering
clustered = KMeans(n_clusters=5).fit_predict(country_vec)

fig, ax = plt.subplots(figsize=(22, 22))

# Set Color map
cmap = plt.get_cmap('Dark2')

for i in range(t_sne.shape[0]):
    cval = cmap(clustered[i] / 4)
    ax.scatter(t_sne[i][0], t_sne[i][1], marker='.', color=cval)
    ax.annotate(country_vec.index[i], xy=(t_sne[i][0], t_sne[i][1]), color=cval)
plt.show()

Answer commentary

Nearly 80% of the total is a copy of Article "Amateur Language Processing 100 Knock: 99".

Here is the main code for this time. TSNE of scikit-learn has some parameters, but I ran it with the defaults. Some blogs mentioned that TSNE of scikit-learn is not so good, but for the time being I will go.

t_sne = TSNE().fit_transform(country_vec)

Also, K-Mean is used for non-hierarchical clustering as the color to be displayed in the scatter plot.

clustered = KMeans(n_clusters=5).fit_predict(country_vec)

Finally, use matplotlib to display the scatter plot. The display color is defined using plt.get_cmap, and there is information on color example code: colormaps_reference.py. .. The dot is displayed with scatter, and the label (country name) is displayed with ʻannotate`.

fig, ax = plt.subplots(figsize=(22, 22))

# Set Color map
cmap = plt.get_cmap('Dark2')

for i in range(t_sne.shape[0]):
    cval = cmap(clustered[i] / 4)
    ax.scatter(t_sne[i][0], t_sne[i][1], marker='.', color=cval)
    ax.annotate(country_vec.index[i], xy=(t_sne[i][0], t_sne[i][1]), color=cval)
plt.show()

Looking at the area around Japan in a magnified view, it looks like this. It is easier to understand than the hierarchical clustering that was done in the previous knock. image.png

Recommended Posts

100 language processing knock-99 (using pandas): visualization by t-SNE
100 Language Processing Knock-31 (using pandas): Verb
100 Language Processing Knock-38 (using pandas): Histogram
100 Language Processing Knock-33 (using pandas): Sahen noun
100 Language Processing Knock-35 (using pandas): Noun concatenation
100 Language Processing Knock-39 (using pandas): Zipf's Law
100 Language Processing Knock-34 (using pandas): "A B"
100 language processing knock-20 (using pandas): reading JSON data
100 Language Processing Knock-32 (using pandas): Prototype of verb
100 language processing knock-98 (using pandas): Ward's method clustering
100 language processing knock-95 (using pandas): Rating with WordSimilarity-353
100 Language Processing Knock-36 (using pandas): Frequency of word occurrence
100 Language Processing Knock: Chapter 2 UNIX Command Basics (using pandas)
100 Language Processing Knock-83 (using pandas): Measuring word / context frequency
100 language processing knock-30 (using pandas): reading morphological analysis results
100 language processing knock-76 (using scikit-learn): labeling
100 language processing knock-73 (using scikit-learn): learning
100 Language Processing Knock Chapter 1 by Python
100 language processing knock-74 (using scikit-learn): Prediction
100 Language Processing Knock-84 (using pandas): Creating a word context matrix
100 Language Processing Knock (2020): 28
100 Language Processing Knock (2020): 38
100 language processing knock 00 ~ 02
100 language processing knock-97 (using scikit-learn): k-means clustering
100 Language Processing Knock-44: Visualization of Dependent Tree
100 Language Processing Knock-89: Analogy by Additive Constitutiveness
100 Language Processing Knock-71 (using Stanford NLP): Stopword
100 language processing knock-90 (using Gensim): learning with word2vec
100 language processing knock 2020 [00 ~ 69 answer]
100 Language Processing Knock 2020 Chapter 1
100 language processing knock 2020 [00 ~ 49 answer]
100 Language Processing Knock-93 (using pandas): Calculate the accuracy rate of analogy tasks
100 Language Processing Knock-52: Stemming
100 language processing knock-79 (using scikit-learn): precision-recall graph drawing
100 Language Processing Knock Chapter 1
100 Language Processing Knock 2020 Chapter 3
100 Language Processing Knock 2020 Chapter 2
100 language processing knock-75 (using scikit-learn): weight of features
100 Amateur Language Processing Knock: 09
100 Amateur Language Processing Knock: 47
100 Language Processing Knock-53: Tokenization
100 Amateur Language Processing Knock: 97
100 language processing knock 2020 [00 ~ 59 answer]
100 Amateur Language Processing Knock: 67
100 language processing knock-72 (using Stanford NLP): feature extraction
100 language processing knock-92 (using Gensim): application to analogy data
100 language processing knock-94 (using Gensim): similarity calculation with WordSimilarity-353
100 language processing knocks-37 (using pandas): Top 10 most frequent words
100 Language Processing with Python Knock 2015
100 Language Processing Knock-51: Word Clipping
100 Language Processing Knock-58: Tuple Extraction
100 language processing knock-50: sentence break
100 Language Processing Knock Chapter 1 (Python)
100 Language Processing Knock-87: Word Similarity
Visualization memo by pandas, seaborn
I tried 100 language processing knock 2020
100 language processing knock-56: co-reference analysis
Solving 100 Language Processing Knock 2020 (01. "Patatokukashi")
100 Amateur Language Processing Knock: Summary
100 language processing knock-77 (using scikit-learn): measurement of correct answer rate
100 Language Processing Knock-70 (using Stanford NLP): Obtaining and shaping data