[PYTHON] 100 language processing knock-98 (using pandas): Ward's method clustering

This is the record of the 98th "clustering by Ward method" of Language processing 100 knock 2015. Unlike the previous non-hierarchical clustering, hierarchical clustering is performed. The knock result is in the form shown below in which the clusters are hierarchical (large because there are 238 countries ...). image.png

Reference link

Link Remarks
098.Clustering by Ward method.ipynb Answer program GitHub link
100 amateur language processing knocks:98 I am always indebted to you by knocking 100 language processing
Creating a Dendrogram with Python and the elements of linkage How to implement Ward's method and dendrogram in Python

environment

type version Contents
OS Ubuntu18.04.01 LTS It is running virtually
pyenv 1.2.15 I use pyenv because I sometimes use multiple Python environments
Python 3.6.9 python3 on pyenv.6.I'm using 9
3.7 or 3.There is no deep reason not to use 8 series
Packages are managed using venv

In the above environment, I am using the following additional Python packages. Just install with regular pip.

type version
matplotlib 3.1.1
pandas 0.25.3
scipy 1.4.1

Task

Chapter 10: Vector Space Method (II)

In Chapter 10, we will continue to study word vectors from the previous chapter.

98. Ward's method clustering

Perform hierarchical clustering by Ward's method for> 96 word vectors. Furthermore, visualize the clustering result as a dendrogram.

Task Supplement (K-Means)

This time it is hierarchical clustering. The details were very easy to understand in Link. The number of clusters can be set arbitrarily by dividing by the upper hierarchy.

Answer

Answer program [098. Clustering by Ward method.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/10.%E3%83%99%E3%82%AF%E3%83%88% E3% 83% AB% E7% A9% BA% E9% 96% 93% E6% B3% 95% 20 (II) /098.Ward%E6%B3%95%E3%81%AB%E3%82%88 % E3% 82% 8B% E3% 82% AF% E3% 83% A9% E3% 82% B9% E3% 82% BF% E3% 83% AA% E3% 83% B3% E3% 82% B0.ipynb )

import matplotlib.pyplot as plt
import pandas as pd
from scipy.cluster.hierarchy import dendrogram, linkage

country_vec = pd.read_pickle('./096.country_vector.zip')
print(country_vec.info())

clustered = linkage(country_vec, method='ward')

plt.figure(figsize=(20, 50), dpi=100, facecolor='c')
_ = dendrogram(clustered, labels=list(country_vec.index), leaf_font_size=8, orientation='right')
plt.show()

Answer commentary

In general, I refer to the article "About creating a Dendrogram with Python and the elements of linkage". Cluster using the function linkage. By setting the parameter method to ward, clustering is performed by Ward's method.

clustered = linkage(country_vec, method='ward')

After that, the result is displayed. The displayed dendrogram (tree diagram) is on the top. I purposely set the background color to c (cyan) because the default white label disappeared when I made it into an image file.

plt.figure(figsize=(20, 50), dpi=100, facecolor='c')
_ = dendrogram(clustered, labels=list(country_vec.index), leaf_font_size=8, orientation='right')
plt.show()

I will look only at the top. Unlike K-Means, it's nice to know the degree of similarity between individual countries. If you think that maritime nations are lined up, Japan and China are close to each other. I wonder why. Iraq, Afghanistan, and Israel are close in physical distance, but are they similar in meaning? image.png

Recommended Posts

100 language processing knock-98 (using pandas): Ward's method clustering
100 Language Processing Knock-31 (using pandas): Verb
100 Language Processing Knock-38 (using pandas): Histogram
100 language processing knock-97 (using scikit-learn): k-means clustering
100 Language Processing Knock-33 (using pandas): Sahen noun
100 Language Processing Knock-35 (using pandas): Noun concatenation
100 Language Processing Knock-39 (using pandas): Zipf's Law
100 Language Processing Knock-34 (using pandas): "A B"
100 language processing knock-20 (using pandas): reading JSON data
100 Language Processing Knock-32 (using pandas): Prototype of verb
100 language processing knock-99 (using pandas): visualization by t-SNE
100 language processing knock-95 (using pandas): Rating with WordSimilarity-353
100 Language Processing Knock-36 (using pandas): Frequency of word occurrence
100 Language Processing Knock: Chapter 2 UNIX Command Basics (using pandas)
100 Language Processing Knock-83 (using pandas): Measuring word / context frequency
100 language processing knock-30 (using pandas): reading morphological analysis results
100 language processing knock-76 (using scikit-learn): labeling
100 language processing knock-73 (using scikit-learn): learning
100 language processing knock-74 (using scikit-learn): Prediction
100 Language Processing Knock-84 (using pandas): Creating a word context matrix
100 Language Processing Knock (2020): 28
100 Language Processing Knock (2020): 38
100 language processing knock 00 ~ 02
100 Language Processing Knock-71 (using Stanford NLP): Stopword
100 language processing knock 2020 [00 ~ 39 answer]
100 language processing knock 2020 [00-79 answer]
100 language processing knock 2020 [00 ~ 69 answer]
100 Language Processing Knock 2020 Chapter 1
100 Amateur Language Processing Knock: 17
100 language processing knock 2020 [00 ~ 49 answer]
100 Language Processing Knock-93 (using pandas): Calculate the accuracy rate of analogy tasks
100 Language Processing Knock-52: Stemming
100 Amateur Language Processing Knock: 07
100 Language Processing Knock 2020 Chapter 3
100 Language Processing Knock 2020 Chapter 2
100 language processing knock-75 (using scikit-learn): weight of features
100 Amateur Language Processing Knock: 09
100 Amateur Language Processing Knock: 47
100 Language Processing Knock-53: Tokenization
100 Amateur Language Processing Knock: 97
100 language processing knock 2020 [00 ~ 59 answer]
100 Amateur Language Processing Knock: 67
100 language processing knock-72 (using Stanford NLP): feature extraction
100 language processing knock-92 (using Gensim): application to analogy data
100 language processing knock-94 (using Gensim): similarity calculation with WordSimilarity-353
100 language processing knocks-37 (using pandas): Top 10 most frequent words
100 Language Processing with Python Knock 2015
100 Language Processing Knock-51: Word Clipping
100 Language Processing Knock-58: Tuple Extraction
100 Language Processing Knock-57: Dependency Analysis
100 language processing knock-50: sentence break
100 Language Processing Knock Chapter 1 (Python)
100 Language Processing Knock Chapter 2 (Python)
100 Language Processing Knock-25: Template Extraction
100 Language Processing Knock-87: Word Similarity
I tried 100 language processing knock 2020
100 language processing knock-56: co-reference analysis
Solving 100 Language Processing Knock 2020 (01. "Patatokukashi")
100 Amateur Language Processing Knock: Summary
100 language processing knock-77 (using scikit-learn): measurement of correct answer rate
100 Language Processing Knock-70 (using Stanford NLP): Obtaining and shaping data