[PYTHON] 100 language processing knock-97 (using scikit-learn): k-means clustering

This is the record of the 97th "k-means clustering" of Language Processing 100 Knock 2015. Classify the countries into 5 clusters using the word vector of the country name obtained in the previous knock. K-Means used at that time is learned in "Coursera Machine Learning Introductory Course (8th week-Unsupervised Learning (K-Means and PCA))" However, it is a clustering method.

Reference link

Link Remarks
097.k-means clustering.ipynb Answer program GitHub link
100 amateur language processing knocks:97 I am always indebted to you by knocking 100 language processing

environment

type version Contents
OS Ubuntu18.04.01 LTS It is running virtually
pyenv 1.2.15 I use pyenv because I sometimes use multiple Python environments
Python 3.6.9 python3 on pyenv.6.I'm using 9
3.7 or 3.There is no deep reason not to use 8 series
Packages are managed using venv

In the above environment, I am using the following additional Python packages. Just install with regular pip.

type version
pandas 0.25.3
scikit-learn 0.21.3

Task

Chapter 10: Vector Space Method (II)

In Chapter 10, we will continue to study word vectors from the previous chapter.

97. k-means clustering

Execute k-means clustering for> 96 word vectors with the number of clusters $ k = 5 $.

Task Supplement (K-Means)

Regarding K-Means, "I tried to visualize the K-means method with D3.js" It's easy to understand. You can skip the statistical and mathematical parts and understand it sensuously. For those who are not satisfied, the free Coursera Machine Learning Introductory Course is recommended, and the content is the article "Coursera Machine Learning Introductory Online Course Tora no Maki (Liberal Arts Society)" Recommended for people) ".

Answer

Answer program [097.k-means clustering.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/10.%E3%83%99%E3%82%AF%E3%83%88% E3% 83% AB% E7% A9% BA% E9% 96% 93% E6% B3% 95% 20 (II) /097.k-means%E3%82%AF%E3%83%A9%E3%82 % B9% E3% 82% BF% E3% 83% AA% E3% 83% B3% E3% 82% B0.ipynb)

import pandas as pd
from sklearn.cluster import KMeans

country_vec = pd.read_pickle('./096.country_vector.zip')
print(country_vec.info())

# K-Means clustering
country_vec['class'] = KMeans(n_clusters=5).fit_predict(country_vec)

for i in range(5):
    print('{} Cluster:{}'.format(i, country_vec[country_vec['class'] == i].index))

Answer commentary

Read the last knock file.

country_vec = pd.read_pickle('./096.country_vector.zip')
print(country_vec.info())

The following attributes are output as a DataFrame.

<class 'pandas.core.frame.DataFrame'>
Index: 238 entries, American_Samoa to Zimbabwe
Columns: 300 entries, 0 to 299
dtypes: float64(300)
memory usage: 559.7+ KB
None

With Scikit-learn, you can do K-Means with just this. It is convenient to be able to pass a DataFrame.

#KMeans clustering
predicts = KMeans(n_clusters=5).fit_predict(country_vec)

Let's take a quick look at the clustering results.

for i in range(5):
    print('{} Cluster:{}'.format(i, country_vec[country_vec['class'] == i].index))

Cluster 0 is unusually high at 153, but is it something else? Cluster 1 is like a so-called maritime nation such as New Zealand, Great Britain, and Japan, but it also includes India and China. Cluster 2 has many European countries, but it is mixed with Brazil and Argentina. It is a subtle result that cannot be judged whether it is successful or not.

0 Cluster:Index(['American_Samoa', 'Antigua_and_Barbuda', 'Bosnia_and_Herzegovina',
       'Burkina_Faso', 'Cabo_Verde', 'Cayman_Islands',
       'Central_African_Republic', 'Christmas_Island', 'Keeling_Islands',
       'Cocos_Islands',
       ...
       'Tonga', 'Tunisia', 'Turkmenistan', 'Tuvalu', 'Uruguay', 'Uzbekistan',
       'Vanuatu', 'Yemen', 'Zambia', 'Zimbabwe'],
      dtype='object', length=153)
1 Cluster:Index(['New_Zealand', 'United_Kingdom', 'United_States', 'Australia', 'Canada',
       'China', 'India', 'Ireland', 'Israel', 'Japan', 'Pakistan'],
      dtype='object')
2 Cluster:Index(['Argentina', 'Austria', 'Belgium', 'Brazil', 'Bulgaria', 'Denmark',
       'Egypt', 'Finland', 'France', 'Germany', 'Greece', 'Hungary', 'Italy',
       'Netherlands', 'Norway', 'Poland', 'Portugal', 'Romania', 'Spain',
       'Sweden', 'Switzerland'],
      dtype='object')
3 Cluster:Index(['Guinea', 'Jersey', 'Mexico'], dtype='object')
4 Cluster:Index(['Czech_Republic', 'Hong_Kong', 'People's_Republic_of_China',
       'Puerto_Rico', 'South_Africa', 'Sri_Lanka', 'Great_Britain',
       'Northern_Ireland', 'Afghanistan', 'Albania', 'Algeria', 'Angola',
       'Armenia', 'Azerbaijan', 'Bangladesh', 'Cambodia', 'Chile', 'Colombia',
       'Croatia', 'Cuba', 'Cyprus', 'Ethiopia', 'Fiji', 'Georgia', 'Ghana',
       'Iceland', 'Indonesia', 'Iraq', 'Kenya', 'Latvia', 'Lebanon', 'Libya',
       'Lithuania', 'Malaysia', 'Malta', 'Mongolia', 'Morocco', 'Nepal',
       'Nigeria', 'Panama', 'Peru', 'Philippines', 'Serbia', 'Singapore',
       'Slovakia', 'Sudan', 'Thailand', 'Turkey', 'Uganda', 'Ukraine'],
      dtype='object')

Recommended Posts

100 language processing knock-97 (using scikit-learn): k-means clustering
100 language processing knock-76 (using scikit-learn): labeling
100 language processing knock-73 (using scikit-learn): learning
100 language processing knock-74 (using scikit-learn): Prediction
Try using scikit-learn (1) --K-means clustering
100 language processing knock-79 (using scikit-learn): precision-recall graph drawing
100 language processing knock-98 (using pandas): Ward's method clustering
100 language processing knock-75 (using scikit-learn): weight of features
100 Language Processing Knock-31 (using pandas): Verb
100 Language Processing Knock-38 (using pandas): Histogram
100 Language Processing Knock (2020): 28
100 Language Processing Knock (2020): 38
100 language processing knock 00 ~ 02
100 language processing knock-77 (using scikit-learn): measurement of correct answer rate
100 Language Processing Knock-33 (using pandas): Sahen noun
100 Language Processing Knock-71 (using Stanford NLP): Stopword
100 Language Processing Knock-35 (using pandas): Noun concatenation
100 Language Processing Knock-39 (using pandas): Zipf's Law
100 Language Processing Knock-34 (using pandas): "A B"
100 language processing knock 2020 [00 ~ 39 answer]
100 language processing knock 2020 [00-79 answer]
100 language processing knock 2020 [00 ~ 69 answer]
100 Language Processing Knock 2020 Chapter 1
100 Amateur Language Processing Knock: 17
100 language processing knock 2020 [00 ~ 49 answer]
100 Language Processing Knock-52: Stemming
100 Language Processing Knock Chapter 1
100 Amateur Language Processing Knock: 07
100 Language Processing Knock 2020 Chapter 3
100 Language Processing Knock 2020 Chapter 2
100 Amateur Language Processing Knock: 09
100 Amateur Language Processing Knock: 47
100 Language Processing Knock-53: Tokenization
100 Amateur Language Processing Knock: 97
100 language processing knock 2020 [00 ~ 59 answer]
100 Amateur Language Processing Knock: 67
100 language processing knock-90 (using Gensim): learning with word2vec
100 language processing knock-20 (using pandas): reading JSON data
100 Language Processing Knock-32 (using pandas): Prototype of verb
100 language processing knock-99 (using pandas): visualization by t-SNE
100 language processing knock-95 (using pandas): Rating with WordSimilarity-353
100 language processing knock-72 (using Stanford NLP): feature extraction
100 Language Processing with Python Knock 2015
100 language processing knock-92 (using Gensim): application to analogy data
100 Language Processing Knock-51: Word Clipping
100 Language Processing Knock-58: Tuple Extraction
100 Language Processing Knock-57: Dependency Analysis
100 language processing knock-50: sentence break
100 Language Processing Knock-36 (using pandas): Frequency of word occurrence
100 Language Processing Knock Chapter 1 (Python)
100 Language Processing Knock Chapter 2 (Python)
100 Language Processing Knock: Chapter 2 UNIX Command Basics (using pandas)
100 Language Processing Knock-83 (using pandas): Measuring word / context frequency
100 language processing knock-30 (using pandas): reading morphological analysis results
100 Language Processing Knock-25: Template Extraction
100 Language Processing Knock-87: Word Similarity
100 language processing knock-94 (using Gensim): similarity calculation with WordSimilarity-353
I tried 100 language processing knock 2020
100 language processing knock-56: co-reference analysis
Solving 100 Language Processing Knock 2020 (01. "Patatokukashi")
100 Amateur Language Processing Knock: Summary