[PYTHON] 100 amateur language processing knocks: 98

It is a challenge record of Language processing 100 knock 2015. The environment is Ubuntu 16.04 LTS + Python 3.5.2 : : Anaconda 4.1.1 (64-bit). Click here for a list of past knocks (http://qiita.com/segavvy/items/fb50ba8097d59475f760).

Chapter 10: Vector Space Law (II)

In Chapter 10, we will continue to study word vectors from the previous chapter.

98. Ward's method clustering

Execute hierarchical clustering by Ward's method for> 96 word vectors. Furthermore, visualize the clustering result as a dendrogram.

The finished code:

`main.py`


# coding: utf-8
import pickle
from collections import OrderedDict
from scipy import io
import numpy as np

from scipy.cluster.hierarchy import ward, dendrogram
from matplotlib import pyplot as plt

fname_dict_index_t = 'dict_index_country'
fname_matrix_x300 = 'matrix_x300_country'


#Read dictionary
with open(fname_dict_index_t, 'rb') as data_file:
		dict_index_t = pickle.load(data_file)

#Matrix reading
matrix_x300 = io.loadmat(fname_matrix_x300)['matrix_x300']

#Clustering with Ward's method
ward = ward(matrix_x300)
print(ward)

#Dendrogram display
dendrogram(ward, labels=list(dict_index_t.keys()), leaf_font_size=8)
plt.show()

Execution result:

Kobito.MNLwcE.png

It's a little detailed and it's hard to read the country name, but ...

The execution result of Ward's method is displayed on the console. At the end, some warnings that seem to be related to GUI are displayed, but I did not see it because the dendrogram can be displayed for the time being ^ ^;

`Execution result excerpt`


[[  3.20000000e+01   5.50000000e+01   3.53776681e-01   2.00000000e+00]
 [  6.00000000e+00   1.36000000e+02   3.87856834e-01   2.00000000e+00]
 [  1.77000000e+02   1.88000000e+02   4.23449123e-01   2.00000000e+00]
 [  1.54000000e+02   2.10000000e+02   4.27902443e-01   3.00000000e+00]
 [  1.57000000e+02   1.59000000e+02   4.59378255e-01   2.00000000e+00]
 [  3.70000000e+01   1.49000000e+02   4.71549159e-01   2.00000000e+00]
 [  2.80000000e+01   2.11000000e+02   4.87752476e-01   3.00000000e+00]
 [  1.58000000e+02   1.93000000e+02   4.97266503e-01   2.00000000e+00]
 [  3.80000000e+01   6.20000000e+01   5.05332444e-01   2.00000000e+00]
 [  1.20000000e+02   1.60000000e+02   5.24018281e-01   2.00000000e+00]
 [  2.09000000e+02   2.12000000e+02   5.26909455e-01   5.00000000e+00]
 [  2.14000000e+02   2.19000000e+02   5.63841737e-01   7.00000000e+00]
 [  1.01000000e+02   1.78000000e+02   5.74498656e-01   2.00000000e+00]
 [  1.18000000e+02   2.15000000e+02   5.88183758e-01   4.00000000e+00]
 [  1.55000000e+02   2.13000000e+02   6.09433424e-01   3.00000000e+00]
 [  1.51000000e+02   2.20000000e+02   6.57637828e-01   8.00000000e+00]
 [  7.40000000e+01   2.22000000e+02   6.69853809e-01   5.00000000e+00]
 [  4.80000000e+01   7.00000000e+01   6.72731044e-01   2.00000000e+00]
 [  2.17000000e+02   2.21000000e+02   6.88767402e-01   4.00000000e+00]
 [  2.16000000e+02   2.18000000e+02   6.89235190e-01   4.00000000e+00]
(Omitted)
 [  3.53000000e+02   3.66000000e+02   7.72813958e+00   3.70000000e+01]
 [  8.80000000e+01   3.93000000e+02   7.91160391e+00   3.00000000e+00]
 [  5.90000000e+01   3.95000000e+02   8.36126980e+00   4.00000000e+00]
 [  3.10000000e+01   3.98000000e+02   8.42501966e+00   4.00000000e+00]
 [  3.71000000e+02   3.97000000e+02   8.67427318e+00   1.10000000e+02]
 [  3.84000000e+02   3.87000000e+02   8.73417227e+00   4.90000000e+01]
 [  9.40000000e+01   3.96000000e+02   8.76123102e+00   3.00000000e+00]
 [  3.88000000e+02   3.92000000e+02   9.29959662e+00   1.20000000e+01]
 [  3.86000000e+02   3.89000000e+02   1.00548308e+01   5.00000000e+00]
 [  3.90000000e+02   3.91000000e+02   1.03513479e+01   1.80000000e+01]
 [  8.40000000e+01   4.06000000e+02   1.08361185e+01   1.90000000e+01]
 [  4.02000000e+02   4.04000000e+02   1.22602262e+01   6.10000000e+01]
 [  4.03000000e+02   4.05000000e+02   1.27024876e+01   8.00000000e+00]
 [  3.99000000e+02   4.07000000e+02   1.36985698e+01   2.30000000e+01]
 [  1.98000000e+02   4.00000000e+02   1.39290496e+01   5.00000000e+00]
 [  4.09000000e+02   4.11000000e+02   1.64166647e+01   1.30000000e+01]
 [  3.94000000e+02   4.08000000e+02   1.65142018e+01   6.30000000e+01]
 [  4.10000000e+02   4.12000000e+02   2.02282024e+01   3.60000000e+01]
 [  4.01000000e+02   4.13000000e+02   2.40955381e+01   1.73000000e+02]
 [  4.14000000e+02   4.15000000e+02   4.18596046e+01   2.09000000e+02]]
GLib-GIO-Message: Using the 'memory' GSettings backend.  Your settings will not be saved or shared with other applications.

(python:20130): Gtk-WARNING **: GModule (/usr/lib/x86_64-linux-gnu/gtk-2.0/2.10.0/immodules/im-fcitx.so) initialization check failed: GLib version too old (micro mismatch)

(python:20130): Gtk-WARNING **: Loading IM context type 'fcitx' failed

(python:20130): Gtk-WARNING **: GModule (/usr/lib/x86_64-linux-gnu/gtk-2.0/2.10.0/immodules/im-fcitx.so) initialization check failed: GLib version too old (micro mismatch)

(python:20130): Gtk-WARNING **: Loading IM context type 'fcitx' failed

(python:20130): Gtk-WARNING **: GModule (/usr/lib/x86_64-linux-gnu/gtk-2.0/2.10.0/immodules/im-fcitx.so) initialization check failed: GLib version too old (micro mismatch)

(python:20130): Gtk-WARNING **: Loading IM context type 'fcitx' failed

Hierarchical clustering

Hierarchical clustering is a method of grouping the most similar ones step by step, instead of determining the number of clusters first as in the previous question K-Means.

First, consider each data as separate clusters, calculate the distance between the clusters, and combine the clusters with the closest distances into one. This completes the first process and classifies into a cluster with -1 data. Repeat this to reduce the clusters one by one. When a cluster is put together, the distance to other clusters is calculated using the center of gravity of the data belonging to it as the representative point of the cluster.

Ward method (Ward method)

Hierarchical clustering can be divided into several methods depending on how the distance between clusters is calculated. Ward's method is one of them, which minimizes the distance from each value of the cluster to its center of mass. In addition, there are group averaging method and shortest distance method.

Dendrogram

A dendrogram is a tree diagram. Visualize the results of hierarchical clustering in the form of a tournament table. The branch point of the line shows the cluster that was put together, and the actual processing is done from bottom to top. The vertical axis is the sum of the distances of the belonging data in the cluster, and it shows that the farther you go up, the farther you put together.

In the dendrogram of the execution result, if you cut it horizontally above about 25 on the vertical axis, you can use it as a result of classifying into two clusters. If it is just above 20, it will be 3, and if it is just below 20, it will be 4. In this way, you can determine the number of classifications while looking at the dendrogram.

Kobito.4DlwhF.png

There are many explanations of hierarchical clustering and Ward's method when you google, so I will omit the details. The explanation Data Mining Cluster Analysis on ALBERT's homepage was easy to understand.

Implementation of Ward's method

Clustering with Ward's method is easy with SciPy, which I started using in Problem 84.

Word in scipy.cluster.hierarchy.ward () Specify the vector and you're done. The number of rows in the resulting matrix is the total number (= number of data-1), and the numerical value in each column is the index of the cluster bundled by the first two, the next is the sum of the distances of the data belonging to the cluster, and the last. Is the number of data in the bundled cluster.

For example, in the first line of the execution result

`[ 3.20000000e+01 5.50000000e+01 3.53776681e-01 2.00000000e+00]`


 Indicates the process summarized first. Clusters with indexes 32 and 55 (initially the word itself) are bundled into one cluster, and the sum of the distances of the belonging data in the cluster is 0.3537 ... (= the value on the vertical axis of the dendrogram). It shows that two data belong to it. The bundled cluster will be indexed with the current maximum index value + 1.

 Of the last line

#### **` [  4.14000000e+02   4.15000000e+02   4.18596046e+01   2.09000000e+02]`**

Indicates the final summary process. Indexes 414 and 415 are bundled, the sum of distances is 41.8596 ..., and 209 data (= total number of data) belong to it. This corresponds to the blue line in the resulting dendrogram.

Implementation of dendrogram

Displaying the dendrogram is also easy with SciPy, scipy.cluster.hierarchy.dendrogram () /generated/scipy.cluster.hierarchy.dendrogram.html#scipy-cluster-hierarchy-dendrogram) to [scipy.cluster.hierarchy.ward ()](https://docs.scipy.org/doc/scipy/reference/ Just specify the result of generated / scipy.cluster.hierarchy.ward.html # scipy-cluster-hierarchy-ward). By default, the characters are too small, so I adjusted it with leaf_font_size. You can also swap axes by specifying ʻorientation`.

Unfortunately, it's the same as the previous question, and even if you look at the results, you can't tell what kind of perspective it was classified into ...

That's all for the 99th knock. If you have any mistakes, I would appreciate it if you could point them out.

The execution result includes a part of the data distributed in Corpus data used for 100 knocks. I will. The license for corpus data used in this Chapter 10 is Creative Commons Attribution-Inheritance 3.0 Non-Portable (Japanese translation). For a list of country names, see "KIDS Ministry of Foreign Affairs-Countries of the World" (Ministry of Foreign Affairs) (http://www.mofa.go.jp/mofaj/kids/ichiran/index.html .mofa.go.jp/mofaj/kids/ichiran/index.html)) and [Countries and Regions of the World from A] in nationsonline.org to Z](http://www.nationsonline.org/oneworld/countries_of_the_world.htm) is processed and created. *