I started machine learning with Python Clustering & Dimension Compression & Visualization

</ i> Introduction

It's been a while since the last "Data Preprocessing", but this time I'll try clustering Twitter text data.

</ i> Summary in 3 lines

-(Finally) clustered. --The clustering result was visualized with matplotlib. ――Next time, it may be a side road to introduce visualization tricks.

</ i> Suddenly source code (other than visualization)

Last time I added the implementation of ** "clustering" **** "dimension compression" ** to the implementation of "vectorize". (The source of "visualization" is a little long, so later)

tw_ml.py(Excerpt)


#! /usr/bin/env python
# -*- coding:utf-8 -*-

import MeCab as mc
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.decomposition import TruncatedSVD

MECAB_OPT = "-Ochasen -d C:\\tmp\\mecab-ipadic-neologd\\"

NUM_CLUSTER = 3
SAMPLE_DATA = [
    #Omitted because it is the same as last time
]

def mecab_tokenizer(text):
    #Omitted because it is the same as last time

def main():

    #Vectra class initialization (same as last time)
    vectorizer = TfidfVectorizer(
        min_df=1, stop_words=[u"Perfume", u"HTTPS"],
        tokenizer=mecab_tokenizer)
    #Vectorization of sample data (same as last time)
    tfidf_weighted_matrix = vectorizer.fit_transform(SAMPLE_DATA)

    # K-Means cluster analysis class initialization
    km_model = KMeans(n_clusters=NUM_CLUSTER)
    #Perform cluster analysis by feeding vector information
    km_model.fit(tfidf_weighted_matrix)

    #Dimensional compression (singular value decomposition) class initialization
    lsa = TruncatedSVD(2)
    #Two-dimensional compression of sample data and vector information of cluster center points
    compressed_text_list = lsa.fit_transform(tfidf_weighted_matrix)
    compressed_center_list = lsa.fit_transform(km_model.cluster_centers_)

if __name__ == '__main__':
    main()

Let's take a look at what we are doing.

</ i> Clustering

python


# K-Means cluster analysis class initialization
km_model = KMeans(n_clusters=NUM_CLUSTER)
#Perform cluster analysis by feeding vector information
km_model.fit(tfidf_weighted_matrix)

The processing content is as written in the comment. The problem here is the parameters passed when initializing the KMeans class, but tuning is almost unnecessary if you just want to move it for the time being. Since the number of data handled this time is as small as 5, the number of clusters has been changed from the default ** 8 ** to ** 3 **, but the others remain the default. (Details of other parameters: sklearn.cluster.KMeans)

After initialization, just feed the data with fit and analyze. The main analysis results can be confirmed by referring to the following of km_model.

Parameters Contents Value example
km_model.cluster_centers_ Vector information of the center point for each cluster [[0, 0, 0.46369322, 0.46369322, 0, 0.46369322, 0, 0, 0, 0, 0, 0, 0, 0.37410477]...(As many as the number of clusters)]
km_model.labels_ Label for each element of the data to be analyzed (value indicating which cluster it belongs to) [2 1 1 0 1]

However, even if you can see this result (list of numerical values), it is too Nanno Kocha, so let's visualize the cluster. The following ** dimensional compression ** is required for this.

</ i> Dimensional compression (for visualization)

python


#Dimensional compression (singular value decomposition) class initialization
lsa = TruncatedSVD(2)
#Two-dimensional compression of sample data and vector information of cluster center points
compressed_text_list = lsa.fit_transform(tfidf_weighted_matrix)
compressed_center_list = lsa.fit_transform(km_model.cluster_centers_)

When you hear "dimensional compression", the final ○ antagy V and VIII reflexively come to your mind. It's a character that is reminiscent of something SF-like. However, the purpose is "I can't show high-dimensional information, so I want to convert it to 2D of xy (because it's an approximation) **". It's not a word either.

Specifically, since the number of dimensions is "the type of word that appears in all texts" this time, the sample data is the following 14-dimensional data. It's difficult to illustrate this. [0, 0, 0.46369322, 0.46369322, 0, 0.46369322, 0, 0, 0, 0, 0, 0, 0, 0.37410477]

This is made into the following two dimensions by dimensional compression. If this is the case, you can plot it on the 2-axis diagram of xy. [9.98647967e-01, 0.00000000e+00]

There are various methods of dimensional compression, but this time we will use ** Latent Semantics (LSA) **. This also seems to be strong, but as shown in the above source code, it is easy if you use scikit-lean (this time I will use Truncated SVD, but there seems to be others). In the above implementation, the vector information of the 14-dimensional sample data and the vector information of the center point of the cluster analysis result are converted to 2D.

Now you are ready to visualize.

</ i> Source code (visualization part)

With the implementation so far, we have all the information needed to visualize the cluster, so let's add a drawing process using matplotlib.

tw_ml.py(Excerpt from the visualization part)



import matplotlib.pyplot as plt
from matplotlib import cm
import matplotlib.font_manager as fm

FP = fm.FontProperties(
    fname=r'C:\WINDOWS\Fonts\YuGothL.ttc',
    size=7)

def draw_km(text_list, km_text_labels,
            compressed_center_list, compressed_text_list):

    #Start drawing.
    fig = plt.figure()
    axes = fig.add_subplot(111)
    for label in range(NUM_CLUSTER):

        #Separate colors for each label.
        #Leave the colors to the color map (cool).
        color = cm.cool(float(label) / NUM_CLUSTER)

        #Plot the center of the label
        xc, yc = compressed_center_list[label]
        axes.plot(xc, yc,
                  color=color,
                  ms=6.0, zorder=3, marker="o")

        #Cluster label also plotted
        axes.annotate(
            label, xy=(xc, yc), fontproperties=FP)

        for text_num, text_label in enumerate(km_text_labels):

            if text_label == label:
                #Plot text with matching labels
                x, y = compressed_text_list[text_num]
                axes.plot(x, y,
                          color=color,
                          ms=5.0, zorder=2, marker="x")

                #Text also plot
                axes.annotate(
                    text_list[text_num], xy=(x, y), fontproperties=FP)

                #Plot the line from the center point of the label
                axes.plot([x, xc], [y, yc],
                          color=color,
                          linewidth=0.5, zorder=1, linestyle="--")

    plt.axis('tight')
    plt.show()

def main():
    #Omitted because it is the same up to dimensional compression

    #Visualize data
    # ※km_model.labels_To SAMPLE_The label information of each element of DATA is stored.
    draw_km(SAMPLE_DATA, km_model.labels_,
            compressed_center_list, compressed_text_list)

if __name__ == '__main__':
    main()

The result looks like this! figure_1.png

text cluster
Nocchi cute#Perfume https://t.co/xxx 2
Perfume's production is amazing#prfm #Perfume_um https://t.co/xxx 1
chocolate disco/ Perfume #NowPlaying https://t.co/xxx 1
I went to Perfume A Gallery Experience in London https://t.co/xxx 0
The chocolate disco production is cool. I want to go live.#Perfume https://t.co/xxx 1

The appearance is delicate because the number of data is small. But somehow, I feel like I can classify it!

</ i> Finally clustering is possible!

The touch part of clustering text data using the K-means method has been completed. Feeding km_models with new data should give you a solution as to which cluster to classify. (That is, * text classifier! *)

Next time, we will actually classify new data and see the result of increasing the number of data. It's a side road, but I'd like to devise a little way to show it on matplotlib.

Recommended Posts