The previous Try SVM with scikit-learn on Jupyter Notebook This is an article I want to sequel to.

After this, Easy installation and startup of Jupyter Notebook using Docker (also supports nbextensions and Scala) --Qiita I am trying it in the environment of Jupyter Notebook prepared according to.

In this Jupyter environment, you can access port 8888 with a browser and use the Jupyter Notebook. You can open a new note by following New> Python 3 on the upper right button.

CSV file created at random https://github.com/suzuki-navi/sample-data/blob/master/sample-data-2.csv I also use it.

Data confirmation

Try it with sample-data-2.csv.

import pandas as pd
from sklearn import model_selection
df = pd.read_csv("sample-data-2.csv", names=["id", "target", "data1", "data2"])

This is the data.

%matplotlib inline
import matplotlib.pyplot as plt
plt.scatter(df["data1"], df["data2"])

I can't tell if this is all, but there are four clusters. I'll sloppy the scatter plots after this so that you can see the four, but to summarize first, there are the following four.

A. A small amount of data scattered around B. A large chunk of data near the center C. A small chunk of data in the center near the center D. A very small chunk of data near the center and slightly to the upper right of the center

Below is a color-coded scatter plot.

plt.scatter(df["data1"], df["data2"], c = df["target"])

There is a large amount of data near the center (B green, C blue, D purple) and a small amount of data scattered around (A yellow).

Let's enlarge the area near the center.

plt.xlim(-5, 5)
plt.ylim(-5, 5)
plt.scatter(df["data1"], df["data2"], c = df["target"])

There are two dark lumps (C blue, D purple) in the green area.

Let's expand it even further.

plt.xlim(-2, 2)
plt.ylim(-2, 2)
plt.scatter(df["data1"], df["data2"], c = df["target"])

The center is C blue and the upper right is D purple.

As you can see by looking at the code below that generated this data, we just superposed four colors that were randomly distributed with a normal distribution for each color.

https://github.com/suzuki-navi/sample-data/blob/master/sample-data-2.py

I want to cluster these 4 colors without a teacher.

Try K-means

Before the mixed Gaussian model, we also try K-means for unsupervised clustering. It seems impossible with K-means from the distribution of data, but let's try to understand how to do it.

For the time being

feature = df[["data1", "data2"]]
target = df["target"]

Learn with the K-means model. Specify that the number of clusters is four.

from sklearn import cluster
model = cluster.KMeans(n_clusters=4)
model.fit(feature)

reference sklearn.cluster.KMeans — scikit-learn 0.21.3 documentation

You can use the plotting.plot_decision_regions included in the package mlxtend to visualize how it is classified in a scatter plot. You need to pass an array of NumPy to plot_decision_regions instead of a Pandas object, so convert it with the methodto_numpy ().

from mlxtend import plotting
plotting.plot_decision_regions(feature.to_numpy(), target.to_numpy(), clf=model)

As you can see, it didn't work at all.

reference plot_decision_regions - Mlxtend.plotting - mlxtend pandas.DataFrame.to_numpy — pandas 0.25.3 documentation

Try a mixed gauss model

Let's try the mixed Gauss model of the subject.

reference sklearn.mixture.GaussianMixture — scikit-learn 0.21.3 documentation

from sklearn.mixture import GaussianMixture
model = GaussianMixture(n_components=4, covariance_type='full')
model.fit(feature)

plotting.plot_decision_regions(feature.to_numpy(), target.to_numpy(), clf=model)

Well, not as much as I expected ...

I didn't know how to expand with plot_decision_regions, so I usually look at the center of the classification result with matplotlib.

pred = model.predict(feature)
plt.xlim(-5, 5)
plt.ylim(-5, 5)
plt.scatter(feature["data1"], df["data2"], c = pred)

The two lumps in the center have come together.

There is a random element in the learning, so I tried it somehow, but it does not separate the two chunks in the center, and everywhere else it seems to draw disjointed boundaries.

Original (?) Logic

It's abrupt, but here I wrote my own code based on K-means so that clusters that follow a normal distribution can be separated.

I wrote my own code in Scala for the time being, but I will omit the details here. If I can afford it, I will introduce it in another article.

Save the result of clustering with your own code in pred1.csv and see it in a scatter plot.

pred1 = pd.read_csv("pred1.csv", names=["pred"])
plt.scatter(feature["data1"], feature["data2"], c = pred1["pred"])

Looks good.

Enlarge the central part.

plt.xlim(-5, 5)
plt.ylim(-5, 5)
plt.scatter(feature["data1"], feature["data2"], c = pred1["pred"])

plt.xlim(-2, 2)
plt.ylim(-2, 2)
plt.scatter(feature["data1"], feature["data2"], c = pred1["pred"])

I was able to cleanly separate it into four.

Background

The motivation for this article was to want to separate without a teacher in the case of this sample data, which cannot be separated cleanly with K-means, but the order of the articles was different, and the mixed Gauss model was not clear at first. , I wrote my own logic earlier. It is an improved version of K-means.

I was impressed by the fact that it was possible to separate it so neatly with its own logic, but when I consulted with @stkdev thinking that it was just a reinvention of a known algorithm, it was pointed out that a mixed Gauss model could be used. The algorithm for the mixed Gauss model isn't fully understood yet, and I'm not sure if it's the same as the proprietary logic I wrote. It may be the same, but at least I think it's similar.

However, I'm not sure if it's a tuning problem that didn't come out cleanly when I tried it with scikit-learn on a mixed Gauss model.

I will post my own logic separately if I can afford it.

[PYTHON] Try clustering with a mixed Gaussian model on a Jupyter Notebook

Data confirmation

Try K-means

Try a mixed gauss model

Original (?) Logic

Background