[PYTHON] Try clustering with a mixed Gaussian model on a Jupyter Notebook

The previous Try SVM with scikit-learn on Jupyter Notebook This is an article I want to sequel to.

After this, Easy installation and startup of Jupyter Notebook using Docker (also supports nbextensions and Scala) --Qiita I am trying it in the environment of Jupyter Notebook prepared according to.

In this Jupyter environment, you can access port 8888 with a browser and use the Jupyter Notebook. You can open a new note by following New> Python 3 on the upper right button.

CSV file created at random https://github.com/suzuki-navi/sample-data/blob/master/sample-data-2.csv I also use it.

Data confirmation

Try it with sample-data-2.csv.

import pandas as pd
from sklearn import model_selection
df = pd.read_csv("sample-data-2.csv", names=["id", "target", "data1", "data2"])

This is the data.

image.png

%matplotlib inline
import matplotlib.pyplot as plt
plt.scatter(df["data1"], df["data2"])

image.png

I can't tell if this is all, but there are four clusters. I'll sloppy the scatter plots after this so that you can see the four, but to summarize first, there are the following four.

A. A small amount of data scattered around B. A large chunk of data near the center C. A small chunk of data in the center near the center D. A very small chunk of data near the center and slightly to the upper right of the center

Below is a color-coded scatter plot.

plt.scatter(df["data1"], df["data2"], c = df["target"])

image.png

There is a large amount of data near the center (B green, C blue, D purple) and a small amount of data scattered around (A yellow).

Let's enlarge the area near the center.

plt.xlim(-5, 5)
plt.ylim(-5, 5)
plt.scatter(df["data1"], df["data2"], c = df["target"])

image.png

There are two dark lumps (C blue, D purple) in the green area.

Let's expand it even further.

plt.xlim(-2, 2)
plt.ylim(-2, 2)
plt.scatter(df["data1"], df["data2"], c = df["target"])

image.png

The center is C blue and the upper right is D purple.

As you can see by looking at the code below that generated this data, we just superposed four colors that were randomly distributed with a normal distribution for each color.

https://github.com/suzuki-navi/sample-data/blob/master/sample-data-2.py

I want to cluster these 4 colors without a teacher.

Try K-means

Before the mixed Gaussian model, we also try K-means for unsupervised clustering. It seems impossible with K-means from the distribution of data, but let's try to understand how to do it.

For the time being

feature = df[["data1", "data2"]]
target = df["target"]

Learn with the K-means model. Specify that the number of clusters is four.

from sklearn import cluster
model = cluster.KMeans(n_clusters=4)
model.fit(feature)

reference sklearn.cluster.KMeans — scikit-learn 0.21.3 documentation

You can use the plotting.plot_decision_regions included in the package mlxtend to visualize how it is classified in a scatter plot. You need to pass an array of NumPy to plot_decision_regions instead of a Pandas object, so convert it with the methodto_numpy ().

from mlxtend import plotting
plotting.plot_decision_regions(feature.to_numpy(), target.to_numpy(), clf=model)

image.png

As you can see, it didn't work at all.

reference plot_decision_regions - Mlxtend.plotting - mlxtend pandas.DataFrame.to_numpy — pandas 0.25.3 documentation

Try a mixed gauss model

Let's try the mixed Gauss model of the subject.

reference sklearn.mixture.GaussianMixture — scikit-learn 0.21.3 documentation

from sklearn.mixture import GaussianMixture
model = GaussianMixture(n_components=4, covariance_type='full')
model.fit(feature)
plotting.plot_decision_regions(feature.to_numpy(), target.to_numpy(), clf=model)

image.png

Well, not as much as I expected ...

I didn't know how to expand with plot_decision_regions, so I usually look at the center of the classification result with matplotlib.

pred = model.predict(feature)
plt.xlim(-5, 5)
plt.ylim(-5, 5)
plt.scatter(feature["data1"], df["data2"], c = pred)

image.png

The two lumps in the center have come together.

There is a random element in the learning, so I tried it somehow, but it does not separate the two chunks in the center, and everywhere else it seems to draw disjointed boundaries.

Original (?) Logic

It's abrupt, but here I wrote my own code based on K-means so that clusters that follow a normal distribution can be separated.

I wrote my own code in Scala for the time being, but I will omit the details here. If I can afford it, I will introduce it in another article.

Save the result of clustering with your own code in pred1.csv and see it in a scatter plot.

pred1 = pd.read_csv("pred1.csv", names=["pred"])
plt.scatter(feature["data1"], feature["data2"], c = pred1["pred"])

image.png

Looks good.

Enlarge the central part.

plt.xlim(-5, 5)
plt.ylim(-5, 5)
plt.scatter(feature["data1"], feature["data2"], c = pred1["pred"])

image.png

plt.xlim(-2, 2)
plt.ylim(-2, 2)
plt.scatter(feature["data1"], feature["data2"], c = pred1["pred"])

image.png

I was able to cleanly separate it into four.

Background

The motivation for this article was to want to separate without a teacher in the case of this sample data, which cannot be separated cleanly with K-means, but the order of the articles was different, and the mixed Gauss model was not clear at first. , I wrote my own logic earlier. It is an improved version of K-means.

I was impressed by the fact that it was possible to separate it so neatly with its own logic, but when I consulted with @stkdev thinking that it was just a reinvention of a known algorithm, it was pointed out that a mixed Gauss model could be used. The algorithm for the mixed Gauss model isn't fully understood yet, and I'm not sure if it's the same as the proprietary logic I wrote. It may be the same, but at least I think it's similar.

However, I'm not sure if it's a tuning problem that didn't come out cleanly when I tried it with scikit-learn on a mixed Gauss model.

I will post my own logic separately if I can afford it.

Recommended Posts

Try clustering with a mixed Gaussian model on a Jupyter Notebook
Try SVM with scikit-learn on Jupyter Notebook
Try a state-space model (Jupyter Notebook + IR kernel)
[Python] Clustering with an infinitely mixed Gaussian model
[Python] Implementation of clustering using a mixed Gaussian model
Make a sound with Jupyter notebook
Try running Jupyter Notebook on Mac
Make Jupyter Notebook a service on CentOS
Run Jupyter notebook on a remote server
Try TensorFlow RNN with a basic model
Enable Jupyter Notebook with conda on remote server
Try using conda virtual environment with Jupyter Notebook
Simply display a line graph on Jupyter Notebook
Try Apache Spark on Jupyter Notebook (on local Docker
Try basic operations for Pandas DataFrame on Jupyter Notebook
Drawing a tree structure with D3.js in Jupyter Notebook
EC2 provisioning with Vagrant + Jupyter (IPython Notebook) on Docker
Using Graphviz with Jupyter Notebook
Use pip with Jupyter Notebook
Try programming with a shell!
Try using Jupyter Notebook dynamically
High charts on Jupyter notebook
View PDF on Jupyter Notebook
Use Cython with Jupyter Notebook
Play with Jupyter Notebook (IPython Notebook)
Try running Python with Try Jupyter
The usual way to add a Kernel with Jupyter Notebook
Write charts in real time with Matplotlib on Jupyter notebook
A very convenient way to give a presentation on Jupyter Notebook
A note when I can't open Jupyter Notebook on Windows
Run Jupyter Notebook on windows
How to quickly create a machine learning environment using Jupyter Notebook on macOS Sierra with anaconda
How to batch start a python program created with Jupyter notebook
I wanted to create a smart presentation with Jupyter Notebook + nbpresent
Build a comfortable psychological experiment / analysis environment with PsychoPy + Jupyter Notebook
Post a Jupyter Notebook as a blog post
Visualize decision trees with jupyter notebook
Run azure ML on jupyter notebook
Use markdown with jupyter notebook (with shortcut)
Try running Jupyter with VS Code
Add more kernels with Jupyter Notebook
Convenient analysis with Pandas + Jupyter notebook
[Python] Mixed Gauss model with Pyro
Make a model iterator with PySide
Try starting Jupyter Notebook ~ Esper training
Use Jupyter Notebook with Visual Studio Code on Windows 10 + Python + Poetry + pyenv-win
How to set up a jupyter notebook on ssh destination (AWS EC2)
Settings when reading S3 files with pandas from Jupyter Notebook on AWS
Use nb extensions with Anaconda's Jupyter notebook
Use apache Spark with jupyter notebook (IPython notebook)
Try to make a kernel of Jupyter
I want to blog with Jupyter Notebook
Try drawing a normal distribution with matplotlib
Use Jupyter Lab and Jupyter Notebook with EC2
Make a Notebook Pipeline with Kedro + Papermill
A memo with Python2.7 and Python3 on CentOS
Map rent information on a map with python
Throw a request with a certificate on httpie
Try server-side encryption on S3 with boto3
Clone the github repository on jupyter notebook
How to use jupyter notebook with ABCI