It's like this.
Cluster analysis: An analysis that groups similar data together. K-means clustering: One of the cluster analysis methods.
There are various fields of application, but the most obvious example is A program that brings together the points scattered on the coordinates with each other. This is what I will make this time.
According to Wikipedia, this is what it does.
The K-means method is generally implemented in the following flow. Let n be the number of data and K be the number of clusters.
- Randomly allocate clusters for each data x_i (i = 1 \ ... n).
- Calculate the center V_j (j = 1 \ ... K) of each cluster based on the allocated data. The calculation usually uses the arithmetic mean of each element of the assigned data.
- Find the distance between each x_i and each V_j and reassign x_i to the closest central cluster.
- If the allocation of all x_i clusters does not change in the above process, or if the amount of change falls below a certain preset threshold, it is determined that the process has converged and the process ends. If not, recalculate V_j from the newly allocated cluster and repeat the above process.
It's easy to do, but it's difficult. To do
I think this is what it is.
Yes.
This at the beginning. I made this this time.
99 lines in total. It fits within 100 lines. happy. I'll explain it later.
k-means.py
# -*- coding: utf-8 -*-
import wx
import random
import math
class MyMainWindow(wx.Frame):
def __init__(self, parent=None, id=-1, title=None):
#Set panel on frame
wx.Frame.__init__(self, parent, id, title)
self.panel = wx.Panel(self, size=(300, 300))
self.panel.SetBackgroundColour('000000')
self.Fit()
#Event
self.panel.Bind(wx.EVT_PAINT, self.on_paint)
self.panel.Bind(wx.EVT_LEFT_DOWN, self.on_left_click)
self.panel.Bind(wx.EVT_RIGHT_DOWN, self.on_right_click)
#Variable initialization
self.dots = []
self.dc = None
#There are three types of clusters: red, green, and blue
self.cluster_types = ('#FF0000', '#00FF00', '#0000FF')
self.clusters = [(0, 0), (0, 0), (0, 0)]
#Initial set of dots
self.shuffle_dots()
def on_paint(self, event):
u"""Drawing event"""
self.dc = wx.PaintDC(self.panel)
#Write a square
self.dc.SetPen(wx.Pen('#CCCCCC', 1))
for x in range(0, 300, 10):
self.dc.DrawLine(x, 0, x, 300)
for y in range(0, 300, 10):
self.dc.DrawLine(0, y, 300, y)
#Draw a dot
for dot in self.dots:
self.dc.SetPen(wx.Pen(self.cluster_types[dot['cluster']], 5))
self.dc.DrawPoint(dot['x'], dot['y'])
#Draw the center of gravity of the cluster.
self.draw_cluster()
def on_left_click(self, evt):
u"""Left click to recalculate cluster"""
self.change_cluster()
self.Refresh()
def on_right_click(self, evt):
u"""Right click to reset dot"""
self.shuffle_dots()
self.Refresh()
def shuffle_dots(self):
u"""Arrange dots randomly."""
self.dots = []
for i in range(30):
self.dots.append({'x': random.randint(0, 300),
'y': random.randint(0, 300),
'cluster': random.randint(0, len(self.cluster_types) - 1)})
def draw_cluster(self):
u"""Draw a cluster."""
self.clusters = []
for c in range(len(self.cluster_types)):
#Center of gravity of cluster = average of coordinates of dots belonging to cluster
self.dc.SetPen(wx.Pen(self.cluster_types[c], 1))
count = sum(1 for dot in self.dots if dot['cluster'] == c)
x = sum(dot['x'] for dot in self.dots if dot['cluster'] == c) // count if count != 0 else 150
y = sum(dot['y'] for dot in self.dots if dot['cluster'] == c) // count if count != 0 else 150
self.clusters.append({'x': x, 'y': y})
#Draw the cluster with a cross
self.dc.DrawLine(x - 5, y - 5, x + 5, y + 5)
self.dc.DrawLine(x + 5, y - 5, x - 5, y + 5)
#Draw a line from the cluster to each dot.
self.dc.SetPen(wx.Pen(self.cluster_types[c], 0.8))
for dot in self.dots:
if dot['cluster'] == c:
self.dc.DrawLine(x, y, dot['x'], dot['y'])
def change_cluster(self):
u"""Change the affiliation of each dot to the nearest cluster."""
for d in range(len(self.dots)):
near_dist = 99999
#Distance between two points = √( (X1-X2)^2+(Y1-Y2)^2 )
for c in range(len(self.cluster_types)):
dist = math.sqrt(
(self.dots[d]['x'] - self.clusters[c]['x']) ** 2 + (self.dots[d]['y'] - self.clusters[c]['y']) ** 2)
#Change to the nearest cluster
if near_dist >= dist:
self.dots[d]['cluster'] = c
near_dist = dist
if __name__ == '__main__':
app = wx.PySimpleApp()
w = MyMainWindow(title='K-Means Test')
w.Center()
w.Show()
app.MainLoop()
wxPython This time I used a GUI library called wxPython. It works on Mac, Windows, and Linux, and displays it in a way that is familiar to each platform.
class MyMainWindow(wx.Frame):
Create a class that inherits wx.Frame
and
if __name__ == '__main__':
app = wx.PySimpleApp()
w = MyMainWindow(title='K-Means Test')
w.Center()
w.Show()
app.MainLoop()
call. Such a flow.
Yes. The actual logic is explained from here.
shuffle_dots()
This function scatters the points and puts them in an array called dots.
At that time, groups are assigned appropriately.
draw_cluster()
The average of each point belonging to the group is calculated and used as the center of the group.
By the way, draw a line from the center to each point to make it easier to understand.
change_cluster()
Distance between two points = √ ((X1-X2) ^ 2 + (Y1-Y2) ^ 2)
Use this formula to find the distance between each point and the center of the group, and switch to the closest group. There was such a formula. It was useful for the first time since junior high school.
on_paint()
This function is called at the timing of drawing.
Because it was linked with self.panel.Bind (wx.EVT_PAINT, self.on_paint)
at the beginning.
The order of the explanation and the processing has changed, but the processing explained earlier is called in this function.
Every time you click, the flow so far is recalculated. If you press it several times, it will be classified into the completed form. Left-click to reposition.
It's an analysis method that seems to have a high threshold, but When I actually try it, it's relatively simple.
However, this K-means method tends to give a biased result, so There seems to be an improved analysis method.
Even if you type points into the coordinates and divide them into groups, it's just empty. It may be interesting to analyze the similarity of your favorite animals by assigning the X-axis to the cat liking and the Y-axis to the dog liking.
I also like to increase the coordinate system by two and do four-dimensional cluster analysis. There is a sense of the future.
Recommended Posts