Data analysis by clustering using k-means method (python) ([High school information department information II] teaching materials for teacher training)

Introduction

One of the non-hierarchical clustering methods is the k-means method (k-means method). The description of "Chapter 3 Information and Data Science Second Half Learning 16. Classification by Clustering" in the teaching materials is quoted because it is easy to understand.

In the k-means method, clustering is performed according to the following procedure.

  1. Determine the number of clusters to be divided in advance, and randomly determine the representative points (centroids).
  2. Find the distance between the data and each representative point and classify it into the cluster of the closest representative points.
  3. Calculate the average for each cluster and use it as a new representative point.
  4. If the position of the representative point has changed, return to 2. If there is no change, the classification ends. By randomly determining the representative points according to 1), the results will differ greatly, resulting in appropriate clustering. It may not be. It can be improved by repeating the analysis several times or by using the k-means ++ method.

1') Randomly select one representative point from the data, and select the remaining representative points with a probability proportional to the square of the distance from that point.

SnapCrab_NoName_2020-9-24_20-10-29_No-00.png

In the section "Chapter 3 Information and Data Science Second Half Learning 16. Classification by Clustering" where the explanation about clustering of teaching materials is written, it has already been explained by the implementation example by python. This time, in "Chapter 5 Exploration of Problem Discovery / Solution Utilizing Information and Information Technology, Activity Example at the End of the Book 3. Utilization of Information Technology for Utilizing Data", an implementation example written in R is used in python. By replacing it, I would like to confirm the data analysis by clustering using the k-means method.

Teaching materials

[High School Information Department "Information II" Teacher Training Materials (Main Volume): Ministry of Education, Culture, Sports, Science and Technology](https://www.mext.go.jp/a_menu/shotou/zyouhou/detail/mext_00742.html "High School Information Department "Information II" teaching materials for teacher training (main part): Ministry of Education, Culture, Sports, Science and Technology ") Chapter 5 Search for Problems and Solutions Utilizing Information and Information Technology, End of Book (PDF: 4.1MB)

environment

ipython Colaboratory - Google Colab

Parts to be taken up in the teaching materials

Activity example 3 Utilization of information technology to utilize data

Implementation example and result in python

Before doing the analysis

This time, the teaching materials use Japanese for graph plotting. Therefore, it is necessary to set in advance so that Japanese can be used in the graph plot (matplotlib).

!apt-get -y install fonts-ipafont-gothic
!ls -ll /root/.cache/matplotlib/
:
-rw-r--r-- 1 root root 46443 Sep 18 20:45 fontList.json
-rw-r--r-- 1 root root 29337 Sep 18 20:25 fontlist-v310.json
drwxr-xr-x 2 root root  4096 Sep 18 20:25 tex.cache

Delete the old font cache fontlist-v310.json based on the information of the ls command.

#Delete the cache.
!rm /root/.cache/matplotlib/fontlist-v310.json #Cache to be erased
!ls -ll /root/.cache/matplotlib/
#Delete the cache.
!rm /root/.cache/matplotlib/fontlist-v310.json #Cache to be erased
!ls -ll /root/.cache/matplotlib/

Now, start the runtime of google colab. Next, set up matplotlib to use Japanese.

import matplotlib

#Japanese display
matplotlib.rcParams['font.family'] = "IPAGothic"

Preprocessing

Download the following Excel data as a "Survey on the actual situation of computerization of education in schools".

["Actual conditions of" computer installation status "and" Internet connection status "by prefecture (high school)"](https://www.e-stat.go.jp/stat-search/files?page=1&query= % E5% AD% A6% E6% A0% A1% E3% 81% AB% E3% 81% 8A% E3% 81% 91% E3% 82% 8B% E6% 95% 99% E8% 82% B2% E3 % 81% AE% E6% 83% 85% E5% A0% B1% E5% 8C% 96% E3% 81% AE% E5% AE% 9F% E6% 85% 8B% E7% AD% 89% E3% 81 % AB% E9% 96% A2% E3% 81% 99% E3% 82% 8B% E8% AA% BF% E6% 9F% BB & layout = dataset & stat_infid = 000031898768 & metadata = 1 & data = 1 "By prefecture" Computer installation status "And the actual situation of" Internet connection status "(high school)" ")

As with the teaching materials, data cleaning is performed on Excel before the first analysis with python. The data that has been organized and shaped is as follows.

pc_sjis.csv

The processing performed is as follows.

--Delete unnecessary headers and footers --Delete unnecessary items --Remove commas to separate digits to convert data to CSV format --Changed the item name to alphabetic characters to make it easier to work --Each item of data is pref (by prefecture), school (number of schools), student (number of students), room (number of ordinary classrooms), PC (total number of PCs for learners), spp (PC1 for learners) Number of children per vehicle), prj (large presentation device maintenance rate in ordinary classrooms), lan (school LAN maintenance rate in ordinary classrooms), wlan (wireless LAN maintenance rate in ordinary classrooms)

Based on these, the data is read.

import pandas as pd
from IPython.display import display

pc = pd.read_csv('/content/pc_sjis.csv', encoding='shift_jis')
display(pc.head())

SnapCrab_NoName_2020-9-24_20-29-50_No-00.png

The teaching materials are as follows.

SnapCrab_NoName_2020-9-24_20-30-34_No-00.png

In the teaching materials, there seems to be an error that the total number of educational PCs is reading where the total number of learner PCs should be read.

Data analysis and visualization

To understand what trends you can read, first try displaying the scatterplot matrix. This time, I will use the seaborn module.

import seaborn as sns

pg = sns.pairplot(pc)
print(type(pg))

seaborn_pairplot (1).png

From the teaching materials

Those with clear linear trends, such as the number of students and the number of classrooms, are subject to the correlation coefficient and simple regression analysis learned in "Information I". This time, we will not look at the linear tendency, so let's consider wlan (wireless LAN) and spp (number of students per PC).

Since there is, take out the values of wlan (wireless LAN) and spp (number of students per PC) and scale.

Specifically, we have standardized.

from sklearn.preprocessing import StandardScaler

#Value extraction(wlan spp)
pc_ws = pc[['wlan', 'spp']]

#Standardization(How to use Standard Scaler)
std_sc = StandardScaler()
std_sc.fit(pc_ws)
pcs = std_sc.transform(pc_ws)
pcs_df = pd.DataFrame(pcs, columns = pc_ws.columns)
display(pcs_df.head())

SnapCrab_NoName_2020-9-24_20-38-2_No-00.png

Since the types of data handled are different, we are standardizing them in the same way as textbooks. For standardization, past articles will be helpful. https://qiita.com/ereyester/items/b78b22a76a8f50006880

Next, create and classify the model.

from sklearn.cluster import KMeans

#Creating a model
km = KMeans(init='random', n_clusters=2 , random_state=0)
#Forecast
pc_cluster = km.fit_predict(pcs_df)
cluster_df = pd.DataFrame(pc_cluster, columns=['cluster'])

#Value extraction(pref wlan spp cluster)
pcs_cluster_df = pd.concat([pc[['pref', 'wlan', 'spp']], cluster_df], axis=1)
display(pcs_cluster_df.head())

SnapCrab_NoName_2020-9-24_20-40-49_No-00.png

I would like to confirm the result with a scatter plot.

import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

_, ax = plt.subplots(figsize=(5, 5), dpi=200)

sns.scatterplot(data=pcs_cluster_df, x="wlan", y="spp", hue="cluster", ax=ax)

for k, v in pcs_cluster_df.iterrows():
    ax.annotate(v['pref'],xy=(v['wlan'],v['spp']),size=5)

plt.show()

SnapCrab_NoName_2020-9-24_20-41-30_No-00.png

It seems that the wireless LAN (wlan) is generally classified based on the information. Also, Chiba and Saga prefectures appear to be off the center of the group.

Further analysis

Next, let's plot the graph of the number of students and the number of learning PCs for which a clear positive correlation can be read, color-coded in the previous cluster.

#Value extraction(pref student pc cluster)
pcs_cluster2_df = pd.concat([pc[['pref', 'student', 'pc']], cluster_df], axis=1)

_, ax2 = plt.subplots(figsize=(5, 5), dpi=200)

sns.scatterplot(data=pcs_cluster2_df, x="student", y="pc", hue="cluster", ax=ax2)

for k, v in pcs_cluster2_df.iterrows():
    ax2.annotate(v['pref'],xy=(v['student'],v['pc']),size=5)

plt.show()

SnapCrab_NoName_2020-9-24_20-43-21_No-00.png

If the ratio of PCs (total number of learner PCs) to students (number of students) is large, the group tends to have a high maintenance rate of wlan (wireless LAN maintenance rate of ordinary classrooms), otherwise wlan (ordinary classrooms) It seems that there is a tendency for the group to have a low maintenance rate (wireless LAN maintenance rate). In Saga prefecture, the ratio of PCs (total number of learners'PCs) to students (number of students) is very large, while in Chiba prefecture, the ratio of PCs (total number of learners' PCs) to students (number of students) is very small. You can see the characteristics of.

Source code

https://gist.github.com/ereyester/ce9370e3022f05f4d7548a8ccaed33cc

Recommended Posts

Data analysis by clustering using k-means method (python) ([High school information department information II] teaching materials for teacher training)
Classification by k-nearest neighbor method (kNN) by python ([High school information department information II] teaching materials for teacher training)
[High School Information Department Information I / Information II] Summary of teaching materials for teacher training by python
Text mining by word2vec etc. by python ([High school information department information II] teaching materials for teacher training)
Binary classification by decision tree by python ([High school information department information II] teaching materials for teacher training)
[High School Information Department Information I] Teaching materials for teacher training: Data format and visualization (python)
[High School Curriculum Guidelines Information I] Teaching materials for teacher training: Implementation of Huffman method in python
[High School Information Department] Information I / Information II Reiwa 3rd year supplementary teaching materials Exercise examples
Data analysis by clustering using k-means method (python) ([High school information department information II] teaching materials for teacher training)
[High School Information Department Information I / Information II] Summary of teaching materials for teacher training by python
Text mining by word2vec etc. by python ([High school information department information II] teaching materials for teacher training)
Binary classification by decision tree by python ([High school information department information II] teaching materials for teacher training)
Classification by k-nearest neighbor method (kNN) by python ([High school information department information II] teaching materials for teacher training)
[High School Information Department Information I] Teaching materials for teacher training: Data format and visualization (python)
Principal component analysis with python (Scikit-learn version, pandas & numpy version) ([High school information department information II] teaching materials for teacher training)
[High School Curriculum Guidelines Information I] Teaching materials for teacher training: Implementation of Huffman method in python
[High School Information Department] Information I / Information II Reiwa 3rd year supplementary teaching materials Exercise examples
I tried object detection using Python and OpenCV
Data analysis using Python 0
Data analysis by clustering using k-means method (python) ([High school information department information II] teaching materials for teacher training)
[High School Information Department Information I / Information II] Summary of teaching materials for teacher training by python
Text mining by word2vec etc. by python ([High school information department information II] teaching materials for teacher training)
Binary classification by decision tree by python ([High school information department information II] teaching materials for teacher training)
Classification by k-nearest neighbor method (kNN) by python ([High school information department information II] teaching materials for teacher training)
[High School Information Department Information I] Teaching materials for teacher training: Data format and visualization (python)
[High School Curriculum Guidelines Information I] Teaching materials for teacher training: Implementation of Huffman method in python
Data analysis by clustering using k-means method (python) ([High school information department information II] teaching materials for teacher training)
[High School Information Department Information I / Information II] Summary of teaching materials for teacher training by python
Text mining by word2vec etc. by python ([High school information department information II] teaching materials for teacher training)
Binary classification by decision tree by python ([High school information department information II] teaching materials for teacher training)
Classification by k-nearest neighbor method (kNN) by python ([High school information department information II] teaching materials for teacher training)
[High School Information Department Information I] Teaching materials for teacher training: Data format and visualization (python)
Principal component analysis with python (Scikit-learn version, pandas & numpy version) ([High school information department information II] teaching materials for teacher training)
[High School Curriculum Guidelines Information I] Teaching materials for teacher training: Implementation of Huffman method in python
[High School Information Department] Information I / Information II Reiwa 3rd year supplementary teaching materials Exercise examples
I tried object detection using Python and OpenCV
[High School Information Department] Information I / Information II Reiwa 3rd year supplementary teaching materials Exercise examples
Principal component analysis with Power BI + Python
Challenge principal component analysis of text data with Python
Principal component analysis using python from nim with nimpy
2. Multivariate analysis spelled out in Python 3-1. Principal component analysis (scikit-learn)
Easy-to-understand [Pandas] practice / data confirmation method for high school graduates
Python for Data Analysis Chapter 4
Classify data by k-means method
Python for Data Analysis Chapter 2
Data analysis using python pandas
Python for Data Analysis Chapter 3