[PYTHON] Clustering representative schools in summer 2016 with scikit-learn

Introduction

I tried to touch scikit-learn while studying because it was a summer vacation. It's like a free study during summer vacation. Please forgive it though it is a beginner's content.

I really wanted to do something like machine learning, but I started because I could do it because I lacked knowledge and data. Since the summer high school baseball Koshien tournament is just getting excited (personally), I decided to cluster the representative schools using the data of the local tournament.

Data preparation

It seems that we can analyze various things by collecting data such as personal results, but first we decided to use basic data such as team batting average and ERA.

Batting results

Create the original data by referring to this site. The batting average, home runs, sacrifice bunts, and stolen bases of each representative school's local tournament are summarized. The number of home runs is outstanding at only one school. By the way, if you look closely, the representative schools are lined up in the order of prefecture code.

https://github.com/radiocat/study-sklearn/blob/master/hs-bb/batting-2016.csv

Pitcher performance

[This site](http://koshien.site/wp/2016/08/05/%E9%AB%98%E6%A0%A1%E9%87%8E%E7%90%83%E5%A4% 8F% E3% 81% AE% E7% 94% B2% E5% AD% 90% E5% 9C% 92% E5% 87% BA% E5% A0% B4% E6% A0% A1% E6% 8A% 95% Create the original data with reference to E6% 89% 8B% E6% 88% 90% E7% B8% BE /). The main pitchers, innings pitched, runs, and ERA at each representative school's local tournament are summarized. If one person does not throw more than 60% of the pitches, it seems that two or three pitchers are used for the calculation. The main pitcher counted the number of people and added it to another item.

https://github.com/radiocat/study-sklearn/blob/master/hs-bb/pitching-2016.csv

Clustering

I set the number of clusters to 5 for the time being. There is no particular basis. The algorithm uses k-means. I don't have the knowledge to choose another rather than this is good ...

Batting results

#coding:utf-8
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans

dataframe = pd.read_csv('batting-2016.csv')

array = np.array([dataframe['Number of games'].tolist(),
    dataframe['batting average'].tolist(),
    dataframe['Home run'].tolist(),
    dataframe['Sacrifice'].tolist(),
    dataframe['Stolen base'].tolist()
    ], np.float)
array = array.T

predict = KMeans(n_clusters=5).fit_predict(array)
print(predict)

Pitcher performance

#coding:utf-8
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans

dataframe = pd.read_csv('pitching-2016.csv')

array = np.array([dataframe['Number of pitchers'].tolist(),
    dataframe['Number of pitches'].tolist(),
    dataframe['Conceded'].tolist(),
    dataframe['Earned run average'].tolist()
    ], np.float)
array = array.T

predict = KMeans(n_clusters=5).fit_predict(array)
print(predict)

Overall grade

I tried to match the batting and pitcher results.

#coding:utf-8
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans

batting_dataframe = pd.read_csv('batting-2016.csv')
pitching_dataframe = pd.read_csv('pitching-2016.csv')

array = np.array([batting_dataframe['Number of games'].tolist(),
    batting_dataframe['batting average'].tolist(),
    batting_dataframe['Home run'].tolist(),
    batting_dataframe['Sacrifice'].tolist(),
    batting_dataframe['Stolen base'].tolist(),
    pitching_dataframe['Number of pitchers'].tolist(),
    pitching_dataframe['Number of pitches'].tolist(),
    pitching_dataframe['Conceded'].tolist(),
    pitching_dataframe['Earned run average'].tolist()
    ], np.float)
array = array.T

predict = KMeans(n_clusters=5).fit_predict(array)
print(predict)

result

school name Blow pitcher Comprehensive
Clark Memorial International 1 2 2
North Sea 2 0 1
Hachinohe Gakuin Kosei 3 2 3
With Morioka Dai 0 1 0
Tohoku 4 0 1
Omagari 0 2 2
Tsuruoka Higashi 1 4 0
Seiko Gakuin 0 3 0
Joso Gakuin 0 1 2
Sakushin Gakuin 0 3 0
Maebashi Ikuei 0 0 1
Hanasaki Tokuharu 1 1 3
Kisarazu synthesis 4 2 1
Kanto Daiichi 1 2 2
Hachioji 3 3 4
Yokohama 1 2 2
Chuetsu 3 3 4
Toyama Daiichi 4 4 3
Star Ridge 1 3 0
Hokuriku 0 3 0
Yamanashi Gakuin 0 4 0
Saku Chosei 1 1 2
Chukyo 0 4 0
Tokoha Kikugawa 3 1 3
Toho 3 1 4
Inabe synthesis 0 4 0
Omi 0 1 2
Kyoto Shoei 0 1 2
Shoshosha 1 0 1
Amagasaki City 4 0 1
Chiben Gakuen 1 2 2
Ichi Wakayama 4 2 3
Border 0 3 0
Izumo 4 1 3
Soshi Gakuen 0 1 2
Hiroshima Shinjo 4 2 3
Takagawa Gakuen 3 0 3
Naruto 1 4 0
Jinseigakuen 4 3 3
Matsuyama Seiryo 4 2 3
Meitoku Gijuku 0 4 0
Kyushu International University High School 0 2 2
Karatsu merchant 1 3 2
Nagasaki Commercial 1 2 2
Shugakukan 3 4 4
Oita 0 3 2
Nichinan Gakuen 0 4 0
Shonan 2 2 1
Kadena 1 2 2

Summary

There seems to be various tsukkomi, but it seems that 0 or 4 is not stronger because the trends of the data are numerically similar. Well, the data of the local tournament is completely different at Koshien, and it can't be helped to pursue it deeply.

Since it is not the purpose to predict the winning school, I will not mention the contents any more, but I will recalculate it to the average batting average per game, etc., and if there are multiple main pitchers, this will also be averaged or pitchers I feel that different results may be obtained by examining information such as the number of pitches for each pitch and giving more detailed numerical values for clustering. I think this area is an important area of data science, but this time I would like to finish it.

Recommended Posts

Clustering representative schools in summer 2016 with scikit-learn
Clustering with scikit-learn (1)
Clustering with scikit-learn (2)
Clustering with scikit-learn + DBSCAN
DBSCAN (clustering) with scikit-learn
Identify outliers with RandomForestClassifier in scikit-learn
The most basic clustering analysis with scikit-learn
Fill in missing values with Scikit-learn impute
Isomap with Scikit-learn
Clustering with python-louvain
DBSCAN with scikit-learn
Predict the second round of summer 2016 with scikit-learn
PCA with Scikit-learn
kmeans ++ with scikit-learn
Continued) Try other distance functions with kmeans in Scikit-learn
Clustering text in Python
Cross Validation with scikit-learn
Learn with chemoinformatics scikit-learn
Install scikit.learn with pip
Calculate tf-idf with scikit-learn