[PYTHON] Machine learning in Delemas (practice)

This is a continuation of Last time. [THE IDOLM @ STER CINDERELLA GIRLS](https://ja.wikipedia.org/wiki/%E3%82%A2%E3%82%A4%E3%83%89%E3%83%AB%E3%83%9E% E3% 82% B9% E3% 82% BF% E3% 83% BC_% E3% 82% B7% E3% 83% B3% E3% 83% 87% E3% 83% AC% E3% 83% A9% E3% Prediction of 3 types (Cu, Co, Pa) from profile data of 183 people (as of April 2017) of 82% AC% E3% 83% BC% E3% 83% AB% E3% 82% BA) I will.

The following 16 items were acquired. It is a 183 x 16 matrix. [Type, Name, Age, Birth, Constellation, Blood type, Height, Weight, B, W, H, Handedness, Hometown, Hobbies, CV, Implementation date]

Of these, this time we will use the following 6 data to predict the type. [Age, height, weight, B, W, H]

Data shaping

Since all the types of the acquired data are objects, convert them to numeric types.

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import csv
import pandas as pd
from pandas import DataFrame
import matplotlib
import matplotlib.pyplot as plt

def translate(df):

    #Convert data type to float
    df['age']=df['age'].str.extract('([0-9]+)').astype(float)
    df['height']=df['height'].astype(float)
    df['body weight']=df['body weight'].str.extract('([0-9]+)').astype(float)
    df['B']=df['B'].str.extract('([0-9]+)').astype(float)
    df['W']=df['W'].str.extract('([0-9]+)').astype(float)
    df['H']=df['H'].str.extract('([0-9]+)').astype(float)

    #Numerical conversion of attribute values
    df.loc[df['attribute'] == "Cu", 'attribute'] = 0
    df.loc[df['attribute'] == "Co", 'attribute'] = 1
    df.loc[df['attribute'] == "Pa", 'attribute'] = 2
    df['attribute']=df['attribute'].astype(int)

    return df

if __name__ == '__main__':
    #Data read
    df = pd.read_csv('aimasudata.csv')
    df=translate(df)

――Since Japanese is sometimes mixed in data such as age, str.extract ('([0-9] +)') is used to extract only numbers. \ ([Eternal ○ years old] → [○]. You did it!) --Attribute values are numerically converted for use in SVM determination.

Data confirmation

Let's graph the data to see if it is really discernible by machine learning.

def checkdata(df,index):

    #Get data for each type
    x1 = [df[index][df['attribute']==0]]
    x2 = [df[index][df['attribute']==1]]
    x3 = [df[index][df['attribute']==2]]
    
    #Histogram generation
    plt.hist([x1,x2,x3], bins=16)

    #Save image
    plt.savefig("%s_graph.png " %index)

    #Image display
    plt.show()

if __name__ == '__main__':
    #Data read
    df = pd.read_csv('row_data.csv')
    df=translate(df)
    checkdata(df,"age")

result

age

年齢_graph.png Blue: Cu, Orange: Co, Green: Pa. Co has a high proportion of older people.

height

身長_graph.png Is Cu low and Co high? This data makes the most difference.

body weight

体重_graph.png The difference is not large, but Co is slightly higher. Overall too light.

B B_graph.png

W W_graph.png

H H_graph.png In the body data system, the value of Co is high as a whole. Is the separation of Cu and Pa subtle?

Grid search

This time, we will use SVM to determine three classes (Co, Cu, Pa). Since it is necessary to set parameters when implementing SVM, First, use grid search to determine the parameters to be applied to SVM.

[Parameter optimization by grid search from Scikit learn] (http://qiita.com/SE96UoC5AfUt7uY/items/c81f7cea72a44a7bfd3a)

def gridsearch(df):
    tuned_parameters = [{'C': [1, 10, 100, 1000, 10000], 'kernel': ['rbf'], 'gamma': [0.01,0.001, 0.0001]}]
    score = 'f1'
    clf = GridSearchCV(
        SVC(), #Identifyer
        tuned_parameters, #Parameter set you want to optimize
        cv=5, #Number of cross-validations
        scoring='%s_weighted' % score ) #Specifying the evaluation function of the model

    df = df.dropna(subset=['age','height','body weight','B','W','H'])
    X = df[['age','height','body weight','B','W','H']]
    y = df["attribute"]

    clf.fit(X, y)

    print"mean score for cross-validation:\n"
    for params, mean_score, all_scores in clf.grid_scores_:
        print "{:.3f} (+/- {:.3f}) for {}".format(mean_score, all_scores.std() / 2, params)
    print clf.best_params_

result

スクリーンショット 2017-04-07 22.15.29.png

The result seems to be best when C = 100 and gamma = 0.0001.

SVM implementation

Implement SVM using the parameters obtained by grid search.

def dosvm(df):
    #Delete rows with missing values
    df=df.dropna(subset=['age','height','body weight','B','W','H'])

    X = df[['age','height','body weight','B','W','H']]
    y = df["attribute"]
   
    data_train,data_test,label_train,label_test=train_test_split(X,y,test_size=0.2)

    clf = svm.SVC(kernel="rbf",C=100,gamma=0.0001)
    clf.fit(data_train, label_train)
    result=clf.predict(data_test)
    cmat=confusion_matrix(label_test,result)
    acc=accuracy_score(label_test,result)
    
    print cmat
    print acc

result

スクリーンショット 2017-04-07 22.22.49.png

After about 100 trials, I was able to determine with an accuracy of about 0.45. Looking at the confusion matrix, it seems that Pa is not predicted well.

Note

――When I started, I was wondering if I could identify it at all, but I was able to identify it unexpectedly. --Parameter setting is required when using SVM. (If I did it without setting, the result was about 0.3) ――This time, I made a type prediction using 6 parameters, but even if I use only the height parameter, the accuracy is about 0.42. On the contrary, using 5 parameters excluding height, the accuracy is about 0.36. I want to learn how to analyze the cause of the results around here ――I was Co (as expected)

Recommended Posts

Machine learning in Delemas (practice)
Machine learning in Delemas (data acquisition)
Used in machine learning EDA
Machine learning
Automate routine tasks in machine learning
Classification and regression in machine learning
Python: Preprocessing in Machine Learning: Overview
Preprocessing in machine learning 2 Data acquisition
Random seed research in machine learning
Preprocessing in machine learning 4 Data conversion
[python] Frequently used techniques in machine learning
Python: Preprocessing in machine learning: Data acquisition
[Memo] Machine learning
Machine learning classification
[Python] Saving learning results (models) in machine learning
Python: Preprocessing in machine learning: Data conversion
Preprocessing in machine learning 1 Data analysis process
Machine Learning sample
Data supply tricks using deques in machine learning
Full disclosure of methods used in machine learning
[Python] Data analysis, machine learning practice (Kaggle) -Data preprocessing-
Summary of evaluation functions used in machine learning
Get a glimpse of machine learning in Python
[For beginners] Introduction to vectorization in machine learning
Machine learning tutorial summary
About machine learning overfitting
Machine learning ⑤ AdaBoost Summary
Machine Learning: Supervised --AdaBoost
Machine learning logistic regression
Deep learning 1 Practice of deep learning
Machine learning support vector machine
Studying Machine Learning ~ matplotlib ~
Machine learning linear regression
Machine learning course memo
Machine learning library dlib
Machine learning (TensorFlow) + Lotto 6
Somehow learn machine learning
Machine learning library Shogun
Machine learning rabbit challenge
Introduction to machine learning
Machine Learning: k-Nearest Neighbors
What is machine learning?
Build an interactive environment for machine learning in Python
Tool MALSS (application) that supports machine learning in Python
Coursera Machine Learning Challenges in Python: ex2 (Logistic Regression)
Tool MALSS (basic) that supports machine learning in Python
About testing in the implementation of machine learning models
Coursera Machine Learning Challenges in Python: ex1 (Linear Regression)
Attempt to include machine learning model in python package
Cross-entropy to review in Coursera Machine Learning week 2 assignments
MALSS, a tool that supports machine learning in Python
Machine learning model considering maintainability
Machine language embedding in C language
Machine learning learned with Pokemon
Data set for machine learning
[Machine learning] Let's summarize random forest in an easy-to-understand manner
Japanese preprocessing for machine learning
How to adapt multiple machine learning libraries in one shot
Machine learning with Jupyter Notebook in OCI Always Free environment (2019/12/17)
An introduction to machine learning
Machine learning / classification related techniques