1. 1. Introduction

As a tutorial for learning machine learning, I will record as a reminder the method I used to predict the name of iris, which is a must-have path for everyone.

The version used is here.

Python　3.7.6
numpy　1.18.1 − pandas　1.0.1
matplotlib　3.1.3
seaborn　0.10.0
scikit-learn　0.22.1

2. What is the classification of irises?

2-1 Outline of iris problem

There are three varieties of iris called "setosa", "versicolor" and "virginica". The data representing the corolla (the entire iris) of this iris includes the width and length of the sepals (Sepal) and petals (Petal). Deriving the names of three kinds of flowers from these four characteristics is the problem this time.

2-2 About the program

Import of libraries etc.


import numpy as np
import pandas as pd
from pandas import Series,DataFrame

import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')

%matplotlib inline

from sklearn.datasets import load_iris

This time, we are loading numpy, pandas, matplotlib, seaborn, and sklearn. The iris dataset was read from within sklearn.datasets.

Take a look at the data


iris_data = DataFrame(x, columns=['Sepal Length', 'Sepal Width', 'Petal Length', 'Petal, Width'])
iris_data

There were 150 data. Also, the width and length of sepals and petals are listed, probably in cm.

Next, let's look at the types of flowers.


iris_target = DataFrame(y, columns =['Species'])
iris_target

You can see that the type is already assigned as a numerical value, not as the name of the flower. It is OK to process as it is, but it will be troublesome such as having to remember the correspondence between the numerical value and the name by yourself, so let's correspond to the name.


#Define a function to name
def flower(num):
    if num ==0:
        return 'Setosa'
    elif num == 1:
        return 'Veriscolour'
    else:
        return 'Virginica'
iris_target['Species'] = iris_target['Species'].apply(flower)
iris_target

Now that the name is specified, it's easier to understand.

Check the correlation for each variable


iris = pd.concat
([iris_data, iris_target], axis=1)
sns.pairplot(iris, hue='Species',hue_order=['Virginica', 'Veriscolour', 'Setosa'], size=2,palette="husl")

Plot the correlation for each variable. It can be described in one line by using seaborn's pairplot method. Looking at it this way, you can see that Setosa has a distinctive difference compared to the other two. On the other hand, Virginica and Veriscolour are located where the Sepal Length is similar, and it seems difficult to separate them by this alone.

If you look at the actual flowers, you can see that the flowers that are small overall are Setosa.

2-3 Prediction using logistic regression


#Import Logistic Regression
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
logreg = LogisticRegression()

#We decided to use 30% of the test data.
x_train, x_test, y_train, y_test =train_test_split(x,y,test_size=0.3, random_state=3)
logreg.fit(x_train, y_train)

#Correct answer rate(accuracy_function to get score)
from sklearn import metrics
y_pred  =logreg.predict(x_test)
metrics.accuracy_score(y_test, y_pred)

Correct answer rate: 0.9777777777777777

This time, we analyzed using logistic regression. Logistic regression is a regression whose objective variable is a binary value of 0 or 1. In other words, it is a means to determine whether it is "genuine" or "fake", "benign" or "malignant".

In this case, we applied the method of dividing into three. It is possible to apply logistic regression to multiple classes of 3 or more. As for the image of its application, even if it is multivariable as shown in the image below, it is calculated separately as two variables. 　

In this case, the correct answer rate was 97.8%. You can see that this method looks good.

Reference URL

https://dev.classmethod.jp/machine-learning/logistic-regression-impl/ http://www.msi.co.jp/nuopt/docs/v20/examples/html/02-18-00.html

3. 3. Full program


import numpy as np
import pandas as pd
from pandas import Series,DataFrame

import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')

%matplotlib inline

from sklearn.datasets import load_iris
iris = load_iris()
x =iris.data
y=iris.target

iris_data = DataFrame(x, columns=['Sepal Length', 'Sepal Width', 'Petal Length', 'Petal, Width'])
iris_target = DataFrame(y, columns =['Species'])

def flower(num):
    if num ==0:
        return 'Setosa'
    elif num == 1:
        return 'Veriscolour'
    else:
        return 'Virginica'

iris_target['Species'] = iris_target['Species'].apply(flower)

iris = pd.concat([iris_data, iris_target], axis=1)

#Import logistic regression
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

logreg = LogisticRegression()
x_train, x_test, y_train, y_test =train_test_split(x,y,test_size=0.3, random_state=3)

logreg.fit(x_train, y_train)

from sklearn import metrics
y_pred  =logreg.predict(x_test)


metrics.accuracy_score(y_test, y_pred)

[PYTHON] Solving the iris problem with scikit-learn ver1.0 (logistic regression)