Introduction

In the above-mentioned "[R]" Need numeric dependent variable for regression. "Cause and response", the classification problem using the support vector machine in R language could be operated. This time, when I tried to port this from R language to Python, it was made of character string data (to be exact, factor type) in R language, but in Python, if it is not converted to numeric type, the following error will occur.

model.fit(sze.iloc[train, 1:], sze.iloc[train, 0]) 

ValueError: could not convert string to float: 'G'

That would be inconvenient, and I suspected that Python could actually be used without my knowledge.

environment

Google Colaboratory
Python 3.6.9
sklearn 0.22.2

Research

I couldn't get a good answer even if I searched for "scikit-learn SVM string". I decided that it was impossible to search in Japanese and changed it to English. "Scikit-learn SVM String" and "scikit-learn SVM Non-" I searched for "Integer" and found the following site. Non-Integer Class Labels Scikit-Learn - stackoverflow

Doing this does not result in an error.

from sklearn.svm import SVC
clf = SVC()
x = [[1,2,3], [4,5,6]]
y = ['dog', 'cat']
clf.fit(x,y)

yhat = clf.predict([[1,2,5]])
print yhat[0]

Then, I want to use rock-paper-scissors data, so I used the character strings of G, C, and P, and used the character strings for the training data and the correct label. However, this gives the same error.

from sklearn.svm import SVC
clf = SVC()
x = [['G',2,3],['C',5,6]]
y = ['G', 'C']
clf.fit(x,y)

yhat = clf.predict([['P',2,5]])
print(yhat[0])

ValueError: could not convert string to float: 'G'

As with the sample, when only the correct label was used as a character string, no error occurred.

from sklearn.svm import SVC
clf = SVC()
x = [['1',2,3],['4',5,6]]
y = ['G', 'C']
clf.fit(x,y)

yhat = clf.predict([['1',2,5]])
print(yhat[0])

Since SVM (Support Vector Machine) is a linear binary classification algorithm, it seems that it cannot be classified even if character data such as URL is included in the training data. The URL string cannot be read with machine learning Value Error.

For training data, it is necessary to use numeric type instead of character string.

Change from string type to numeric type

Use category encoding to change from a string type to a numeric type. There are many types of category encoding, such as label encoding and One-Hot encoding. Provided as preprocessing for machine learning (scikit-learn.preprocessing)

LabelEncoder

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
le.fit(['C', 'P', 'G', 'C', 'C'])

print(le.classes_)
print(le.transform(['G', 'C', 'P', 'C'])) 
print(le.inverse_transform([1, 0, 2, 0]))

#result
['C' 'G' 'P']
[1 0 2 0]
['G' 'C' 'P' 'C']

Select the data you want to convert in the le.fit () part. You can check the allocation value in le.classes_. Numbers are assigned in ascending alphabetical order. Convert to a number with le.transform (). You can convert it back to string data with le.inverse_transform ().

OneHotEncoder Meaning and advantages / disadvantages of One-hot vector

If you pass it as a one-dimensional array like LabelEncoder, you will get the following error"'Expected 2D array, got 1D array instead' ", so pass it as a two-dimensional array. If you want to convert with a one-dimensional array, use the label binarizer described later. Two-dimensional arrays are more convenient when using pandas DataFrames.

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

ohe = OneHotEncoder()
ohe.fit([['C'], ['P'], ['G'], ['C'], ['C']])
ct = ColumnTransformer([("category", ohe, [0])], remainder="passthrough")

print(ohe.categories_)
print(ct.fit_transform([['G'], ['C'], ['P'], ['C']])) 
print(ohe.inverse_transform([[0,1,0],[1,0,0],[0,0,1],[1,0,0]]))

#result
[array(['C', 'G', 'P'], dtype=object)]
[[0. 1. 0.]
 [1. 0. 0.]
 [0. 0. 1.]
 [1. 0. 0.]]
[['G']
 ['C']
 ['P']
 ['C']]

Select the data you want to convert in the ohe.fit () part. You can check the allocation value in ohe.classes_. Numbers are assigned in ascending alphabetical order. Convert to a number with ct.fit_transform (). You can convert it back to string data with ohe.inverse_transform ().

The'categorical_features' keyword in the OneHotEncoder sample is deprecated in version 0.20, removed in 0.22 and suggested to use'ColumnTransformer' instead. Try using scikit-learn's ColumnTransformer-Quiet name

You can use'get_dummies' for pandas for One-Hot encoding. Let's use it according to the purpose. Use pandas get_dummies-quiet nomenclature

The original usage is when you want to express One-Hot with multiple combinations as shown below. OneHotEncoder - Taustation

from sklearn.preprocessing import OneHotEncoder

X = [
    ['Tokyo', 'Male'],
    ['Tokyo', 'Female'],
    ['Osaka', 'Male'],
    ['Kyoto', 'Female'],
    ['Osaka', 'Female'],
    ['Osaka', 'Male']
]

ohe = OneHotEncoder(sparse=False)
ohe.fit(X)

print(ohe.categories_)
print(ohe.transform(X)) 

#result
[array(['Kyoto', 'Osaka', 'Tokyo'], dtype=object), array(['Female', 'Male'], dtype=object)]
[[0. 0. 1. 0. 1.]
 [0. 0. 1. 1. 0.]
 [0. 1. 0. 0. 1.]
 [1. 0. 0. 1. 0.]
 [0. 1. 0. 1. 0.]
 [0. 1. 0. 0. 1.]]

LabelBinarizer In One-Hot encoding, it was passed as a two-dimensional array, but the label binarizer can convert it to a One-Hot representation with a one-dimensional array.

import numpy as np
from sklearn.preprocessing import LabelBinarizer

lｂ = LabelBinarizer()
lｂ.fit(['C', 'P', 'G', 'C', 'C'])

print(lｂ.classes_)
print(lｂ.transform(['G', 'C', 'P', 'C'])) 
print(lｂ.inverse_transform(np.array([[0,1,0],[1,0,0],[0,0,1],[1,0,0]])))

#result
['C' 'G' 'P']
[[0 1 0]
 [1 0 0]
 [0 0 1]
 [1 0 0]]
['G' 'C' 'P' 'C']

Select the data you want to convert in the lb.fit () part. You can check the allocation value in lb.classes_. Numbers are assigned in ascending alphabetical order. Convert to a number with lb.transform (). You can return to string data with lb.inverse_transform (). * Requires np.array

Scikit-learn's LabelBinarizer vs. OneHotEncoder - stackoverflow

OrdinalEncoder Numbers are assigned in ascending alphabetical order, similar to LabelEncoder. The difference is that it is passed as a two-dimensional array like the OneHotEncoder. Two-dimensional arrays are more convenient when using pandas DataFrames.

from sklearn.preprocessing import OrdinalEncoder

oe = OrdinalEncoder()
oe.fit([['C'], ['P'], ['G'], ['C'], ['C']])

print(oe.categories_)
print(oe.fit_transform([['G'], ['C'], ['P'], ['C']])) 
print(oe.inverse_transform([[1],[0],[2],[0]]))

#result
[array(['C', 'G', 'P'], dtype=object)]
[[1.]
 [0.]
 [2.]
 [0.]]
[['G']
 ['C']
 ['P']
 ['C']]

Select the data you want to convert in the oe.fit () part. You can check the allocation value in oe.classes_. Numbers are assigned in ascending alphabetical order. Convert to a number with oe.fit_transform (). You can return to string data with oe.inverse_transform ().

Since OrdinalEncoder can be used in multiple combinations, LabelEncoder is used as the correct label side, while the training data side is an image that uses OrdinalEncoder.

from sklearn.preprocessing import OrdinalEncoder

X = [
    ['Tokyo', 'Male'],
    ['Tokyo', 'Female'],
    ['Osaka', 'Male'],
    ['Kyoto', 'Female'],
    ['Osaka', 'Female'],
    ['Osaka', 'Male']
]
 
oe = OrdinalEncoder()
oe.fit(X)

print(oe.categories_)
print(oe.transform(X)) 

#result
[array(['Kyoto', 'Osaka', 'Tokyo'], dtype=object), array(['Female', 'Male'], dtype=object)]
[[2. 1.]
 [2. 0.]
 [1. 1.]
 [0. 0.]
 [1. 0.]
 [1. 1.]]

CustomerRating In Sazae-san's rock-paper-scissors, the strength is G> C> P from the statistical information, so if the numbers are assigned in ascending alphabetical order like LabelEncoder and OrdinalEncoder, the result will be different. .. If you change C to another alphabet between G and P, you can use it as it is with LabelEncoder and OrdinalEncoder, but I want to assign a number while keeping the symbol.

I am converting to the number I want to assign using map.

import pandas as pd

rating = {'G' : 1, 'C' : 2, 'P' : 3}
df = pd.DataFrame(['C', 'P', 'G', 'C', 'C'])
print(df.applymap(lambda x : rating[x]))

#result
0  2
1  3
2  1
3  2
4  2

Finally

If there are different numbers in the same column, it is misunderstood that there is a certain order (0 <1 <2) depending on the model data. To overcome this problem, we use the One Hot expression. Conversely, different numbers may be used in the same column, given the priority of certain orders (G> C> P).

I feel like I finally understand the meaning of using the One Hot expression. There are inputs (explanatory variables, independent variables) and outputs (objective variables, dependent variables), and the explanatory variable side can use the basic One Hot expression for both regression and classification, and the objective variable side is a character string. It means that there is no problem.

[Scikit-learn] How to use Label Encoder and One Hot Encoder properly --teratail

[GO] [Python] Use string data with scikit-learn SVM

Introduction

environment

Research

Change from string type to numeric type

Finally