[GO] [Python] Use string data with scikit-learn SVM

Introduction

In the above-mentioned "[R]" Need numeric dependent variable for regression. "Cause and response", the classification problem using the support vector machine in R language could be operated. This time, when I tried to port this from R language to Python, it was made of character string data (to be exact, factor type) in R language, but in Python, if it is not converted to numeric type, the following error will occur.

model.fit(sze.iloc[train, 1:], sze.iloc[train, 0]) 

ValueError: could not convert string to float: 'G'

That would be inconvenient, and I suspected that Python could actually be used without my knowledge.

environment

Research

I couldn't get a good answer even if I searched for "scikit-learn SVM string". I decided that it was impossible to search in Japanese and changed it to English. "Scikit-learn SVM String" and "scikit-learn SVM Non-" I searched for "Integer" and found the following site. Non-Integer Class Labels Scikit-Learn - stackoverflow

Doing this does not result in an error.

from sklearn.svm import SVC
clf = SVC()
x = [[1,2,3], [4,5,6]]
y = ['dog', 'cat']
clf.fit(x,y)

yhat = clf.predict([[1,2,5]])
print yhat[0]

Then, I want to use rock-paper-scissors data, so I used the character strings of G, C, and P, and used the character strings for the training data and the correct label. However, this gives the same error.

from sklearn.svm import SVC
clf = SVC()
x = [['G',2,3],['C',5,6]]
y = ['G', 'C']
clf.fit(x,y)

yhat = clf.predict([['P',2,5]])
print(yhat[0])

ValueError: could not convert string to float: 'G'

As with the sample, when only the correct label was used as a character string, no error occurred.

from sklearn.svm import SVC
clf = SVC()
x = [['1',2,3],['4',5,6]]
y = ['G', 'C']
clf.fit(x,y)

yhat = clf.predict([['1',2,5]])
print(yhat[0])

Since SVM (Support Vector Machine) is a linear binary classification algorithm, it seems that it cannot be classified even if character data such as URL is included in the training data. The URL string cannot be read with machine learning Value Error.

For training data, it is necessary to use numeric type instead of character string.

Change from string type to numeric type

Use category encoding to change from a string type to a numeric type. There are many types of category encoding, such as label encoding and One-Hot encoding. Provided as preprocessing for machine learning (scikit-learn.preprocessing)

LabelEncoder

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
le.fit(['C', 'P', 'G', 'C', 'C'])

print(le.classes_)
print(le.transform(['G', 'C', 'P', 'C'])) 
print(le.inverse_transform([1, 0, 2, 0]))

#result
['C' 'G' 'P']
[1 0 2 0]
['G' 'C' 'P' 'C']

Select the data you want to convert in the le.fit () part. You can check the allocation value in le.classes_. Numbers are assigned in ascending alphabetical order. Convert to a number with le.transform (). You can convert it back to string data with le.inverse_transform ().

OneHotEncoder Meaning and advantages / disadvantages of One-hot vector

If you pass it as a one-dimensional array like LabelEncoder, you will get the following error"'Expected 2D array, got 1D array instead' ", so pass it as a two-dimensional array. If you want to convert with a one-dimensional array, use the label binarizer described later. Two-dimensional arrays are more convenient when using pandas DataFrames.

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

ohe = OneHotEncoder()
ohe.fit([['C'], ['P'], ['G'], ['C'], ['C']])
ct = ColumnTransformer([("category", ohe, [0])], remainder="passthrough")

print(ohe.categories_)
print(ct.fit_transform([['G'], ['C'], ['P'], ['C']])) 
print(ohe.inverse_transform([[0,1,0],[1,0,0],[0,0,1],[1,0,0]]))

#result
[array(['C', 'G', 'P'], dtype=object)]
[[0. 1. 0.]
 [1. 0. 0.]
 [0. 0. 1.]
 [1. 0. 0.]]
[['G']
 ['C']
 ['P']
 ['C']]

Select the data you want to convert in the ohe.fit () part. You can check the allocation value in ohe.classes_. Numbers are assigned in ascending alphabetical order. Convert to a number with ct.fit_transform (). You can convert it back to string data with ohe.inverse_transform ().

The'categorical_features' keyword in the OneHotEncoder sample is deprecated in version 0.20, removed in 0.22 and suggested to use'ColumnTransformer' instead. Try using scikit-learn's ColumnTransformer-Quiet name

You can use'get_dummies' for pandas for One-Hot encoding. Let's use it according to the purpose. Use pandas get_dummies-quiet nomenclature

The original usage is when you want to express One-Hot with multiple combinations as shown below. OneHotEncoder - Taustation

from sklearn.preprocessing import OneHotEncoder

X = [
    ['Tokyo', 'Male'],
    ['Tokyo', 'Female'],
    ['Osaka', 'Male'],
    ['Kyoto', 'Female'],
    ['Osaka', 'Female'],
    ['Osaka', 'Male']
]

ohe = OneHotEncoder(sparse=False)
ohe.fit(X)

print(ohe.categories_)
print(ohe.transform(X)) 

#result
[array(['Kyoto', 'Osaka', 'Tokyo'], dtype=object), array(['Female', 'Male'], dtype=object)]
[[0. 0. 1. 0. 1.]
 [0. 0. 1. 1. 0.]
 [0. 1. 0. 0. 1.]
 [1. 0. 0. 1. 0.]
 [0. 1. 0. 1. 0.]
 [0. 1. 0. 0. 1.]]

LabelBinarizer In One-Hot encoding, it was passed as a two-dimensional array, but the label binarizer can convert it to a One-Hot representation with a one-dimensional array.

import numpy as np
from sklearn.preprocessing import LabelBinarizer

lb = LabelBinarizer()
lb.fit(['C', 'P', 'G', 'C', 'C'])

print(lb.classes_)
print(lb.transform(['G', 'C', 'P', 'C'])) 
print(lb.inverse_transform(np.array([[0,1,0],[1,0,0],[0,0,1],[1,0,0]])))

#result
['C' 'G' 'P']
[[0 1 0]
 [1 0 0]
 [0 0 1]
 [1 0 0]]
['G' 'C' 'P' 'C']

Select the data you want to convert in the lb.fit () part. You can check the allocation value in lb.classes_. Numbers are assigned in ascending alphabetical order. Convert to a number with lb.transform (). You can return to string data with lb.inverse_transform (). * Requires np.array

Scikit-learn's LabelBinarizer vs. OneHotEncoder - stackoverflow

OrdinalEncoder Numbers are assigned in ascending alphabetical order, similar to LabelEncoder. The difference is that it is passed as a two-dimensional array like the OneHotEncoder. Two-dimensional arrays are more convenient when using pandas DataFrames.

from sklearn.preprocessing import OrdinalEncoder

oe = OrdinalEncoder()
oe.fit([['C'], ['P'], ['G'], ['C'], ['C']])

print(oe.categories_)
print(oe.fit_transform([['G'], ['C'], ['P'], ['C']])) 
print(oe.inverse_transform([[1],[0],[2],[0]]))

#result
[array(['C', 'G', 'P'], dtype=object)]
[[1.]
 [0.]
 [2.]
 [0.]]
[['G']
 ['C']
 ['P']
 ['C']]

Select the data you want to convert in the oe.fit () part. You can check the allocation value in oe.classes_. Numbers are assigned in ascending alphabetical order. Convert to a number with oe.fit_transform (). You can return to string data with oe.inverse_transform ().

Since OrdinalEncoder can be used in multiple combinations, LabelEncoder is used as the correct label side, while the training data side is an image that uses OrdinalEncoder.

from sklearn.preprocessing import OrdinalEncoder

X = [
    ['Tokyo', 'Male'],
    ['Tokyo', 'Female'],
    ['Osaka', 'Male'],
    ['Kyoto', 'Female'],
    ['Osaka', 'Female'],
    ['Osaka', 'Male']
]
 
oe = OrdinalEncoder()
oe.fit(X)

print(oe.categories_)
print(oe.transform(X)) 

#result
[array(['Kyoto', 'Osaka', 'Tokyo'], dtype=object), array(['Female', 'Male'], dtype=object)]
[[2. 1.]
 [2. 0.]
 [1. 1.]
 [0. 0.]
 [1. 0.]
 [1. 1.]]

CustomerRating In Sazae-san's rock-paper-scissors, the strength is G> C> P from the statistical information, so if the numbers are assigned in ascending alphabetical order like LabelEncoder and OrdinalEncoder, the result will be different. .. If you change C to another alphabet between G and P, you can use it as it is with LabelEncoder and OrdinalEncoder, but I want to assign a number while keeping the symbol.

I am converting to the number I want to assign using map.

import pandas as pd

rating = {'G' : 1, 'C' : 2, 'P' : 3}
df = pd.DataFrame(['C', 'P', 'G', 'C', 'C'])
print(df.applymap(lambda x : rating[x]))

#result
0  2
1  3
2  1
3  2
4  2

Finally

If there are different numbers in the same column, it is misunderstood that there is a certain order (0 <1 <2) depending on the model data. To overcome this problem, we use the One Hot expression. Conversely, different numbers may be used in the same column, given the priority of certain orders (G> C> P).

I feel like I finally understand the meaning of using the One Hot expression. There are inputs (explanatory variables, independent variables) and outputs (objective variables, dependent variables), and the explanatory variable side can use the basic One Hot expression for both regression and classification, and the objective variable side is a character string. It means that there is no problem.

[Scikit-learn] How to use Label Encoder and One Hot Encoder properly --teratail

Recommended Posts

[Python] Use string data with scikit-learn SVM
Data analysis with python 2
[Python] Use JSON with Python
Multi-class SVM with scikit-learn
Use mecab with Python3
Use DynamoDB with Python
Use Python 3.8 with Anaconda
Use python with docker
Data analysis with Python
Sample data created with python
Use Trello API with python
Neural network with Python (scikit-learn)
[Python] Use a string sequence
Get Youtube data with python
Use TUN / TAP with Python
[Python] Linear regression with scikit-learn
Use subsonic API with python3
String format with Python% operator
Read json data with python
Implement normalization of Python training data preprocessing with scikit-learn [fit_transform]
Python: How to use async with
Use PointGrey camera with Python (PyCapture2)
Use vl53l0x with Raspberry Pi (python)
[Python] Use Basic/Digest authentication with Flask
Use NAIF SPICE TOOLKIT with Python
[Python] Get economic data with DataReader
Use rospy with virtualenv in Python3
String replacement with Python regular expression
Try machine learning with scikit-learn SVM
Python data structures learned with chemoinformatics
Use Python in pyenv with NeoVim
How to use FTP with Python
Use Windows 10 speech synthesis with Python
Easy data visualization with Python seaborn.
Use OpenCV with Python 3 in Window
Process Pubmed .xml data with python
Data analysis starting with python (data visualization 1)
Use PostgreSQL with Lambda (Python + psycopg2)
Data analysis starting with python (data visualization 2)
Python application: Data cleansing # 2: Data cleansing with DataFrame
Python string
Data pipeline construction with Python and Luigi
Receive textual data from mysql with python
[Note] Get data from PostgreSQL with Python
Use smbus with python3 under pyenv environment
Use DeepL with python (for dissertation translation)
Process Pubmed .xml data with python [Part 2]
Use PostgreSQL data type (jsonb) from Python
Add a Python data source with Redash
Retrieving food data with Amazon API (Python)
Try working with binary data in Python
Use Amazon Simple Notification Service with Python
Try SVM with scikit-learn on Jupyter Notebook
Convert Excel data to JSON with python
Download Japanese stock price data with python
[Introduction to Python] Let's use foreach with Python
Use PIL and Pillow with Cygwin Python
Use cryptography library cryptography with Docker Python image
Manipulate DynamoDB data with Lambda (Node & Python)
Convert FX 1-minute data to 5-minute data with Python
Use Application Insights with Python 3 (including bottles)