Machine learning with python without losing to categorical variables (dummy variable)

I'm always looking at the same thing, so I took this opportunity to make a note.

Here, I actually read csv with python and Pre-processing such as dummy variable conversion was performed and prediction was performed with SVM.

Motivated

When you plunge your data into machine learning, your data may contain categorical variables (eg gender, country of origin). At that time, if you change it to just a numerical value (example: 1 in Japan, 2 in the United States), the unintended meaning will be converted into data. Learning may not be successful because it will be given.

Here, we will deal with it by converting categorical variables into numerical values using a method called dummy variables.

What is a dummy variable?

For example, assume the following data.

In the dummy variable conversion, the column country is changed to three columns country.Japan, country.US, country.China. Converts only the applicable values to 1 and the others to 0.

Data example before dummy variable conversion

Country
Japan

Data example after conversion to dummy variable

Country.Japan Country.America Country.China
1 0 0

Implementation

Data acquisition

This time, we used the data used in experiments such as anonymization processing called "Adult Income Data Set". You can probably get it by google, but this time I got it with R (don't worry if you say python but use R right away).

This dataset also has an item called ʻincome (income)and has three values:large, small, NaN. In this implementation, we want to predict large or small for NaN (missing value) `.

Therefore, the row without NaN is used as training data, and the data with NaN is used as evaluation data.

library('arules')
data("AdultUCI")
id <- 1:nrow(AdultUCI)
d <- data.frame(id, AdultUCI)
write.csv(d, "AdultDataSet.csv", quote = FALSE, fileEncoding = 'cp932', row.names = FALSE)

Loading libraries and csv files

import numpy as np
import pandas as pd
from sklearn import svm

df = pd.read_csv("AdultDataSet.csv", encoding='cp932', low_memory=False)

Pre-processing (also making dummy variables)

#Training label
Y_train = df.copy()
Y_train['income'] = Y_train['income'].map({"large":1, "small":0})
Y_train = Y_train[Y_train['income'].notnull()]
Y_train = Y_train.iloc[:, 15].values #income only

#Creating dummy variables for categorical variables
X = df.iloc[:, 0:15] #Other than income
colnames_categorical = ['workclass', 'marital.status', 'occupation', 'relationship', 'race', 'sex', 'native.country']
X_dummy = pd.get_dummies(X[colnames_categorical], drop_first=True)

#Joining dummy variables
X = pd.merge(X, X_dummy, left_index=True, right_index=True)

#Remove duplicate columns that you don't use
X = X.drop(colnames_categorical, axis=1)
X = X.drop(['id', 'education'], axis=1)

#Separate train and test depending on whether income is NaN or not
X_train = X[df['income'].notnull()].values
X_test  = X[df['income'].isnull()].values

Learning and prediction

#Learning
clf = svm.LinearSVC() #Because learning is fast. Another svm.SVC(kernel='rbf')Etc.
print('start!')
clf.fit(X_train, Y_train)
print('end!')

#Forecast
Y_predict = clf.predict(X_test)

Combining predicted results

#Add the predicted value
df2 = df.copy()
df2.loc[df2['income'].isnull(), 'income'] = Y_predict
df2['income'] = df2['income'].map({1.:"large", 0.:"small", "small":"small", "large":"large"})
df2.head()

Check the result

Originally, I think that the data with the correct label should be classified in advance to evaluate the performance. This time, what I predicted for the missing data and the purpose is to apply dummy variable conversion. For the time being, let's check that there are no missing values.

#Aggregated value of income before learning
count_before = df['income'].value_counts(dropna=False)
pd.DataFrame(count_before) #  print(count_before)May be
#Aggregated value of income after learning
count_after = df2['income'].value_counts(dropna=False)
pd.DataFrame(count_after)

If NaN disappears after learning, it's OK for the time being.

Result output

df2.to_csv('AfterAdultDataSet.csv', index=False)

at the end

It shouldn't be that difficult, but how to use pandas and scikit learn I had a hard time typing ... sad ...

Recommended Posts

Machine learning with python without losing to categorical variables (dummy variable)
[Python] Easy introduction to machine learning with python (SVM)
Machine learning with Python! Preparation
Beginning with Python machine learning
Machine learning with python (1) Overall classification
"Scraping & machine learning with Python" Learning memo
Newton's method for machine learning (from one variable to multiple variables)
Amplify images for machine learning with python
I installed Python 3.5.1 to study machine learning
[Shakyo] Encounter with Python for machine learning
An introduction to Python for machine learning
Build AI / machine learning environment with Python
Mayungo's Python Learning Episode 2: I tried to put out characters with variables
A beginner of machine learning tried to predict Arima Kinen with python
I started machine learning with Python (I also started posting to Qiita) Data preparation
Machine learning starting with Python Personal memorandum Part2
Machine learning starting with Python Personal memorandum Part1
[Python] Collect images with Icrawler for machine learning [1000 images]
I started machine learning with Python Data preprocessing
Build a Python machine learning environment with a container
Machine learning beginners tried to make a horse racing prediction model with python
Python learning notes for machine learning with Chainer Chapters 11 and 12 Introduction to Pandas Matplotlib
I tried to move machine learning (ObjectDetection) with TouchDesigner
Learning Python with ChemTHEATER 03
"Object-oriented" learning with python
I tried to make a real-time sound source separation mock with Python machine learning
Learning Python with ChemTHEATER 05-1
Run a machine learning pipeline with Cloud Dataflow (Python)
The first step of machine learning ~ For those who want to implement with python ~
Machine learning python code summary (updated from time to time)
Try to predict forex (FX) with non-deep machine learning
Learning Python with ChemTHEATER 02
Preparing to start "Python machine learning programming" (for macOS)
Build a machine learning application development environment with Python
Learning Python with ChemTHEATER 01
Site summary to learn machine learning with English video
Summary of the basic flow of machine learning with Python
Attempt to include machine learning model in python package
Convert numeric variables to categorical with thresholds in pandas
Introduction to machine learning
I tried to build an environment for machine learning with Python (Mac OS X)
Introduction to Machine Learning with scikit-learn-From data acquisition to parameter optimization
Challenge problem 5 with Python: lambda ... I decided to copy without
For those who want to start machine learning with TensorFlow2
How to use machine learning for work? 03_Python coding procedure
Mayungo's Python Learning Episode 3: I tried to print numbers with print
[Machine learning] Feature selection of categorical variables using chi-square test
Create a python machine learning model relearning mechanism with mlflow
Machine learning to learn with Nogizaka46 and Keyakizaka46 Part 1 Introduction
Try to predict if tweets will burn with machine learning
Machine learning environment settings based on Python 3 on Mac (coexistence with Python 2)
Machine learning learned with Pokemon
Connect to BigQuery with Python
An introduction to machine learning
Post to slack with Python 3
Reinforcement learning starting with Python
Machine learning Minesweeper with PyTorch
Python Machine Learning Programming> Keywords
Switch python to 2.7 with alternatives
Write to csv with Python
Python Iteration Learning with Cheminformatics