[PYTHON] "Usable" one-hot Encoding method for machine learning

Introduction

Even if I study, I will forget it soon, so I will post an article on Qiita for memorandum and output practice. I would be grateful if you could comment on any mistakes or better ways.

Idea

I want to perform One-hot Encoding by machine learning, but I don't know what kind of data is in the test data. Every site says that if you want to do One-hot Encoding, you should use get_dummies, but for example ** train_df ['sex'] has Male and Female, but test_df ['sex'] has only Male ** In such a case, if you normally use get_dummies, the number of columns created will change. that's no good.

After a lot of research, I arrived at the following article.

[Python] Don't use pandas.get_dummies for machine learning

The article itself does not use get_dummies, but uses sklearn's ʻOne Hot Encoder. However, I wanted to analyze the data in Pandas format and then finally convert it to Numpy format, so I was particular about doing something with Pandas`.

That is explained in the comment of the article ↑, and in this article I will drop it to the point where I can chew it in my own way.

Implementation

The implementation ends up using get_dummies.

# (i) df_A unique of train"hoge"When"fuga".. B unique"a"When"b"
df_train = pd.DataFrame({"A": ["hoge", "fuga"], "B": ["a", "b"]})

# (ii) df_A unique of train"hoge"When"piyo".. B unique"a"When"c"
df_test = pd.DataFrame({"A": ["hoge", "piyo"], "B": ["a", "c"]})

# (iii)In Categorical A is"hoge"When"fuga", B"a"When"b"だよWhen決め打ちしてしまう
df_train["A"] = pd.Categorical(df_train["A"], categories=["hoge", "fuga"])
df_train["B"] = pd.Categorical(df_train["B"], categories=["a", "b"])
df_test["A"] = pd.Categorical(df_test["A"], categories=["hoge", "fuga"])
df_test["B"] = pd.Categorical(df_test["B"], categories=["a", "b"])

# (iv) get_one with dummies-hot
df_train = pd.get_dummies(df_train)
df_test = pd.get_dummies(df_test)

The final one-hot data is as follows.

df_train
   A_hoge  A_fuga  B_a  B_b
0       1       0    1    0
1       0       1    0    1
df_test
   A_hoge  A_fuga  B_a  B_b
0       1       0    1    0
1       0       0    0    0

I was able to use only the unique train. This time it was hard-coded, but if you use ʻunique` separately, you can handle it more flexibly.

Supplement

The reason why df_train also fixes the category is that if you do not do this, the order of hoge and fuga will be reversed.

Recommended Posts

"Usable" one-hot Encoding method for machine learning
Japanese preprocessing for machine learning
Study method for learning machine learning from scratch (March 2020 version)
Learning method output for LPIC acquisition
<For beginners> python library <For machine learning>
Machine learning meeting information for HRTech
Machine learning algorithm (gradient descent method)
[Recommended tagging for machine learning # 4] Machine learning script ...?
Newton's method for machine learning (from one variable to multiple variables)
Machine learning
Amplify images for machine learning with python
First Steps for Machine Learning (AI) Beginners
An introduction to OpenCV for machine learning
[Shakyo] Encounter with Python for machine learning
[Python] Web application design for machine learning
An introduction to Python for machine learning
Creating a development environment for machine learning
An introduction to machine learning for bot developers
Partial One-Hot encoding
Recommended study order for machine learning / deep learning beginners
Machine learning starting from 0 for theoretical physics students # 1
Upgrade the Azure Machine Learning SDK for Python
[Python] Collect images with Icrawler for machine learning [1000 images]
Machine learning starting from 0 for theoretical physics students # 2
[Memo] Machine learning
Collect images for machine learning (Bing Search API)
Machine learning classification
[For beginners] Introduction to vectorization in machine learning
A memorandum of method often used in machine learning using scikit-learn (for beginners)
Machine Learning sample
Image collection Python script for creating datasets for machine learning
Build an interactive environment for machine learning in Python
[Recommended tagging for machine learning # 2] Extension of scraping script
[Recommended tagging for machine learning # 2.5] Modification of scraping script
Python learning memo for machine learning by Chainer from Chapter 2
Python learning memo for machine learning by Chainer Chapters 1 and 2
Machine learning #k-nearest neighbor method and its implementation and various
Preparing to start "Python machine learning programming" (for macOS)
[Python] I made a classifier for irises [Machine learning]
14 e-mail newsletters useful for gathering information on machine learning
Memo for building a machine learning environment using Python
xgboost: A valid machine learning model for table data
Everything for beginners to be able to do machine learning
Machine learning tutorial summary
Machine learning ⑤ AdaBoost Summary
Machine Learning: Supervised --AdaBoost
Reinforcement learning for tic-tac-toe
Machine learning support vector machine
Studying Machine Learning ~ matplotlib ~
Machine learning linear regression
Machine learning course memo
Machine learning library dlib
Machine learning (TensorFlow) + Lotto 6
Somehow learn machine learning
Summary for learning RAPIDS
Machine learning library Shogun
Machine learning rabbit challenge
Introduction to machine learning
Machine Learning: k-Nearest Neighbors
What is machine learning?
Rebuilding an environment for machine learning with Miniconda (Windows version)