[PYTHON] Preprocessing in machine learning 4 Data conversion

Aidemy 2020/10/29

Introduction

Hello, it is Yope! I am a liberal arts student, but I was interested in the possibilities of AI, so I went to the AI-specialized school "Aidemy" to study. I would like to share the knowledge gained here with you, and I am summarizing it on Qiita. I am very happy that many people have read the previous summary article. Thank you! This is the fourth post of machine learning pre-processing. Nice to meet you.

What to learn this time ・ About continuous value ・ Conversion of category data ・ About scale adjustment ・ About vertical data and horizontal data

About continuous values

Categorification of continuous values

-When preprocessing data, you may want to divide continuous values into categories. -Specifically, there is age data [10,15,18,20,27,32], and I would like to divide this into "teens, 20s, 30s" (categorization of continuous values).

-Execution is performed by the cut (x, bins = [], labels = [], right =) __ function of pandas. ・ About each argument - "X" __ represents the data to be passed. -Pass the delimiter numbers as an array to __ "bins" __. -Pass the name of each category as an array to __ "labels" __. -For __ "right" __, set True if the section division is "greater than A and less than B", and False if "A or more and less than B". (Default = True, but for Japanese thinking, it is easier to intuitively understand if it is in the form of "A or more and less than B" in False)

x = [10,15,18,20,27,32]
bins = [10,20,30,40] #10~20,20~30,30~Means 40
labels = ['10's','20's','30s']
#Categorification
pd.cut(x,bins=bins,labels=labels,right=False)
#[10's,10's,10's,20's,20's,30s]

One-hot encode categorical data

-The data categorized in the previous section is converted to __ "1" if it belongs to that category, "0" __ if it does not belong to that category, and so on. -For example, if the category is "teens", the output should be [1,1,1,0,0,0]. -Such conversion is called __ "one-hot encoding" __. -For example, when the data {'age': [teens, teens, teens, 20s, 20s, 30s]} is given by DataFrame (variable df) It can be converted with __pd.get_dummies (df ['age']) __. -Also, if there are columns other than'age'and you want to convert all of them, you only need to use "df" as an argument. -If you want to keep the name of the original column in the converted column, you can pass the column name as prefix ='age' in the second argument.

result = pd.cut(x,bins=bins,labels=labels,right=False)
df = pd.DataFrame({'age':result})
#one-Executing hot encoding
pd.get_dummies(df['age'],prefix='age')
'''
age_Teen age_20s age_30s
0	1	0	0
1	1	0	0
2	1	0	0
3	0	1	0
4	0	1	0
5	0	0	1
'''

About scale adjustment

-If data with a relatively large value is included in the numerical data items passed to the model, the learning efficiency may be reduced. -In such a case, adjusting the values of all numerical data items so that they fit within a certain standard is called __ "scale adjustment" __. -The general standard is __ "mean 0, variance 1" __. In this case, use the __scale () __ function of the preprocessing module of scikit-learn.

-Code (scale adjustment of 10 regular random numbers with an average of 50 variances)![Screenshot 2020-10-29 16.24.15.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws. com / 0/698700/7c891221-4251-8f4b-e922-6f155fff7af3.png)

Box-Cox conversion

-When converting data so that it approaches __ "normal distribution" __, perform something called __Box-Cox conversion __. -The normal distribution is a distribution in which the number of data increases toward the center and decreases toward the edges. -When performing Box-cox conversion, use the __boxcox () __ function of the stats module of scipy.

-Two variables are passed to the boxcox function, the first is the array after conversion, and the second is the parameter "λ" used at the time of conversion.

-Code (convert x ^ 2 with 3 degrees of freedom to Box-Cox)![Screenshot 2020-10-29 16.26.31.png](https://qiita-image-store.s3.ap-northeast-1. amazonaws.com/0/698700/dd6b1340-1eb4-9a14-8011-8b5b63084156.png)

・ Result![Screenshot 2020-10-29 16.26.41.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/4291aee1-314c-7386- 274a-dabcec09dcfd.png)

Vertical data and horizontal data

What is vertical data and horizontal data?

-Vertical data has a structure that increases data "vertically" (addition of rows) when the amount of data information increases, and horizontal data is "horizontal". It refers to a structure that increases data (adds columns). -For example, suppose that there is data of "purchaser", "purchased item", and "price" as an accounting report. When the amount of data increases, simply adding data vertically to the data with __3 columns is the vertical holding data __, __ Expanding the purchase horizontally and adding the data there Horizontal data __. -Vertical data has the advantage of being easy to respond to changes in the __ data structure and logic because there is no need to add columns. -On the other hand, horizontal data has the advantage of being easy to understand intuitively when checking. -When passing data to the model, it is necessary to convert to either of these formats.

Convert vertical data to horizontal data

-Use the __pivot (index, columns, values) __ function of pandas. -For each argument, "index" is the column of the original DataFrame that puts the same elements together in one row, "columns" is the original column for making a new column, and "values" is the original column for making the column value.

・ In the previous example, many of the same people will continue to be the "buyer", so the "index" will be used, and the "purchased item" will be expanded horizontally, so the "columns" will be used, and the "price" will be the column value. So store it in "values". -Furthermore, the index of df after conversion is the index passed at this time, so if you want to return this to the normal index number, add "__. Reset_index () __" to pivot (). Also, at this time, the name attribute of the column remains, so add __.columns.set_names (None) __ to the columns of the converted df.

df = pd.DataFrame({
      'buyer': ['Sato','Tanaka','Kato','Tanaka','Kato'],
      'Purchased items': ['pen','paper','Beverage','Scissors','Seaweed']
      'price': [500, 250, 250, 600, 200]
})
##Convert vertical data to horizontal data
pivoted_df = df.pivot(index='buyer',columns='Purchased items',values='price').reset_index()
pivoted_df.columns = pivoted_df.columns.set_names(None)
'''
Buyer Pen Paper Beverage Scissors Glue
0 Sato 500
1 Tanaka 250 600
2 Kato 250 200
'''

Convert horizontally held data to vertically held data

-Can be converted with pandas __melt (frame, id_vars, value_vars, var_name, value_name) __. ・ About each argument ・ Frame: conversion source df -Id_vars: Key column (passed as an array) -Value_vars: Columns that will be the converted values (passed as an array) -Var_name: The name of the column that groups the columns that are expanded horizontally -Value_name: The name of the column that will be the converted value

・ The horizontal data created in the previous section is as follows. __pd.melt (pivoted_df, id_vars = ['buyer'], value_vars = ['pen','paper','beverage','scissors','glue'], var_name ='purchase', value_name ='price ') __

Summary

-If you want to categorize continuous values, use pandas __cut (x, bins = [], labels = [], right =) __. -Converting categorized data to "1" if it belongs to that category, "0" if it does not belong, is called "one-hot encoding", and it is done with __pd.get_dummies () __ .. -Aligning the values of numerical data relative to each other is called "scale adjustment", and __preprocessing.scale () __ Do it with. -Converting data so that it approaches a normal distribution is called "Box-Cox conversion" and is performed by __stats.boxcox () __. -Use __pivot (index, columns, values) __ when converting from vertical data to horizontal data, and use __melt (frame, id_vars, value_vars, var_name, value_name) __ when doing the opposite. ..

This time is over. Thank you for reading until the end.

Recommended Posts

Preprocessing in machine learning 4 Data conversion
Python: Preprocessing in machine learning: Data conversion
Python: Preprocessing in machine learning: Data acquisition
Preprocessing in machine learning 1 Data analysis process
Machine learning in Delemas (data acquisition)
Python: Preprocessing in Machine Learning: Overview
Pre-processing in machine learning 3 Missing values, outliers, and imbalanced data
Data supply tricks using deques in machine learning
[Python] Data analysis, machine learning practice (Kaggle) -Data preprocessing-
I started machine learning with Python Data preprocessing
Data set for machine learning
Japanese preprocessing for machine learning
Machine learning in Delemas (practice)
Used in machine learning EDA
About data preprocessing of systems that use machine learning
Classification and regression in machine learning
Machine learning
Performance verification of data preprocessing for machine learning (numerical data) (Part 1)
Random seed research in machine learning
Basic machine learning procedure: ② Prepare data
How to collect machine learning data
Machine learning imbalanced data sklearn with k-NN
[python] Frequently used techniques in machine learning
[Python] First data analysis / machine learning (Kaggle)
[Python] Saving learning results (models) in machine learning
[Updated Ver1.3.1] I made a data preprocessing library DataLiner for machine learning.
[Memo] Machine learning
Machine learning classification
Machine Learning sample
Multivariate LSTM and data preprocessing in TensorFlow 2.x
Machine learning Training data division and learning / prediction / verification
Summary of evaluation functions used in machine learning
Get a glimpse of machine learning in Python
A story about data analysis by machine learning
[For beginners] Introduction to vectorization in machine learning
Machine learning tutorial summary
About machine learning overfitting
Build an interactive environment for machine learning in Python
Machine learning ⑤ AdaBoost Summary
Machine Learning: Supervised --AdaBoost
Sampling in imbalanced data
Tool MALSS (application) that supports machine learning in Python
Machine learning logistic regression
Coursera Machine Learning Challenges in Python: ex2 (Logistic Regression)
How to split machine learning training data into objective variables and others in Pandas
Tool MALSS (basic) that supports machine learning in Python
About testing in the implementation of machine learning models
Studying Machine Learning ~ matplotlib ~
Machine learning linear regression
Machine learning course memo
Machine learning library dlib
Coursera Machine Learning Challenges in Python: ex1 (Linear Regression)
Time series data prediction by AutoML (automatic machine learning)
Attempt to include machine learning model in python package
Cross-entropy to review in Coursera Machine Learning week 2 assignments
Preprocessing of prefecture data
xgboost: A valid machine learning model for table data
Machine learning library Shogun
Machine learning rabbit challenge
Introduction to machine learning
Machine Learning: k-Nearest Neighbors