Aidemy　2020/10/29

Introduction

Hello, it is Yope! I am a liberal arts student, but I was interested in the possibilities of AI, so I went to the AI-specialized school "Aidemy" to study. I would like to share the knowledge gained here with you, and I am summarizing it on Qiita. I am very happy that many people have read the previous summary article. Thank you! This is the fourth post of machine learning pre-processing. Nice to meet you.

This article is a summary of what you learned in "Aidemy" "in your own words". It may contain mistakes and misunderstandings. Please note.

What to learn this time ・ About continuous value ・ Conversion of category data ・ About scale adjustment ・ About vertical data and horizontal data

About continuous values

Categorification of continuous values

-When preprocessing data, you may want to divide continuous values into categories. -Specifically, there is age data [10,15,18,20,27,32], and I would like to divide this into "teens, 20s, 30s" (categorization of continuous values).

-Execution is performed by the cut (x, bins = [], labels = [], right =) __ function of pandas. ・ About each argument - "X" __ represents the data to be passed. -Pass the delimiter numbers as an array to __ "bins" __. -Pass the name of each category as an array to __ "labels" __. -For __ "right" __, set True if the section division is "greater than A and less than B", and False if "A or more and less than B". (Default = True, but for Japanese thinking, it is easier to intuitively understand if it is in the form of "A or more and less than B" in False)

x = [10,15,18,20,27,32]
bins = [10,20,30,40] #10~20,20~30,30~Means 40
labels = ['10's','20's','30s']
#Categorification
pd.cut(x,bins=bins,labels=labels,right=False)
#[10's,10's,10's,20's,20's,30s]

One-hot encode categorical data

-The data categorized in the previous section is converted to __ "1" if it belongs to that category, "0" __ if it does not belong to that category, and so on. -For example, if the category is "teens", the output should be [1,1,1,0,0,0]. -Such conversion is called __ "one-hot encoding" __. -For example, when the data {'age': [teens, teens, teens, 20s, 20s, 30s]} is given by DataFrame (variable df) It can be converted with __pd.get_dummies (df ['age']) __. -Also, if there are columns other than'age'and you want to convert all of them, you only need to use "df" as an argument. -If you want to keep the name of the original column in the converted column, you can pass the column name as prefix ='age' in the second argument.

result = pd.cut(x,bins=bins,labels=labels,right=False)
df = pd.DataFrame({'age':result})
#one-Executing hot encoding
pd.get_dummies(df['age'],prefix='age')
'''
age_Teen age_20s age_30s
0	1	0	0
1	1	0	0
2	1	0	0
3	0	1	0
4	0	1	0
5	0	0	1
'''

About scale adjustment

-If data with a relatively large value is included in the numerical data items passed to the model, the learning efficiency may be reduced. -In such a case, adjusting the values of all numerical data items so that they fit within a certain standard is called __ "scale adjustment" __. -The general standard is __ "mean 0, variance 1" __. In this case, use the __scale () __ function of the preprocessing module of scikit-learn.

-Code (scale adjustment of 10 regular random numbers with an average of 50 variances)![Screenshot 2020-10-29 16.24.15.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws. com / 0/698700/7c891221-4251-8f4b-e922-6f155fff7af3.png)

Box-Cox conversion

-When converting data so that it approaches __ "normal distribution" __, perform something called __Box-Cox conversion __. -The normal distribution is a distribution in which the number of data increases toward the center and decreases toward the edges. -When performing Box-cox conversion, use the __boxcox () __ function of the stats module of scipy.

-Two variables are passed to the boxcox function, the first is the array after conversion, and the second is the parameter "λ" used at the time of conversion.

-Code (convert x ^ 2 with 3 degrees of freedom to Box-Cox)![Screenshot 2020-10-29 16.26.31.png](https://qiita-image-store.s3.ap-northeast-1. amazonaws.com/0/698700/dd6b1340-1eb4-9a14-8011-8b5b63084156.png)

・ Result![Screenshot 2020-10-29 16.26.41.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/4291aee1-314c-7386- 274a-dabcec09dcfd.png)

Vertical data and horizontal data

What is vertical data and horizontal data?

-Vertical data has a structure that increases data "vertically" (addition of rows) when the amount of data information increases, and horizontal data is "horizontal". It refers to a structure that increases data (adds columns). -For example, suppose that there is data of "purchaser", "purchased item", and "price" as an accounting report. When the amount of data increases, simply adding data vertically to the data with __3 columns is the vertical holding data __, __ Expanding the purchase horizontally and adding the data there Horizontal data __. -Vertical data has the advantage of being easy to respond to changes in the __ data structure and logic because there is no need to add columns. -On the other hand, horizontal data has the advantage of being easy to understand intuitively when checking. -When passing data to the model, it is necessary to convert to either of these formats.

Convert vertical data to horizontal data

-Use the __pivot (index, columns, values) __ function of pandas. -For each argument, "index" is the column of the original DataFrame that puts the same elements together in one row, "columns" is the original column for making a new column, and "values" is the original column for making the column value.

・ In the previous example, many of the same people will continue to be the "buyer", so the "index" will be used, and the "purchased item" will be expanded horizontally, so the "columns" will be used, and the "price" will be the column value. So store it in "values". -Furthermore, the index of df after conversion is the index passed at this time, so if you want to return this to the normal index number, add "__. Reset_index () __" to pivot (). Also, at this time, the name attribute of the column remains, so add __.columns.set_names (None) __ to the columns of the converted df.

df = pd.DataFrame({
      'buyer': ['Sato','Tanaka','Kato','Tanaka','Kato'],
      'Purchased items': ['pen','paper','Beverage','Scissors','Seaweed']
      'price': [500, 250, 250, 600, 200]
})
##Convert vertical data to horizontal data
pivoted_df = df.pivot(index='buyer',columns='Purchased items',values='price').reset_index()
pivoted_df.columns = pivoted_df.columns.set_names(None)
'''
Buyer Pen Paper Beverage Scissors Glue
0 Sato 500
1 Tanaka 250 600
2 Kato 250 200
'''

Convert horizontally held data to vertically held data

-Can be converted with pandas __melt (frame, id_vars, value_vars, var_name, value_name) __. ・ About each argument ・ Frame: conversion source df -Id_vars: Key column (passed as an array) -Value_vars: Columns that will be the converted values (passed as an array) -Var_name: The name of the column that groups the columns that are expanded horizontally -Value_name: The name of the column that will be the converted value

・ The horizontal data created in the previous section is as follows. __pd.melt (pivoted_df, id_vars = ['buyer'], value_vars = ['pen','paper','beverage','scissors','glue'], var_name ='purchase', value_name ='price ') __

Summary

-If you want to categorize continuous values, use pandas __cut (x, bins = [], labels = [], right =) __. -Converting categorized data to "1" if it belongs to that category, "0" if it does not belong, is called "one-hot encoding", and it is done with __pd.get_dummies () __ .. -Aligning the values of numerical data relative to each other is called "scale adjustment", and __preprocessing.scale () __ Do it with. -Converting data so that it approaches a normal distribution is called "Box-Cox conversion" and is performed by __stats.boxcox () __. -Use __pivot (index, columns, values) __ when converting from vertical data to horizontal data, and use __melt (frame, id_vars, value_vars, var_name, value_name) __ when doing the opposite. ..

This time is over. Thank you for reading until the end.

[PYTHON] Preprocessing in machine learning 4 Data conversion