Conversion of continuous value and categorical data

Categorification of continuous values

When preprocessing data, you may want to divide continuous values into appropriate ranges and categorize them.

For example, if you have age data such as [20, 23, 30, 35, 42, 45] The last digit of age is not very important, such as when teenagers are important attributes.

In the case of this example, the data is divided every 10 years and each data is divided into categories such as 20s, 30s, and 40s.

For this process

Use the pandas cut function.

In the cut function, continuous values are categorized by mainly specifying the following four arguments.

x     :One-dimensional array
bins  :An array of numbers used as delimiters
labels:An array of strings with the names of each delimiter
right :Whether to make the right side of the delimiter specified by bins a closed interval. Specify True or False

For example, if you want to divide the previous [20, 23, 30, 35, 42, 45] into 20s, 30s, and 40s. Specify as follows.

x = [20, 23, 30, 35, 42, 45]
pd.cut(x, bins = [19, 29, 39, 49], labels= ['20's', '30s', 'Forties'])

In this example, bins = [19, 29, 39, 49] By default, the cut function uses the value specified for bins. The left side is the open section and the right side is the closed section. In this example, it is divided into three parts such as 19 <x <= 29, 29 <x <= 39, 39 <x <= 49.

If you want to make the left side a closed interval, specify the argument right of the cut function to False. Write as follows. In this example, this makes the source code meaning more intuitive and easier to read.

x = [20, 23, 30, 35, 42, 45]
pd.cut(x, bins = [20, 30, 40, 50], labels= ['20's', '30s', 'Forties'], right=False)

The output of the cut function gives a Categorical object as follows:

[20's, 20's,30s,30s,Forties,Forties]
 Categories (3, object): [20's<30s<Forties]

Categorical objects can access elements like arrays.

x = [20, 23, 30, 35, 42, 45]
result = pd.cut(x, bins = [20, 30, 40, 50], labels= ['20's', '30s', 'Forties'], right=False)
result[0] #value is'20's'become

If you write a code that counts the number of a certain car model

import pandas as pd
x = [191, 184, 173, 162, 175, 183, 151, 160, 170, 182, 190, 192]

#Example of use
pd.cut(x, bins = [150, 160, 170, 180, 190, 200], labels= ['150 units', '160 units', '170 units', '180 units', '190 units'], right=False)

#Output result
[190 units,180 units,170 units,160 units,170 units, ...,160 units,170 units,180 units, 190 units, 190 units]
Length: 12
Categories (5, object): [150 units<160 units<170 units<180 units<190 units]

Dummy variable of categorical data

When there is categorical data, treat it as a separate variable for each value You may want to have values of 1 and 0. For example, if the data in the age column is categorical as follows:

If you want something like the following table.

This table has columns in their 20s, 30s, and 40s. If the age of some data is in the 20s, the value in the column in the 20s will be 1. The other columns in their 30s and 40s are 0.

Such a conversion is one-Also known as hot encoding.

This conversion is
pandas get_It's easy to do with the dummies function.

In the previous example, if there is the following DataFrame as the conversion source data,

df = pd.DataFrame({'age': ['20's', '20's', '30s', '30s', 'Forties', 'Forties']})

You can get the converted data in DataFrame format by using the get_dummies function as follows.

pd.get_dummies(df['age'])

In addition, when DataFrame itself is specified in the get_dummies function instead of a specific column as shown below All categorical columns and string columns contained in the DataFrame You can get a DataFrame that is all dummy variables.

pd.get_dummies(df)

To make the source column easier to understand If you want to keep the original column name in the converted column You can pass a prefix to the get_dummies function.

pd.get_dummies(df['age'], prefix='age')

Then, the following prefix will be added to the conversion result column.

In summary, it looks like this.

import pandas as pd
df = pd.DataFrame({'height': ['190 units', '180 units', '170 units', '160 units', '170 units', '180 units', '150 units']})

pd.get_dummies(df['height'])

Conversion of data scale and distribution

Data scale conversion

Compare the data used for learning machine learning with other data items If data items containing relatively large values are mixed, when learning as it is I am affected by data items that contain large values It may not be possible to efficiently learn the parameters of the analytical model.

In these cases, scale so that all numeric data items fit within a certain standard.

The most common scale adjustment is to set the mean to 0 and the variance to 1.

For example, if you have the following data

x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

Adjusting so that the mean is 0 and the variance is 1, the data will be as follows.

scaled_x = [
    -1.5666989,  -1.21854359,  -0.87038828, -0.52223297, -0.17407766,
    0.17407766,   0.52223297,   0.87038828,  1.21854359,  1.5666989 
]

This adjustment

scikit-In the learn preprocessing module
This is easy to do with the scale function.

scaled_x = preprocessing.scale(x)
#scaled converted by the scale function_x will be the following array as before.

scaled_x = [
    -1.5666989,  -1.21854359,  -0.87038828, -0.52223297, -0.17407766,
    0.17407766,   0.52223297,   0.87038828,  1.21854359,  1.5666989 
]

Click here for usage examples

import numpy as np
from sklearn import preprocessing

np.random.seed(0)

#Generate 10 normal distribution data with mean 50 and variance 10
data = np.random.normal(50, 10, 10)

preprocessing.scale(data)

#Output result
array([ 1.06095671, -0.34936741,  0.2489091 ,  1.55402992,  1.16798583,
       -1.7736924 ,  0.21928426, -0.91965619, -0.86987913, -0.33857068])

Box-Cox conversion

When using an analytical model that assumes a normal distribution of explanatory variables, such as linear multiple regression analysis You may want to transform the data so that it approaches a normal distribution.

In such a case

Box-Performs a conversion called Cox conversion.

Box-Cox conversion converts the original data xi to yi as follows:

λ is a parameter for conversion. The value of λ is Maximum likelihood estimation is performed to determine that the converted data has a normal distribution, or an arbitrary value is specified.

As you can see from the above formula, x converted by Box-Cox conversion must be a positive number. This is because log (x) cannot be calculated.

Using the Box-Cox transformation, the following data with a χ ^ 2 distribution with 2 degrees of freedom is

It approaches the normal distribution as follows:

This Box-Cox conversion

This is easily achieved using the boxcox function of scipy's stats module.

y, lambda_value = stats.boxcox(x)

The scipy boxcox function has two return values. In the above code, as the return value of the function

Converted array to y
Maximum likelihood estimated λ is lambda_It is stored in value.

In summary, it looks like this.

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

np.random.seed(0)

x = np.random.chisquare(3, 20000)

#Example of use
y, lambda_value = stats.boxcox(x)
plt.hist(y)
plt.show()

Vertical data and horizontal data

What is vertical data and horizontal data?

As a common data structure conversion in preprocessing There is mutual conversion between vertical data and horizontal data.

This is easier to understand with an example, so let's start with an example.

The following is the data of the test results at the school.

This is horizontal data. Because in this data structure, for example, when the information increases as a test subject This is because if it is a database, you need to add columns horizontally, and after increasing the columns, the table will be extended horizontally as follows.

Horizontal data is, for example, when a person confirms it with spreadsheet software. It is often used because it is easy to understand and it is easy to intuitively perform tabulation.

On the other hand, you need to change the data structure when adding new information (If it's a database, you'll need to add columns to the table), so the system has horizontal data By implementing a data structure, when the required data changes It is more likely to be accompanied by table changes and logic changes, which can be inefficient.

Therefore, in such a case, the following data structure may be used.

This is vertical data. In this structure, when the information increases as a test subject No need to change the structure of the data (no need to add columns in the database) All you have to do is increase the data vertically as follows.

At first glance, it may seem difficult to grasp the situation with vertical data. When implemented in a system, it is easy to respond to changes in data structure and logic It has the advantage that maintenance costs may be reduced.

In data analysis, it depends on how the data you get was used. It can be vertically held data or horizontally held data. Then, depending on the data analysis model you use, you may want to transform your data structure into either format.

Convert vertical data to horizontal data

When converting vertically held data to horizontally held data

Use the pandas pivot function.

The arguments of the pivot function are as follows.

index  :Original DataFrame column to be used as a key value to combine into one row as a horizontal column
columns:The original DataFrame column used to expand as a horizontal column
values :The original DataFrame column to use as the value of the horizontal column

Specifically, if you have the following data,

If you use the pivot function like this

df.pivot(index='name', columns='Subject', values='Score')

You can get the following conversion results.

In the converted DataFrame, the column specified for the index of the pivot function is used for the row index. The name attribute of the row index is "name" and the name attribute of the column index is "subject". Often you want to treat the value used for index as a regular DataFrame column

In that case, reset as follows_Execute index.

pivoted_df = df.pivot(index='name',columns='Subject', values='Score').reset_index()

Then, the index value can be treated as column data as follows.

In addition, "subject" remains as the name attribute of the column index. It doesn't make sense, but it has as columns in the DataFrame like this: Replace the Index object to handle it.

pivoted_df.columns = pivoted_df.columns.set_names(None)

The final DataFrame is as follows.

When the contents are summarized, it looks like this.

import pandas as pd
import numpy as np


data = pd.DataFrame({
      'Opportunity': np.append(np.repeat('Opportunity1', 3), np.repeat('Opportunity2', 3)),
      'name': np.tile(['Ishikawa', 'Kawai', 'Kimura'], 2),
      'Degree of contribution': [50, 25, 25, 60, 20, 20]
})

#Example of use
pivoted_data = data.pivot(index='Opportunity', columns='name', values='Degree of contribution')
pivoted_data = pivoted_data.reset_index()
pivoted_data.columns = pivoted_data.columns.set_names(None)
pivoted_data

Before processing

After treatment

Convert horizontally held data to vertically held data

To convert horizontally held data to vertically held data

Use the pandas melt function.

The main arguments of the melt function are:

frame     :DataFrame you want to convert
id_vars   :Specify the column to be used for the conversion key as an array
value_vars:Specify the column to be used as the value after conversion as an array
var_name  :After conversion, the name of the variable column for grouping the columns of horizontal data
value_name:The name of the column that will be the value after conversion

When there is the following data as the data you want to convert

If you use the melt function like this

pd.melt(data, id_vars=['name'], value_vars=['National language', 'English', 'Math', 'Science', 'society'], var_name='Subject', value_name='Score')