[PYTHON] You will be an engineer in 100 days ――Day 79 ――Programming ――About machine learning 4

Click here until yesterday

You will become an engineer in 100 days --Day 76 --Programming --About machine learning

You will become an engineer in 100 days-Day 70-Programming-About scraping

You will become an engineer in 100 days --Day 66 --Programming --About natural language processing

You will become an engineer in 100 days --Day 63 --Programming --Probability 1

You will become an engineer in 100 days-Day 59-Programming-Algorithms

You will become an engineer in 100 days --- Day 53 --Git --About Git

You will become an engineer in 100 days --Day 42 --Cloud --About cloud services

You will become an engineer in 100 days --Day 36 --Database --About the database

You will be an engineer in 100 days-Day 24-Python-Basics of Python language 1

You will become an engineer in 100 days --Day 18 --Javascript --JavaScript basics 1

You will become an engineer in 100 days --Day 14 --CSS --CSS Basics 1

You will become an engineer in 100 days --Day 6 --HTML --HTML basics 1

This time is a continuation of the story about machine learning.

About the data processing flow of machine learning

The flow of work when incorporating machine learning is as follows.

  1. Determine the purpose
  2. Data acquisition
  3. Data understanding / selection / processing
  4. Data mart (data set) creation
  5. Model creation
  6. Accuracy verification
  7. System implementation

Of these, 2-3 parts are called data preprocessing.

This time, I would like to create a data mart out of this preprocessing.

About creating a data mart

Language is python Libraries for machine learning are Pandas and Numpy The library for visualization uses seaborn, matplotlib.

** Loading library **

#Loading the library
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

** Data details **

The data used this time is the Titanic passenger list.

PassengerID:Passenger ID
Survived:Survival result(0=death, 1=Survival)
Pclass:Passenger class(1 seems to be the best)
Name:Passenger name
Sex:sex
Age:age
SibSp Number of siblings and spouses
Parch Number of parents and children
Ticket Ticket number
Fare boarding fee
Cabin room number
Embarked Port on board

Suppose you have a file called titanic_train.csv.

** Read file **

In the pandas library, there are many reading methods for the file format called read_xxx, so use them to read the file. This time it's a CSV file, so it's read_csv.

The pandas library is a library that handles data formats called tabular data frames. Load the file into the data frame.

#Read data from file
file_path = 'data/titanic_train.csv'
train_df = pd.read_csv(file_path,encoding='utf-8')
train_df.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.25 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.925 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1 C123 S
4 5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.05 NaN S

The data looks like this.

Last time, I looked at the contents of various data and asked what kind of data could be used. This time, we will continue with this, using data that seems to be usable, and incorporate it into data for machine learning.

** Check for missing values **

When the data is read, if there is no data, it will be treated as a missing value on the data frame.

How many missing values are Data frame .isnull (). Sum () You can check the number of missing values in each column with.

train_df.isnull().sum()

PassengerId 0 Survived 0 Pclass 0 Name 0 Sex 0 Age 177 SibSp 0 Parch 0 Ticket 0 Fare 0 Cabin 687 Embarked 2 dtype: int64

Looking at this, it seems that only some columns have missing values. There seems to be a defect in "Age", "Cabin", and "Embarked".

Let's display only the part with missing values.

** Extract rows that match the conditions **

Data frame [conditional expression]

** Extract rows with missing values **

Data frame [Data frame ['column name']. Isnull ()]

train_df[train_df['Embarked'].isnull()]
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
61 62 1 1 Icard, Miss. Amelie female 38 0 0 113572 80 B28 NaN
829 830 1 1 Stone, Mrs. George Nelson (Martha Evelyn) female 62 0 0 113572 80 B28 NaN

If you look at the value in the ʻEmbarked column, it's NaN. Missing values on the data frame are displayed as NaN`.

In case of missing numbers ·Average value ·Median ・ Arbitrary value In many cases, missing values are complemented with.

Category values such as ʻEmbarked` are not numbers and cannot be replaced with any numbers.

If you want to fill in the missing values, you can fill them with fillna.

** Complement missing values with arbitrary values **

Data frame .fillna (filling value)

If you want to complement with the average value of that column, first find the average value.

** Calculate column average **

Data frame ['column name']. Mean ()

** Find the median of the column **

Dataframe ['column name']. median ()

print(train_df['Fare'].mean())
print(train_df['Fare'].median())

32.2042079685746 14.4542

#Complement age with mean
train_df['Age'] = train_df['Age'].fillna(train_df['Age'].mean())

** Vectorization of category values **

In machine learning, basically all the values used for calculation must be numerical values. In the case of category values composed of character strings, in many cases, except for some models, they cannot be used as machine learning data as they are.

Therefore, it converts the category value to a numerical value as ʻone-hot vector`.

** Make the category value a one-hot vector **

ʻOne-hot vector` is the data that creates a column of category values and sets the value to 1 if the column name is different and 0 if it is different.

pd.get_dummies (dataframe [['column name']])

#Categorification of boarding locations
train_df["Embarked"] = train_df["Embarked"].fillna('N')
one_hot_df = pd.get_dummies(train_df["Embarked"],prefix='Em')
one_hot_df.head()
Em_C Em_N Em_Q Em_S
0 0 0 0 1
1 1 0 0 0
2 0 0 0 1
3 0 0 0 1
4 0 0 0 1

Since the ʻEmbarkedcolumn has a defect, it is made into a category value after replacing the defect withN`. A new data frame is generated with the category value replaced by 1 where it does not exist.

You will create columns for each type of data. If there are too many types of data, the data will be raw (almost 0). It is a good idea to create categorical variables only for those that are limited to some extent.

** Conversion of numbers and strings ** We will change the data that is a character string to a numerical value, or convert the numerical value to a character string to make it data that can be used for machine learning.

Since Gender (Sex) is a character string, it cannot be used for machine learning as it is. We will convert from a character string to a numerical value.

Data frame ['column name']. replace ({value: value, value: value ...})

#Gender quantification(0 men,1 woman)
train_df['Sex2'] = train_df['Sex'].replace({'male':0,'female':1})
train_df.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Sex2
0 1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.25 NaN S 0
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38 1 0 PC 17599 71.2833 C85 C 1
2 3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.925 NaN S 1
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1 C123 S 1
4 5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.05 NaN S 0

Created a new gender column.

You can also use replace when converting from a number to a string. When changing the entire column data type

Data frame ['column name']. Astype (np. Data type)

** Numericalization **

Let's calculate the age using age. If you divide the age by 10 to get the age. You can also create a column for missing items without age.


#Age categorization
train_df['period'] = train_df['Age']//10
train_df['period'] = train_df['period'].fillna('NaN')
train_df['period'] = train_df['period'].astype(np.str)
period_df = pd.get_dummies(train_df["period"],prefix='Pe')
period_df.head()
Pe_0.0 Pe_1.0 Pe_2.0 Pe_3.0 Pe_4.0 Pe_5.0 Pe_6.0 Pe_7.0 Pe_8.0
0 0 0 1 0 0 0 0 0 0
1 0 0 0 1 0 0 0 0 0
2 0 0 1 0 0 0 0 0 0
3 0 0 0 1 0 0 0 0 0
4 0 0 0 1 0 0 0 0 0

** Combine data frames **

Combine the newly created data frames into one. Use pd.concat to put it together.

pd.concat ([dataframe, dataframe], axis = 1)


con_df = pd.concat([train_df,period_df],axis=1)
con_df = pd.concat([con_df,one_hot_df],axis=1)
con_df.head(1)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare ... Pe_3.0 Pe_4.0 Pe_5.0 Pe_6.0 Pe_7.0 Pe_8.0 Em_C Em_N Em_Q Em_S
0 1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.25 ... 0 0 0 0 0 0 0 0 0 1

You have now concatenated the data frames horizontally.

** Delete unnecessary data **

When joining the data frames, the original column is deleted because the data before conversion is not required.

Dataframe.drop (['column name'], axis = 1)

data_df = con_df.drop(['PassengerId','Pclass','Name','Age','Ticket','Cabin','Embarked','period','Sex'], axis=1)
data_df.head()
Survived SibSp Parch Fare Sex2 Pe_0.0 Pe_1.0 Pe_2.0 Pe_3.0 Pe_4.0 Pe_5.0 Pe_6.0 Pe_7.0 Pe_8.0 Em_C Em_N Em_Q Em_S
0 0 1 0 7.25 0 0 0 1 0 0 0 0 0 0 0 0 0 1
1 1 1 0 71.2833 1 0 0 0 1 0 0 0 0 0 1 0 0 0
2 1 0 0 7.925 1 0 0 1 0 0 0 0 0 0 0 0 0 1
3 1 1 0 53.1 1 0 0 0 1 0 0 0 0 0 0 0 0 1
4 0 0 0 8.05 0 0 0 0 1 0 0 0 0 0 0 0 0 1

All data are numerical values. In this way, it can be used as the final data for machine learning.

Summary

Today, I processed the data and created a data mart for machine learning. We have introduced only a few processing methods today.

First, let's learn the rough flow. And once you understand it to a certain extent, I think it's a good idea to devise ways to improve the accuracy or try new methods.

21 days until you become an engineer

Author information

Otsu py's HP: http://www.otupy.net/

Youtube: https://www.youtube.com/channel/UCaT7xpeq8n1G_HcJKKSOXMw

Twitter: https://twitter.com/otupython

Recommended Posts

You will be an engineer in 100 days ――Day 82 ――Programming ――About machine learning 7
You will be an engineer in 100 days ――Day 79 ――Programming ――About machine learning 4
You will be an engineer in 100 days ――Day 76 ――Programming ――About machine learning
You will be an engineer in 100 days ――Day 80 ――Programming ――About machine learning 5
You will be an engineer in 100 days ――Day 78 ――Programming ――About machine learning 3
You will be an engineer in 100 days ――Day 84 ――Programming ――About machine learning 9
You will be an engineer in 100 days ――Day 83 ――Programming ――About machine learning 8
You will be an engineer in 100 days ――Day 77 ――Programming ――About machine learning 2
You will be an engineer in 100 days ――Day 85 ――Programming ――About machine learning 10
You will be an engineer in 100 days ――Day 71 ――Programming ――About scraping 2
You will be an engineer in 100 days ――Day 61 ――Programming ――About exploration
You will be an engineer in 100 days ――Day 74 ――Programming ――About scraping 5
You will be an engineer in 100 days ――Day 73 ――Programming ――About scraping 4
You will be an engineer in 100 days ――Day 75 ――Programming ――About scraping 6
You will be an engineer in 100 days --Day 68 --Programming --About TF-IDF
You will be an engineer in 100 days ――Day 70 ――Programming ――About scraping
You will be an engineer in 100 days --Day 63 --Programming --Probability 1
You will be an engineer in 100 days --Day 64 --Programming --Probability 2
You will be an engineer in 100 days ――Day 60 ――Programming ――About data structure and sorting algorithm
You will be an engineer in 100 days --Day 27 --Python --Python Exercise 1
You will be an engineer in 100 days --Day 34 --Python --Python Exercise 3
You will be an engineer in 100 days --Day 31 --Python --Python Exercise 2
You become an engineer in 100 days ――Day 67 ――Programming ――About morphological analysis
You become an engineer in 100 days ――Day 66 ――Programming ――About natural language processing
You will be an engineer in 100 days ――Day 24 ―― Python ―― Basics of Python language 1
You will be an engineer in 100 days ――Day 30 ―― Python ―― Basics of Python language 6
You will be an engineer in 100 days ――Day 25 ―― Python ―― Basics of Python language 2
You will be an engineer in 100 days --Day 29 --Python --Basics of the Python language 5
You will be an engineer in 100 days --Day 33 --Python --Basics of the Python language 8
You will be an engineer in 100 days --Day 26 --Python --Basics of the Python language 3
You will be an engineer in 100 days --Day 35 --Python --What you can do with Python
You will be an engineer in 100 days --Day 32 --Python --Basics of the Python language 7
You will be an engineer in 100 days --Day 28 --Python --Basics of the Python language 4
Become an AI engineer soon! Comprehensive learning of Python / AI / machine learning / deep learning / statistical analysis in a few days!
You have to be careful about the commands you use every day in the production environment.
Build an interactive environment for machine learning in Python
About testing in the implementation of machine learning models
About machine learning overfitting
Programming learning record day 2
Until an engineer who was once frustrated about machine learning manages to use machine learning at work
[Machine learning] Let's summarize random forest in an easy-to-understand manner
Machine learning in Delemas (practice)
An introduction to machine learning
About machine learning mixed matrices
Python Machine Learning Programming> Keywords
Used in machine learning EDA
How about Anaconda for building a machine learning environment in Python?
Learn machine learning anytime, anywhere in an on-demand Jupyter Notebook environment