[PYTHON] You will be an engineer in 100 days ――Day 78 ――Programming ――About machine learning 3

Click here until yesterday

You will become an engineer in 100 days --Day 76 --Programming --About machine learning

You will become an engineer in 100 days-Day 70-Programming-About scraping

You will become an engineer in 100 days --Day 66 --Programming --About natural language processing

You will become an engineer in 100 days --Day 63 --Programming --Probability 1

You will become an engineer in 100 days-Day 59-Programming-Algorithms

You will become an engineer in 100 days --- Day 53 --Git --About Git

You will become an engineer in 100 days --Day 42 --Cloud --About cloud services

You will become an engineer in 100 days --Day 36 --Database --About the database

You will be an engineer in 100 days --Day 24 --Python --Basics of Python language 1

You will become an engineer in 100 days --Day 18 --Javascript --JavaScript basics 1

You will become an engineer in 100 days --Day 14 --CSS --CSS Basics 1

You will become an engineer in 100 days --Day 6 --HTML --HTML basics 1

This time is a continuation of the story about machine learning.

About the data processing flow of machine learning

The flow of work when incorporating machine learning is as follows.

  1. Determine the purpose
  2. Data acquisition
  3. Data understanding / selection / processing
  4. Data mart (data set) creation
  5. Model creation
  6. Accuracy verification
  7. System implementation

Of these, 2-3 parts are called data preprocessing.

This time, I would like to talk about data understanding in this preprocessing.

About data understanding

Let's roughly explain what the data preprocessing work in machine learning is like, but let's do it with some code.

Language is python Libraries for machine learning are Pandas and Numpy The library for visualization uses seaborn, matplotlib.

** Loading library **

#Loading the library
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

** Data details **

The data used this time is the Titanic passenger list.

PassengerID:Passenger ID
Survived:Survival result(0=death, 1=Survival)
Pclass:Passenger class(1 seems to be the best)
Name:Passenger name
Sex:sex
Age:age
SibSp Number of siblings and spouses
Parch Number of parents and children
Ticket Ticket number
Fare boarding fee
Cabin room number
Embarked Port on board

Suppose you have a file called titanic_train.csv.

** Read file **

In the pandas library, there are many reading methods for the file format called read_xxx, so use them to read the file. This time it's a CSV file, so it's read_csv.

The pandas library is a library that handles data formats called tabular data frames. Load the file into the data frame.

#Read data from file
file_path = 'data/titanic_train.csv'
train_df = pd.read_csv(file_path,encoding='utf-8')
train_df.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.25 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.925 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1 C123 S
4 5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.05 NaN S

The data looks like this. Data frames allow you to work with data in rows and columns.

** Check data **

First, check the data frame. Let's see what kind of columns there are.


print(train_df.columns)
print(len(train_df.columns))

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'], dtype='object') 12

Next is the confirmation of the data type. In the pandas library, the data type is fixed for each column. It is necessary to operate according to the type.

train_df.dtypes

PassengerId int64 Survived int64 Pclass int64 Name object Sex object Age float64 SibSp int64 Parch int64 Ticket object Fare float64 Cabin object Embarked object dtype: object

ʻObject` is a data type such as a character string, and the others are numeric data types.

** Column reference ** You can also refer to the data column by column in the data frame. If you specify a character string for one column, or a list type column name for multiple columns, you can refer to only that data.

Data frame [column name] Data frame [[column name, column name]]

** Basic aggregation of data frames **

The first thing to do after receiving data is basic tabulation. pandas allows you to calculate basic statistics for data frames.

#Basic statistics of numerical data
train_df.describe()
PassengerId Survived Pclass Age SibSp Parch Fare
count 891 891 891 714 891 891 891
mean 446 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1 0 1 0.42 0 0 0
25% 223.5 0 2 20.125 0 0 7.9104
50% 446 0 3 28 0 0 14.4542
75% 668.5 1 3 38 1 0 31
max 891 1 3 80 8 6 512.3292

Get the basic statistics for numeric columns. You can see how much data you have and what it looks like.

** Check data type ** Since character string type data cannot be basic aggregated, what category of data and how much is there? It can be calculated with a method called value_counts.

Dataframe ['column name']. value_counts ()

train_df['Pclass'].value_counts()

3 491 1 216 2 184 Name: Pclass, dtype: int64

** Group-by aggregation **

Use group by when aggregating using multiple columns.

Data frame [['column name','column name','column name']]. groupby (['column name','column name']). Aggregate function ()

#Confirmation of the number of survivors by gender(0:death, 1:Survival)
train_df[['Sex','Survived','PassengerId']].groupby(['Sex','Survived']).count()
PassengerId
Sex Survived
female 0 81
1 233
male 0 468
1 109

Group-by aggregation allows you to perform various aggregations. (Average, minimum, maximum, etc.)

Looking at this, it seems that there is a considerable difference between the survival rate of women and the survival rate of men. Many men are Survived = 0 and can be seen dying in overwhelming boats.

** Cross tabulation **

You can perform cross tabulation (counting the number) by collecting multiple columns.

pd.crosstab (dataframe [column name], dataframe [column name])

pd.crosstab ([data frame [column name], data frame [column name]], [data frame [column name], data frame [column name]])

pd.crosstab([train_df['Sex'], train_df['Survived']], train_df['Pclass'])
Pclass 1 2 3
Sex Survived
female 0 3 6 72
1 91 70 72
male 0 77 91 300
1 45 17 47
pd.crosstab([train_df['Sex'], train_df['Survived']], [train_df['Pclass'], train_df['Embarked']],margins=True)
Pclass 1 2 3
Embarked C Q S C Q S C Q S
Sex Survived
female 0 1 0 2 0 0 6 8 9 55
1 42 1 46 7 2 61 15 24 33
male 0 25 1 51 8 1 82 33 36 231
1 17 0 28 2 0 15 10 3 34
All 85 2 127 17 3 164 66 72 353

By summarizing each column like this, we will look at what kind of data it is.

** Data visualization **

Data can be visualized by using matplotlib.

Charts can be created from data such as histograms, bar graphs, and scatter plots. Let's draw a histogram.

The histogram is a chart when the horizontal axis is the value and the vertical axis is the number. The bin is the number of stages when dividing the data into stages.

** Display histogram **

Data frame ['column name']. Plot (kind ='hist', bin = number of bins)

Data frame ['column name']. Hist (bin = number of bins)

train_df['Fare'].plot(figsize=(16, 5),kind='hist',bins=20)
plt.show()

image.png

train_df['Age'].plot(figsize=(16, 5),kind='hist',bins=20)
plt.show()

image.png

** Condition specification **

By adding extraction conditions to the data frame, you can divide what you want to visualize into layers.

Data frame [conditional expression] ['column name']

conditions symbol
and condition &
or condition |
In case of not condition ~
#Extract data frames by specifying conditions
train_df[train_df['Survived']==0]['Age'].head()

0 22.0 4 35.0 5 NaN 6 54.0 7 2.0 Name: Age, dtype: float64

#Draw a histogram separately for survivors and dead by age
train_df[train_df['Survived']==0]['Age'].hist(figsize=(16,5),bins=16,color="#F8766D", alpha=0.3)
train_df[train_df['Survived']==1]['Age'].hist(figsize=(16,5),bins=16,color="#5F9BFF", alpha=0.3)
plt.show()

image.png

#For men only, draw a histogram by dividing survivors and dead by age
train_df[(train_df['Survived']==0)&(train_df['Sex']=='male')]['Age'].hist(figsize=(16,5),bins=10,color="#F8766D", alpha=0.3)
train_df[(train_df['Survived']==1)&(train_df['Sex']=='male')]['Age'].hist(figsize=(16,5),bins=10,color="#5F9BFF", alpha=0.3)
plt.show()

image.png

#Only women draw a histogram by age group for survivors and dead
train_df[(train_df['Survived']==0)&(train_df['Sex']=='female')]['Age'].hist(figsize=(16,5),bins=10,color="#F8766D", alpha=0.3)
train_df[(train_df['Survived']==1)&(train_df['Sex']=='female')]['Age'].hist(figsize=(16,5),bins=10,color="#5F9BFF", alpha=0.3)
plt.show()

image.png

Visualizing the results of various tabulations makes it easier to see the differences and makes it easier to find things that are likely to be effective for prediction from the data.

It can be seen that there is a considerable difference in life and death between the distribution of men and the distribution of women.

Next, let's display the scatter plot.

** Display scatter plot **

The scatter plot is scatter. Since it is a visualization using two numerical values, specify two columns of some numerical values.

Data frame [['column name','column name']]. plot (x ='horizontal column name', y ='vertical column name', kind ='scatter')

train_df[['Fare','Age']].plot(x='Fare', y='Age', figsize=(16, 9),kind='scatter')
plt.show()

image.png

I want to color by gender, but since the data is a character string, I need to number it. Add one column and substitute the value with the type converted to a numerical value.

** Add columns in data frame **

Dataframe ['column name'] = value

** Data frame character replacement **

Dataframe.replace ({'string before replacement': value after replacement})

#Gender quantification(0 men,1 woman)
train_df['Sex2'] = train_df['Sex'].replace({'male':0,'female':1})
train_df.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Sex2
0 1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.25 NaN S 0
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38 1 0 PC 17599 71.2833 C85 C 1
2 3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.925 NaN S 1
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1 C123 S 1
4 5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.05 NaN S 0

At the very end, you will have a column with gender 0,1.

You have now added a column of values with gender as a number.

I will color it and draw a scatter plot.

#Coloring the scatter plot(0 men,1 woman)
train_df[['Fare','Age']].plot(x='Fare', y='Age', figsize=(16, 5),kind='scatter',c=train_df['Sex2'],cmap='winter')
plt.show()

image.png

Let's color-code the survivors and the dead.

#Coloring the scatter plot(0:death, 1:Survival)
train_df[['Fare','Age']].plot(x='Fare', y='Age', figsize=(16, 5),kind='scatter',c=train_df['Survived'],cmap='winter')
plt.show()

image.png

I think it feels quite similar when color-coded by gender and life or death.

** Visualization with seaborn **

There is a library called seaborn to display matplotlib a little more beautifully. There is also a visualization method that cannot be expressed by matplotlib in seaborn La Ibrari.

A scatter plot of numerical data is displayed together.

sns.pairplot (dataframe, hue" string type column ")

#If there are missing values, some cannot be displayed, so fill the missing values with the average value.
for name in ['Age']:
    train_df[name] = train_df[name].fillna(train_df[name].mean())

train_df['Survived2'] = train_df['Survived'].replace({0:'death',1:'survived'})
train_df['Sex2'] = train_df['Sex'].replace({'male':0,'female':1})

# pairplot
#Check the correlation between data series
#A bar graph is displayed for the same data, and a scatter plot is displayed for different data.
sns.pairplot(train_df[['Pclass','Age','Fare','Sex2','Survived2']], hue="Survived2")
plt.show()

image.png

factorplot

factor plot can sort charts by factor.

factorplot (x ='horizontal column name', y ='vertical column name', hue ='type column name', col ='column name', data = dataframe)

# factorplot (Type comparison diagram)
sns.factorplot(x='Pclass', y='Age', hue='Survived', col='Sex2', data=train_df)
plt.show()

image.png

You can turn it into a box plot by adding kind ='box'.

sns.factorplot(x='Pclass', y='Age', hue='Survived', col='Sex2',kind='box', data=train_df)
plt.show()

image.png

lmplot

A graph that visualizes a scatter plot between two variables and a linear regression line.

sns.lmplot (x = horizontal axis column, y = vertical axis column, hue ='coloring thing', data = data frame, x_bins = bin value)

#Specify age group
generations = [10,20,30,40,50,60,70,80]
#Draw customer base, survival rate by age, and regression line
sns.lmplot('Age', 'Survived', hue='Pclass', data=train_df,hue_order=[1,2,3], x_bins=generations)
plt.show()

image.png

Summary

In understanding the data, aggregate the whole to understand what kind of data is available, and try various visualizations by stratification. Let's see what kind of data distribution it has.

After looking at the data, organize what kind of data can be used for prediction. From here, we will fix the data that can be used for machine learning.

First, let's learn the rough flow.

22 days until you become an engineer

Author information

Otsu py's HP: http://www.otupy.net/

Youtube: https://www.youtube.com/channel/UCaT7xpeq8n1G_HcJKKSOXMw

Twitter: https://twitter.com/otupython

Recommended Posts

You will be an engineer in 100 days ――Day 81 ――Programming ――About machine learning 6
You will be an engineer in 100 days ――Day 82 ――Programming ――About machine learning 7
You will be an engineer in 100 days ――Day 79 ――Programming ――About machine learning 4
You will be an engineer in 100 days ――Day 76 ――Programming ――About machine learning
You will be an engineer in 100 days ――Day 80 ――Programming ――About machine learning 5
You will be an engineer in 100 days ――Day 78 ――Programming ――About machine learning 3
You will be an engineer in 100 days ――Day 84 ――Programming ――About machine learning 9
You will be an engineer in 100 days ――Day 83 ――Programming ――About machine learning 8
You will be an engineer in 100 days ――Day 77 ――Programming ――About machine learning 2
You will be an engineer in 100 days ――Day 85 ――Programming ――About machine learning 10
You will be an engineer in 100 days ――Day 71 ――Programming ――About scraping 2
You will be an engineer in 100 days ――Day 74 ――Programming ――About scraping 5
You will be an engineer in 100 days ――Day 73 ――Programming ――About scraping 4
You will be an engineer in 100 days ――Day 75 ――Programming ――About scraping 6
You will be an engineer in 100 days --Day 68 --Programming --About TF-IDF
You will be an engineer in 100 days ――Day 70 ――Programming ――About scraping
You will be an engineer in 100 days --Day 63 --Programming --Probability 1
You will be an engineer in 100 days --Day 65 --Programming --Probability 3
You will be an engineer in 100 days --Day 64 --Programming --Probability 2
You will be an engineer in 100 days --Day 86 --Database --About Hadoop
You will be an engineer in 100 days ――Day 60 ――Programming ――About data structure and sorting algorithm
You will be an engineer in 100 days --Day 27 --Python --Python Exercise 1
You will be an engineer in 100 days --Day 34 --Python --Python Exercise 3
You become an engineer in 100 days ――Day 67 ――Programming ――About morphological analysis
You will be an engineer in 100 days ――Day 24 ―― Python ―― Basics of Python language 1
You will be an engineer in 100 days ――Day 30 ―― Python ―― Basics of Python language 6
You will be an engineer in 100 days ――Day 25 ―― Python ―― Basics of Python language 2
You will be an engineer in 100 days --Day 29 --Python --Basics of the Python language 5
You will be an engineer in 100 days --Day 33 --Python --Basics of the Python language 8
You will be an engineer in 100 days --Day 26 --Python --Basics of the Python language 3
You will be an engineer in 100 days --Day 35 --Python --What you can do with Python
You will be an engineer in 100 days --Day 32 --Python --Basics of the Python language 7
You will be an engineer in 100 days --Day 28 --Python --Basics of the Python language 4
Become an AI engineer soon! Comprehensive learning of Python / AI / machine learning / deep learning / statistical analysis in a few days!
You have to be careful about the commands you use every day in the production environment.
Build an interactive environment for machine learning in Python
About testing in the implementation of machine learning models
About machine learning overfitting
Programming learning record day 2
Until an engineer who was once frustrated about machine learning manages to use machine learning at work
[Machine learning] Let's summarize random forest in an easy-to-understand manner
Machine learning in Delemas (practice)
About machine learning mixed matrices
Python Machine Learning Programming> Keywords
Used in machine learning EDA
How about Anaconda for building a machine learning environment in Python?
Learn machine learning anytime, anywhere in an on-demand Jupyter Notebook environment