[PYTHON] Data analysis Titanic 2

Aidemy 2020/10/30

Introduction

Hello, it is Yope! I am a liberal arts student, but I was interested in the possibilities of AI, so I went to the AI-specialized school "Aidemy" to study. I would like to share the knowledge gained here with you, and I am summarizing it on Qiita. I am very happy that many people have read the previous summary article. Thank you! This is the second post of "Data Analysis Titanic". Nice to meet you.

What to learn this time ・ ④ Pattern analysis, data analysis (continuation of the previous time) ・ ③ Data shaping, creation, cleansing (return) → __ Creation of new features __

④ Pattern analysis, data analysis

Analysis of the previous pivot table (correlation)

-When Pclass = 1, the average of Survived is 0.62, and it can be said that there is a significant (positive) correlation that exceeds 0.5, so Pclass should be used as a feature of the model. And. -Similarly, when Sex = Female, the average of Survived is 0.74, so this is also used as a feature of the model. -Since there was no significant correlation between __SibSp and Parch __, create a new feature by combining the two __.

Visualize data

Histogram by specifying the age range

-__ To confirm the hypothesis that "children have a high survival rate" __, specify the range of __Age data and divide it __. Use this to create a distribution of data with a __histogram __. -For this histogram, the horizontal axis is Age, but the vertical axis is the number of data, that is, the number of passengers, so be careful.

-The histogram can be created with __ "df.hist ()" __. How many classes to create (how many data to divide) can be specified by __ "bins =" __ in the argument, and when Survive wants to see the number of data of 0 and 1, respectively __ You can specify the parameter with "by =" __. -Also, as you saw in Chapter 1, __Age contains missing values __, so delete the missing values first with dropna.

·code スクリーンショット 2020-10-22 15.12.23.png

・ Result (graph) スクリーンショット 2020-10-22 15.13.00.png

・ Looking at the results, we can see that __ "0-5 years old" has a high survival rate __. ・ Also, looking at the total number of data, we can see that the number of __ "15-35 years old" is large __.

Correlation of features with "category value and numerical value"

・ Next, the histogram for each Age created in the previous section will be further visualized for each __P class. -The code is written to histogram __Survived and Age, and divide it by __ "by = train_df ['Pclass']" __.

·code スクリーンショット 2020-10-22 15.29.52.png ・ Graph (dead) スクリーンショット 2020-10-22 15.30.21.png ・ Graph (survivor) スクリーンショット 2020-10-22 15.30.36.png

・ From this graph, __ "There are overwhelmingly many deaths with P class = 3" "Many survivors with P class = 1" "Many survivors with Age of P class = 2 and 3 are 0 to 5 years old" __ You can see that. It can be said that both are as hypothesized.

Correlation of features with "category value"

-Create the relationship between "Survived" and "Pclass / Sex" when __ "Embarked" __ is __ ['C','Q','S'] __ in the pivot table, and __ "plt. It is illustrated by plot () "__.

-Create a pivot table in the same way as in Chapter 1. Don't forget to specify the contents of __Embarked in the conditional expression this time __. Also, since there are two groups (aggregate), Pclass and Sex, pass them as a list. -Sex divides the created pivot table by male and female and sorts them in the order of Pclass. -This is illustrated by __ "plt.plot ()" __. The x-axis is __ ['1', '2', '3'] __, and the y-axis is the number of __Survived data __.

-Code (when Embarked ='C'. The other two are created in the same way) (The cut part is the same as the previous time) スクリーンショット 2020-10-22 16.21.55.png

・ Graph (only when Embarked ='C', the other two are similar graphs) スクリーンショット 2020-10-22 16.22.44.png

・ From this graph, it can be said that __ "the survival rate of women is overwhelmingly high" __. This is also the hypothesis. It was also found that, although not shown here, __ "males with Embarked ='Q' have a very low survival rate" __.

Correlation between features with "category value" and features with "numerical value"

-This time, replace the __ "P class" part in the previous section with "Survived" and the "Survived" part with "Fare" __, and show them in exactly the same way. -Once the relationship between "Fare" and "Survived / Sex" is created in the pivot table, the next step is to divide it according to whether Survived is 0 or 1. This time, it is represented by __subplot __. __ You can do it with "plt.subplot ()" . The horizontal axis is __ "Sex" __ The vertical axis is __ "Fare" __ A bar graph ( plt.bar () __) is set, the left is __ "Survived == 0" __, and the right is __ Set so that "Survived == 1" __.

③ Data shaping, creation, cleansing

-From here, __feature conversion / creation / completion __ is performed.

Delete data

-First, as decided in Chapter 1, delete __ "Ticket" and "Cabin" with many defects and duplicates __. -Use __ "drop (axis = 1)" __ because the column is deleted.

Creating new features

-Similarly, __ "Name" and "Passenger Id" that are clearly uncorrelated with Survived are also deleted __, but __ titles (Mr, Mrs, Dr, etc.) included in "Name" and Survived There may be a correlation __, so check it. -To extract the title from Name, use __regular expression __. To extract the part that matches the regular expression from the str type data, use __ "str.extract ('regular expression')" __. The regular expression this time is '([A-Za-z] +) .'. This is expressed like this because there are multiple uppercase and lowercase letters before the "." In the title, such as "Dr.". In addition, __ "expand = False" __ is specified in this second argument, which indicates that the extracted one is returned in __DataFrame . -This extracted product is stored in a new column (feature amount) called 'Title'. -The relationship between 'Title' and'Sex' is __ "cross tabulation" . In this example, in this example, it aggregates how many Sex elements appear for each Title category such as Dr. and Mrs. . -Use the __ "pd.crosstab ()" __ function to perform crosstabs. In the first argument, pass the data corresponding to the result row ('Title' in this case), and in the second argument, pass the data corresponding to the result column ('Sex').

・ Code![Screenshot 2020-10-23 15.28.45.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/3a3d54fa-669e-ec58- d0cd-c0463cabf909.png)

・ Result (only part)![Screenshot 2020-10-23 15.29.14.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/b946ce52 -b42c-6610-b702-733193b204d0.png)

・ Of the honorific titles that appear here, those that are less frequent are summarized in the other frame called __ "'Rare'" __. Also, replace'Mile'with'Miss' and'Mme' with'Mrs', which have the same meaning. These replacements can be done with __ "replace ()" __. ・ Once you have reached this point, create a pivot table and check the correlation.

・ Code![Screenshot 2020-10-23 15.53.28.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/f5fbcf9e-4689-7002- c398-1e91950f9a23.png)

-For this Title, I want to treat each element as a numerical value, so convert it to __ {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5} __ To do. -The method can be converted by preparing the above correspondence table (dictionary) and applying the __ "map ()" __ function to dataset ['Title']. -When the following execution is completed, delete __ "Name" and "Passenger Id" as originally expected __.

-Code![Screenshot 2020-10-23 16.06.21.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/5943862b-98ea-c3c3- 5043-4354df7e6780.png)

・ Result![Screenshot 2020-10-23 16.06.21.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/5943862b-98ea-c3c3- 5043-4354df7e6780.png)

Convert multi-valued data to binary data

-Using the __map () __ function used in the previous section, the next step is to convert multivalued data to binary data. -Here, convert 'Sex' to male: 0, female: 1. -If __ data type __ is specified in the argument of __astype () __ method, a new DataFrame with all the specified data types of all columns changed is returned. Since we want to make all int types this time, specify int as an argument.

·code スクリーンショット 2020-10-24 10.52.27.png

Feature complementation Age

-After deleting, converting, and creating data, perform __complement __. Completion is __ guessing and assigning __ value to Null or NaN. -First, complement the continuous numerical data __ "Age" __. There are the following three methods of complementation. 1: Generate __random numbers with reference to the average __ 2: Correlated __ Refer to other features __ 3: Combining 1 and 2 __, specifically, generate random numbers with reference to the mean and standard deviation

・ This time, the method of __ "2" __ is used. Specifically, there are two feature quantities that correlate with "Age": __ "Sex" and "P class" __. With reference to these two features, the estimated value of Age (intermediate age) is obtained. -First, prepare an empty array of __2 rows and 3 columns to store the Age value __. When creating an array because the specific matrix size is known in this way, it is better to use __ "np.zeros ()" . By specifying the form of the matrix in the argument, it is possible to create an array with all 0 values ( virtually empty __).

・ Next, calculate the estimated value of Age (intermediate age). First, extract the value of'Age'for all combinations of'Sex' and'Pclass' (excluding NaN). Since there are __ "2 * 3 ways" __ combinations, the size of the matrix is specified as __ (2,3) __. On the other hand, if you use __ "median ()" __ to get the __median __, this will be the middle age. -Store the median value in each case in the array of (2,3) and finish.

-Code![Screenshot 2020-10-24 11.45.22.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/8635098f-1cc3-b47d- 3fde-5d38456d4077.png)

·result スクリーンショット 2020-10-24 11.46.09.png

Convert continuous values to discrete values

AgeBand -After completing Age, convert the entire Age continuous value to __discrete value __. This is done to make it easier to predict by specifying __ range and dividing __ (= converting to discrete data) as shown in the section of the guideline "Creation" in Chapter 1. -Converting to discrete values is called __ "binning processing" __ or __ "bin division" __. To do this, use __ "pd.cut ()" __. Data is passed to the first argument, and __how many divisions of data __ is specified in the second argument. This time, it will be divided into 5 and stored as discrete data in a new feature called __ "AgeBand" __.

-Also, I want to check the correlation between AgeBand and Survived, so I also create a __pivot table __.

・ Code![Screenshot 2020-10-24 15.05.02.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/7e4a2eb9-9049-f7bf- 089f-9bc99492ab70.png)

-Next, the Age converted to discrete values is converted to __order data . Specifically, referring to AgeBand, if Age is __ "0 ~ 16", it is converted to 0, if it is "16 ~ 32", it is converted to 1, if it is "32 ~ 48", it is converted to 2, and if it is "48 ~ 64", it is converted to 3. .. -The conversion method is to extract the range of'Age'with __loc [conditional expression, column to be converted] __ and replace it with the above numerical value. ・ When you have finished so far, delete AgeBand with __drop () __.

・ Code![Screenshot 2020-10-24 15.06.17.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/91f81220-3eec-8b5e- a75e-54730478ecb5.png)

Creating new features

FamilySize -As shown in "Creation" of the policy, create a new feature amount called __ "Family Size" __ by combining the feature amounts __ "Parch" and "Sibsp" __ of the same system. This feature represents __ "number of families" __. -The method is usually __ just extract two columns and add them together __. However, since I am also included in the "number of families", do not forget the __ "+1" __ for that amount. -When creating a new feature, take the average with Survived and check the correlation __.

・ Code![Screenshot 2020-10-24 15.07.04.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/20905888-8aa5-75a6- c0de-58a3afe6aa48.png)

IsAlone -The above Family Size is classified as __ "single or family-friendly" __ to make it more abstract. That is, if FamilySize is 1, it is converted as "IsAlone = 1", and everything else is converted as "IsAlone = 0". -As a method of making, __ First, create a feature called __IsAlone with all values set to "0", and convert it to "1" only when FamilySize = 1 (use __loc () __). ・ When you reach this point, delete __Parch, SibSp, and FamilySize __.

・ Code![Screenshot 2020-10-24 15.07.46.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/0bac092b-b650-d823- 22d6-f72c14bf151b.png)

Age*Class -Here, we create an artificial feature called __ "Age * Class" __ that is weighted by multiplying "age" by "grade of the cabin". To make it, just extract Age and Pclass as it is.

Complement with mode Embarked

-As seen in Chapter 1, there are two missing values in __Embarked of the training data __. For this missing value, first __delete __ and replace that part with __mode __. By the way, the mode is __ "S" __. -The mode can be obtained with __ "mode ()" __. Also, since we want to get only the mode (index is not necessary), we get only column with [0]. -Store the acquired mode value in the variable freq_port and supplement it with __ "fillna (freq_port)" __.

·code スクリーンショット 2020-10-24 15.41.33.png

-Also, like Sex etc., Embarked is a category value, so convert it to a __number like __ {'S': 0,'C': 1,'Q': 2} __ .. Use the __map () __ function. Also, use __astype (int) __.

·code スクリーンショット 2020-10-24 15.54.28.png

Complementing numerical data Fare

-There is only one missing value in __Fare of test data __. Substitute the median (median) for this missing value. -Furthermore, like Age, __Fare, which is a continuous value, is converted to a discrete value __. The converted version is stored in a new feature called FareBand. Discrete values are calculated by dividing Fare into four parts. -In the case of Age, the range was divided by __ "cut ()" __, but this is used when __ "divide so that the range is even" __. On the other hand, in this Fare, "__qcut () __" used when "dividing so that the number of __elements included in the range __ is even" is used.

・ Code![Screenshot 2020-10-24 17.00.23.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/2e191c9f-1baf-a732- b0c1-cebc72c7d1e8.png)

-Refer to this FareBand and replace Fare with a discrete value. This time, __ "~ 7.91" is converted to 0, "7.91 to 14.454" is converted to 1, "14.454 to 31" is converted to 2, "31 ~" is converted to 3__, and so on. -The conversion method is the same as for Age __ "loc [conditional expression, column to be converted]" __.

·code スクリーンショット 2020-10-24 16.59.52.png

Summary

・ The correlation with the variable __ "Survived", which is the purpose of prediction, was investigated , and those with a significant correlation were used as they were, and those without a significant correlation were extracted from the title Tirle. Like, you can take out the __ part of the variable and create a new feature, or you can create a new feature by combining Parch and SibSp __FamilySize. -In addition, continuous values such as Age and Fare are converted to discrete values to convert features. __ If it contains missing values, use the median or mode to complete the values __. -Sex and Embarked are data such as [male, female] [S, C, Q], so replace them with numerical data such as __ [0,1] [0,1,2] __.

This time is over. Thank you for reading until the end.

Recommended Posts

Data analysis Titanic 2
Data analysis Titanic 1
Data analysis Titanic 3
Data analysis python
I tried factor analysis with Titanic data!
Data analysis before kaggle's titanic feature generation
Data analysis with python 2
Data analysis using xarray
Data analysis parts collection
Data analysis using Python 0
Data analysis overview python
Python data analysis template
Data analysis with Python
I tried principal component analysis with Titanic data!
My python data analysis container
Multidimensional data analysis library xarray
Python for Data Analysis Chapter 4
[Python] Notes on data analysis
Python data analysis learning notes
Python for Data Analysis Chapter 2
Wrap analysis part1 (data preparation)
Data analysis using python pandas
Tips for data analysis ・ Notes
Python for Data Analysis Chapter 3
Analyzing Twitter Data | Trend Analysis
Let's make the analysis of the Titanic sinking data like that
First satellite data analysis by Tellus
Data prediction competition in 3 steps (titanic)
Preprocessing template for data analysis (Python)
November 2020 data analysis test passing experience
Data analysis for improving POG 3-Regression analysis-
Recommendation of data analysis using MessagePack
Time series analysis 3 Preprocessing of time series data
Data analysis starting with python (data visualization 1)
Data analysis starting with python (data visualization 2)
Data handling 2 Analysis of various data formats
Multidimensional data analysis library xarray Part 2
Starbucks Twitter Data Location Visualization and Analysis
I tried logistic regression analysis for the first time using Titanic data
Python visualization tool for data analysis work
Check raw data with Kaggle's Titanic (kaggle ⑥)
Data analysis, what do you do after all?
Data handling
[Python] First data analysis / machine learning (Kaggle)
Creating a data analysis application using Streamlit
Parabolic analysis
Data analysis starting with python (data preprocessing-machine learning)
[Data analysis] Let's analyze US automobile stocks
I did Python data analysis training remotely
Data analysis environment centered on Datalab (+ GCP)
Python 3 Engineer Certified Data Analysis Exam Preparation
Preprocessing in machine learning 1 Data analysis process
JupyterLab Basic Setting 2 (pip) for data analysis
JupyterLab Basic Setup for Data Analysis (pip)
Analysis for Data Scientists: Qiita Self-Article Summary 2020
Data analysis for improving POG 2 ~ Analysis with jupyter notebook ~
Prepare a programming language environment for data analysis
[Examination Report] Python 3 Engineer Certified Data Analysis Exam
Analysis for Data Scientists: Qiita Self-Article Summary 2020 (Practice)
[CovsirPhy] COVID-19 Python Package for Data Analysis: Data loading
Python3 Engineer Certification Data Analysis Exam Self-made Questions