Aidemy　2020/10/30

Introduction

Hello, it is Yope! I am a liberal arts student, but I was interested in the possibilities of AI, so I went to the AI-specialized school "Aidemy" to study. I would like to share the knowledge gained here with you, and I am summarizing it on Qiita. I am very happy that many people have read the previous summary article. Thank you! This is the first post of "Data Analysis Titanic". Nice to meet you.

This article is a summary of what you learned in "Aidemy" "in your own words". It may contain mistakes and misunderstandings. Please note.

What to learn this time ・ Confirmation of the flow of the Titanic survivor prediction model ・ ② Acquisition of training data / test data ・ ③ Data shaping, creation, cleansing ・ ④ Pattern analysis, data analysis (some will be done next time)

Titanic Survivor Prediction

Forecast flow

① Question you want to clarify, __Definition of problem __ ② __ Acquisition of training data / test data __ ③ __ Data shaping, creation, cleansing __ ④ __ Pattern analysis __, specific and exploratory __ data analysis __ ⑤ Problem modeling, prediction, solution ⑥ __Visualization __, report on problem-solving steps and final solution

① Questions to be clarified, definition of problems

・ This time, we will build a prediction model for __ "Titanic Survivor Prediction" __. A site called "Kaggle" holds something like __ "competition" __ "model auction" __, which is one of the challenges. -In Kaggle, __ task training data is passed __. I will use it this time as well.

・ About the definition of the problem -The training data this time is labeled with __ "passenger data" __ and __ "survival / death", and the test data is not labeled with this label. ・ By applying the model constructed to this test data, the condition of passengers can be predicted.

② Acquisition of training data / test data

-Data acquisition is performed with __ "pd.read_csv ()" __. -Also, in order to decide which feature amount (variable) to use as a data set, check __data feature amount (variable) with Pandas __. To do this, just look at the __column name of train_df, so output it with __ "train_df.columns.values" __. -Also, in order to check what kind of data is included, __ output the first and last few lines of the data __. This can be confirmed with __ "head () / tail ()" __.

·code スクリーンショット 2020-10-21 16.45.43.png

・ Output result (only part) スクリーンショット 2020-10-21 16.48.21.png

・ About each feature ・ Survived: Did you survive? "0" is Yes / "1" is No ・ Pclass: Seat grade "1"> "2"> "3" ・ Sex: Gender ・ Age: Age -Sibsp: Number of siblings / spouses on board -Parch: Number of parents / children / grandchildren on board ・ Ticket: Boarding number ・ Fare: Boarding cost ・ Cabin: Room number ・ Embarked: Port name of departure

③ Data shaping, creation, cleansing

Types of features

-There are __ "category value" __ and __ "numerical value" __ in the feature quantity.

Category value

-__Category value __ is a feature quantity that takes only 'character string' or 'fixed numerical value'. They are called __ "nominal data" __ and __ "order data" __, respectively. -Of the features confirmed in the previous section, the nominal data will be __ "Survived" "Sex" "Embarked" __. (Survived indicates "Yes / No") -The order data will be __ "P class" __ indicating the order of the fixed numerical values "1,2,3".

Numerical value

-Numerical data is divided into __ "discrete data" __ and __ "continuous data" __. -In this data, discrete data will be __ "Sibsp" "Parch" __, and continuous data will be __ "Age" "Fare" __.

Missing value count

-Next, in order to process the missing value, it is confirmed whether there is a missing value in the __ data and which feature amount it is in __. -Check with __ "info ()" __ to see if there are any missing values in the data. -Looking at the output code, __ and "RangeIndex" __ are 891, so you can see that there are 891 data in total. Here, when looking at each feature quantity, if the number is less than 891, it means that the missing value is included by that amount __. (For example, since there are only 714 Ages, __177 are missing values __) -Since both train_df and test_df contain missing values in __ "Age", "Fare", and "Cabin" __, we will supplement these missing values in ③.

・ Result (only part)![Screenshot 2020-10-22 12.03.57.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/6cbc8d9a -4d5c-9690-66d5-dc011ea9c77b.png)

Check for duplicate data

-__ If the data is duplicated, it needs to be deleted __. Check if there are duplicates with __ "describe ()" __. -By specifying __ "include = ['O']" __ in the argument, information about object data can be displayed. Specifically, __ "count (number of data)" "unique (number of data after deduplication)" "top (most data)" "freq (number of tops)" __ is displayed. -Here, we want to know the number of duplicate data, so we can look at the number of __ "unique" __.

・ Result![Screenshot 2020-10-22 12.00.45.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/f76f4fba-d6ae-5533- 74a9-f7d167454090.png)

③ Subsequent policy

・ From here, we will build a model using this data, but __ we will consider the policy at that time from the data we have seen so far __. ・ There are __7 __ things that will be the policy (goal) at this time. __ "Classification" "Correlation" "Conversion" "Completion" "Modification" "Create" "Graph" __. Regarding these, we will consider "how to do for which feature amount".

-__ Correlation __: The model predicts whether or not the passenger survived, that is, __ "Survived" __. _ Examine the correlation __ to analyze how other features affect Survived.

-__ Completion __: Data completion is prioritized from the one with the strongest correlation __ (for the one with the weakest interphase, it is better to perform the following correction). This time, it is complemented from __ "Age" and "Embarked" __.

-__Modification __: Survived and __ Exclude those that are clearly not likely to correlate __. __ "Passenerld" and "Name" __ are data for identifying passengers, so it is irrelevant whether they survived or not, so they are excluded. Also, __ "Ticket" __ may be excluded because it has a high __overlapping rate and may not correlate with Survived. __ "Cabin" __ has a large number of missing values __, so it may be deleted.

-__Create : Create a new feature by __dividing or extracting the feature amount. This time, we will create a new feature called __ "Family Size" __ for the features "Parch" and "Sibsp" of the same system. Also, since it is easier to predict the continuous data "Age" and "Fare" if they are made into __discrete data, create a new feature quantity __ divided by specifying the __ range.

-__ Classification __: The Titanic's __viability is considered to be high for "children," "women," and "upper floors (upper class)." Looking at the data based on this hypothesis, __ "Sex = Female" "Age <?" "P class = 1" __ is likely to be __ "Survived = 1" __.

④ Pattern analysis, data analysis

Aggregation of features (correlation)

-Use __ "pivot table" __ when aggregating and analyzing a large amount of data like this time. -Pivot table can be done only with data that does not contain __missing values __ Other, __ "Category value, order data, discrete data" __ It is desirable to do it. -As an analysis, __correlation between features is analyzed __.

Pivot table creation for Pclass (order data) and Survived

・ First of all, code description スクリーンショット 2020-10-22 13.18.20.png

-The "__train_df [[" Pclass "," Survived "]] __" part specifies the __ element of the table to be created this time from the train_df column . -" Groupby ([" Pclass "], as_index = False) __" is grouped (aggregated) __ in the Pclass column. Pclass itself has a value of "1,2,3", but I want to use this as a column as it is, so I do not index it. -" Mean () __" calculates the __mean of the value Survived . -" Sort_values (by =" Survived ", ascending = False) __" indicates that the Survived __mean values are sorted in descending order __.

-Output result![Screenshot 2020-10-22 13.34.53.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/acedde96-9196-039e -a1a4-7787402d5c26.png)

・ Do the same for __ "Sex", "Parch", and "SibSp" __.

Summary

-The first thing to do when creating a model for predicting the survival rate of the Titanic is to acquire data. -Once the data is acquired, check if there are any missing values or duplicates for n the next time the data is formatted. ・ From the data so far, consider the policy from the next time onward. First, start from the "correlation" part. For the correlation, consider the correlation with "Survived", which is the teacher label of the model.

This time is over. Thank you for reading until the end.

[PYTHON] Data analysis Titanic 1