[PYTHON] Data analysis Titanic 3

Aidemy 2020/10/31

Introduction

Hello, it is Yope! I am a liberal arts student, but I was interested in the possibilities of AI, so I went to the AI-specialized school "Aidemy" to study. I would like to share the knowledge gained here with you, and I am summarizing it on Qiita. I am very happy that many people have read the previous summary article. Thank you! This is the third post of "Data Analysis Titanic". Nice to meet you.

What to learn this time ・ ⑤ Problem modeling, prediction, and solution

Creating a model

Selection of algorithm to use

-Since the data processing is completed in Chapter 2, from here, we will actually pass the data to the __ model, make a prediction, and perform the process until it is resolved. -First, it is necessary to decide __ which algorithm to create the model __. There are __ "classification" and "regression" __ in the prediction problem. The former divides the data into classes and predicts which "class" the passed data belongs to, and the latter is the "value" of the data. Is to predict. -The prediction problem for this Titanic issue is to "classify" whether __Survived is 0 or 1. The algorithm used is __ "logistic regression" "SVC" "k-NN" "decision tree" "random forest" __ to create a model.

Data preparation

-Prepare __ "X_train" "y_train" "X_test" __ to be passed to the model. I usually use __train_test_split () __ in sklearn.model_serection, but this time I split it myself . -For X_train, substitute train_df other than Survived, for y_train, substitute train_df for Survived only, and for X_test, substitute test_df other than __PassengerId.

・ Code![Screenshot 2020-10-24 18.32.01.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/57f40d79-e73c-3912- e3be-ea8389d5e6fb.png)

Logistic regression

・ (Review) __Logistic regression __ uses the __sigmoid function __ to classify binary values. The sigmoid function is a function __ that takes a value between __0 and 1. Create a model using __LogidticRegression () __. -This time, predict whether the objective variable __Survived is 0 or 1 using the explanatory variables Age and Pclass.

-Also, as for the data to be passed to the model, only logistic regression uses __X_train and Y_train divided into training data and test data at a ratio of 8: 2.

-Code (including results)![Screenshot 2020-10-24 19.03.30.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/79db1564 -18c1-c397-80b0-e236333cad58.png)

Visualize which explanatory variables (features) are likely to influence the results

-__ To know "which explanatory variable (feature amount) tends to influence the result" __, calculate the _partial regression coefficient __ between the objective variable and the explanatory variable. It can be said that the larger the value, the easier it is to influence the result. -Calculate the partial regression coefficient with __ "model.corf" __. Since I want to handle it in DataFrame, I create a DataFrame that has train_df columns in the row ("Feature"), create a "Partial regression coefficient" as a new column, and store the partial regression coefficient there. -The reason why delete (0) is set when creating a DataFrame is that if 0 appears in the process of calculating the partial regression coefficient, the calculation cannot be performed and it becomes NaN.

-Code![Screenshot 2020-10-24 19.25.54.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/1cd46828-9fcd-bec9- 0851-275745c732b3.png)

・ Result![Screenshot 2020-10-24 19.26.25.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/efd82e96-a574-36ea- 80ac-2bc177a5ed6f.png)

SVM -(Review) __ Support Vector machine (SVM) __ An algorithm for classification. Since the classification boundary is drawn so that it is the farthest from other classes, it is easy to be generalized __. Since the kernel method __ is used to convert from non-linear to linear, __ can handle non-linear data. Linear SVM is used with __LinearSVC () __ and non-linear SVM is used with __SVC () __.

・ Code (If you create LinearSVC in the same way, it's OK. This result is "83.84") スクリーンショット 2020-10-24 19.48.48.png

k-NN -(Review) k-NN is an algorithm that extracts k teacher data similar to prediction data __ and outputs the most common class as the prediction result. The feature is that the learning cost is 0 and the prediction accuracy is high __. It can be used with __KNeighborsClassifier () __, and the number k of data to be extracted can be set by specifying __ "n_neighbors" __ as an argument.

・ Code![Screenshot 2020-10-24 19.56.44.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/04b20121-5e79-cf72- 9d79-4d5c9ce7018d.png)

Decision tree

-(Review) Decision tree is named like this because the rules extracted from the data are represented by a tree structure. The rule is, for example, that if the explanatory variable Age is 1 (16 to 32 years old), it is judged for Pclass. As these progress, the classes can be finally classified. Can be used with __DecisionTreeClassifier () __.

Random forest

-(Review) __Random Forest __ is an algorithm that builds a large number of decision trees __ and outputs the largest number of each result as the final result. Learning using multiple classifiers like this is called ensemble learning. It can be used with __RandomForestClassifier () __, and the number of decision trees can be determined by specifying __ "n_estimators" __ as an argument.

Evaluation of the model

-By listing the accuracy (acc) __ of each __model created above in DataFrame, decide which model should be used.

-Code![Screenshot 2020-10-24 21.52.03.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/5140731d-b25a-73c6- 2b77-0780d617f121.png)

・ Result![Screenshot 2020-10-24 22.10.51.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/c7ac0fac-5ba4-f9e3- 3939-cc0a12fe92c1.png)

Save model

-As seen in the results in the previous section, it was found that the accuracy of the model using the __decision tree and random forest is the highest __. This time, I decided to use the __random forest model, which seems to be more generalized, and save this model as a csv file. -You can write and save csv files with __ "to_csv" __. In the file, create a file as a DataFrame that has a column that stores the PassengerId of test_df as'PassengerId'and a column that stores the prediction result'Y_pred' of the random forest (decision tree) as'Survived'.

・ Code (file path is fictitious) スクリーンショット 2020-10-24 22.24.47.png

Summary

-Split the data created up to the last time into __train_X, train_y, test_X . Create a model using these data. - In order to know "which explanatory variable (feature amount) tends to influence the result" __, it is better to calculate and visualize the __partial regression coefficient __. -Compare which model has the highest __accuracy (acc) __ from the score of each model, and save the highest __accuracy (acc) __ in a csv file.

This time is over. Thank you for reading until the end.

Recommended Posts

Data analysis Titanic 1
Data analysis Titanic 3
Data analysis python
Data analysis before kaggle's titanic feature generation
Data analysis with python 2
Data analysis using xarray
Data analysis parts collection
Data analysis using Python 0
Data analysis with Python
I tried principal component analysis with Titanic data!
My python data analysis container
Multidimensional data analysis library xarray
Python for Data Analysis Chapter 4
[Python] Notes on data analysis
Python data analysis learning notes
Python for Data Analysis Chapter 2
Wrap analysis part1 (data preparation)
Data analysis using python pandas
Tips for data analysis ・ Notes
Python for Data Analysis Chapter 3
Analyzing Twitter Data | Trend Analysis
Let's make the analysis of the Titanic sinking data like that
First satellite data analysis by Tellus
Data prediction competition in 3 steps (titanic)
Preprocessing template for data analysis (Python)
November 2020 data analysis test passing experience
Data analysis for improving POG 3-Regression analysis-
Recommendation of data analysis using MessagePack
Time series analysis 3 Preprocessing of time series data
Data analysis starting with python (data visualization 1)
Data analysis starting with python (data visualization 2)
Data handling 2 Analysis of various data formats
Multidimensional data analysis library xarray Part 2
Starbucks Twitter Data Location Visualization and Analysis
I tried logistic regression analysis for the first time using Titanic data
Python visualization tool for data analysis work
Check raw data with Kaggle's Titanic (kaggle ⑥)
Data analysis, what do you do after all?
Data handling
[Python] First data analysis / machine learning (Kaggle)
Creating a data analysis application using Streamlit
Parabolic analysis
Data analysis starting with python (data preprocessing-machine learning)
[Data analysis] Let's analyze US automobile stocks
I did Python data analysis training remotely
Data analysis environment centered on Datalab (+ GCP)
Python 3 Engineer Certified Data Analysis Exam Preparation
Preprocessing in machine learning 1 Data analysis process
JupyterLab Basic Setting 2 (pip) for data analysis
JupyterLab Basic Setup for Data Analysis (pip)
Analysis for Data Scientists: Qiita Self-Article Summary 2020
Data analysis for improving POG 2 ~ Analysis with jupyter notebook ~
Prepare a programming language environment for data analysis
[Examination Report] Python 3 Engineer Certified Data Analysis Exam
[CovsirPhy] COVID-19 Python Package for Data Analysis: Data loading
Python 3 Engineer Certification Data Analysis Exam Pre-Exam Learning
An introduction to statistical modeling for data analysis
[Python] Data analysis, machine learning practice (Kaggle) -Data preprocessing-
How to use data analysis tools for beginners
Data analysis in Python: A note about line_profiler
[Introduction to minimize] Data analysis with SEIR model ♬