[PYTHON] Day 67 [Introduction to Kaggle] Have you tried using Random Forest?

Kaggle's Titanic conjecture. Last time, I made All Survival Model and Gender Based Model (male death / female survival). Day 66 [Introduction to Kaggle] The easiest Titanic prediction

This time it was machine learning, so I tried using Random Forest. Click here for the original story. The most popular recipe on Kaggle's Notebook. Titanic Data Science Solutions

Since it is written in English, I will take a quick look from top to bottom for the time being. In conclusion, Random Forest seems to be the easiest to use.

I immediately tried to execute it based on the previous Gender Based Model.


Cut out the data from train.csv and test.csv.

This area is the same as last time.

11.py


import pandas as pd
#Read CSV with pandas
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')

#Convert gender to male 0 female 1
train_df.replace({'Sex': {'male': 0, 'female': 1}}, inplace=True)
test_df.replace({'Sex': {'male': 0, 'female': 1}}, inplace=True)

#Create a Dataframe
train_df = train_df.loc[:,['PassengerId','Survived','Sex']]
test_df = test_df.loc[:,['PassengerId','Sex']]

Create a predictive model from the training data.

-The training data train.csv is vertically divided into the explanatory variable (x) and the objective variable (y). -Furthermore, the pseudo training data (X_train, y_train) and the pseudo test data ((X_valid, y_valid) are divided horizontally.

12.py


#Building a baseline model
#Import data split module
from sklearn.model_selection import train_test_split

#Cut out the training data from the original data.numpy in values.Convert to ndarray type
X = train_df.iloc[:, 2:].values        #Factors that cause
y = train_df.iloc[:, 1].values         #result

#test data
X_test = test_df.iloc[:, 1:].values #Factors that cause

#Divide the training data to create a prediction model
#For data partitioning, scikit-learn train_test_Use split function
#Randomize the split Set the seed value to 42 (according to the Galactic Hitchhiking Guide)
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.3, random_state=42)

Train with simulated training data to create a prediction model.

Predict pseudo test data with a prediction model. The closer the resulting score is, the better the prediction model. If the training data score is too good for overfitting or too low for underfitting, review the prediction model.

13.py


#Create a predictive model in a random forest
from sklearn.ensemble import RandomForestClassifier

#Learn pseudo-training data and create a prediction model.
rfc = RandomForestClassifier(n_estimators=100)
rfc.fit(X_train, y_train)

#Pseudo training data X_train, y_train)View the score of
print('Train Score: {}'.format(round(rfc.score(X_train, y_train), 3)))
#Pseudo test data(X_valid, y_valid)View the score of
print(' Test Score: {}'.format(round(rfc.score(X_valid, y_valid), 3)))

Check the pseudo-training score and the pseudo-test score.

Train Score: 0.785 Test Score: 0.791

The result of this time is ... Is it okay, how is it? Humans are lacking in learning. Anyway, let's predict that the model has been created.

Predict the test data with the created prediction model.

14.py


#Created prediction model( rfc.predict)With test data(X_test)Predict
y_pred = rfc.predict(X_test)

#Convert the result to a Pandas data frame.
submission = pd.DataFrame({
        "PassengerId": test_df["PassengerId"],
        "Survived": y_pred
    })

#Output to CSV.
submission.to_csv('titanic1-2.csv', index=False)

Complete!

I'll upload it to Kaggle right away.

Public Score:0.76555

???

This is the same result as the previous male death female survival model. When I checked the CSV file, it was certainly exactly the same. Looking at the original data train.csv, the survival rate of women is 75% and that of men is 18%, so it was not because I thought that there would be some different predictions.

The prediction model is based on more than 600 data of 890 train.csv divided into 7: 3. It may not be enough to predict. Or maybe Random Forest isn't good at making ambiguous predictions, or maybe it's coded wrong somewhere.

I'm not sure about this area, so I'll put it on hold.

Recommended Posts

Day 67 [Introduction to Kaggle] Have you tried using Random Forest?
Day 68 [Introduction to Kaggle] Random Forest was a simple one.
I tried using Random Forest
How to set up Random forest using Optuna
How to set up Random forest using Optuna
Day 66 [Introduction to Kaggle] The easiest Titanic forecast
Have you tried using your student ID as a key? !!
[I tried using Pythonista 3] Introduction
Why you have to specify dtype when using keras pad_sequences
Introduction to discord.py (3) Using voice
[Introduction to Python3 Day 1] Programming and Python
[Introduction to Python3 Day 13] Chapter 7 Strings (7.1-7.1.1.1)
[Introduction to Python3 Day 14] Chapter 7 Strings (7.1.1.1 to 7.1.1.4)
[Introduction to Python3 Day 15] Chapter 7 Strings (7.1.2-7.1.2.2)
[Kaggle] I tried undersampling using imbalanced-learn
[Introduction to Python3 Day 21] Chapter 10 System (10.1 to 10.5)
Try using n to downgrade the version of Node.js you have installed
I tried to visualize all decision trees of random forest with SVG
Introduction to Discrete Event Simulation Using Python # 1
[Introduction to Python3, Day 17] Chapter 8 Data Destinations (8.1-8.2.5)
[Introduction to Python3, Day 17] Chapter 8 Data Destinations (8.3-8.3.6.1)
"Lie ... What have you been up to?"
[Kaggle] I tried ensemble learning using LightGBM
[PyTorch] Introduction to document classification using BERT
[Introduction to Python3 Day 19] Chapter 8 Data Destinations (8.4-8.5)
I tried to classify text using TensorFlow
[Introduction to Python3 Day 18] Chapter 8 Data Destinations (8.3.6.2 to 8.3.6.3)
Introduction to discord.py (1st day) -Preparation for discord.py-
Introduction to Discrete Event Simulation Using Python # 2
Disease classification in Random Forest using Python
Introduction to Tornado (3): Development using templates [Practice]
I tried to predict Covid-19 using Darts
Kaggle: Introduction to Manual Feature Engineering Part 1