[PYTHON] It's okay to stumble on Titanic! Introducing the Kaggle strategy for super beginners

I started Kaggle, but I'm not sure at the time of Titanic!

You may have signed up for Kaggle and watched the tutorial on Titanic, but you're not sure what you're doing.

Also, even if Titanic can be done to some extent, it is difficult to understand how to utilize it for other analysis.

So this time, I will explain what steps Kaggle should take even for those who are working on Titanic with Kaggle.

In this article, we will use the iris dataset contained in the Python library instead of Titanic to classify the varieties.

Let's start by importing the necessary libraries and data.

Import required libraries and data

Whether you're doing Titanic with Kaggle or tackling future issues, you first need to have the tools you need.

If you do not do this, you will get an error even if you write the code, so be careful.

import numpy as np
import pandas as pd
from pandas import DataFrame, Series
#numpy is used for calculation, pandas is used for data processing.

from sklearn.datasets import load_iris
#Borrow the original data
from sklearn import datasets
#Make your data available in pandas
from sklearn.model_selection import train_test_split
#Used to split the data
from sklearn.linear_model import LogisticRegression
#Use this for machine learning this time

import matplotlib.pyplot as plt
import seaborn as sns
#Both are used to make a diagram

Here we will touch on how each library is used.

numpy is used for data calculation and pandas is used for processing read data.

sklearn contains free-to-use datasets and machine learning techniques.

Here, the data that describes the characteristics of the iris and the name of the variety is read.

It also loads what makes the data available to pandas and what is needed to split the data.

We will explain how to divide the data later.

matplotlib and seaborn are used to chart and visualize given data.

Sometimes just looking at the data can be a clue to what you can't understand or to think about your next strategy.

Look at the big picture of the data to see if there are any missing values

Now that we have loaded the library, let's take a look at the iris data used this time.

iris = load_iris()#Read iris data
df = pd.DataFrame(iris.data, columns=iris.feature_names)#Make data visible in data frames

df.head()#See only the beginning of the data
df.describe()#See the big picture of the data

The beginning of the data

sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2

The big picture of the data

sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.057333 3.758000 1.199333
std 0.828066 0.435866 1.765298 0.762238
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000

I read the data and tried to give an overview and the beginning part.

The Sepal Length written here is the length of the sepal, the Sepal Width is the width of the sepal, the Petal Length is the length of the petals, and the Petal Width is the width of the petals.

The information here characterizes the iris varieties and is called the "feature amount".

This data has a total of 150 data, and there are three types.

However, for the sake of simplicity, we will focus on two types.

df = df[:100]

From here, let's see if there is anything missing in the data called "missing value".

Let's imagine some study reference book here.

When we look at the reference books, we can somehow make up for any typographical errors or invisible parts in our heads.

However, the program "recognizes" the entered information as it is, so an error is displayed.

So you need to make sure it's not there in the first place.

df.isnull().sum()

sepal length (cm)    0
sepal width (cm)     0
petal length (cm)    0
petal width (cm)     0
dtype: int64

Fortunately there were no missing values here.

In the unlikely event that something happens, you will need to do something like fill in the numbers such as the average value or exclude the missing parts, so keep in mind.

Divide the data and prepare it for learning

Since we have read the data and confirmed that there are no missing values, we will first start dividing the data.

y = pd.Series(data=iris.target)
y = y[:100]
#Since y is the name of the iris variety, convert it to a number.
x = df.loc[:,["sepal length (cm)","sepal width (cm)","petal length (cm)","petal width (cm)"]]
x_train, x_test, y_train, y_test =train_test_split(x, y, test_size = 0.1, train_size = 0.9, shuffle = True)

This time, 10% was randomly divided and taken out for testing from the read data.

If it is a regular competition, the entire reference book will be handed over as training data, and the exam questions will be distributed as test data.

However, this time there is no such thing, so 10% randomly extracted from the data will be used as exam questions.

x_Train, x_valid, y_Train, y_valid =train_test_split(x_train, y_train, test_size = 0.2, train_size = 0.8, shuffle = True)

Then, 20% is randomly extracted from the remaining data and used as an exercise.

Here, let AI learn, but use only the data that was not retrieved until the end.

You may be wondering why the training data is further divided.

This is to make sure that there is no difference between the remaining part and the exercises.

When we were studying for a test at school, we memorized reference books, etc., but sometimes it didn't lead to a point.

Such things can happen programmatically and are called "overfitting."

To see if that is happening, study on the pages that remain until the end, and check through exercises to see if you are studying only for the pages that remain.

If that happens, you need to review how you learn.

Check the result of learning

lg = LogisticRegression()
lg.fit(x_Train, y_Train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

Only the part that remains until the end is used for learning using one of the methods used for AI (artificial intelligence).

Compare the correct answer rate using the data used for learning again and the data extracted for the exercise.

print('Train Score: {}'.format(round(lg.score(x_Train, y_Train), 3)))
print(' Test Score: {}'.format(round(lg.score(x_valid, y_valid), 3)))

Train Score: 1.0
 Test Score: 1.0

When I verified it, there seemed to be almost no problem.

Then, let's go to the test production with what we learned. I am making an answer here.

y_pred = lg.predict(x_test)

y_pred
array([1, 1, 0, 0, 1, 0, 0, 0, 1, 0])

y_test
68    1
88    1
35    0
20    0
95    1
7     0
12    0
0     0
76    1
44    0
dtype: int64

Then, we will check the correct answer rate.

np.mean(y_pred==y_test)

1.0

When I was asked to score automatically, there was almost no problem.

In the usual competition, the management side has the answer. Also, please note that only some data will be scored.

When working on Titanic with Kaggle, we also do this!

sns.heatmap(df.corr(),annot=True,cmap='bwr',linewidths=0.2) 
fig=plt.gcf()
fig.set_size_inches(5,4)
plt.show()

Kaggle_Titanic.png

Here, how much the features are related to each other is displayed.

I didn't do it this time, but there is a possibility to combine multiple things to make a new one.

In addition, there are a wide variety of methods used for AI (artificial intelligence).

from sklearn.neural_network import MLPClassifier

mlpc = MLPClassifier(hidden_layer_sizes=(100, 100, 10), random_state=0)
mlpc.fit(x_Train, y_Train)

print('Train Score: {}'.format(round(mlpc.score(x_Train, y_Train), 3)))
print(' Test Score: {}'.format(round(mlpc.score(x_valid, y_valid), 3)))

Train Score: 1.0
 Test Score: 1.0
lg_pred = lg.predict_proba(x_test)
mlpc_pred = mlpc.predict_proba(x_test)

pred_proba = (lg_pred + mlpc_pred) / 2
pred = pred_proba.argmax(axis=1)

Each learning method has its advantages and disadvantages.

Therefore, combining multiple learnings to improve accuracy is called "ensemble learning".

np.mean(pred==y_test)

1.0

This time, I have explained using the iris data set instead of Titanic so that even people who are new to Kaggle and working on Titanic analysis can understand the minimum required setup with Kaggle.

The minimum requirement is to read the necessary libraries and data, check the overall picture of the data for missing values, divide the data so that it can be learned, and finally train it for accuracy. To make sure.

When entering a full-scale analysis such as Titanic with Kaggle, it is necessary to devise such as seeing the relationship between features and using multiple learning methods.

In addition to Titanic, Kaggle also has an environment for discussing what you don't understand or have datasets for beginners.

Titanic is called Kaggle's hello world, but let's work on it so that we can analyze various data while proceeding steadily without rushing.

Finally, I will summarize the code used this time.

This code

import numpy as np
import pandas as pd
from pandas import DataFrame, Series
#numpy is used for calculation, pandas is used for data processing.

from sklearn.datasets import load_iris
#Borrow the original data
from sklearn import datasets
#Make your data available in pandas
from sklearn.model_selection import train_test_split
#Used to split the data
from sklearn.linear_model import LogisticRegression
#Use this for machine learning this time

import matplotlib.pyplot as plt
import seaborn as sns
#Both are used to make a diagram

iris = load_iris()#Read iris data
df = pd.DataFrame(iris.data, columns=iris.feature_names)#Make data visible in data frames
df.head()#See only the beginning of the data
df.describe()#See the big picture of the data
df = df[:100]#Narrow down the types of irises to only two types
df.isnull().sum()#Visualize for missing values

y = pd.Series(data=iris.target)
y = y[:100]
#Since y is the name of the iris variety, convert it to a number.
x = df.loc[:,["sepal length (cm)","sepal width (cm)","petal length (cm)","petal width (cm)"]]
x_train, x_test, y_train, y_test =train_test_split(x, y, test_size = 0.1, train_size = 0.9, shuffle = True)
x_Train, x_valid, y_Train, y_valid =train_test_split(x_train, y_train, test_size = 0.2, train_size = 0.8, shuffle = True)
#Divide the data into training and test data, and further divide the training data into training and exercise data

lg = LogisticRegression()
lg.fit(x_Train, y_Train)

print('Train Score: {}'.format(round(lg.score(x_Train, y_Train), 3)))
print(' Test Score: {}'.format(round(lg.score(x_valid, y_valid), 3)))
#AI(Artificial intelligence)Check for overfitting using one of the methods used in

y_pred = lg.predict(x_test)
y_pred
y_test
np.mean(y_pred==y_test)
#Check the correct answer rate by verifying the test data with what you learned

sns.heatmap(df.corr(),annot=True,cmap='bwr',linewidths=0.2) 
fig=plt.gcf()
fig.set_size_inches(5,4)
plt.show()
#Visualize the relationship between each feature

from sklearn.neural_network import MLPClassifier

mlpc = MLPClassifier(hidden_layer_sizes=(100, 100, 10), random_state=0)
mlpc.fit(x_Train, y_Train)

print('Train Score: {}'.format(round(mlpc.score(x_Train, y_Train), 3)))
print(' Test Score: {}'.format(round(mlpc.score(x_valid, y_valid), 3)))

lg_pred = lg.predict_proba(x_test)
mlpc_pred = mlpc.predict_proba(x_test)

pred_proba = (lg_pred + mlpc_pred) / 2
pred = pred_proba.argmax(axis=1)
np.mean(pred==y_test)
#Ensemble learning and good AI(Artificial intelligence)Combine some of the techniques used in

Recommended Posts

It's okay to stumble on Titanic! Introducing the Kaggle strategy for super beginners
[Kaggle for super beginners] Titanic (Logistic regression)
Approach commentary for beginners to be in the top 1.5% (0.83732) of Kaggle Titanic_3
Approach commentary for beginners to be in the top 1.5% (0.83732) of Kaggle Titanic_1
Approach commentary for beginners to be in the top 1.5% (0.83732) of Kaggle Titanic_2
[For Kaggle beginners] Titanic (LightGBM)
It's time to install DB with Docker! DB installation for beginners on Docker
Introducing the Min-Max strategy to Othello's AI
The fastest way for beginners to master Python
Python for super beginners Python for super beginners # Easy to get angry
Day 66 [Introduction to Kaggle] The easiest Titanic forecast
How to convert Python # type for Python super beginners: str
Python # How to check type and type for super beginners
Kaggle Tutorial Titanic know-how to be in the top 2%
How to convert Python # type for Python super beginners: int, float
Tips for Python beginners to use the Scikit-image example for themselves