[PYTHON] Try machine learning with Kaggle

This article is the 23rd day article of PONOS Advent Calendar 2019. Kaggle_logo.png

Introduction

This article aims to give you a quick trial of machine learning without the hassle of preparatory work. I will not touch on detailed method explanations or problem solving methods.

Register with Kaggle

First, register for Kaggle. Kaggle is a platform where data scientists and machine learning engineers around the world are shaving day and night. The execution environment of python is prepared on the web, and all the necessary libraries and learning data are available, so you can try it immediately without building the environment locally.

Trial using data from the Titanic sinking accident

Kaggle holds daily competitions where you can access a variety of data. This time, we will use Titanic: Machine Learning from Disaster, which is always open as a tutorial, not the competition being held. The purpose of this competition is to determine whether a passenger without survival information has survived, using the Titanic passenger list (name, age, gender, cabin class, etc.) and whether or not they have survived as learning data.

Participate in the competition

You can join by pressing Join Competition.

Make a Notebook

You can create it by going to the Notebooks tab and pressing New Notebook. Go to the settings screen. You can keep the default, so just press Create.

View data

First, let's look at the data to be learned. Erase the code written from the beginning and write the following code.

cell1


import pandas as pd

You can execute the contents of that cell by pressing ctrl + enter or by pressing the play button on the left. Nothing changes here as we are just loading the library. Press b or + code at the bottom of the cell to add a new cell and write the following code.

cell2


train = pd.read_csv('../input/titanic/train.csv')
test = pd.read_csv('../input/titanic/test.csv')
passenger_id = test.PassengerId #Save for submission
train.head(3)

If you run it and the table is displayed, it is successful. This time, we will use Survived, Pclass, Sex, Age, SibSp, Parch, Fare, Embarked.

cell2


train = train.iloc[:, [1, 2, 4, 5, 6, 7, 9, 11]]
test = test.iloc[:, [1, 3, 4, 5, 6, 8, 10]]

Format the data

Numerical data is required for learning, so we will format the data. First, repair the data loss. Since train.Age, train.Embarked, test.Age, and test.Fare have missing data, fill them with good Shioume numbers. This time, Embarked is filled with S, and the others are filled with the median.

cell2


train.Age = train.Age.fillna(train.Age.median())
train.Embarked = train.Embarked.fillna('S')
test.Age = test.Age.fillna(test.Age.median())
test.Fare = test.Fare.fillna(test.Fare.median())

Next, convert Sex and Embarked to numbers with one-hot encoding.

cell2


train = pd.get_dummies(train)
test = pd.get_dummies(test)

Finally, convert Age and Fare to discrete values. Since it uses numpy, it loads the library.

cell1


import numpy as np

cell2


train.Age = np.digitize(train.Age, bins=[10, 20, 30, 40, 50])
train.Fare = np.digitize(test.Fare, bins=[10, 20, 30])
test.Age = np.digitize(train.Age, bins=[10, 20, 30, 40, 50])
test.Fare = np.digitize(test.Fare, bins=[10, 20, 30])

learn

This time we will use Random Forest. It is a method of learning slightly different decision trees and averaging them. First, load the library (scikit-learn).

cell1


from sklearn.ensemble import RandomForestClassifier 

Separate the Survived of the training data from earlier. Add a new cell and write the following code.

cell3


X = train.iloc[:, 1:]
y = train.iloc[:, 1]

Now that the training data is ready, let's train.

cell3


forest = RandomForestClassifier(n_estimators=5, random_state=0)
forest.fit(X, y)

Now that we have learned, we will make predictions using test data.

cell3


predictions = forest.predict(test)

Finally, save the prediction result to a file.

cell3


submission = pd.DataFrame({ 'PassengerId': passenger_id, 'Survived': predictions })
submission.to_csv('submission.csv', index=False)

Submit to Kaggle

Press the Commit button and a pop-up window will appear. Press the Open Version button when you are done. In the Output column of the newly opened screen, there are the submission.csv saved earlier and the Submit to Competition button, so press them. The score will be displayed when the submission is completed. I think it will be around 0.76 (the closer it is to 1, the better the score).

And to the actual competition ...

As I tried this time, the library will do most of the learning part. The actual difficulty was overwhelmingly the data molding part (more difficult if you wanted to achieve accuracy). Those who are good at this kind of work may want to step into the path of machine learning.

Recommended Posts

Try machine learning with Kaggle
Try machine learning with scikit-learn SVM
Machine learning learned with Pokemon
Try deep learning with TensorFlow
Machine learning with Python! Preparation
Beginning with Python machine learning
Try to predict forex (FX) with non-deep machine learning
Machine learning
Try Deep Learning with FPGA-Select Cucumbers
Reinforcement learning 13 Try Mountain_car with ChainerRL.
[Machine learning] Try studying decision trees
I tried machine learning with liblinear
Machine learning with python (1) Overall classification
Read kaggle Courses --- Intermediate Machine Learning 5
Machine learning beginners try linear regression
Read kaggle Courses --- Intermediate Machine Learning 6
Try Common Representation Learning with chainer
Quantum-inspired machine learning with tensor networks
Get started with machine learning with SageMaker
"Scraping & machine learning with Python" Learning memo
[Machine learning] Start Spark with iPython Notebook and try MLlib
Try to predict if tweets will burn with machine learning
Manga Recommendations with Machine Learning Part 1 First, try dividing without thinking
Predict power demand with machine learning Part 2
Amplify images for machine learning with python
[Machine learning] Try running Spark MLlib with Python and make recommendations
Machine learning imbalanced data sklearn with k-NN
Machine learning with python (2) Simple regression analysis
I tried learning with Kaggle's Titanic (kaggle②)
A story about machine learning with Kyasuket
Try Bitcoin Price Forecasting with Deep Learning
Try with Chainer Deep Q Learning --Launch
Try deep learning of genomics with Kipoi
[Shakyo] Encounter with Python for machine learning
[Python] First data analysis / machine learning (Kaggle)
Machine learning with Pytorch on Google Colab
[Memo] Machine learning
Machine learning classification
Build AI / machine learning environment with Python
Machine Learning sample
Reinforcement learning 11 Try OpenAI acrobot with ChainerRL.
[Python] Easy introduction to machine learning with python (SVM)
Machine learning starting with Python Personal memorandum Part2
Try to forecast power demand by machine learning
Machine learning starting with Python Personal memorandum Part1
[Python] Data analysis, machine learning practice (Kaggle) -Data preprocessing-
Looking back on learning with Azure Machine Learning Studio
I started machine learning with Python Data preprocessing
Try using Jupyter Notebook of Azure Machine Learning
Build a Python machine learning environment with a container
Machine learning tutorial summary
Try scraping with Python.
Learning Python with ChemTHEATER 03
"Object-oriented" learning with python
About machine learning overfitting
Learning Python with ChemTHEATER 05-1
Machine Learning: Supervised --AdaBoost
Machine learning logistic regression
Machine learning support vector machine
Learning Python with ChemTHEATER 02
Machine learning linear regression