[Python] [Machine learning] Beginners without any knowledge try machine learning for the time being

Premise

Me: A beginner who has never studied or touched machine learning Machine learning is a technology that you should know in the future, and I thought I would like to use it for a while. I will do it with a stance of trying to implement it aiming for a state where it moves for the time being, without digging deep into the details. (I feel really light and I feel like lowering the psychological hurdle to machine learning)

Environmental preparation

For the time being, make pandas and scikit-learn available in python If you install it with pip, it should be completed ...

$ pip install pandas
Traceback (most recent call last):
File "/home/myuser/.local/bin/pip", line 7, in <module>
from pip._internal import main
ImportError: No module named 'pip._internal'

I'm not sure about the details, but I can't talk about it unless it works for the time being. Download get-pip.py from Official Site

$ curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py

Run with python, python3

$ sudo python get-pip.py
$ sudo python3 get-pip.py

Check if the pip command is available

$ pip --version
pip 20.2.4 from /Library/Python/3.7/site-packages/pip (python 3.7)

You can now use pip safely Now you can install pandas, scikit-learn ↓ Confirm that the installation was successful

$ pip show pandas
Name: pandas
Version: 1.1.4
Summary: Powerful data structures for data analysis, time series, and statistics
Home-page: https://pandas.pydata.org
Author: None
Author-email: None
License: BSD
Location:
Requires: python-dateutil, numpy, pytz
Required-by: 

$ pip show scikit-learn
Name: scikit-learn
Version: 0.23.2
Summary: A set of python modules for machine learning and data mining
Home-page: http://scikit-learn.org
Author: None
Author-email: None
License: new BSD
Location:
Requires: joblib, threadpoolctl, scipy, numpy
Required-by: sklearn

Rough procedure for machine learning

A quick look reveals that machine learning roughly follows the flow below.

  1. Obtaining data
  2. Data preprocessing
  3. Method selection
  4. Hyperparameter selection
  5. Model learning
  6. Evaluation (→ Return to 2 or 3 or 4 and try and error)

Titanic: Machine Learning from Disaster (Titanic survival prediction)

For the time being, I will try Kaggle's Titanic survival prediction that I often see in the introduction to machine learning

Obtaining data

Download data to use from Kaggle's site

Download the following data from Kaggle Site (You need to register an account with Kaggle to download the data)

When I check the contents, it looks like this

>>> import pandas as pd
>>> gender_submission = pd.read_csv("./Data/gender_submission.csv")
>>> test = pd.read_csv("./Data/test.csv")
>>> train = pd.read_csv("./Data/train.csv")
>>> 
>>> gender_submission.head(5)
   PassengerId  Survived
0          892         0
1          893         1
2          894         0
3          895         0
4          896         1
>>> test.head(5)
   PassengerId  Pclass                                          Name     Sex   Age  SibSp  Parch   Ticket     Fare Cabin Embarked
0          892       3                              Kelly, Mr. James    male  34.5      0      0   330911   7.8292   NaN        Q
1          893       3              Wilkes, Mrs. James (Ellen Needs)  female  47.0      1      0   363272   7.0000   NaN        S
2          894       2                     Myles, Mr. Thomas Francis    male  62.0      0      0   240276   9.6875   NaN        Q
3          895       3                              Wirz, Mr. Albert    male  27.0      0      0   315154   8.6625   NaN        S
4          896       3  Hirvonen, Mrs. Alexander (Helga E Lindqvist)  female  22.0      1      1  3101298  12.2875   NaN        S
>>> train.head(5)
   PassengerId  Survived  Pclass                                               Name     Sex   Age  SibSp  Parch            Ticket     Fare Cabin Embarked
0            1         0       3                            Braund, Mr. Owen Harris    male  22.0      1      0         A/5 21171   7.2500   NaN        S
1            2         1       1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1      0          PC 17599  71.2833   C85        C
2            3         1       3                             Heikkinen, Miss. Laina  female  26.0      0      0  STON/O2. 3101282   7.9250   NaN        S
3            4         1       1       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1      0            113803  53.1000  C123        S
4            5         0       3                           Allen, Mr. William Henry    male  35.0      0      0            373450   8.0500   NaN        S

gender_submission(PassengerId, Survived) test(PassengerId, Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, Embarked) train(PassengerId, Survived, Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin)

Check the correlation coefficient between each column

In pandas, the corr () method is used to find the correlation coefficient between each column of the data frame. By the way, it seems that the corr () method can specify three calculation methods by specifying the argument method. I'm not sure this time, so use the default

+'pearson': Pearson product moment correlation coefficient ← default +'kendall': Kendall rank correlation coefficient +'spearman': Spearman's rank correlation coefficient

>>> train_corr = train.corr()
>>> train_corr
             PassengerId  Survived    Pclass       Age     SibSp     Parch      Fare
PassengerId     1.000000 -0.005007 -0.035144  0.036847 -0.057527 -0.001652  0.012658
Survived       -0.005007  1.000000 -0.338481 -0.077221 -0.035322  0.081629  0.257307
Pclass         -0.035144 -0.338481  1.000000 -0.369226  0.083081  0.018443 -0.549500
Age             0.036847 -0.077221 -0.369226  1.000000 -0.308247 -0.189119  0.096067
SibSp          -0.057527 -0.035322  0.083081 -0.308247  1.000000  0.414838  0.159651
Parch          -0.001652  0.081629  0.018443 -0.189119  0.414838  1.000000  0.216225
Fare            0.012658  0.257307 -0.549500  0.096067  0.159651  0.216225  1.000000

Try to make a heat map

It seems that you can easily visualize it as a heat map by using a library called seaborn let's try it!

>>> import seaborn
>>> import matplotlib as mpl
>>> import matplotlib.pyplot as plt
>>> 
>>> seaborn.heatmap(train_corr,vmax=1, vmin=-1, center=0)
<AxesSubplot:>
>>> plt.show()

Figure_1.png

I see, it's easier to see Maybe there is a strong correlation between Pclass and Fare in this ...? Pclass and Fare are weak with Survived predicted this time, but there seems to be a correlation ...?

Data preprocessing

NA (missing value) completion, correction of character strings such as Sex, Embarked, Cabin to numerical values This time NA basically complements the mean, but Embarked complements the most "S" Cabin is corrected only to the acronym (probably representing the rank of the guest room), NA complements the most C

>>> train.Embarked.value_counts()
S    644
C    168
Q     77
Name: Embarked, dtype: int64

↓ Correction function used

def CorrectTitanicData(df):
    #Age : NA ->Average value
    df.Age = df.Age.fillna(df.Age.median())
    #Sex : male -> 0, female -> 1
    df.Sex = df.Sex.replace(['male', 'female'],[0,1])
    #Embarked : NA -> S, C -> 0, S -> 1, Q -> 2
    df.Embarked = df.Embarked.fillna("S")
    df.Embarked = df.Embarked.replace(['C', 'S', 'Q'], [0, 1, 2])
    #Fare : NA ->Average value
    df.Fare = df.Fare.fillna(df.Fare.median())
    #Cabin : NA -> C, A~G -> 0~6, T -> 7
    df.Cabin = df.Cabin.fillna('C')
    df.Cabin = df.Cabin.replace('A(.*)','A',regex=True)
    df.Cabin = df.Cabin.replace('B(.*)','B',regex=True)
    df.Cabin = df.Cabin.replace('C(.*)','C',regex=True)
    df.Cabin = df.Cabin.replace('D(.*)','D',regex=True)
    df.Cabin = df.Cabin.replace('E(.*)','E',regex=True)
    df.Cabin = df.Cabin.replace('F(.*)','F',regex=True)
    df.Cabin = df.Cabin.replace('G(.*)','G',regex=True)
    df.Cabin = df.Cabin.replace(['A','B','C','D','E','F','G','T'], [0,1,2,3,4,5,6,7])
    
    return df

Check the correlation between each column again after preprocessing

>>> train = CorrectTitanicData(train)
>>> train_corr = train.corr()
>>> train_corr
             PassengerId  Survived    Pclass       Sex       Age     SibSp     Parch      Fare     Cabin  Embarked
PassengerId     1.000000 -0.005007 -0.035144 -0.042939  0.034212 -0.057527 -0.001652  0.012658 -0.035748 -0.017443
Survived       -0.005007  1.000000 -0.338481  0.543351 -0.064910 -0.035322  0.081629  0.257307  0.080643 -0.125953
Pclass         -0.035144 -0.338481  1.000000 -0.131900 -0.339898  0.083081  0.018443 -0.549500  0.009851  0.305762
Sex            -0.042939  0.543351 -0.131900  1.000000 -0.081163  0.114631  0.245489  0.182333  0.070780 -0.022521
Age             0.034212 -0.064910 -0.339898 -0.081163  1.000000 -0.233296 -0.172482  0.096688 -0.032105 -0.040166
SibSp          -0.057527 -0.035322  0.083081  0.114631 -0.233296  1.000000  0.414838  0.159651  0.000224  0.030874
Parch          -0.001652  0.081629  0.018443  0.245489 -0.172482  0.414838  1.000000  0.216225  0.018232 -0.035957
Fare            0.012658  0.257307 -0.549500  0.182333  0.096688  0.159651  0.216225  1.000000 -0.098064 -0.268865
Cabin          -0.035748  0.080643  0.009851  0.070780 -0.032105  0.000224  0.018232 -0.098064  1.000000  0.069852
Embarked       -0.017443 -0.125953  0.305762 -0.022521 -0.040166  0.030874 -0.035957 -0.268865  0.069852  1.000000
>>> 
>>> seaborn.heatmap(train_corr,vmax=1, vmin=-1, center=0)
<AxesSubplot:>
>>> plt.show()

Figure_2.png It became clear that Sex has a stronger correlation

Method selection

Eight items (other than Passenger ID) used as predictors this time are "P class", "Sex", "Age", "SibSp", "Parch", "Fare", "Cabin", and "Embarked". Cross-validation is performed by implementing the following seven learning methods.

>>> from sklearn.linear_model import LogisticRegression
>>> from sklearn.svm import SVC, LinearSVC
>>> from sklearn.neighbors import KNeighborsClassifier
>>> from sklearn.tree import DecisionTreeClassifier
>>> from sklearn.ensemble import RandomForestClassifier
>>> from sklearn.neural_network import MLPClassifier
>>> from sklearn.model_selection import cross_val_score
>>> 
>>> predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Cabin", "Embarked"]
>>> models = []
>>> models.append(("LogisticRegression",LogisticRegression()))
>>> models.append(("SVC",SVC()))
>>> models.append(("LinearSVC",LinearSVC()))
>>> models.append(("KNeighbors",KNeighborsClassifier()))
>>> models.append(("DecisionTree",DecisionTreeClassifier()))
>>> models.append(("RandomForest",RandomForestClassifier()))
>>> models.append(("MLPClassifier",MLPClassifier(solver='lbfgs', random_state=0)))
>>> 
>>> results = []
>>> names = []
>>> 
>>> for name,model in models:
...     result = cross_val_score(model, train[predictors], train["Survived"],  cv=3)
...     names.append(name)
...     results.append(result)
... 

>>> for i in range(len(names)):
...     print(names[i],results[i].mean())
... 
LogisticRegression 0.7811447811447811
SVC 0.6554433221099888
LinearSVC 0.7317620650953985
KNeighbors 0.7070707070707071
DecisionTree 0.7721661054994389
RandomForest 0.7957351290684623
MLPClassifier 0.7901234567901234

Random Forest seems to have the best rating

Save prediction results to CSV file for submission

Learn training data in a random forest and make predictions with test data. Save the result in CSV format

>>> test = pd.read_csv("./Data/test.csv")
>>> test = CorrectTitanicData(test)
>>> algorithm = RandomForestClassifier()
>>> algorithm.fit(train[predictors], train["Survived"])
RandomForestClassifier()
>>> predictions = algorithm.predict(test[predictors])
>>> submission = pd.DataFrame({
...     "PassengerId":test["PassengerId"],
...     "Survived":predictions
... })
>>> submission.to_csv("submission.csv", index=False)

Submission result

I submitted it with Kaggle because it was a big deal Result Score is 0.74162

スクリーンショット 2020-11-07 19.28.55.png

I would like to increase the correct answer rate by trial and error from here, but this time it is up to here It seems that scikit-learn has GridSearchCV that searches hyperparameters, so If you use it, the percentage of correct answers is likely to increase ...

Recommended Posts

[Python] [Machine learning] Beginners without any knowledge try machine learning for the time being
<For beginners> python library <For machine learning>
[Introduction to Reinforcement Learning] Reinforcement learning to try moving for the time being
Python Master RTA for the time being
Try using FireBase Cloud Firestore in Python for the time being
Upgrade the Azure Machine Learning SDK for Python
Use logger with Python for the time being
Try using LINE Notify for the time being
Run with CentOS7 + Apache2.4 + Python3.6 for the time being
Learning flow for Python beginners
For the time being, try using the docomo chat dialogue API
Machine learning summary by Python beginners
See python for the first time
[Python machine learning] Recommendation of using Spyder for beginners (as of August 2020)
Understanding the python class Struggle (1) Let's move it for the time being
Let's touch Google's Vision API from Python for the time being
Amplify images for machine learning with python
First Steps for Machine Learning (AI) Beginners
Why Python is chosen for machine learning
[Shakyo] Encounter with Python for machine learning
MongoDB for the first time in Python
Call the python debugger at any time
[Python] Web application design for machine learning
Let's try Linux for the first time
An introduction to Python for machine learning
[For beginners] Try web scraping with Python
[Example of Python improvement] What is the recommended learning site for Python beginners?
Take the free "Introduction to Python for Machine Learning" online until 4/27 application
Python learning memo for machine learning by Chainer until the end of Chapter 2
The fastest way for beginners to master Python
The story of low learning costs for Python
Recommended study order for machine learning / deep learning beginners
For the time being, import them into jupyter
Make a histogram for the time being (matplotlib)
Try adding an external module to pepper. For the time being, in requests.
Run yolov4 "for the time being" on windows
I played with Floydhub for the time being
I tried python programming for the first time.
[Python] Collect images with Icrawler for machine learning [1000 images]
Try to calculate RPN in Python (for beginners)
Looking back on the machine learning competition that I worked on for the first time
Align the number of samples between classes of data for machine learning with Python
virtualenv For the time being, this is all!
GTUG Girls + PyLadiesTokyo Meetup I went to machine learning for the first time
[For beginners] Introduction to vectorization in machine learning
Try posting to Qiita for the first time
Lists, functions, for, while, with (open), class and learning supplements up to the last time (Python beginners after learning Ruby)
I will try to summarize the links that seem to be useful for the time being
Predicting the goal time of a full marathon with machine learning-③: Visualizing data with Python-
The first step of machine learning ~ For those who want to implement with python ~
Until you can install blender and run it with python for the time being
Image collection Python script for creating datasets for machine learning
Build an interactive environment for machine learning in Python
I tried Python on Mac for the first time.
Flow memo to move LOCUST for the time being
[Python] Measures and displays the time required for processing
Machine learning python code summary (updated from time to time)
Python learning memo for machine learning by Chainer from Chapter 2
Python learning memo for machine learning by Chainer Chapters 1 and 2
I tried python on heroku for the first time
Preparing to start "Python machine learning programming" (for macOS)