[PYTHON] I tried using Random Forest

Purpose of this time

SIGNATE Practice, predicting wine varieties.

Main subject

Learning algorithm used

As a result of trying three algorithms, the random forest was the most accurate, so it was adopted as the final classifier.

code

Data reading

wine-learning.py


wine_data = pd.read_csv('train.tsv',sep='\t')
wine_test = pd.read_csv('test.tsv',sep='\t')

I used read_table last time, but I also tried using read_csv because it's a big deal. I feel that table is easier. By the way, both methods are the same, so there is no right one.

Separation of feature data and teacher data

wine-learning.py


X = wine_data.loc[:,['Alcohol','Malic acid','Ash','Alcalinity of ash','Magnesium','Total phenols','Flavanoids','Nonflavanoid ohenols','Proanthocyanins','Color intensity','Hue','OD280/OD315 of diluted wines','Proline']].values
y = wine_data.loc[:,'Y'].values

I want to do something about this because it tends to be long when there are many variables. Let's consider if there is a way to improve it when doing the next task. By the way, the test data is finally here

wine-learning.py


Xt = wine_test.loc[:,['Alcohol','Malic acid','Ash','Alcalinity of ash','Magnesium','Total phenols','Flavanoids','Nonflavanoid ohenols','Proanthocyanins','Color intensity','Hue','OD280/OD315 of diluted wines','Proline']].values

Divided into training data and test data

wine-learning.py


X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)

This time as well, the data was divided at a ratio of 8: 2.

Delete missing values

wine-learning.py


X_train = X_train[:, ~np.isnan(X_train).any(axis=0)]
X_test = X_test[:, ~np.isnan(X_test).any(axis=0)]
Xt = Xt[:, ~np.isnan(Xt).any(axis=0)]

A missing value that did not exist before the division suddenly appeared. I did not understand the cause, so I will try to verify it soon. This time, we decided to delete the missing values.

Model learning

SVC

wine-learning.py


clf = svm.SVC()
clf.fit(X_train, y_train)
Logistic regression

wine-learning.py


clf = LogisticRegression()
clf.fit(X_train, y_train)
Random forest

wine-learning.py


clf = RandomForestClassifier(n_estimators=500, random_state=0)
clf.fit(X_train, y_train)

Random_state was set to 0, and n_estimators (number of decision trees) was set to 500.

Model evaluation

wine-learning.py


y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print('Correct answer rate= ' , accuracy)

As in the example, the accuracy rate was calculated using the ʻaccuracy` function.

SVC correct answer rate
Correct answer rate=  0.6111111111111112
Correct answer rate for logistic regression
Correct answer rate=  0.8888888888888888
Random forest correct answer rate
Correct answer rate=  1.0

Classification

wine-learning.py


X_pred = np.array(Xt)
y_pred = clf.predict(X_pred)
print(y_pred)

result

image.png

I did it (Pachi Pachi)

Consideration

--Change the learning algorithm depending on the amount of variables. --Reduce the number of variables (remove dimensions) and then apply the learning algorithm. --Understand the characteristics of the algorithm in the first place. --Investigating the cause of missing values that occurred at the time of data division --Adoption of mixed matrix

Recommended Posts

I tried using Random Forest
I tried using parameterized
I tried using argparse
I tried using mimesis
I tried using anytree
I tried using aiomysql
I tried using Summpy
I tried using coturn
I tried using Pipenv
I tried using matplotlib
I tried using "Anvil".
I tried using Hubot
I tried using ESPCN
I tried using openpyxl
I tried using Ipython
I tried using PyCaret
I tried using cron
I tried using ngrok
I tried using face_recognition
I tried using Jupyter
I tried using PyCaret
I tried using Heapq
I tried using doctest
I tried using folium
I tried using jinja2
I tried using folium
I tried using time-window
[I tried using Pythonista 3] Introduction
I tried using easydict (memo).
I tried face recognition using Face ++
I tried using BigQuery ML
I tried using Amazon Glacier
I tried using git inspector
[Python] I tried using OpenPose
I tried using magenta / TensorFlow
I tried using AWS Chalice
I tried using Slack emojinator
Random Forest (2)
Random Forest
I tried using Rotrics Dex Arm # 2
I tried using Rotrics Dex Arm
I tried using GrabCut of OpenCV
I tried using Thonny (Python / IDE)
I tried server-client communication using tmux
I tried reinforcement learning using PyBrain
I tried deep learning using Theano
Somehow I tried using jupyter notebook
[Kaggle] I tried undersampling using imbalanced-learn
I tried shooting Kamehameha using OpenPose
I tried using the checkio API
[Python] I tried using YOLO v3
I tried asynchronous processing using asyncio
Day 67 [Introduction to Kaggle] Have you tried using Random Forest?
What I was asked when using Random Forest in practice
I tried using Amazon SQS with django-celery
I tried using Azure Speech to Text.
I tried using Twitter api and Line api
I tried scraping
I tried PyQ
I tried playing a ○ ✕ game using TensorFlow
I tried using Selenium with Headless chrome