[PYTHON] Signate 2nd _Beginner Limited Competition Review

Introduction

Participated in the shoulder break-in beginner-only competition (https://signate.jp/competitions/293) for the second term of AIQuest starting in October 2020. The competition I participated in to participate in AIQuest was not so good, but I managed to get involved. I thought that I couldn't get results as it was, so I was able to spend some time to participate in this competition, although it was from the latter half of September.

This competition was clear if a certain score was given, and the ranking was not very meaningful, but I made an effort to raise the ranking "for studying" and "to gain confidence". As a result, I was lucky enough to take first place, so I would like to introduce what I did this time.

Overview of the competition

This time, blood data, age, and gender are used to determine whether or not the disease is liver disease. The evaluation function uses AUC. The clearing condition is that AUC = 0.92 is exceeded.

environment

The environment is Google Colaboratory

Introduction

We will pick up what is commonly done in the analysis of table data.

  1. View the data

Check what each data means, what the data type is, and whether there are any missing values. After that, let's visualize how much bias there is.

  1. Take a look at feature_importance

Let's learn without devising anything and see what is important. This work seems to be important for creating an image in order to make assumptions even if only a few impressions are given. (Depending on how you think about it, the impression you get here may be a shackle)

  1. Learn with various models and see the score It's also done in Kaggle's famous Titanic notebook, but I'll try to score with various models. What I tried this time Support Vector Machines, KNN, Logistic Regression, Random Forest, Naive Bayes, Perceptron, Stochastic Gradient Decent, Linear SVC, Decision Tree,catboost

I adopted catboost. Since there were no missing values this time, I scored with various models without thinking about anything, and the one with the best result was set as first_commit. I think the result was around 0.8

The minimum required to clear

In ↑, it was 0.8, which is far from the clearing condition, so I couldn't clear it without any ingenuity. In my case, I was able to meet the clearing conditions by doing the following two ideas (maybe not).

  1. Delete the gender and age columns

At first glance, it seems to have something to do with whether or not it is a liver disease (the correlation was actually high), but removing it improved the accuracy. I really wanted to erase it for some reason, but I couldn't grasp it and said, "I tried to erase it mechanically and it worked." By the time you reach here --Try TargetEncoding --Dividing your age into teens, 20s ... Was carried out, but none of them produced any results. The accuracy increases to about 0.83 here.

  1. Make the inference result a probabilistic notation This kind of reasoning only had a brain that outputs 1 or 0, so it took time to come up with it, but this dramatically improved the score.

assessment.py


#Output with 0 or 1
model.predict(pred)

#Probability notation
model.predict_proba(pred)[:, 1]

This reached the pass line of 0.83 ⇒ 0.92.

What we did to further improve accuracy

The competition was irrelevant to the ranking, but I was motivated, so I tried to improve the accuracy. The following is what the score went up.

  1. Added the feature amount of whether it is within the range of normal blood value from a medical point of view. Although it is possible to judge by the value, it is difficult to make a comprehensive judgment because all the units are different, and I wanted a feature amount to judge only by "whether it is within the normal value range", so I adopted it. In fact, this method was quite useful, and just doing this improved my ranking to the top 10.

  2. Remove irregular data of training data with the knowledge gained in ↑

"Data that is judged to be normal even though most of the numerical values are outliers" that could not be found by simply deleting the data that has prominent data mechanically was removed. Inference of 1 or 0 may not have much effect, but since it is a probability notation, by erasing the data that is the exact opposite of the tendency, it will be possible to express outstanding white and black data with 1 or 0 as much as possible. Then? This was done from the assumption. This was the decisive hit that was able to climb to 1st place.

Summary

It's a subtle point, but I'm glad I worked on it because it's a beginner-only competition and the ranking doesn't really matter.

I feel that this kind of sloppy data analysis cannot withstand the practical level, so I would like to be able to work harder.

Recommended Posts

Signate 2nd _Beginner Limited Competition Review
SIGNATE [1st _Beginner Limited Competition] Participated in bank customer targeting
SIGNATE [1st _Beginner Limited Competition] How to Solve Bank Customer Targeting
AtCoder Beginner Contest 152 Review
AtCoder Beginner Contest 160 Review
AtCoder Beginner Contest 178 Review
AtCoder Beginner Contest 166 Review
AtCoder Beginner Contest 167 Review
AtCoder Beginner Contest 169 Review
AtCoder Beginner Contest 181 Review
AtCoder Beginner Contest 171 Review
AtCoder Beginner Contest 182 Review
AtCoder Beginner Contest 180 Review
AtCoder Beginner Contest 177 Review
AtCoder Beginner Contest 168 Review
AtCoder Beginner Contest 179 Review
AtCoder Beginner Contest 172 Review
AtCoder Beginner Contest 176 Review
AtCoder Beginner Contest 174 Review
AtCoder Beginner Contest 153 Review
AtCoder Beginner Contest 161 Review
AtCoder Beginner Contest 170 Review
AtCoder Beginner Contest 165 Review
AtCoder Beginner Contest 173 Review
AtCoder Beginner Contest 155 Review
AtCoder Beginner Contest 162 Review