[PYTHON] Looking back on the machine learning competition that I worked on for the first time

Introduction

The other day, I participated in a machine learning competition. I tried my best to kill the weekend many times. As a result, although I did not win a prize, I was able to enter the top 3%, which gave me confidence. I don't have much experience in competitions, so my thoughts may change in the future, but first I will look back on this experience at this point and leave it as a memo.

Motivation for participation

I'm studying machine learning by reading books and watching videos, but I wanted to objectively know how much my skills were.

Premise

I can't say the details of the competition, but it's a table competition. I fought in the following battle environment.

Hardware (1 unit)

software

Looking back

There are many great books on machine learning techniques, so here I would like to list the muddy things that I actually felt when I took them into the competition.

Created script

When competing in the competition, we created a versatile command line script for the two tasks of feature selection and predictive model creation, and used disposable commands for the others (visualization, minor preprocessing, etc.). For example, for predictive model creation, learning algorithm selection, random number seeding, cross-validation number, etc. can be specified by arguments as much as possible. Also, even if a new algorithm is added, it can be added with a minimum of modifications. By doing this, I think I was able to improve my productivity by being able to change what I wanted to do by simply changing the argument of the command, or by batching it.

Reproducibility

In competitions, it is often obligatory (I think) to provide a script that can reproduce model creation and prediction when winning a prize and receiving prize money. It can be quite a rework if you only look at the reproduction method when the score goes up and you can see the prize. Therefore, it is necessary to establish a method for ensuring reproducibility at the earliest possible stage. Algorithms that include randomness in the results often have parameters such as "random_state", so I tried to investigate and specify them. However, when calculated with GPU, it seems difficult to fix it (this time, GPU was used only by Keras, so I did not investigate it too deeply).

The following can be considered as targets for fixing the random number seed.

--Each feature selection algorithm such as Boruta --Each machine learning algorithm such as Random Forest and XGBoost --Split processing during cross-validation

log

I got some good results using the results of a feature selection method, but I didn't know how to create the first feature selection, so I ended up starting over from scratch. It's important to keep a log in each script to prevent this from happening.

Not all of the following have been implemented, but looking back now, the following can be considered as things that should be logged.

--The executed script itself (because the script is frequently modified) --Input information to the script. For example, in the case of creating a prediction model, the following can be considered. --Input file path --Output file path --Prediction algorithm --Cross-validation method, number of divisions, evaluation index --Hyperparameter search method, search range, etc. --Random seed --Script execution status. For example, in the case of creating a prediction model, the progress of grid search is performed. This is also very useful for deciding whether to stop in the middle and review the parameter search range.

The same applies to other scripts such as feature selection. Note that the log does not have to be one file, it is sufficient to create one output folder for each execution and output multiple types of log files including the prediction results under it.

Use your time effectively

Feature selection Boruta took a full day with 500 iterations according to the competition data. Depending on the result, I had to start over several times. If you execute Boruta without specifying anything, it may occupy the CPU and you may not be able to do any other work. In the case of processing that occupies the CPU in this way, if an argument is prepared, by explicitly specifying the number of CPUs, room for other work is left. By doing this, other work (for example, searching for hyperparameters with another feature selection result) can be performed even while Boruta is being executed, and time can be effectively used.

Consider the execution speed of the prediction algorithm

This time, we were able to improve the accuracy by ensemble learning to create the final prediction model by combining the output results of multiple prediction algorithms. In ensemble learning, it is said that the greater the variety of combinations, the higher the accuracy. This time, I was able to create a script for general purposes and add various prediction algorithms, so I combined as many algorithms as possible. As for the speed of each algorithm, it was as follows.

--LightGBM and DeepLearng (Kearas) are quite fast, and while other methods are struggling, even data without feature selection was quite fast. --XGB, Cat Boost and Extra Tree were also reasonably fast. --The linear method (PLS, ExtraNet, etc.) is relatively fast, probably because it is a simple algorithm. --Support Vector Machine, Random Forest was very slow.

The fact that it takes time means that if there are many combinations of parameter search, it will take more time for the combination. For this reason, the algorithm used without feature selection and the algorithm used after feature selection were used properly. We believe that this has made it possible to efficiently search for highly accurate models of each prediction algorithm.

Believe in cross-validation

At the end of the competition, although the cross-validation score increased, the public test score did not increase. I had a lot of trouble, but in the public test with all the data released at the end of the competition, the result was almost the same as cross-validation. After all, once you find a valuation method that you think you can trust, it's important to believe it and focus on improving accuracy.

Give up

During the competition, it may not be effective for all the hard work. For example, this time, feature generation by genetic programming had no effect (although it may be a matter of method). I couldn't think of any cause. In such a case, I think it is important to give up quickly and find out what has been effective (ensemble learning in this case).

in conclusion

Looking back, I feel that I wrote only the obvious things, but this is probably my current level.

Recommended Posts

Looking back on the machine learning competition that I worked on for the first time
I tried python on heroku for the first time
GTUG Girls + PyLadiesTokyo Meetup I went to machine learning for the first time
Build a machine learning Python environment on Mac OS
Notes on PyQ machine learning python grammar
Notes on running Azure Machine Learning locally
Machine learning with Pytorch on Google Colab
Looking back on the machine learning competition that I worked on for the first time
I tried tensorflow for the first time
I tried using scrapy for the first time
I tried python programming for the first time.
Looking back on learning with Azure Machine Learning Studio
I tried Mind Meld for the first time
What I got into Python for the first time
For the first time, I learned about Unix (Linux).
AI Gaming I tried it for the first time
Kaggle for the first time (kaggle ①)
After attending school, I participated in SIGNATE's BEGINNER limited competition for the first time.
Kaguru for the first time
I tried running PIFuHD on Windows for the time being
I tried the Google Cloud Vision API for the first time
If you're learning Linux for the first time, do this!
[For self-learning] Go2 for the first time
See python for the first time
Start Django for the first time
Differences C # engineers felt when learning python for the first time
Code that I wish I had remembered when I participated in AtCoder for the first time (Reflection 1 for the next time)
I tried logistic regression analysis for the first time using Titanic data
[Python] [Machine learning] Beginners without any knowledge try machine learning for the time being
Looking back on 2016 in the Crystal language
MongoDB for the first time in Python
Let's try Linux for the first time
For the first time in Numpy, I will update it from time to time
Introduction to Deep Learning for the first time (Chainer) Japanese character recognition Chapter 2 [Model generation by machine learning]
For the time being using FastAPI, I want to display how to use API like that on swagger
The story that had nothing to do with partitions when I did disk backup with dd for the first time
Let's display a simple template that is ideal for Django for the first time
I tried to predict the change in snowfall for 2 years by machine learning
I tried to process and transform the image and expand the data for machine learning
Looking back on the history of expressions that return sum of square to Pythonic
How to use MkDocs for the first time
Upgrade the Azure Machine Learning SDK for Python
Run yolov4 "for the time being" on windows
[Note] Deploying Azure Functions for the first time
Notes on machine learning (updated from time to time)
I played with Floydhub for the time being
Install the machine learning library TensorFlow on fedora23
Try posting to Qiita for the first time
I will try to summarize the links that seem to be useful for the time being
The first step of machine learning ~ For those who want to implement with python ~
[CodeIQ] I wrote the probability distribution of dice (from CodeIQ math course for machine learning [probability distribution])
I want to create a lunch database [EP1] Django study for the first time
I want to create a lunch database [EP1-4] Django study for the first time
Register a task in cron for the first time
Looking back on the transition of the Qiita Advent calendar
I will install Arch Linux for the time being.
[Python] I made a classifier for irises [Machine learning]
Record of the first machine learning challenge with Keras
14 e-mail newsletters useful for gathering information on machine learning
WebDriver methods that Python beginners were looking for first
I tried to compress the image using machine learning
Automatic brute force machine learning (regression analysis) -This greatly reduces the time for parameter tuning-
I made an API with Docker that returns the predicted value of the machine learning model
Introduction to Deep Learning for the first time (Chainer) Japanese character recognition Chapter 1 [Environment construction]
I want to use the Ubuntu desktop environment on Android for the time being (Termux version)