Introduction

The other day, I participated in a machine learning competition. I tried my best to kill the weekend many times. As a result, although I did not win a prize, I was able to enter the top 3%, which gave me confidence. I don't have much experience in competitions, so my thoughts may change in the future, but first I will look back on this experience at this point and leave it as a memo.

Motivation for participation

I'm studying machine learning by reading books and watching videos, but I wanted to objectively know how much my skills were.

Premise

I can't say the details of the competition, but it's a table competition. I fought in the following battle environment.

Hardware (1 unit)

CPU i7-8700 3.70GHz --Memory 16GB
GPU NVIDIA GetForce GTX 1060 6G

software

Windows 10
Python 3.6
scikit-learn 0.21.2
tensorflow 1.13.2
keras 2.3.1
xgboost 0.9
lightgbm, catboost etc.. --PyCharm (development environment)

Looking back

There are many great books on machine learning techniques, so here I would like to list the muddy things that I actually felt when I took them into the competition.

Created script

When competing in the competition, we created a versatile command line script for the two tasks of feature selection and predictive model creation, and used disposable commands for the others (visualization, minor preprocessing, etc.). For example, for predictive model creation, learning algorithm selection, random number seeding, cross-validation number, etc. can be specified by arguments as much as possible. Also, even if a new algorithm is added, it can be added with a minimum of modifications. By doing this, I think I was able to improve my productivity by being able to change what I wanted to do by simply changing the argument of the command, or by batching it.

Reproducibility

In competitions, it is often obligatory (I think) to provide a script that can reproduce model creation and prediction when winning a prize and receiving prize money. It can be quite a rework if you only look at the reproduction method when the score goes up and you can see the prize. Therefore, it is necessary to establish a method for ensuring reproducibility at the earliest possible stage. Algorithms that include randomness in the results often have parameters such as "random_state", so I tried to investigate and specify them. However, when calculated with GPU, it seems difficult to fix it (this time, GPU was used only by Keras, so I did not investigate it too deeply).

The following can be considered as targets for fixing the random number seed.

--Each feature selection algorithm such as Boruta --Each machine learning algorithm such as Random Forest and XGBoost --Split processing during cross-validation

log

I got some good results using the results of a feature selection method, but I didn't know how to create the first feature selection, so I ended up starting over from scratch. It's important to keep a log in each script to prevent this from happening.

Not all of the following have been implemented, but looking back now, the following can be considered as things that should be logged.

--The executed script itself (because the script is frequently modified) --Input information to the script. For example, in the case of creating a prediction model, the following can be considered. --Input file path --Output file path --Prediction algorithm --Cross-validation method, number of divisions, evaluation index --Hyperparameter search method, search range, etc. --Random seed --Script execution status. For example, in the case of creating a prediction model, the progress of grid search is performed. This is also very useful for deciding whether to stop in the middle and review the parameter search range.

The same applies to other scripts such as feature selection. Note that the log does not have to be one file, it is sufficient to create one output folder for each execution and output multiple types of log files including the prediction results under it.

Use your time effectively

Feature selection Boruta took a full day with 500 iterations according to the competition data. Depending on the result, I had to start over several times. If you execute Boruta without specifying anything, it may occupy the CPU and you may not be able to do any other work. In the case of processing that occupies the CPU in this way, if an argument is prepared, by explicitly specifying the number of CPUs, room for other work is left. By doing this, other work (for example, searching for hyperparameters with another feature selection result) can be performed even while Boruta is being executed, and time can be effectively used.

Consider the execution speed of the prediction algorithm

This time, we were able to improve the accuracy by ensemble learning to create the final prediction model by combining the output results of multiple prediction algorithms. In ensemble learning, it is said that the greater the variety of combinations, the higher the accuracy. This time, I was able to create a script for general purposes and add various prediction algorithms, so I combined as many algorithms as possible. As for the speed of each algorithm, it was as follows.

--LightGBM and DeepLearng (Kearas) are quite fast, and while other methods are struggling, even data without feature selection was quite fast. --XGB, Cat Boost and Extra Tree were also reasonably fast. --The linear method (PLS, ExtraNet, etc.) is relatively fast, probably because it is a simple algorithm. --Support Vector Machine, Random Forest was very slow.

The fact that it takes time means that if there are many combinations of parameter search, it will take more time for the combination. For this reason, the algorithm used without feature selection and the algorithm used after feature selection were used properly. We believe that this has made it possible to efficiently search for highly accurate models of each prediction algorithm.

Believe in cross-validation

At the end of the competition, although the cross-validation score increased, the public test score did not increase. I had a lot of trouble, but in the public test with all the data released at the end of the competition, the result was almost the same as cross-validation. After all, once you find a valuation method that you think you can trust, it's important to believe it and focus on improving accuracy.

Give up

During the competition, it may not be effective for all the hard work. For example, this time, feature generation by genetic programming had no effect (although it may be a matter of method). I couldn't think of any cause. In such a case, I think it is important to give up quickly and find out what has been effective (ensemble learning in this case).

in conclusion

Looking back, I feel that I wrote only the obvious things, but this is probably my current level.

[PYTHON] Looking back on the machine learning competition that I worked on for the first time