[PYTHON] "Then, how does it compare to other methods?"

** "Then, how does it compare to other methods?" **

I think algorithm developers are (quite often) scared of this word. Especially if you are forced to develop a practical system by the deadline. If you are not trying to write an academic paper, it is a problem that you should avoid if you can avoid it. When this word appears, the amount of work will increase by a factor of (1 + the number of comparison methods). A person who is not aware of how troublesome it is to create learning and evaluation by replacing only the learning algorithm using the same learning data and the same features, says, "Then, how do you compare it with other methods? That's a very reasonable statement. (note) ** In situations where development resources are limited, it is important to create an acceptable level of it. ** You need to be careful not to make a decision to raise the priority of a project for an item that is not the bottleneck of development. In the case of machine learning, data acquisition, data enhancement based on evaluation results, data preprocessing of how to design features, and results such as confusion matrix for unknown data Encourage people around you to realize that setting policies for future data augmentation after acquisition is more important than choosing which machine learning algorithm to choose. It is to leave. If development resources are not constrained, the developer himself would like to ask, "Then, how does it compare to other methods?"

** Lack of data becomes a bottleneck in the early stages of development **

Evaluate values for both accuracy and recall, using things like the scikit-learn evaluation framework. In the early stages of development, there is not enough data at all, and both accuracy and recall are tattered. Therefore, if you increase the training data only in the category with low recall rate, the recall rate of that category should improve. It should also improve the accuracy of the category that was misrecognizing that category. As long as you behave like that, you should steadily increase the learning data and evaluation data. The dataset created for that should not be wasted.

In machine learning, a model with a higher degree of freedom can be applied with less residuals to the training data. However, it does not mean that accuracy and recall will be achieved when it is not used for learning. With any machine learning algorithm, the original performance cannot be achieved when the amount of learning / evaluation data is insufficient. Each algorithm only depends on how sensitive the lack of data is.

** Support Vector Machine (SVM) **

Support vector machines (SVMs) are said to be easy to give stable learning results even with relatively small amounts of data (compared to other algorithms). Even one of the SVM algorithms has the know-how to lead machine learning to success by processing data in advance (A Practical Guide to Support Vector Classification. /~cjlin/papers/guide/guide.pdf)).

--Normalize the data to the range [-1, 1] or [0, 1] --First, use the RBF kernel as the first choice --Cross-validation and grid search of hyperparameters C and γ   These are written in the pdf file linked above.

There is work on how to determine the values of the parameters gamma and C given when learning SVM. libSVM provides a tool grid.py for that.  Grid Parameter Search for Regression

First of all, I suggest using SVM to get good results.

The lack of initial value dependency issues with SVMs is also one of the reasons I like it as the first method to use SVMs. (Depending on the algorithm, there is a problem of initial value dependence. "Isn't it working because the initial value used was bad? (Isn't it working if another initial value is used?)" There is a possibility that you will never be able to make a sufficient judgment.)

The question "How does it compare to other methods?" Means that the number of training data and evaluation data is sufficient, and the current method is appropriate for data preprocessing. In a situation where performance can be brought out. First, let's take an interest in raising the situation of the development team to that situation.

** scikit-learn is a treasure trove of algorithm comparisons **

With enough data, it becomes more meaningful to compare algorithms. In such a case, use scikit-learn to evaluate the difference between the algorithms. "Even with the same algorithm, different libraries have different interfaces, of course. Even with machine learning for the same purpose, if the algorithm is different, the interface will be different. " Scikit-learn shattered this situation for a long time. In Face completion with a multi-output estimators, the following is applied: , It's a wonderful thing.

python


for name, estimator in ESTIMATORS.items():
    estimator.fit(X_train, y_train)
    y_test_predict[name] = estimator.predict(X_test)

** scikit-learn API design is becoming a sample for other libraries **

There are various machine learning libraries, but many of them are creating wrappers for APIs similar to scikit-learn. So there is no loss in getting used to sciki-learn.

Chainer can now be used for scikit-learn like

scikit-chainer

There are multiple scikit-learn-like implementations. Check the latest status to see which implementation is best maintained.

tensorflow/skflow

tensorflow/tensorflow/examples/skflow/ digits.py iris.py mnist.py If you look at etc., you can see that it can be used with the same interface as scikit-learn.

How to create a scikit-learn-compliant prediction model

When doing various things with machine learning, you may want to create a new prediction model yourself, such as an ensemble model that combines various models. In that case, you can create it from scratch, but the model created in this way is a little inconvenient because you cannot use the parameter optimization modules of scikit-learn, such as GridSearch and RandomSearch. At this time, if you define the model according to the definition of scikit-learn, it will work well and be efficient.

** When you want to improve the algorithm further **

If the datasets for training and evaluation are properly collected, more efficient algorithms and more accuracy can be used using external resources such as Kaggle. You can develop an algorithm that can be used.

** Precautions for interpretation of evaluation after learning with different algorithms **

――Which algorithm is better depends on the balance between the degree of freedom of the model and the number of data. ――The susceptibility to the effects of learning images that have been labeled incorrectly depends on the algorithm. ――It also depends on the algorithm whether it is affected by the statistical distribution of training data.

Please find the best method while considering these things.

Note: Articles pointing out the problem of overfitting I tried to classify the voices of voice actors   I tried to graph the learning process of the neural network

** Note: Importance of dataset **

Since the description has increased, I made it independent as a separate article. Importance of dataset

Note: When using SVM, you have a choice of which library to use. We recommend libSVM's python binding and scikit-learn's SVM because it can return probabilities in multiclass classification.

Search for SVM

We have created a collection of links to articles that collect machine learning data. How to collect machine learning data

Reference information

Qiita Support Vector Machines and Other Machine Learning Techniques Qiita [Machine learning with Python] Support vector machine (SVM) information posted Reference website summary Qiita chainer can now be used for scikit-learn like Qiita Overview of machine learning techniques learned from scikit-learn

From SSII2016

From the Image Sensing Symposium

"As for object recognition, we have commoditized it to the extent that anyone can build a system to some extent if we can prepare appropriate quantity and quality data to be a problem." [SSII2016 Cutting edge and near future of image recognition] It is written in (https://confit.atlas.jp/guide/event/ssii2016/static/speciallecture). It still remains to prepare the appropriate quantity and quality data to be questioned.

Mr. Yasutomo Kawanishi When I made a method, I think that many people ask me, "Did you compare it with SVM? What will happen to the accuracy if you do it in Random Forest?" In this tutorial, the recent spread of machine learning libraries makes it easier to apply various machine learning methods to certain recognition problems and compare their performance, and how to use them well. I will explain. (Https://confit.atlas.jp/guide/event/ssii2016/static/tutorial)

Mr. Yasutomo Kawanishi slideShare Introduction to Machine Learning with Python-From SVM to Deep Learning-

Mr. Yasutomo Kawanishi Sample code for tutorial in SSII2016

Note: It's scary that you can't do what you're supposed to do if you decide that reproducing the results that others have (should) do is prioritized over your current development. It will be a situation.

Postscript "Grand Challenge in Pedestrian Detection"

[Survey paper] Research trends in pedestrian detection using deep learning

Things that have changed since the time I wrote this sentence (2018.07 postscript)

--The source code of the algorithm to be compared has been released more and more. --Moreover, the number of cases where it is published on github has increased. --If you do git clone and cmake and make, you can prepare the algorithm to be compared. ――By using Docker, you can avoid the situation where the library versions required for each source code are different and do not work well. Depending on the distribution source, the situation where Docker configuration files are included in the distribution is increasing. ――The number of public databases that correspond to the fields of interest and the issues of interest is increasing. ――In some cases, tools for aggregation have been standardized, making it easier to compare.

Recommended Posts

"Then, how does it compare to other methods?"
[Pepper] How to utilize it?
[Google Colab] How to interrupt learning and then resume it
How to compare time series data-Derivative DTW, DTW-
How to install Cascade detector and how to use it
[Python] How to compare datetime with timezone added
How to deal with Django's Template Does Not Exist
How to use Decorator in Django and how to make it
Compare how to write processing for lists by language