[PYTHON] About testing in the implementation of machine learning models

An example that comes with a number of frameworks, a story about implementing a machine learning model. What they all have in common is that there are no tests. Even machine learning models are part of the production code once they are embedded in the application. Do you want to incorporate an untested implementation into your production environment? I don't think that's usually the case.

kurenai.PNG (Borrowed from [Studio Ghibli Porco Rosso](https://www.amazon.co.jp/dp/B00005R5J6))

It's easy to forget that machine learning models are most accurate at the "moment of release." The reason is that the moment it is released is a model that has been trained using the full-fledged data available at that time, and after that, more and more unknown data will come in. Therefore, it is very important to be able to verify the accuracy and validity of the model at any time. This is equivalent to the reason for testing regular code, which means that just because it's a machine learning model isn't special.

In this article, I will explain the method of testing this machine learning model. Of course, this is the method I am currently practicing, and I think that the know-how of more practical methods will spread as the application of machine learning to applications progresses in the future.

Machine learning model design

First, it must be well designed to perform the test. There is a material that explained this point earlier, so I would like to quote from that.

Code design to avoid crying with machine learning

-21-320.jpg

Model is a real machine learning model (built with scikit-learn, Chainer, TensorFlow), which is often packed with all the processing. It is a story to divide it as follows.

When a problem such as inaccuracy occurs due to this, is it a problem of the model itself, is it bad to train, is the model okay and there is a problem only when using it from the application side? Is to be able to isolate and verify and test whether there is a mistake in data preprocessing.

However, the output of machine learning is indefinite compared to a normal program where the input / output can be clearly defined. DataProcessor and Resource are almost the same as normal programs, so it is easy to test, but this is a problem for Trainer and Model API including Model itself.

I didn't go into detail on this point in the material above, but I'd like to take a look at these tests from here.

Machine learning model test

There are four main things to test in a machine learning model:

I would like to take a step-by-step approach to these tests. In the following code introduction, I will quote from the following recently developed repository.

icoxfog417/tensorflow_qrnn

This is based on TensorFlow, but I think the idea can be used with other libraries (I did the same design and testing when using Chainer before). On the contrary, when using TensorFlow, there is a point that I am addicted to during the test, so I will also mention how to deal with that problem.

Operation Test

In the operation test, we check whether the Model works from input to output without causing an error. In the case of a neural network model, it can also be called a Forward check.

Here is the code I actually used.

tensorflow_qrnn/test_tf_qrnn_forward.py

The input can be random, so make sure it passes through to the output. The operation test is frequently used (used) during development for the purpose of "verifying as lightly and quickly as possible" whether it works for the time being when developing or rearranging the model. In that sense, the position is close to compilation.

In TensorFlow, when you run a unit test, multiple tests share Global Graph information and an unintended error occurs. Therefore, please note that you need to separate the Graph for each test case.

class TestQRNNForward(unittest.TestCase):

    def test_qrnn_linear_forward(self):
        batch_size = 100
        sentence_length = 5
        word_size = 10
        size = 5
        data = self.create_test_data(batch_size, sentence_length, word_size)

        with tf.Graph().as_default() as q_linear:
            qrnn = QRNN(in_size=word_size, size=size, conv_size=1)
            ...

In particular, this phenomenon becomes chaotic if the variable scope is not turned off. Basically, when using TensorFlow, it is important to cut the variable scope firmly with variable_scope when declaring variables (duplicates cannot be checked with name_scope).

class QRNNLinear():

    def __init__(self, in_size, size):
        self.in_size = in_size
        self.size = size
        self._weight_size = self.size * 3  # z, f, o
        with tf.variable_scope("QRNN/Variable/Linear"):
            initializer = tf.random_normal_initializer()
            self.W = tf.get_variable("W", [self.in_size, self._weight_size], initializer=initializer)
            self.b = tf.get_variable("b", [self._weight_size], initializer=initializer)

For the scope, please refer to this article, so please refer to it. I will summarize it). In any case, when using TensorFlow, please keep the following in mind.

scope.PNG (Borrowed from [Studio Ghibli Porco Rosso](https://www.amazon.co.jp/dp/B00005R5J6))

Verify Test

Once you have a model that passes the operation test, it is a little hasty to learn using the production data immediately. The volume of production data will be considerable, and it will take time to learn. If you're not very confident, you should first check that your model behaves as intended and record better accuracy than the baseline, or check with smaller data. This is a verification test.

Conversely, creating a dataset for validation testing and a baseline model for it will help in the process of improving the machine learning model. A data set for validation tests is a data set that is easy to handle and can be trained in a relatively short time. And the baseline model is a basic model that "if it does not exceed this, it is NG".

Without this, there is a tendency to get caught up in the delusion that "it may be better to increase the data a little more" and "the accuracy may be improved if a little more learning time is taken", and the improvement of the essential algorithm tends to be obscured. There is.

image.png

It is quite a trap to say, "We can handle the same data as the actual data immediately," and the tendency may be greatly biased because the actual data is the actual data (for example, 90 is usually used for diagnostic imaging. If% is normal, the accuracy will be 90% even with a model that simply predicts "no abnormality"). It is a basic matter in machine learning that the bias of data causes bias in judgment, but the sense of security that "we are using production data" tends to distract us from such a point.

In order to solve the above problems, it is recommended to prepare "easy-to-use size" and "label balanced" verification test data and its environment.

In the implementation below, we are testing with a dataset of handwritten characters called digit that comes with scikit-learn. scikit-learn comes with a dataset such as handwritten characters, so if you can use it, you can save the trouble of preparing the data.

tensorflow_qrnn/test_tf_qrnn_work.py

If you have production data, it is a good idea to create a well-balanced sampling data set in consideration of the target label, rather than simply extracting it by period. This will reduce the loss and check if the accuracy is correct.

Comparison with baseline is also an important role of validation testing. It's often said that SVM was definitely better than the neural network model I made so hard (* Tune the model used for the baseline properly. The important thing is not to use the neural network, but for your purpose. Because you are looking for a suitable model). Fortunately, scikit-learn comes with a variety of models, which makes it perfect for this verification. I think you can compare and verify with the baseline model without writing too much code.

By establishing a barrier called this verification test, you can save time and money (GPU fee) spent on a bad model.

seido.PNG (Borrowed from [Studio Ghibli Porco Rosso](https://www.amazon.co.jp/dp/B00005R5J6))

However, it is also a fact that there are models that cannot be accurate unless they are trained endlessly. In such a case, you can record the loss / accuracy value (velocity-like) for the learning time and check it to replace it.

Integration Test

The integration test checks whether the call from the application is successful. When using a machine learning model, it is not only the accuracy that should be tested, but also the preprocessing and so on.

Therefore, the Data Processor should be tested independently before the cooperation test. Then, test whether the Model API works properly when used from the application. As for the accuracy, it is good to prepare a data set of a size that is easy to verify and test the accuracy when running the Model API, that is, the actual application, as in the evaluation test above. This is because the following things often happen during the cooperation test.

For this reason, it is advisable to measure accuracy as well as simply function.

The dataset used to test the accuracy of this Model API is also useful for constantly monitoring the performance of machine learning models. This makes it possible to determine the timing of re-learning / reconstruction, and in that sense, it is recommended to prepare a collaborative test separately from the evaluation test (the data will be a little more actual than the evaluation test).

Evaluation Test

We will move to the evaluation test when the verification test exceeds the baseline and the cooperation test confirms that the application can be called.

Here, so-called A / B tests are carried out. In the implementation, if necessary, we will train firmly with the amount of data that exceeds the verification test. Then check to see if it has an advantage over existing models.

The indicators checked at the evaluation test stage and the indicators checked at the verification test are very different. At the verification test stage, indicators that represent the performance of the model such as accuracy are checked, and at the evaluation test, KPIs (Key Performance Indicators) in the service such as "user engagement rate" are checked.

Ultimately, it's not about building a highly accurate model, it's about building a model that contributes to the service, that is, adds value to the user. The evaluation test checks this point.

The above is the test method for implementing a machine learning model. In fact, I am doing it through trial and error, so if you have any opinions like this, I would love to hear from you.

last.PNG

Recommended Posts

About testing in the implementation of machine learning models
About the development contents of machine learning (Example)
A reminder about the implementation of recommendations in Python
Survey on the use of machine learning in real services
Othello-From the tic-tac-toe of "Implementation Deep Learning" (3)
Machine learning algorithm (implementation of multi-class classification)
[Python] Saving learning results (models) in machine learning
Othello-From the tic-tac-toe of "Implementation Deep Learning" (2)
Full disclosure of methods used in machine learning
Summary of evaluation functions used in machine learning
[Note] About the role of underscore "_" in Python
About the behavior of Model.get_or_create () of peewee in Python
Get a glimpse of machine learning in Python
Implementation of a model that predicts the exchange rate (dollar-yen rate) by machine learning
Count the number of parameters in the deep learning model
About data preprocessing of systems that use machine learning
Impressions of taking the Udacity Machine Learning Engineer Nano-degree
About the inefficiency of data transfer in luigi on-memory
Predict the gender of Twitter users with machine learning
About the uncluttered arrangement in the import order of flake8
Othello ~ From the tic-tac-toe of "Implementation Deep Learning" (4) [End]
Summary of the basic flow of machine learning with Python
Record of the first machine learning challenge with Keras
Overview of generalized linear models and implementation in Python
[Machine learning] "Abnormality detection and change detection" Let's draw the figure of Chapter 1 in Python.
Machine learning in Delemas (practice)
Basics of Machine Learning (Notes)
About the ease of Python
About machine learning mixed matrices
The first step for those who are amateurs of statistics but want to implement machine learning models in Python
Used in machine learning EDA
Importance of machine learning datasets
About the components of Luigi
Implementation of quicksort in Python
About the features of Python
Deep reinforcement learning 2 Implementation of reinforcement learning
Try to evaluate the performance of machine learning / regression model
About the garbled Japanese part of pandas-profiling in Jupyter notebook
Predict the presence or absence of infidelity by machine learning
[Reinforcement learning] Explanation and implementation of Ape-X in Keras (failure)
Try to evaluate the performance of machine learning / classification model
How to increase the number of machine learning dataset images
[Machine learning] I tried to summarize the theory of Adaboost
I tried to compare the accuracy of machine learning models using kaggle as a theme.
I considered the machine learning method and its implementation language from the tag information of Qiita
A story stuck with the installation of the machine learning library JAX
Real-time display of server-side processing progress in the browser (implementation of progress bar)
Significance of machine learning and mini-batch learning
The story of participating in AtCoder
Qiskit: Implementation of quantum Boltzmann machine
Coursera Machine Learning Challenges in Python: ex5 (Adjustment of Regularization Parameters)
Implementation of login function in Django
Automate routine tasks in machine learning
Perform morphological analysis in the machine learning environment launched by GCE
About the return value of pthread_mutex_init ()
Implementation of life game in Python
Machine learning ③ Summary of decision tree
Classification and regression in machine learning
About the return value of the histogram.
About the basic type of Go
The story of the "hole" in the file