[PYTHON] One-click data prediction for the field realized by fully automatic machine learning

I feel that the utilization of data, including machine learning, works better on the ground. This is because what you do from the top down is as follows.

m_dq.gif

Then ...

And ...

gd_test.gif

Yes.

This article is an idea + implementation for practicing field-based utilization of machine learning. Specifically, we emphasize the following three points.

In other words, it can be easily incorporated into the business system that you use every day, and it is easy to operate, and even if you make a mistake, you can redo it many times. This time, we used kintone as a business system and implemented it by incorporating it. I will explain the mechanism and function below.

This article is based on Announcement at Cybozu Days 2016 kintone hack. In this article, I will touch on technical points that I could not touch during the presentation.

Usage image: Data prediction with one click

The following is the kintone application for property management. The name of the property and various characteristics (walking time from the station, age, etc.) are entered here. When you wonder, "How much is the rent for such a property?", Press the "Predict" button. Then, the predicted value of rent will be entered based on the learning result.

prediction.PNG

All you need to do is press this button. We have developed a mechanism to achieve this under the name karura. Only 3 steps are required to use the prediction function in karura. Of course, there are no consultants at this step.

How to use: Data utilization in 3 steps

All three steps are required to make the above one-click data prediction.

Put a plugin in kintone

Put the plugin in the app you want to use the prediction function.

image

Prepare a field to put the predicted value

Prepare a field to enter the predicted value. This is prepared separately because it is assumed that the value entered by a person may be compared with the predicted value.

image

For the field to enter the predicted value, give the field name ending with "_prediction".

To learn

From here, we will work on the application (karura) side for learning. Enter the number of the property management application you want to add the prediction function to, and load the application information. Then, set the "field used for prediction" and the "field you want to predict".

image

After setting this, press the learn button.

image

When the training is completed, the prediction accuracy and advice for improving the model will be displayed as shown below.

image

This completes the preparation. The prediction function can now be used on the application side. This demo is a fairly simple value prediction, but it also supports classification and non-linear prediction. And the user does not have to support whether it is a value prediction or a classification. It automatically determines and switches models internally.

Mechanism for realization: Fully automatic machine learning

This mechanism, karura, cuts all the troublesome parts in machine learning.

image

As shown in the figure above, the following points are automatically performed.

"Automatically" seems to be doing something awesome, but they're all just doing what they can do normally. Specifically, it looks like the following.

A little ingenuity is the definition of features. Specifically, the following operations are performed internally.

Below, I would like to explain the points for automation, including the above.

Judgment of quantitative / categorical variables

Some of the features, that is, the items of the app on kintone, have numerical values, while others have categories such as days of the week. It is not appropriate to simply convert the items with classifications (Monday, Tuesday, Wednesday, etc. for days of the week) into numerical values. For example, if 0 = Monday, 1 = Tuesday, 2 = Wednesday, is Wednesday twice as much as Tuesday? Is it Tuesday + Tuesday = Wednesday? It doesn't make sense. Therefore, each value must be treated independently. Variables that represent these categories are called categorical variables, and when they are used as features, each value is regarded as an item (Monday = True / False, Tuesday = True / False, etc., and each value is an item. To do). On the contrary, variables that can be treated as numerical values and have no problem (temperature, amount, etc.) are called quantitative variables.

Since it is a burden to make the user think about this, this time we estimate the quantitative variable / categorical variable by the type of field item. Specifically, if it is a drop-down list or radio button item, it is regarded as a categorical variable. This is kintone's [Form Design API](https://cybozudev.zendesk.com/hc/ja/articles/201941834-%E3%83%95%E3%82%A9%E3%83%BC% E3% 83% A0% E8% A8% AD% E8% A8% 88% E6% 83% 85% E5% A0% B1% E5% 8F% 96% E5% BE% 97) Does not have to be specified.

Similarly, the field specified as the value to be predicted is a categorical variable or a quantitative variable, and it is identified whether it is a classification problem or a value prediction problem.

However, we are currently not able to handle fields that contain natural language (specifically, fields that contain text such as comments and titles). I think it would be good if these fields could be automatically featured using distributed representations.

Normalization for each feature and storage of its parameters

It is common sense that data needs to be normalized, but we are saving the parameters (mean / variance) for the normalization and normalization. The parameters for normalization are saved because normalization is also required when making predictions.

Feature selection

The feature selection is scikit-learn's Feature Selection. The usage is as follows.

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

X, y = iris.data, iris.target

X_new = SelectKBest(chi2, k=2).fit_transform(X, y)  # choose best 2 feature
print(X_new.shape)

In short, this is to classify / predict each feature individually, measure how accurate it is, and then look at the contribution of each feature. This removes unnecessary features to make the model simpler. At the same time, it internally keeps "which item works and how much" so that it can be used for advice to users.

However, since the maximum number of features and the threshold of the features to be cut off are set appropriately now (for the time being, the number of items in the kintone app is about this), there. However, adjustment is a future issue.

Model selection / parameter tuning

Scikit-learn's GridSearchCV is used for model selection and parameter tuning. The usage is as follows.

from sklearn.grid_search import GridSearchCV
from sklearn.svm import SVC


candidates = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4], 'C': [1, 10, 100]},
              {'kernel': ['linear'], 'C': [1, 10, 100]}]

clf = GridSearchCV(SVC(C=1), candidates, cv=5, scoring="f1")
clf.fit(digits.data, digits.target)

print(clf.best_estimator_)

for params, mean_score, scores in clf.grid_scores_:
    print("%0.3f (+/-%0.03f) for %r" % (mean_score, scores.std() / 2, params))

Here, if you pass the model (estimator) and its parameter range, it will search the entire range (combination) of that range (super convenient). Now you can easily get the most accurate combination of parameters. Do this for each candidate model, and finally save the most accurate model and its parameters.

This candidate model is based on model selection map provided by scikit-learn. That said, it's not too complicated, and I feel like trying Elastic Net and SVR for value prediction, and SVM for classification while changing the kernel.

Karura has thus achieved fully automated machine learning, but so far there is no advanced technology. All are the accumulation of existing know-how and functions. There is no deep learning element here. I implemented it with a body that quietly performs work that can be called an iron plate. However, we believe that it alone can cover most of the so-called “data forecasting”.

The so-called artificial intelligence area these days has the taste of competing for transcendental ramen (expensive) with ridiculous ingenuity. Instead, I think it is also important to firmly provide the function of "Oh yeah, this is fine" like a ramen shop in the town.

The implementation of karura is published on GitHub, and if you have kintone, you can try it out (if you use it individually, you need to rewrite the contents of the plugin and JavaScript customization, which will be fixed in the future is). If you are interested, please give it a try.

icoxfog417/karura

icon.PNG

Recommended Posts

One-click data prediction for the field realized by fully automatic machine learning
Time series data prediction by AutoML (automatic machine learning)
Python learning memo for machine learning by Chainer until the end of Chapter 2
Judge the authenticity of posted articles by machine learning (Google Prediction API).
Upgrade the Azure Machine Learning SDK for Python
Machine learning Training data division and learning / prediction / verification
I tried to predict the change in snowfall for 2 years by machine learning
I tried to process and transform the image and expand the data for machine learning
Align the number of samples between classes of data for machine learning with Python
A story about data analysis by machine learning
Python learning memo for machine learning by Chainer from Chapter 2
Python learning memo for machine learning by Chainer Chapters 1 and 2
xgboost: A valid machine learning model for table data
Automatic brute force machine learning (regression analysis) -This greatly reduces the time for parameter tuning-
An example of a mechanism that returns a prediction by HTTP from the result of machine learning
Performance verification of data preprocessing for machine learning (numerical data) (Part 2)
I installed the automatic machine learning library auto-sklearn on centos7
Predict the presence or absence of infidelity by machine learning
Made icrawler easier to use for machine learning data collection
Python learning memo for machine learning by Chainer Chapter 7 Regression analysis
Performance verification of data preprocessing for machine learning (numerical data) (Part 1)
Japanese preprocessing for machine learning
4 [/] Four Arithmetic by Machine Learning
Python learning memo for machine learning by Chainer Chapter 8 Introduction to Numpy
Before the introduction to machine learning. ~ Technology required for machine learning other than machine learning ~
Python learning memo for machine learning by Chainer Chapter 10 Introduction to Cupy
Perform morphological analysis in the machine learning environment launched by GCE
Machine Learning with docker (42) Programming PyTorch for Deep Learning By Ian Pointer
How to use machine learning for work? 01_ Understand the purpose of machine learning
kintone x Easy business card management realized by machine learning @kintone Café
Python learning memo for machine learning by Chainer Chapter 9 Introduction to scikit-learn
Feature engineering for machine learning starting with the 1st Google Colaboratory --Binarization and discretization of count data
Introduction to Deep Learning for the first time (Chainer) Japanese character recognition Chapter 2 [Model generation by machine learning]
Try to predict the value of the water level gauge by machine learning using the open data of Data City Sabae