I feel that the utilization of data, including machine learning, works better on the ground. This is because what you do from the top down is as follows.

Then ...

A Jinkou Chino consultant (self-proclaimed) has appeared!
"In order to use machine learning, business analysis is indispensable. By the way, first have a consulting contract (x million) signed, and then ..."
Jinkou Chino Service (self-proclaimed) has appeared!
"It's easy with our service! We promise high-precision image recognition and voice recognition with the rumored deep learning! Well, the data is neither image nor voice? It's okay! It's versatile! "
IOT robot has appeared!
"Bunseki with a sensor attached to Mokuhyo ... Bunseki with a sensor attached to Mokuhyo ... Bunseki with a sensor attached to Mokuhyo ..."

And ...

The president is being treated by the president!
"If you say Jinkou Chino, it's ○ Toson. First, you have to use ○ Toson."
Shisutemudaijin has been tampered with!
"From now on, you can talk to anything. Search boxes are out of the question."
Shisutemukenkyuin is being manipulated by Dipla Ningu!
"Isn't it deep learning? It's not artificial intelligence, it's not machine learning (laughs)"

Yes.

This article is an idea + implementation for practicing field-based utilization of machine learning. Specifically, we emphasize the following three points.

Easy to install
Easy to use
Can be redone many times

In other words, it can be easily incorporated into the business system that you use every day, and it is easy to operate, and even if you make a mistake, you can redo it many times. This time, we used kintone as a business system and implemented it by incorporating it. I will explain the mechanism and function below.

This article is based on Announcement at Cybozu Days 2016 kintone hack. In this article, I will touch on technical points that I could not touch during the presentation.

Usage image: Data prediction with one click

The following is the kintone application for property management. The name of the property and various characteristics (walking time from the station, age, etc.) are entered here. When you wonder, "How much is the rent for such a property?", Press the "Predict" button. Then, the predicted value of rent will be entered based on the learning result.

All you need to do is press this button. We have developed a mechanism to achieve this under the name karura. Only 3 steps are required to use the prediction function in karura. Of course, there are no consultants at this step.

How to use: Data utilization in 3 steps

All three steps are required to make the above one-click data prediction.

Put a plugin in kintone

Put the plugin in the app you want to use the prediction function.

Prepare a field to put the predicted value

Prepare a field to enter the predicted value. This is prepared separately because it is assumed that the value entered by a person may be compared with the predicted value.

For the field to enter the predicted value, give the field name ending with "_prediction".

To learn

From here, we will work on the application (karura) side for learning. Enter the number of the property management application you want to add the prediction function to, and load the application information. Then, set the "field used for prediction" and the "field you want to predict".

After setting this, press the learn button.

When the training is completed, the prediction accuracy and advice for improving the model will be displayed as shown below.

This completes the preparation. The prediction function can now be used on the application side. This demo is a fairly simple value prediction, but it also supports classification and non-linear prediction. And the user does not have to support whether it is a value prediction or a classification. It automatically determines and switches models internally.

Mechanism for realization: Fully automatic machine learning

This mechanism, karura, cuts all the troublesome parts in machine learning.

As shown in the figure above, the following points are automatically performed.

Data extraction
Selection of features
Model selection
Model parameter tuning

"Automatically" seems to be doing something awesome, but they're all just doing what they can do normally. Specifically, it looks like the following.

Extract data: Extract data from the target app using kintone API
Feature selection: scikit-learn's Feature Selection (http://scikit-learn.org/stable/modules/feature_selection.html)
Model selection: Select based on Map of model selection published by scikit-learn
Model parameter tuning: Tuning with scikit-learn's GridSearchCV

A little ingenuity is the definition of features. Specifically, the following operations are performed internally.

Judgment of quantitative / categorical variables
Regularization of each feature and storage of its parameters

Below, I would like to explain the points for automation, including the above.

Judgment of quantitative / categorical variables

Some of the features, that is, the items of the app on kintone, have numerical values, while others have categories such as days of the week. It is not appropriate to simply convert the items with classifications (Monday, Tuesday, Wednesday, etc. for days of the week) into numerical values. For example, if 0 = Monday, 1 = Tuesday, 2 = Wednesday, is Wednesday twice as much as Tuesday? Is it Tuesday + Tuesday = Wednesday? It doesn't make sense. Therefore, each value must be treated independently. Variables that represent these categories are called categorical variables, and when they are used as features, each value is regarded as an item (Monday = True / False, Tuesday = True / False, etc., and each value is an item. To do). On the contrary, variables that can be treated as numerical values and have no problem (temperature, amount, etc.) are called quantitative variables.

Since it is a burden to make the user think about this, this time we estimate the quantitative variable / categorical variable by the type of field item. Specifically, if it is a drop-down list or radio button item, it is regarded as a categorical variable. This is kintone's [Form Design API](https://cybozudev.zendesk.com/hc/ja/articles/201941834-%E3%83%95%E3%82%A9%E3%83%BC% E3% 83% A0% E8% A8% AD% E8% A8% 88% E6% 83% 85% E5% A0% B1% E5% 8F% 96% E5% BE% 97) Does not have to be specified.

Similarly, the field specified as the value to be predicted is a categorical variable or a quantitative variable, and it is identified whether it is a classification problem or a value prediction problem.

However, we are currently not able to handle fields that contain natural language (specifically, fields that contain text such as comments and titles). I think it would be good if these fields could be automatically featured using distributed representations.

Normalization for each feature and storage of its parameters

It is common sense that data needs to be normalized, but we are saving the parameters (mean / variance) for the normalization and normalization. The parameters for normalization are saved because normalization is also required when making predictions.

Feature selection

The feature selection is scikit-learn's Feature Selection. The usage is as follows.

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

X, y = iris.data, iris.target

X_new = SelectKBest(chi2, k=2).fit_transform(X, y)  # choose best 2 feature
print(X_new.shape)

In short, this is to classify / predict each feature individually, measure how accurate it is, and then look at the contribution of each feature. This removes unnecessary features to make the model simpler. At the same time, it internally keeps "which item works and how much" so that it can be used for advice to users.

However, since the maximum number of features and the threshold of the features to be cut off are set appropriately now (for the time being, the number of items in the kintone app is about this), there. However, adjustment is a future issue.

Model selection / parameter tuning

Scikit-learn's GridSearchCV is used for model selection and parameter tuning. The usage is as follows.

from sklearn.grid_search import GridSearchCV
from sklearn.svm import SVC


candidates = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4], 'C': [1, 10, 100]},
              {'kernel': ['linear'], 'C': [1, 10, 100]}]

clf = GridSearchCV(SVC(C=1), candidates, cv=5, scoring="f1")
clf.fit(digits.data, digits.target)

print(clf.best_estimator_)

for params, mean_score, scores in clf.grid_scores_:
    print("%0.3f (+/-%0.03f) for %r" % (mean_score, scores.std() / 2, params))

Here, if you pass the model (estimator) and its parameter range, it will search the entire range (combination) of that range (super convenient). Now you can easily get the most accurate combination of parameters. Do this for each candidate model, and finally save the most accurate model and its parameters.

This candidate model is based on model selection map provided by scikit-learn. That said, it's not too complicated, and I feel like trying Elastic Net and SVR for value prediction, and SVM for classification while changing the kernel.

Karura has thus achieved fully automated machine learning, but so far there is no advanced technology. All are the accumulation of existing know-how and functions. There is no deep learning element here. I implemented it with a body that quietly performs work that can be called an iron plate. However, we believe that it alone can cover most of the so-called “data forecasting”.

The so-called artificial intelligence area these days has the taste of competing for transcendental ramen (expensive) with ridiculous ingenuity. Instead, I think it is also important to firmly provide the function of "Oh yeah, this is fine" like a ramen shop in the town.

The implementation of karura is published on GitHub, and if you have kintone, you can try it out (if you use it individually, you need to rewrite the contents of the plugin and JavaScript customization, which will be fixed in the future is). If you are interested, please give it a try.

icoxfog417/karura

[PYTHON] One-click data prediction for the field realized by fully automatic machine learning