I feel that the utilization of data, including machine learning, works better on the ground. This is because what you do from the top down is as follows.
Then ...
And ...
Yes.
This article is an idea + implementation for practicing field-based utilization of machine learning. Specifically, we emphasize the following three points.
In other words, it can be easily incorporated into the business system that you use every day, and it is easy to operate, and even if you make a mistake, you can redo it many times. This time, we used kintone as a business system and implemented it by incorporating it. I will explain the mechanism and function below.
This article is based on Announcement at Cybozu Days 2016 kintone hack. In this article, I will touch on technical points that I could not touch during the presentation.
The following is the kintone application for property management. The name of the property and various characteristics (walking time from the station, age, etc.) are entered here. When you wonder, "How much is the rent for such a property?", Press the "Predict" button. Then, the predicted value of rent will be entered based on the learning result.
All you need to do is press this button. We have developed a mechanism to achieve this under the name karura. Only 3 steps are required to use the prediction function in karura. Of course, there are no consultants at this step.
All three steps are required to make the above one-click data prediction.
Put the plugin in the app you want to use the prediction function.
Prepare a field to enter the predicted value. This is prepared separately because it is assumed that the value entered by a person may be compared with the predicted value.
For the field to enter the predicted value, give the field name ending with "_prediction".
From here, we will work on the application (karura) side for learning. Enter the number of the property management application you want to add the prediction function to, and load the application information. Then, set the "field used for prediction" and the "field you want to predict".
After setting this, press the learn button.
When the training is completed, the prediction accuracy and advice for improving the model will be displayed as shown below.
This completes the preparation. The prediction function can now be used on the application side. This demo is a fairly simple value prediction, but it also supports classification and non-linear prediction. And the user does not have to support whether it is a value prediction or a classification. It automatically determines and switches models internally.
This mechanism, karura, cuts all the troublesome parts in machine learning.
As shown in the figure above, the following points are automatically performed.
"Automatically" seems to be doing something awesome, but they're all just doing what they can do normally. Specifically, it looks like the following.
A little ingenuity is the definition of features. Specifically, the following operations are performed internally.
Below, I would like to explain the points for automation, including the above.
Some of the features, that is, the items of the app on kintone, have numerical values, while others have categories such as days of the week. It is not appropriate to simply convert the items with classifications (Monday, Tuesday, Wednesday, etc. for days of the week) into numerical values. For example, if 0 = Monday, 1 = Tuesday, 2 = Wednesday, is Wednesday twice as much as Tuesday? Is it Tuesday + Tuesday = Wednesday? It doesn't make sense. Therefore, each value must be treated independently. Variables that represent these categories are called categorical variables, and when they are used as features, each value is regarded as an item (Monday = True / False, Tuesday = True / False, etc., and each value is an item. To do). On the contrary, variables that can be treated as numerical values and have no problem (temperature, amount, etc.) are called quantitative variables.
Since it is a burden to make the user think about this, this time we estimate the quantitative variable / categorical variable by the type of field item. Specifically, if it is a drop-down list or radio button item, it is regarded as a categorical variable. This is kintone's [Form Design API](https://cybozudev.zendesk.com/hc/ja/articles/201941834-%E3%83%95%E3%82%A9%E3%83%BC% E3% 83% A0% E8% A8% AD% E8% A8% 88% E6% 83% 85% E5% A0% B1% E5% 8F% 96% E5% BE% 97) Does not have to be specified.
Similarly, the field specified as the value to be predicted is a categorical variable or a quantitative variable, and it is identified whether it is a classification problem or a value prediction problem.
However, we are currently not able to handle fields that contain natural language (specifically, fields that contain text such as comments and titles). I think it would be good if these fields could be automatically featured using distributed representations.
It is common sense that data needs to be normalized, but we are saving the parameters (mean / variance) for the normalization and normalization. The parameters for normalization are saved because normalization is also required when making predictions.
The feature selection is scikit-learn's Feature Selection. The usage is as follows.
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
X, y = iris.data, iris.target
X_new = SelectKBest(chi2, k=2).fit_transform(X, y) # choose best 2 feature
print(X_new.shape)
In short, this is to classify / predict each feature individually, measure how accurate it is, and then look at the contribution of each feature. This removes unnecessary features to make the model simpler. At the same time, it internally keeps "which item works and how much" so that it can be used for advice to users.
However, since the maximum number of features and the threshold of the features to be cut off are set appropriately now (for the time being, the number of items in the kintone app is about this), there. However, adjustment is a future issue.
Scikit-learn's GridSearchCV is used for model selection and parameter tuning. The usage is as follows.
from sklearn.grid_search import GridSearchCV
from sklearn.svm import SVC
candidates = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4], 'C': [1, 10, 100]},
{'kernel': ['linear'], 'C': [1, 10, 100]}]
clf = GridSearchCV(SVC(C=1), candidates, cv=5, scoring="f1")
clf.fit(digits.data, digits.target)
print(clf.best_estimator_)
for params, mean_score, scores in clf.grid_scores_:
print("%0.3f (+/-%0.03f) for %r" % (mean_score, scores.std() / 2, params))
Here, if you pass the model (estimator) and its parameter range, it will search the entire range (combination) of that range (super convenient). Now you can easily get the most accurate combination of parameters. Do this for each candidate model, and finally save the most accurate model and its parameters.
This candidate model is based on model selection map provided by scikit-learn. That said, it's not too complicated, and I feel like trying Elastic Net and SVR for value prediction, and SVM for classification while changing the kernel.
Karura has thus achieved fully automated machine learning, but so far there is no advanced technology. All are the accumulation of existing know-how and functions. There is no deep learning element here. I implemented it with a body that quietly performs work that can be called an iron plate. However, we believe that it alone can cover most of the so-called “data forecasting”.
The so-called artificial intelligence area these days has the taste of competing for transcendental ramen (expensive) with ridiculous ingenuity. Instead, I think it is also important to firmly provide the function of "Oh yeah, this is fine" like a ramen shop in the town.
The implementation of karura is published on GitHub, and if you have kintone, you can try it out (if you use it individually, you need to rewrite the contents of the plugin and JavaScript customization, which will be fixed in the future is). If you are interested, please give it a try.
Recommended Posts