Using machine learning involves a lot of work, starting with data preprocessing, choosing the right model, optimizing the parameters, and so on. However, not a few parts of this work are routine. Speaking of pre-processing, processing such as normalizing numerical data and converting categorical variables to 0/1 features (dummy variable conversion) is the processing to be executed for the time being in any case. Of course, advanced feature engineering is a different story, but there are often scenes where you want to do routine work quickly when you have data and see how accurate the basic model is for the time being. There is one.

Therefore, we have developed a mechanism to automate such work. The name is karura.

As shown in this figure, the concept is to automate a series of routine tasks in creating a model. As shown in the figure, it is created by making full use of the functions of scikit-learn. I would like to introduce this point in this article as well.

karura architecture

karura is a mechanism to build a process by stacking processes called insight. Insight is a collection of two processes, "when" and "what should be done".

For example, in NAFrequencyCheckInsight, "if NA is xx% or more" in a column. Processing such as "delete column" is implemented. Of course, it is possible to interactively insert the processing here because it is case by case whether to drop the line or not.

Insight is divided into types such as preprocessing, feature selection, and model construction, and the order in which they are applied changes depending on the type (in the case of the same type of processing, they are applied in the order in which they were added). This order is defined in the Insight Index (https://github.com/chakki-works/karura/blob/master/karura/core/insight.py#L61).

The final implementation will look like this.

Of course, you can use any insight, and you can also create your own insight (Insight. Can be created by inheriting).

Once a process is created in this way, it can be reused for various data. Since the created model can be converted into a prediction model, it can be immediately converted to API.

We have prepared a simple Jupyter Notebook, so if you want to see how to use it, please refer to it.

karura_notebook_demo

Incorporation into tools

karura is intended to be embedded in business applications and can be linked with the following platforms.

Cooperation with Slack

You can use karura as a Slack bot.

(By the way, it supports multiple languages, and it is currently possible to set Japanese / English)

When using as a Slack bot, upload a data file such as csv and karura will start the analysis (the analysis process is defined in advance as described above). Where confirmation is required for analysis, confirmation is done interactively ([when defining ʻInsight, set ʻautomatic to False to wait for confirmation](https :: //github.com/chakki-works/karura/blob/master/karura/core/insight.py#L13)).

In addition, karura positions it as an important function to teach users the points that must be confirmed when analyzing such data. This is because knowledge of the data is indispensable for interpreting the prediction results of the constructed model.

Therefore, in karura, we use a model that guarantees explanatory power. We also use a lightly trained model to drive the improvement cycle as quickly as possible. To put it simply, I don't use a deep model. Since the purpose of karura is to "automate routine work and check it quickly", the point of taking time to create a high-precision model is dropped as a function in the first place. By creating an "easy and quick" model, the accuracy will not be achieved as it is, and it is a tool for noticing that you have to add data items / and input blank data properly.

For the background of narrowing down to these functions, please refer to the following materials if you are interested.

Three pillars to utilize machine learning ~ The need for educational machine learning tools ~

Cooperation with kintone

It is also possible to incorporate it into kintone, which is a platform that makes it easy to create business applications. You can easily try creating a model using the data in the app created with kintone.

Enter the number of the app you want to analyze
Among the items in the app, select the item you want to use for prediction and the item you want to predict
Press the learn button

This completes the analysis. If you install the plugin on the analyzed app, you can also enter the predicted value for karura. I've done hands-on on this point before, so if you are interested, please give it a try.

In addition, it is rare that the data in the business application can be analyzed as it is and it will not work. Actually, the accuracy is not so high, and we will proceed with such examination as to what kind of items are bad and what kind of items should be added.

Since it is difficult to perform such work on the kintone app, it is equipped with a batch download function for actual values and prediction results, and a re-learning function by uploading files.

This allows you to test various hypotheses about your data.

karura implementation

The contents of karura are created by making full use of scikit-learn.

Preprocessing: sklearn.preprocessing
Feature selection: sklearn.feature_selection
Model selection / optimization: sklearn.model_selection

And all the processing is summarized in Pipeline to make it a predictive API.

Machine learning does not end with a model with good accuracy, and it is meaningless without utilizing it. And, of course, it is necessary to input preprocessed data to the created model, and since the predicted values are normalized and indexed, it is necessary to perform inverse conversion.

In other words, it becomes an API that can be used only by sandwiching the model with "all preprocessing" and "inverse label conversion". The model is accurate! Even though I think about it, it is often daunting to see the accumulated pre-processing up to that point and ask, "Is it possible to implement all of this and pass it on to the model?"

In karura, by implementing the process of converting the insight used at the time of analysis into a Transformer for prediction, if the process for building the model (applying insights sequentially) is decided, the process for prediction (Transformer) Is applied sequentially).

In Numerical Scaling Insight, which is an insight that normalizes numbers, [get_transformer](https: // github.com/chakki-works/karura/blob/master/karura/core/insights/numerical_scaling_insight.py#L48) is implemented, and parameters such as mean and variance for normalization found during analysis are for prediction. It is now taken over by Transformer (although it is actually handled by StandardScaler / MinMaxScaler).

Sklearn.feature_extraction is not used yet as it does not currently support natural language / image items. However, this may be used in the future.

karura is under development on GitHub, so I hope you will take advantage of it!

chakki-works/karura (It will be encouraging if you give me a Star m (_ _) m)

[PYTHON] Automate routine tasks in machine learning

karura architecture

Incorporation into tools

Cooperation with Slack

Cooperation with kintone

karura implementation