[PYTHON] Automate routine tasks in machine learning

Using machine learning involves a lot of work, starting with data preprocessing, choosing the right model, optimizing the parameters, and so on. However, not a few parts of this work are routine. Speaking of pre-processing, processing such as normalizing numerical data and converting categorical variables to 0/1 features (dummy variable conversion) is the processing to be executed for the time being in any case. Of course, advanced feature engineering is a different story, but there are often scenes where you want to do routine work quickly when you have data and see how accurate the basic model is for the time being. There is one.

Therefore, we have developed a mechanism to automate such work. The name is karura.

chakki-works/karura

karura_concept.png

As shown in this figure, the concept is to automate a series of routine tasks in creating a model. As shown in the figure, it is created by making full use of the functions of scikit-learn. I would like to introduce this point in this article as well.

karura architecture

karura is a mechanism to build a process by stacking processes called insight. Insight is a collection of two processes, "when" and "what should be done".

insight.png

For example, in NAFrequencyCheckInsight, "if NA is xx% or more" in a column. Processing such as "delete column" is implemented. Of course, it is possible to interactively insert the processing here because it is case by case whether to drop the line or not.

Insight is divided into types such as preprocessing, feature selection, and model construction, and the order in which they are applied changes depending on the type (in the case of the same type of processing, they are applied in the order in which they were added). This order is defined in the Insight Index (https://github.com/chakki-works/karura/blob/master/karura/core/insight.py#L61).

The final implementation will look like this.

stack_insights.png

Of course, you can use any insight, and you can also create your own insight (Insight. Can be created by inheriting).

Once a process is created in this way, it can be reused for various data. Since the created model can be converted into a prediction model, it can be immediately converted to API.

We have prepared a simple Jupyter Notebook, so if you want to see how to use it, please refer to it.

karura_notebook_demo

Incorporation into tools

karura is intended to be embedded in business applications and can be linked with the following platforms.

Cooperation with Slack

You can use karura as a Slack bot.

karura_as_slackbot.PNG (By the way, it supports multiple languages, and it is currently possible to set Japanese / English)

When using as a Slack bot, upload a data file such as csv and karura will start the analysis (the analysis process is defined in advance as described above). Where confirmation is required for analysis, confirmation is done interactively ([when defining ʻInsight, set ʻautomatic to False to wait for confirmation](https :: //github.com/chakki-works/karura/blob/master/karura/core/insight.py#L13)).

In addition, karura positions it as an important function to teach users the points that must be confirmed when analyzing such data. This is because knowledge of the data is indispensable for interpreting the prediction results of the constructed model.

Therefore, in karura, we use a model that guarantees explanatory power. We also use a lightly trained model to drive the improvement cycle as quickly as possible. To put it simply, I don't use a deep model. Since the purpose of karura is to "automate routine work and check it quickly", the point of taking time to create a high-precision model is dropped as a function in the first place. By creating an "easy and quick" model, the accuracy will not be achieved as it is, and it is a tool for noticing that you have to add data items / and input blank data properly.

For the background of narrowing down to these functions, please refer to the following materials if you are interested.

Three pillars to utilize machine learning ~ The need for educational machine learning tools ~

Cooperation with kintone

It is also possible to incorporate it into kintone, which is a platform that makes it easy to create business applications. You can easily try creating a model using the data in the app created with kintone.

karura_on_kintone.PNG

This completes the analysis. If you install the plugin on the analyzed app, you can also enter the predicted value for karura. I've done hands-on on this point before, so if you are interested, please give it a try.

In addition, it is rare that the data in the business application can be analyzed as it is and it will not work. Actually, the accuracy is not so high, and we will proceed with such examination as to what kind of items are bad and what kind of items should be added.

Since it is difficult to perform such work on the kintone app, it is equipped with a batch download function for actual values and prediction results, and a re-learning function by uploading files.

image_720.png

This allows you to test various hypotheses about your data.

karura implementation

The contents of karura are created by making full use of scikit-learn.

And all the processing is summarized in Pipeline to make it a predictive API.

Machine learning does not end with a model with good accuracy, and it is meaningless without utilizing it. And, of course, it is necessary to input preprocessed data to the created model, and since the predicted values are normalized and indexed, it is necessary to perform inverse conversion.

In other words, it becomes an API that can be used only by sandwiching the model with "all preprocessing" and "inverse label conversion". The model is accurate! Even though I think about it, it is often daunting to see the accumulated pre-processing up to that point and ask, "Is it possible to implement all of this and pass it on to the model?"

In karura, by implementing the process of converting the insight used at the time of analysis into a Transformer for prediction, if the process for building the model (applying insights sequentially) is decided, the process for prediction (Transformer) Is applied sequentially).

In Numerical Scaling Insight, which is an insight that normalizes numbers, [get_transformer](https: // github.com/chakki-works/karura/blob/master/karura/core/insights/numerical_scaling_insight.py#L48) is implemented, and parameters such as mean and variance for normalization found during analysis are for prediction. It is now taken over by Transformer (although it is actually handled by StandardScaler / MinMaxScaler).

Sklearn.feature_extraction is not used yet as it does not currently support natural language / image items. However, this may be used in the future.

karura is under development on GitHub, so I hope you will take advantage of it!

chakki-works/karura (It will be encouraging if you give me a Star m (_ _) m)

Recommended Posts

Automate routine tasks in machine learning
Machine learning in Delemas (practice)
Used in machine learning EDA
Classification and regression in machine learning
Machine learning
Machine learning in Delemas (data acquisition)
Python: Preprocessing in Machine Learning: Overview
Preprocessing in machine learning 2 Data acquisition
Random seed research in machine learning
Preprocessing in machine learning 4 Data conversion
[python] Frequently used techniques in machine learning
Python: Preprocessing in machine learning: Data acquisition
[Python] Saving learning results (models) in machine learning
Python: Preprocessing in machine learning: Data conversion
Preprocessing in machine learning 1 Data analysis process
[Memo] Machine learning
Machine learning classification
Machine Learning sample
Data supply tricks using deques in machine learning
Full disclosure of methods used in machine learning
Summary of evaluation functions used in machine learning
Get a glimpse of machine learning in Python
[For beginners] Introduction to vectorization in machine learning
Machine learning tutorial summary
About machine learning overfitting
Build an interactive environment for machine learning in Python
Machine learning ⑤ AdaBoost Summary
Machine Learning: Supervised --AdaBoost
Tool MALSS (application) that supports machine learning in Python
Machine learning logistic regression
Coursera Machine Learning Challenges in Python: ex2 (Logistic Regression)
Tool MALSS (basic) that supports machine learning in Python
Machine learning support vector machine
About testing in the implementation of machine learning models
Studying Machine Learning ~ matplotlib ~
Machine learning linear regression
Machine learning course memo
Machine learning library dlib
Machine learning (TensorFlow) + Lotto 6
Coursera Machine Learning Challenges in Python: ex1 (Linear Regression)
Somehow learn machine learning
Attempt to include machine learning model in python package
Cross-entropy to review in Coursera Machine Learning week 2 assignments
Machine learning library Shogun
Machine learning rabbit challenge
Introduction to machine learning
Machine Learning: k-Nearest Neighbors
MALSS, a tool that supports machine learning in Python
What is machine learning?
[Machine learning] Let's summarize random forest in an easy-to-understand manner
How to adapt multiple machine learning libraries in one shot
The result of Java engineers learning machine learning in Python www
Survey on the use of machine learning in real services
Pre-processing in machine learning 3 Missing values, outliers, and imbalanced data
Coursera Machine Learning Challenges in Python: ex7-2 (Principal Component Analysis)
Machine learning model considering maintainability
Machine learning learned with Pokemon
Data set for machine learning
Japanese preprocessing for machine learning
An introduction to machine learning
Machine learning / classification related techniques