[PYTHON] I made a tool that makes it convenient to set parameters for machine learning models.

I made a convenient tool colt for writing application settings such as machine learning. Briefly, colt is a tool for writing settings like ʻAllenNLP`. I wrote "machine learning" in the title, but I think it can be used to set up many applications, not just machine learning.

sample code

Introduction

When experimenting with machine learning models, we often see that ʻargparse and [Hydra`](https://hydra.cc/) are used to manage hyperparameters. I think the problem with many of these existing parameter management tools is that when the model changes significantly, the parameter loading process also needs to change.

For example (a very aggressive example), SVC in scikit-learn. I intended to use generated / sklearn.svm.SVC.html? Highlight = svc # sklearn-svm-svc) and wrote a setting to read parameters such as C, kernel, class_weight in ʻargparse. , [RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html?highlight=randomforest#sklearn-ensemble-randomforestclassifier) Is it necessary to rewrite even the part? Also, if you want to set an ensemble model like [StackingClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.StackingClassifier.html?highlight=stacking#sklearn-ensemble-stackingclassifier) , If you want to set the base classifier and meta classifier, you are wondering how to write the settings.

AllenNLP

One of the means to solve these problems is the ** Register function ** adopted by ʻAllenNLP`, which is a deep learning framework for natural language processing. There is.

Here, I will explain this ** Register function ** a little. If you know it, please skip it.

ʻAllenNLP` describes the settings in JSON format. The following are some of the sentence classification model settings:

    "model": {
        "type": "basic_classifier",
        "text_field_embedder": {
            "token_embedders": {
                "tokens": {
                    "type": "embedding",
                    "embedding_dim": 10,
                    "trainable": true
                }
            }
        },
        "seq2vec_encoder": {
           "type": "cnn",
           "num_filters": 8,
           "embedding_dim": 10,
           "output_dim": 16
        }
    },

Specify the class you want to use with type and set the parameters in the field of the same level. Let's also look at the code for basic_classifier and cnn. The setting items correspond to the arguments of the __init __ method:

@Model.register("basic_classifier")
class BasicClassifier(Model):
    def __init__(
        self,
        ...,
        text_field_embedder: TextFieldEmbedder,
        seq2vec_encoder: Seq2VecEncoder,
        ...,
    ) -> None:
    ...


@Seq2VecEncoder.register("cnn")
class CnnEncoder(Seq2VecEncoder):
    def __init__(self,
                 embedding_dim: int,
                 num_filters: int,
                 ngram_filter_sizes: Tuple[int, ...] = (2, 3, 4, 5),
                 conv_layer_activation: Activation = None,
                 output_dim: Optional[int] = None) -> None:

If you register classes with the decorator register, you can specify those classes from the settings. With ʻAllenNLP, you can write the settings of a class just by creating a class and register`. Here, this function is called ** Register function **. Since the Register function associates a class with its settings, it is not necessary to change the setting reading process according to the model change.

You can easily replace various components of the model from the settings. To change the type of seq2vec_encoder from cnn to lstm, simply rewrite the settings as follows (lstm is already provided in ʻAllenNLP`):

        "seq2vec_encoder": {
           "type": "lstm",
           "num_layers": 1,
           "input_size": 10,
           "hidden_size": 16
        }

Features of colt

colt is a tool to realize the same function as ** Register function ** of ʻAllenNLP. By using colt, you can easily make settings that are flexible and resistant to code changes like ʻAllenNLP. It also implements some features not found in ʻAllenNLP` to make it easier to use in more cases.

Register function

Here is an example of using colt:

import typing as tp
import colt

@colt.register("foo")
class Foo:
    def __init__(self, message: str) -> None:
        self.message = message

@colt.register("bar")
class Bar:
    def __init__(self, foos: tp.List[Foo]) -> None:  # ---- (*)
        self.foos = foos

config = {
    "@type": "bar",  # `@type`Specify the class with
    "foos": [
        {"message": "hello"},  #The type here is(*)Inferred from the type hint of
        {"message": "world"},
    ]
}

bar = colt.build(config)  #Build object from config

assert isinstance(bar, Bar)

print(" ".join(foo.message for foo in bar.foos))  # => "hello world"

Register the class with colt.register ("<class identifier> "). On the setting side, describe in the format {"@type ":" <class identifier> ", (argument) ...}.

When building an object from a setting, call colt.build (<setting dict>).

If there is no @ type field in the setting and the type hint is written in the corresponding argument, the object will be created based on the type hint. In the above example, the argument foos of Bar is given the type hint List [Foo], so the contents of foos in config are converted to objects of the Foo class.

Type hints are not always required for colt. If you do not use type hints, write @ type without omitting it.


config = {
    "@type": "bar",
    "foos": [
        {"@type": "bar", "message": "hello"},
        {"@type": "bar", "message": "world"},
    ]
}

If there is no @ type or type hint, it is simply treated as dict.

Import function

You can also use colt for existing models included in scikit-learn etc. If the name specified by @ type is not registered, it will be imported automatically.

The following is an example of using StackingClassifier in scikit-learn:

import colt

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

config = {
    "@type": "sklearn.ensemble.StackingClassifier",
    "estimators": [
        ("rfc", { "@type": "sklearn.ensemble.RandomForestClassifier",
                  "n_estimators": 10 }),
        ("svc", { "@type": "sklearn.svm.SVC",
                  "gamma": "scale" }),
    ],
    "final_estimator": {
      "@type": "sklearn.linear_model.LogisticRegression",
      "C": 5.0,
    },
    "cv": 5,
}

X, y = load_iris(return_X_y=True)
X_train, X_valid, y_train, y_valid = train_test_split(X, y)

model = colt.build(config)
model.fit(X_train, y_train)

valid_accuracy = model.score(X_valid, y_valid)
print(f"valid_accuracy: {valid_accuracy}")

In the above example, the model described in config can be replaced if it has the API of scikit-learn. For example, to grid search LGBMClassifier with GridSearchCV:

config = {
    "@type": "sklearn.model_selection.GridSearchCV",
    "estimator": {
        "@type": "lightgbm.LGBMClassifier",
        "boosting_type": "gbdt",
        "objective": "multiclass",
    },
    "param_grid": {
        "n_estimators": [10, 50, 100],
        "num_leaves": [16, 32, 64],
        "max_depth": [-1, 4, 16],
    }
}

About reading from the configuration file

colt does not provide a function to read settings from a file. If you want to read the settings from a file, convert your favorite format such as JSON / Jsonnet or YAML to dict and pass it to colt.

Other detailed functions

Module import

If you are registering in multiple different files, all the classes to be used at the time of colt.build must be imported. colt can use colt.import_modules to recursively import multiple modules.

For example, consider the following file structure:

.
|-- main.py
 `- models
    |-- __init__.py
    |-- foo.py
     `- bar.py

Let's say that models / foo.py and models / bar.py have Foo and Bar classes register respectively, and main.py does colt.build. .. Use colt.import_modules (["<module name> ", ...]) in main.py as follows.

main.py


colt.import_modules(["models"])
colt.build(config)

If you pass a list of module names to colt.import_modules, each module and below will be imported recursively. In the above example, we passed [" models "] as an argument, so all the modules under the models module will be imported and Foo, Bar will be available.

Positional argument

When describing positional arguments in the settings, specify * as the key and pass a list (or tuple) of positional arguments as the value.

@colt.register("foo")
class Foo:
    def __init__(self, x, y):
        ...

config = {"@type": "foo", "*": ["x", "y"]}

Specifying the constructor

By default, colt builds an object by passing class arguments to __init__. If you want to create an object from a method other than __init__, you can specify:

@colt.register("foo", constructor="build")
class FooWrapper:
    @classmethod
    def build(cls, *args, **kwargs) -> Foo:
        ...

This is convenient when you want to use it as a wrapper for another class.

Meta key change

Special keys such as @ type and*used by colt can be changed. For example, to change @ type to @ and * to +, specify it as an argument to colt.build:

colt.build(config, typekey="@", argskey="+")

If you want to keep the common settings, use ColtBuilder.

builder = colt.ColtBuilder(typekey="@", argskey="+")
builder(config_one)
builder(config_two)

Example of use with kaggle Titanic

I tried kaggle's Titanic competition using colt.

https://github.com/altescy/colt/tree/master/examples/titanic

From creating features to modeling using pdpipe and scikit-learn Most of the processing from learning to evaluation can be set. All settings are described as Jsonnet below configs. I hope it will be helpful when using colt.

in conclusion

We introduced the functions and usage examples of colt. I hope it helps you when writing the settings.

Also, the functionality of colt is based on a great framework called ʻAllenNLP](https://allennlp.org/). [ʻAllenNLP is packed with useful ideas for many machine learning tasks as well as natural language processing, so if you are interested, please use it.

Recommended Posts

I made a tool that makes it convenient to set parameters for machine learning models.
I made a tool that makes it a little easier to create and install a public key.
[Python] I made a classifier for irises [Machine learning]
[Updated Ver1.3.1] I made a data preprocessing library DataLiner for machine learning.
I made a tool that makes decompression a little easier with CLI (Python3)
I made a useful tool for Digital Ocean
I made a cleaning tool for Google Container Registry
I made a VM that runs OpenCV for Python
MALSS, a tool that supports machine learning in Python
I made a tool in Python that right-clicks an Excel file and divides it into files for each sheet.
I made a toolsver that spits out OS, Python, modules and tool versions to Markdown
Data set for machine learning
I tried to compare the accuracy of machine learning models using kaggle as a theme.
I made a tool to create a word cloud from wikipedia
I made a learning kit for word2vec / doc2vec / GloVe / fastText
I made a tool to automatically generate a state transition diagram that can be used for both web development and application development
[Python] I wrote a test of "Streamlit" that makes it easy to create visualization applications.
[Titan Craft] I made a tool to summon a giant to Minecraft
I tried to make a site that makes it easy to see the update information of Azure
The story that I set transparent proxy and it worked for some reason without a certificate
I want to specify a file that is not a character string for logrotate, but is it impossible?
I made a client / server CLI tool for WebSocket (like Netcat for WebSocket)
I want to create a machine learning service without programming! WebAPI
I made a scaffolding tool for the Python web framework Bottle
I made a library that adds docstring to a Python stub file.
I made a C ++ learning site
I wrote a book that allows you to learn machine learning implementations and algorithms in a well-balanced manner.
I made a Docker container to use JUMAN ++, KNP, python (for pyKNP).
I changed my job to a machine learning engineer at AtCoder Jobs
[Python] I made a decorator that doesn't seem to have any use.
I made a tool to automatically browse multiple sites with Selenium (Python)
I made a web application in Python that converts Markdown to HTML
I made a Discord bot in Python that translates when it reacts
I made a CLI tool to convert images in each directory to PDF
I made a library konoha that switches the tokenizer to a nice feeling
I made a tool to convert Jupyter py to ipynb with VS Code
[Python] It was very convenient to use a Python class for a ROS program.
I want to create a machine learning service without programming! Text classification
I asked a friend who works in machine learning at a very famous IT company. Machine learning (natural language processing) What I want to learn for self-study
I made a dash docset for Holoviews
Introduction to Machine Learning: How Models Work
I installed Python 3.5.1 to study machine learning
An introduction to OpenCV for machine learning
An introduction to Python for machine learning
I made a system that automatically decides whether to run tomorrow with Python and adds it to Google Calendar.
I made a browser automatic stamping tool.
I thought it would be slow to use a for statement in NumPy, but that wasn't the case.
Creating a development environment for machine learning
I made a library for actuarial science
I made a tool to estimate the execution time of cron (+ PyPI debut)
I made a tool to easily display data as a graph by GUI operation.
Free version of DataRobot! ?? Introduction to "PyCaret", a library that automates machine learning
Looking back on the machine learning competition that I worked on for the first time
I made a tool to generate Markdown from the exported Scrapbox JSON file
I tried to predict the change in snowfall for 2 years by machine learning
I tried to implement various methods for machine learning (prediction model) using scikit-learn.
I tried to process and transform the image and expand the data for machine learning
I made a tool to automatically back up the metadata of the Salesforce organization
I want to do machine learning even without a server --Time Series Edition -
A story that I was addicted to when I made SFTP communication with python
GTUG Girls + PyLadiesTokyo Meetup I went to machine learning for the first time