What is this

I wanted to incorporate my own function into sklearn.pipeline, and as I researched various things, I came up with the question of what is a __sklearn-compliant model. Below is a summary of the official document [^ 1]. I hope it helps people who are thinking of using sklearn.pipeline, sklearn.model_selection.GridSearchCV, etc.

1. Object composition

Objects in the sklearn compliant model need to be configured as follows. Let's look at them in order.

It has fit and set_params methods.
Has the _estimator_type attribute

1.1. fit, set_params method

The fit method is a method used for learning training data. In sklearn, the name of the learning method is unified to fit. As a result, even on the pipeline and GridSearchCV side, the model can be trained by calling the fit method on the object of the sklearn compliant model. The set_params method has a similar idea. This method is called when tuning parameters such as GridSearchCV.

1.2. _estimator_type attribute

Some of the features provided by sklearn (for example, GridSearchCV and cross_val_score) behave differently depending on the model type. For example, when learning a classifier, data is stratified and sampled. An example is shown below. --Category: classifier --Regression: regressor --Clustering: clusterer

The \ _estimator_type attribute is automatically set by inheriting the Mixin class (for example, ClassifierMixin class) in sklearn.base. In addition, sklearn recommends that when creating a sklearn-compliant model, it inherits both sklearn.base.BaseEstimator and the Mixin class suitable for that model. --BaseEstimator: methods such as set_params method that will become boilerplate code if implemented from 0 are described. --Mixin: Describes the methods that will be used in each _estimator_type.

The code is published on github, so reading it will lead to further understanding [^ 2].

1.3. Implementation example

Write the code based on the contents of 1.1. And 1.2. Note that set_params is not described here because it is prepared in the BaseEstimator class.


from sklearn.base import BaseEstimator, ClassifierMixin

class Classifier(BaseEstimator, ClassifierMixin):

    def __init__(self):
        pass

    def fit(self, X, y):
        pass

__init__ What you should pay attention to when creating an instance is the receipt of parameters. The way to receive it is listed below. --All parameters related to learning such as hyperparameters should be passed by the \ _ \ _ init__ method (only data should be passed by the fit method). --All received parameters should have default values. --All attributes with parameters should be the same as the parameters. --When you receive the parameter, do not validate the value (because set \ _params also overwrites the parameter, validation at the time of instantiation should be avoided)

2.1. Implementation example


from sklearn.base import BaseEstimator, ClassifierMixin

class Classifier(BaseEstimator, ClassifierMixin):

    def __init__(self, params1=0, params2=None):
        self.params1 = params1
        self.params2 = params2

    def fit(self, X, y):
        pass

fit The items to be noted in fit are listed below. --Receives data as an argument, not parameters --Even if data is learned, the data itself is not retained --Even if y (correct answer data) is not required, it is received in the form of y = None as the second argument (to enable feature generation by unsupervised learning → supervised learning with pipeline etc.) --The return value is self --Attributes estimated from the data are underlined at the end (eg coef_)

3.1. Implementation example


from sklearn.base import BaseEstimator, ClassifierMixin

class Classifier(BaseEstimator, ClassifierMixin):

    def __init__(self, params1=0, params2=None):
        self.params1 = params1
        self.params2 = params2

    def fit(self, X, y=None):
        print('The process of learning data is described here.')
        return self

4. Other

Items to be noted other than the above are listed. --X.shape [0] and y.shape [0] are the same (check using sklearn.utils.validation.check_X_y). --set_params takes a dictionary as an argument and the return value is self. --get_params takes no arguments. --For classifiers, have a list of labels in the classes_ attribute (use sklearn.utils.multiclass.unique_labels).

4.1. Implementation example


from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.utils.validation import check_X_y
from sklearn.utils.multiclass import unique_labels

class Classifier(BaseEstimator, ClassifierMixin):

    def __init__(self, params1=0, params2=None):
        self.params1 = params1
        self.params2 = params2

    def fit(self, X, y=None):
        X, y = check_X_y(X, y)
        self.classes_ = unique_labels(y)
        print('The process of learning data is described here.')
        return self

    def get_params(self, deep=True):
        return {"params1": self.params1, "params1": self.params1}

    def set_params(self, **parameters):
        for parameter, value in parameters.items():
            setattr(self, parameter, value)
        return self

5. Coding convention

It is basically PEP8 compliant, but in addition to that, there is a coding standard for sklearn, so I will describe it. These are unnecessary if you are not thinking of contributing to sklearn. --Separate each word with an underscore except for the class name (for example, n_samples) --Do not write multiple statements on one line (if statement and for statement will break) --Import modules in sklearn with relative paths (in test code, write with absolute paths) --ʻImport * `is not used --docstring is numpy style [^ 3]

6. Check if the model is sklearn compliant

sklearn provides a check_estimator method to check if it is a sklearn compliant model. It depends on the _estimator_type attribute, but it seems to do some testing to make sure it's compliant. If you don't implement the fit method, you will get the error ʻAttributeError:'Classifier' object has no attribute'fit'`. Also, since a template of sklearn compliant model is prepared on github, I think it is better to implement it referring to that and execute check_estimator when it is completed to check it. An execution example is shown below.

from sklearn.utils.estimator_checks import check_estimator

#In the code implemented above so far, an error occurs because the predict method required as a classifier is not defined.
#If you implement it as a Template Estimator without inheriting ClassifierMixin, no error will occur.
class Estimator(BaseEstimator):

    def __init__(self, params1=0, params2=None):
        self.params1 = params1
        self.params2 = params2

    def fit(self, X, y=None):
        X, y = check_X_y(X, y)
        self.classes_ = unique_labels(y)
        self.is_fitted_ = True
        return self

    def get_params(self, deep=True):
        return {"params1": self.params1, "params1": self.params1}

    def set_params(self, **parameters):
        for parameter, value in parameters.items():
            setattr(self, parameter, value)
        return self


check_estimator(Estimator)

[PYTHON] About sklearn compliant model