[PYTHON] Implement a minimal self-made estimator with scikit-learn

Thing you want to do

scikit-learn is almost the de facto machine learning library for Python. The advantage of scikit-learn is that many algorithms are implemented, but it is designed in a consistent manner and can handle various algorithms in a common way. If you implement a new algorithm that scikit-learn does not have, or if you implement it so that it can be treated like other sciki-learn estimators when using other libraries, it will be cross-validated like the originally implemented estimator. You can evaluate performance and optimize parameters with grid search. Here is the minimum estimator implementation. Here, we consider discriminators or regressionrs as targets (not clustering or unsupervised learning).

Solid implementation

from sklearn.base import BaseEstimator

class MyEstimator(BaseEstimator):
    def __init__(self, param1, param2):
        self.param1 = param1
        self.param2 = param2
    
    def fit(self, x, y):
        return self 
    
    def predict(self, x):
        return [1.0]*len(x) 
    
    def score(self, x, y):
        return 1
    
    def get_params(self, deep=True):
        return {'param1': self.param1, 'param2': self.param2}

    def set_params(self, **parameters):
        for parameter, value in parameters.items():
            setattr(self,parameter, value)
        return self

Inherit sklearn.base.BaseEstimator to define the estimator class. Please rewrite the contents of the method as appropriate.

Execution example

Cross validation:


x = [[2,3],[4,5],[6,1],[2,0]] 
y = [0.0,9.4,2.1,0.9]

estimator = MyEstimator()
cross_validation.cross_val_score(estimator,x,y,cv=3)

Result:

array([ 1.,  1.,  1.])

Grid search:

gs = grid_search.GridSearchCV(estimator, {'param1': [0,10], 'param2': (1, 1e-1, 1e-2)})
gs.fit(x,y)
gs.best_estimator_, gs.best_params_, gs.best_score_

Result:

(MyEstimator(), {'param1': 0, 'param2': 1}, 1.0)

cross_validation In order to perform cross_validation, you need a fit method that learns training data and ascore method that inputs test data, compares the value estimated from it with the correct answer value, and outputs a score. fit(self, x, y) It is a function that learns so that the output is y for the input x. predict(self, x) A function whose output returns y_pred for input x. You don't need predict if you just want to do cross_validation, but in most cases you will call predict inside score. By implementing only predict by inheriting sklearnbase.ClassifierMixin and scikit-learn.base.RegressionMixin multiple times, you can use the implemented score function. score(self, x, y) It is a function that estimates the output y_pred for the input x, compares the y_pred with the correct answer y, and returns the score (whether the error or label matches, etc.). grid_search In order to do grid_search, we need to manipulate parameters in addition to learning and calculating the score as defined above. Implement the method get_params to get the data-independent parameters and the method set_params to set the parameters. get_params(self, deep=True) In the get_params method, the parameter key is the attribute name. Try to return a dictionary where value is a value. set_params(self, **parameters) This is a parameter setter. Pass it in a dictionary like get_params.

About Mixin

Implemented methods can be used by multiple inheritance of sklearn.base.ClassifierMixin for discriminative model and sklearn.base.RegressorMixin for regression model. If you inherit these

For release

You can check if your estimator is compatible with sklearn with sklearn.utils.estimator_checks.check_estimator. By the way, in the sample shown in this article, I get an error that the input is not validated. There should be no problem if you use it yourself.

Summary

--Create your own estimator class by inheriting sklearn.base.BaseEstimator --You need fit, score methods to do cross_validation --In order to do grid_search, you need more get_params and set_params methods. --If you define ClassifierMixin or RegressorMixin, you can use the score method to calculate the score using the predict you implemented.

reference

Most of what I wrote here API Reference for sklearn.base Module Information for developers on the official website Is referred to.

Recommended Posts

Implement a minimal self-made estimator with scikit-learn
Implement a model with state and behavior
Isomap with Scikit-learn
DBSCAN with scikit-learn
Clustering with scikit-learn (1)
Clustering with scikit-learn (2)
PCA with Scikit-learn
Implement a discrete-time logistic regression model with stan
kmeans ++ with scikit-learn
Cross Validation with scikit-learn
Implement FReLU with tf.keras
A4 size with python-pptx
Multi-class SVM with scikit-learn
Clustering with scikit-learn + DBSCAN
Learn with chemoinformatics scikit-learn
DBSCAN (clustering) with scikit-learn
Decorate with a decorator
Install scikit.learn with pip
Calculate tf-idf with scikit-learn
Implement normalization of Python training data preprocessing with scikit-learn [fit_transform]
[Causal search / causal inference] Implement a Bayesian network with Titanic data