[PYTHON] How to use cuML SVC as a Gridsearch CV classifier

Article summary

If you pass cuML's SVM (SVC) as an estimator to scikit-learn's Gridsearch CV, Since an error has occurred, I will leave a solution. Since this is Qiita's first post, I would appreciate it if you could point out any mistakes or points that are difficult to understand.

Only the conclusion first

Cause of error

--The return value of scikit-learn's SVM.SVC.predict () is an array of numpy --The return value of cuml.svm.SVC.predict () of cuML is Series of cuDF

GridsearchCV of scikit-learn assumes a numpy array as the return value of estimator.predict (). However, since the return value of SVC.predict () of cuML is Series of cuDF, an error occurs inside Gridsearch CV.

If you don't use GridsearchCV, you can solve it by converting the return value to a numpy array each time, but if you use GridsearchCV, you can't use that method. (Because it is necessary to pass each instance of SVC class to GridsearchCV)

solution

--Create a class that inherits cuml.svm.SVC --Override the predict method to convert the return value to a numpy array before outputting --Use an instance of that class as an estimator

Implementation example

This time, as an example, we will use SVM to classify "5" and "8" in the MNIST dataset. The reason is as follows.

――MNIST is easy to obtain and format, and the number of data is just right --cuML's SVC currently only supports two-class classification --It seems that it is the most difficult to classify "5" and "8" (Reference)

Execution environment

Data set creation

First, create a dataset. Take out only MNIST 5 and 8 and Change the label to binary (5 → 0, 8 → 1).

dataset_maker.py


import numpy as np
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split

def dataset_maker():
    mnist = fetch_openml('mnist_784', version=1)
    data_58 = []
    label_58 =[]
    for data,target in zip(mnist.data, mnist.target):
        if target=='5':
            data_58.append(data/255)
            label_58.append(0)
        elif target=='8':
            data_58.append(data/255)
            label_58.append(1)

    data_58 = np.array(data_58)
    label_58 = np.array(label_58)
    X_train, X_test, y_train, y_test = train_test_split(data_58, label_58)

    return X_train, X_test, y_train, y_test

Difference between __sklearn.svm.SVC.predict () __ and ** cuml.svm.SVC.predict () **

Check the difference in the return value of the predict method, which is the cause of the error. As shown in the code below, cuML SVC can be treated in the same way as sklearn SVC. I am happy that the introduction is easy.

sklearn_vs_cuML.py


from sklearn.svm import SVC as skSVC
from cuml.svm import SVC as cuSVC

def classify_sklearn(X_train, X_test, y_train, y_test):
    clf = skSVC()
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)

    print("skSVC output_type:{}".format(type(y_pred)))
    print("skSVC y_pred:{}".format(y_pred[0:10]))

def classify_cuml(X_train, X_test, y_train, y_test):
    clf = cuSVC()
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)

    print("cuSVC output_type:{}".format(type(y_pred)))
    print("cuSVC y_pred:{}".format(y_pred[0:10]))


if __name__ == "__main__":
    X_train, X_test, y_train, y_test = dataset_maker()
    classify_sklearn(X_train, X_test, y_train, y_test)
    classify_cuml(X_train, X_test, y_train, y_test)

When you do this, the output will look like this:

skSVC output_type:<class 'numpy.ndarray'>
skSVC y_pred:[0 0 0 1 0 0 0 0 1 0]
cuSVC output_type:<class 'cudf.core.series.Series'>
cuSVC y_pred:0    0.0
1    0.0
2    0.0
3    1.0
4    0.0
5    0.0
6    0.0
7    0.0
8    1.0
9    0.0
dtype: float64

As I wrote above, you can see that the return value is different. In summary, it looks like this.

--Output type is different - sklearn: numpy.ndarray - cuML: cudf.core.series.Series --The element type in the output array is different - sklearn: int - cuML: float64

Of these, due to the former, if the return value of cuml.svm.SVC.predict () is passed to the evaluation function of sklearn as it is, I get angry with ValueError: Expected array-like (array or non-string sequence). [^ 1]

[^ 1]: The latter seems to be cast without permission, and it will work if only the former is fixed. However, it's unpleasant, so the code below explicitly casts it to an int type.

This itself can be solved by converting it to a numpy array, so when classifying with cuML's SVC, Set the return value of the predict method to [cudf.core.series.Series.to_array ()](https://rapidsai.github.io/projects/cudf/en/latest/api.html#cudf.core.series.Series. Convert to a numpy array using to_array) and then Let's pass it to the evaluation function of scikit-learn. [^ 2]

[^ 2]: Of course, the evaluation function of cuML is compatible with the Series of cuDF, but there are few types at present, and I think that the evaluation function of scikit-learn is probably used in many cases in practice.

Use Gridsearch CV with cuML

Now the main subject. If you want to determine the hyperparameters of SVC by grid search Perhaps the first thing that comes to mind is how to use scikit-learn's Gridsearch CV. First, let's try scikit-learn's SVC as an estimator.

classify_sklearn_grid.py


def classify_sklearn_grid(X_train, X_test, y_train, y_test):
    parameters = {'kernel': ['linear', 'rbf'],
                  'C': [0.1, 1, 10, 100],
                  'gamma': [0.1, 1, 10]}

    clf = GridSearchCV(skSVC(), parameters, scoring='accuracy', verbose=2)
    clf.fit(X_train, y_train)
    y_pred = clf.best_estimator_.predict()

if __name__ == "__main__":
    X_train, X_test, y_train, y_test = dataset_maker()
    pred_sk_grid = classify_sklearn_grid(X_train, X_test, y_train, y_test)

I think it will be like this.

Since cuML's SVC is a class with necessary methods such as .fit () and .predict (), it meets the requirements of Gridsearch CV as an estimator.

However, in reality, the return value of the predict method is cuDF Series, which causes an error in the process of evaluating the result. Since it is necessary to pass each instance of SVC to GridsearchCV, it is not possible to convert using the to_array method every time the predict method is called.

To solve this problem, you can override the predict method so that the return value is a numpy array.

I will explain in detail. It's easy, just define a new class like this:

MySVC.py


from cuml.svm import SVC

class MySVC(SVC):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
    def predict(self, X):
        y_pred = super().predict(X).to_array().astype(int)
        return y_pred

You can pass this MySVC to GridsearchCV instead of cuml's SVC. I don't think I need to write it,

classify_MySVC.py


from MySVC import MySVC

def classify_cuml_grid(X_train, X_test, y_train, y_test):

    parameters = {'kernel': ['linear', 'rbf'],
                  'C': [0.1, 1, 10, 100],
                  'gamma': [0.1, 1, 10]}

    clf = GridSearchCV(MySVC(), parameters, scoring='accuracy', verbose=2)
    clf.fit(X_train, y_train)
    y_pred = clf.best_estimator_.predict(X_test)

    return y_pred

if __name__ == "__main__":
    X_train, X_test, y_train, y_test = dataset_maker()
    pred_cu_grid = classify_cuml_grid(X_train, X_test, y_train, y_test)

It is like this. You should now be able to use Gridsearch CV with cuML! It's been a long time, but thank you for reading!

bonus

Since it is a big deal, I will post the difference in execution time when using scikit-learn and when using cuML.

--scikit-learn: 1348.87 [s](about 22.5 minutes) --cuML: 270.06 [s](about 4.5 minutes)

Since each trial is only once, it is only for reference, but scikit-learn took about 5 times longer. After all cuML is fast!

References

Recommended Posts

How to use cuML SVC as a Gridsearch CV classifier
How to use Fujifilm X-T3 as a webcam on Ubuntu 20.04
How to use a file other than .fabricrc as a configuration file
How to disguise a ZIP file as a PNG file
A simple example of how to use ArgumentParser
How to use Python-shell
How to use tf.data
How to use virtualenv
How to use Seaboan
How to use shogun
How to use Pandas 2
How to use Virtualenv
How to use numpy.vectorize
How to use pytest_report_header
How to use partial
How to use Bio.Phylo
How to use SymPy
How to use x-means
How to use IPython
How to use virtualenv
How to use Matplotlib
How to use iptables
How to use numpy
How to use TokyoTechFes2015
How to use venv
How to use dictionary {}
How to use Pyenv
How to use list []
How to use python-kabusapi
How to use OptParse
How to use return
How to use dotenv
How to use pyenv-virtualenv
How to use Go.mod
How to use imutils
How to use import
A memo of how to use AIST supercomputer ABCI
A memorandum on how to use keras.preprocessing.image in Keras
How to display DataFrame as a table in Markdown
[Tips] How to use iPhone as webcam on Linux
How to use python multiprocessing (continued 3) apply_async in class with Pool as a member
How to use GitHub on a multi-person server without a password
How to use Qt Designer
How to use search sorted
[gensim] How to use Doc2Vec
python3: How to use bottle (2)
Understand how to use django-filter
How to use the generator
How to import NoteBook as a module in Jupyter (IPython)
[Python] How to use list 1
How to use the __call__ method in a Python class
How to call a function
How to use FastAPI ③ OpenAPI
How to print characters as a table with Python's print function
How to use Python argparse
How to use IPython Notebook
How to use Pandas Rolling
How to use redis-py Dictionaries
How to hack a terminal
Python: How to use pydub
[Python] How to use checkio