[PYTHON] Make inferences using scikit-learn's trained model in PySpark

Introduction

When doing machine learning with PySpark, the ML library may not be fully functional and you may want to use other libraries such as scikit-learn.

The learning at that time needs to be done separately because Spark's DataFrame does not support it in the first place, but inference can be done smoothly by using UDF, so it is posted as a reminder.

manner

If you have a trained model (model: scikit-learn image), you can do as follows. data is the DataFrame of the inference data and features is the list of explanatory variables.

Here, the result predicted by model.predict (x) is returned, and it is necessary to replace it with the prediction function of the created model as appropriate. Similarly, if the return value is a continuous value, change it to DoubleType ().

Inference using a trained model on pyspark


import pandas as pd
from pyspark.sql.functions import pandas_udf
from pyspark.sql.types import IntegerType

@pandas_udf(returnType=IntegerType())
def predict_udf(*cols):
    X = pd.concat(cols, axis=1)
    return pd.Series(model.predict(X))

data.withColumn('predict', predict_udf(*features))

reference

Recommended Posts

Make inferences using scikit-learn's trained model in PySpark
Use a scikit-learn model trained in PySpark
How to make a model for object detection using YOLO in 3 hours
Image recognition model using deep learning in 2016
Benefits of using slugfield in Django's model
I tried to make PyTorch model API in Azure environment using TorchServe
Make any key the primary key in Django's model
Model using convolutional neural network in natural language processing
Two-dimensional visualization of document vectors using Word2Vec trained model