Introduction

When doing machine learning with PySpark, the ML library may not be fully functional and you may want to use other libraries such as scikit-learn.

The learning at that time needs to be done separately because Spark's DataFrame does not support it in the first place, but inference can be done smoothly by using UDF, so it is posted as a reminder.

Here, we deal only with inference, not learning itself.

manner

If you have a trained model (model: scikit-learn image), you can do as follows. data is the DataFrame of the inference data and features is the list of explanatory variables.

Here, the result predicted by model.predict (x) is returned, and it is necessary to replace it with the prediction function of the created model as appropriate. Similarly, if the return value is a continuous value, change it to DoubleType ().

`Inference using a trained model on pyspark`


import pandas as pd
from pyspark.sql.functions import pandas_udf
from pyspark.sql.types import IntegerType

@pandas_udf(returnType=IntegerType())
def predict_udf(*cols):
    X = pd.concat(cols, axis=1)
    return pd.Series(model.predict(X))

data.withColumn('predict', predict_udf(*features))

reference

Prediction at Scale with scikit-learn and PySpark Pandas UDFs

[PYTHON] Make inferences using scikit-learn's trained model in PySpark

Introduction

manner

Inference using a trained model on pyspark

reference

`Inference using a trained model on pyspark`