[PYTHON] I tried MLflow on Databricks

Operating environment

Since we are using lightGBM this time, we are using Databricks 7.4ML of the latest Runtime at the time of article creation (November 2020).

If you want to install additional external libraries

Can be installed from the Libraries tab of the cluster (7.4ML already has lightGBM installed)

Evaluate the model with MLflow Tracking

Let's train the iris dataset with lightGBM. After launching the Databricks cluster, try running the following sample code on your Notebook
In the following, we have enabled auto-tracking (mlflow.lightgbm.autolog ()), so parameters and metrics will be tracked automatically.

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, log_loss
import lightgbm as lgb

import mlflow
import mlflow.lightgbm
def train(learning_rate, colsample_bytree, subsample):

  #Data preparation
  iris = datasets.load_iris()
  X = iris.data
  y = iris.target
  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

  #Make it in lightgbm format
  train_set = lgb.Dataset(X_train, label=y_train)
  
  #Automatic tracking
  mlflow.lightgbm.autolog()
  
  with mlflow.start_run():

      #Learn the model
      params = {
          "objective": "multiclass",
          "num_class": 3,
          "learning_rate": learning_rate,
          "metric": "multi_logloss",
          "colsample_bytree": colsample_bytree,
          "subsample": subsample,
          "seed": 42,
      }
      model = lgb.train(
          params, train_set, num_boost_round=10, valid_sets=[train_set], valid_names=["train"]
      )

      #Evaluation of the model
      y_proba = model.predict(X_test)
      y_pred = y_proba.argmax(axis=1)
      loss = log_loss(y_test, y_proba)
      acc = accuracy_score(y_test, y_pred)

      # log metrics
      mlflow.log_metrics({"log_loss": loss, "accuracy": acc})

Try changing the parameters several times

train(0.1, 1.0, 1.0)
train(0.2, 0.8, 0.9)
train(0.4, 0.7, 0.8)

Click the ①Experiment icon on the upper right of the notebook to display the experiment result ②.
Click the button ③ to display the experiment result list screen.
Select all the experimental results ① and press the Compare ② button to display the experimental result comparison screen (parameter differences and evaluation indexes can be compared in a list).

Register the model in the Model Registry

MLflow Model Registry on Databricks is a model registry that manages model versions, stage transitions (transition from staging to production), model descriptions, etc. on the Databricks file system.
Click Run ID on the model evaluation result comparison screen to display the model details screen.
You can register to the Model Registry by clicking the Register Model button in the Artifacts.
Specify an existing model name (version up) or enter a new model name
You can check the registered models from the Models tab.

Launch an inference API using Model Serving

When MLflow Model Serving is enabled for a model in the model registry, a unique single node cluster is automatically created and can be used as a REST endpoint. Automatically deployed based on model registry version and stage

Change the stage of Model

The model registered this time is version 2, and the stage is None. I'm going to make a request to change the stage to Production
Transition from the model registry screen to the model details screen of version 2, click Stage: None ①, and click Request transition to-> Production ②.
Another user's model details screen shows a stage change request
You can change the stage if you approve

Enable Model Serving

It can be enabled by clicking the Enable Serving button from the Serving tab of the model details screen of the model registry.
After a while, you can confirm that the REST endpoint has been started for each model version and stage.

Use API from client side

You need to get a security token issued in order to access the API from the client side
Can be created from the Access Tokens tab from the User Setting screen of the administrator account
Try using API from curl

export DATABRICKS_TOKEN={token}

cat <<EOF > ./data.json
 [
   {
     "sepal length(cm)": 4.6,
     "sepal width(cm)": 3.6,
     "petal length(cm)": 1,
     "petal width(cm)": 0.2
   }
 ]
 EOF

curl \
  -u token:$DATABRICKS_TOKEN \
  -H "Content-Type: application/json; format=pandas-records" \
  [email protected] \
  https://dbc-xxxxxxxxxxxxx.cloud.databricks.com/model/iris_model/Production/invocations
[[0.9877602676352799, 0.006085719008512947, 0.006154013356207185]]

Finally

This time, I explained the functions that can use MLflow from Databricks Notebook. There are some features that I haven't introduced yet, so I'd like to summarize them next time.