[PYTHON] ML flow Tracking is good even for personal use

Introduction

There is a story called MLOps, "Let's create a foundation for operating a system that properly includes machine learning technology so that the machine learning model does not become obsolete and the system becomes garbage." Reference article: MLOps2020, start small and grow big

MLFlow is a tool designed to help with this. I had the opportunity to use one of the functions of MLflow, MLflow Tracking, so I thought it would be good if I tried using it while researching various things, so I will write it here. Well, there are many other articles on how to use it, so if you look at it, you can use it as a seed for ideas on how to keep a log of model construction, saying "How about keeping a record of model creation like this?" Happy. MLflow used version 1.8.0. The following article is easy to understand about MLflow. Consider how to use mlflow to streamline the data analysis cycle MLflow 1.0.0 released! Start your machine learning life cycle!

Usage data

Use data from Kaggle's Telco Customer Churn. https://www.kaggle.com/blastchar/telco-customer-churn This is data about the customers of the telephone company, and is a binary classification problem with the objective variable being whether or not to cancel. Each row represents a customer, and each column contains the customer's attributes.

Definition of packages and functions to use

Create a function to visualize the aggregation result and a function to create a model. Since it is not the main subject, the explanation is omitted.

Package used

# package
import numpy as np
import scipy
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import seaborn as sns
import pandas as pd
from pandas.plotting import register_matplotlib_converters
import xgboost
import xgboost.sklearn as xgb
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import auc
from sklearn.metrics import roc_curve
from sklearn.metrics import log_loss
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_validate
from sklearn.cluster import KMeans
from sklearn.metrics import confusion_matrix
from sklearn.metrics import make_scorer
from sklearn.metrics import precision_recall_curve
import time
import os
import glob
from tqdm import tqdm
import copy
import mlflow
from mlflow.sklearn import log_model
from mlflow.sklearn import load_model

Defined functions

#Histogram creation
def plot_many_hist(df_qiita,ex_col,ob_col,clip=[0, 99.],defalt_bin=10,png='tmp.png', visual = True):
    fig=plt.figure(figsize=(15,10))
    for i in range(len(ex_col)):
        df_qiita_clip=df_qiita.copy()
        col=ex_col[i]
        #clipping
        upperbound, lowerbound = np.percentile(df_qiita[col].values, clip)
        col_clip = np.clip(df_qiita[col].values, upperbound, lowerbound)
        df_qiita_clip['col_clip']=col_clip
        #Adjusting the number of bins
        if len(df_qiita_clip['col_clip'].unique())<10:
            bins=len(df_qiita_clip['col_clip'].unique())
        else:
            bins=defalt_bin
        #Histogram plot
        ax=plt.subplot(3,3,i+1)
        for u in range(len(df_qiita_clip[ob_col].unique())):
            ln1=ax.hist(df_qiita_clip[df_qiita_clip[ob_col]==u]['col_clip'], bins=bins,label=u, alpha=0.7)
            ax.set_title(col)
        h1, l1 = ax.get_legend_handles_labels()
        ax.legend(loc='upper right')
        ax.grid(True)
    plt.tight_layout()
    fig.suptitle("hist", fontsize=15)
    plt.subplots_adjust(top=0.92)
    plt.savefig(png)
    if visual == True:
        print('Cluster Hist')
        plt.show()
    else:
        plt.close()

#Standardization
def sc_trans(X):
    ss = StandardScaler()
    X_sc = ss.fit_transform(X)
    return X_sc

#kmeans modeling
def km_cluster(X, k):
    km=KMeans(n_clusters=k,\
              init="k-means++",\
              random_state=0)
    y_km=km.fit_predict(X)
    return y_km,km

#Pie chart creation
def pct_abs(pct, raw_data):
    absolute = int(np.sum(raw_data)*(pct/100.))
    return '{:d}\n({:.0f}%)'.format(absolute, pct) if pct > 5 else ''

def plot_chart(y_km, png='tmp.png', visual = True):
    km_label=pd.DataFrame(y_km).rename(columns={0:'cluster'})
    km_label['val']=1
    km_label=km_label.groupby('cluster')[['val']].count().reset_index()
    fig=plt.figure(figsize=(5,5))
    ax=plt.subplot(1,1,1)
    ax.pie(km_label['val'],labels=km_label['cluster'], autopct=lambda p: pct_abs(p, km_label['val']))#, autopct="%1.1f%%")
    ax.axis('equal')
    ax.set_title('Cluster Chart (ALL UU:{})'.format(km_label['val'].sum()),fontsize=14)
    plt.savefig(png)
    if visual == True:
        print('Cluster Structure')
        plt.show()
    else:
        plt.close()

#Table creation
def plot_table(df_qiita, cluster_name, png='tmp.png', visual = True):
    fig, ax = plt.subplots(figsize=(10,10))
    ax.axis('off')
    ax.axis('tight')
    tab=ax.table(cellText=np.round(df_qiita.groupby(cluster_name).mean().reset_index().values, 2),\
                 colLabels=df_qiita.groupby(cluster_name).mean().reset_index().columns,\
                 loc='center',\
                 bbox=[0,0,1,1])
    tab.auto_set_font_size(False)
    tab.set_fontsize(12)
    tab.scale(5,5)
    plt.savefig(png)
    if visual == True:
        print('Cluster Stats Mean')
        plt.show()
    else:
        plt.close()

#XGB model creation
def xgb_model(X_train, y_train, X_test):
    model = xgb.XGBClassifier()
    model.fit(X_train, y_train)
    y_pred=model.predict(X_test)
    y_pred_proba=model.predict_proba(X_test)[:, 1]
    y_pred_proba_both=model.predict_proba(X_test)
    return model, y_pred, y_pred_proba, y_pred_proba_both

#Training data and test data creation
def createXy(df, exp_col, ob_col, test_size=0.3, random_state=0, stratify=True):
    dfx=df[exp_col].copy()
    dfy=df[ob_col].copy()
    print('exp_col:',dfx.columns.values)
    print('ob_col:',ob_col)

    if stratify == True:
        X_train, X_test, y_train, y_test = train_test_split(dfx, dfy, test_size=test_size, random_state=random_state, stratify=dfy)
    else:
        X_train, X_test, y_train, y_test = train_test_split(dfx, dfy, test_size=test_size, random_state=random_state)
    print('Original Size is {}'.format(dfx.shape))
    print('TrainX Size is {}'.format(X_train.shape))
    print('TestX Size is {}'.format(X_test.shape))
    print('TrainY Size is {}'.format(y_train.shape))
    print('TestY Size is {}'.format(y_test.shape))
    return X_train, y_train, X_test, y_test

#Return the result of the classification evaluation index
def eval_list(y_test, y_pred, y_pred_proba, y_pred_proba_both):
    # eval
    log_loss_=log_loss(y_test, y_pred_proba_both)
    accuracy=accuracy_score(y_test, y_pred)
    precision=precision_score(y_test, y_pred)
    recall=recall_score(y_test, y_pred)
    # FPR, TPR, thresholds
    fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
    # AUC
    auc_ = auc(fpr, tpr)
    # roc_curve
    fig, ax = plt.subplots(figsize=(10,10))
    ax.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %.2f)'%auc_)
    ax.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
    plt.legend()
    plt.title('ROC curve')
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.grid(True)
    plt.savefig('ROC_curve.png')
    plt.close()
    return log_loss_, accuracy, precision, recall, auc_

# Recall-Precision curve returns
def threshold_pre_rec(test, prediction, save_name='threshold_pre_rec.png'):
    precision, recall, threshold = precision_recall_curve(test, prediction)
    thresholds = threshold
    user_cnt=[prediction[prediction>=i].shape[0] for i in thresholds]
    fig=plt.figure(figsize=(10,6))
    ax1 = plt.subplot(1,1,1)
    ax2=ax1.twinx()
    ax1.plot(thresholds, precision[:-1], color=sns.color_palette()[0],marker='+', label="precision")
    ax1.plot(thresholds, recall[:-1], color=sns.color_palette()[2],marker='+', label="recall")
    ax2.plot(thresholds, user_cnt, linestyle='dashed', color=sns.color_palette()[6], label="user_cnt")
    handler1, label1 = ax1.get_legend_handles_labels()
    handler2, label2 = ax2.get_legend_handles_labels()
    ax1.legend(handler1 + handler2, label1 + label2, loc='lower left')
    ax1.set_xlim(-0.05,1.05)
    ax1.set_ylim(-0.05,1.05)
    ax1.set_xlabel('threshold')
    ax1.set_ylabel('%')
    ax2.set_ylabel('user_cnt')
    ax2.grid(False)
    plt.savefig(save_name)
    plt.close()

#Predicted Probability-Returns Measured Probability Curve
def calib_curve(y_tests, y_pred_probas, save_name='calib_curve.png'):
    y_pred_proba_all=y_pred_probas.copy()
    y_tests_all=y_tests.copy()
    proba_check=pd.DataFrame(y_tests_all.values,columns=['real'])
    proba_check['pred']=y_pred_proba_all
    s_cut, bins = pd.cut(proba_check['pred'], list(np.linspace(0,1,11)), right=False, retbins=True)
    labels=bins[:-1]
    s_cut = pd.cut(proba_check['pred'], list(np.linspace(0,1,11)), right=False, labels=labels)
    proba_check['period']=s_cut.values
    proba_check = pd.merge(proba_check.groupby(['period'])[['real']].mean().reset_index().rename(columns={'real':'real_ratio'})\
                            , proba_check.groupby(['period'])[['real']].count().reset_index().rename(columns={'real':'UU'})\
                            , on=['period'], how='left')
    proba_check['period']=proba_check['period'].astype(str)
    proba_check['period']=proba_check['period'].astype(float)
    fig=plt.figure(figsize=(10,6))
    ax1 = plt.subplot(1,1,1)
    ax2=ax1.twinx()
    ax2.bar(proba_check['period'].values, proba_check['UU'].values, color='gray', label="user_cnt", width=0.05, alpha=0.5)
    ax1.plot(proba_check['period'].values, proba_check['real_ratio'].values, color=sns.color_palette()[0],marker='+', label="real_ratio")
    ax1.plot(proba_check['period'].values, proba_check['period'].values, color=sns.color_palette()[2], label="ideal_line")
    handler1, label1 = ax1.get_legend_handles_labels()
    handler2, label2 = ax2.get_legend_handles_labels()
    ax1.legend(handler1 + handler2, label1 + label2, loc='center right')
    ax1.set_xlim(-0.05,1.05)
    ax1.set_ylim(-0.05,1.05)
    ax1.set_xlabel('period')
    ax1.set_ylabel('real_ratio %')
    ax2.set_ylabel('user_cnt')
    ax2.grid(False)
    plt.savefig(save_name)
    plt.close()

#Output mixed matrix
def print_cmx(y_true, y_pred, save_name='tmp.png'):
    labels = sorted(list(set(y_true)))
    cmx_data = confusion_matrix(y_true, y_pred, labels=labels)

    df_cmx = pd.DataFrame(cmx_data, index=labels, columns=labels)

    plt.figure(figsize = (10,6))
    sns.heatmap(df_cmx, annot=True, fmt='d', cmap='coolwarm', annot_kws={'fontsize':20},alpha=0.8)
    plt.xlabel('pred', fontsize=18)
    plt.ylabel('real', fontsize=18)
    plt.savefig(save_name)
    plt.close()

Data reading

Since it is not the main subject, the missing line is deleted as it is.

#Data read
df=pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')
churn=df.copy()
#Change half-width space to Nan
churn.loc[churn['TotalCharges']==' ', 'TotalCharges']=np.nan
#Change to float
churn['TotalCharges']=churn['TotalCharges'].astype(float)
#If there is a missing line because it is troublesome, delete that line
churn=churn.dropna()
print(churn.info())
display(churn.head())

Record clustering results with MLflow

After clustering with kmenas etc., humans interpret the clustering result by looking at the characteristics of each cluster. However, it is troublesome to visualize and create a table every time through trial and error to see the characteristics of the cluster, and you may forget what the previous result was. MLflow can be used to solve this problem. Clustering is performed by narrowing down the explanatory variables to continuous values of'tenure',' MonthlyCharges', and'TotalCharges', and the result is recorded in MLflow.

####Clustering
exp_col=['tenure','MonthlyCharges','TotalCharges']
df_km=churn.copy()[exp_col]
df_cluster=df_km.copy()
cluster_name = 'My_Cluster'
k=5
ob_col = cluster_name

#Record the clustering result in mlflow
mlflow.set_experiment('My Clustering')#Define the name of the experiment
with mlflow.start_run():#mlflow recording started
    #Standardization
    X=sc_trans(df_cluster)
    #kmeans modeling
    y_km, km=km_cluster(X, k)
    #Record params in mlflow
    mlflow.log_param("method_name",km.__class__.__name__)
    mlflow.log_param("k", k)
    mlflow.log_param("features", df_cluster.columns.values)
    #Save model in mlflow
    log_model(km, "model")
    
    df_cluster[cluster_name]=y_km
    
    #Visualize clustering results
    #Cluster composition ratio
    plot_chart(y_km, png='Cluster_Chart.png', visual = False)#Save the figure in the current directory
    mlflow.log_artifact('Cluster_Chart.png')#Record the figure in the current directory
    os.remove('Cluster_Chart.png')#Deleted the figure in the current directory after recording

    #Average value per cluster
    plot_table(df_cluster, ob_col, png='Cluster_Stats_Mean.png', visual = False)#Save the figure in the current directory
    mlflow.log_artifact('Cluster_Stats_Mean.png')#Record the figure in the current directory
    os.remove('Cluster_Stats_Mean.png')#Deleted the figure in the current directory after recording

    #Histogram by cluster
    plot_many_hist(df_cluster,exp_col,ob_col,clip=[0, 99.],defalt_bin=20, png='Cluster_Hist.png', visual = False)#Save the figure in the current directory
    mlflow.log_artifact('Cluster_Hist.png')#Record the figure in the current directory
    os.remove('Cluster_Hist.png')#Deleted the figure in the current directory after recording

As mentioned above, when the code described to record the algorithm name, explanatory variable name, high para value, and chart that visualizes the characteristics of each cluster is executed, a folder called "mlruns" is created in the current directory. All the recorded results are saved in this folder called "mlruns". If you open a terminal in the directory containing the "mlruns" folder, write "mlflow ui" and execute it, localhost number 5000 will be launched. If you access localhost 5000 with a browser, you can see the record of the model created through the rich UI of MLflow.

"mlruns" folder image.png Write "mlflow ui" in the directory containing the folder "mlruns" image.png mlflow ui top screen image.png

The clustering results are recorded in a room named My Clustering. To see what was recorded, check the link with the date and time of recording.

Parameters, Metrics image.png

Artifacts image.png

Artifacts image.png

Artifacts image.png

It can be confirmed that the algorithm name set to be recorded, the explanatory variable name, and the high para value are recorded in Parameters. Charts and tables are recorded in Artifacts. By recording in this way, it becomes easy to compare the results of the previous model even when the explanatory variables are changed or the value of k is changed.

Record prediction results in MLflow

The same can be recorded when creating a classification model.

####Build a predictive model
exp_col=['tenure','MonthlyCharges','TotalCharges']
ob_col = 'Churn'
df_pred=churn.copy()
df_pred.loc[df_pred[ob_col]=='Yes', ob_col]=1
df_pred.loc[df_pred[ob_col]=='No', ob_col]=0
df_pred[ob_col]=df_pred[ob_col].astype(int)
df_pred[cluster_name]=y_km
X_tests, y_tests, y_preds, y_pred_probas, y_pred_proba_boths = [],[],[],[],[]

for cluster_num in np.sort(df_pred[cluster_name].unique()):
    #Extract data from one cluster
    df_n=df_pred[df_pred[cluster_name]==cluster_num].copy()
    
    #Training data and test data creation
    X_train, y_train, X_test, y_test=createXy(df_n, exp_col, ob_col, test_size=0.3, random_state=0, stratify=True)
    
    #Modeling
    model, y_pred, y_pred_proba, y_pred_proba_both = xgb_model(X_train, y_train, X_test)

    #Evaluation index calculation
    log_loss_, accuracy, precision, recall, auc_ = eval_list(y_test, y_pred, y_pred_proba, y_pred_proba_both)
    
    #Insert data into empty list
    X_tests.append(X_test)
    y_tests.append(y_test)
    y_preds.append(y_pred)
    y_pred_probas.append(y_pred_proba)
    y_pred_proba_boths.append(y_pred_proba_both)

    #Mixed matrix
    print_cmx(y_test.values, y_pred, save_name='confusion_matrix.png')
    
    # Recall-Precision curve
    threshold_pre_rec(y_test, y_pred_proba, save_name='threshold_pre_rec.png')
    
    #Pred Prob curve
    calib_curve(y_test,y_pred_proba, save_name='calib_curve.png')

    #Record the prediction result in mlflow
    mlflow.set_experiment('xgb_predict_cluster'+str(cluster_num))#Define the name of the experiment
    with mlflow.start_run():#mlflow recording started
        mlflow.log_param("01_method_name", model.__class__.__name__)
        mlflow.log_param("02_features", exp_col)
        mlflow.log_param("03_objective_col", ob_col)
        mlflow.log_params(model.get_xgb_params())
        mlflow.log_metrics({"01_accuracy": accuracy})
        mlflow.log_metrics({"02_precision": precision})
        mlflow.log_metrics({"03_recall": recall})
        mlflow.log_metrics({"04_log_loss": log_loss_})
        mlflow.log_metrics({"05_auc": auc_})
        mlflow.log_artifact('ROC_curve.png')
        os.remove('ROC_curve.png')
        mlflow.log_artifact('confusion_matrix.png')
        os.remove('confusion_matrix.png')
        mlflow.log_artifact('threshold_pre_rec.png')
        os.remove('threshold_pre_rec.png')
        mlflow.log_artifact('calib_curve.png')
        os.remove('calib_curve.png')
        log_model(model, "model")

#Concat the data for each cluster and put all the data together
y_pred_all=np.hstack((y_preds))
y_pred_proba_all=np.hstack((y_pred_probas))
y_pred_proba_both_all=np.concatenate(y_pred_proba_boths)
y_tests_all=pd.concat(y_tests)
#Evaluation index calculation
log_loss_, accuracy, precision, recall, auc_ = eval_list(y_tests_all.values, y_pred_all, y_pred_proba_all, y_pred_proba_both_all)
#Mixed matrix
print_cmx(y_tests_all.values, y_pred_all, save_name='confusion_matrix.png')
#Pred Prob curve
calib_curve(y_tests_all, y_pred_proba_all, save_name='calib_curve.png')

#Record the prediction result of all data in mlflow
mlflow.set_experiment('xgb_predict_all')#Define the name of the experiment
with mlflow.start_run():#mlflow recording started
    mlflow.log_param("01_method_name", model.__class__.__name__)
    mlflow.log_param("02_features", exp_col)
    mlflow.log_param("03_objective_col", ob_col)
    mlflow.log_params(model.get_xgb_params())
    mlflow.log_metrics({"01_accuracy": accuracy})
    mlflow.log_metrics({"02_precision": precision})
    mlflow.log_metrics({"03_recall": recall})
    mlflow.log_metrics({"04_log_loss": log_loss_})
    mlflow.log_metrics({"05_auc": auc_})
    mlflow.log_artifact('ROC_curve.png')
    os.remove('ROC_curve.png')
    mlflow.log_artifact('confusion_matrix.png')
    os.remove('confusion_matrix.png')
    mlflow.log_artifact('calib_curve.png')
    os.remove('calib_curve.png')

When you execute the above code, a model is created for each cluster and recorded in MLflow. You can record algorithm names, explanatory variable names, high-para values, loss functions, classification accuracy of various indicators, ROC curves, Calibration curves, Recall, Precision curves, mixed matrices, and so on.

mlflow top screen image.png

Parameters image.png

Metrics image.png

Artifacts image.png

Recording in this way makes it easier to compare the results of the previous model when the explanatory variables are changed, the high-para adjustment is made, or the algorithm is changed.

in conclusion

Although it also serves as a memorandum, I did not explain MLflow at all, and mainly showed how to use it like this. I hope the content will give you an image of how people who are thinking about using it in the future. If you are interested, I recommend you to check other articles and use it.

bonus

Contents of "mlruns" folder

Click 1 image.png

Click b3fa3eb983044a259e6cae4f149f32c8 image.png

Click artifacts image.png

The figure is saved image.png What can be confirmed with mlflow ui is stored in Local like this. It seems that it is possible to share model records with various people in cooperation with cloud services. Is there any problem with Local for personal use?

that's all!

Recommended Posts

ML flow Tracking is good even for personal use
Bakthat is good for backup, isn't it?
Bakthat is good for backup, isn't it?