There is a story called MLOps, "Let's create a foundation for operating a system that properly includes machine learning technology so that the machine learning model does not become obsolete and the system becomes garbage." Reference article: MLOps2020, start small and grow big
MLFlow is a tool designed to help with this. I had the opportunity to use one of the functions of MLflow, MLflow Tracking, so I thought it would be good if I tried using it while researching various things, so I will write it here. Well, there are many other articles on how to use it, so if you look at it, you can use it as a seed for ideas on how to keep a log of model construction, saying "How about keeping a record of model creation like this?" Happy. MLflow used version 1.8.0. The following article is easy to understand about MLflow. Consider how to use mlflow to streamline the data analysis cycle MLflow 1.0.0 released! Start your machine learning life cycle!
Use data from Kaggle's Telco Customer Churn. https://www.kaggle.com/blastchar/telco-customer-churn This is data about the customers of the telephone company, and is a binary classification problem with the objective variable being whether or not to cancel. Each row represents a customer, and each column contains the customer's attributes.
Create a function to visualize the aggregation result and a function to create a model. Since it is not the main subject, the explanation is omitted.
Package used
# package
import numpy as np
import scipy
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import seaborn as sns
import pandas as pd
from pandas.plotting import register_matplotlib_converters
import xgboost
import xgboost.sklearn as xgb
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import auc
from sklearn.metrics import roc_curve
from sklearn.metrics import log_loss
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_validate
from sklearn.cluster import KMeans
from sklearn.metrics import confusion_matrix
from sklearn.metrics import make_scorer
from sklearn.metrics import precision_recall_curve
import time
import os
import glob
from tqdm import tqdm
import copy
import mlflow
from mlflow.sklearn import log_model
from mlflow.sklearn import load_model
Defined functions
#Histogram creation
def plot_many_hist(df_qiita,ex_col,ob_col,clip=[0, 99.],defalt_bin=10,png='tmp.png', visual = True):
fig=plt.figure(figsize=(15,10))
for i in range(len(ex_col)):
df_qiita_clip=df_qiita.copy()
col=ex_col[i]
#clipping
upperbound, lowerbound = np.percentile(df_qiita[col].values, clip)
col_clip = np.clip(df_qiita[col].values, upperbound, lowerbound)
df_qiita_clip['col_clip']=col_clip
#Adjusting the number of bins
if len(df_qiita_clip['col_clip'].unique())<10:
bins=len(df_qiita_clip['col_clip'].unique())
else:
bins=defalt_bin
#Histogram plot
ax=plt.subplot(3,3,i+1)
for u in range(len(df_qiita_clip[ob_col].unique())):
ln1=ax.hist(df_qiita_clip[df_qiita_clip[ob_col]==u]['col_clip'], bins=bins,label=u, alpha=0.7)
ax.set_title(col)
h1, l1 = ax.get_legend_handles_labels()
ax.legend(loc='upper right')
ax.grid(True)
plt.tight_layout()
fig.suptitle("hist", fontsize=15)
plt.subplots_adjust(top=0.92)
plt.savefig(png)
if visual == True:
print('Cluster Hist')
plt.show()
else:
plt.close()
#Standardization
def sc_trans(X):
ss = StandardScaler()
X_sc = ss.fit_transform(X)
return X_sc
#kmeans modeling
def km_cluster(X, k):
km=KMeans(n_clusters=k,\
init="k-means++",\
random_state=0)
y_km=km.fit_predict(X)
return y_km,km
#Pie chart creation
def pct_abs(pct, raw_data):
absolute = int(np.sum(raw_data)*(pct/100.))
return '{:d}\n({:.0f}%)'.format(absolute, pct) if pct > 5 else ''
def plot_chart(y_km, png='tmp.png', visual = True):
km_label=pd.DataFrame(y_km).rename(columns={0:'cluster'})
km_label['val']=1
km_label=km_label.groupby('cluster')[['val']].count().reset_index()
fig=plt.figure(figsize=(5,5))
ax=plt.subplot(1,1,1)
ax.pie(km_label['val'],labels=km_label['cluster'], autopct=lambda p: pct_abs(p, km_label['val']))#, autopct="%1.1f%%")
ax.axis('equal')
ax.set_title('Cluster Chart (ALL UU:{})'.format(km_label['val'].sum()),fontsize=14)
plt.savefig(png)
if visual == True:
print('Cluster Structure')
plt.show()
else:
plt.close()
#Table creation
def plot_table(df_qiita, cluster_name, png='tmp.png', visual = True):
fig, ax = plt.subplots(figsize=(10,10))
ax.axis('off')
ax.axis('tight')
tab=ax.table(cellText=np.round(df_qiita.groupby(cluster_name).mean().reset_index().values, 2),\
colLabels=df_qiita.groupby(cluster_name).mean().reset_index().columns,\
loc='center',\
bbox=[0,0,1,1])
tab.auto_set_font_size(False)
tab.set_fontsize(12)
tab.scale(5,5)
plt.savefig(png)
if visual == True:
print('Cluster Stats Mean')
plt.show()
else:
plt.close()
#XGB model creation
def xgb_model(X_train, y_train, X_test):
model = xgb.XGBClassifier()
model.fit(X_train, y_train)
y_pred=model.predict(X_test)
y_pred_proba=model.predict_proba(X_test)[:, 1]
y_pred_proba_both=model.predict_proba(X_test)
return model, y_pred, y_pred_proba, y_pred_proba_both
#Training data and test data creation
def createXy(df, exp_col, ob_col, test_size=0.3, random_state=0, stratify=True):
dfx=df[exp_col].copy()
dfy=df[ob_col].copy()
print('exp_col:',dfx.columns.values)
print('ob_col:',ob_col)
if stratify == True:
X_train, X_test, y_train, y_test = train_test_split(dfx, dfy, test_size=test_size, random_state=random_state, stratify=dfy)
else:
X_train, X_test, y_train, y_test = train_test_split(dfx, dfy, test_size=test_size, random_state=random_state)
print('Original Size is {}'.format(dfx.shape))
print('TrainX Size is {}'.format(X_train.shape))
print('TestX Size is {}'.format(X_test.shape))
print('TrainY Size is {}'.format(y_train.shape))
print('TestY Size is {}'.format(y_test.shape))
return X_train, y_train, X_test, y_test
#Return the result of the classification evaluation index
def eval_list(y_test, y_pred, y_pred_proba, y_pred_proba_both):
# eval
log_loss_=log_loss(y_test, y_pred_proba_both)
accuracy=accuracy_score(y_test, y_pred)
precision=precision_score(y_test, y_pred)
recall=recall_score(y_test, y_pred)
# FPR, TPR, thresholds
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
# AUC
auc_ = auc(fpr, tpr)
# roc_curve
fig, ax = plt.subplots(figsize=(10,10))
ax.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %.2f)'%auc_)
ax.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.legend()
plt.title('ROC curve')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.grid(True)
plt.savefig('ROC_curve.png')
plt.close()
return log_loss_, accuracy, precision, recall, auc_
# Recall-Precision curve returns
def threshold_pre_rec(test, prediction, save_name='threshold_pre_rec.png'):
precision, recall, threshold = precision_recall_curve(test, prediction)
thresholds = threshold
user_cnt=[prediction[prediction>=i].shape[0] for i in thresholds]
fig=plt.figure(figsize=(10,6))
ax1 = plt.subplot(1,1,1)
ax2=ax1.twinx()
ax1.plot(thresholds, precision[:-1], color=sns.color_palette()[0],marker='+', label="precision")
ax1.plot(thresholds, recall[:-1], color=sns.color_palette()[2],marker='+', label="recall")
ax2.plot(thresholds, user_cnt, linestyle='dashed', color=sns.color_palette()[6], label="user_cnt")
handler1, label1 = ax1.get_legend_handles_labels()
handler2, label2 = ax2.get_legend_handles_labels()
ax1.legend(handler1 + handler2, label1 + label2, loc='lower left')
ax1.set_xlim(-0.05,1.05)
ax1.set_ylim(-0.05,1.05)
ax1.set_xlabel('threshold')
ax1.set_ylabel('%')
ax2.set_ylabel('user_cnt')
ax2.grid(False)
plt.savefig(save_name)
plt.close()
#Predicted Probability-Returns Measured Probability Curve
def calib_curve(y_tests, y_pred_probas, save_name='calib_curve.png'):
y_pred_proba_all=y_pred_probas.copy()
y_tests_all=y_tests.copy()
proba_check=pd.DataFrame(y_tests_all.values,columns=['real'])
proba_check['pred']=y_pred_proba_all
s_cut, bins = pd.cut(proba_check['pred'], list(np.linspace(0,1,11)), right=False, retbins=True)
labels=bins[:-1]
s_cut = pd.cut(proba_check['pred'], list(np.linspace(0,1,11)), right=False, labels=labels)
proba_check['period']=s_cut.values
proba_check = pd.merge(proba_check.groupby(['period'])[['real']].mean().reset_index().rename(columns={'real':'real_ratio'})\
, proba_check.groupby(['period'])[['real']].count().reset_index().rename(columns={'real':'UU'})\
, on=['period'], how='left')
proba_check['period']=proba_check['period'].astype(str)
proba_check['period']=proba_check['period'].astype(float)
fig=plt.figure(figsize=(10,6))
ax1 = plt.subplot(1,1,1)
ax2=ax1.twinx()
ax2.bar(proba_check['period'].values, proba_check['UU'].values, color='gray', label="user_cnt", width=0.05, alpha=0.5)
ax1.plot(proba_check['period'].values, proba_check['real_ratio'].values, color=sns.color_palette()[0],marker='+', label="real_ratio")
ax1.plot(proba_check['period'].values, proba_check['period'].values, color=sns.color_palette()[2], label="ideal_line")
handler1, label1 = ax1.get_legend_handles_labels()
handler2, label2 = ax2.get_legend_handles_labels()
ax1.legend(handler1 + handler2, label1 + label2, loc='center right')
ax1.set_xlim(-0.05,1.05)
ax1.set_ylim(-0.05,1.05)
ax1.set_xlabel('period')
ax1.set_ylabel('real_ratio %')
ax2.set_ylabel('user_cnt')
ax2.grid(False)
plt.savefig(save_name)
plt.close()
#Output mixed matrix
def print_cmx(y_true, y_pred, save_name='tmp.png'):
labels = sorted(list(set(y_true)))
cmx_data = confusion_matrix(y_true, y_pred, labels=labels)
df_cmx = pd.DataFrame(cmx_data, index=labels, columns=labels)
plt.figure(figsize = (10,6))
sns.heatmap(df_cmx, annot=True, fmt='d', cmap='coolwarm', annot_kws={'fontsize':20},alpha=0.8)
plt.xlabel('pred', fontsize=18)
plt.ylabel('real', fontsize=18)
plt.savefig(save_name)
plt.close()
Since it is not the main subject, the missing line is deleted as it is.
#Data read
df=pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')
churn=df.copy()
#Change half-width space to Nan
churn.loc[churn['TotalCharges']==' ', 'TotalCharges']=np.nan
#Change to float
churn['TotalCharges']=churn['TotalCharges'].astype(float)
#If there is a missing line because it is troublesome, delete that line
churn=churn.dropna()
print(churn.info())
display(churn.head())
After clustering with kmenas etc., humans interpret the clustering result by looking at the characteristics of each cluster. However, it is troublesome to visualize and create a table every time through trial and error to see the characteristics of the cluster, and you may forget what the previous result was. MLflow can be used to solve this problem. Clustering is performed by narrowing down the explanatory variables to continuous values of'tenure',' MonthlyCharges', and'TotalCharges', and the result is recorded in MLflow.
####Clustering
exp_col=['tenure','MonthlyCharges','TotalCharges']
df_km=churn.copy()[exp_col]
df_cluster=df_km.copy()
cluster_name = 'My_Cluster'
k=5
ob_col = cluster_name
#Record the clustering result in mlflow
mlflow.set_experiment('My Clustering')#Define the name of the experiment
with mlflow.start_run():#mlflow recording started
#Standardization
X=sc_trans(df_cluster)
#kmeans modeling
y_km, km=km_cluster(X, k)
#Record params in mlflow
mlflow.log_param("method_name",km.__class__.__name__)
mlflow.log_param("k", k)
mlflow.log_param("features", df_cluster.columns.values)
#Save model in mlflow
log_model(km, "model")
df_cluster[cluster_name]=y_km
#Visualize clustering results
#Cluster composition ratio
plot_chart(y_km, png='Cluster_Chart.png', visual = False)#Save the figure in the current directory
mlflow.log_artifact('Cluster_Chart.png')#Record the figure in the current directory
os.remove('Cluster_Chart.png')#Deleted the figure in the current directory after recording
#Average value per cluster
plot_table(df_cluster, ob_col, png='Cluster_Stats_Mean.png', visual = False)#Save the figure in the current directory
mlflow.log_artifact('Cluster_Stats_Mean.png')#Record the figure in the current directory
os.remove('Cluster_Stats_Mean.png')#Deleted the figure in the current directory after recording
#Histogram by cluster
plot_many_hist(df_cluster,exp_col,ob_col,clip=[0, 99.],defalt_bin=20, png='Cluster_Hist.png', visual = False)#Save the figure in the current directory
mlflow.log_artifact('Cluster_Hist.png')#Record the figure in the current directory
os.remove('Cluster_Hist.png')#Deleted the figure in the current directory after recording
As mentioned above, when the code described to record the algorithm name, explanatory variable name, high para value, and chart that visualizes the characteristics of each cluster is executed, a folder called "mlruns" is created in the current directory. All the recorded results are saved in this folder called "mlruns". If you open a terminal in the directory containing the "mlruns" folder, write "mlflow ui" and execute it, localhost number 5000 will be launched. If you access localhost 5000 with a browser, you can see the record of the model created through the rich UI of MLflow.
"mlruns" folder Write "mlflow ui" in the directory containing the folder "mlruns" mlflow ui top screen
The clustering results are recorded in a room named My Clustering. To see what was recorded, check the link with the date and time of recording.
Parameters, Metrics
Artifacts
Artifacts
Artifacts
It can be confirmed that the algorithm name set to be recorded, the explanatory variable name, and the high para value are recorded in Parameters. Charts and tables are recorded in Artifacts. By recording in this way, it becomes easy to compare the results of the previous model even when the explanatory variables are changed or the value of k is changed.
The same can be recorded when creating a classification model.
####Build a predictive model
exp_col=['tenure','MonthlyCharges','TotalCharges']
ob_col = 'Churn'
df_pred=churn.copy()
df_pred.loc[df_pred[ob_col]=='Yes', ob_col]=1
df_pred.loc[df_pred[ob_col]=='No', ob_col]=0
df_pred[ob_col]=df_pred[ob_col].astype(int)
df_pred[cluster_name]=y_km
X_tests, y_tests, y_preds, y_pred_probas, y_pred_proba_boths = [],[],[],[],[]
for cluster_num in np.sort(df_pred[cluster_name].unique()):
#Extract data from one cluster
df_n=df_pred[df_pred[cluster_name]==cluster_num].copy()
#Training data and test data creation
X_train, y_train, X_test, y_test=createXy(df_n, exp_col, ob_col, test_size=0.3, random_state=0, stratify=True)
#Modeling
model, y_pred, y_pred_proba, y_pred_proba_both = xgb_model(X_train, y_train, X_test)
#Evaluation index calculation
log_loss_, accuracy, precision, recall, auc_ = eval_list(y_test, y_pred, y_pred_proba, y_pred_proba_both)
#Insert data into empty list
X_tests.append(X_test)
y_tests.append(y_test)
y_preds.append(y_pred)
y_pred_probas.append(y_pred_proba)
y_pred_proba_boths.append(y_pred_proba_both)
#Mixed matrix
print_cmx(y_test.values, y_pred, save_name='confusion_matrix.png')
# Recall-Precision curve
threshold_pre_rec(y_test, y_pred_proba, save_name='threshold_pre_rec.png')
#Pred Prob curve
calib_curve(y_test,y_pred_proba, save_name='calib_curve.png')
#Record the prediction result in mlflow
mlflow.set_experiment('xgb_predict_cluster'+str(cluster_num))#Define the name of the experiment
with mlflow.start_run():#mlflow recording started
mlflow.log_param("01_method_name", model.__class__.__name__)
mlflow.log_param("02_features", exp_col)
mlflow.log_param("03_objective_col", ob_col)
mlflow.log_params(model.get_xgb_params())
mlflow.log_metrics({"01_accuracy": accuracy})
mlflow.log_metrics({"02_precision": precision})
mlflow.log_metrics({"03_recall": recall})
mlflow.log_metrics({"04_log_loss": log_loss_})
mlflow.log_metrics({"05_auc": auc_})
mlflow.log_artifact('ROC_curve.png')
os.remove('ROC_curve.png')
mlflow.log_artifact('confusion_matrix.png')
os.remove('confusion_matrix.png')
mlflow.log_artifact('threshold_pre_rec.png')
os.remove('threshold_pre_rec.png')
mlflow.log_artifact('calib_curve.png')
os.remove('calib_curve.png')
log_model(model, "model")
#Concat the data for each cluster and put all the data together
y_pred_all=np.hstack((y_preds))
y_pred_proba_all=np.hstack((y_pred_probas))
y_pred_proba_both_all=np.concatenate(y_pred_proba_boths)
y_tests_all=pd.concat(y_tests)
#Evaluation index calculation
log_loss_, accuracy, precision, recall, auc_ = eval_list(y_tests_all.values, y_pred_all, y_pred_proba_all, y_pred_proba_both_all)
#Mixed matrix
print_cmx(y_tests_all.values, y_pred_all, save_name='confusion_matrix.png')
#Pred Prob curve
calib_curve(y_tests_all, y_pred_proba_all, save_name='calib_curve.png')
#Record the prediction result of all data in mlflow
mlflow.set_experiment('xgb_predict_all')#Define the name of the experiment
with mlflow.start_run():#mlflow recording started
mlflow.log_param("01_method_name", model.__class__.__name__)
mlflow.log_param("02_features", exp_col)
mlflow.log_param("03_objective_col", ob_col)
mlflow.log_params(model.get_xgb_params())
mlflow.log_metrics({"01_accuracy": accuracy})
mlflow.log_metrics({"02_precision": precision})
mlflow.log_metrics({"03_recall": recall})
mlflow.log_metrics({"04_log_loss": log_loss_})
mlflow.log_metrics({"05_auc": auc_})
mlflow.log_artifact('ROC_curve.png')
os.remove('ROC_curve.png')
mlflow.log_artifact('confusion_matrix.png')
os.remove('confusion_matrix.png')
mlflow.log_artifact('calib_curve.png')
os.remove('calib_curve.png')
When you execute the above code, a model is created for each cluster and recorded in MLflow. You can record algorithm names, explanatory variable names, high-para values, loss functions, classification accuracy of various indicators, ROC curves, Calibration curves, Recall, Precision curves, mixed matrices, and so on.
mlflow top screen
Parameters
Metrics
Artifacts
Recording in this way makes it easier to compare the results of the previous model when the explanatory variables are changed, the high-para adjustment is made, or the algorithm is changed.
Although it also serves as a memorandum, I did not explain MLflow at all, and mainly showed how to use it like this. I hope the content will give you an image of how people who are thinking about using it in the future. If you are interested, I recommend you to check other articles and use it.
Contents of "mlruns" folder
Click 1
Click b3fa3eb983044a259e6cae4f149f32c8
Click artifacts
The figure is saved What can be confirmed with mlflow ui is stored in Local like this. It seems that it is possible to share model records with various people in cooperation with cloud services. Is there any problem with Local for personal use?
that's all!