[PYTHON] Machine learning model management to avoid quarreling with the business side

Introduction

This is Miyano (@estie_mynfire) from estie CTO. On the first day, I wrote a little niche content (I tried Pandas' Sql Upsert), but this time I'm doing it at estie ** "Office" About "Forecast of appropriate rent" **.

As for office rent, unlike housing, the offered rent is not so public (only one-third or less of the properties are open to the public in the 5 central wards of the city), and the contracted rent is basically not available ** It is difficult to collect correct answer data **. Under such circumstances, when estimating real estate rent, we are jointly verifying the accuracy of the model while incorporating the professional eyes of office real estate on the business side.

Since there are many interactions with business sites, I will write about what I pay particular attention to and what I have devised.

Task

As mentioned above, since we work together with the members on the business side, the following issues are more likely to occur than when developing only with ML engineers.

Challenge 1. Managing past models

With feedback from business engineers and expansion of source data We frequently make logic changes such as outlier property removal, feature addition, and learning data change, and are improving every day to create a more accurate model. However, as mentioned above, the accuracy of these models cannot always be evaluated using only numerical indicators (professional eyes are also required). I intended to make a better model than in the past

** "Oh, if it was a model two months ago, it would have been a good value here, but it's a strange value!" **

What often happens. ~~ Most of the time, I notice that when I'm in a hurry, so I'm ** tingling **. ~~

In such a case, if you can immediately switch back to the state of the past model, you can immediately investigate the cause and reflect it in the production data at high speed.

Challenge 2. Explanation of model estimates

When jointly verifying the accuracy, it is often requested to explain the cause, such as "The estimated value here, why is it such a value?". In such a case, if you can explain that you are greatly influenced by this feature, you can continue a more meaningful discussion.

Challenge 3. Visualization of output for accuracy verification

As mentioned above, we create multiple new models every day, output the model output, and have the business side check the accuracy at high speed. At that time, intuitive analysis is difficult if the output is in tabular format, and communication to show only the values in this area may occur over several round trips.

What you are doing

The following measures are taken to solve the above.

Correspondence 1 Model version control

Code management

The logic (including high parameters) and learning data change tickets are managed by issue on github, and the branch is cut for each issue. If the version to be released next is v1.1.0 and the corresponding issue numbers are 4 and 6, the branch name will be something like dev / v1.1.0 / issue4_6. At the time of release, it is once merged into v1.1.0branch and tag management is also performed.

Version control of intermediate generated files

All files used for learning and estimation are managed by s3. A bucket for machine learning is prepared in s3, and intermediate files are stored under the same directory structure (dev / v1.1.0 / issue4_6) as the branch name.

Correspondence 2 Cause explanation using SHAP

When asked "Why is this estimated value here?", It is possible to find out the cause by using SHAP, so add the shap_value column when estimating. It is. There are various articles about SHAP, so please refer to them.

Explanation of interpretation of machine learning model using Shap

Simply put, it tells us "how much each feature contributed to the estimated value".

import shap
def calc_shap(df_, feature_list, model, rank_th=5) -> pd.core.frame.DataFrame:
    '''shap_Add a value column
    Args:
        df_ (pd.core.frame.DataFrame):Data for which you want to calculate the estimated rent after adding features
        feature_list ([str]):Feature name list
        model :Learning model
        rank_th (int): shap_How many features with high value should be displayed. default 5.
    Returns:
        pd.core.frame.DataFrame.
    '''
    df = df_.copy()
    explainer = shap.TreeExplainer(model)
    shap_values = explainer.shap_values(df[feature_list])
    shap_df = pd.DataFrame(shap_values, columns=feature_list) # df[feature_values]Data frame in which all the values of
    shap_rank = shap_df.applymap(lambda x: abs(x)).rank(axis=1, ascending=False, method='min') #For each record,Those with a large absolute value(⇔ High contribution)Ranking from
    main_contri_col = {i: [col for col in r.keys() if r[col] <= rank_th] for i, r in shap_rank.iterrows()} #Contribution rank for each record_Get column list up to th
    main_contri_val = [shap_df.loc[i, main_contri_col[i]].to_dict() for i in main_contri_col.keys()] #Contribution rank for each record_Get columns up to th and their contribution
    df['shap_value'] = main_contri_val
    return df

The value of this shap_value column is a json string {'Area': 22627,'Age': 717,'hoge1': -5409,'hoge2': 2968,'hoge3': 3791} It looks like this. This is useful because you can find out that the area contributes insanely, and when you look it up, you can discover that the area order of the estimated record was incorrect.

Response 3 Visualization on a map for accuracy verification

For business side members to check accuracy independently at high speed Not only the simple accuracy and the difference between the result and the previous logic, but also the learning data and the estimated value for various properties are visualized. The following images are past samples, but the properties with high estimates are plotted in orange, and the properties with low estimates are plotted in light blue. (Usually, the learning data is also plotted with black circles)

image.png

The visualization code is below.

'''Visualization of estimated rent by folium
Required:
    pandas
    folium
    matplotlib
'''

import subprocess
import pandas as pd
import folium
import matplotlib.colors as cl

def calc_RGB_value(norm_rent: float) -> str:
    '''0-Returns a 1-scale compressed number as RGB hexadecimal notation
The cheapest property is light blue,Make expensive properties orange
    Args:
        norm_rent (float): 0-Numerical value compressed to 1 scale
    Returns:
        str
        ex: #54b0c5
    '''
    R_val = 41 + (255 - 41) * norm_rent
    G_val = 182 + (150 - 182) * norm_rent
    B_val = 246 + (0 - 246) * norm_rent
    return cl.to_hex((R_val / 255, G_val / 255, B_val / 255, 1))

def add_color_col(df_: pd.core.frame.DataFrame) -> pd.core.frame.DataFrame:
    '''Add color column
    Args:
        df_ (pd.core.frame.DataFrame): estimated_Data frame containing rent columns
    Returns:
        pd.core.frame.DataFrame
        'color'Add column and return
    '''
    df = df_.copy()
    norm = cl.Normalize(vmin=df['estimated_rent'].min(), vmax=df['estimated_rent'].max())
    norm_rent_ = [norm(v) for v in df['estimated_rent']]  #Estimated rent 0,Make it 1 scale
    color_ = [calc_RGB_value(norm_rent) for norm_rent in norm_rent_]
    df["color"] = color_
    return df

class Drawer:
    def __init__(self, ld_path, ed_path):
        self.read_ld(ld_path)
        self.read_ed(ed_path)
        self.add_color_col()
        self.init_map()
    def read_ld(self, ld_path):
        '''Reading training data
        '''
        self.ld = pd.read_csv(ld_path)
        assert 'answer_rent' in self.ld.columns
    def read_ed(self, ed_path):
        '''Data reading after estimated rent calculation
        '''
        self.ed = pd.read_csv(ed_path)
        assert 'estimated_rent' in self.ld.columns
    def add_color_col(self):
        self.ed = add_color_col(self.ed)
        self.ld['color'] = '#262626' #black
    def init_map(self):
        '''Initialize map
        '''
        self.map = folium.Map(
            location=[self.ed.latitude.mean(),self.ed.longitude.mean()],
            zoom_start=6, tiles='cartodbpositron')
    def add_ld_plot(self, size=15):
        '''Training data plot
        size (int):The size of the plot circle. default 15.
        '''
        for i, row in self.ld.iterrows():
            folium.Circle(
                radius=size, location=[row['latitude'], row['longitude']],
                popup='Property name: %s' % (row['Property name'] if 'Property name' in row.keys() else '' +
                '<br/>Correct rent: {:,.0f}Circle/Tsubo'.format(row['ans_rent']),
                color=row['color'], fill_color=row['color']).add_to(self.map)
    def add_ed_plot(self, size=5):
        '''Estimated rent plot
        size (int):The size of the plot circle. default 5.
        '''
        for i, row in self.ld.iterrows():
            folium.Circle(
                radius=size, location=[row['latitude'], row['longitude']],
                popup='Property name: %s' % (row['Property name'] if 'Property name' in row.keys() else '' +
                '<br/>Estimated rent: {:,.0f}Circle/Tsubo'.format(row['estimated_rent']),
                color=row['color'], fill_color=row['color']).add_to(self.map)
if __name__ == '__main__':
    drawer = Drawer(
        ld_path='s3://hogehoge/dev/v1.1.0/issue4_6/ld.csv',
        ed_path='s3://hogehoge/dev/v1.1.0/issue4_6/ed.csv')
    drawer.add_ld_plot()
    drawer.add_ed_plot()
    drawer.map.save('map.html')
    subprocess.call(
        ['aws', 's3', 'mv', 'map.html', 's3://hogehoge/dev/v1.1.0/issue4_6/map.html'])

in conclusion

from now on

Although the method is still primitive, the data code version synchronization management is performed by the above method. In the future, I am thinking of introducing MLflow to make it easier to manage, but I will write a sequel as soon as it is introduced.

About estie

At estie, we are always looking for engineers who are enthusiastic about new technologies and full-stack engineers! https://www.wantedly.com/companies/company_6314859/projects

estie -> https://www.estie.jp estiepro -> https://pro.estie.jp Company site-> https://www.estie.co.jp

Recommended Posts

Machine learning model management to avoid quarreling with the business side
I tried to visualize the model with the low-code machine learning library "PyCaret"
Try to evaluate the performance of machine learning / regression model
Try to evaluate the performance of machine learning / classification model
Validate the learning model with Pylearn2
Record the steps to understand machine learning
[Introduction to machine learning] Until you run the sample code with chainer
[Python] Easy introduction to machine learning with python (SVM)
Machine learning beginners tried to make a horse racing prediction model with python
I tried to move machine learning (ObjectDetection) with TouchDesigner
The first step of machine learning ~ For those who want to implement with python ~
Try to predict forex (FX) with non-deep machine learning
Predict the gender of Twitter users with machine learning
Machine Learning with Caffe -1-Category images using reference model
Site summary to learn machine learning with English video
Attempt to include machine learning model in python package
Record of the first machine learning challenge with Keras
Introduction to machine learning
I tried to compress the image using machine learning
Uncle SE with hardened brain tried to study machine learning
Introduction to Machine Learning with scikit-learn-From data acquisition to parameter optimization
For those who want to start machine learning with TensorFlow2
How to increase the number of machine learning dataset images
Create a python machine learning model relearning mechanism with mlflow
Machine learning to learn with Nogizaka46 and Keyakizaka46 Part 1 Introduction
Try to predict if tweets will burn with machine learning
I captured the Touhou Project with Deep Learning ... I wanted to.
I tried to divide with a deep learning language model
Machine learning model considering maintainability
Machine learning learned with Pokemon
An introduction to machine learning
Machine learning with Python! Preparation
Machine learning Minesweeper with PyTorch
Calibrate the model with PyCaret
Beginning with Python machine learning
Super introduction to machine learning
Try machine learning with Kaggle
I tried to make Othello AI with tensorflow without understanding the theory of machine learning ~ Introduction ~
I tried to make Othello AI with tensorflow without understanding the theory of machine learning ~ Implementation ~
A story stuck with the installation of the machine learning library JAX
[Machine learning] Check the performance of the classifier with handwritten character data
[Machine learning] Understand from mathematics why the correlation coefficient ranges from -1 to 1.
Before the introduction to machine learning. ~ Technology required for machine learning other than machine learning ~
Machine learning with python without losing to categorical variables (dummy variable)
[Machine learning] Cluster Yahoo News articles with MLlib's topic model (LDA).
How to use machine learning for work? 01_ Understand the purpose of machine learning
kintone x Easy business card management realized by machine learning @kintone Café
[Introduction to StyleGAN] Unique learning of anime with your own machine ♬
People memorize learned knowledge in the brain, how to memorize learned knowledge in machine learning
How to create a serverless machine learning API with AWS Lambda
About the shortest path to create an image recognition model by machine learning and implement an Android application
(Machine learning) I tried to understand the EM algorithm in a mixed Gaussian distribution carefully with implementation.
Introduction to Deep Learning for the first time (Chainer) Japanese character recognition Chapter 2 [Model generation by machine learning]
"Introduction to Machine Learning by Bayesian Inference" Approximate inference of Poisson mixed model implemented only with Python numpy
I tried to make Othello AI with tensorflow without understanding the theory of machine learning ~ Battle Edition ~
Introduction to machine learning Note writing
I tried machine learning with liblinear
Machine learning with python (1) Overall classification
Explore the maze with reinforcement learning
Try machine learning with scikit-learn SVM
Inversely analyze a machine learning model