I want to improve the accuracy of the prediction model

But what should I do ... What is the cause of the misprediction in the first place? ?? Yes, can we visualize the cause of the misprediction with a decision tree? Attempt. (Although there may be some theoretical mistakes) I'm going to post the flow of analysis, so I'll write about data processing and so on. We will write in the flow of data reading, preprocessing, data confirmation, model construction / accuracy confirmation, and factor investigation. Maybe it will be longer. The main subject is this chapter → [Visualization of customer characteristics that are unpredictable by decision tree](# Visualization of customer characteristics that are unpredictable by decision tree)

Usage data

Use data from Kaggle's Telco Customer Churn. https://www.kaggle.com/blastchar/telco-customer-churn This is data about the customers of the telephone company, and is a binary classification problem with the objective variable being whether or not to cancel. Each row represents a customer, and each column contains the customer's attributes. Each column is as follows. (Google translate kaggle column description)

customerID: Customer ID
gender: Whether the customer is male or female
SeniorCitizen: Whether the customer is elderly (1, 0)
Partner: Whether the customer has a partner (yes, no)
Dependents: Whether the customer has dependents (yes, no)
tenure: The number of months the customer has stayed at the company
PhoneService: Whether you are using the telephone service (yes, no)
MultipleLines: Whether the customer has multiple lines (yes, no, no phone service)
InternetService: Customer's Internet Service Provider (DSL, Fiber Optic, No)
OnlineSecurity: Whether you have online security (yes, no, no internet service)
OnlineBackup: Whether you have an online backup (yes, no, no internet service)
DeviceProtection: Whether you are protecting your device (yes, no, no internet service)
TechSupport: Whether you have technical support (yes, no, no internet service)
StreamingTV: Whether you have a streaming TV (yes, no, no internet service)
StreamingMovies: Whether the customer has streaming movies (yes, no, no internet service)
Contract: Customer's contract period (monthly, 1 year, 2 years)
PaperlessBilling: Whether you are making a paperless bill (yes, no)
PaymentMethod: Customer payment method (electronic check, mail check, bank transfer (automatic), credit card (automatic))
MonthlyCharges: Amount charged to customers each month
TotalCharges: Total amount charged to customer

Churn: Whether the customer canceled (yes or no)

Data overview

Let's take a quick look at the contents of the data.

Basic information

#Package import
import pandas as pd
from sklearn import datasets
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn import preprocessing
from sklearn import tree
from sklearn.externals.six import StringIO
import pydotplus
from IPython.display import Image
from dtreeviz.trees import dtreeviz
import xgboost.sklearn as xgb
plt.style.use('seaborn-darkgrid')
%matplotlib inline

#Data read
df=pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')
churn=df.copy()
print(churn.info())
display(churn.head())

Missing line

Change to missing value

The TotalCharges column is a number, but it is not a number type because there are blank lines. So convert whitespace to Nan and convert to numeric type.

#Change half-width space to Nan
churn.loc[churn['TotalCharges']==' ', 'TotalCharges']=np.nan
#Change to float
churn['TotalCharges']=churn['TotalCharges'].astype(float)
print(churn.info())
display(churn.head())

Completed conversion of Total Charges column to float.

Missing row completion

Next, work to complement Nan of Total Charges. At this time, it seems that filling Nan with the categorical variable with the largest bias of Total Charges depending on the category is more effective for prediction than simply filling Nan with the average value of Total Charges. Therefore, the policy is to fill Nan with the average value of Total Charges for each categorical variable. First, create a data frame excluding Nan, and then create a data frame containing only categorical variables and Total Charges.

#Creating a dataframe with Nan lines removed
churn2=churn[churn.isnull().any(axis=1)==False].copy()
churn2['TotalCharges']=churn2['TotalCharges'].astype(float)

#Creating a data frame of categorical variable + Total Charges
#Extract only data whose dtype is object type
churn2_TotalCharges=churn2.select_dtypes(include=['object']).drop('customerID',axis=1)
#Added Total Charges
churn2_TotalCharges['TotalCharges']=churn2['TotalCharges']
print(churn2_TotalCharges.info())

Due to the exclusion of Nan lines, the number changed from 7,043 to 7,032. Next, let's visualize which categorical variables are likely to be related to Total Charges.

#Define a list of categorical variable names without TotalCharges
obj_columns=churn2_TotalCharges.columns.values
obj_columns=obj_columns[~(obj_columns == 'TotalCharges')]

#Draw a total Charges histogram for each categorical variable
fig=plt.figure(figsize=(12,10))
for i in range(len(obj_columns)):
    col=obj_columns[i]
    youso=churn2_TotalCharges[col].unique()
    ax=plt.subplot(round(len(obj_columns)/np.sqrt(len(obj_columns))), round(len(obj_columns)/np.sqrt(len(obj_columns))), i+1)
    for j in range(len(youso)):
        sns.distplot(churn2_TotalCharges[churn2_TotalCharges[col]==youso[j]]['TotalCharges'], bins=30, ax=ax, kde=False, label=youso[j])
    ax.legend(loc='upper right')
    ax.set_ylabel('Freq')
    ax.set_title(col)
plt.tight_layout()
fig.suptitle("TotalCharges hist", fontsize=15)
plt.subplots_adjust(top=0.92)
plt.show()

'MultipleLines', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies' → There is a feeling that the distribution of Total Charges differs greatly for each category in the categorical variables in this area. From here, you can select a categorical variable that calculates the average of Total Charges, but if possible, I would like to make a quantitative judgment.

Then, although it is different from the main subject, I will try to judge using the decision tree here as well. CART creates an explanatory variable branch that best separates the objective variable based on gini impureness (or entropy) and information gain. For a theoretical story, see ["First pattern recognition"](https://www.amazon.co.jp/dp/4627849710 "" First pattern recognition "") or the following site. https://dev.classmethod.jp/articles/2017ad_20171211_dt-2/

In other words, if CART is applied to TotalCharges as the objective variable, the explanatory variable that best separates TotalCharges should come to the first branch. I will actually try it.

#Copy a new data frame
churn2_TotalCharges_trans=churn2_TotalCharges.copy()
#Label categorical variables
for column in churn2_TotalCharges_trans.columns:
    le = preprocessing.LabelEncoder()
    le.fit(churn2_TotalCharges_trans[column])
    churn2_TotalCharges_trans[column] = le.transform(churn2_TotalCharges_trans[column])

display(churn2_TotalCharges_trans)

#Decision tree model construction
clf = tree.DecisionTreeRegressor(max_depth=1)
#Data frame with only categorical variables as explanatory variable, Total Charges as objective variable
clffit = clf.fit(churn2_TotalCharges_trans.drop('TotalCharges',axis=1).values\
                 , churn2_TotalCharges_trans['TotalCharges'].values)

#Visualization of decision tree
dot_data = StringIO()
tree.export_graphviz(clffit, out_file=dot_data,\
                     feature_names=churn2_TotalCharges_trans.drop('TotalCharges',axis=1).columns.values,\
                     class_names=True,\
                     filled=True, rounded=True)

graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png())

Since'DeviceProtection'is the variable that divides TotalCharges most, fill Nan with the average value of TotalCharges for each category of this variable.

#Calculate the average Total Charges for each Device Protection
churn2_TotalCharges_mean=churn2.groupby(['DeviceProtection'])[['TotalCharges']].mean().reset_index().rename(columns={'TotalCharges':'TotalCharges_mean'})
#Merge the average Total Charges per Device Protection into a data frame containing Nan
churn=pd.merge(churn,churn2_TotalCharges_mean,on=['DeviceProtection'],how='left')
#Complement Nan
churn.loc[(churn.isnull().any(axis=1)==True), 'TotalCharges']=churn['TotalCharges_mean']
#Removed average column for Total Charges
churn=churn.drop('TotalCharges_mean',axis=1)
#Confirmation of Nan line
display(churn[churn.isnull().any(axis=1)==True])

Nan was complemented.

A bird's-eye view of the data

Since the missing values have disappeared and the data has been completed, we will carry out a bird's-eye view of the continuous value variables and categorical variables.

# Churn:Pie chart with Yes or No percentage of cancellations
pie=churn.groupby(['Churn'])[['customerID']].count().reset_index()
fig=plt.figure(figsize=(8,8))
ax=plt.subplot(1,1,1)
texts = ax.pie(pie['customerID'], labels=pie['Churn'], counterclock=False, startangle=90, autopct="%1.1f%%")
for t,t2 in texts[1],texts[2]:
    t.set_size(18)
    t2.set_size(18)
ax.set_title('Churn ratio', fontsize=20)
plt.show()

#Create a data frame by extracting only continuous variables
churn_select=churn.select_dtypes(include=['number']).copy()
#Extract only categorical variables and create a data frame (use later)
churn_select_obj=churn.select_dtypes(include=['object']).copy()
churn_select_obj['SeniorCitizen']=churn_select['SeniorCitizen']
#Add objective variable
churn_select['Churn']=churn['Churn']
churn_select=churn_select.drop('SeniorCitizen',axis=1)
# label encoding
le = preprocessing.LabelEncoder()
le.fit(churn_select['Churn'])
# Churn Yes：1、No：0
churn_select['Churn'] = le.transform(churn_select['Churn'])
#Continuous variable pairplot
sns.pairplot(data=churn_select, hue='Churn', diag_kind='hist',plot_kws={'marker': '+', 'alpha': 0.5},diag_kws={'bins': 20})
plt.show()

#List of categorical variable names
col=churn_select_obj.columns.values[1:]
col2=col[~(col == 'Churn')]
#Horizontal axis Churn, vertical axis count bar plot for each category
fig=plt.figure(figsize=(10,10))
churn_num=churn.copy()
churn_num['num']=1
for i in range(len(col2)):
    ax=plt.subplot(round(len(col2)/np.sqrt(len(col2))),round(len(col2)/np.sqrt(len(col2))),i+1)
    sns.barplot(data=churn_num.groupby(['Churn',col2[i]])[['num']].count().reset_index(),x='Churn',y='num',hue=col2[i],ax=ax)
    ax.legend(loc='upper right')
    ax.set_ylabel('count')
    ax.set_title(col2[i]+' count')
plt.tight_layout()
plt.show()

I think that it is a little imbalanced data, that the one with larger tenure and Total Charges tends not to cancel, that if the contract is paid annually, there are not many cancelers, gender seems to be irrelevant.

Model building

Now that I've seen the data, let's make a model.

#LabelEncoded data frame churn_make encode
churn_encode=churn.copy()
columns=list(churn_encode.select_dtypes(include=['object']).columns.values)
for column in columns:
    le = preprocessing.LabelEncoder()
    le.fit(churn_encode[column])
    churn_encode[column] = le.transform(churn_encode[column])
churn_encode=churn_encode.drop('customerID',axis=1)

# Train,Function to create Test data
def createXy(df, col, target, test_size=0.3, random_state=0):
    #Separation of explanatory variables and objective variables
    dfx=df[col]
    dfy=df[target]
    X_train, X_test, y_train, y_test = train_test_split(dfx, dfy, test_size=test_size, random_state=random_state)
    print('TrainX Size is {}'.format(X_train.shape))
    print('TestX Size is {}'.format(X_test.shape))
    print('TrainY Size is {}'.format(y_train.shape))
    print('TestY Size is {}'.format(y_test.shape))
    return X_train, y_train, X_test, y_test

This time it's troublesome, so I didn't have to select the features or adjust the high para.

#Explanatory variable name
colx=churn_encode.columns.values[:-1]
#Objective variable name
coly='Churn'
X_train, y_train, X_test, y_test = createXy(churn_encode, colx, coly, test_size=0.3, random_state=0)

#Make a model with XGBoost
xgb_model = xgb.XGBClassifier()
xgb_model.fit(X_train, y_train)
print('Train accuracy_score',xgb_model.score(X_train, y_train))
y_true=y_test.values
y_pred=xgb_model.predict(X_test)
#Checking the accuracy of Test data
print('Test accuracy_score',accuracy_score(y_true, y_pred))
print('Test precision_score',precision_score(y_true, y_pred))
print('Test recall_score',recall_score(y_true, y_pred))
print('Test f1_score',f1_score(y_true, y_pred))

result: Train accuracy_score 0.8267748478701825

Test accuracy_score 0.7946048272598202 Test precision_score 0.6307692307692307 Test recall_score 0.5189873417721519 Test f1_score 0.5694444444444443

#See also Confusion Matrix
result=pd.DataFrame({'y_true':y_true,'y_pred':y_pred})
result['dummy']=1
display(result.pivot_table(result,index='y_true',columns='y_pred',aggfunc='count').fillna(0))

1: Canceled, 0: Not canceled

By the way, the confirmation of the accuracy of the model is finally completed. Let's look at the cause of the misprediction from the next.

Visualization of unpredictable customer characteristics in decision trees

Finally the main subject. What are the characteristics of the customer that was wrong? I want to see that in the decision tree. As mentioned above, the CART algorithm of the decision tree generates a branch of the explanatory variable that best separates the objective variable based on the gini purity (or entropy) and the information gain. Therefore, if a decision tree is drawn with the two values of the prediction hit or miss as the objective variable, it seems possible to generate a branch of the explanatory variable that best distinguishes between the prediction hit and miss. First, make a column indicating that the prediction was correct or incorrect.

# X_Make a copy of test and add a new variable
churn_test=X_test.copy()
churn_test['true']=y_true
churn_test['pred']=y_pred
churn_test['RightOrWrong']=0
# y_true and y_RightOrWrong if pred matches=Set to 1
churn_test.loc[(churn_test['true']==churn_test['pred']),'RightOrWrong']=1
display(churn_test.groupby(['RightOrWrong'])[['pred']].count())
display(churn_test)

434 out of 2,113 unpredictable customers Let's visualize what kind of tendency these 434 people have with a decision tree.

#Tree depth 3 (Tekito)
clf = tree.DecisionTreeClassifier(max_depth=3)
# 'true','pred','RightOrWrong'Explanatory variable without'RightOrWrong'The objective variable
clffit = clf.fit(churn_test.drop(['true','pred','RightOrWrong'],axis=1), churn_test['RightOrWrong'])

#Visualization with dtreeviz
viz = dtreeviz(\
    clffit,\
    churn_test.drop(['true','pred','RightOrWrong'],axis=1),\
    churn_test['RightOrWrong'],\
    target_name='RightOrWrong',\
    feature_names=churn_test.drop(['true','pred','RightOrWrong'],axis=1).columns.values,\
    class_names=[0,1], histtype='bar'\
) 
print('Data Count ',len(churn_test))
display(viz)

Data Count 2113 Apparently, Contract is at the top of the tree, so it seems to be the explanatory variable that best distinguishes between hit and miss predictions. Contract is 0 = People who pay monthly tend to be wrong. On the contrary, it seems that a considerable proportion of people who pay 1 = annual payment and 2 = 2 years pay are correct. Among those who pay monthly, those with Internet Service 2 = No tend to be predictable, and those with 0 = DSL and 1 = Fiber optic are out of the question. Let's take a look at the graph that was shown earlier. People who pay monthly contracts have many Yes and No in Churn, and those who use optical fiber for Internet Service have many Yes and No in Churn. (People who use Internet services tend to cancel, and they are dissatisfied because the line is slow, right?) Since it is difficult for these people to see the Yes and No trends of Churn, it seems that the predictions are likely to be wrong, and as a result of actually visualizing with a decision tree, it was found that the predictions tend to be wrong. TotalCharges is also appearing in the tree, but it seems that the tendency of prediction accuracy will not change when the value increases to some extent. For the time being, based on these results, add variables such as composite variables and clipping, create a model again, and check the accuracy.

fig=plt.figure(figsize=(10,10))
ax1=plt.subplot(2,2,1)
ax2=plt.subplot(2,2,2)
ax3=plt.subplot(2,2,3)
ax4=plt.subplot(2,2,4)

#Total Charges Histogram
ax1.hist(X_train['TotalCharges'].values,bins=20)
ax1.set_title('X_train TotalCharges')
ax2.hist(X_test['TotalCharges'].values,bins=20)
ax2.set_title('X_test TotalCharges')

#Clipping Total Charges
upperbound, lowerbound = np.percentile(X_train['TotalCharges'].values, [0, 90])
TotalCharges_train = np.clip(X_train['TotalCharges'].values, upperbound, lowerbound)
ax3.hist(TotalCharges_train,bins=20)#Total Charges Histogram
ax3.set_title('X_train TotalCharges clipping')

#Clipping Total Charges
upperbound, lowerbound = np.percentile(X_test['TotalCharges'].values, [0, 90])
TotalCharges_test = np.clip(X_test['TotalCharges'].values, upperbound, lowerbound)
ax4.hist(TotalCharges_test,bins=20)#Total Charges Histogram
ax4.set_title('X_test TotalCharges clipping')
plt.show()

#New Train with variables,Make a test
new_X_train=X_train.copy()
new_X_test=X_test.copy()
#Change Total Charges to Clipping
new_X_train['TotalCharges']=TotalCharges_train
new_X_test['TotalCharges']=TotalCharges_test
#Create a composite variable for Contract and Internet Service
new_X_train['Contract_InternetService']=new_X_train['Contract'].astype(str)+new_X_train['InternetService'].astype(str)
new_X_test['Contract_InternetService']=new_X_test['Contract'].astype(str)+new_X_test['InternetService'].astype(str)
#LabelEncoding the composite variable of Contract and Internet Service
le = preprocessing.LabelEncoder()
le.fit(new_X_train['Contract_InternetService'])
new_X_train['Contract_InternetService'] = le.transform(new_X_train['Contract_InternetService'])
le = preprocessing.LabelEncoder()
le.fit(new_X_test['Contract_InternetService'])
new_X_test['Contract_InternetService'] = le.transform(new_X_test['Contract_InternetService'])

#Build a model
xgb_model = xgb.XGBClassifier()
xgb_model.fit(new_X_train, y_train)
print('Train accuracy_score',xgb_model.score(new_X_train, y_train))
y_true=y_test.values
y_pred=xgb_model.predict(new_X_test)
#Accuracy verification
print('')
print('Test accuracy_score',accuracy_score(y_true, y_pred))
print('Test precision_score',precision_score(y_true, y_pred))
print('Test recall_score',recall_score(y_true, y_pred))
print('Test f1_score',f1_score(y_true, y_pred))

#Confirmation of confusion matrix
result=pd.DataFrame({'y_true':y_true,'y_pred':y_pred})
result['dummy']=1
display(result.pivot_table(result,index='y_true',columns='y_pred',aggfunc='count').fillna(0))

result: Train accuracy_score 0.8253549695740365

Test accuracy_score 0.7998106956933271 Test precision_score 0.6406926406926406 Test recall_score 0.5352622061482821 Test f1_score 0.5832512315270936

First result (repost): Train accuracy_score 0.8267748478701825

Test accuracy_score 0.7946048272598202 Test precision_score 0.6307692307692307 Test recall_score 0.5189873417721519 Test f1_score 0.5694444444444443

Confusion matrix 1: Canceled, 0: Not canceled result: First result (repost): Improved accuracy for Test data in all metrics.

Summary

Data reading, preprocessing, data confirmation, model construction / accuracy confirmation, and factor investigation of misprediction are all described. Isn't the main subject of this article to be able to investigate the factors that make predictions wrong with decision trees? I practiced the idea. I'm sorry if the theory or interpretation is wrong. I thought that it might be useful when reporting the current problems of the prediction model to customers such as PoC. While showing the learning curve, I also showed the result of the decision tree, such as "Maybe there is not enough data" or "I think that I should deal with it like this because the prediction is wrong and clogged around here." It is also good to discuss with customers. In such a story, it seems that there are situations where customers with abundant domain knowledge can help by thinking, "By the way, there may be such data and it may be added to variables?" The decision tree is easy to see! It's okay to put the SHAP price, but it may be difficult for customers to understand and it is difficult to explain.

Bonus (SHAP)

Let's check the tendency of customers when the forecast is wrong even with the SHAP value. (I haven't fully understood it yet, so I may have misinterpreted it.) See Koo's blog below for how to do this. Interact with machine learning models using SHAP

import shap
#Make a model with XGBoost
xgb_model = xgb.XGBClassifier()
xgb_model.fit(X_train, y_train)

#It seems to be a spell to run Javascript in notebook
shap.initjs()

#Pass the data you want to interpret as a model.
X_test=X_test.reset_index().drop('index',axis=1)#Initialize index
explainer = shap.TreeExplainer(model=xgb_model)
shap_values = explainer.shap_values(X=X_test)

# summary_plot
shap.summary_plot(shap_values,X_test)

See summary_plot. The red color indicates that the value of the variable is high, the blue color indicates that the value of the variable is low, and the horizontal axis is the SHAP value. Contract is the most influential because the dots represent individual samples and the features are arranged from the top in descending order of influence on the overall prediction. Contract is 0 = monthly payment has a positive effect on the forecast (that is, the forecast in the direction of cancellation), 1 = annual payment, 2 = 2 year payment has a negative effect on the forecast (that is, the direction of not canceling) Prediction) is given. Monthly payment has a positive effect on the forecast (that is, the forecast of the direction of cancellation), which is probably the reason why it deviates in the first place. Think about it.

Next, let's visualize a typical pattern of prediction, which variable pulled the customer who hit the prediction and the customer who did not hit the prediction to get the final prediction.

#Get the index that was wrong and the index that was hit
make_index=X_test.copy()
make_index['y_pred']=y_pred
make_index['y_true']=y_true
make_index['tf']=0
make_index.loc[aaa['y_true']!=aaa['y_pred'], 'tf']=1
miss=make_index[make_index['tf']==1].index.values
nonmiss=make_index[make_index['tf']==0].index.values

# decision_Visualization of the person hit by plot
shap.decision_plot(explainer.expected_value\
                   , shap_values[nonmiss[:50]], X_test.iloc[nonmiss[:50],:]\
                   ,link="logit",ignore_warnings = True, feature_order='hclust')

# decision_Visualization of people who are out of plot
shap.decision_plot(explainer.expected_value\
                   , shap_values[miss[:50]], X_test.iloc[miss[:50],:]\
                   ,link="logit",highlight=range(len(miss[:20])),ignore_warnings = True, feature_order='hclust')

Winning customer Out of customer The red color indicates that the contract is canceled (= 1), and the blue color indicates that the contract is not canceled (= 0). Looking at this, it can be seen that the final forecast result is pulled by Contract and tenure, but the customers who are out of order tend to be pulled by the wrong forecast by Contract and tenure. From these results, it may be possible to think of ways to modify Contract and tenure as features.

that's all!

[PYTHON] What are the factors behind the misprediction of ML? ～ Factor investigation is decided by decision tree ～