Overview

We performed a cluster analysis of customers based on purchasing data. After performing a cluster analysis, we visualized the characteristics of each segment.

Introduction

When doing machine learning, data analysis is required in many situations. How do you look at the data you have and how do you process it? I would like to increase such withdrawals, but the current situation is that there are not many data analysis teaching materials using Python. (I'm tired of talking about irises and Titanic ...) So I thought, "Isn't it possible to increase the number of drawers by buying teaching materials that can learn data analysis methods and reproducing them in Python in my own way, even if it is not Pthon?" The books I got this time are as follows.

[ R Business Statistics Analysis [Visit Tech] ](https://www.amazon.co.jp/R%E3%83%93%E3%82%B8%E3%83%8D%E3%82% B9% E7% B5% B1% E8% A8% 88% E5% 88% 86% E6% 9E% 90-% E3% 83% 93% E3% 82% B8% E3% 83% 86% E3% 82% AF -% E8% B1% 8A% E6% BE% A4-% E6% A0% 84% E6% B2% BB / dp / 4798149500 / ref = sr_1_1 adgrpid = 50883860102 & gclid = CjwKCAiAmNbwBRBOEiwAqcwwpa7BggXF27rE-sYlO1xvJXmMQn-PeUq6EHGRIMkxBD-lQKWtl7MmsRoC888QAvD_BwE & hvadid = 338575431599 & hvdev = c & hvlocphy? = 009433 & hvnetw = g & hvpos = 1t2 & hvqmt = e & hvrand = 7508171191131142832 & hvtargid = aud-759377471893% 3Akwd-411533392957 & hydadcr = 13897_10891658 & jp-ad-ap = 0 & keywords = r% E3% 83% 93% E3% E3% % B9% E7% B5% B1% E8% A8% 88% E5% 88% 86% E6% 9E% 90 & qid = 1578535237 & sr = 8-1)

[Purpose of this book] This book uses R from a large amount of data accumulated in the field of business. A business statistical analysis method that seeks Big X, which is directly linked to your company's sales It is a summarized book. An easy-to-understand set of an introduction and an actual analysis method I am explaining.

Usage data

I used the above book file.

・ Purchase history information [buying.csv]

Actual code

1. Read data

#Library import
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import japanize_matplotlib
%matplotlib inline

#Read / display files
buying = pd.read_csv("buying.csv",encoding='cp932')
buying.head()

2. Perform cross tabulation

#Create joint sales data by cross tabulation
buying_mat = pd.crosstab(buying['id'], buying['category'])
buying_mat.head()

3. Hierarchical cluster analysis after making dummy variables

#Make the joint sales data a dummy variable
#"1" if purchased more than once, "0" otherwise
buying_mat1 = buying_mat.copy()
for i in range(len(buying_mat1)):
    for j in range(len(buying_mat1.columns)):
        if buying_mat1.iloc[i, j] > 1:
            buying_mat1.iloc[i, j] = 1

buying_mat1.head()

#Importing a library of cluster analysis
from scipy.cluster.hierarchy import dendrogram, linkage
from scipy.cluster.hierarchy import linkage, dendrogram, fcluster

#Implementation of hierarchical clustering
#Ward's method x Euclidean distance
linkage_result = linkage(buying_mat1, method='ward', metric='euclidean')

#Determine the threshold for clustering
threshold = 0.7 * np.max(linkage_result[:, 2])

#Visualization of hierarchical clustering
plt.figure(num=None, figsize=(16, 9), dpi=200, facecolor='w', edgecolor='k')
dendrogram(linkage_result, labels=buying_mat1.index, color_threshold=threshold)
plt.axhline(7, linestyle='--', color='r')
plt.show()

#Get the value of the clustering result
clustered = fcluster(linkage_result, threshold, criterion='distance')

#Check the clustering result
print(clustered)

4. Hierarchical cluster analysis without dummy variables

#Implementation of hierarchical clustering
#Ward's method x Euclidean distance
linkage_result2 = linkage(buying_mat, method='ward', metric='euclidean')

#Determine the threshold for clustering
threshold2 = 0.7 * np.max(linkage_result2[:, 2])

#Visualization of hierarchical clustering
plt.figure(num=None, figsize=(16, 9), dpi=200, facecolor='w', edgecolor='k')
dendrogram(linkage_result2, labels=buying_mat.index, color_threshold=threshold2)
plt.axhline(23, linestyle='--', color='r')
plt.show()

5. Combine the results of hierarchical cluster analysis with the original data

#DataFrame conversion of the results of hierarchical cluster analysis
_class = pd.DataFrame({'class':clustered}, index= buying_mat1.index)
_class.head()

#Combine original data and analysis results
buying_mat2 = pd.concat([buying_mat1, _class] ,axis=1)
buying_mat2.head()

#Check the number of customers in each segment
buying_mat2.groupby('class').size()

6. Grasp the characteristics of the joint sales tendency in each segment

#Calculate the average value of all product categories for each segment
cluster_stats = np.round(buying_mat2.groupby('class', as_index=False).mean() ,2)
cluster_stats.head()

#Convert to portrait data format for graph drawing
mat_melt = pd.melt(cluster_stats, id_vars='class', var_name='Category',value_name='Rate')
mat_melt.head()

#Graph the characteristics of the segment
fig = plt.figure(figsize =(20,8))
ax1 = fig.add_subplot(1, 5, 1)
sns.barplot(x='Category', y='Rate', data=mat_melt[mat_melt['class'] == 1], ax=ax1)
plt.xticks(rotation=90)
plt.ylim(0, 1)

ax2 = fig.add_subplot(1, 5, 2)
sns.barplot(x='Category', y='Rate', data=mat_melt[mat_melt['class'] == 2], ax=ax2)
plt.xticks(rotation=90)
plt.ylim(0, 1)

ax3 = fig.add_subplot(1, 5, 3)
sns.barplot(x='Category', y='Rate', data=mat_melt[mat_melt['class'] == 3], ax=ax3)
plt.xticks(rotation=90)
plt.ylim(0, 1)

ax4 = fig.add_subplot(1, 5, 4)
sns.barplot(x='Category', y='Rate', data=mat_melt[mat_melt['class'] == 4], ax=ax4)
plt.xticks(rotation=90)
plt.ylim(0, 1)           
    
ax5 = fig.add_subplot(1, 5, 5)
sns.barplot(x='Category', y='Rate', data=mat_melt[mat_melt['class'] == 5], ax=ax5)
plt.xticks(rotation=90)

#Graph the characteristics of the segment(Graph drawing with for statement)
groups = mat_melt.groupby('class')
fig = plt.figure(figsize =(20,8))

for name, group in groups:
    _ax = fig.add_subplot(1, 5, int(name))
    sns.barplot(x='Category', y='Rate', data=group , ax=_ax)
    plt.title('Class' + str(name))
    plt.xticks(rotation=90)
    plt.ylim(0, 1)

Estimate the purchasing group from the characteristics of each segment. ・ Class1 ・ High purchase rate of miscellaneous goods ⇒A group of people who like miscellaneous goods? ・ Class2 ・ The purchase rate of items excluding men's items is evenly high. ⇒A family with children / female customers? ・ Class3 ・ High purchase rate for baby products, maternity, and men's items ⇒A family with children / male customers? ・ Class4 ・ High purchase rate of women's goods ⇒ Female customer? ・ Class5 ・ The purchase rate of women's items is high, but the purchasing tendency is different from Class4. ⇒ Female customers (purchasing tendency is different from Class 4)?

[PYTHON] I tried to perform a cluster analysis of customers using purchasing data