[PYTHON] I tried to perform a cluster analysis of customers using purchasing data

Overview

We performed a cluster analysis of customers based on purchasing data. After performing a cluster analysis, we visualized the characteristics of each segment.

Introduction

When doing machine learning, data analysis is required in many situations. How do you look at the data you have and how do you process it? I would like to increase such withdrawals, but the current situation is that there are not many data analysis teaching materials using Python. (I'm tired of talking about irises and Titanic ...) So I thought, "Isn't it possible to increase the number of drawers by buying teaching materials that can learn data analysis methods and reproducing them in Python in my own way, even if it is not Pthon?" The books I got this time are as follows.

[** R Business Statistics Analysis [Visit Tech] **](https://www.amazon.co.jp/R%E3%83%93%E3%82%B8%E3%83%8D%E3%82% B9% E7% B5% B1% E8% A8% 88% E5% 88% 86% E6% 9E% 90-% E3% 83% 93% E3% 82% B8% E3% 83% 86% E3% 82% AF -% E8% B1% 8A% E6% BE% A4-% E6% A0% 84% E6% B2% BB / dp / 4798149500 / ref = sr_1_1 adgrpid = 50883860102 & gclid = CjwKCAiAmNbwBRBOEiwAqcwwpa7BggXF27rE-sYlO1xvJXmMQn-PeUq6EHGRIMkxBD-lQKWtl7MmsRoC888QAvD_BwE & hvadid = 338575431599 & hvdev = c & hvlocphy? = 009433 & hvnetw = g & hvpos = 1t2 & hvqmt = e & hvrand = 7508171191131142832 & hvtargid = aud-759377471893% 3Akwd-411533392957 & hydadcr = 13897_10891658 & jp-ad-ap = 0 & keywords = r% E3% 83% 93% E3% E3% % B9% E7% B5% B1% E8% A8% 88% E5% 88% 86% E6% 9E% 90 & qid = 1578535237 & sr = 8-1)

[Purpose of this book] This book uses R from a large amount of data accumulated in the field of business. A business statistical analysis method that seeks Big X, which is directly linked to your company's sales It is a summarized book. An easy-to-understand set of an introduction and an actual analysis method I am explaining.

Usage data

I used the above book file.

・ Purchase history information [buying.csv]

Actual code

1. Read data

#Library import
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import japanize_matplotlib
%matplotlib inline
#Read / display files
buying = pd.read_csv("buying.csv",encoding='cp932')
buying.head()

Data1.jpg

2. Perform cross tabulation

#Create joint sales data by cross tabulation
buying_mat = pd.crosstab(buying['id'], buying['category'])
buying_mat.head()

Data2.jpg

3. Hierarchical cluster analysis after making dummy variables

#Make the joint sales data a dummy variable
#"1" if purchased more than once, "0" otherwise
buying_mat1 = buying_mat.copy()
for i in range(len(buying_mat1)):
    for j in range(len(buying_mat1.columns)):
        if buying_mat1.iloc[i, j] > 1:
            buying_mat1.iloc[i, j] = 1

buying_mat1.head()

Data3.jpg

#Importing a library of cluster analysis
from scipy.cluster.hierarchy import dendrogram, linkage
from scipy.cluster.hierarchy import linkage, dendrogram, fcluster
#Implementation of hierarchical clustering
#Ward's method x Euclidean distance
linkage_result = linkage(buying_mat1, method='ward', metric='euclidean')
#Determine the threshold for clustering
threshold = 0.7 * np.max(linkage_result[:, 2])
#Visualization of hierarchical clustering
plt.figure(num=None, figsize=(16, 9), dpi=200, facecolor='w', edgecolor='k')
dendrogram(linkage_result, labels=buying_mat1.index, color_threshold=threshold)
plt.axhline(7, linestyle='--', color='r')
plt.show()

graph1.png

#Get the value of the clustering result
clustered = fcluster(linkage_result, threshold, criterion='distance')
#Check the clustering result
print(clustered)

Data4.jpg

4. Hierarchical cluster analysis without dummy variables

#Implementation of hierarchical clustering
#Ward's method x Euclidean distance
linkage_result2 = linkage(buying_mat, method='ward', metric='euclidean')
#Determine the threshold for clustering
threshold2 = 0.7 * np.max(linkage_result2[:, 2])
#Visualization of hierarchical clustering
plt.figure(num=None, figsize=(16, 9), dpi=200, facecolor='w', edgecolor='k')
dendrogram(linkage_result2, labels=buying_mat.index, color_threshold=threshold2)
plt.axhline(23, linestyle='--', color='r')
plt.show()

graph2.png

5. Combine the results of hierarchical cluster analysis with the original data

#DataFrame conversion of the results of hierarchical cluster analysis
_class = pd.DataFrame({'class':clustered}, index= buying_mat1.index)
_class.head()

Data5.jpg

#Combine original data and analysis results
buying_mat2 = pd.concat([buying_mat1, _class] ,axis=1)
buying_mat2.head()

Data6.jpg

#Check the number of customers in each segment
buying_mat2.groupby('class').size()

Data7.jpg

6. Grasp the characteristics of the joint sales tendency in each segment

#Calculate the average value of all product categories for each segment
cluster_stats = np.round(buying_mat2.groupby('class', as_index=False).mean() ,2)
cluster_stats.head()

Data8.jpg

#Convert to portrait data format for graph drawing
mat_melt = pd.melt(cluster_stats, id_vars='class', var_name='Category',value_name='Rate')
mat_melt.head()

Data9.jpg

#Graph the characteristics of the segment
fig = plt.figure(figsize =(20,8))
ax1 = fig.add_subplot(1, 5, 1)
sns.barplot(x='Category', y='Rate', data=mat_melt[mat_melt['class'] == 1], ax=ax1)
plt.xticks(rotation=90)
plt.ylim(0, 1)

ax2 = fig.add_subplot(1, 5, 2)
sns.barplot(x='Category', y='Rate', data=mat_melt[mat_melt['class'] == 2], ax=ax2)
plt.xticks(rotation=90)
plt.ylim(0, 1)

ax3 = fig.add_subplot(1, 5, 3)
sns.barplot(x='Category', y='Rate', data=mat_melt[mat_melt['class'] == 3], ax=ax3)
plt.xticks(rotation=90)
plt.ylim(0, 1)

ax4 = fig.add_subplot(1, 5, 4)
sns.barplot(x='Category', y='Rate', data=mat_melt[mat_melt['class'] == 4], ax=ax4)
plt.xticks(rotation=90)
plt.ylim(0, 1)           
    
ax5 = fig.add_subplot(1, 5, 5)
sns.barplot(x='Category', y='Rate', data=mat_melt[mat_melt['class'] == 5], ax=ax5)
plt.xticks(rotation=90)

graph4.png

#Graph the characteristics of the segment(Graph drawing with for statement)
groups = mat_melt.groupby('class')
fig = plt.figure(figsize =(20,8))

for name, group in groups:
    _ax = fig.add_subplot(1, 5, int(name))
    sns.barplot(x='Category', y='Rate', data=group , ax=_ax)
    plt.title('Class' + str(name))
    plt.xticks(rotation=90)
    plt.ylim(0, 1)

graph4.png

Estimate the purchasing group from the characteristics of each segment. ・ Class1 ・ High purchase rate of miscellaneous goods ⇒A group of people who like miscellaneous goods? ・ Class2 ・ The purchase rate of items excluding men's items is evenly high. ⇒A family with children / female customers? ・ Class3 ・ High purchase rate for baby products, maternity, and men's items ⇒A family with children / male customers? ・ Class4 ・ High purchase rate of women's goods ⇒ Female customer? ・ Class5 ・ The purchase rate of women's items is high, but the purchasing tendency is different from Class4. ⇒ Female customers (purchasing tendency is different from Class 4)?

Recommended Posts

I tried to perform a cluster analysis of customers using purchasing data
I tried to analyze scRNA-seq data using Topological Data Analysis (TDA)
I tried to get a database of horse racing using Pandas
I tried to make a regular expression of "amount" using Python
I tried to make a regular expression of "time" using Python
I tried to make a regular expression of "date" using Python
I tried to get a list of AMI Names using Boto3
I tried cluster analysis of the weather map
I tried to make a ○ ✕ game using TensorFlow
I tried to make a suspicious person MAP quickly using Geolonia address data
I tried reading data from a file using Node.js.
I tried using Python (3) instead of a scientific calculator
I tried to draw a configuration diagram using Diagrams
I tried to notify the update of "Become a novelist" using "IFTTT" and "Become a novelist API"
I want to collect a lot of images, so I tried using "google image download"
I tried to search videos using Youtube Data API (beginner)
I tried to automate [a certain task] using Raspberry Pi
[Python] I tried collecting data using the API of wikipedia
I tried to make a stopwatch using tkinter in python
I tried to make a simple text editor using PyQt
I tried to get data from AS / 400 quickly using pypyodbc
I tried to make a function to retrieve data from database column by column using sql with sqlite3 of python [sqlite3, sql, pandas]
I tried to compare the accuracy of machine learning models using kaggle as a theme.
I tried using GrabCut of OpenCV
Recommendation of data analysis using MessagePack
I tried using PI Fu to generate a 3D model of a person from one image
I tried to automate the construction of a hands-on environment using IBM Cloud's SoftLayer API
I tried to get the index of the list using the enumerate function
I tried to create a list of prime numbers with python
I tried to visualize BigQuery data using Jupyter Lab on GCP
I tried to make an analysis base of 5 patterns in 3 years
I tried to make a todo application using bottle with python
[Python] I tried to get various information using YouTube Data API!
I tried to get data from AS / 400 quickly using pypyodbc Preparation 1
I tried to make a mechanism of exclusive control with Go
I tried to create a linebot (implementation)
I tried using Azure Speech to Text.
I tried to create a linebot (preparation)
I tried playing a ○ ✕ game using TensorFlow
I tried using YOUTUBE Data API V3
I tried factor analysis with Titanic data!
I tried drawing a line using turtle
Creating a data analysis application using Streamlit
I tried to make a Web API
I tried using pipenv, so a memo
I tried 3D detection of a car
I tried to predict Covid-19 using Darts
I tried to transform the face image using sparse_image_warp of TensorFlow Addons
I tried logistic regression analysis for the first time using Titanic data
I tried to get the batting results of Hachinai using image processing
I tried to estimate the similarity of the question intent using gensim's Doc2Vec
I tried fMRI data analysis with python (Introduction to brain information decoding)
I tried to display the altitude value of DTM in a graph
[Kaggle] I tried feature engineering of multidimensional time series data using tsfresh.
I tried the common story of using Deep Learning to predict the Nikkei 225
Using COTOHA, I tried to follow the emotional course of Run, Melos!
I tried to create a sample to access Salesforce using Python and Bottle
I tried to verify the result of A / B test by chi-square test
I tried to implement a card game of playing cards in Python
A super introduction to Django by Python beginners! Part 2 I tried using the convenient functions of the template
I tried to predict the deterioration of the lithium ion battery using the Qore SDK