Marketing analysis with Python ① Customer analysis (decyl analysis, RFM analysis)

Introduction

I wanted to learn about marketing analysis methods by touching actual data in Python, so I wrote an article for the purpose of keeping a record of that learning. This time, we will use decyl analysis and RFM analysis, which are customer analysis methods.

Usage data

Use the Online Retail Data Set (https://www.kaggle.com/vijayuv/onlineretail) published by Kaggle.

Execution environment

It runs on Google Craboratory. The versions of Python and each library are as follows.

Customer segment analysis overview

By dividing customers by certain characteristics, you will be able to find good customers, place advertisements efficiently, and so on. There is also a method of using attributes such as gender and age for segmentation, but here we will deal with two analysis methods using purchase history data.

What is decyl analysis?

This is a method of segmenting customers by arranging them in order of purchase amount and dividing them into 10 parts.

What is RFM analysis?

This is a method of segmenting customers according to the following three indicators.

-** Recency : How many days ago did you use it? - Frequency : How often did you use it during a certain period? - Monetary **: Total amount spent during a period

Implementation of decyl analysis

Data reading, EDA, preprocessing

#Import base module
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Read the data by specifying the data type.

#Data read
dtypes = {
    'InvoiceNo': 'object',
    'StockCode': 'object',
    'Description': 'object',
    'Quantity': 'int8',
    'InvoiceDate': 'datetime64[ns]',
    'UnitPrice': 'float64',
    'CustomerID': 'object',
    'Country': 'object'
}
raw_data = pd.read_csv('./data/OnlineRetail.csv', dtype=dtypes, engine='python')
#Overview confirmation
print(raw_data.shape)
raw_data.head(20)

>>>
(541909, 8)

スクリーンショット 2021-01-08 11.18.13.png

To briefly explain each column

--InvouceNo: Order number --StockCode: Item number --Description: Product description --Quantity: Purchased quantity --InvouceDate: Purchase date and time --UnitPrice: Product unit price --CustomerID: Customer number --Country: Country

Decyl analysis uses Quantity, UnitPrice, and CustomerID. Looking at the data a little more

#Check missing values, etc.
raw_data.info()

>>>
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   InvoiceNo    541909 non-null  object        
 1   StockCode    541909 non-null  object        
 2   Description  540455 non-null  object        
 3   Quantity     541909 non-null  int8          
 4   InvoiceDate  541909 non-null  datetime64[ns]
 5   UnitPrice    541909 non-null  float64       
 6   CustomerID   406829 non-null  object        
 7   Country      541909 non-null  object        
dtypes: datetime64[ns](1),float64(1),int8(1),object(5)
memory usage: 29.5+ MB

The CustomerID column is missing. Also,

#Statistic confirmation of numerical variables
raw_data.describe()

You can see that there are negative values ​​in some of Quantity and UnitPrice. It seems that there are some rules such as a negative value being entered when canceling a purchase, but this time I will use only data of 0 or more. Also, exclude the data for which the Cutomer ID is missing.

#cleansing
data = raw_data.query('Quantity >= 0 & UnitPrice >= 0').dropna(axis=0, subset=['CustomerID'])

Decyl analysis

Create a total amount (quantity x unit price) column.

#the amount=Calculate quantity x unit price
data['TotalPrice'] = data['Quantity'] * data['UnitPrice']
#Find the total purchase amount for each customer
decil = data[['CustomerID', 'TotalPrice']].groupby('CustomerID').sum().reset_index()
decil.head()

Using this value, divide by 10% from the top.

#Find the quantile
parties = decil.quantile(q=[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]).to_dict()
parties

>>>
{'TotalPrice': {0.1: 151.812,
  0.2: 240.62800000000001,
  0.3: 340.95799999999997,
  0.4: 474.1039999999999,
  0.5: 656.6899999999999,
  0.6: 905.4079999999998,
  0.7: 1309.4160000000002,
  0.8: 1954.0280000000002,
  0.9: 3488.1160000000036}}
#Define a function to divide by quantile
def cal_decil(x, col, df):
    if x <= df[col][0.1]:
        return 1
    elif x <= df[col][0.2]:
        return 2
    elif x <= df[col][0.3]: 
        return 3
    elif x <= df[col][0.4]: 
        return 4
    elif x <= df[col][0.5]: 
        return 5
    elif x <= df[col][0.6]: 
        return 6
    elif x <= df[col][0.7]: 
        return 7
    elif x <= df[col][0.8]: 
        return 8
    elif x <= df[col][0.9]: 
        return 9
    else:
        return 10
#Calculation of decyl score
decil['Decil_score'] = decil['TotalPrice'].apply(cal_decil, args=('TotalPrice', parties))
decil.head()

We were able to divide our customers into ranks 1-10. From here, we will look at how much each layer contributes to overall sales. You can use Pandas' .cumsum () and .cumprod () methods to calculate cumulative sums and ratios.

#Sort in descending order to find the cumulative sum / cumulative ratio
decil = decil.sort_values('TotalPrice', ascending=False)
decil['Cumsum'] = decil['TotalPrice'].cumsum()
decil['Cumprod'] = decil['Cumsum'] / decil['Cumsum'].max()
decil
#Percentage of sales at each level
decil[['Decil_score', 'Cumprod']].groupby('Decil_score').max() \
    .reset_index().sort_values('Decil_score', ascending=False)

Looking at the results, we can see that the top 10% of total purchases account for about 55% of total sales, the top 20% for about 70%, and the top 30% for about 80%.

Implementation of RFM analysis

RFM analysis uses InvoiceDate, TotalPrice (column created by decyl analysis), and CustomerID.

Calculation of RFM value

In order to calculate the Recency, you need to specify the current time (time at the time of analysis). Since the last time of the purchase data is '2011-12-09 12:50:00', this time we will set the current time (NOW) to December 10, 2011 (0:00).

#Get the latest time in history
data['InvoiceDate'].max()

>>>
Timestamp('2011-12-09 12:50:00')
#Set the current time
import datetime as dt
NOW = dt.datetime(2011,12,10)

GroupBy with CustomerID and use the agg function to find Recency, Frequency and Monetary as follows.

#RFM calculation
rfm = data.groupby("CustomerID") \
    .agg({"InvoiceDate": lambda date: (NOW - date.max()).days,
          "InvoiceNo": lambda num: num.nunique(),
          "TotalPrice": lambda price: price.sum()}).reset_index()

#Column name change
rfm.rename(columns={'InvoiceDate': 'recency', 
                    'InvoiceNo': 'frequency', 
                    'TotalPrice': 'monetary'}, inplace=True)

rfm.head()

Segmentation

From here, we will segment using each index actually calculated. How to divide (reference value and number of divisions) needs to be changed depending on the situation, this time we will divide each index using the quartile.

Find the RFM score at the quartile

#Find the quartile of each of the three indicators
quantiles = rfm.quantile(q=[0.25,0.5,0.75])
quantiles_dict = quantiles.to_dict()
print(quantiles_dict)

>>>
{'recency': {0.25: 17.0, 0.5: 50.0, 0.75: 142.0},
 'frequency': {0.25: 1.0, 0.5: 2.0, 0.75: 5.0},
 'monetary': {0.25: 300.67499999999995, 0.5: 656.6899999999999, 0.75: 1601.0}}
#Define a function that divides RFM by quartiles
def cal_R(x, col, df):
    if x <= df[col][0.25]:
        return 1
    elif x <= df[col][0.50]:
        return 2
    elif x <= df[col][0.75]: 
        return 3
    else:
        return 4
    
def cal_FM(x, col, df):
    if x <= df[col][0.25]:
        return 4
    elif x <= df[col][0.50]:
        return 3
    elif x <= df[col][0.75]: 
        return 2
    else:
        return 1
#Calculation of various scores
rfm['R_score'] = rfm.recency.apply(cal_R, args=('recency', quantiles_dict))
rfm['F_score'] = rfm.frequency.apply(cal_FM, args=('frequency', quantiles_dict))
rfm['M_score'] = rfm.monetary.apply(cal_FM, args=('monetary', quantiles_dict))

#Calculation of score by integrating 3 indicators
rfm["RFM_score"] = rfm.R_score.astype(str)+ rfm.F_score.astype(str) + rfm.M_score.astype(str)

rfm.reset_index(inplace=True)
rfm.head()

スクリーンショット 2021-01-08 12.19.14.png

Using the obtained score, we will segment it, for example, "If the score 4 is 2 or more items, it is a good customer".

Clustering (k-means)

It seems that unsupervised machine learning method (clustering) may be used for segmentation. Let's try to implement segmentation using the k-means method.

#Standardization
from sklearn.preprocessing import StandardScaler

tmp = rfm[['R_score', 'F_score', 'M_score']]
ss = StandardScaler()
rfm_scaled = ss.fit_transform(tmp)

print(rfm_scaled)

>>>
array([[ 1.34301118,  1.15405079,  1.34141302],
       [-1.33165772, -1.47067031, -1.34162009],
       [ 0.45145488, -0.59576328, -0.44727572],
       ...,
       [-1.33165772,  0.27914376,  1.34141302],
       [-1.33165772, -1.47067031, -1.34162009],
       [-0.44010142, -0.59576328, -1.34162009]])

The number of clusters to be divided must be specified by the analyst, but one method for considering the optimum number of clusters is to see the change in the itertia value (sum of squared errors in the cluster) when the number of divisions is changed. .. For more details, please read this article.

#Clustering (examination of the number of divisions)
from sklearn.cluster import KMeans

inertias = {}
for k in range(2, 10): 
    km = KMeans(n_clusters=k, random_state=1)
    km.fit(rfm_scaled)
    inertias[k] = km.inertia_

pd.Series(inertias).plot()
plt.xlabel("K (num of clusters)")
plt.ylabel("Inertia Score")

<img src="https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/740683/bd2daeae-5be5-b65f-a9c3-834aea55f405.png " width=60%>

It's a little difficult to judge, but this time I'll divide it into four clusters.

# k=Clustering with 4
k = 4
km = KMeans(n_clusters=k, random_state = 1)
rfm["KMeans"] = km.fit_predict(rfm_scaled)
rfm.head()

スクリーンショット 2021-01-08 12.50.31.png

#Visualization
sns.pairplot(rfm[['R_score', 'F_score', 'M_score', 'KMeans']], hue="KMeans", palette='jet_r')
plt.show()

スクリーンショット 2021-01-08 12.36.26.png

#Check the number of elements in each cluster
rfm['KMeans'].value_counts()

>>>
0    1428
3    1112
1     997
2     782
Name: KMeans, dtype: int64

in conclusion

I implemented decyl analysis and RFM analysis in Python. It's a good practice for Pandas, and I found it easier to maintain motivation with meaningful data analysis. Next, I will try product analysis.

※reference

https://library.musubu.in/articles/10537# https://www.albert2005.co.jp/knowledge/marketing/customer_product_analysis/decyl_rfm https://qiita.com/deaikei/items/11a10fde5bb47a2cf2c2

Recommended Posts

Marketing analysis with Python ① Customer analysis (decyl analysis, RFM analysis)
Data analysis with python 2
Voice analysis with python
Voice analysis with python
Data analysis with Python
[Python] Morphological analysis with MeCab
[Co-occurrence analysis] Easy co-occurrence analysis with Python! [Python]
Sentiment analysis with Python (word2vec)
Planar skeleton analysis with Python
Japanese morphological analysis with Python
Muscle jerk analysis with Python
3D skeleton structure analysis with Python
Impedance analysis (EIS) with python [impedance.py]
Text mining with Python ① Morphological analysis
Data analysis starting with python (data visualization 1)
Logistic regression analysis Self-made with python
Data analysis starting with python (data visualization 2)
[In-Database Python Analysis Tutorial with SQL Server 2017]
Two-dimensional saturated-unsaturated osmotic flow analysis with Python
Machine learning with python (2) Simple regression analysis
2D FEM stress analysis program with Python
Tweet analysis with Python, Mecab and CaboCha
Principal component analysis with Power BI + Python
Data analysis starting with python (data preprocessing-machine learning)
Two-dimensional unsteady heat conduction analysis with Python
Python: Simplified morphological analysis with regular expressions
FizzBuzz with Python3
Scraping with Python
Statistics with python
Scraping with Python
Python with Go
Data analysis python
[Various image analysis with plotly] Dynamic visualization with plotly [python, image]
Twilio with Python
Medical image analysis with Python 1 (Read MRI image with SimpleITK)
Integrate with Python
Play with 2016-Python
AES256 with python
Tested with Python
python starts with ()
with syntax (Python)
Bingo with python
Zundokokiyoshi with python
Static analysis of Python code with GitLab CI
Easy Lasso regression analysis with Python (no theory)
Excel with Python
Two-dimensional elastic skeleton geometric nonlinear analysis with Python
Microcomputer with Python
Cast with python
Text mining with Python ① Morphological analysis (re: Linux version)
Data analysis for improving POG 1 ~ Web scraping with Python ~
Collecting information from Twitter with Python (morphological analysis with MeCab)
[OpenCV / Python] I tried image analysis of cells with OpenCV
Reading Note: An Introduction to Data Analysis with Python
Data analysis environment construction with Python (IPython notebook + Pandas)
Calculate the regression coefficient of simple regression analysis with python
3. Natural language processing with Python 4-1. Analysis for words with KWIC
Challenge principal component analysis of text data with Python
Planar skeleton analysis with Python (4) Handling of forced displacement
Principal component analysis using python from nim with nimpy
Serial communication with Python