[PYTHON] Clustering ID-POS data with LDA

Introduction

It is said that LDA can be used to cluster high-dimensional vectorized POS data. I also looked at the transition of the cluster.

LDA That is used when dividing news articles by topic. One of the dimension reduction methods, which is suitable for document modeling, and is called a topic model. So-called soft clustering is possible, and data can belong to multiple clusters. (Cluster 1 affiliation probability 0.8, cluster 2 affiliation probability 0.2, etc.) For more information, check out other articles and books.

Application to POS data

If you want to cluster users based on the number of purchases and sales for each product using POS data, the more product types, the more dimensions you have. When clustering it by a method that uses distances such as kmeans, the distances increase rapidly due to the large number of dimensions, and clustering cannot be performed well. On the other hand, in LDA, for example, when dividing a document, it is vectorized in BoW format and a model is applied, but since the number of words is huge, it should be quite high-dimensional data. In other words, topic models like LDA are easy to cluster even with high-dimensional data, so I thought, "Oh? Isn't it suitable for POS data clustering?"

Usage data

Use kaggle's "eCommerce purchase history from electronics store". (Source: https://rees46.com/) Online store purchase data for large home appliances and electronic devices from April 2020 to November 2020.

Clustering execution

LDA can perform soft clustering, but this time we will treat only the cluster with the highest belonging probability as hard clustering.

Preparation

First import the required packages.

#Package import
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import seaborn as sns
import pandas as pd
import datetime as dt
import sklearn
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.decomposition import LatentDirichletAllocation
import time
import os
import glob
import codecs
sns.set()
'''
numpy 1.18.1
matplotlib 3.1.3
seaborn 0.10.0
pandas 1.0.3
sklearn 0.22.1
'''

Data reading, processing and extraction.

file='kz.csv'
df = pd.read_csv(file, dtype={'user_id':str, 'order_id':str})
df=df[['event_time', 'category_code', 'brand', 'price', 'user_id', 'order_id']]
df=df.dropna()
df['event_time']=df['event_time'].str[:-4]
df['event_time']=pd.to_datetime(df['event_time'])
df=df[df['event_time']>=dt.datetime(2020,1,1)]
df=df.sort_values('event_time')
#Combine brands and categories
df_cat_split=df['category_code'].str.split('.', expand=True)
df_cat_split.loc[(pd.isna(df_cat_split[2])), 2]=df_cat_split[1]
df_cat_split[3]=df_cat_split[1]+'.'+df_cat_split[2]
df['category']=df_cat_split[3].values
df['brand_category']=df['brand']+'.'+df['category']
#Unique number of each column
print('order_id', df['order_id'].unique().shape[0])
print('user_id', df['user_id'].unique().shape[0])
print('category_code', df['category_code'].unique().shape[0])
print('brand', df['brand'].unique().shape[0])
print('brand_category', df['brand_category'].unique().shape[0])
'''
order_id 331424
user_id 203235
category_code 123
brand 570
brand_category 1375
'''
display(df)

image.png

Divide the data into the first half and the second half.

#Divide the data in half
df_before=df.iloc[:int(len(df)/2.),:]
df_after=df.iloc[int(len(df)/2.):,:]

#User present in both data_Extract id
df_target=pd.merge(df_before[['user_id']], df_after[['user_id']], on=['user_id'], how='inner')['user_id'].unique()
df_target=pd.DataFrame(df_target, columns=['user_id'])
#User present in both data_Create data for id
df_before=df_before[df_before['user_id'].isin(df_target['user_id'].values)]
df_after=df_after[df_after['user_id'].isin(df_target['user_id'].values)]
#Two data periods and user_Show unique number of id
print('before\n', df_before['event_time'].min())
print('', df_before['event_time'].max())
print('\nafter\n',df_after['event_time'].min())
print('', df_after['event_time'].max())
print('\nUnique User Cnt', len(df_target))
'''
before
 2020-01-05 04:35:21
 2020-08-14 08:58:58

after
 2020-08-14 08:59:17
 2020-11-21 09:59:55

Unique User Cnt 15527
'''

Create a data mart of purchase amount for each product by user_id. I thought that the ease of purchase would change if the price was high, so I created a data mart of the purchase price, not the number of purchases, in terms of weighting.

#Make a mart with a pivot
def df_pivot(df, index, columns, values, aggfunc):
    df_mart=df.pivot_table(index=index, columns=columns, values=values, aggfunc=aggfunc).reset_index()
    df_mart=df_mart.fillna(0)
    return df_mart

#Processing for feeding to LDA
def df_to_np(df_mart):
    df_data=df_mart.copy().iloc[:,1:]
    df_data = df_data.values
    return df_data

row='user_id'
col='brand_category'
val='price'

df_mart=df_pivot(df_before, row, col, val, 'sum')
df_mart2=df_pivot(df_after, row, col, val, 'sum')

# df_mart and df_Get the unique column name in mart2
after=np.hstack((df_mart.columns.values, df_mart2.columns.values))
unique_after, counts_after = np.unique(after, return_counts=True)
non_dep_after=unique_after[counts_after == 1]

#Among the column names that are not duplicated earlier, df_It's in mart and df_Extract column names not included in mart2
before=np.hstack((non_dep_after, df_mart.columns.values))
unique_before, counts_before = np.unique(before, return_counts=True)
dep_before=unique_before[counts_before != 1]

# df_df to mart2_Added column with mart-specific column name
for col in dep_before:
    df_mart2[col]=0.

#Now df_mart and df_mart2 columns are aligned
df_mart=df_mart[df_mart.columns]
df_mart2=df_mart2[df_mart.columns]
#Processing for feeding to LDA
df_data=df_to_np(df_mart)
df_data2=df_to_np(df_mart2)
display(df_mart)
display(df_mart2)
display(df_data)
display(df_data2)

image.png image.png

Clustering with LDA

Create a model with 2 to 50 clusters (topics). The log-likelihood (larger is better) and perplexity (smaller is better) became better as the number of clusters increased, so we set the number of clusters = 6 appropriately. (How do you decide the number of clusters?)

#Functions that model LDA
def model_plot_opt(tfidf_data, topic_list, plot_enabled=True):
    #Definition
    n_topics = list(topic_list.astype(int))
    perplexities=[]
    log_likelyhoods_scores=[]
    models=[]
    search_params = {'n_components': n_topics}
    minmax_1 = MinMaxScaler()
    minmax_2 = MinMaxScaler()
    
    #Create a model for each set number of topics
    for i in n_topics:
        print('topic_cnt:',i)
        lda = LatentDirichletAllocation(n_components=i,random_state=0,
                                        learning_method='batch',
                                        max_iter=25)
        lda.fit(tfidf_data)
        lda_perp = lda.perplexity(tfidf_data)
        log_likelyhoods_score = lda.score(df_data)
        perplexities.append(lda_perp)
        log_likelyhoods_scores.append(log_likelyhoods_score)
        models.append(lda)
    
    #Plot the normalized log-likelihood and perplexity
    if plot_enabled:
        #Normalization
        log_likelyhoods_scores_std=minmax_1.fit_transform(np.array(log_likelyhoods_scores).reshape(len(log_likelyhoods_scores),1))
        perplexities_std=minmax_2.fit_transform(np.array(perplexities).reshape(len(perplexities),1))
        #Drawing
        plt.figure(figsize=(12, 8))
        ax=plt.subplot(1,1,1)
        ax.plot(n_topics, log_likelyhoods_scores_std, marker='o', color='blue', label='log-likelyhoods score')
        ax.set_title("Choosing Optimal LDA Model")
        ax.set_xlabel("Numer of Topics")
        ax.set_ylabel("Log Likelyhood Scores&Perplexity")
        ax.plot(n_topics, perplexities_std, marker='x', color='red', label='perplexity')
        plt.legend()
        plt.show()

    return models, log_likelyhoods_scores_std, perplexities_std

#Define a list of models and normalized log-likelihood and perplexity
models_list, log_likelyhoods_scores_std, perplexities_std = model_plot_opt(df_data, np.linspace(2,51,50))
#Set to 6 appropriately
lda=models_list[4]
print('topic_num:', lda.components_.shape[0])
'''
topic_num: 6
'''

image.png

Let's look at the characteristics of each cluster. First, extract the top probability of product appearance in each cluster.

#Function to get the top 20 product appearance probabilities in each topic
def component(lda, features):
    df_component=pd.DataFrame()
    for tn in range(lda.components_.shape[0]):
        row = lda.components_[tn]
        words = [features[i] for i in row.argsort()[:-20-1:-1]]
        df_component[tn]=words
        words = ', '.join([features[i] for i in row.argsort()[:-20-1:-1]])
    return df_component

#Extract the top 5 product appearance probabilities in each topic
features = df_mart.iloc[:,1:].columns.values
df_component=component(lda, features)
display(df_component.iloc[:5,:])

(Appeared with EXCEL ↓) image.png

In addition, the average number of products purchased in each cluster is extracted.

# user_Create a df with the topics with the highest probability of belonging for each id added as columns
def create_topic_no(df_mart, df_data, lda):
    df_id_cluster=df_mart[[row]]
    df_topic=pd.DataFrame(lda.transform(df_data))
    topic=df_topic.loc[:,:].idxmax(axis=1).values
    df_id_cluster['topic']=topic
    return df_id_cluster

df_id_cluster=create_topic_no(df_mart, df_data, lda)
df_id_cluster2=pd.merge(df_mart, df_id_cluster, on=['user_id'], how='left')
#Extract the average number of product purchases for each topic
display(df_id_cluster2.groupby(['topic']).mean().T)

(Appeared with EXCEL ↓) image.png You can see that cluster 0 is an Asus PC, 1 is an iPhone, 2 is a Lenovo PC, and so on.

I will give a cluster number for each user_id.

# df_I'll give a topic number to mart
df_topic_result=df_mart.copy()
top_price_brand_before=df_mart.iloc[:,1:].idxmax(axis=1).values
# user_Add the topic with the highest affiliation probability for each id as a column
df_topic_result['topic_before']=create_topic_no(df_mart, df_data, lda)['topic'].values
# user_Add the brand with the highest purchase amount by id as a column
df_topic_result['top_price_brand_before']=top_price_brand_before

# df_I will give a topic number to mart2
df_topic_result2=df_mart2.copy()
top_price_brand_after=df_mart2.iloc[:,1:].idxmax(axis=1).values
# user_Add the topic with the highest affiliation probability for each id as a column
df_topic_result2['topic_after']=create_topic_no(df_mart2, df_data2, lda)['topic'].values
# user_Add the brand with the highest purchase amount by id as a column
df_topic_result2['top_price_brand_after']=top_price_brand_after

# df_mart and df_JOIN mart2
df_topic_result=pd.merge(df_topic_result, df_topic_result2[['user_id','topic_after','top_price_brand_after']], on=['user_id'], how='left')
display(df_topic_result)

image.png

Check the chart of the cluster.

# plot Cluster Chart
def pct_abs(pct, raw_data):
    absolute = int(np.sum(raw_data)*(pct/100.))
    return '{:d}\n({:.0f}%)'.format(absolute, pct) if pct > 5 else ''

def plot_chart(y_km):
    km_label=pd.DataFrame(y_km).rename(columns={0:'cluster'})
    km_label['val']=1
    km_label=km_label.groupby('cluster')[['val']].count().reset_index()
    fig=plt.figure(figsize=(5,5))
    ax=plt.subplot(1,1,1)
    ax.pie(km_label['val'],labels=km_label['cluster'], autopct=lambda p: pct_abs(p, km_label['val']))#, autopct="%1.1f%%")
    ax.axis('equal')
    ax.set_title('Cluster Chart (ALL UU:{})'.format(km_label['val'].sum()),fontsize=14)
    plt.show()

plot_chart(df_topic_result['topic_before'].values)
plot_chart(df_topic_result['topic_after'].values)

First half data chart image.png Chart of data in the second half image.png

The characteristics of each cluster are different, and the ratio of users in the cluster is not biased, so I feel that they are divided into good feelings.

Check cluster transition

Since some users have cluster transitions in the first half and the second half, let's look at the cluster transitions in a cross table.

display(df_topic_result.pivot_table(index='topic_before', columns='topic_after', values='user_id', aggfunc='count'))

(Appeared with EXCEL ↓) image.png

For example, there are many people who transition from cluster 2 to 0 or transition from 4 to 0. What kind of brand purchases have increased or decreased for those who have transitioned from 2 to 0? Let's check for a moment.     Create a data frame aggregated by the number of purchases.

row='user_id'
col='brand_category'
val='order_id'

df_mart=df_pivot(df_before, row, col, val, 'count')
df_mart2=df_pivot(df_after, row, col, val, 'count')

# df_mart and df_Get the unique column name in mart2
after=np.hstack((df_mart.columns.values, df_mart2.columns.values))
unique_after, counts_after = np.unique(after, return_counts=True)
non_dep_after=unique_after[counts_after == 1]

#Among the column names that are not duplicated earlier, df_It's in mart and df_Extract column names not included in mart2
before=np.hstack((non_dep_after, df_mart.columns.values))
unique_before, counts_before = np.unique(before, return_counts=True)
dep_before=unique_before[counts_before != 1]

# df_df to mart2_Added column with mart-specific column name
for col in dep_before:
    df_mart2[col]=0

#Now df_mart and df_mart2 columns are aligned
df_mart=df_mart[df_mart.columns]
df_mart2=df_mart2[df_mart.columns]

display(df_mart)
display(df_mart2)

image.png

Extract people who have transitioned from cluster 2 to 0.

n=2
m=0
user_id_n_m=df_topic_result[(df_topic_result['topic_before']==n)&(df_topic_result['topic_after']==m)]['user_id'].values
df_b_n_m=df_mart[df_mart['user_id'].isin(user_id_n_m)]
df_a_n_m=df_mart2[df_mart2['user_id'].isin(user_id_n_m)]
display(df_b_n_m)
display(df_a_n_m)

image.png

Subtract the data from the first half from the data from the second half to check the increase or decrease in the number of purchases for each brand.

df_diff_n_m=df_a_n_m.iloc[:,1:]-df_b_n_m.iloc[:,1:]
df_diff_n_m.index=df_a_n_m['user_id'].values
df_diff_n_m=df_diff_n_m.T
df_diff_n_m['col']=df_diff_n_m.index
df_diff_n_m['brand']=df_diff_n_m['col'].str.split('.', expand=True).iloc[:,0].values
df_diff_n_m=pd.DataFrame(df_diff_n_m.groupby(['brand']).sum().T.sum()).sort_values(0, ascending=False)

#Plot the increase / decrease in the number of purchases for each brand
fig=plt.figure(figsize=(20,10))
plt.bar(df_diff_n_m.index[:11], df_diff_n_m[0][:11])
plt.bar(df_diff_n_m.index[-10:], df_diff_n_m[0][-10:])
plt.rcParams["font.family"] = "IPAexGothic"
plt.tick_params(labelsize=18)
plt.xticks(rotation=45)
plt.xlabel('# brand', fontsize=18)
plt.ylabel('# frequency of purchasing', fontsize=18)
plt.title('Increase / decrease in the number of purchases for each brand (top 10 and bottom 10)', fontsize=18)
plt.show()

Cluster 2 → 0 Transition Increase / decrease in the number of purchases by brand of users image.png

It can be seen that the number of purchases of LENOVO and Samsung has decreased, and the number of purchases of Asus and Logitech has increased for the user group that has transitioned from cluster 2 to 0. This is in line with the characteristics of each cluster confirmed in the high appearance probability of each cluster as described above.   In this way, it may be possible to follow changes in user preferences over time. In addition, although it cannot be determined without looking at it in more detail, for example, there is a possibility that the user's brand switching has occurred, so by deep digging, we analyze why our own brand and other companies' brands sold and why they did not sell. You may be able to do it.

As described above, clustering with LDA may enable meaningful analysis.

Clustering with kmeans

Try clustering like you did in LDA with kmeans. Check the resulting chart.

row='user_id'
col='brand_category'
val='price'

df_mart=df_pivot(df_before, row, col, val, 'sum')
df_mart2=df_pivot(df_after, row, col, val, 'sum')

# df_mart and df_Get the unique column name in mart2
after=np.hstack((df_mart.columns.values, df_mart2.columns.values))
unique_after, counts_after = np.unique(after, return_counts=True)
non_dep_after=unique_after[counts_after == 1]

#Among the column names that are not duplicated earlier, df_It's in mart and df_Extract column names not included in mart2
before=np.hstack((non_dep_after, df_mart.columns.values))
unique_before, counts_before = np.unique(before, return_counts=True)
dep_before=unique_before[counts_before != 1]

# df_df to mart2_Added column with mart-specific column name
for col in dep_before:
    df_mart2[col]=0

#Now df_mart and df_mart2 columns are aligned
df_mart=df_mart[df_mart.columns]
df_mart2=df_mart2[df_mart.columns]
#Processing for feeding to LDA
ss=StandardScaler()
df_data=ss.fit_transform(df_mart.iloc[:,1:].values)
ss=StandardScaler()
df_data2=ss.fit_transform(df_mart2.iloc[:,1:].values)

def km_cluster(X, k):
    km=KMeans(n_clusters=k,\
              init="k-means++",\
              random_state=0)
    y_km=km.fit_predict(X)
    return y_km,km

# k=Clustering at 6
y_km,km=km_cluster(df_data, 6)
plot_chart(y_km)

image.png

It has been clustered quite biased.

Looking at the average value for each cluster, Samsung is high or biased in many clusters.

df_kmeans=df_mart.copy()
df_kmeans['cluster']=y_km
#Extract the average number of product purchases for each topic
df_kmeans.groupby(['cluster']).mean().T

(Appeared with EXCEL ↓) image.png

As mentioned above, when it comes to high-dimensional data, kmeans could not cluster well.

in conclusion

I tried clustering POS data with LDA. The topic model may give good results when it seems difficult to cluster by distance.

that's all!

Recommended Posts

Clustering ID-POS data with LDA
Clustering with python-louvain
Clustering with scikit-learn (1)
Clustering with scikit-learn (2)
Data analysis with python 2
Visualize data with Streamlit
Clustering with scikit-learn + DBSCAN
Reading data with TensorFlow
DBSCAN algorithm (data clustering)
Data visualization with pandas
Data manipulation with Pandas!
Shuffle data with pandas
Data Augmentation with openCV
DBSCAN (clustering) with scikit-learn
Normarize data with Scipy
Data analysis with Python
LOAD DATA with PyMysql
Relationship data learning with numpy and NetworkX (spectral clustering)
Sample data created with python
Embed audio data with Jupyter
Graph Excel data with matplotlib (1)
Extract Twitter data with CSV
Get Youtube data with python
I tried clustering with PyCaret
Learn new data with PaintsChainer
Binarize photo data with OpenCV
Graph Excel data with matplotlib (2)
Save tweet data with Django
Deep Embedded Clustering with Chainer 2.0
Data processing tips with Pandas
Interpolate 2D data with scipy.interpolate.griddata
Read json data with python
Save & load data with joblib, pickle
How to deal with imbalanced data
How to deal with imbalanced data
[Python] Get economic data with DataReader
Versatile data plotting with pandas + matplotlib
Python data structures learned with chemoinformatics
Install the data files with setup.py
Parse pcap data with tshark command
Create noise-filled audio data with SoX
How to Data Augmentation with PyTorch
Easy data visualization with Python seaborn.
Generate fake table data with GAN
Process Pubmed .xml data with python
Data analysis starting with python (data visualization 1)
Manage your data with AWS RDS
Try data parallelism with Distributed TensorFlow
Data science environment construction with Docker
Data analysis starting with python (data visualization 2)
Merge JSON format data with Ansible
Implement "Data Visualization Design # 2" with matplotlib
Python application: Data cleansing # 2: Data cleansing with DataFrame
Subtitle data created with Amazon Transcribe