Introduction

When building a machine learning model with Python, there are many samples on the net if it is a processing pattern (classification, regression, clustering, etc.) included in scikit-learn, and it can be easily implemented by referring to them. However, if this is not the case, it can be quite difficult to select a library and obtain an implementation sample. A typical example is "** Association Analysis **," which is often used in marketing analysis. In R, you can easily build a model with a library that is often used, but in Python, there are surprisingly few such sample codes. This article introduces that part. By the way, a series of procedures including use cases from a more upstream business perspective, processing means from the original data from the UCI sample dataset, and their explanations are described in detail in section 5.4 of my book "Profitable AI". I am. If you are interested, please refer to this book as well.

Amazon book https://www.amazon.co.jp/dp/4296106961/

Amazon Kindle https://www.amazon.co.jp/dp/B08F9P726T/

Book support page https://github.com/makaishi2/profitable_ai_book_info/blob/master/README.md

Usage data

Use the data linked below.

https://github.com/makaishi2/sample-data/blob/master/data/retail-france.csv

This data is after some processing from the UCI dataset. Please refer to the above-mentioned book for the processing procedure up to this point.

Implementation code overview

The following is an overview of the association analysis implementation code using this data. For the entire Notebook

https://github.com/makaishi2/sample-notebooks/blob/master/profitable-ai/association-sample.ipynb

I uploaded it to.

Common processing

Of the pre-processing common to books, the relevant part was extracted in this sample.

#Common preprocessing

#Hide extra warnings
import warnings
warnings.filterwarnings('ignore')

#Import of required libraries
import pandas as pd
import numpy as np

#Data frame display function
from IPython.display import display

#Display option adjustment
#Floating point display accuracy in pandas
pd.options.display.float_format = '{:.4f}'.format

#Show all items in data frame
pd.set_option("display.max_columns",None)

Data reading

Import the pre-processed CSV data shown above into the data frame.

url = 'https://raw.githubusercontent.com/makaishi2/sample-data/master/data/retail-france.csv'
df = pd.read_csv(url)
display(df[100:110])

The result of the display function should look like this.

スクリーンショット 2020-09-22 9.20.46.png

Data processing

In order to perform association analysis on the above data, it is necessary to convert the data to horizontal format. (If you want to know what horizontal possession is, please refer to the book) The implementation code for that is as follows.

#Aggregate the number of products using the order number and product number as keys
w1 = df.groupby(['order number', 'Item Number'])['Number of products'].sum()

#Check the result
print(w1.head())

The state of w1 at this stage is as follows.

スクリーンショット 2020-09-22 9.25.11.png

Use the unstack function to move the item number in the row to the column.

#Move item number to column(Use of unstack function)
w2 = w1.unstack().reset_index().fillna(0).set_index('order number')

#Check size
print(w2.shape)

#Check the result
display(w2.head())

The result is as follows:

スクリーンショット 2020-09-22 9.27.03.png

Finally, use the apply function of the data frame to convert each element from a numerical value to a true / false binary value.

#True depending on whether the aggregation result is positive or 0/Set to False
basket_df = w2.apply(lambda x: x>0)

#Check the result
display(basket_df.head())

The results are as follows. This completes the pre-processing for association analysis.

スクリーンショット 2020-09-22 9.30.36.png

Model building

Use mlextend as a library for association analysis. mlextend is not as famous as scikit-learn, but it is a library for Python machine learning, similar to scikit-learn.

First, install the mlxtend library.

#Introduction of mlxtend
!pip install mlxtend

Next, import the functions ʻapriori`` and ʻassociation_rules`` to be used in the analysis.

#Loading the library
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

First, a method called a priori analysis is used to extract relationships between products with high numerical values called "** support **".

#A priori analysis
freq_items1 = apriori(basket_df, min_support = 0.06, 
    use_colnames = True)

#Check the result
display(freq_items1.sort_values('support', 
    ascending = False).head(10))

#Check the number of itemset
print(freq_items1.shape[0])

The result is as follows.

スクリーンショット 2020-09-22 10.21.06.png

Extract the relationship with a high "** lift value **" to the last extracted list.

#Extraction of association rules
a_rules1 = association_rules(freq_items1, metric = "lift",
    min_threshold = 1)

#Sort by lift value
a_rules1 = a_rules1.sort_values('lift',
    ascending = False).reset_index(drop=True)

#Check the result
display(a_rules1.head(10))

#Check the number of rules
print(a_rules1.shape[0])

The following list is the final result.

スクリーンショット 2020-09-22 10.22.15.png

In the book, based on the above results, I also use NetworkX to create the following relationship graph.

The technical terms "** support " and " lift value **" mentioned here are explained in the book, so please refer to them.

Association analysis in Python