[PYTHON] Basket analysis with Spark (1)

What is market basket analysis?

That is the example of "Diapers and beer are bought together on Friday night". In basket analysis, three indicators, support, confidence, and lift, are calculated from sales data. This article is intended to be implemented using PySpark. See other articles for the analysis method. There are also some in Qiita.

Definition

Support(A⇒B)=P(A∩B)=\frac{Number of baskets containing A and B}{Total number of baskets} 
Conviction(A⇒B) = \frac{P(A∩B)}{P(A)}=\frac{Number of baskets containing A and B}{Total number of baskets containing A} 
Expected confidence(A⇒B) = P(B)=\frac{Number of baskets containing B}{Total number of baskets} 
lift(A⇒B) = \frac{P(A∩B)}{P(A)P(B)}=\frac{Conviction}{期待Conviction}

Sample data

I will use Groceries used in the R language association analysis example. There are many commentary articles and Youtube videos, so it's easy to check the calculation results. This file has one line per basket and contains a total of 9835 baskets of data. A line is sometimes called a transaction. The first five lines are as follows.

groceries.csv


citrus fruit,semi-finished bread,margarine,ready soups
tropical fruit,yogurt,coffee
whole milk
pip fruit,yogurt,cream cheese ,meat spreads
other vegetables,whole milk,condensed milk,long life bakery product

Calculation of support

support.py


# -*- coding: utf-8 -*-
import sys
from itertools import combinations
from pprint import pprint
from pyspark import SparkContext


#Data read. Trim and normalize to lowercase
sc = SparkContext()
baskets = (
    sc.textFile(sys.argv[1])
    .map(lambda row: set([word.strip().lower() for word in row.split(",")]))
).cache()

#Total number of baskets
total = float(baskets.count())
 
result = (
    baskets
    #Give the ID to the basket
    .zipWithIndex()

    #Make a pair of products. Sorting is for stable pairs.
    .flatMap(lambda (items, basket_id): ((tuple(sorted(c)), (basket_id,)) for c in combinations(items, 2)))

    #Count the number of baskets using a pair of products as a key
    .reduceByKey(lambda a, b: a + b)
    .map(lambda pair_baskets: (pair_baskets[0], len(pair_baskets[1])))

    #Add support
    .map(lambda pair_count: (pair_count[0], (pair_count[1], pair_count[1] / total * 100)))

    #Sort in descending order by support
    .sortBy(lambda (pair, stats): -stats[1])
)

#Show top 10 support
pprint(result.take(10))

Results of support

(Vegetables, milk) was the top with a frequency of 736 and a support of 7.48% out of 9835. The following are data from Westerners such as bread and milk, milk and yogurt, etc., so reasonable results have been obtained.

$ spark-submit support.py groceries.csv

[((u'other vegetables', u'whole milk'), (736, 7.483477376715811)),
 ((u'rolls/buns', u'whole milk'), (557, 5.663446873411286)),
 ((u'whole milk', u'yogurt'), (551, 5.602440264361973)),
 ((u'root vegetables', u'whole milk'), (481, 4.89069649211998)),
 ((u'other vegetables', u'root vegetables'), (466, 4.738179969496695)),
 ((u'other vegetables', u'yogurt'), (427, 4.341637010676156)),
 ((u'other vegetables', u'rolls/buns'), (419, 4.260294865277071)),
 ((u'tropical fruit', u'whole milk'), (416, 4.229791560752415)),
 ((u'soda', u'whole milk'), (394, 4.006100660904932)),
 ((u'rolls/buns', u'soda'), (377, 3.833248601931876))]

Then, let's take a short detour to see what the worst 10 is. It's OK if the sortBy order is set to stats [1]. Mayonnaise and white wine, brandy and candy, gum and red wine, artificial sweeteners and dog food, light bulbs and jam, etc.

[((u'mayonnaise', u'white wine'), (1, 0.010167768174885612)),
 ((u'chewing gum', u'red/blush wine'), (1, 0.010167768174885612)),
 ((u'chicken', u'potato products'), (1, 0.010167768174885612)),
 ((u'brandy', u'candy'), (1, 0.010167768174885612)),
 ((u'chewing gum', u'instant coffee'), (1, 0.010167768174885612)),
 ((u'artif. sweetener', u'dog food'), (1, 0.010167768174885612)),
 ((u'meat spreads', u'uht-milk'), (1, 0.010167768174885612)),
 ((u'baby food', u'rolls/buns'), (1, 0.010167768174885612)),
 ((u'baking powder', u'frozen fruits'), (1, 0.010167768174885612)),
 ((u'jam', u'light bulbs'), (1, 0.010167768174885612))]

Confidence calculation

Since (X⇒Y) and (Y⇒X) are different confidence levels, I used permutations instead of combinations to list all the cases.

confidence.py


# -*- coding: utf-8 -*-
import sys
from itertools import permutations, combinations
from pprint import pprint
from pyspark import SparkContext


#Data read. Trim and normalize to lowercase
sc = SparkContext()
baskets = (
    sc.textFile(sys.argv[1])
    .map(lambda row: set([word.strip().lower() for word in row.split(",")]))
).cache()

#Total number of baskets
total = float(baskets.count())

#Give the ID to the basket
baskets_with_id = baskets.zipWithIndex()

# (Product pair,Number of baskets it contains)make.
pair_count = (
    baskets_with_id
    .flatMap(lambda (items, basket_id): [(pair, (basket_id,)) for pair in permutations(items, 2)])
    #Make a list of baskets that contain a pair of products as a key
    .reduceByKey(lambda a, b: a + b)
    #Count the number of baskets and add(pair, count)
    .map(lambda pair_baskets: (pair_baskets[0], len(pair_baskets[1])))
)

#Number of baskets containing product X
x_count = (
    baskets_with_id
    .flatMap(lambda (items, basket_id): [(x, (basket_id,)) for x in items])
    #Make a list of basket IDs that contain product X
    .reduceByKey(lambda a, b: a + b)
    #Count the number of baskets and add(x, count)
    .map(lambda x_baskets: (x_baskets[0], len(x_baskets[1])))
)

#Calculate conviction for X
confidence = (
    pair_count
    #Transform so that you can join with X as a key
    .map(lambda (pair, count): (pair[0], (pair, count)))
    .join(x_count)

    #Add confidence
    .map(lambda (x, ((pair, xy_count), x_count)): (pair, (xy_count, x_count, float(xy_count) / x_count * 100)))
    
    #Sort by confidence in descending order
    .sortBy(lambda (pair, stats): -stats[2])
)

pprint(confidence.take(10))



Confidence result

The result is a tuple of ((Product X, Product Y), (Number of baskets containing XY, Number of baskets containing X, Confidence%)). Sorted by certainty, it's 100% certainty, but it's just an example of a rare combination that appears only once.

$ spark-submit confidence.py groceries.csv

[((u'baby food', u'waffles'), (1, 1, 100.0)),
 ((u'baby food', u'cake bar'), (1, 1, 100.0)),
 ((u'baby food', u'dessert'), (1, 1, 100.0)),
 ((u'baby food', u'brown bread'), (1, 1, 100.0)),
 ((u'baby food', u'rolls/buns'), (1, 1, 100.0)),
 ((u'baby food', u'soups'), (1, 1, 100.0)),
 ((u'baby food', u'chocolate'), (1, 1, 100.0)),
 ((u'baby food', u'whipped/sour cream'), (1, 1, 100.0)),
 ((u'baby food', u'fruit/vegetable juice'), (1, 1, 100.0)),
 ((u'baby food', u'pastry'), (1, 1, 100.0))]

So, when I sorted by [the number of baskets containing X, the number of baskets containing XY], the following results were obtained. Milk is the most bought, with 11% to 29% confidence that vegetables, bread and yogurt will be bought together.

[((u'whole milk', u'other vegetables'), (736, 2513, 29.287703939514525)),
 ((u'whole milk', u'rolls/buns'), (557, 2513, 22.16474333465977)),
 ((u'whole milk', u'yogurt'), (551, 2513, 21.92598487863112)),
 ((u'whole milk', u'root vegetables'), (481, 2513, 19.140469558296857)),
 ((u'whole milk', u'tropical fruit'), (416, 2513, 16.55391961798647)),
 ((u'whole milk', u'soda'), (394, 2513, 15.678471945881418)),
 ((u'whole milk', u'bottled water'), (338, 2513, 13.450059689614008)),
 ((u'whole milk', u'pastry'), (327, 2513, 13.01233585356148)),
 ((u'whole milk', u'whipped/sour cream'), (317, 2513, 12.614405093513728)),
 ((u'whole milk', u'citrus fruit'), (300, 2513, 11.937922801432551))]

Lift calculation

Source code evaporation. I will post it as soon as it is found.

Lift result

Anyway, people who buy something that looks like a snack are more likely to buy it with sake than to buy it alone (laughs).

[((u'cocoa drinks', u'preservation products'), 22352.27272727273),
 ((u'preservation products', u'cocoa drinks'), 22352.272727272728),
 ((u'finished products', u'baby food'), 15367.1875),
 ((u'baby food', u'finished products'), 15367.1875),
 ((u'baby food', u'soups'), 14679.104477611942),
 ((u'soups', u'baby food'), 14679.10447761194),
 ((u'abrasive cleaner', u'preservation products'), 14050.000000000002),
 ((u'preservation products', u'abrasive cleaner'), 14050.0),
 ((u'cream', u'baby cosmetics'), 12608.97435897436),
 ((u'baby cosmetics', u'cream'), 12608.974358974358)]

Summary

We have conducted a market basket analysis with PySpark.

This article was written a long time ago and was left as a draft, so it may not work in the current pyspark.

Recommended Posts

Basket analysis with Spark (1)
Principal component analysis with Spark ML
Data analysis with python 2
Dependency analysis with CaboCha
Voice analysis with python
Getting started with Spark
Voice analysis with python
Dynamic analysis with Valgrind
Regression analysis with NumPy
Data analysis with Python
[Co-occurrence analysis] Easy co-occurrence analysis with Python! [Python]
Ensemble learning and basket analysis
Multiple regression analysis with Keras
Sentiment analysis with Python (word2vec)
Texture analysis learned with pyradiomics
Planar skeleton analysis with Python
Japanese morphological analysis with Python
Muscle jerk analysis with Python
[PowerShell] Morphological analysis with SudachiPy
Text sentiment analysis with ML-Ask
3D skeleton structure analysis with Python
Impedance analysis (EIS) with python [impedance.py]
Text mining with Python ① Morphological analysis
Getting Started with Cisco Spark REST-API
Convenient analysis with Pandas + Jupyter notebook
I played with Mecab (morphological analysis)!
Kaggle Summary: Instacart Market Basket Analysis
Spark play with WSL anaconda jupyter (2)
Data analysis starting with python (data visualization 1)
Logistic regression analysis Self-made with python
Data analysis starting with python (data visualization 2)
I tried multiple regression analysis with polynomial regression
The most basic clustering analysis with scikit-learn
Principal Component Analysis with Livedoor News Corpus-Practice-
[In-Database Python Analysis Tutorial with SQL Server 2017]
Marketing analysis with Python ① Customer analysis (decyl analysis, RFM analysis)
Use apache Spark with jupyter notebook (IPython notebook)
Two-dimensional saturated-unsaturated osmotic flow analysis with Python
Machine learning with python (2) Simple regression analysis
2D FEM stress analysis program with Python
I tried factor analysis with Titanic data!
[Voice analysis] Find Cross Similarity with Librosa
Line talk analysis with janome (OSS released)
Sentiment analysis of tweets with deep learning
Tweet analysis with Python, Mecab and CaboCha
Principal component analysis with Power BI + Python
Visualize 2ch threads with WordCloud-Morphological analysis / WordCloud-
Data analysis starting with python (data preprocessing-machine learning)
Two-dimensional unsteady heat conduction analysis with Python
Network Analysis with NetworkX --- Community Detection Volume
Python: Simplified morphological analysis with regular expressions
How about polarity analysis with "order" added?