Rewrite SPSS Modeler reconfigure node in Python. Aggregation by purchased product category

Let's rewrite the reconstruction node that converts vertically held data to horizontally held with SPSS Modeler with Python pandas.

1. 1. Image of processing

From the following POS data with ID, let's aggregate (1) the total purchase amount for each product category and (2) the purchase ratio for each product category for each customer.

■ Before processing We use ID-attached POS data that records who (CUSTID) purchased when (SDATE) and what (PRODUCTID, L_CLASS product major classification, M_CLASS product middle classification) and how much (SUBTOTAL).

image.png

■ After processing For each customer (CUSTID), the total purchase amount for each product category (1) and the purchase ratio for each product category (L_CLASS) are totaled. image.png

There are three major product categories: BAG, COMETICS, and SHOES. The 100001 customer has a purchase of 7243 yen for BAG, 10273 yen for COSMETICS, and 26870 yen for SHOES. If you calculate it as a percentage of the amount, it will be BAG 16.3%, COSMETICS 23.1%, SHOES 60.5%. When such aggregation is performed, the characteristics of the customer become apparent.

2. Settings on the Modeler reconfiguration node (1) Total purchase amount for each product category

First, (1) find up to the total purchase amount for each product category. Reconfigure nodes are commonly used in combinations of data type nodes, record summaries, and even replacement nodes. In addition, this time we will also combine record aggregation and record join nodes to calculate the total purchase price.

image.png

First, the data type node recognizes what category value is in the product classification (L_CLASS). When you perform "Read Value" on the data type node, all the major categories are automatically recognized.

image.png

Next, select the value of the field you want to hold horizontally in the reconfiguration node. In the example below, BAG, COMETICS, SHOES are expanded as columns for the product major classification (L_CLASS), and subtotals (SUBSTOTAL) are set for the values. image.png

A preview of the result of this setting looks like below. You can see that the subtotal (SUBSTOTAL) has been sorted into the columns of each product major classification (L_CLASS). image.png

Next, use the "Record Aggregation" node to aggregate the subtotal values of BAG, COMETICS, and SHOES into one record by the customer (uncheck because the record frequency is unnecessary). image.png

With this, the total purchase amount for each BAG, COMETICS, and SHOES product category for each customer has been aggregated. It was calculated that the customer number 100001 had a purchase of 7243 yen for BAG, 10273 yen for COSMETICS, and 26870 yen for SHOES.

image.png

It's almost complete, but it has some null values. This means that this customer has never made a purchase in this product category. If this is left as it is, it will be difficult to use for calculation, so replace NULL with 0.

It is the replacement node that does that. Select the node with the total value of BAG, COMETICS, SHOES and set it to 0 if it is null.

image.png

Nulls have been converted to 0 as shown below.

image.png

Next, the total purchase amount for each customer is calculated by adding the record aggregation and record combination nodes. image.png

First, the record aggregation node calculates the total subtotal (SUBTOTAL) of each customer (CUSTID), that is, the total purchase amount (again, the record frequency is not required).

image.png

The 100001 customer purchases a total of 44,386 yen. image.png

Combine this aggregation result with the aggregation of the total amount of BAG, COMETICS, and SHOES earlier with the record join node. image.png

Then, the total purchase amount could be combined after the total amount of BAG, COMETICS, and SHOES for each customer. image.png

At the end, I shortened the long column name in the filter node (this work is not mandatory). image.png

The final data looks like this: image.png

2. Settings on the Modeler reconfiguration node (2) Purchase ratio by product category

From here, it is a little applied usage as a reconfiguration node, but I will introduce it because I often want to calculate a "ratio" that is easy to compare with other customers.

Let's take a look at the data we created earlier.

image.png

For the 100001 customer, the purchase amount ratio can be calculated by dividing BAG by 7243 yen ÷ 44,386 yen, COSMETICS by 10273 yen ÷ 44,386 yen, and SHOES by 26870 yen ÷ 44,386 yen. Do this calculation on the field creation node.

image.png

Set multiple modes and select BAG_SUM, COSMETICS_SUM, SHOES_SUM in the field list. Then you can use @FIELD in the CLEM expression to refer to the BAG_SUM, COSMETICS_SUM, and SHOES_SUM columns. here @FIELD/ALL_SUM By specifying Three ratios of BAG_SUM / ALL_SUM, COSMETICS_SUM / ALL_SUM, and SHOES_SUM / ALL_SUM are calculated.

image.png

With this, the purchase ratio for each major product category of BAG, COMETICS, and SHOES for each customer has been aggregated. The purchase ratio of customer No. 100001 is BAG 16.3%, COSMETICS 23.1%, SHOES 60.5%. image.png

3. Settings in pandas ① Total purchase amount for each product category

I will try the same data processing with pandas. Actually (1) Calculation of the total purchase amount for each product category can be done very easily with one method called pivot_table in pandas.

pivot_tran_df= df.pivot_table(
    index=['CUSTID'],columns=['L_CLASS'],values=['SUBTOTAL'],
    aggfunc='sum',
    fill_value=0,
    margins=True, margins_name='ALL')

--index = ['CUSTID'], columns = ['L_CLASS'], values = ['SUBTOTAL'] and expand SUBTOTAL for each major category of BAG, COMETICS, SHOES included in L_CLASS using CUSTID as a key. .. This is the image of the reconfiguration node settings. --Indicates that the total value of SUBTOTAL is calculated with aggfunc ='sum'. Here is an image of the aggregation node. --fill_value = 0 indicates that the part that became NULL is filled with 0. Here is an image of the replacement node. --margins = True, margins_name ='ALL' indicates to add column totals and row totals. This serves as an aggregation node and a record join node.

The purchase amount and total purchase amount of BAG, COMETICS, SHOES can be calculated for each CUSTID as shown below.

image.png

--Reference

The column total column is required, but the row total row is not needed this time, so we will delete it below.

pivot_tran_df=pivot_tran_df[:-1]

--Reference --python — How to remove the last line of data in a pandas data frame https://www.it-swarm.dev/ja/python/pandas%E3%83%87%E3%83%BC%E3%82% BF% E3% 83% 95% E3% 83% AC% E3% 83% BC% E3% 83% A0% E3% 81% AE% E3% 83% 87% E3% 83% BC% E3% 82% BF% E3% 81% AE% E6% 9C% 80% E5% BE% 8C% E3% 81% AE% E8% A1% 8C% E3% 82% 92% E5% 89% 8A% E9% 99% A4% E3% 81% 99% E3% 82% 8B% E6% 96% B9% E6% B3% 95/1050572731 /

If you use pivot_table, the columns will be hierarchical and multi-column.

image.png image.png

If it is a multi-column, it is difficult to combine and handle it, so convert it to a flat snake format (concatenated with underscore) column name. According to the Modeler naming convention, it is as follows.

pivot_tran_df.columns = [
    pivot_tran_df.columns.names[1]+"_"+levels[1]+"_"+levels[0] 
    for levels in pivot_tran_df.columns]

image.png

image.png

I also used a regular expression here and changed it to a shorter column name like BAG_SUM.

import re
pivot_tran_df=pivot_tran_df.rename(
    columns= lambda str:
    re.sub('L_CLASS_(.+)_SUBTOTAL',r'\1_SUM',str))

image.png

3. Settings in pandas ② Purchase ratio by product category

(2) In Modeler, the purchase ratio for each product category was calculated by @ FIELD / ALL_SUM.

pandas uses the div method. Close to the image of the Replace node in Modeler, all columns are divided by ALL_SUM and replaced. ALL_SUM itself is also divided to 1.0.

pivot_tran_ratio_df=pivot_tran_df.div(pivot_tran_df["ALL_SUM"], axis=0)

image.png

--Reference

Drop ALL_SUM and rename it because it's RATIO (percentage) instead of SUM (total).

import re
pivot_tran_ratio_df=pivot_tran_ratio_df\
    .drop(columns=['ALL_SUM'])\
    .rename(columns= lambda str:re.sub('_SUM','_RATIO',str))

image.png

Finally, (1) Rejoin the DataFrame of the total purchase amount for each product category to complete.

pivot_tran_df=pivot_tran_df.join(pivot_tran_ratio_df)

image.png

4. Sample

The sample is placed below.

stream https://github.com/hkwd/200611Modeler2Python/blob/master/Restructure/Restructure.str?raw=true notebook https://github.com/hkwd/200611Modeler2Python/blob/master/Restructure/restructure.ipynb data https://raw.githubusercontent.com/hkwd/200611Modeler2Python/master/data/sampletranDEPT4en2019S.csv

■ Test environment Modeler 18.2.1 Windows 10 64bit Python 3.6.9 pandas 0.24.1

5. Reference information

[Relay serialization] My recommended node --Awesome technician "reconstruction node" that generates features from POS with ID and IoT time series data | IBM Solution Blog https://www.ibm.com/blogs/solutions/jp -ja / spssmodeler-push-node-4 /

Recommended Posts

Rewrite SPSS Modeler reconfigure node in Python. Aggregation by purchased product category
Rewrite SPSS Modeler filter nodes in Python
Rewrite field order nodes in SPSS Modeler with Python.
Rewrite the record addition node of SPSS Modeler with Python.
Change node settings in supernodes with SPSS Modeler Python scripts
Rewrite the sampling node of SPSS Modeler with Python (2): Layered sampling, cluster sampling
Rewrite the sampling node of SPSS Modeler with Python ①: First N cases, random sampling
Rewrite SPSS Modeler reconfigure node in Python. Aggregation by purchased product category
Rewrite the record addition node of SPSS Modeler with Python.
Rewrite SPSS Modeler filter nodes in Python
Rewrite the sampling node of SPSS Modeler with Python ①: First N cases, random sampling
Rewrite field order nodes in SPSS Modeler with Python.
Rewrite the sampling node of SPSS Modeler with Python (2): Layered sampling, cluster sampling
Sort by date in python
How to pass arguments to a Python script in SPSS Modeler Batch
Using Python with SPSS Modeler extension node (2) Model creation using Spark MLlib
[Python] Create parallel coordinate plots color-coded by category in Plotly Express
Rewrite the field creation node of SPSS Modeler with Python. Feature extraction from time series sensor data