Rewrite duplicate record nodes in SPSS Modeler in Python. ① Identify the item you purchased first. (2) Identification of the top-selling item in the product category

Duplicate nodes are what SPSS Modeler puts together duplicate records. I will explain this duplicate node and rewrite it with Python pandas.

Duplicate nodes can also be used to literally delete completely duplicate records, but when used in Modeler, they are often used to get the first record in a group.

Therefore, here we will explain using the POS data with ID in the following two use cases. ① Identify the first purchased item (2) Identification of the top-selling item in the product category

0. raw data

The following POS data with ID is targeted. We use ID-attached POS data that records who (CUSTID) purchased when (SDATE) and what (PRODUCTID, L_CLASS product major classification, M_CLASS product middle classification) and how much (SUBTOTAL).

There are 28,599 cases in 6 fields. image.png

1m. (1) Identification of the first purchased item Modeler version

Sometimes the first purchase will give you an idea of what you're buying afterwards. If you can see the characteristics of the product, such as the person who bought the first product is likely to be a good customer, and the person who bought a certain product is likely to leave without a second purchase, the product that is likely to be a good customer You can think of measures such as prominent shelving allocation and banner placement.

Let's extract the first purchase transaction for each CUSTID from the data of POS with ID.

image.png

Set the mode of the duplicate record node to "Include only the first record in each group". The key field for grouping should be CUSTID only. This will extract only one record for one CUSTID. Specify SDATE in ascending order in "Sort records in group". This will select the oldest transaction for each CUSTID.

image.png

The result is as follows. The first item purchased by the person with CUSTID No. 100001 was BAG01 9937845. image.png

1p. ① Identification of the first purchased item pandas version

Use the sort_values and drop_duplicates functions to combine duplicate lines into one line in pandas. First, sort by ‘CUSTID’ in sort_values in SDATE order. Then drop_duplicates extracts only the first line for each CUSTID. The point is to detect duplicates only with ‘CUSTID’ in the subset.

#Sort by CUSTID and SDATE
df_sorted=df.sort_values(['CUSTID','SDATE'])
#Extract the first line for each CUSTID
df_sorted.drop_duplicates(subset='CUSTID')

image.png

2m. (2) Identification of the top-selling item in the product category Modeler version

The duplicate record node is often used together with the record aggregation node to extract the number one or worst one record in the aggregation result.

Here, we will try to extract the products with the highest sales in the major product categories of BAG, COSMETICS, and SHOES.

image.png

First, calculate the total of SUBTOTAL by grouping by L_CLASS and PRODUCTID in the record aggregation. image.png

The total sales amount for each product is now calculated. image.png

Next, set the following on the duplicate node. Set the mode to "Include only the first record in each group". The key field for grouping should be L_CLASS only. This will extract only one record for one L_CLASS. Specify SUBTOTAL_Sum in ascending order in "Sort records in group". With this, the product with the highest sales is selected for each L_CLASS.

image.png

The result is as follows. The best-selling BAG product was 9900307, which sold 1,781,615 yen.

image.png

2p. (2) Identification of the top-selling item in the product category pandas version

Use the groupby, sort_values, and drop_duplicates functions to extract the number one from the aggregated results with pandas. The process is the same as using the aggregation node and duplicate record node in Modeler.

First, group by L_CLASS and PRODUCTID and calculate the total of SUBTOTAL. Next, sort by'L_CLASS'in sort_values in descending order of'SUBTOTAL'. Then drop_duplicates extracts only the first line for each L_CLASS.

#Total sales calculated by major classification and PRODUCT ID
df_sum=df[['L_CLASS','PRODUCTID','SUBTOTAL']].groupby(['L_CLASS','PRODUCTID'],as_index=False).sum()
#Sort by sales within the major categories
df_sum_sorted=df_sum.sort_values(['L_CLASS','SUBTOTAL'],ascending=[True,False])
#Extract the first line of the major classification
df_sum_sorted.drop_duplicates(['L_CLASS'])

I was able to extract as follows. image.png

3. Sample

The sample is placed below.

stream https://github.com/hkwd/200611Modeler2Python/raw/master/distinct/distinct.str notebook https://github.com/hkwd/200611Modeler2Python/blob/master/distinct/distinct.ipynb data https://raw.githubusercontent.com/hkwd/200611Modeler2Python/master/data/sampletranDEPT4en2019S.csv

■ Test environment Modeler 18.2.1 Windows 10 64bit Python 3.6.9 pandas 0.24.1

4. Reference information

Duplicate record node https://www.ibm.com/support/knowledgecenter/ja/SS3RA7_18.2.1/modeler_mainhelp_client_ddita/clementine/distinct_settingstab.html

Recommended Posts

Rewrite duplicate record nodes in SPSS Modeler in Python. ① Identify the item you purchased first. (2) Identification of the top-selling item in the product category
Rewrite SPSS Modeler reconfigure node in Python. Aggregation by purchased product category
Rewrite the record addition node of SPSS Modeler with Python.
Rewrite SPSS Modeler filter nodes in Python
Rewrite the sampling node of SPSS Modeler with Python ①: First N cases, random sampling
Rewrite field order nodes in SPSS Modeler with Python.
Rewrite the sampling node of SPSS Modeler with Python (2): Layered sampling, cluster sampling