Duplicate nodes are what SPSS Modeler puts together duplicate records. I will explain this duplicate node and rewrite it with Python pandas.
Duplicate nodes can also be used to literally delete completely duplicate records, but when used in Modeler, they are often used to get the first record in a group.
Therefore, here we will explain using the POS data with ID in the following two use cases. ① Identify the first purchased item (2) Identification of the top-selling item in the product category
The following POS data with ID is targeted. We use ID-attached POS data that records who (CUSTID) purchased when (SDATE) and what (PRODUCTID, L_CLASS product major classification, M_CLASS product middle classification) and how much (SUBTOTAL).
There are 28,599 cases in 6 fields.
Sometimes the first purchase will give you an idea of what you're buying afterwards. If you can see the characteristics of the product, such as the person who bought the first product is likely to be a good customer, and the person who bought a certain product is likely to leave without a second purchase, the product that is likely to be a good customer You can think of measures such as prominent shelving allocation and banner placement.
Let's extract the first purchase transaction for each CUSTID from the data of POS with ID.
Set the mode of the duplicate record node to "Include only the first record in each group". The key field for grouping should be CUSTID only. This will extract only one record for one CUSTID. Specify SDATE in ascending order in "Sort records in group". This will select the oldest transaction for each CUSTID.
The result is as follows. The first item purchased by the person with CUSTID No. 100001 was BAG01 9937845.
Use the sort_values and drop_duplicates functions to combine duplicate lines into one line in pandas. First, sort by ‘CUSTID’ in sort_values in SDATE order. Then drop_duplicates extracts only the first line for each CUSTID. The point is to detect duplicates only with ‘CUSTID’ in the subset.
#Sort by CUSTID and SDATE df_sorted=df.sort_values(['CUSTID','SDATE']) #Extract the first line for each CUSTID df_sorted.drop_duplicates(subset='CUSTID')
The duplicate record node is often used together with the record aggregation node to extract the number one or worst one record in the aggregation result.
Here, we will try to extract the products with the highest sales in the major product categories of BAG, COSMETICS, and SHOES.
First, calculate the total of SUBTOTAL by grouping by L_CLASS and PRODUCTID in the record aggregation.
The total sales amount for each product is now calculated.
Next, set the following on the duplicate node. Set the mode to "Include only the first record in each group". The key field for grouping should be L_CLASS only. This will extract only one record for one L_CLASS. Specify SUBTOTAL_Sum in ascending order in "Sort records in group". With this, the product with the highest sales is selected for each L_CLASS.
The result is as follows. The best-selling BAG product was 9900307, which sold 1,781,615 yen.
Use the groupby, sort_values, and drop_duplicates functions to extract the number one from the aggregated results with pandas. The process is the same as using the aggregation node and duplicate record node in Modeler.
First, group by L_CLASS and PRODUCTID and calculate the total of SUBTOTAL. Next, sort by'L_CLASS'in sort_values in descending order of'SUBTOTAL'. Then drop_duplicates extracts only the first line for each L_CLASS.
#Total sales calculated by major classification and PRODUCT ID df_sum=df[['L_CLASS','PRODUCTID','SUBTOTAL']].groupby(['L_CLASS','PRODUCTID'],as_index=False).sum() #Sort by sales within the major categories df_sum_sorted=df_sum.sort_values(['L_CLASS','SUBTOTAL'],ascending=[True,False]) #Extract the first line of the major classification df_sum_sorted.drop_duplicates(['L_CLASS'])
I was able to extract as follows.
The sample is placed below.
stream https://github.com/hkwd/200611Modeler2Python/raw/master/distinct/distinct.str notebook https://github.com/hkwd/200611Modeler2Python/blob/master/distinct/distinct.ipynb data https://raw.githubusercontent.com/hkwd/200611Modeler2Python/master/data/sampletranDEPT4en2019S.csv
■ Test environment Modeler 18.2.1 Windows 10 64bit Python 3.6.9 pandas 0.24.1
Duplicate record node https://www.ibm.com/support/knowledgecenter/ja/SS3RA7_18.2.1/modeler_mainhelp_client_ddita/clementine/distinct_settingstab.html