Rewrite the sampling node of SPSS Modeler with Python (2): Layered sampling, cluster sampling

Sampling nodes are used for sampling in SPSS Modeler. I will explain this sampling node and rewrite it with Python pandas.

There are two types of sampling: (1) simple sampling and (2) complex sampling that reflects data trends. Last time explained ① simple sampling. This time, (2) complicated sampling will be explained.

① Simple sampling
①-1. First N cases
①-2. Random sampling
② Complex sampling ← This article
②-1. Layered sampling
②-2. Cluster sampling

0. raw data

The following POS data with ID is targeted. We use ID-attached POS data that records who (CUSTID) purchased when (SDATE) and what (PRODUCTID, L_CLASS product major classification, M_CLASS product middle classification) and how much (SUBTOTAL).

There are 28,599 cases in 6 fields. image.png

1m. ②-1. Layered sampling Modeler version

Random sampling is a sampling method that can reflect trends in all data if there are enough records. However, some data may have a large bias in the distribution and only exist in small proportions. If the number of samplings is small, such data may not be able to reflect the tendency.

For example, let's look at the distribution of M_CLASS (classification in products) of this data. The number of sales of SHOES01 is 631 times, which is 2.21% of the total, so it is not big. image.png

Looking at the distribution of M_CLASS (classification in the product) as a result of sampling this at 0.2%, SHOES01 has disappeared. Also, other items are different from the original distribution. image.png

Originally, in such a case, the number of samplings should be increased, but if there is no choice but to make small samplings such as verification data, stratified sampling can be used.

This is a method of sampling data separately for each layer. In this example, the image is sampled for each middle classification of M_CLASS (classification in products).

Layered sampling is also performed at the sampling node. The sample method is "complex". Then specify the sample size. Here, 0.002 (0.2%) is specified. You can then specify layered variables by clicking the Cluster and Hierarchy button. Here, M_CLASS (classification in the product) is specified as the stratified variable.

Also, the random seed setting is checked so that sampling can be reproduced.

image.png

The result has a column called SampleWeight, which writes out the weights used internally when sampling. You can see that the values are the same for each M_CLASS. Normally it is not needed, so you can remove it with the filter node. image.png

Looking at the distribution of M_CLASS (classification in products) as a result of sampling, there is SHOES01, which is close to the original distribution for all cases.

image.png

Note that SQL pushback does not work for stratified sampling. It turns purple and seems to be looking for an empty string in the layered column, but the sampling itself has not been converted to SQL.

1p. ②-1. Layered sampling pandas version

Use the groupby and sample functions to get layered sampling in pandas. First, group by ‘M_CLASS’. group_keys = False is not multi-indexed.

Then, 0.2% random sampling is executed with sample using the lamda formula for each block of data of each M_CLASS.

Stratified_df=df.groupby('M_CLASS', group_keys=False)\
    .apply(lambda x: x.sample(frac=0.002, random_state = 1))

Grouped by M_CLASS, the data is 0.2%.

image.png

Another option is to use Stratified Shuffle Split. This is an object that performs layered sampling when separating training data and test data.

The sampling size of training data and test data is determined by the train_size and test_size arguments of StratifiedShuffleSplit. random_state is a random seed. Since it is originally an object for separating training data and test data, it is necessary to determine train_size and test_size.

If you specify the Dataframe (df) and the column (df ['M_CLASS']) you want to stratify with the split function for the instantiated sample, the index (train_, test_) of the Dataframe of the training data and test data will be returned. From there, I am creating a new Dataframe (StratifiedShuffleSplit_df).

from sklearn.model_selection import StratifiedShuffleSplit
sample = StratifiedShuffleSplit(n_splits = 1,train_size = 0.002,test_size = 0.01, random_state = 1)
for train_,test_ in sample.split(df, df['M_CLASS']):
    StratifiedShuffleSplit_ = df.loc[train_]
#    chunk_test = df1.loc[test_]

Comparing the distribution of M_CLASS between all data and these stratified sampling data and simple random sampling data, SHOES01 is missing from the simple random sampling data, and the distribution of all data cannot be reflected. I understand.

image.png

2m. ②-2. Cluster sampling Modeler version

The data this time is a purchasing transaction. Random sampling from the entire data will thin out the items purchased by each customer. The number of purchases and the purchase amount per person will be small, and it will be difficult to understand the purchase tendency of people who "buy SHOES often". You can analyze what the best-selling products are in the entire transaction, but the data will not be suitable for customer-oriented analysis.

In such a case, perform cluster sampling (aggregate ID sampling) that samples at the customer ID level. When cluster sampling is performed, the transaction of the extracted customer ID is retained by sampling by the customer ID, so it is possible to analyze by the customer axis.

Cluster sampling is also performed at the sampling node. The sample method is "complex" and the sample size is specified. Here, 0.1 (10%) is specified. You can then specify the variables you want to cluster by clicking the Cluster and Hierarchy button. Here, CUSTID is specified as the cluster.

Also, the random seed setting is checked so that sampling can be reproduced.

image.png

10% of all CUSTIDs were randomly sampled, the extracted CUSTID transactions were saved and 2652 were extracted. A column for SampleWeight has also been added, but I don't think it's used for complex sampling. image.png

However, SQL pushback will not work if cluster sampling is performed using the sampling node function. Therefore, it is recommended to sample CUSTID by record aggregation and random sampling, and then rejoin with the original data. image.png

Create a unique dataset with CUSTID in record aggregation. image.png

The sample method is simple and a random% of 10% is specified.

image.png

Then combine the transactions from the original data. image.png

This method will do SQL pushback. Random sampling is done in RAND (2743707) <1.0000000000000001e-01) and transactions are combined in WHERE (T0.CUSTID = T1.CUSTID).

[2020-08-12 12:58:45] Previewing SQL: SELECT T1.SDATE AS SDATE, T1.PRODUCTID AS PRODUCTID, T1. "L_CLASS" AS "L_CLASS", T1. "M_CLASS" AS "M_CLASS", T1.SUBTOTAL AS SUBTOTAL, T0.CUSTID AS CUSTID FROM (SELECT T0.CUSTID AS CUSTID FROM (SELECT T0.CUSTID AS CUSTID FROM SAMPLETRANDEPT4EN2019S T0 GROUP BY T0.CUSTID) T0 WHERE RAND (2743707) <1.0000000000000001e-01) (SELECT T0.CUSTID AS CUSTID, T0.SDATE AS SDATE, T0.PRODUCTID AS PRODUCTID, T0. "L_CLASS" AS "L_CLASS", T0. "M_CLASS" AS "M_CLASS", T0.SUBTOTAL AS SUBTOTAL FROM SAMPLETRANDEPT4EN2019S T0) T1 WHERE (T0.CUSTID = T1.CUSTID)

2p. ②-2. Cluster sampling pandas version

Use the unique, sample, and isin functions for cluster sampling with pandas. The process is the same as using the aggregation node, sampling node, and record join node in Modeler.

Creates a recordset that is unique and has a unique CUSTID. Random sampling is done with sample. Only CUSTIDs sampled from the original transaction with isin are extracted.

df_custid =pd.Series(df['CUSTID'].unique()).sample(frac=0.1,random_state=1)
df[df['CUSTID'].isin(df_custid)]

Cluster sampling can be performed as follows. image.png

3. Sample

The sample is placed below.

stream https://github.com/hkwd/200611Modeler2Python/raw/master/sample/sample.str notebook https://github.com/hkwd/200611Modeler2Python/blob/master/sample/sampling.ipynb data https://raw.githubusercontent.com/hkwd/200611Modeler2Python/master/data/sampletranDEPT4en2019S.csv

■ Test environment Modeler 18.2.1 Windows 10 64bit Python 3.6.9 pandas 0.24.1

4. Reference information

Random sampling-Wikipedia
https://en.wikipedia.org/wiki/%E7%84%A1%E4%BD%9C%E7%82%BA%E6%8A%BD%E5%87%BA #% E7% B5% B1% E8% A8% 88% E8% AA% BF% E6% 9F% BB% E3% 81% AB% E3% 81% 8A% E3% 81% 91% E3% 82% 8B% E7% 84% A1% E4% BD% 9C% E7% 82% BA% E6% 8A% BD% E5% 87% BA% E3% 81% AE% E6% 89% 8B% E6% B3% 95 The explanation of stratified sampling method and cluster sampling method is easy to understand.

Sampling node https://www.ibm.com/support/knowledgecenter/ja/SS3RA7_18.2.1/modeler_mainhelp_client_ddita/clementine/mainwindow_navigationstreamsoutputtab.html

Recommended Posts

Rewrite the sampling node of SPSS Modeler with Python (2): Layered sampling, cluster sampling
Rewrite the sampling node of SPSS Modeler with Python ①: First N cases, random sampling
Rewrite the field creation node of SPSS Modeler with Python. Feature extraction from time series sensor data
Rewrite field order nodes in SPSS Modeler with Python.
Change node settings in supernodes with SPSS Modeler Python scripts
Using Python with SPSS Modeler extension node (2) Model creation using Spark MLlib
Check the existence of the file with python
[Python3] Rewrite the code object of the function
Rewrite SPSS Modeler filter nodes in Python
Rewrite SPSS Modeler reconfigure node in Python. Aggregation by purchased product category
Prepare the execution environment of Python3 with Docker
2016 The University of Tokyo Mathematics Solved with Python
Calculate the total number of combinations with python
Check the date of the flag duty with Python
Rewrite the name of the namespaced tag with lxml
Convert the character code of the file with Python3
[Python] Determine the type of iris with SVM
Rewrite duplicate record nodes in SPSS Modeler in Python. ① Identify the item you purchased first. (2) Identification of the top-selling item in the product category
Extract the table of image files with OneDrive & Python
Learn Nim with Python (from the beginning of the year).
Destroy the intermediate expression of the sweep method with Python
Visualize the range of interpolation and extrapolation with python
Calculate the regression coefficient of simple regression analysis with python
Summary of the basic flow of machine learning with Python
[Python] How to rewrite the table style with python-pptx [python-pptx]
Get the operation status of JR West with Python
Extract the band information of raster data with python
Version control of Node, Ruby and Python with anyenv
Try scraping the data of COVID-19 in Tokyo with Python
I tried "gamma correction" of the image with Python + OpenCV
The story of implementing the popular Facebook Messenger Bot with python
Unify the environment of the Python development team starting with Poetry
Visualize the results of decision trees performed with Python scikit-learn
Calculate the square root of 2 in millions of digits with python
I wrote the basic grammar of Python with Jupyter Lab
Tank game made with python About the behavior of tanks
Run the intellisense of your own python library with VScode.
I evaluated the strategy of stock system trading with Python.
Check the scope of local variables with the Python locals function.
Let's touch the API of Netatmo Weather Station with Python. #Python #Netatmo
Using Python with SPSS Modeler extension nodes ① Setup and visualization
The story of rubyist struggling with python :: Dict data with pycall
[Homology] Count the number of holes in data with Python
Try to automate the operation of network devices with Python
Estimate the attitude of AR markers with Python + OpenCV + drone
Play with the password mechanism of GitHub Webhook and Python
Get the source of the page to load infinitely with python.
Towards the retirement of Python2
About the ease of Python
Call the API with python3.
About the features of Python
The Power of Pandas: Python
I compared the speed of Hash with Topaz, Ruby and Python
I tried scraping the ranking of Qiita Advent Calendar with Python
Save the result of the life game as a gif with python
March 14th is Pi Day. The story of calculating pi with python
Color extraction with Python + OpenCV solved the mystery of the green background
[python, ruby] fetch the contents of a web page with selenium-webdriver
I want to output the beginning of the next month with Python
Output the contents of ~ .xlsx in the folder to HTML with Python
The story of making a standard driver for db with python.