Rewrite the sampling node of SPSS Modeler with Python ①: First N cases, random sampling

When dealing with data from a large number of records, the analysis process takes time. In such cases, it is common to create a subset (sample) from all the cases, make it a processable size, and then analyze it. Also, even if the final analysis is for all cases, it is often the case that a small subset is used to lighten the process and perform a preliminary analysis. Making a small sample from all the data in this way is called sampling (sampling).

It is the sampling node that does this sampling in SPSS Modeler. I will explain this sampling node and rewrite it with Python pandas.

There are two types of sampling: (1) simple sampling and (2) complex sampling that reflects data trends. I will explain each of them in two articles.

① Simple sampling ← This article
①-1. First N cases
①-2. Random sampling
② Complex sampling
②-1. Layered sampling
②-2. Cluster sampling

0. raw data

The following POS data with ID is targeted. We use ID-attached POS data that records who (CUSTID) purchased when (SDATE) and what (PRODUCTID, L_CLASS product major classification, M_CLASS product middle classification) and how much (SUBTOTAL).

There are 28,599 cases in 6 fields. image.png

1m. ①-1. First N cases Modeler version

Extract 100 items from the beginning. This is a very light process. It is also useful when previewing data that is known to be large. This is the default mode for Modeler sampling.

■ Image after processing The order has not changed, and 100 items have been extracted. image.png

Use the sampling node. image.png

The sample method is "Simple", and the sample method is "First n cases" and 100 cases are specified. image.png

By the way, SQL pushback is effective for the "first n cases" of the sampling node. The following is an example of using SQL pushback for Db2. FETCH FIRST 100 ROWS ONLY is used.

SELECT T0.CUSTID AS CUSTID,T0.SDATE AS SDATE,T0.PRODUCTID AS PRODUCTID,T0."L_CLASS" AS "L_CLASS",T0."M_CLASS" AS "M_CLASS",T0.SUBTOTAL AS SUBTOTAL FROM (SELECT T0.CUSTID AS CUSTID,T0.SDATE AS SDATE,T0.PRODUCTID AS PRODUCTID,T0."L_CLASS" AS "L_CLASS",T0."M_CLASS" AS "M_CLASS",T0.SUBTOTAL AS SUBTOTAL FROM SAMPLETRANDEPT4EN2019S T0 FETCH FIRST 100 ROWS ONLY) T0

In the case of file input, the same result will be returned every time the "first n items" are executed repeatedly. However, RDB does not guarantee the order, so the same result may not always be returned. If you want to guarantee the same result, insert a sort node etc. to fix the order and then use the sampling node.

1p. ①-1. First N pandas version

If you want to get the first N items in pandas, use the head function.

df.head(100)

image.png

2m. ①-2. Random sampling Modeler version

Randomly extract n% from all cases. Sampling the first N records is a light method unless the original data tends to be biased by order. However, in general, data is often recorded in the order of occurrence, and the tendency of data often differs depending on the order of occurrence. For example, with one day's worth of data, it is possible that the trends in the data differ between morning and night. Random sampling (simple random sampling) is the most common sampling method that can deal with these problems.

Note that this process is an image of rolling dice for all records, so if the amount of data is large, it will be a heavy process.

After all it is done in the sampling node. image.png

The sample method is "Simple", and the sample method is "Random%", which is 0.2%.

Also, check the random seed setting. By checking this, the same result will be returned when sampling is executed repeatedly. It is usually checked because the analysis becomes unstable when the data changes. If you want to change the sampling result, click the Generate Seed button and change the seed to change the next sampling result. image.png

■ Image after processing 0.2% was randomly selected. 68 cases were extracted from 28,599 cases. Unlike the first N cases, the data of the first 100001 customers is not selected, and the transactions of each customer are also sparsely selected.

image.png

By the way, SQL pushback is effective for "Random%" of sampling node. The following is an example of using SQL pushback for Db2. BERNOULLI (0.200000) REPEATABLE (377028) is used. Different RDBs may use different methods. In particular, I feel that there may be RDBs that do not support seeds (the sampling results are not reproducible).

SELECT T0.CUSTID AS CUSTID,T0.SDATE AS SDATE,T0.PRODUCTID AS PRODUCTID,T0."L_CLASS" AS "L_CLASS",T0."M_CLASS" AS "M_CLASS",T0.SUBTOTAL AS SUBTOTAL FROM (SELECT T0.CUSTID AS CUSTID,T0.SDATE AS SDATE,T0.PRODUCTID AS PRODUCTID,T0."L_CLASS" AS "L_CLASS",T0."M_CLASS" AS "M_CLASS",T0.SUBTOTAL AS SUBTOTAL FROM SAMPLETRANDEPT4EN2019S T0 TABLESAMPLE BERNOULLI(0.200000) REPEATABLE(377028)) T0

As pointed out earlier, random sampling can be a heavy process to process all cases, so it is better to cache the sampling results or output them to a file or table for more efficient subsequent analysis. I will.

image.png

You may also want to determine the number of cases and perform random sampling. In that case, calculate and set the ratio by (number of samplings) / (total number) * 100%. If you want to randomly sample 100 items with this data, it will be 100/28599 * 100 = 0.349%.

If you set the random% to 0.35 and specify the maximum number to 100, the number is limited to 100. image.png

■ Image after processing 100 random samplings were made. image.png

2p. ①-2. Random sampling pandas version

If you want to get random sampling with pandas, use the sample function. Set the percentage with frac. It is a decimal, not a fraction (1 is 100%). Here, 0.2% is specified. Specify a random seed with random_state.

df.sample(frac=0.002, random_state = 1)

image.png

If you want to reassign the index, reset_index (drop = True) is performed.

df.sample(frac=0.002, random_state = 1).reset_index(drop=True)

image.png

When specifying the number of cases, specify n instead of frac.

df.sample(n=100, random_state = 1)

image.png

3. Sample

The sample is placed below.

stream https://github.com/hkwd/200611Modeler2Python/raw/master/sample/sample.str notebook https://github.com/hkwd/200611Modeler2Python/blob/master/sample/sampling.ipynb data https://raw.githubusercontent.com/hkwd/200611Modeler2Python/master/data/sampletranDEPT4en2019S.csv

■ Test environment Modeler 18.2.1 Windows 10 64bit Python 3.6.9 pandas 0.24.1

4. Reference information

Sampling node https://www.ibm.com/support/knowledgecenter/ja/SS3RA7_18.2.1/modeler_mainhelp_client_ddita/clementine/mainwindow_navigationstreamsoutputtab.html

Recommended Posts

Rewrite the sampling node of SPSS Modeler with Python ①: First N cases, random sampling
Rewrite the sampling node of SPSS Modeler with Python (2): Layered sampling, cluster sampling
Rewrite the record addition node of SPSS Modeler with Python.
Rewrite the field creation node of SPSS Modeler with Python. Feature extraction from time series sensor data
Rewrite field order nodes in SPSS Modeler with Python.
Rewrite duplicate record nodes in SPSS Modeler in Python. ① Identify the item you purchased first. (2) Identification of the top-selling item in the product category
Change node settings in supernodes with SPSS Modeler Python scripts
First Python 3 ~ The beginning of repetition ~
Using Python with SPSS Modeler extension node (2) Model creation using Spark MLlib
Create a compatibility judgment program with the random module of python.
Check the existence of the file with python
[Python3] Rewrite the code object of the function
Rewrite SPSS Modeler filter nodes in Python
Rewrite SPSS Modeler reconfigure node in Python. Aggregation by purchased product category
The result of making the first thing that works with Python (image recognition)
Note: How to get the last day of the month with python (added the first day of the month)
Prepare the execution environment of Python3 with Docker
2016 The University of Tokyo Mathematics Solved with Python
[Note] Export the html of the site with python.
Calculate the total number of combinations with python
Check the date of the flag duty with Python
Rewrite the name of the namespaced tag with lxml
Find the geometric mean of n! Using Python
Convert the character code of the file with Python3
[Python] Determine the type of iris with SVM
The first step of machine learning ~ For those who want to implement with python ~
Second half of the first day of studying Python Try hitting the Twitter API with Bottle
The first algorithm to learn with Python: FizzBuzz problem
Destroy the intermediate expression of the sweep method with Python
Calculate the regression coefficient of simple regression analysis with python
Summary of the basic flow of machine learning with Python
Record of the first machine learning challenge with Keras
[Python] How to rewrite the table style with python-pptx [python-pptx]
Get the operation status of JR West with Python
Why is the first argument of [Python] Class self?
Extract the band information of raster data with python
Version control of Node, Ruby and Python with anyenv
the zen of Python
I tried to find the entropy of the image with python
Try scraping the data of COVID-19 in Tokyo with Python
The story of implementing the popular Facebook Messenger Bot with python
Unify the environment of the Python development team starting with Poetry
Visualize the results of decision trees performed with Python scikit-learn
The first API to make with python Djnago REST framework
Calculate the square root of 2 in millions of digits with python
I wrote the basic grammar of Python with Jupyter Lab
Run the intellisense of your own python library with VScode.
Check the scope of local variables with the Python locals function.
Let's touch the API of Netatmo Weather Station with Python. #Python #Netatmo
Using Python with SPSS Modeler extension nodes ① Setup and visualization
The story of rubyist struggling with python :: Dict data with pycall
[Homology] Count the number of holes in data with Python
Try to automate the operation of network devices with Python
[Python] The first step to making a game with Pyxel
Estimate the attitude of AR markers with Python + OpenCV + drone
Play with the password mechanism of GitHub Webhook and Python
Get the source of the page to load infinitely with python.
I tried to solve the first question of the University of Tokyo 2019 math entrance exam with python sympy