Using Python with SPSS Modeler extension nodes ① Setup and visualization

0. Introduction

SPSS Modeler provides various functions that are generally used in analysis, but I think there are cases where you want to use R and Python functions as well. You can run R and Python from SPSS Modeler by using the extension node of SPSS Modeler.

This time I will try Python integration. If you want to try R integration, @ kawada2017's Use R with SPSS Modeler extension node Is introduced.

One thing to keep in mind when using Python with SPSS Modeler extension nodes is that they handle I / O data in Spark DataFrame format. Of course, by converting Spark DataFrame to pandas Dataframe etc. in the process, it is possible to process the data using the pandas method and visualize it with a library such as matplot/seaborn.

■ Test environment SPSS Modeler 18.2.2 Windows 10 Python 3.7.7

1. Introduction of Python 3.7.x

SPSS Modeler 18.2.2 has Python 3.7.7 included with the product, but it is also possible to use the Python environment installed by the user separately.

Assuming that additional libraries will be installed, this time we will use our own Python installed environment. At the time of this writing, Modeler 18.2.2 supports Python 3.7.x. Please check the manual for the latest support status.

Script using Python for Spark https://www.ibm.com/support/knowledgecenter/ja/SS3RA7_18.2.2/modeler_r_nodes_ddita/clementine/r_pyspark_api.html

** 1. Download Python 3.7.x installer **

Download the 3.7.x installer from the link below. https://www.python.org/downloads/

** 2. Run Python 3.7.x installer **

Launch the installer as an administrator and select Customization Install. image.png

Leave the Optional Feature as the default and proceed to the next. image.png

Check Install for all users and press Install. At this time, make a note of the install location as well. In the example, it is C: \ Program Files \ Python37. image.png

After the Python installation is complete, check your Python version. Start Powershell and execute the following command.

cd <python_install_location>
python -v

Execution example


PS C:\Users\Administrator> cd "C:\Program Files\Python37"
PS C:\Program Files\Python37> python -V
Python 3.7.7

** 3. Introduction of additional Python library ** Install any Python library. In the example, numpy, pandas and matplotlib, seaborn are introduced for visualization.

python -m pip install <Library name>

Execution example


PS C:\Program Files\Python37> python -m pip install numpy, pandas, matplotlib, seaborn
Collecting numpy
  Downloading https://files.pythonhosted.org/packages/5f/a5/24db9dd5c4a8b6c8e495289f17c28e55601769798b0e2e5a5aeb2abd247b/numpy-1.19.4-cp37-cp37m-win_amd64.whl (12.9MB)
(Abbreviation)
Successfully installed ~ (Abbreviation)

2. Specify your own Python installation for SPSS Modeler

Open options.cfg with Notepad, for example. The default path for options.cfg is: C:\Program Files\IBM\SPSS\Modeler\18.2.2\config\options.cfg

Specify the path of python.exe in eas_pyspark_python_path and save it by overwriting.

options.cfg


# Set to the full path to the python executable (including the executable name) to enable use of PySpark.
eas_pyspark_python_path, "C:<python_install_location>\\python.exe"

Description example


# Set to the full path to the python executable (including the executable name) to enable use of PySpark.
eas_pyspark_python_path, "C:\\Program Files\\Python37\\python.exe"

Restart SPSS Modeler 18.2.2 from the service for the changes to take effect.

image.png

3. Operation check with SPSS Modeler

Download the iris dataset as a dataset for checking the operation. https://github.com/mwaskom/seaborn-data/blob/master/iris.csv

Start SPSS Modeler, add a variable length node, and load iris.csv. image.png

Select the extended output node from the Output tab and connect. image.png

Open the properties of the extension's output node, write the following code in the Python for Spark syntax, and execute it. It is a content to output a pair plot diagram (scatter plot matrix) using the library seaborn. The details of the syntax will be described later. It may take a few minutes to complete the execution.

Syntax


import spss.pyspark.runtime

ascontext = spss.pyspark.runtime.getContext()

import numpy
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

indf = ascontext.getSparkInputData()
sns.pairplot(indf.toPandas(), hue='species')
plt.show()

image.png

When the execution is completed, the pair plot diagram (scatter plot matrix) will be displayed in the Python window as shown below. You can check the relationship between each variable collectively. In the case of R, the graph is output to the graph output tab in the extended output node, but it seems that Python does not currently support it, and it is displayed in the Python window. image.png

Example of graph output with R extended output node Quoted from @ kawada2017's Using R in SPSS Modeler Extension Nodes image.png

By the way, while displaying the graph in the Python window, the stream is always running as shown below. You need to close the Python window to complete the stream processing. image.png

I thought it would be nice if Python also supports graph output to the graph output tab in the extended output node like R, so I posted it as an idea for a new function. If you agree with us, we would appreciate it if you could vote by clicking Vote from the URL below. The development of the feature is not guaranteed, but it will be considered if the number of votes is large.

Extension Output Node for Python with Graph Output tab like R https://ibm-data-and-ai.ideas.aha.io/ideas/MDLR-I-329

(Supplement) About Python for Spark syntax

The syntax described in the output node of the extension is supplemented.

When using Python with SPSS Modeler extension nodes, it is necessary to handle input / output data in Spark DataFrame format. You also need to go through the interface of the wrapper Analytics Server context provided by SPSS Modeler instead of running the Spark API directly.

I will explain each part of the whole syntax below.

(Repost) Overall syntax


import spss.pyspark.runtime

ascontext = spss.pyspark.runtime.getContext()

import numpy
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

indf = ascontext.getSparkInputData()
sns.pairplot(indf.toPandas(), hue='species')
plt.show()

First, the first half. These contents always need to be described when dealing with Python in the extension node. Import the library spss.pyspark.runtime for Analytics Server interaction. Define an Analytics Server context object in the variable ascontext.

First half of syntax


import spss.pyspark.runtime #Read library for Analytics Server context processing

ascontext = spss.pyspark.runtime.getContext() #Defining Analytics Server context objects

This is the process of creating a pair plot diagram in the latter half. Generally, the flow is almost the same as the syntax when using seaborn in Python, but dataframe conversion processing etc. are additionally implemented.

First, the data from the previous node is stored in a variable called indf. In this case, the data read by iris.csv is stored in Spark Dataframe format.

Next, create a pair plot diagram with seaborn. seaborn cannot receive Spark Dataframe as it is, so convert it to pandas Dataframe and pass it. The conversion from Spark Dataframe to pandas Dataframe can be achieved with the toPandas () method.

Finally, use plt.show () to display the pair plot diagram.

Second half of the syntax


import numpy
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

indf = ascontext.getSparkInputData() # iris.Read csv data
sns.pairplot(indf.toPandas(), hue='species') # indf.toPandas()Then, pass the data converted to pandas Dataframe to seaborn. Specify species for color coding by hue.
plt.show() #Show figure

reference

See below for more information on Analytics Server contexts.

Script using Python for Spark https://www.ibm.com/support/knowledgecenter/ja/SS3RA7_18.2.2/modeler_r_nodes_ddita/clementine/r_pyspark_api.html

Analytic Server Context https://www.ibm.com/support/knowledgecenter/ja/SS3RA7_18.2.2/modeler_r_nodes_ddita/clementine/r_pyspark_api_context.html

Recommended Posts

Using Python with SPSS Modeler extension nodes ① Setup and visualization
Using Python with SPSS Modeler extension node (2) Model creation using Spark MLlib
Rewrite field order nodes in SPSS Modeler with Python.
Clustering and visualization using Python and CytoScape
Using Python and MeCab with Azure Databricks
Rewrite SPSS Modeler filter nodes in Python
I'm using tox and Python 3.3 with Travis-CI
Logistics visualization with Python
Programming with Python and Tkinter
Encryption and decryption with Python
Python and hardware-Using RS232C with Python-
[S3] CRUD with S3 using Python [Python]
[Python] Using OpenCV with Python (Basic)
Serial communication control with python and I2C communication (using USBGPIO8 device)
Rewrite the record addition node of SPSS Modeler with Python.
Using MLflow with Databricks ② --Visualization of experimental parameters and metrics -
python with pyenv and venv
Change node settings in supernodes with SPSS Modeler Python scripts
Serial communication control with python and SPI communication (using USBGPIO8 device)
Using OpenCV with Python @Mac
This and that for using Step Functions with CDK + Python
Works with Python and R
Send using Python with Gmail
[Python] Error and solution memo when using venv with pyenv + anaconda
I tried using PyEZ and JSNAPy. Part 4: Automate ISP setup with PyEZ and JSNAPy
Rewrite the sampling node of SPSS Modeler with Python (2): Layered sampling, cluster sampling
Communicate with FX-5204PS with Python and PyUSB
Complement python with emacs using company-jedi
Interactive visualization with ipywidgets and Bokeh
Shining life with Python and OpenCV
Robot running with Arduino and python
Install Python 2.7.9 and Python 3.4.x with pip.
[Python] Using OpenCV with Python (Image Filtering)
Neural network with OpenCV 3 and Python 3
AM modulation and demodulation with python
[Python] font family and font with matplotlib
Scraping with Node, Ruby and Python
Using Rstan from Python with PypeR
Setup modern Python environment with Homebrew
Authentication using tweepy-User authentication and application authentication (Python)
[Python] Using OpenCV with Python (Image transformation)
Scraping with Python, Selenium and Chromedriver
[Python] Using OpenCV with Python (Edge Detection)
Scraping with Python and Beautiful Soup
Easy visualization using Python but PixieDust
JSON encoding and decoding with python
Hadoop introduction and MapReduce with Python
[GUI with Python] PyQt5-Drag and drop-
Using Sessions and Reflections with SQLAlchemy
Reading and writing NetCDF with Python
I played with PyQt5 and Python3
Notes on using rstrip with python.
Easy data visualization with Python seaborn.
Reading and writing CSV with Python
Multiple integrals with Python and Sympy
Data analysis starting with python (data visualization 1)
Coexistence of Python2 and 3 with CircleCI (1.0)
Easy modeling with Blender and Python
When using MeCab with virtualenv python
Precautions when using six with Python 2.5
Data analysis starting with python (data visualization 2)