SPSS Modeler provides various functions that are generally used in analysis, but I think there are cases where you want to use R and Python functions as well. You can run R and Python from SPSS Modeler by using the extension node of SPSS Modeler.
This time I will try Python integration. If you want to try R integration, @ kawada2017's Use R with SPSS Modeler extension node Is introduced.
One thing to keep in mind when using Python with SPSS Modeler extension nodes is that they handle I / O data in Spark DataFrame format. Of course, by converting Spark DataFrame to pandas Dataframe etc. in the process, it is possible to process the data using the pandas method and visualize it with a library such as matplot/seaborn.
■ Test environment SPSS Modeler 18.2.2 Windows 10 Python 3.7.7
SPSS Modeler 18.2.2 has Python 3.7.7 included with the product, but it is also possible to use the Python environment installed by the user separately.
Assuming that additional libraries will be installed, this time we will use our own Python installed environment. At the time of this writing, Modeler 18.2.2 supports Python 3.7.x. Please check the manual for the latest support status.
Script using Python for Spark https://www.ibm.com/support/knowledgecenter/ja/SS3RA7_18.2.2/modeler_r_nodes_ddita/clementine/r_pyspark_api.html
** 1. Download Python 3.7.x installer **
Download the 3.7.x installer from the link below. https://www.python.org/downloads/
** 2. Run Python 3.7.x installer **
Launch the installer as an administrator and select Customization Install.
Leave the Optional Feature as the default and proceed to the next.
Check Install for all users and press Install. At this time, make a note of the install location as well. In the example, it is C: \ Program Files \ Python37.
After the Python installation is complete, check your Python version. Start Powershell and execute the following command.
cd <python_install_location>
python -v
Execution example
PS C:\Users\Administrator> cd "C:\Program Files\Python37"
PS C:\Program Files\Python37> python -V
Python 3.7.7
** 3. Introduction of additional Python library ** Install any Python library. In the example, numpy, pandas and matplotlib, seaborn are introduced for visualization.
python -m pip install <Library name>
Execution example
PS C:\Program Files\Python37> python -m pip install numpy, pandas, matplotlib, seaborn
Collecting numpy
Downloading https://files.pythonhosted.org/packages/5f/a5/24db9dd5c4a8b6c8e495289f17c28e55601769798b0e2e5a5aeb2abd247b/numpy-1.19.4-cp37-cp37m-win_amd64.whl (12.9MB)
(Abbreviation)
Successfully installed ~ (Abbreviation)
Open options.cfg with Notepad, for example. The default path for options.cfg is: C:\Program Files\IBM\SPSS\Modeler\18.2.2\config\options.cfg
Specify the path of python.exe in eas_pyspark_python_path and save it by overwriting.
options.cfg
# Set to the full path to the python executable (including the executable name) to enable use of PySpark.
eas_pyspark_python_path, "C:<python_install_location>\\python.exe"
Description example
# Set to the full path to the python executable (including the executable name) to enable use of PySpark.
eas_pyspark_python_path, "C:\\Program Files\\Python37\\python.exe"
Restart SPSS Modeler 18.2.2 from the service for the changes to take effect.
Download the iris dataset as a dataset for checking the operation. https://github.com/mwaskom/seaborn-data/blob/master/iris.csv
Start SPSS Modeler, add a variable length node, and load iris.csv.
Select the extended output node from the Output tab and connect.
Open the properties of the extension's output node, write the following code in the Python for Spark syntax, and execute it. It is a content to output a pair plot diagram (scatter plot matrix) using the library seaborn. The details of the syntax will be described later. It may take a few minutes to complete the execution.
Syntax
import spss.pyspark.runtime
ascontext = spss.pyspark.runtime.getContext()
import numpy
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
indf = ascontext.getSparkInputData()
sns.pairplot(indf.toPandas(), hue='species')
plt.show()
When the execution is completed, the pair plot diagram (scatter plot matrix) will be displayed in the Python window as shown below. You can check the relationship between each variable collectively. In the case of R, the graph is output to the graph output tab in the extended output node, but it seems that Python does not currently support it, and it is displayed in the Python window.
By the way, while displaying the graph in the Python window, the stream is always running as shown below. You need to close the Python window to complete the stream processing.
I thought it would be nice if Python also supports graph output to the graph output tab in the extended output node like R, so I posted it as an idea for a new function. If you agree with us, we would appreciate it if you could vote by clicking Vote from the URL below. The development of the feature is not guaranteed, but it will be considered if the number of votes is large.
Extension Output Node for Python with Graph Output tab like R https://ibm-data-and-ai.ideas.aha.io/ideas/MDLR-I-329
The syntax described in the output node of the extension is supplemented.
When using Python with SPSS Modeler extension nodes, it is necessary to handle input / output data in Spark DataFrame format. You also need to go through the interface of the wrapper Analytics Server context provided by SPSS Modeler instead of running the Spark API directly.
I will explain each part of the whole syntax below.
(Repost) Overall syntax
import spss.pyspark.runtime
ascontext = spss.pyspark.runtime.getContext()
import numpy
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
indf = ascontext.getSparkInputData()
sns.pairplot(indf.toPandas(), hue='species')
plt.show()
First, the first half. These contents always need to be described when dealing with Python in the extension node. Import the library spss.pyspark.runtime for Analytics Server interaction. Define an Analytics Server context object in the variable ascontext.
First half of syntax
import spss.pyspark.runtime #Read library for Analytics Server context processing
ascontext = spss.pyspark.runtime.getContext() #Defining Analytics Server context objects
This is the process of creating a pair plot diagram in the latter half. Generally, the flow is almost the same as the syntax when using seaborn in Python, but dataframe conversion processing etc. are additionally implemented.
First, the data from the previous node is stored in a variable called indf. In this case, the data read by iris.csv is stored in Spark Dataframe format.
Next, create a pair plot diagram with seaborn. seaborn cannot receive Spark Dataframe as it is, so convert it to pandas Dataframe and pass it. The conversion from Spark Dataframe to pandas Dataframe can be achieved with the toPandas () method.
Finally, use plt.show () to display the pair plot diagram.
Second half of the syntax
import numpy
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
indf = ascontext.getSparkInputData() # iris.Read csv data
sns.pairplot(indf.toPandas(), hue='species') # indf.toPandas()Then, pass the data converted to pandas Dataframe to seaborn. Specify species for color coding by hue.
plt.show() #Show figure
See below for more information on Analytics Server contexts.
Script using Python for Spark https://www.ibm.com/support/knowledgecenter/ja/SS3RA7_18.2.2/modeler_r_nodes_ddita/clementine/r_pyspark_api.html
Analytic Server Context https://www.ibm.com/support/knowledgecenter/ja/SS3RA7_18.2.2/modeler_r_nodes_ddita/clementine/r_pyspark_api_context.html
Recommended Posts