[In-Database Python Analysis Tutorial with SQL Server 2017] Step 3: Data Exploration and Visualization

From the beginning of the tutorial

In-Database Python Analysis for SQL Developers

Previous step

Step 2: Import data into SQL Server using PowerShell

Next step

Step 4: Feature extraction of data using T-SQL

Step 3: Data exploration and visualization

Data science solution development typically involves intensive data exploration and data visualization. In this step, you will explore the sample data and generate some plots. In addition, you will learn how to serialize graphics objects in Python and how to deserialize and create plots.

[!NOTE] This tutorial only shows the binary classification model. It is also possible to build other models such as regression classification and multinomial classification.

Data confirmation

The original dataset provides the taxi identifier and driving record in separate files, but we have joined with medallion, hack_license, and pickup_datetime as keys to make the sample data easier to use. The records used are sampled at 1% of the original number of records. The sampled dataset has 1,703,957 rows and 23 columns.

** Taxi identifier **

--The medallion column shows the unique ID number of the taxi. --The hack_license column shows the driver's anonymized driver's license number.

** Driving record and fare record **

――Each driving record includes the location and time of boarding and alighting, and the driving distance. --Each fare record contains payment information such as payment type, total payment amount, tip amount and so on. --The last three columns can be used for various machine learning tasks. --The tip_amount column contains consecutive numbers and can be used as a label column (objective variable) for regression analysis. --The tipped column has only yes / no values and can be used as a label column (objective variable) for binary classification. --The tip_class column has multiple ** class labels ** and can be used as a label column (objective variable) for multinomial classification. --The value used as the label column is based on the tip_amount column.

 |Column|rule|
 |:--|:--|
 |tipped|If tip_amount > 0, tipped = 1, otherwise tipped = 0|
 |tip_class|Class 0: tip_amount =$ 0<br/> Class 1: tip_amount >$ 0and tip_amount <=$ 5<br/> Class 2: tip_amount >$ 5and tip_amount <=10 dollars<br/> Class 3: tip_amount >10 dollarsand tip_amount <=$ 20<br/> Class 4: tip_amount >$ 20|

Create plots in Python within T-SQL

Visualization is important for understanding the distribution of data and outliers, and Python provides many packages for data visualization. The matplotlib module includes many features for creating histograms, scatter plots, box plots, and other data exploration graphs.

In this section, you will learn how to work with plots using stored procedures. Here, the plot is treated as varbinary type data.

Store plot as varbinary data type

The ** RevoScalePy ** package of the Python library included in SQL Server 2017 Machine Learning Services is equivalent to the RevoScaleR package of the R library. This example uses rxHistogram to plot a histogram based on the Transact-SQL query result data. Wrap it in a PlotHistogram stored procedure for ease of use.

This stored procedure returns a serialized Python drawing object as a stream of varbinary data. You cannot view the binary data directly, but you can use Python code on the client to deserialize the binary data and save the image file on the client computer.

Define a SerializePlots stored procedure

The stored procedure SerializePlots is defined in SQL Server through Step 2: Import Data into SQL Server Using PowerShell (http://qiita.com/qio9o9/items/98df36982f1fbecdf5e7).

  1. In Management Studio's Object Explorer, expand Programming> Stored Procedures.

  2. Right-click SerializePlots and select Modify to open the Transact-SQL script in a new query window.

    
    CREATE PROCEDURE [dbo].[SerializePlots]
    AS
    BEGIN
      SET NOCOUNT ON;
      DECLARE @query nvarchar(max) =
      N'SELECT cast(tipped as int) as tipped, tip_amount, fare_amount FROM [dbo].[nyctaxi_sample]'
      EXECUTE sp_execute_external_script
      @language = N'Python',
      @script = N'
    import matplotlib
    matplotlib.use("Agg")
    import matplotlib.pyplot as plt
    import pandas as pd
    import pickle
    
    fig_handle = plt.figure()
    plt.hist(InputDataSet.tipped)
    plt.xlabel("Tipped")
    plt.ylabel("Counts")
    plt.title("Histogram, Tipped")
    plot0 = pd.DataFrame(data =[pickle.dumps(fig_handle)], columns =["plot"])
    plt.clf()
    
    plt.hist(InputDataSet.tip_amount)
    plt.xlabel("Tip amount ($)")
    plt.ylabel("Counts")
    plt.title("Histogram, Tip amount")
    plot1 = pd.DataFrame(data =[pickle.dumps(fig_handle)], columns =["plot"])
    plt.clf()
    
    plt.hist(InputDataSet.fare_amount)
    plt.xlabel("Fare amount ($)")
    plt.ylabel("Counts")
    plt.title("Histogram, Fare amount")
    plot2 = pd.DataFrame(data =[pickle.dumps(fig_handle)], columns =["plot"])
    plt.clf()
    
    plt.scatter( InputDataSet.fare_amount, InputDataSet.tip_amount)
    plt.xlabel("Fare Amount ($)")
    plt.ylabel("Tip Amount ($)")
    plt.title("Tip amount by Fare amount")
    plot3 = pd.DataFrame(data =[pickle.dumps(fig_handle)], columns =["plot"])
    plt.clf()
    
    OutputDataSet = plot0.append(plot1, ignore_index=True).append(plot2, ignore_index=True).append(plot3, ignore_index=True)
    ',
                                     @input_data_1 = @query
      WITH RESULT SETS ((plot varbinary(max)))
    END
    
    GO
    
    

--The variable @ query defines the query text passed as input @ input_data_1 to the Python code block. -** Histograms and scatter plots are created with the figure of the matplotlib library ** and these objects are serialized using the ** pickle library **. --Python drawing objects are serialized into ** pandas ** dataframes for output.

Output varbinary data as an image file

  1. Run the following query in Management Studio.

    EXEC [dbo].[SerializePlots]
    

    sqldev-python-step3-1-gho9o9.png

  2. Change the connection string in the Python script DeserializeSavePlots.py in the downloaded files to suit your environment, and then execute it.

** For SQL Server Authentication **

```python:DeserializeSavePlots
import pyodbc
import pickle
import os
cnxn = pyodbc.connect('DRIVER={SQL Server};SERVER={SERVER_NAME};DATABASE={DB_NAME};UID={USER_NAME};PWD={PASSOWRD}')
cursor = cnxn.cursor()
cursor.execute("EXECUTE [dbo].[SerializePlots]")
tables = cursor.fetchall()
for i in range(0, len(tables)):
    fig = pickle.loads(tables[i][0])
    fig.savefig(str(i)+'.png')
print("The plots are saved in directory: ",os.getcwd())
```

** For Windows authentication **

```Python:DeserializeSavePlots.py
import pyodbc
import pickle
import os
cnxn = pyodbc.connect('DRIVER={SQL Server};SERVER={SERVER_NAME};DATABASE={DB_NAME};Trusted_Connection=yes;')
cursor = cnxn.cursor()
cursor.execute("EXECUTE [dbo].[SerializePlots]")
tables = cursor.fetchall()
for i in range(0, len(tables)):
    fig = pickle.loads(tables[i][0])
    fig.savefig(str(i)+'.png')
print("The plots are saved in directory: ",os.getcwd())
```

> [!NOTE]

Match the Python runtime version on the server and client. Also, the version of the Python library such as matplotlib used by the client should match the server or be a higher version.

  1. If the connection is successful, the following result will be displayed.

    sqldev-python-step3-2-gho9o9.png

  2. Four files are created in the Python working directory.

Shows the number of chips obtained and the number not obtained. sqldev-python-step3-3-1-gho9o9.png

Shows the distribution of tip amounts. sqldev-python-step3-3-2-gho9o9.png

Shows the distribution of fares. sqldev-python-step3-3-3-gho9o9.png

It is a scatter plot with the fare on the x-axis and the tip amount on the y-axis. sqldev-python-step3-3-4-gho9o9.png

Link

Next step

Step 4: Feature extraction of data using T-SQL

Previous step

Step 2: Import data into SQL Server using PowerShell

From the beginning of the tutorial

In-Database Python Analysis for SQL Developers

Source

Step 3: Explore and Visualize the Data

Related item

Machine Learning Services with Python

Recommended Posts

[In-Database Python Analysis Tutorial with SQL Server 2017] Step 3: Data Exploration and Visualization
[In-Database Python Analysis Tutorial with SQL Server 2017] Step 1: Download sample data
[In-Database Python Analysis Tutorial with SQL Server 2017]
[In-Database Python Analysis Tutorial with SQL Server 2017] Step 2: Import data to SQL Server using PowerShell
[In-Database Python Analysis Tutorial with SQL Server 2017] Step 6: Using the model
[In-Database Python Analysis Tutorial with SQL Server 2017] Step 4: Feature extraction of data using T-SQL
[In-Database Python Analysis Tutorial with SQL Server 2017] Step 5: Training and saving a model using T-SQL
Data analysis starting with python (data visualization 1)
Data analysis starting with python (data visualization 2)
Data analysis with python 2
Data analysis with Python
Beautiful graph drawing with python -seaborn makes data analysis and visualization easier Part 1
Beautiful graph drawing with python -seaborn makes data analysis and visualization easier Part 2
Easy data visualization with Python seaborn.
Data pipeline construction with Python and Luigi
Starbucks Twitter Data Location Visualization and Analysis
Tweet analysis with Python, Mecab and CaboCha
Recommendation of Altair! Data visualization with Python
Data analysis starting with python (data preprocessing-machine learning)
Sensor data acquisition and visualization for plant growth with Intel Edison and Python
Data analysis python
Implement "Data Visualization Design # 3" with pandas and matplotlib
Launch a web server with Python and Flask
[CGI] Run the Python program on the server with Vue.js + axios and get the output data
Quickly create a Python data analysis dashboard with Streamlit and deploy it to AWS
Analysis of financial data by pandas and its visualization (2)
Data analysis for improving POG 1 ~ Web scraping with Python ~
Analysis of financial data by pandas and its visualization (1)
Reading Note: An Introduction to Data Analysis with Python
Data analysis environment construction with Python (IPython notebook + Pandas)
Overview and tips of seaborn with statistical data visualization
Challenge principal component analysis of text data with Python
[Control engineering] Visualization and analysis of PID control and step response
A server that echoes data POSTed with flask / python
[PyTorch Tutorial ⑦] Visualizing Models, Data, And Training With Tensorboard
Investigate Java and python data exchange with Apache Arrow
ODBC access to SQL Server from Linux with Python
Python Data Visualization Libraries
Data analysis using Python 0
Data analysis overview python
Voice analysis with python
Data visualization with pandas
Python data analysis template
[Python tutorial] Data structure
Voice analysis with python
Local server with python
Logistics visualization with Python
Create a decision tree from 0 with Python and understand it (3. Data analysis library Pandas edition)
"Measurement Time Series Analysis of Economic and Finance Data" Solving Chapter End Problems with Python
I tried the same data analysis with kaggle notebook (python) and Power BI at the same time ①
Easy analysis and sharing with re: dash, an open source data visualization tool Part 1-Installation
screen and split screen with python and ssh login to remote server
Perform isocurrent analysis of open channels with Python and matplotlib
Access the host SQL Server with python27 / pyodbc on the container
Using Python with SPSS Modeler extension nodes ① Setup and visualization
20200329_Introduction to Data Analysis with Python Second Edition Personal Summary
Get rid of dirty data with Python and regular expressions
Solve the spiral book (algorithm and data structure) with python!
Try to extract Azure SQL Server data table with pyodbc
Build a Python environment and transfer data to the server
<Python> Build a dedicated server for Jupyter Notebook data analysis