[In-Database Python Analysis Tutorial with SQL Server 2017] Step 4: Feature extraction of data using T-SQL

From the beginning of the tutorial

In-Database Python Analysis for SQL Developers

Previous step

Step 3: Data exploration and visualization

Next step

Step 5: Training and saving a model using T-SQL

Step 4: Feature extraction of data using T-SQL

After exploring the data, we will gather some insights from the data and move on to feature engineering. The process of feature extraction from raw data is an important step in advanced analytical modeling.

In this step, you will learn how to use Transact-SQL functions to extract features from raw data. Then call that function from the stored procedure to create a table containing the feature values.

Function definition

The meter distance values recorded in the original data may not represent geographic or travel distances, so the coordinates available in this dataset are used to direct the distance between the boarding and disembarking positions. To calculate. To do this, use the Haversine Formula (https://en.wikipedia.org/wiki/Haversine_formula) in your custom Transact-SQL function.

The T-SQL function fnCalculateDistance uses a Haversine expression to calculate the distance, and the T-SQL function fnEngineerFeatures creates a table containing all the features.

Calculate the distance traveled using fnCalculateDistance

The T-SQL function fnCalculateDistance is defined in SQL Server through Step 2: Import Data into SQL Server Using PowerShell (http://qiita.com/qio9o9/items/98df36982f1fbecdf5e7).

  1. In Management Studio's Object Explorer, expand Programming> Functions> Scalar Value Functions.

  2. Right-click fnCalculateDistance and select Modify to open the Transact-SQL script in a new query window.

    CREATE FUNCTION [dbo].[fnCalculateDistance](@Lat1 float, @Long1 float, @Lat2 float, @Long2 float)
    -- User-defined function that calculates the direct distance between two geographical coordinates
    RETURNS float
    AS
    BEGIN
      DECLARE @distance decimal(28, 10)
      -- Convert to radians
      SET @Lat1 = @Lat1 / 57.2958
      SET @Long1 = @Long1 / 57.2958
      SET @Lat2 = @Lat2 / 57.2958
      SET @Long2 = @Long2 / 57.2958
      -- Calculate distance
      SET @distance = (SIN(@Lat1) * SIN(@Lat2)) + (COS(@Lat1) * COS(@Lat2) * COS(@Long2 - @Long1))
      --Convert to miles
      IF @distance <> 0
      BEGIN
        SET @distance = 3958.75 * ATAN(SQRT(1 - POWER(@distance, 2)) / @distance);
      END
      RETURN @distance
    END
    GO
    

--This function is a scalar value function that returns a single data value of a predefined type. --The latitude and longitude values obtained from the boarding and disembarking positions are used as inputs. The Haversine formula converts positions to radians and uses these values to calculate the direct distance between two locations.

Save feature values using fnEngineerFeatures

The T-SQL function fnEngineerFeatures is defined in SQL Server through Step 2: Import Data into SQL Server Using PowerShell (http://qiita.com/qio9o9/items/98df36982f1fbecdf5e7).

  1. In Management Studio's Object Explorer, expand Programming> Functions> Table Value Functions.

  2. Right-click fnCalculateDistance and select Modify to open the Transact-SQL script in a new query window.

fnEngineerFeatures is a table-valued function that takes multiple columns as input and returns multiple feature-valued columns. The purpose of fnEngineerFeatures is to create a feature value set to use for model building. fnEngineerFeatures calls fnCalculateDistance to get the linear distance between the boarding position and the disembarking position.

```SQL:fnEngineerFeatures
CREATE FUNCTION [dbo].[fnEngineerFeatures](
@passenger_count int = 0,
@trip_distance float = 0,
@trip_time_in_secs int = 0,
@pickup_latitude float = 0,
@pickup_longitude float = 0,
@dropoff_latitude float = 0,
@dropoff_longitude float = 0)
RETURNS TABLE
AS
  RETURN
  (
  -- Add the SELECT statement with parameter references here
  SELECT
    @passenger_count AS passenger_count,
    @trip_distance AS trip_distance,
    @trip_time_in_secs AS trip_time_in_secs,
    [dbo].[fnCalculateDistance](@pickup_latitude,@pickup_longitude,@dropoff_latitude,@dropoff_longitude) AS direct_distance
  )
GO
```
  1. As an operation check, let's calculate the geographical distance for the record where the meter distance value is set to 0 even though the boarding position and the getting-off position are different.

        SELECT tipped, fare_amount, passenger_count,(trip_time_in_secs/60) as TripMinutes,
        trip_distance, pickup_datetime, dropoff_datetime,
        dbo.fnCalculateDistance(pickup_latitude, pickup_longitude,  dropoff_latitude, dropoff_longitude) AS direct_distance
        FROM nyctaxi_sample
        WHERE pickup_longitude != dropoff_longitude and pickup_latitude != dropoff_latitude and trip_distance = 0
        ORDER BY trip_time_in_secs DESC
    

    sqldev-python-step4-1-gho9o9.png

As you can see, the distances reported by the meters are not always recorded as an indication of geographical distance. These pretreatments are characteristic engineering is the reason why engineering is important.

In the next step, you'll learn how to use these features to create and train machine learning models using Python.

Link

Next step

Step 5: Training and saving a model using T-SQL

Previous step

Step 3: Data exploration and visualization

From the beginning of the tutorial

In-Database Python Analysis for SQL Developers

Source

Step 4: Create Data Features using T-SQL

Related item

Machine Learning Services with Python

Recommended Posts

[In-Database Python Analysis Tutorial with SQL Server 2017] Step 4: Feature extraction of data using T-SQL
[In-Database Python Analysis Tutorial with SQL Server 2017] Step 2: Import data to SQL Server using PowerShell
[In-Database Python Analysis Tutorial with SQL Server 2017] Step 6: Using the model
[In-Database Python Analysis Tutorial with SQL Server 2017] Step 1: Download sample data
[In-Database Python Analysis Tutorial with SQL Server 2017] Step 5: Training and saving a model using T-SQL
[In-Database Python Analysis Tutorial with SQL Server 2017]
Data analysis with python 2
Data analysis using Python 0
Data analysis with Python
Challenge principal component analysis of text data with Python
Data analysis using python pandas
Practical exercise of data analysis with Python ~ 2016 New Coder Survey Edition ~
Feature extraction by TF method using the result of morphological analysis
Recommendation of data analysis using MessagePack
Data analysis starting with python (data visualization 1)
Data analysis starting with python (data visualization 2)
Rewrite the field creation node of SPSS Modeler with Python. Feature extraction from time series sensor data
Recommendation tutorial using association analysis (python implementation)
Recommendation of Altair! Data visualization with Python
Data analysis starting with python (data preprocessing-machine learning)
Python practice data analysis Summary of learning that I hit about 10 with 100 knocks
I tried to make a function to retrieve data from database column by column using sql with sqlite3 of python [sqlite3, sql, pandas]
Python introductory study-output of sales data using tuples-
Static analysis of Python code with GitLab CI
A well-prepared record of data analysis in Python
Summary of statistical data analysis methods using Python that can be used in business
Data analysis python
[python] Around generating XML string without using to_sql () of pandas and updating data using OPENXML function in SQL Server stored procedure
Data analysis for improving POG 1 ~ Web scraping with Python ~
Explanation of the concept of regression analysis using python Part 2
Get Amazon RDS (PostgreSQL) data using SQL with pandas
[OpenCV / Python] I tried image analysis of cells with OpenCV
Try projective transformation of images using OpenCV with Python
Reading Note: An Introduction to Data Analysis with Python
Data analysis environment construction with Python (IPython notebook + Pandas)
Calculate the regression coefficient of simple regression analysis with python
Story of image analysis of PDF file and data extraction
List of Python code used in big data analysis
Explanation of the concept of regression analysis using Python Part 1
A server that echoes data POSTed with flask / python
Planar skeleton analysis with Python (4) Handling of forced displacement
Process csv data with python (count processing using pandas)
Principal component analysis using python from nim with nimpy
Explanation of the concept of regression analysis using Python Extra 1
[Basics of data science] Collecting data from RSS with python
Extract the band information of raster data with python
[Technical book] Introduction to data analysis using Python -1 Chapter Introduction-
ODBC access to SQL Server from Linux with Python
I know? Data analysis using Python or things you want to use when you want with numpy
"Measurement Time Series Analysis of Economic and Finance Data" Solving Chapter End Problems with Python