[PYTHON] Support vector regression and feature selection

Support vector regression

Support vector regression is one of the machine learning methods and is suitable for multivariate nonlinear regression problems because it estimates the regression curve without assuming a functional form. In addition, it has strong collinearity and is less likely to become unstable even if it is used roughly as much as possible using the explanatory variables.

** Example of support vector regression ** figure_1-1.png

test_svr.py


import numpy as np
import random
import matplotlib.pyplot as plt
from sklearn import svm

PI = 3.14

#Make a point by dividing 0 to 2π into 120 equal parts
X = np.array(range(120))
X = X * 6 * PI / 360
# y=Calculate sinX and add an error that follows a Gaussian distribution
y = np.sin(X)
e = [random.gauss(0, 0.2) for i in range(len(y))]
y += e
#Convert to column vector
X = X[:, np.newaxis]

#Do learning
svr = svm.SVR(kernel='rbf')
svr.fit(X, y)

#Draw a regression curve
X_plot = np.linspace(0, 2*PI, 10000)
y_plot = svr.predict(X_plot[:, np.newaxis])

#Plot on the graph.
plt.scatter(X, y)
plt.plot(X_plot, y_plot)
plt.show()

Feature selection problem

In general, the regression curve of support vector regression is a non-linear map to a higher dimensional feature space. Therefore, it is not possible to simply infer the contribution of each explanatory variable to the explanatory power from the absolute value of the coefficient as in multiple regression analysis. (You can't do that, right?) Therefore, it is considered effective to perform sensitivity analysis, record the transition of the coefficient of determination while deleting the variables in ascending order of sensitivity, and use the variable set immediately before the coefficient of determination drops significantly as an effective feature. ..

Method and implementation

I referred to this document.

Evaluation of coefficient of determination

  1. Find the regression curve using all features and calculate the coefficient of determination (grid search + cross-validation is recommended)

Sensitivity analysis

  1. Divide into training data and test data. For test data, replace the values of variables other than the variable for which you want to obtain sensitivity with the average value of that variable.
  2. Learn using the training data and obtain the output value from the test data created in 2.
  3. Perform simple regression analysis with the variable for which you want to obtain sensitivity as the explanatory variable and the output value as the objective variable, and use the absolute value of the slope as the sensitivity.

Feature selection

  1. After evaluating the coefficient of determination and sensitivity analysis for all features, remove the feature with the lowest sensitivity.
  2. Make this one round and repeat the above process.
  3. Check the transition of the coefficient of determination in the process of reducing the features, and cut the variables to your liking.

test

Let's verify using the Boston house price data provided in scikit-learn.

** Number of rounds of feature reduction and coefficient of determination ** image It can be seen that the coefficient of determination does not drop sharply even if some features are removed. The table is as follows.

Number of rounds Features removed Coefficient of determination
0 - 0.644
1 ZN 0.649
2 INDUS 0.663
3 CHAS 0.613
4 CRIM 0.629
5 RAD 0.637
6 NOX 0.597
7 PTRATIO 0.492
8 B 0.533
9 TAX 0.445
10 DIS 0.472
11 AGE 0.493
12 RM 0.311

The last remaining feature is LSTAT.

The meaning of each feature is roughly as follows. See here for more information. ** CRIM **: Crime rate per capita ** ZN **: Percentage of residential land over 25,000 square feet ** INDUS **: Percentage of non-retail industries ** CHAS **: Whether it touches the Charles River ** NOX **: Nitrogen oxide concentration ** RM **: Number of rooms ** AGE **: Percentage of homes built before 1940 ** DIS **: Distance to Boston employment centers ** RAD **: Easy access to radial highways ** TAX **: Property tax rate at maturity ** PTRATIO **: Number of students per teacher ** B **: Ratio of blacks to population ** LSTAT **: Percentage of lower class people

From this result, we can see the following.

--The top eight features of NOX, PTRATIO, B, TAX, DIS, AGE, RM, and LSTAT alone have an estimation ability comparable to that of using all the features.

――LSTAT and RM are more important than ZN and INDUS to predict house prices in Boston.

By combining sensitivity analysis in this way, the contribution of features can be ranked even when support vector regression is used. Finally, the code used for feature selection is described.

select_features.py


def standardize(data_table):
    for column in data_table.columns:
        if column in ["target"]:
            continue
        if data_table[column].std() == 0:
            data_table.loc[:, column] = 0
        else:
            data_table.loc[:, column] = ((data_table.loc[:,column] - data_table[column].mean()) 
                                         / data_table[column].std())

    return data_table

#It is a method to calculate the sensitivity
def calculate_sensitivity(data_frame, feature_name, k=10):
    import numpy as np
    import pandas as pd
    from sklearn import svm
    from sklearn import linear_model
    from sklearn import grid_search
    
    #Set parameters for grid search
    tuned_parameters = [{'kernel': ['rbf'], 'gamma': [10**i for i in range(-4, 0)],
                         'C': [10**i for i in range(1,4)]}]
    
    #A list that stores the slope values.
    slope_list = []
    
    #sample size
    sample_size = len(data_frame.index)
    
    features = list(data_frame.columns)
    features.remove("target")
    
    for number_set in range(k):
        
        #Divide the data for training and testing.
        if number_set < k - 1:
            test_data = data_frame.iloc[number_set*sample_size//k:(number_set+1)*sample_size//k,:]
            learn_data = pd.concat([data_frame.iloc[0:number_set*sample_size//k, :],data_frame.loc[(number_set+1)*sample_size//k:, :]])
        else:
            test_data = data_frame[(k-1)*sample_size//k:]
            learn_data = data_frame[:(k-1)*sample_size//k]
        #Divide each into labels and features
        learn_label_data = learn_data["target"]
        learn_feature_data = learn_data.loc[:,features]
        test_label_data = test_data["target"]
        test_feature_data = test_data.loc[:, features]
        
        #Replace the columns other than the one for which you want to analyze the sensitivity of the test data with the column average.
        for column in test_feature_data.columns:
            if column == feature_name:
                continue
            test_feature_data.loc[:, column] = test_feature_data[column].mean()
        
        #Numpy each data for SVR.Convert to array format.
        X_test = np.array(test_feature_data)
        X_linear_test = np.array(test_feature_data[feature_name])
        X_linear_test = X_linear_test[:, np.newaxis]
        y_test = np.array(test_label_data)
        X_learn = np.array(learn_feature_data)
        y_learn = np.array(learn_label_data)
        
        #Perform regression analysis and get output
        gsvr = grid_search.GridSearchCV(svm.SVR(), tuned_parameters, cv=5, scoring="mean_squared_error") 
        gsvr.fit(X_learn, y_learn)
        y_predicted = gsvr.predict(X_test)
        
        #Performs a linear regression on the output.
        lm = linear_model.LinearRegression()
        lm.fit(X_linear_test, y_predicted)
        
        #Get the slope
        slope_list.append(lm.coef_[0])
    
    return np.array(slope_list).mean()

#A method that calculates the coefficient of determination.
def calculate_R2(data_frame,k=10):
    import numpy as np
    import pandas as pd
    from sklearn import svm
    from sklearn import grid_search
    
    #Set parameters for grid search
    tuned_parameters = [{'kernel': ['rbf'], 'gamma': [10**i for i in range(-4, 0)],
                         'C': [10**i for i in range(1,4)]}]
    svr = svm.SVR()
    
    #Define a list that stores the value of the coefficient of determination for each round.
    R2_list = []
    
    features = list(data_frame.columns)
    features.remove("target")
    
    #sample size
    sample_size = len(data_frame.index)
    
    for number_set in range(k):
        
        #Divide the data for training and testing.
        if number_set < k - 1:
            test_data = data_frame[number_set*sample_size//k:(number_set+1)*sample_size//k]
            learn_data = pd.concat([data_frame[0:number_set*sample_size//k],data_frame[(number_set+1)*sample_size//k:]])
        else:
            test_data = data_frame[(k-1)*sample_size//k:]
            learn_data = data_frame[:(k-1)*sample_size//k]
        #Divide each into labels and features
        learn_label_data = learn_data["target"]
        learn_feature_data = learn_data.loc[:, features]
        test_label_data = test_data["target"]
        test_feature_data = test_data.loc[:, features]

        #Numpy each data for SVR.Convert to array format.
        X_test = np.array(test_feature_data)
        y_test = np.array(test_label_data)
        X_learn = np.array(learn_feature_data)
        y_learn = np.array(learn_label_data)
        
        #Perform regression analysis and R for test data^Calculate 2.
        gsvr = grid_search.GridSearchCV(svr, tuned_parameters, cv=5, scoring="mean_squared_error") 
        gsvr.fit(X_learn, y_learn)
        score = gsvr.best_estimator_.score(X_test, y_test)
        R2_list.append(score)
    
    # R^Returns the mean of 2.
    return np.array(R2_list).mean()

if __name__ == "__main__":
    from sklearn.datasets import load_boston
    from sklearn import svm
    import pandas as pd
    import random
    import numpy as np

    #Read Boston rent data.
    boston = load_boston()
    X_data, y_data = boston.data, boston.target
    df = pd.DataFrame(X_data, columns=boston["feature_names"])
    df['target'] = y_data
    count = 0
    temp_data = standardize(df)
    #Randomly sort the data for cross-validation.
    temp_data.reindex(np.random.permutation(temp_data.index)).reset_index(drop=True)
    #Create a dataframe to store the sensitivity and coefficient of determination of the features in each loop.
    result_data_frame = pd.DataFrame(np.zeros((len(df.columns), len(df.columns))), columns=df.columns)
    result_data_frame["Coefficient of determination"] = np.zeros(len(df.columns))
    #Execute the following loop until the features are completely removed.
    while(len(temp_data.columns)>1):
        #This is the coefficient of determination when all the remaining features in this round are used.
        result_data_frame.loc[count, "Coefficient of determination"] = calculate_R2(temp_data,k=10)
        #A data frame that stores the sensitivity of each feature in this round.
        temp_features = list(temp_data.columns)
        temp_features.remove('target')
        temp_result = pd.DataFrame(np.zeros(len(temp_features)),
                                   columns=["abs_Sensitivity"], index=temp_features)

        #It loops the following for each feature.
        for i, feature in enumerate(temp_data.columns):
            if feature == "target":
                continue
            #Perform sensitivity analysis.
            sensitivity = calculate_sensitivity(temp_data, feature)

            result_data_frame.loc[count, feature] = sensitivity
            temp_result.loc[feature, "abs_Sensitivity"] = abs(sensitivity)
            print(feature, sensitivity)

        print(count, result_data_frame.loc[count, "Coefficient of determination"])
        #Make a copy of the data with the features with the smallest absolute value of sensitivity removed.
        ineffective_feature = temp_result["abs_Sensitivity"].argmin()
        print(ineffective_feature)
        temp_data = temp_data.drop(ineffective_feature, axis=1)


    #Data and sensitivity and R^Returns the transition of 2.
        result_data_frame.to_csv("result.csv")

        count += 1

Recommended Posts

Support vector regression and feature selection
Machine learning support vector machine
Feature selection by sklearn.feature_selection
Feature selection by genetic algorithm
Difference between regression and classification
Feature selection by Null importances
Machine Learning: Supervised --Support Vector Machine
Organized feature selection using sklearn
Machine learning algorithm (support vector machine)