[PYTHON] Steps to machine learning and API server using Kaggle's House Sales in King County, USA dataset

Introduction

Using a dataset from Kaggle's House Sales in King County, USA, I used XGboost machine learning to generate a learning model and Flask to turn it into an API server. This machine learning API server has four main steps. First, in order to understand the data of House Sails, EDA (Exploratory Data Analysis) exploratory data analysis is performed to understand the status of the data. Next, preprocessing is performed so that the data will be trained by machine learning. Next, machine learning is used to generate a learning model. This time, I'm using XG boost. Finally, we will implement the API server with Flas.

The environment required to run this program

Libraries such as Anaconda, XGBoost, joblib, Flask, and flask-cors are installed.

The implementation of the API server by this machine learning is done by the following four processes.

--Understanding House Sails data (EDA) --Pre-process the dataset --Making a learning model with machine learning --Create an API server with flask

Loading libraries and datasets

First, you can load the required Labrari and the House Sails dataset that you downloaded from Kaggle. You can specify the number of columns to display by specifying set_option.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
pd.set_option('display.max_columns', 4)
df = pd.read_csv('house_sales/kc_house_data.csv')
df.head()
id date ... sqft_living15 sqft_lot15
0 7129300520 20141013T000000 ... 1340 5650
1 6414100192 20141209T000000 ... 1690 7639
2 5631500400 20150225T000000 ... 2720 8062
3 2487200875 20141209T000000 ... 1360 5000
4 1954400510 20150218T000000 ... 1800 7503

5 rows × 21 columns

Let's display the histogram of sqft_living.

First, of the features, the histogram of sqft_living is displayed. If you look at this, you can see that it deviates from the shape of the standard normal distribution because of the large values.

plt.figure(figsize = (12,8))
plt.hist(df["sqft_living"])
plt.savefig('House_Sales_Explanatory Data Analysis_hist01.png', bbox_inches='tight')

House_Sales_Explanatory Data Analysis_hist01.png

Let's display a histogram of price.

Let's display a histogram for price, which is the value predicted this time. As before, there are large values and outliers, so the shape is based on the left.

plt.figure(figsize = (12,8))
plt.hist(df["price"])
plt.savefig('House_Sales_Explanatory Data Analysis_hist02.png', bbox_inches='tight')

House_Sales_Explanatory Data Analysis_hist02.png

Delete the outliers by processing the interquartile range

This function is processing the interquartile range. This will delete the outlier data.

def outlier_iqr(df, columns = None):
    if columns == None:
        columns = df.columns
        
    for col in columns:
        q1 = df[col].describe()['25%']
        q3 = df[col].describe()['75%']
        
        iqr = q3 - q1
        
        outlier_min = q1 - iqr * 1.5
        outlier_max = q3 + iqr * 1.5
        
        df = df[(df[col] >= outlier_min) & (df[col] <= outlier_max)]
        
    return df
df_1 = outlier_iqr(df, columns = ['price'])
df_1.shape

(20454, 21)

Try to display the price as a histogram again.

Let's display the histogram of the price processed in the interquartile range again. Now that we've removed the outliers, we have a histogram with a shape close to a normal distribution. If you check the shape of the data with shape, you can see that the number of data is slightly smaller than before, but it is not significantly reduced.

plt.figure(figsize = (12,8))
plt.hist(df_1["price"])
plt.savefig('House_Sales_Explanatory Data Analysis_hist03.png', bbox_inches='tight')

House_Sales_Explanatory Data Analysis_hist03.png

df.shape

(21613, 21)

Let's display all the histograms of numerical data.

For confirmation, let's check the shape of the histogram of other data. You can see that the shape of the histogram is generally well-balanced.

fig, axes = plt.subplots(2,3, figsize = (18, 12))

axes.ravel()[0].hist(df_1["sqft_living"])
axes.ravel()[1].hist(df_1["sqft_above"])
axes.ravel()[2].hist(df_1["sqft_basement"])
axes.ravel()[3].hist(df_1["lat"])
axes.ravel()[4].hist(df_1["long"])
axes.ravel()[5].hist(df_1["sqft_living15"])

axes.ravel()[0].set_title("sqft_living")
axes.ravel()[1].set_title("sqft_above")
axes.ravel()[2].set_title("sqft_basement")
axes.ravel()[3].set_title("lat")
axes.ravel()[4].set_title("long")
axes.ravel()[5].set_title("sqft_living15")

plt.savefig('House_Sales_Explanatory Data Analysis_hist04.png', bbox_inches='tight')

House_Sales_Explanatory Data Analysis_hist04.png

Delete unnecessary features.

Here, we will delete the features that are no longer needed during learning. Here, id, date, sqft_lot, sqft_lot15, and zipcode are deleted.

df_1 = df_1.drop(columns = ['id', 'date', 'sqft_lot','sqft_lot15','zipcode'])

Change the year of construction to the age of construction.

The feature quantity yr_built is the year of construction, so the year the building was built is in the data. At this rate, it is difficult to handle it as learning data, so we have added a new feature amount based on the age of the building. Also, yr_renovated is also in the year of refurbishment, so this is also changed to the number of years since it was refurbished.

df_1["age"] = 2020 - df_1["yr_built"]
df_1.loc[(df_1['yr_renovated'] == 0), 'yr_renovated'] = 2020

Standardize numerical features.

Numerical data is standardized here using StandardScaler.

from sklearn.preprocessing import StandardScaler
num_feature = ['sqft_living', 'sqft_above', 'sqft_basement', 'lat', 'long', 'sqft_living15']

for col in num_feature:
    scaler = StandardScaler()
    df_1[col] = scaler.fit_transform(np.array(df_1[col].values).reshape(-1, 1))

Try to display all the histograms of the numerical data again.

For the time being, I am checking the histogram of the numerical data again. You can see that the shape is the same as the one displayed earlier.

fig, axes = plt.subplots(2,3, figsize = (18, 12))

axes.ravel()[0].hist(df_1["sqft_living"])
axes.ravel()[1].hist(df_1["sqft_above"])
axes.ravel()[2].hist(df_1["sqft_basement"])
axes.ravel()[3].hist(df_1["lat"])
axes.ravel()[4].hist(df_1["long"])
axes.ravel()[5].hist(df_1["sqft_living15"])

axes.ravel()[0].set_title("sqft_living")
axes.ravel()[1].set_title("sqft_above")
axes.ravel()[2].set_title("sqft_basement")
axes.ravel()[3].set_title("lat")
axes.ravel()[4].set_title("long")
axes.ravel()[5].set_title("sqft_living15")

plt.savefig('House_Sales_Explanatory Data Analysis_hist05.png', bbox_inches='tight')

House_Sales_Explanatory Data Analysis_hist05.png

Save the correct answer data as csv

Finally, save it in CSV as data for machine learning.

df_price = df_1["price"]
df_price.to_csv('House_Sales_Explanatory_Price.csv')

Save training data as csv

Unnecessary data is deleted for training data, and categorical data is converted to dummy variables by get_dummies and saved as csv.

df_1 = df_1.drop(columns = ['price', 'yr_built', 'yr_renovated'])
df_1 = pd.get_dummies(df_1, columns = ['bedrooms', 'bathrooms', 'floors', 'waterfront', 'view', 'condition', 'grade', 'age', 'renovated_age'], drop_first = True)
df_1.to_csv('House_Sales_Explanatory_Preprocessing.csv')

Machine learning is performed using the preprocessed data.

Here, we will perform machine learning using the pre-processed data that was performed earlier. So it's better to keep it as a separate project.

Import the required libraries

I will import the library required for this process again.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

Read CSV file as pandas

Load the preprocessed CSV file you saved earlier as pandas. There was an unnecessary feature called Unnamed, so I'll delete it. The number of features displayed by set_option is four.

pd.set_option('display.max_columns', 4)
df = pd.read_csv('House_Sales_Explanatory_Preprocessing.csv')
df = df.drop(columns = ['Unnamed: 0'])
df.head()
sqft_living sqft_above ... renovated_age_80 renovated_age_86
0 -1.026685 -0.725963 ... 0
1 0.769106 0.635702 ... 0
2 -1.556379 -1.289885 ... 0
3 -0.018975 -0.904768 ... 0
4 -0.380717 -0.038253 ... 0

Read CSV file as pandas (price)

The price that is the correct answer data can be read in the same way.

price
0 221900.0
1 538000.0
2 180000.0
3 604000.0
4 510000.0
df_price = pd.read_csv('House_Sales_Explanatory_Price.csv', header=None, names=['price'])
df_price.head()

Create a data frame with only the features to be used.

In this machine learning, as a result of pre-learning, there is a big difference in learning accuracy between the learning model using all preprocessed data and the learning model using data with less important features deleted. Because I didn't get it, I decided to use only the following features. Basically, all numerical features are used, and for categorical features, only grade is left and other categorical data is deleted. The main reason is that we expected it would be a hassle to implement a front end for multiple categorical data when finally developing an application using a machine learning model.

df = df[["sqft_living","sqft_above","sqft_basement","lat","long","sqft_living15","grade_3","grade_4","grade_5","grade_6","grade_7"
        ,"grade_8","grade_9","grade_10","grade_11","grade_12"]]
sqft_living sqft_above ... grade_11 grade_12
0 -1.026685 -0.725963 ... 0 0
1 0.769106 0.635702 ... 0 0
2 -1.556379 -1.289885 ... 0 0
3 -0.018975 -0.904768 ... 0 0
4 -0.380717 -0.038253 ... 0 0

5 rows × 16 columns

Generate a learning model with XGboost

Here, we are importing XGboost and doing machine learning. The parlor meter remains almost the default.

import xgboost as xgb
X_train, X_test, y_train, y_test = train_test_split(df, df_price, random_state = 0)
params = {
    'silent' : 1,
    'max_depth' : 6,
    'min_chiled_weight' : 1,
    'eta' : 0.1,
    'tree_method' : 'exact',
    'objective' : 'reg:linear',
    'eval_metric' : 'rmse',
    'predictor' : 'cpu_predictor'
}

dtrain = xgb.DMatrix(X_train, label = y_train)
dtest = xgb.DMatrix(X_test, label = y_test)

model = xgb.train(params = params,
                 dtrain = dtrain,
                 num_boost_round = 200,
                 early_stopping_rounds = 10,
                 evals = [(dtest, 'test')])
[0]	test-rmse:471544
Will train until test-rmse hasn't improved in 10 rounds.
[1]	test-rmse:427350
[2]	test-rmse:387757
[3]	test-rmse:352314
[4]	test-rmse:320602
[5]	test-rmse:292132
[6]	test-rmse:266667
[7]	test-rmse:244148
[8]	test-rmse:223983
[9]	test-rmse:206046
[10]	test-rmse:190112
[11]	test-rmse:176111
[12]	test-rmse:163754
[13]	test-rmse:152820
[14]	test-rmse:143269
[15]	test-rmse:134879
[16]	test-rmse:127772
[17]	test-rmse:121362
[18]	test-rmse:115939
[19]	test-rmse:111405
[20]	test-rmse:107280
[21]	test-rmse:103750
[22]	test-rmse:100928
[23]	test-rmse:98446.5
[24]	test-rmse:96280.3
[25]	test-rmse:94419.2
[26]	test-rmse:92933.6
[27]	test-rmse:91644.1
[28]	test-rmse:90581.3
[29]	test-rmse:89422.8
[30]	test-rmse:88575.7
[31]	test-rmse:88038.8
[32]	test-rmse:87254.6
[33]	test-rmse:86857.1
[34]	test-rmse:86527.8
[35]	test-rmse:86238.3
[36]	test-rmse:85950
[37]	test-rmse:85705
[38]	test-rmse:85532.4
[39]	test-rmse:85346.7
[40]	test-rmse:85204.1
[41]	test-rmse:85058.9
[42]	test-rmse:84926.7
[43]	test-rmse:84845.4
[44]	test-rmse:84671.9
[45]	test-rmse:84539.6
[46]	test-rmse:84380.6
[47]	test-rmse:84287.2
[48]	test-rmse:84254.7
[49]	test-rmse:84168.9
[50]	test-rmse:84106.6
[51]	test-rmse:83858.5
[52]	test-rmse:83829.8
[53]	test-rmse:83809.5
[54]	test-rmse:83726
[55]	test-rmse:83704.2
[56]	test-rmse:83650.4
[57]	test-rmse:83422.6
[58]	test-rmse:83405.8
[59]	test-rmse:83281
[60]	test-rmse:83293.6
[61]	test-rmse:83289.4
[62]	test-rmse:83251.9
[63]	test-rmse:83237.5
[64]	test-rmse:83055.6
[65]	test-rmse:83051.9
[66]	test-rmse:82938.8
[67]	test-rmse:82932.7
[68]	test-rmse:82933.2
[69]	test-rmse:82859
[70]	test-rmse:82829.6
[71]	test-rmse:82840.5
[72]	test-rmse:82823
[73]	test-rmse:82827.4
[74]	test-rmse:82834.6
[75]	test-rmse:82845.9
[76]	test-rmse:82839.4
[77]	test-rmse:82828.5
[78]	test-rmse:82829.7
[79]	test-rmse:82651.8
[80]	test-rmse:82660
[81]	test-rmse:82637.3
[82]	test-rmse:82514.6
[83]	test-rmse:82497.6
[84]	test-rmse:82484.7
[85]	test-rmse:82486.3
[86]	test-rmse:82486.8
[87]	test-rmse:82496
[88]	test-rmse:82491.4
[89]	test-rmse:82486.6
[90]	test-rmse:82290.3
[91]	test-rmse:82265.1
[92]	test-rmse:82261.5
[93]	test-rmse:82236.5
[94]	test-rmse:82236.4
[95]	test-rmse:82111.9
[96]	test-rmse:82111.1
[97]	test-rmse:82111.3
[98]	test-rmse:82108
[99]	test-rmse:82097.1
[100]	test-rmse:82077.4
[101]	test-rmse:82041.9
[102]	test-rmse:82040
[103]	test-rmse:82042.6
[104]	test-rmse:82044.2
[105]	test-rmse:82033.7
[106]	test-rmse:82041.1
[107]	test-rmse:82028.4
[108]	test-rmse:82030.7
[109]	test-rmse:82036.4
[110]	test-rmse:82028.6
[111]	test-rmse:82020.3
[112]	test-rmse:82025.5
[113]	test-rmse:82024.9
[114]	test-rmse:82034
[115]	test-rmse:82025.2
[116]	test-rmse:81957.5
[117]	test-rmse:81950.9
[118]	test-rmse:81959.8
[119]	test-rmse:81936.7
[120]	test-rmse:81935.9
[121]	test-rmse:81937
[122]	test-rmse:81945.8
[123]	test-rmse:81894.8
[124]	test-rmse:81885.2
[125]	test-rmse:81899.3
[126]	test-rmse:81877
[127]	test-rmse:81875.7
[128]	test-rmse:81859.6
[129]	test-rmse:81849.7
[130]	test-rmse:81851.2
[131]	test-rmse:81839.4
[132]	test-rmse:81850.8
[133]	test-rmse:81846
[134]	test-rmse:81836.2
[135]	test-rmse:81827.2
[136]	test-rmse:81832.3
[137]	test-rmse:81859.6
[138]	test-rmse:81856.6
[139]	test-rmse:81850
[140]	test-rmse:81847.6
[141]	test-rmse:81842.8
[142]	test-rmse:81794.5
[143]	test-rmse:81803.8
[144]	test-rmse:81829.3
[145]	test-rmse:81815.9
[146]	test-rmse:81813.6
[147]	test-rmse:81741
[148]	test-rmse:81728.8
[149]	test-rmse:81714.4
[150]	test-rmse:81708.6
[151]	test-rmse:81592.3
[152]	test-rmse:81621.7
[153]	test-rmse:81624.8
[154]	test-rmse:81629.3
[155]	test-rmse:81615.7
[156]	test-rmse:81617.7
[157]	test-rmse:81613.9
[158]	test-rmse:81612.9
[159]	test-rmse:81594.9
[160]	test-rmse:81595.1
[161]	test-rmse:81581.7
[162]	test-rmse:81595.3
[163]	test-rmse:81603.8
[164]	test-rmse:81601.2
[165]	test-rmse:81600.5
[166]	test-rmse:81552.3
[167]	test-rmse:81557.6
[168]	test-rmse:81565.5
[169]	test-rmse:81566.6
[170]	test-rmse:81581.9
[171]	test-rmse:81570.5
[172]	test-rmse:81571.8
[173]	test-rmse:81569.4
[174]	test-rmse:81494.3
[175]	test-rmse:81476.3
[176]	test-rmse:81454
[177]	test-rmse:81422.6
[178]	test-rmse:81426.1
[179]	test-rmse:81410.8
[180]	test-rmse:81425.1
[181]	test-rmse:81418.2
[182]	test-rmse:81419.4
[183]	test-rmse:81409.6
[184]	test-rmse:81392.1
[185]	test-rmse:81389.3
[186]	test-rmse:81391.1
[187]	test-rmse:81414.5
[188]	test-rmse:81369.9
[189]	test-rmse:81368.3
[190]	test-rmse:81358.4
[191]	test-rmse:81347.7
[192]	test-rmse:81355.4
[193]	test-rmse:81349.2
[194]	test-rmse:81343
[195]	test-rmse:81346.3
[196]	test-rmse:81345.5
[197]	test-rmse:81374.6
[198]	test-rmse:81358.5
[199]	test-rmse:81359.4

Generate parameters for grid search

I'm doing a grid search to see if the model is a little more accurate. First, we are generating the parameters for grid search.

gridsearch_params = [
    (max_depth, eta)
    for max_depth in [6, 7, 8]
    for eta in [0.1, 0.05, 0.01]
]
gridsearch_params

[(6, 0.1), (6, 0.05), (6, 0.01), (7, 0.1), (7, 0.05), (7, 0.01), (8, 0.1), (8, 0.05), (8, 0.01)]

Calculate the most accurate parameters

Here we are calculating which combination of parameters is the most accurate. The result is Best params (8, 0.01).

min_rmse = float('Inf')

best_param = []

for max_depth, eta in gridsearch_params:
    print('max_depth = {}, eta = {}'.format(max_depth, eta))
    
    params['max_depth'] = max_depth
    params['eta'] = eta
    
    cv_results = xgb.cv(
        params,
        dtrain,
        num_boost_round = 1000,
        seed = 0,
        nfold = 5,
        metrics = {'rmse'},
        early_stopping_rounds = 5
    )
    
    mean_rmse = cv_results['test-rmse-mean'].min()
    boost_rounds = cv_results['test-rmse-mean'].argmin()
    print('RMSE {} for {} rounds'.format(mean_rmse, boost_rounds))
    if mean_rmse < min_rmse:
        min_rmse = mean_rmse
        best_param = (max_depth, eta)

print('Best params {}, RMSE {}'.format(best_param, min_rmse))
max_depth = 6, eta = 0.1
RMSE 81689.0296874 for 123 rounds
max_depth = 6, eta = 0.05
RMSE 81545.2953126 for 267 rounds
max_depth = 6, eta = 0.01
RMSE 82118.7765624 for 999 rounds
max_depth = 7, eta = 0.1
RMSE 81372.990625 for 161 rounds
max_depth = 7, eta = 0.05
RMSE 81372.7171876 for 202 rounds
max_depth = 7, eta = 0.01
RMSE 81308.89999979999 for 999 rounds
max_depth = 8, eta = 0.1
RMSE 81277.4515624 for 96 rounds
max_depth = 8, eta = 0.05
RMSE 81155.2687498 for 201 rounds
max_depth = 8, eta = 0.01
RMSE 81080.3156252 for 849 rounds
Best params (8, 0.01), RMSE 81080.3156252

Change the parameters again to generate the training model

Generate the training model again using the calculated parameters by the grid search earlier.

params['max_depth'] = 8
params['eta'] = 0.01

model = xgb.train(params = params,
                 dtrain = dtrain,
                 num_boost_round = 1000,
                 early_stopping_rounds = 5,
                 evals = [(dtest, 'test')])
[0]	test-rmse:515961
Will train until test-rmse hasn't improved in 5 rounds.
[1]	test-rmse:511040
[2]	test-rmse:506173
[3]	test-rmse:501356
[4]	test-rmse:496588
[5]	test-rmse:491874
[6]	test-rmse:487210
[7]	test-rmse:482584
[8]	test-rmse:478012
[9]	test-rmse:473483
[10]	test-rmse:468999
[11]	test-rmse:464566
[12]	test-rmse:460175
[13]	test-rmse:455822
[14]	test-rmse:451526
[15]	test-rmse:447271
[16]	test-rmse:443066
[17]	test-rmse:438893
[18]	test-rmse:434766
[19]	test-rmse:430688
[20]	test-rmse:426657
[21]	test-rmse:422664
[22]	test-rmse:418718
[23]	test-rmse:414801
[24]	test-rmse:410941
[25]	test-rmse:407098
[26]	test-rmse:403307
[27]	test-rmse:399547
[28]	test-rmse:395829
[29]	test-rmse:392152
[30]	test-rmse:388512
[31]	test-rmse:384923
[32]	test-rmse:381353
[33]	test-rmse:377832
[34]	test-rmse:374343
[35]	test-rmse:370902
[36]	test-rmse:367481
[37]	test-rmse:364100
[38]	test-rmse:360764
[39]	test-rmse:357461
[40]	test-rmse:354187
[41]	test-rmse:350960
[42]	test-rmse:347750
[43]	test-rmse:344572
[44]	test-rmse:341426
[45]	test-rmse:338314
[46]	test-rmse:335245
[47]	test-rmse:332213
[48]	test-rmse:329190
[49]	test-rmse:326207
[50]	test-rmse:323269
[51]	test-rmse:320358
[52]	test-rmse:317476
[53]	test-rmse:314619
[54]	test-rmse:311801
[55]	test-rmse:309001
[56]	test-rmse:306233
[57]	test-rmse:303498
[58]	test-rmse:300790
[59]	test-rmse:298113
[60]	test-rmse:295470
[61]	test-rmse:292855
[62]	test-rmse:290269
[63]	test-rmse:287704
[64]	test-rmse:285165
[65]	test-rmse:282666
[66]	test-rmse:280184
[67]	test-rmse:277741
[68]	test-rmse:275313
[69]	test-rmse:272907
[70]	test-rmse:270532
[71]	test-rmse:268190
[72]	test-rmse:265866
[73]	test-rmse:263562
[74]	test-rmse:261295
[75]	test-rmse:259046
[76]	test-rmse:256816
[77]	test-rmse:254625
[78]	test-rmse:252448
[79]	test-rmse:250296
[80]	test-rmse:248164
[81]	test-rmse:246063
[82]	test-rmse:243969
[83]	test-rmse:241918
[84]	test-rmse:239880
[85]	test-rmse:237853
[86]	test-rmse:235862
[87]	test-rmse:233881
[88]	test-rmse:231939
[89]	test-rmse:230004
[90]	test-rmse:228095
[91]	test-rmse:226206
[92]	test-rmse:224346
[93]	test-rmse:222499
[94]	test-rmse:220670
[95]	test-rmse:218861
[96]	test-rmse:217075
[97]	test-rmse:215311
[98]	test-rmse:213567
[99]	test-rmse:211827
[100]	test-rmse:210120
[101]	test-rmse:208421
[102]	test-rmse:206737
[103]	test-rmse:205088
[104]	test-rmse:203457
[105]	test-rmse:201827
[106]	test-rmse:200212
[107]	test-rmse:198636
[108]	test-rmse:197085
[109]	test-rmse:195530
[110]	test-rmse:194010
[111]	test-rmse:192494
[112]	test-rmse:191011
[113]	test-rmse:189524
[114]	test-rmse:188077
[115]	test-rmse:186631
[116]	test-rmse:185212
[117]	test-rmse:183809
[118]	test-rmse:182411
[119]	test-rmse:181043
[120]	test-rmse:179675
[121]	test-rmse:178325
[122]	test-rmse:177006
[123]	test-rmse:175698
[124]	test-rmse:174401
[125]	test-rmse:173124
[126]	test-rmse:171857
[127]	test-rmse:170612
[128]	test-rmse:169374
[129]	test-rmse:168161
[130]	test-rmse:166952
[131]	test-rmse:165766
[132]	test-rmse:164596
[133]	test-rmse:163434
[134]	test-rmse:162284
[135]	test-rmse:161156
[136]	test-rmse:160034
[137]	test-rmse:158914
[138]	test-rmse:157821
[139]	test-rmse:156745
[140]	test-rmse:155676
[141]	test-rmse:154628
[142]	test-rmse:153593
[143]	test-rmse:152568
[144]	test-rmse:151559
[145]	test-rmse:150558
[146]	test-rmse:149572
[147]	test-rmse:148603
[148]	test-rmse:147644
[149]	test-rmse:146701
[150]	test-rmse:145766
[151]	test-rmse:144831
[152]	test-rmse:143911
[153]	test-rmse:143000
[154]	test-rmse:142102
[155]	test-rmse:141215
[156]	test-rmse:140345
[157]	test-rmse:139482
[158]	test-rmse:138627
[159]	test-rmse:137799
[160]	test-rmse:136970
[161]	test-rmse:136155
[162]	test-rmse:135347
[163]	test-rmse:134549
[164]	test-rmse:133771
[165]	test-rmse:132997
[166]	test-rmse:132246
[167]	test-rmse:131489
[168]	test-rmse:130746
[169]	test-rmse:130024
[170]	test-rmse:129296
[171]	test-rmse:128587
[172]	test-rmse:127886
[173]	test-rmse:127192
[174]	test-rmse:126505
[175]	test-rmse:125824
[176]	test-rmse:125160
[177]	test-rmse:124501
[178]	test-rmse:123857
[179]	test-rmse:123216
[180]	test-rmse:122583
[181]	test-rmse:121954
[182]	test-rmse:121339
[183]	test-rmse:120737
[184]	test-rmse:120148
[185]	test-rmse:119561
[186]	test-rmse:118982
[187]	test-rmse:118408
[188]	test-rmse:117840
[189]	test-rmse:117286
[190]	test-rmse:116739
[191]	test-rmse:116198
[192]	test-rmse:115670
[193]	test-rmse:115143
[194]	test-rmse:114633
[195]	test-rmse:114128
[196]	test-rmse:113628
[197]	test-rmse:113133
[198]	test-rmse:112648
[199]	test-rmse:112167
[200]	test-rmse:111694
[201]	test-rmse:111232
[202]	test-rmse:110769
[203]	test-rmse:110309
[204]	test-rmse:109870
[205]	test-rmse:109429
[206]	test-rmse:109001
[207]	test-rmse:108584
[208]	test-rmse:108159
[209]	test-rmse:107745
[210]	test-rmse:107338
[211]	test-rmse:106934
[212]	test-rmse:106543
[213]	test-rmse:106161
[214]	test-rmse:105774
[215]	test-rmse:105404
[216]	test-rmse:105032
[217]	test-rmse:104666
[218]	test-rmse:104306
[219]	test-rmse:103951
[220]	test-rmse:103605
[221]	test-rmse:103256
[222]	test-rmse:102918
[223]	test-rmse:102581
[224]	test-rmse:102258
[225]	test-rmse:101929
[226]	test-rmse:101614
[227]	test-rmse:101305
[228]	test-rmse:101001
[229]	test-rmse:100687
[230]	test-rmse:100393
[231]	test-rmse:100106
[232]	test-rmse:99803.7
[233]	test-rmse:99521.5
[234]	test-rmse:99228
[235]	test-rmse:98952.9
[236]	test-rmse:98687
[237]	test-rmse:98407.6
[238]	test-rmse:98145.9
[239]	test-rmse:97895.6
[240]	test-rmse:97630.3
[241]	test-rmse:97373.5
[242]	test-rmse:97131.5
[243]	test-rmse:96879.7
[244]	test-rmse:96638.5
[245]	test-rmse:96409.2
[246]	test-rmse:96174.5
[247]	test-rmse:95950.8
[248]	test-rmse:95724
[249]	test-rmse:95504.3
[250]	test-rmse:95286
[251]	test-rmse:95063.2
[252]	test-rmse:94852.8
[253]	test-rmse:94646.3
[254]	test-rmse:94438.7
[255]	test-rmse:94227.9
[256]	test-rmse:94032
[257]	test-rmse:93828.1
[258]	test-rmse:93637
[259]	test-rmse:93447.4
[260]	test-rmse:93264
[261]	test-rmse:93072.1
[262]	test-rmse:92886.1
[263]	test-rmse:92699.4
[264]	test-rmse:92519.7
[265]	test-rmse:92341.1
[266]	test-rmse:92158.9
[267]	test-rmse:91984.2
[268]	test-rmse:91818.9
[269]	test-rmse:91667.4
[270]	test-rmse:91508.6
[271]	test-rmse:91340.9
[272]	test-rmse:91179.8
[273]	test-rmse:91036.4
[274]	test-rmse:90880.2
[275]	test-rmse:90730.6
[276]	test-rmse:90586.1
[277]	test-rmse:90440.3
[278]	test-rmse:90301.1
[279]	test-rmse:90168.4
[280]	test-rmse:90031.9
[281]	test-rmse:89908.5
[282]	test-rmse:89775.1
[283]	test-rmse:89654.3
[284]	test-rmse:89526.7
[285]	test-rmse:89395.2
[286]	test-rmse:89275.8
[287]	test-rmse:89160.1
[288]	test-rmse:89035.6
[289]	test-rmse:88924.4
[290]	test-rmse:88812.4
[291]	test-rmse:88696.1
[292]	test-rmse:88588.3
[293]	test-rmse:88483.3
[294]	test-rmse:88367.4
[295]	test-rmse:88265.7
[296]	test-rmse:88159
[297]	test-rmse:88060.3
[298]	test-rmse:87956.8
[299]	test-rmse:87859.5
[300]	test-rmse:87763.9
[301]	test-rmse:87660.7
[302]	test-rmse:87573.6
[303]	test-rmse:87475.7
[304]	test-rmse:87378.3
[305]	test-rmse:87287.8
[306]	test-rmse:87194.3
[307]	test-rmse:87113.9
[308]	test-rmse:87024.7
[309]	test-rmse:86936.5
[310]	test-rmse:86847.3
[311]	test-rmse:86761.9
[312]	test-rmse:86679.8
[313]	test-rmse:86612.5
[314]	test-rmse:86528
[315]	test-rmse:86449.6
[316]	test-rmse:86374.7
[317]	test-rmse:86297.3
[318]	test-rmse:86216.9
[319]	test-rmse:86147
[320]	test-rmse:86085.8
[321]	test-rmse:86018.1
[322]	test-rmse:85941.5
[323]	test-rmse:85878.8
[324]	test-rmse:85815.2
[325]	test-rmse:85755.3
[326]	test-rmse:85691.4
[327]	test-rmse:85631.7
[328]	test-rmse:85554.2
[329]	test-rmse:85478.2
[330]	test-rmse:85420.4
[331]	test-rmse:85355
[332]	test-rmse:85282.9
[333]	test-rmse:85212.7
[334]	test-rmse:85156.1
[335]	test-rmse:85089.2
[336]	test-rmse:85042.1
[337]	test-rmse:84977.9
[338]	test-rmse:84916.3
[339]	test-rmse:84865
[340]	test-rmse:84819.4
[341]	test-rmse:84764.9
[342]	test-rmse:84698.7
[343]	test-rmse:84655.8
[344]	test-rmse:84595.1
[345]	test-rmse:84546.1
[346]	test-rmse:84496.5
[347]	test-rmse:84446.9
[348]	test-rmse:84401.1
[349]	test-rmse:84349.7
[350]	test-rmse:84312.6
[351]	test-rmse:84263.9
[352]	test-rmse:84217.4
[353]	test-rmse:84176.9
[354]	test-rmse:84126.8
[355]	test-rmse:84081.5
[356]	test-rmse:84037.6
[357]	test-rmse:84001.1
[358]	test-rmse:83961.8
[359]	test-rmse:83922.8
[360]	test-rmse:83884.8
[361]	test-rmse:83842.4
[362]	test-rmse:83805.7
[363]	test-rmse:83771.6
[364]	test-rmse:83738.9
[365]	test-rmse:83701.5
[366]	test-rmse:83668
[367]	test-rmse:83633.7
[368]	test-rmse:83591.7
[369]	test-rmse:83552.1
[370]	test-rmse:83514.7
[371]	test-rmse:83479.3
[372]	test-rmse:83440.2
[373]	test-rmse:83412.3
[374]	test-rmse:83380.3
[375]	test-rmse:83346.3
[376]	test-rmse:83309.6
[377]	test-rmse:83272.6
[378]	test-rmse:83243.7
[379]	test-rmse:83211.3
[380]	test-rmse:83184.4
[381]	test-rmse:83151.7
[382]	test-rmse:83119.6
[383]	test-rmse:83089.4
[384]	test-rmse:83056.3
[385]	test-rmse:83023.5
[386]	test-rmse:82994.4
[387]	test-rmse:82964.4
[388]	test-rmse:82936.3
[389]	test-rmse:82907.3
[390]	test-rmse:82873.7
[391]	test-rmse:82845.4
[392]	test-rmse:82816.8
[393]	test-rmse:82790.7
[394]	test-rmse:82766
[395]	test-rmse:82740.9
[396]	test-rmse:82719.9
[397]	test-rmse:82695.1
[398]	test-rmse:82672.3
[399]	test-rmse:82647.9
[400]	test-rmse:82629.9
[401]	test-rmse:82602
[402]	test-rmse:82581.2
[403]	test-rmse:82562.3
[404]	test-rmse:82541
[405]	test-rmse:82524.3
[406]	test-rmse:82504
[407]	test-rmse:82490.7
[408]	test-rmse:82472
[409]	test-rmse:82448
[410]	test-rmse:82424.8
[411]	test-rmse:82408.9
[412]	test-rmse:82395.4
[413]	test-rmse:82373.6
[414]	test-rmse:82358.9
[415]	test-rmse:82336.1
[416]	test-rmse:82322.6
[417]	test-rmse:82301.7
[418]	test-rmse:82282.6
[419]	test-rmse:82268.4
[420]	test-rmse:82253.9
[421]	test-rmse:82229.1
[422]	test-rmse:82207
[423]	test-rmse:82188.9
[424]	test-rmse:82176.5
[425]	test-rmse:82170.7
[426]	test-rmse:82157
[427]	test-rmse:82151.2
[428]	test-rmse:82139.1
[429]	test-rmse:82126.4
[430]	test-rmse:82108.7
[431]	test-rmse:82098.1
[432]	test-rmse:82087.3
[433]	test-rmse:82075.5
[434]	test-rmse:82063.7
[435]	test-rmse:82054
[436]	test-rmse:82039.1
[437]	test-rmse:82027.3
[438]	test-rmse:82014.6
[439]	test-rmse:82005.3
[440]	test-rmse:81993.7
[441]	test-rmse:81984.7
[442]	test-rmse:81973.3
[443]	test-rmse:81955.5
[444]	test-rmse:81943.4
[445]	test-rmse:81932.8
[446]	test-rmse:81918.6
[447]	test-rmse:81909.1
[448]	test-rmse:81899.2
[449]	test-rmse:81886.5
[450]	test-rmse:81873.7
[451]	test-rmse:81863
[452]	test-rmse:81854.2
[453]	test-rmse:81842.5
[454]	test-rmse:81831.3
[455]	test-rmse:81821.2
[456]	test-rmse:81811.4
[457]	test-rmse:81804.7
[458]	test-rmse:81789.8
[459]	test-rmse:81784.3
[460]	test-rmse:81779.4
[461]	test-rmse:81771.3
[462]	test-rmse:81756.4
[463]	test-rmse:81751.9
[464]	test-rmse:81739.6
[465]	test-rmse:81730.1
[466]	test-rmse:81719.8
[467]	test-rmse:81710.2
[468]	test-rmse:81701.1
[469]	test-rmse:81689.9
[470]	test-rmse:81685
[471]	test-rmse:81675.6
[472]	test-rmse:81670.9
[473]	test-rmse:81659.8
[474]	test-rmse:81651.6
[475]	test-rmse:81641.8
[476]	test-rmse:81632.5
[477]	test-rmse:81629.2
[478]	test-rmse:81619.2
[479]	test-rmse:81611
[480]	test-rmse:81608
[481]	test-rmse:81599.1
[482]	test-rmse:81588.6
[483]	test-rmse:81578.9
[484]	test-rmse:81573.4
[485]	test-rmse:81570.2
[486]	test-rmse:81558.9
[487]	test-rmse:81554.5
[488]	test-rmse:81544.7
[489]	test-rmse:81533.8
[490]	test-rmse:81526.6
[491]	test-rmse:81518.9
[492]	test-rmse:81512.2
[493]	test-rmse:81498.3
[494]	test-rmse:81495.6
[495]	test-rmse:81488.1
[496]	test-rmse:81478.6
[497]	test-rmse:81469.1
[498]	test-rmse:81463
[499]	test-rmse:81462.4
[500]	test-rmse:81454.6
[501]	test-rmse:81453.7
[502]	test-rmse:81450.8
[503]	test-rmse:81443
[504]	test-rmse:81434
[505]	test-rmse:81430.1
[506]	test-rmse:81427.5
[507]	test-rmse:81421.4
[508]	test-rmse:81421.2
[509]	test-rmse:81415.5
[510]	test-rmse:81413
[511]	test-rmse:81409.5
[512]	test-rmse:81396.6
[513]	test-rmse:81395.1
[514]	test-rmse:81395.3
[515]	test-rmse:81392.2
[516]	test-rmse:81391.4
[517]	test-rmse:81388.2
[518]	test-rmse:81383.8
[519]	test-rmse:81379.3
[520]	test-rmse:81379.8
[521]	test-rmse:81376.9
[522]	test-rmse:81376.9
[523]	test-rmse:81375.1
[524]	test-rmse:81370.2
[525]	test-rmse:81365.3
[526]	test-rmse:81364.3
[527]	test-rmse:81363.6
[528]	test-rmse:81362.5
[529]	test-rmse:81358.9
[530]	test-rmse:81354.8
[531]	test-rmse:81353.9
[532]	test-rmse:81355.2
[533]	test-rmse:81355.2
[534]	test-rmse:81356.4
[535]	test-rmse:81356.3
[536]	test-rmse:81352.6
[537]	test-rmse:81347.5
[538]	test-rmse:81347.6
[539]	test-rmse:81349.5
[540]	test-rmse:81350.3
[541]	test-rmse:81351.2
[542]	test-rmse:81351.6
Stopping. Best iteration:
[537]	test-rmse:81347.5

Display features with high importance.

Let's display whether the features of the model learned earlier were the most important. Looking at this, we can see that features such as long, lat, and sqft_living are very important.

fig, ax = plt.subplots(figsize = (12,12))
xgb.plot_importance(model, max_num_features = 12, height = 0.8, ax = ax)
plt.savefig('house_sails_feature_importance03.png', bbox_inches='tight')

house_sails_feature_importance03.png

Let's measure the accuracy of the learned model.

Let's measure the accuracy of the learned model. In r2_score, the accuracy is about 0.847.

from sklearn.metrics import r2_score
preds = model.predict(dtest)
r2 = r2_score(y_test, preds)
print(r2)

0.8473346069012444

Save the learning model

Finally, use joblib to save the training model as a pkl file. Now you have a learning model learned by machine learning.

from sklearn.externals import joblib
joblib.dump(model, 'house_sales_model.pkl')

['house_sales_model.pkl']

Create an API server in Flask

Here, we are trying to use the learning model generated by machine learning as an API server. Flask, Python's microservices framework, is used for API server development. The flow of development is to build a virtual environment with conda, test a simple API server, and put the learning model created with XGBoost on it.

Build a virtual environment with conda

The virtual environment uses Anaconda's conda. Create a folder for application development (titanic_api in this case) in the terminal and move it to that folder. Then conda create creates the virtual environment and conda activate activates the virtual environment.

mkdir housesails_api
cd housesails_api
conda create -n housesailsenv
conda activate housesailsenv

Develop API in Flask

To develop an API server in Flask, let's first create and test a simple API server. Create the following folders and files in the folder you created earlier. If you can write the following code in each file, start the API server, and communicate from curl, the simple API server test is successful.

Generate the required folders and files in the terminal.

Create folders and files so that they have the following hierarchy. If you want to create an empty file, it is convenient to use the touch command.

housesails_api
├── api
│   ├── __init__.py
│   └── views
│       └── user.py
├── housesails_app.py
└── house_sales_model.pkl

Write the code in the created file

Write the code in the file you just created as follows. There are three files needed to test a simple API server: api / views / user.py, api / init.py, and titanic_app.py. It is convenient to use vim when writing in the terminal, and Atom when writing in the GUI.

api/views/user.py


from flask import Blueprint, request, make_response, jsonify

#Routing settings
user_router = Blueprint('user_router', __name__)

#Specify path and HTTP method
@user_router.route('/users', methods=['GET'])
def get_user_list():

  return make_response(jsonify({
    'users': [
       {
         'id': 1,
         'name': 'John'
       }
     ]
  }))

api/__init__.py


from flask import Flask, make_response, jsonify
from .views.user import user_router

def create_app():

  app = Flask(__name__)
  app.register_blueprint(user_router, url_prefix='/api')

  return app

app = create_app()

housesails_app.py


import json

from flask import Flask
from flask import request
from flask import abort

import pandas as pd
from sklearn.externals import joblib
import xgboost as xgb

model = joblib.load("house_sales_model.pkl")

app = Flask(__name__)

# Get headers for payload
headers = ['sqft_living','sqft_above','sqft_basement','lat','long','sqft_living15','grade_3','grade_4','grade_5','grade_6','grade_7','grade_8','grade_9','grade_10','grade_11','grade_12']

@app.route('/house_sails', methods=['POST'])
def housesails():
    if not request.json:
        abort(400)
    payload = request.json['data']
    values = [float(i) for i in payload.split(',')]
    data1 = pd.DataFrame([values], columns=headers, dtype=float)
    predict = model.predict(xgb.DMatrix(data1))
    return json.dumps(str(predict[0]))


if __name__ == "__main__":
    app.run(debug=True, port=5000)

API communication test with curl

After rewriting the code, start the API server again with python housesails_app.py. After the API server starts, the communication test is done with the curl command as shown below. If the value after the decimal point 1 is returned for the sent JSON data, it is successful. Now you have a learning model generated by machine learning as an API server.

curl http://localhost:5000/house_sails -s -X POST -H "Content-Type: application/json" -d '{"data": "-1.026685, -0.725963, -0.652987, -0.323607, -0.307144, -0.946801, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0"}'

Add authentication function

Finally, add the authentication function to the API server. This time we are implementing basic authentication. HTTP BasicAuth must be installed in the library. You can implement it by adding # BasicAuth to the code of housessails_app.py earlier.

housesails_app.py


import json

from flask import Flask
from flask import request
from flask import abort
from flask_httpauth import HTTPBasicAuth

import pandas as pd
from sklearn.externals import joblib
import xgboost as xgb

model = joblib.load("house_sales_model.pkl")

app = Flask(__name__)

# BasicAuth
auth = HTTPBasicAuth()

users = {
    "user01": "password01",
    "user02": "password02"
}

@auth.get_password
def get_pw(username):
    if username in users:
        return users.get(username)
    return None

# Get headers for payload
headers = ['sqft_living','sqft_above','sqft_basement','lat','long','sqft_living15','grade_3','grade_4','grade_5','grade_6','grade_7','grade_8','grade_9','grade_10','grade_11','grade_12']

@app.route('/house_sails', methods=['POST'])

# BasicAuth
@auth.login_required

def housesails():
    if not request.json:
        abort(400)
    payload = request.json['data']
    values = [float(i) for i in payload.split(',')]
    data1 = pd.DataFrame([values], columns=headers, dtype=float)
    predict = model.predict(xgb.DMatrix(data1))
    return json.dumps(str(predict[0]))


if __name__ == "__main__":
    app.run(debug=True, port=5000)

API communication test with curl again

Start the API server again with python housesails_app.py. After the API server starts, the communication test is done with the curl command as shown below. You can be authenticated by adding --user user01: password01. If communication is successful with this, it is successful.

curl http://localhost:5000/house_sails --user user01:password01 -s -X POST -H "Content-Type: application/json" -d '{"data": "-1.026685, -0.725963, -0.652987, -0.323607, -0.307144, -0.946801, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0"}'

Recommended Posts

Steps to machine learning and API server using Kaggle's House Sales in King County, USA dataset
Classification and regression in machine learning
Aiming to become a machine learning engineer from sales positions using MOOCs
I tried to classify guitar chords in real time using machine learning
9 Steps to Become a Machine Learning Expert in the Shortest Time [Completely Free]