Introduction

Continuing from the last time, this time I would like to make a prediction by combining various machine learning models for the problem of predicting the House Price (house price) used as a tutorial in kaggle.

Last time https://qiita.com/Fumio-eisan/items/061695b2e3b53ac2a750

Library for machine learning


from sklearn.linear_model import ElasticNet, Lasso,  BayesianRidge, LassoLarsIC
from sklearn.ensemble import RandomForestRegressor,  GradientBoostingRegressor
from sklearn.kernel_ridge import KernelRidge
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import RobustScaler
from sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin, clone
from sklearn.model_selection import KFold, cross_val_score, train_test_split
from sklearn.metrics import mean_squared_error
import xgboost as xgb
import lightgbm as lgb

About cross validation of models

Cross validation (CV) is a technique for estimating model prediction errors and is widely used in machine learning.

Divide the training data into several parts (each called a fold)
Use one of them as validation data and the rest as training data for learning and evaluation. Then, the score of the validation data is calculated.
Change the validation data as many times as the number of divisions and repeat step 2.
Evaluate the quality of the model by averaging those scores.

Reference URL https://note.com/naokiwifruit/n/nc90ca48f16e5 https://qiita.com/LicaOka/items/c6725aa8961df9332cc7


n_folds = 5

def rmsle_cv(model):
    kf = KFold(n_folds, shuffle=True, random_state=42).get_n_splits(train.values)
    rmse= np.sqrt(-cross_val_score(model, train.values, y_train, scoring="neg_mean_squared_error", cv = kf))
    return(rmse)

This time, the number of folds is 5.

About evaluation index (RMSLE)

RMSLE (Root Mean Squared Logarithmic Error) is applied as the evaluation index this time. This value is calculated by the square root of the average difference after taking each logarithm of the new land and the predicted value.

The objective variable (here, SalePrice, etc.) is applied for heavy-tailed distributions. It is used when taking the logarithm and making it a normal distribution.
When taking a logarithm, if the true value takes 0, it diverges negatively in log. Therefore, 1 is added to the calculation.

RMSLE = \sqrt{\frac{1}{n}\sum_{i=1}^{n} (log(y_{i}+1)-log(y_{pred}+1))^2} \\
n:number\\
y_{i}:Measured value\\
y_{pred}:Predicted value

About the combination of models

Make a prediction model by combining a total of 6 models. The image is below.

Elastic Net Regression, Kernel Ridge Regession, LASSO Regression, and Gradient Boosting Regression are combined to form a Stacked Averaged Model. Then, combine it with XGBoost and Litht GBM to make the final Ensemble Models. There are many unclear points about optimizing these combinations, so I will continue to learn and summarize them.

Reference URL https://blog.ikedaosushi.com/entry/2018/10/21/204842?t=0

Creating a Stacked Averaged Model


lasso = make_pipeline(RobustScaler(), Lasso(alpha =0.0005, random_state=1))
ENet = make_pipeline(RobustScaler(), ElasticNet(alpha=0.0005, l1_ratio=.9, random_state=3))
KRR = KernelRidge(alpha=0.6, kernel='polynomial', degree=2, coef0=2.5)
GBoost = GradientBoostingRegressor(n_estimators=3000, learning_rate=0.05,
                                   max_depth=4, max_features='sqrt',
                                   min_samples_leaf=15, min_samples_split=10, 
                                   loss='huber', random_state =5)

First, instantiate the LASSO regression, Elastic Net Regression, Kernel Ridge Regression, and Gradient Boosting Regression models. Then calculate each RMSLE value for the model.

score = rmsle_cv(lasso)
print("\nLasso score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))
score = rmsle_cv(ENet)
print("ElasticNet score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))
score = rmsle_cv(KRR)
print("Kernel Ridge score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))
score = rmsle_cv(GBoost)
print("Gradient Boosting score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))

Lasso score: 0.1115 (0.0074) ElasticNet score: 0.1116 (0.0074) Kernel Ridge score: 0.1153 (0.0075) Gradient Boosting score: 0.1177 (0.0080)

have become. Now, let's make a model that collects these models. First, simply average.

class AveragingModels(BaseEstimator, RegressorMixin, TransformerMixin):
    def __init__(self, models):
        self.models = models
        
    # we define clones of the original models to fit the data in
    def fit(self, X, y):
        self.models_ = [clone(x) for x in self.models]
        
        # Train cloned base models
        for model in self.models_:
            model.fit(X, y)

        return self
    
    #Now we do the predictions for cloned models and average them
    def predict(self, X):
        predictions = np.column_stack([
            model.predict(X) for model in self.models_
        ])
        return np.mean(predictions, axis=1)   
averaged_models = AveragingModels(models = (ENet, GBoost, KRR, lasso))

score = rmsle_cv(averaged_models)
print(" Averaged base models score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))

Averaged base models score: 0.1091 (0.0075) You can see that the value is better even if it is simply averaged (smaller is better this time).

Reference URL https://blog.ikedaosushi.com/entry/2018/10/21/204842?t=0

class AveragingModels(BaseEstimator, RegressorMixin, TransformerMixin):
    def __init__(self, models):
        self.models = models
        
    # we define clones of the original models to fit the data in
    def fit(self, X, y):
        self.models_ = [clone(x) for x in self.models]
        
        # Train cloned base models
        for model in self.models_:
            model.fit(X, y)

        return self
    
    #Now we do the predictions for cloned models and average them
    def predict(self, X):
        predictions = np.column_stack([
            model.predict(X) for model in self.models_
        ])
        return np.mean(predictions, axis=1)   

averaged_models = AveragingModels(models = (ENet, GBoost, KRR, lasso))

score = rmsle_cv(averaged_models)
print(" Averaged base models score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))

Averaged base models score: 0.1091 (0.0075)


class StackingAveragedModels(BaseEstimator, RegressorMixin, TransformerMixin):
    def __init__(self, base_models, meta_model, n_folds=5):
        self.base_models = base_models
        self.meta_model = meta_model
        self.n_folds = n_folds
   
    # We again fit the data on clones of the original models
    def fit(self, X, y):
        self.base_models_ = [list() for x in self.base_models]
        self.meta_model_ = clone(self.meta_model)
        kfold = KFold(n_splits=self.n_folds, shuffle=True, random_state=156)
        
        # Train cloned base models then create out-of-fold predictions
        # that are needed to train the cloned meta-model
        out_of_fold_predictions = np.zeros((X.shape[0], len(self.base_models)))
        for i, model in enumerate(self.base_models):
            for train_index, holdout_index in kfold.split(X, y):
                instance = clone(model)
                self.base_models_[i].append(instance)
                instance.fit(X[train_index], y[train_index])
                y_pred = instance.predict(X[holdout_index])
                out_of_fold_predictions[holdout_index, i] = y_pred
                
        # Now train the cloned  meta-model using the out-of-fold predictions as new feature
        self.meta_model_.fit(out_of_fold_predictions, y)
        return self
   
    #Do the predictions of all base models on the test data and use the averaged predictions as 
    #meta-features for the final prediction which is done by the meta-model
    def predict(self, X):
        meta_features = np.column_stack([
            np.column_stack([model.predict(X) for model in base_models]).mean(axis=1)
            for base_models in self.base_models_ ])
        return self.meta_model_.predict(meta_features)
stacked_averaged_models = StackingAveragedModels(base_models = (ENet, GBoost, KRR),
                                                 meta_model = lasso)

score = rmsle_cv(stacked_averaged_models)
print("Stacking Averaged models score: {:.4f} ({:.4f})".format(score.mean(), score.std()))

Stacking Averaged models score: 0.1085 (0.0074) The value has improved a little more.

Instantiation of XGBoost model


model_xgb = xgb.XGBRegressor(colsample_bytree=0.4603, gamma=0.0468, 
                             learning_rate=0.05, max_depth=3, 
                             min_child_weight=1.7817, n_estimators=2200,
                             reg_alpha=0.4640, reg_lambda=0.8571,
                             subsample=0.5213, silent=1,
                             random_state =7, nthread = -1)

Instantiation of LightGBM model

model_lgb = lgb.LGBMRegressor(objective='regression',num_leaves=5,
                              learning_rate=0.05, n_estimators=720,
                              max_bin = 55, bagging_fraction = 0.8,
                              bagging_freq = 5, feature_fraction = 0.2319,
                              feature_fraction_seed=9, bagging_seed=9,
                              min_data_in_leaf =6, min_sum_hessian_in_leaf = 11)

def rmsle(y, y_pred):
    return np.sqrt(mean_squared_error(y, y_pred))

Stacking-Make an Emsemble model

This learning model uses multiple learning models as ensemble learning. This is the sum of the weighted (coefficient-multiplied) versions of each learning model.

y_{prediction}=w_{1}∗XGB+w_{2}∗LGB+w_{3}∗StR\\
w_{1}+w_{2}+w_{3} = 1\\
\\
y_{prediction}:Predicted value\\
w_{1～3}:coefficient\\
XGB:Predicted value by XGB\\
LGB:Predicted value by LightGBM\\
StR:Predicted value by Stacked Regressor\\

print('RMSLE score on train data:')
print(rmsle(y_train,stacked_train_pred*0.70 +
               xgb_train_pred*0.15 + lgb_train_pred*0.15 ))
ensemble = stacked_pred*0.70 + xgb_pred*0.15 + lgb_pred*0.15

RMSLE score on train data: 0.07530158653663023

It has dropped to a very small value.

At the end

This time, data preprocessing was performed in the first half, and learning was performed in the second half. Regarding data preprocessing, it was found that the point is to handle missing values and categorical variables in a state where training data and test data are combined. Proper handling of features is a very important way to determine the accuracy of a model. I felt that it was important not only to fill in the missing values, but also to focus on the expected causal relationships. Regarding model learning, I was able to learn the concept of stacking and the evaluation method of RMSLE. For optimal stacking, you need to dig deeper into the characteristics of each model.

The full program is here. https://github.com/Fumio-eisan/houseprice20200301

[PYTHON] Predicting House Prices (Machine Learning: Second Half) ver1.1