Continuing from the last time, this time I would like to make a prediction by combining various machine learning models for the problem of predicting the House Price (house price) used as a tutorial in kaggle.
Last time https://qiita.com/Fumio-eisan/items/061695b2e3b53ac2a750
from sklearn.linear_model import ElasticNet, Lasso, BayesianRidge, LassoLarsIC
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.kernel_ridge import KernelRidge
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import RobustScaler
from sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin, clone
from sklearn.model_selection import KFold, cross_val_score, train_test_split
from sklearn.metrics import mean_squared_error
import xgboost as xgb
import lightgbm as lgb
Cross validation (CV) is a technique for estimating model prediction errors and is widely used in machine learning.
Reference URL https://note.com/naokiwifruit/n/nc90ca48f16e5 https://qiita.com/LicaOka/items/c6725aa8961df9332cc7
n_folds = 5
def rmsle_cv(model):
kf = KFold(n_folds, shuffle=True, random_state=42).get_n_splits(train.values)
rmse= np.sqrt(-cross_val_score(model, train.values, y_train, scoring="neg_mean_squared_error", cv = kf))
return(rmse)
This time, the number of folds is 5.
RMSLE (Root Mean Squared Logarithmic Error) is applied as the evaluation index this time. This value is calculated by the square root of the average difference after taking each logarithm of the new land and the predicted value.
RMSLE = \sqrt{\frac{1}{n}\sum_{i=1}^{n} (log(y_{i}+1)-log(y_{pred}+1))^2} \\
n:number\\
y_{i}:Measured value\\
y_{pred}:Predicted value
Make a prediction model by combining a total of 6 models. The image is below.
Elastic Net Regression, Kernel Ridge Regession, LASSO Regression, and Gradient Boosting Regression are combined to form a Stacked Averaged Model. Then, combine it with XGBoost and Litht GBM to make the final Ensemble Models. There are many unclear points about optimizing these combinations, so I will continue to learn and summarize them.
Reference URL https://blog.ikedaosushi.com/entry/2018/10/21/204842?t=0
lasso = make_pipeline(RobustScaler(), Lasso(alpha =0.0005, random_state=1))
ENet = make_pipeline(RobustScaler(), ElasticNet(alpha=0.0005, l1_ratio=.9, random_state=3))
KRR = KernelRidge(alpha=0.6, kernel='polynomial', degree=2, coef0=2.5)
GBoost = GradientBoostingRegressor(n_estimators=3000, learning_rate=0.05,
max_depth=4, max_features='sqrt',
min_samples_leaf=15, min_samples_split=10,
loss='huber', random_state =5)
First, instantiate the LASSO regression, Elastic Net Regression, Kernel Ridge Regression, and Gradient Boosting Regression models. Then calculate each RMSLE value for the model.
score = rmsle_cv(lasso)
print("\nLasso score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))
score = rmsle_cv(ENet)
print("ElasticNet score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))
score = rmsle_cv(KRR)
print("Kernel Ridge score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))
score = rmsle_cv(GBoost)
print("Gradient Boosting score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))
Lasso score: 0.1115 (0.0074) ElasticNet score: 0.1116 (0.0074) Kernel Ridge score: 0.1153 (0.0075) Gradient Boosting score: 0.1177 (0.0080)
have become. Now, let's make a model that collects these models. First, simply average.
class AveragingModels(BaseEstimator, RegressorMixin, TransformerMixin):
def __init__(self, models):
self.models = models
# we define clones of the original models to fit the data in
def fit(self, X, y):
self.models_ = [clone(x) for x in self.models]
# Train cloned base models
for model in self.models_:
model.fit(X, y)
return self
#Now we do the predictions for cloned models and average them
def predict(self, X):
predictions = np.column_stack([
model.predict(X) for model in self.models_
])
return np.mean(predictions, axis=1)
averaged_models = AveragingModels(models = (ENet, GBoost, KRR, lasso))
score = rmsle_cv(averaged_models)
print(" Averaged base models score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))
Averaged base models score: 0.1091 (0.0075) You can see that the value is better even if it is simply averaged (smaller is better this time).
Reference URL https://blog.ikedaosushi.com/entry/2018/10/21/204842?t=0
class AveragingModels(BaseEstimator, RegressorMixin, TransformerMixin):
def __init__(self, models):
self.models = models
# we define clones of the original models to fit the data in
def fit(self, X, y):
self.models_ = [clone(x) for x in self.models]
# Train cloned base models
for model in self.models_:
model.fit(X, y)
return self
#Now we do the predictions for cloned models and average them
def predict(self, X):
predictions = np.column_stack([
model.predict(X) for model in self.models_
])
return np.mean(predictions, axis=1)
averaged_models = AveragingModels(models = (ENet, GBoost, KRR, lasso))
score = rmsle_cv(averaged_models)
print(" Averaged base models score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))
Averaged base models score: 0.1091 (0.0075)
class StackingAveragedModels(BaseEstimator, RegressorMixin, TransformerMixin):
def __init__(self, base_models, meta_model, n_folds=5):
self.base_models = base_models
self.meta_model = meta_model
self.n_folds = n_folds
# We again fit the data on clones of the original models
def fit(self, X, y):
self.base_models_ = [list() for x in self.base_models]
self.meta_model_ = clone(self.meta_model)
kfold = KFold(n_splits=self.n_folds, shuffle=True, random_state=156)
# Train cloned base models then create out-of-fold predictions
# that are needed to train the cloned meta-model
out_of_fold_predictions = np.zeros((X.shape[0], len(self.base_models)))
for i, model in enumerate(self.base_models):
for train_index, holdout_index in kfold.split(X, y):
instance = clone(model)
self.base_models_[i].append(instance)
instance.fit(X[train_index], y[train_index])
y_pred = instance.predict(X[holdout_index])
out_of_fold_predictions[holdout_index, i] = y_pred
# Now train the cloned meta-model using the out-of-fold predictions as new feature
self.meta_model_.fit(out_of_fold_predictions, y)
return self
#Do the predictions of all base models on the test data and use the averaged predictions as
#meta-features for the final prediction which is done by the meta-model
def predict(self, X):
meta_features = np.column_stack([
np.column_stack([model.predict(X) for model in base_models]).mean(axis=1)
for base_models in self.base_models_ ])
return self.meta_model_.predict(meta_features)
stacked_averaged_models = StackingAveragedModels(base_models = (ENet, GBoost, KRR),
meta_model = lasso)
score = rmsle_cv(stacked_averaged_models)
print("Stacking Averaged models score: {:.4f} ({:.4f})".format(score.mean(), score.std()))
Stacking Averaged models score: 0.1085 (0.0074) The value has improved a little more.
model_xgb = xgb.XGBRegressor(colsample_bytree=0.4603, gamma=0.0468,
learning_rate=0.05, max_depth=3,
min_child_weight=1.7817, n_estimators=2200,
reg_alpha=0.4640, reg_lambda=0.8571,
subsample=0.5213, silent=1,
random_state =7, nthread = -1)
model_lgb = lgb.LGBMRegressor(objective='regression',num_leaves=5,
learning_rate=0.05, n_estimators=720,
max_bin = 55, bagging_fraction = 0.8,
bagging_freq = 5, feature_fraction = 0.2319,
feature_fraction_seed=9, bagging_seed=9,
min_data_in_leaf =6, min_sum_hessian_in_leaf = 11)
def rmsle(y, y_pred):
return np.sqrt(mean_squared_error(y, y_pred))
This learning model uses multiple learning models as ensemble learning. This is the sum of the weighted (coefficient-multiplied) versions of each learning model.
y_{prediction}=w_{1}∗XGB+w_{2}∗LGB+w_{3}∗StR\\
w_{1}+w_{2}+w_{3} = 1\\
\\
y_{prediction}:Predicted value\\
w_{1~3}:coefficient\\
XGB:Predicted value by XGB\\
LGB:Predicted value by LightGBM\\
StR:Predicted value by Stacked Regressor\\
print('RMSLE score on train data:')
print(rmsle(y_train,stacked_train_pred*0.70 +
xgb_train_pred*0.15 + lgb_train_pred*0.15 ))
ensemble = stacked_pred*0.70 + xgb_pred*0.15 + lgb_pred*0.15
RMSLE score on train data: 0.07530158653663023
It has dropped to a very small value.
This time, data preprocessing was performed in the first half, and learning was performed in the second half. Regarding data preprocessing, it was found that the point is to handle missing values and categorical variables in a state where training data and test data are combined. Proper handling of features is a very important way to determine the accuracy of a model. I felt that it was important not only to fill in the missing values, but also to focus on the expected causal relationships. Regarding model learning, I was able to learn the concept of stacking and the evaluation method of RMSLE. For optimal stacking, you need to dig deeper into the characteristics of each model.
The full program is here. https://github.com/Fumio-eisan/houseprice20200301
Recommended Posts