[PYTHON] [SIGNATE] [lightgbm] Competition House price forecast for the American city of Ames Participation record (2/2)

Introduction

Competition for Bigginer held at SIGNATE Since I participated in [5th _Beginner Limited Competition] House Price Forecast for Ames, an American City, I will write a record of participation. Here continued.

What I did 3 (Understanding the data)

3-1 Learning data / verification data check

3-2 Missing value check

LightGBM seems to handle missing values ​​well (doesn't it just cause an error?), I checked it just in case, but the number of defects is 0 in all columns. The data did not need to be considered.

print(train_data.isnull().sum())
print(test_data.isnull().sum())

Confirmation of objective variable

'Sale Price' is the expected selling price ($). Maximum: 418000 Minimum: 80000 It feels like this is all.

sns.displot(train_data['SalePrice'], height=5, aspect=1)

saleprice.PNG Since the data looks like a normal distribution, it looks like it can be applied as it is to a linear regression model. It seems that the data of 250000 $ or more is out of order, but it is a decision tree system, and Since the number of data is also reasonable, I will leave it as it is. (The decision tree system looks good for the threshold value, so I don't think it has anything to do with the normal distribution, but it disappears by chance.

3-3 Confirmation of important data

I checked some of the important data directly. Skipped'Garage Cars' and'Garage Area'because seaborn was frozen and could not be displayed as a graph.

fig, ax = plt.subplots(2, 2, figsize=(10, 10))
sns.histplot(x=train_data['BsmtFin SF 1'], ax=ax[0, 0])
sns.histplot(x=train_data['Bsmt Unf SF'], ax=ax[0, 1])
sns.histplot(x=train_data['Bsmt Full Bath'], ax=ax[1, 1])
sns.histplot(x=train_data['Total Bsmt SF'], ax=ax[1, 0])

somefeature.PNG

Since the test data had the same tendency, there seemed to be no major problems with the features.

The features that I was interested in below.

bsmtfin.PNG

There is no sale price below 100000 \ $ above 1500 square feet (x-axis), and
conversely, the thick layer of 250000 \ $ (y-axis) seems to be a correlation. I thought it would be okay to erase the data of 400000 \ $ near 500 square feet as an outlier, but
the decision tree is strong against outliers, so I have not dealt with it. If you take it seriously, you should leave it if you know the reason why this one data was priced higher, and delete it if you do not know. (Intuitively, BsmtFin SF 1 is the most important, so should we consider that the adverse effects of outliers on BsmtFin SF 1 outweigh the advantages of other features and eliminate them without question?)

3-4 Data type confirmation

In 1-2, object type is converted to categorical type at once in order to pass train (). I think that these features represent a character string = some kind and are appropriate.

For other features, check if the current model can be used. In conclusion, the following data has been changed to categorical data.

Column name Overview Reason
MS SubClass Identifies the type of dwelling involved in the sale.
Types on sale?
Although each data is numerical
190 = 2 FAMILY CONVERSION -It is a type such as ALL STYLES AND AGES.
Overall Qual Rates the overall material and finish of the house.
So, the evaluation of the property.
Although each data is numerical
10 =It is a type of 10-grade evaluation such as Very Excellent.
Overall Cond Rates the overall condition of the house.
So, the evaluation of the property.
Although each data is numerical
10 =It is a type of 10-grade evaluation such as Very Excellent.
Year Built Original construction date
Year of construction
Since it is a year, it is converted into categorical data
Year Remod/Add Remodel date (same as construction date if no remodeling or additions)
Year of renovation?However, Year Built> Year Remod/The mystery is that there is a lot of Add data.
Since it is a year, it is converted into categorical data
Mo Sold Month Sold (MM)
Sales month
Since it is a month, it is converted into categorical data
Yr Sold Year Sold (YYYY)
Year of sale
Since it is a year, it is converted into categorical data

Actually, you should check whether data other than the type described in data_description.txt is mixed in each categorical data, but I stopped because it seemed to be relatively organized data from the feeling of the data so far.

3-5 Addition of features

The additional features and acceptance / rejection that came up are as follows.

Feature value Assumption Acceptance / rejection
Year of sale(Yr Sold) -Year of construction(Year Built) Image of age. Adopted as age column.
Year of renovation(Year Remod/Add) -Year of construction(Year Built) If there is no refurbishment'Year Remod/Add' = 'Year Built'Therefore, I wanted to add a feature that expresses it. If that's the case, Boolean is fine, but I think that the difference will have an effect anyway, so I calculate the difference. sub_Adopted as a year column.
Garage size(Garage Area) /Number of cars(Garage Cars) The size per car can be calculated?
I thought that if there was room in the garage, it would give a sense of luxury.
not adopted. For some reason, I wasn't sure about the contents of the negative Garage Cars data, and because the land is large in the United States, I don't care about such a small area.
Inflation rate or price index Year of sale'Yr Sold'2006-The range for the four years of 2010.
Temporarily annual inflation rate 3%In 2006, 100,000$Property is 112,000 in 2010$ (The calculation is correct?)Because it becomes
For example, I thought it would be nice to have a value such as how much this year is based on 100 dollars in 2000.
not adopted. I stopped believing that it was a lot of work to look up and that it was woven into Yr Sold's yearly data.~~I'm pretty worried, so please give someone a try.~~Try with 4 things you did behind. Somewhat effective.
Lehman shock Yr Sold is 2006-The range for the four years of 2010. Since the Lehman shock was in 2008, I thought it was highly possible that housing-related prices would be affected before and after that.(The Lehman shock should have affected the bursting of the housing bubble, etc.) not adopted. I stopped believing that I could come up with a concrete indicator and that it was woven into Yr Sold's annual data. If you look at the news around the subprime mortgage crisis, there seems to be some good indicator.

Result 3

Rank at the time of feature editing

RMSE: 26523.4069341 Ranking: 85/552, so top 16%

RMSE improved by about 30 and the ranking was almost unchanged, so there was not much effect. In terms of the feel of the importance of the added features, the importance was about the top 40%. It may not have been in vain at all.

What I did 4 (parameter adjustment)

At the moment, LightGBM was set with almost default parameters, so The parameters were set in the direction of avoiding overfitting. Reference URL https://lightgbm.readthedocs.io/en/latest/Parameters-Tuning.html Also, there may be a mixture of various googles ...

Change before

lgbm_params = {
    'task': 'train',
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': 'rmse',
    'max_depth' : 10
}

model = lgb.cv(lgbm_params, lgb_train,
               num_boost_round=1000,
               early_stopping_rounds=50,
               verbose_eval=50,
               nfold=7,
               shuffle=True,
               stratified=False,
               seed=42,
               return_cvbooster=True,
              )

After change

lgbm_params = {
    'task': 'train',
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': 'rmse',
    'learning_rate': 0.01,
    'max_depth' : 7,
    'num_leaves': 80,
    'bagging_fraction': 0.8,
    'bagging_freq': 1,
    'feature_fraction': 0.9,
}

model = lgb.cv(lgbm_params, lgb_train,
               num_boost_round=10000,
               early_stopping_rounds=100,
               verbose_eval=50,
               nfold=10,
               shuffle=True,
               stratified=False,
               seed=42,
               return_cvbooster=True,
              )

Change parameters

Parameters before -> after Modified image
learning_rate 0.1 -> 0.01 It seemed to be over-learning, so I reduced it to 1/10 so that I could study in detail.
Instead num_boost_The round is multiplied by 10.
max_depth 10 -> 7 Generally 5-10,I remember that the larger it is, the easier it is to overfit, so I lowered it.
num_leaves ? -> 80 Maximum value 2^(max_depth)According to the official document that the value slightly lowered from is good.
bagging_fraction,
bagging_freq
1 -> 0.8
1 -> 1
Percentage of bagging. It is said that it is effective in avoiding overfitting, so I lowered it a little.
feature_fraction 1 -> 0.9 Percentage of subsampling. It is said that it is effective in avoiding overfitting, so I lowered it a little.
bagging_I don't know much about the difference from fraction, but bagging_Explanation of fraction "like feature"_fraction,but this will randomly select part of data without resampling ". https://github.com/Microsoft/LightGBM/blob/master/docs/Parameters.rst

Result 4

Learning curve

A learning curve that rewrites only the parameters of the code that executed the train () function in "What I did 1" The left side is before the change, the right side is after the change

1-学習曲線.PNG

Before the change, the 27th learning was the peak of eval, whereas After the change, the 404th time is the most accurate, and the score is higher than before the change & the decrease in the score in the right direction is small, which leads to the suppression of overfitting.

Rank at the time of parameter setting

RMSE: 26196.7174197 Ranking: 30/553, so the top 6%

It was effective overall. RMSE improved by about 350 and the ranking rose by 50th.

What I did 5 (parameter search)

Since there seemed to be room for parameter improvement, a parameter search was conducted. It seems that you can do it in one shot by using a library called optuna. LightGBM's cv () function is said to be OK just by replacing it with LightGBMTunerCV (), but ...

Change before

lgbm_params = {
    'task': 'train',
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': 'rmse',
    'learning_rate': 0.01,
    'max_depth' : 7,
    'num_leaves': 80,
    'bagging_fraction': 0.8,
    'bagging_freq': 1,
    'feature_fraction': 0.9,
}

model = lgb.cv(lgbm_params, lgb_train,
               num_boost_round=10000,
               early_stopping_rounds=100,
               verbose_eval=50,
               nfold=10,
               shuffle=True,
               stratified=False,
               seed=42,
               return_cvbooster=True,
              )

After change

lgbm_params = {
    'task': 'train',
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': 'rmse',
    #Delete the parameter you want to search
}


tuner_cv = lgb.LightGBMTunerCV(
    lgbm_params, lgb_train,
    num_boost_round=10000,
    early_stopping_rounds=100,
    verbose_eval=20,
    nfold=10,
    shuffle=True,
    stratified=False,
    seed=42,
    return_cvbooster=True,
)

tuner_cv.run()

It was really made. LightGBMTunerCV takes about 7 minutes while cv () takes about 20 seconds. Maybe this computer made effective use of the CPU for the first time.

print(tuner_cv.best_params)
{'task': 'train', 'boosting_type': 'gbdt', 'objective': 'regression', 'metric': 'rmse', 'feature_pre_filter': False, 'lambda_l1': 0.12711821269550255, 'lambda_l2': 6.733049435313309e-05, 'num_leaves': 62, 'feature_fraction': 0.4, 'bagging_fraction': 1.0, 'bagging_freq': 0, 'min_child_samples': 20}

Note: The source code with import optuna seems to be out of date.

Result 5

Ranking at the time of using optuna

RMSE: 26718.3022355 Ranking: Top 21% for 119/557

Overall it has dropped significantly. Is it better than doing a grid search yourself? I was worried that best_params did not have max_depth. It may just be unnecessary because it can be calculated from other parameters. For the time being, I found that I couldn't master optuna.

Summary 2

After data conversion and parameter setting, I was able to enter the top 6%.

Since there is still room for growth due to parameter optimization, If I can afford it, I will try to search for parameters a little more.

CatBoost was mentioned in an article somewhere that the default parameters are excellent. Is it faster to replace it with this algorithm?

2021/01/11 Addendum What I did 4

I messed with the following data in the spare time.

data Overview effect
Categorize order column Observation number.
Order number. Overlooked but matching value(Purchased at the same time?)Because there was also, it was categorized.
No effect
It may be because I did it at the same time as other corrections, but there is almost no change in RMSE and importance.
Set the following negative values ​​to 0, delete
'Total Bsmt SF'
'Garage Cars'
'Garage Area'
[What I did 3]Since the negative value is strange, I overwrote it with 0, but I thought it was forcible and deleted the process. No effect
It may be because I did it at the same time as other corrections, but there is almost no change in RMSE and importance.
'Garage Cars'Rounding of Looking at the data,-0.00199, 1.Approximately 0 from an integer, such as 9980.00199xx It looked like it had less float type.
It seemed that something like a round-off error was occurring, so it was uniformly 0.Add only 002, make it int type and delete after the decimal point.
No effect
It may be because I did it at the same time as other corrections, but there is almost no change in RMSE.Importance has dropped.
'Bsmt Full Bath'Rounding of As with Garage Cars, I saw a fraction, so I made it an int type and deleted it. No effect
It may be because I did it at the same time as other corrections, but there is almost no change in RMSE and importance.
Addition of consumer index I mentioned that I didn't do it at the top, but I did it because I was curious.
https://ecodb.net/country/US/imf_cpi.html
https://jp.investing.com/economic-calendar/cpi-733
Roughly calculate the numerical value on a monthly basis from the above site, and sell it-The January index has been added to the new column.
As an example, if January 2010 is the sales date,Added December 2009 index. * 1
Effective
30 improvement in RMSE. The ranking also rises by 1st place. Since it is down only in September 2019, the data can take into account the Lehman shock.
'Year Remod/Add'Correction At the top'Year Built' > 'Year Remod/Add'I wrote that it is a mystery that there is a lot of data in 1950, but when I checked the data a little more, the data in 1950 is abnormally large. For example, if there is no input in the input system'Year Remod/Add'Assuming that there may be cases where is forced to be in 1950.'Year Built' > 'Year Remod/Add'in the case of,'Year Built'The value of'Year Remod/Add'Overwritten on. just a little'Year Built' > 'Year Remod/Add'However, there was some data that was not in 1950. .. Effective
RMSE improved by 60, ranking increased by 3rd place. It seems that this data was stuck.

When it comes to the ranking around here, it seems that one or two people will be eliminated just by increasing the RMSE by 10.

Result 4 (final result)

RMSE: 26106.3566493 Rank: 27/582, so it was the top 4.6%.

The 1st place RMSE is 25825.5265928, so it's still quite far away.

Improvement points

Recommended Posts

[SIGNATE] [lightgbm] Competition House price forecast for the American city of Ames Participation record (1/2)
[SIGNATE] [lightgbm] Competition House price forecast for the American city of Ames Participation record (2/2)
Signate_ Review of the 1st Beginner Limited Competition