Introduction

Competition for Bigginer held at SIGNATE Since I participated in [5th _Beginner Limited Competition] House Price Forecast for Ames, an American City, I will write a record of participation. Here continued.

What I did 3 (Understanding the data)

3-1 Learning data / verification data check

train.csv Number of data: 3000 Number of columns: 47 Approximately half of the features listed in data_description.txt. It seems that the data whose operation is strange will be erased.
test.csv Number of data: 2000

3-2 Missing value check

LightGBM seems to handle missing values well (doesn't it just cause an error?), I checked it just in case, but the number of defects is 0 in all columns. The data did not need to be considered.

print(train_data.isnull().sum())
print(test_data.isnull().sum())

Confirmation of objective variable

'Sale Price' is the expected selling price ($). Maximum: 418000 Minimum: 80000 It feels like this is all.

sns.displot(train_data['SalePrice'], height=5, aspect=1)

Since the data looks like a normal distribution, it looks like it can be applied as it is to a linear regression model. It seems that the data of 250000 $ or more is out of order, but it is a decision tree system, and Since the number of data is also reasonable, I will leave it as it is. (The decision tree system looks good for the threshold value, so I don't think it has anything to do with the normal distribution, but it disappears by chance.

3-3 Confirmation of important data

I checked some of the important data directly. Skipped'Garage Cars' and'Garage Area'because seaborn was frozen and could not be displayed as a graph.

fig, ax = plt.subplots(2, 2, figsize=(10, 10))
sns.histplot(x=train_data['BsmtFin SF 1'], ax=ax[0, 0])
sns.histplot(x=train_data['Bsmt Unf SF'], ax=ax[0, 1])
sns.histplot(x=train_data['Bsmt Full Bath'], ax=ax[1, 1])
sns.histplot(x=train_data['Total Bsmt SF'], ax=ax[1, 0])

Since the test data had the same tendency, there seemed to be no major problems with the features.

The features that I was interested in below.

BsmtFin SF 1: Underground (Type 1) area (square feet).
As a test, look at the scatter plot of BsmtFin SF 1 (square feet) on the horizontal axis and SalePrice ($) on the vertical axis.

There is no sale price below 100000 \ $ above 1500 square feet (x-axis), and
conversely, the thick layer of 250000 \ $ (y-axis) seems to be a correlation. I thought it would be okay to erase the data of 400000 \ $ near 500 square feet as an outlier, but
the decision tree is strong against outliers, so I have not dealt with it. If you take it seriously, you should leave it if you know the reason why this one data was priced higher, and delete it if you do not know. (Intuitively, BsmtFin SF 1 is the most important, so should we consider that the adverse effects of outliers on BsmtFin SF 1 outweigh the advantages of other features and eliminate them without question?)

Total Bsmt SF: It seems to be underground, but it looks strange that there is a minus.
, but I leave it as it is because the test target also has negative value data.
However, I felt uncomfortable, so if it was negative, I forced the value to 0.
Since'Garage Cars' and'Garage Area'were in the same state from the minimum value, the same measures are taken.
Garage Cars: I think it's the number of cars that can be accommodated, but it's suspicious because it's a terrible half-hearted value such as 1.998002859.

3-4 Data type confirmation

In 1-2, object type is converted to categorical type at once in order to pass train (). I think that these features represent a character string = some kind and are appropriate.

For other features, check if the current model can be used. In conclusion, the following data has been changed to categorical data.

Column name	Overview	Reason
MS SubClass	Identifies the type of dwelling involved in the sale. Types on sale?	Although each data is numerical 190 = 2 FAMILY CONVERSION -It is a type such as ALL STYLES AND AGES.
Overall Qual	Rates the overall material and finish of the house. So, the evaluation of the property.	Although each data is numerical 10 =It is a type of 10-grade evaluation such as Very Excellent.
Overall Cond	Rates the overall condition of the house. So, the evaluation of the property.	Although each data is numerical 10 =It is a type of 10-grade evaluation such as Very Excellent.
Year Built	Original construction date Year of construction	Since it is a year, it is converted into categorical data
Year Remod/Add	Remodel date (same as construction date if no remodeling or additions) Year of renovation?However, Year Built> Year Remod/The mystery is that there is a lot of Add data.	Since it is a year, it is converted into categorical data
Mo Sold	Month Sold (MM) Sales month	Since it is a month, it is converted into categorical data
Yr Sold	Year Sold (YYYY) Year of sale	Since it is a year, it is converted into categorical data

Actually, you should check whether data other than the type described in data_description.txt is mixed in each categorical data, but I stopped because it seemed to be relatively organized data from the feeling of the data so far.

3-5 Addition of features

The additional features and acceptance / rejection that came up are as follows.

Feature value	Assumption	Acceptance / rejection
Year of sale(Yr Sold) -Year of construction(Year Built)	Image of age.	Adopted as age column.
Year of renovation(Year Remod/Add) -Year of construction(Year Built)	If there is no refurbishment'Year Remod/Add' = 'Year Built'Therefore, I wanted to add a feature that expresses it. If that's the case, Boolean is fine, but I think that the difference will have an effect anyway, so I calculate the difference.	sub_Adopted as a year column.
Garage size(Garage Area) /Number of cars(Garage Cars)	The size per car can be calculated? I thought that if there was room in the garage, it would give a sense of luxury.	not adopted. For some reason, I wasn't sure about the contents of the negative Garage Cars data, and because the land is large in the United States, I don't care about such a small area.
Inflation rate or price index	Year of sale'Yr Sold'2006-The range for the four years of 2010. Temporarily annual inflation rate 3%In 2006, 100,000$Property is 112,000 in 2010$ (The calculation is correct?)Because it becomes For example, I thought it would be nice to have a value such as how much this year is based on 100 dollars in 2000.	not adopted. I stopped believing that it was a lot of work to look up and that it was woven into Yr Sold's yearly data.~~I'm pretty worried, so please give someone a try.~~Try with 4 things you did behind. Somewhat effective.
Lehman shock	Yr Sold is 2006-The range for the four years of 2010. Since the Lehman shock was in 2008, I thought it was highly possible that housing-related prices would be affected before and after that.(The Lehman shock should have affected the bursting of the housing bubble, etc.)	not adopted. I stopped believing that I could come up with a concrete indicator and that it was woven into Yr Sold's annual data. If you look at the news around the subprime mortgage crisis, there seems to be some good indicator.

Result 3

Rank at the time of feature editing

RMSE: 26523.4069341 Ranking: 85/552, so top 16%

RMSE improved by about 30 and the ranking was almost unchanged, so there was not much effect. In terms of the feel of the importance of the added features, the importance was about the top 40%. It may not have been in vain at all.

What I did 4 (parameter adjustment)

At the moment, LightGBM was set with almost default parameters, so The parameters were set in the direction of avoiding overfitting. Reference URL https://lightgbm.readthedocs.io/en/latest/Parameters-Tuning.html Also, there may be a mixture of various googles ...

Change before

lgbm_params = {
    'task': 'train',
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': 'rmse',
    'max_depth' : 10
}

model = lgb.cv(lgbm_params, lgb_train,
               num_boost_round=1000,
               early_stopping_rounds=50,
               verbose_eval=50,
               nfold=7,
               shuffle=True,
               stratified=False,
               seed=42,
               return_cvbooster=True,
              )

After change

lgbm_params = {
    'task': 'train',
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': 'rmse',
    'learning_rate': 0.01,
    'max_depth' : 7,
    'num_leaves': 80,
    'bagging_fraction': 0.8,
    'bagging_freq': 1,
    'feature_fraction': 0.9,
}

model = lgb.cv(lgbm_params, lgb_train,
               num_boost_round=10000,
               early_stopping_rounds=100,
               verbose_eval=50,
               nfold=10,
               shuffle=True,
               stratified=False,
               seed=42,
               return_cvbooster=True,
              )

Change parameters

Parameters	before -> after	Modified image
learning_rate	0.1 -> 0.01	It seemed to be over-learning, so I reduced it to 1/10 so that I could study in detail. Instead num_boost_The round is multiplied by 10.
max_depth	10 -> 7	Generally 5-10,I remember that the larger it is, the easier it is to overfit, so I lowered it.
num_leaves	? -> 80	Maximum value 2^(max_depth)According to the official document that the value slightly lowered from is good.
bagging_fraction, bagging_freq	1 -> 0.8 1 -> 1	Percentage of bagging. It is said that it is effective in avoiding overfitting, so I lowered it a little.
feature_fraction	1 -> 0.9	Percentage of subsampling. It is said that it is effective in avoiding overfitting, so I lowered it a little. bagging_I don't know much about the difference from fraction, but bagging_Explanation of fraction "like feature"_fraction,but this will randomly select part of data without resampling ". https://github.com/Microsoft/LightGBM/blob/master/docs/Parameters.rst

Result 4

Learning curve

A learning curve that rewrites only the parameters of the code that executed the train () function in "What I did 1" The left side is before the change, the right side is after the change

1-学習曲線.PNG

Before the change, the 27th learning was the peak of eval, whereas After the change, the 404th time is the most accurate, and the score is higher than before the change & the decrease in the score in the right direction is small, which leads to the suppression of overfitting.

Rank at the time of parameter setting

RMSE: 26196.7174197 Ranking: 30/553, so the top 6%

It was effective overall. RMSE improved by about 350 and the ranking rose by 50th.

What I did 5 (parameter search)

Since there seemed to be room for parameter improvement, a parameter search was conducted. It seems that you can do it in one shot by using a library called optuna. LightGBM's cv () function is said to be OK just by replacing it with LightGBMTunerCV (), but ...

Change before

lgbm_params = {
    'task': 'train',
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': 'rmse',
    'learning_rate': 0.01,
    'max_depth' : 7,
    'num_leaves': 80,
    'bagging_fraction': 0.8,
    'bagging_freq': 1,
    'feature_fraction': 0.9,
}

model = lgb.cv(lgbm_params, lgb_train,
               num_boost_round=10000,
               early_stopping_rounds=100,
               verbose_eval=50,
               nfold=10,
               shuffle=True,
               stratified=False,
               seed=42,
               return_cvbooster=True,
              )

After change

lgbm_params = {
    'task': 'train',
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': 'rmse',
    #Delete the parameter you want to search
}


tuner_cv = lgb.LightGBMTunerCV(
    lgbm_params, lgb_train,
    num_boost_round=10000,
    early_stopping_rounds=100,
    verbose_eval=20,
    nfold=10,
    shuffle=True,
    stratified=False,
    seed=42,
    return_cvbooster=True,
)

tuner_cv.run()

It was really made. LightGBMTunerCV takes about 7 minutes while cv () takes about 20 seconds. Maybe this computer made effective use of the CPU for the first time.

print(tuner_cv.best_params)
{'task': 'train', 'boosting_type': 'gbdt', 'objective': 'regression', 'metric': 'rmse', 'feature_pre_filter': False, 'lambda_l1': 0.12711821269550255, 'lambda_l2': 6.733049435313309e-05, 'num_leaves': 62, 'feature_fraction': 0.4, 'bagging_fraction': 1.0, 'bagging_freq': 0, 'min_child_samples': 20}

Note: The source code with import optuna seems to be out of date.

Result 5

Ranking at the time of using optuna

RMSE: 26718.3022355 Ranking: Top 21% for 119/557

Overall it has dropped significantly. Is it better than doing a grid search yourself? I was worried that best_params did not have max_depth. It may just be unnecessary because it can be calculated from other parameters. For the time being, I found that I couldn't master optuna.

Summary 2

After data conversion and parameter setting, I was able to enter the top 6%.

It should be lower because the competition is being held and the last day is a holiday.

Since there is still room for growth due to parameter optimization, If I can afford it, I will try to search for parameters a little more.

CatBoost was mentioned in an article somewhere that the default parameters are excellent. Is it faster to replace it with this algorithm?

2021/01/11 Addendum What I did 4

I messed with the following data in the spare time.

data	Overview	effect
Categorize order column	Observation number. Order number. Overlooked but matching value(Purchased at the same time?)Because there was also, it was categorized.	No effect It may be because I did it at the same time as other corrections, but there is almost no change in RMSE and importance.
Set the following negative values to 0, delete 'Total Bsmt SF' 'Garage Cars' 'Garage Area'	[What I did 3]Since the negative value is strange, I overwrote it with 0, but I thought it was forcible and deleted the process.	No effect It may be because I did it at the same time as other corrections, but there is almost no change in RMSE and importance.
'Garage Cars'Rounding of	Looking at the data,-0.00199, 1.Approximately 0 from an integer, such as 9980.00199xx It looked like it had less float type. It seemed that something like a round-off error was occurring, so it was uniformly 0.Add only 002, make it int type and delete after the decimal point.	No effect It may be because I did it at the same time as other corrections, but there is almost no change in RMSE.Importance has dropped.
'Bsmt Full Bath'Rounding of	As with Garage Cars, I saw a fraction, so I made it an int type and deleted it.	No effect It may be because I did it at the same time as other corrections, but there is almost no change in RMSE and importance.
Addition of consumer index	I mentioned that I didn't do it at the top, but I did it because I was curious. https://ecodb.net/country/US/imf_cpi.html https://jp.investing.com/economic-calendar/cpi-733 Roughly calculate the numerical value on a monthly basis from the above site, and sell it-The January index has been added to the new column. As an example, if January 2010 is the sales date,Added December 2009 index. * 1	Effective 30 improvement in RMSE. The ranking also rises by 1st place. Since it is down only in September 2019, the data can take into account the Lehman shock.
'Year Remod/Add'Correction	At the top'Year Built' > 'Year Remod/Add'I wrote that it is a mystery that there is a lot of data in 1950, but when I checked the data a little more, the data in 1950 is abnormally large. For example, if there is no input in the input system'Year Remod/Add'Assuming that there may be cases where is forced to be in 1950.'Year Built' > 'Year Remod/Add'in the case of,'Year Built'The value of'Year Remod/Add'Overwritten on. just a little'Year Built' > 'Year Remod/Add'However, there was some data that was not in 1950. ..	Effective RMSE improved by 60, ranking increased by 3rd place. It seems that this data was stuck.

When it comes to the ranking around here, it seems that one or two people will be eliminated just by increasing the RMSE by 10.

- 1 Supplement
  For example, if you purchased in January 2010, the index will be announced in February 2010.
  Since there is no index at the time of purchase,
  the value of last month announced in January. = Entered the index for December 2009.
  The index for December 2009 will be announced in the middle of January 2010.
  I thought that I wouldn't know the index for December if I bought it in the first half of January, but I didn't notice it. And said.
  
  Also, since this index should reflect the sales result of the month to a greater or lesser extent, it was a feeling that the index of the same year and month was used as the explanatory variable and the objective variable was entered as it was. The reason for shifting the moon.

Result 4 (final result)

RMSE: 26106.3566493 Rank: 27/582, so it was the top 4.6%.

The 1st place RMSE is 25825.5265928, so it's still quite far away.

Improvement points

Some of the data should not be negative, but negative data remains, so there may be some way to solve it.
LightGBM this time does not have much parameter adjustment, so there may be a little more room for setting.
I thought that using another algorithm such as XGBoost might raise the ranking smoothly.
Maybe I should have had more discussion on the SIGNATE forum.
However, the SIGNATE forum looked like a place to report problems to the operation, so it didn't feel like that.

[PYTHON] [SIGNATE] [lightgbm] Competition House price forecast for the American city of Ames Participation record (2/2)

Introduction

What I did 3 (Understanding the data)

3-1 Learning data / verification data check

3-2 Missing value check

Confirmation of objective variable

3-3 Confirmation of important data

3-4 Data type confirmation

3-5 Addition of features

Result 3

Rank at the time of feature editing

What I did 4 (parameter adjustment)

Change before

After change

Change parameters

Result 4

Learning curve

Rank at the time of parameter setting

What I did 5 (parameter search)

Change before

After change

Result 5

Ranking at the time of using optuna

Summary 2

2021/01/11 Addendum What I did 4

Result 4 (final result)

Improvement points