[PYTHON] Box Cox transformation and wood algorithm

A friend of mine who is studying machine learning is working on Kaggle's house price. **> Box-cox conversion **

I was talking about that! That improved the accuracy!

https://sonaeru-blog.com/kaggle-4/ It seems that he referred to the above article.

What is Box-cox conversion in the first place! !! I thought, so Make a note of what you have investigated.

The friend says

I wonder if it is similar to logarithmization in the sense that it approaches a normal distribution.

And that.

What is Box-cox conversion? ??

This article was very helpful. https://gakushukun1.hatenablog.com/entry/2019/04/29/112424

formula image.png

Before and after conversion image.png

Think of it as a more generalized version of ** logarithmic conversion **. In fact, when λ = 0, it is a logarithmic conversion.

The logarithmic conversion has a peak at 0, which is similar to the figure above. If the base is much longer than the normal distribution, it can theoretically be completely replaced by the normal distribution.

In the upper graph, λ is closer to 0 than in the lower graph, so With a distribution like this, I think that there is almost no problem with linear regression even with logarithmic transformation. (Somehow, my own rule of thumb is so poor. Is it possible to improve the accuracy by converting Box-cox?)

** But this presupposes a linear regression algorithm in the first place! ** **

Does the decision tree algorithm require Box-cox?

As a person who wants to raise the ranking as Kaggler as much as possible Personally, the most important thing is the so-called "wooden system" like LGBM. As feature engineering of explanatory variables Is it okay to think that box_cox is unnecessary? Or is it better to use it? ??

https://toukei-lab.com/box-cox%E5%A4%89%E6%8F%9B%E3%82%92%E7%94%A8%E3%81%84%E3%81%A6%E6%AD%A3%E8%A6%8F%E5%88%86%E5%B8%83%E3%81%AB%E5%BE%93%E3%82%8F%E3%81%AA%E3%81%84%E3%83%87%E3%83%BC%E3%82%BF%E3%82%92%E8%A7%A3%E6%9E%90

Recently popular machine learning methods are called nonparametric models. > Many of them do not assume the distribution behind.

According to this article, it is unnecessary! !!

** "Numeric parameters can only be judged by the magnitude relationship after all" ** I think that is the reason.

However, I don't think that is necessarily the case with the objective variable. (In fact, the objective variable is often logarithmic) The reason is to reduce the model penalty for a few large outliers. I understand that.

Then

Is it possible to convert the objective variable to BoxCock?

https://books.google.co.jp/books?id=t1a_DwAAQBAJ&pg=PA222&lpg=PA222&dq=%E7%9B%AE%E7%9A%84%E5%A4%89%E6%95%B0+boxcox&source=bl&ots=L7yjHQ6y6G&sig=ACfU3U3U1ugf0XhDVN_4fKAVnYe9xcFBSQ&hl=ja&sa=X&ved=2ahUKEwi2p_-itoLmAhXZA4gKHUutDmcQ6AEwBXoECAoQAQ#v=onepage&q=%E7%9B%AE%E7%9A%84%E5%A4%89%E6%95%B0%20boxcox&f=false Apparently there is. But that means not using RMSE for the cost function, It's probably synonymous, so I think that choice is more intuitive. https://www.sciencedirect.com/science/article/abs/pii/S0031320396000775?via%3Dihub Then, I found an abstract paper that brilliantly implemented a cost function that transforms the objective variable into boxcox.

I can't see the contents! However, I'm a little tired of it, so I'll leave it here.

Study is accumulated

This time, for the first time, I searched for unknown words and arrived at an answer that I was satisfied with. It was a very pleasant experience to make a hypothesis of a new question and gain knowledge about it.

However, I felt that each one was an accumulation of what I had just learned recently.

・ I know a lot about exponential functions (thanks to the old days) ・ I know about β functions (this is a statistical study session) ・ No need to convert wood (this was said by Mr. Watanabe yesterday) -The objective variable may be transformed (this is often encountered lately) ・ RMSE assumes that the distribution of residuals is a normal distribution (this was done in class) · Cost functions can be implemented individually (this is a recent Kaggle)

I will continue to do my best.

Recommended Posts

Box Cox transformation and wood algorithm
Euclidean algorithm and extended Euclidean algorithm
Rabbit and turtle algorithm