[PYTHON] [Statistics] [R] Try using quantile regression.

This is an introduction to a method called quantile regression.

A normal regression line can be interpreted as the conditional expected value (mean) of $ y $ given $ x $, but in quantile regression, 25% quantiles or 95% quantiles. Let's draw a regression line based on the "quantile" used in, etc.

Anyway, first of all, I will draw a graph using this and try to visualize it.

1. Example of a normal distribution in which the variance of the error depends on the explanatory variables

In this case, the error variance is small where the explanatory variable $ x $ is small, and the error variance is large where the explanatory variable $ x $ is large. I am trying to generate such data.

In quantile regression, different $ \ beta $ is set for each quantile, so each has a different slope.

** Quantile regression execution result ** From the bottom, 5%, 10%, 25%, 75%, 90%, 95% quantile regression line and normal regression line. Quantile_Reg_line-compressor.png

First, generate data and draw a scatter plot.

#If not installed, install it.
#install.packages('quantreg')
library(quantreg)

b <- -5   #Slope of a line
a <- 35   #Straight intercept

data_x <- runif(500, 0, 5)    # 0-Uniform random number between 5
data_y <- rep(0, length(data_x))
for (i in 1:length(data_x)){
    #Intercept:a,Tilt:A model with an error in the straight line of b
    data_y[i] <- b*data_x[i] + a + rnorm(1, mean=0, sd=1+ data_x[i])  #Set the variance depending on the size of the explanatory variable
}

#Draw a scatter plot
plot(data_x, data_y, xlab="Random Variable", ylab="Random Variable", 
        type = "n", cex=.5, xlim=c(0, 5), ylim=c(0, 40))
points(data_x, data_y, cex=.5, col="blue")

For this generated data, quantile regression was performed at 5%, 10%, 25%, 75%, 90%, and 95%, respectively, and a straight line was drawn. At the same time, the result of normal linear regression is also drawn and compared. In this example, the normal distribution is used for the error distribution, and since it is symmetrical, the result of the quantile regression line of the median and the normal linear regression representing the mean value are almost the same.

taus <- c(.05, ,.95, .10, .90, .25, .75)    #Set quantile
cols <- c(4, 3, 2, 2, 3, 4)                 #Color the straight lines separately
xx <- seq(min(data_x), max(data_x), .1)
f  <- coef(rq( (data_y) ~ (data_x), tau=taus) )    #Quantile regression calculation execution
yy <- cbind(1,xx) %*% f

#Draw quantile regression line
for(i in 1:length(taus)){
    lines(xx, yy[,i], col=cols[i]))
}
abline(lm(data_y ~ data_x), col="red", lty = 2, lw=2)    #Draw a normal regression line
abline(rq(data_y ~ data_x), col="black", lw=2)           #Median(median)Draw a quantile regression line

legend(0, 8, c("mean (LSE) fit", "median (LAE) fit"), col = c("red","black"),lty = c(2, 1), lw=c(1, 2))

2. Example of exponential distribution of error

In the case of a symmetrical distribution, it is less interesting, so try generating data using an exponential distribution with a long tail to the right and apply quantile regression.

Then, a straight line is drawn at a position where the median quantile regression represented by the black straight line is more plausible than the least squares straight line represented by the red dotted line. Due to the long hem distribution, the lower quantiles are dense and the upper quantiles are more distant.

Quantile_Reg_exp-compressor.png

Draw a scatter plot by generating data with an exponential distribution of errors.

#If not installed, install it.
#install.packages('quantreg')
library(quantreg)

# b      :Slope of a line
# lambda :Exponential distribution parameters
# taus   :List ex defining quantiles: c(.05, .95, .10, .90, .25,.75) 
quant_reg_exp <- function(b, lambda, taus){
    Dat   <- NULL
    Dat$x <- rnorm(100, mean=50, sd=7)   #Explanatory variables are mean:50,standard deviation:Generated with a normal distribution of 7
    Dat$y <- rep(0, 100)

    for (i in 1:length(Dat$x)){
        Dat$y[i] <- b * Dat$x[i] + rexp(1, lambda)    #Generate error according to exponential distribution
    }
    data_x = Dat$x
    data_y = Dat$y
    
    #Draw a scatter plot
    plot(data_x, data_y ,xlab="Random Variable", ylab="Random Variable", type="n", cex=.5) 
    points(data_x, data_y ,cex=.5, col="blue")

    xx <- seq(min(data_x), max(data_x), .1)
    f  <- coef(rq((data_y)~(data_x), tau=taus))     #Quantile regression calculation execution
    yy <- cbind(1,xx)%*%f

    #Draw a straight line
    cols <- c(2, 2, 3, 3, 4, 4)
    for(i in 1:length(taus)){
        lines(xx, yy[,i], col=cols[i]) 
    }
    abline(lm(data_y ~ data_x), col="red", lty=2)    #Draw a normal regression line
    abline(rq(data_y ~ data_x), col="black", lw=2)   #Median(median)Draw a quantile regression line
    legend(min(data_x), max(data_y), c("mean (LSE) fit", "median (LAE) fit"), 
                 col=c("red","black"), lty=c(2,1), lw=c(1,2))
}

# param: b, lambda, taus
quant_reg_exp(.5, .5, c(.05, .95, .10, .90, .25,.75))

3. Just touch the theory

It's just a touch, but I'd like to mention the points that are different from the least squares method. See the references listed below for details.

The blue line below is a drawing of the $ 1/2 x ^ 2 $ function and represents the function that evaluates the error used in the least squares method.

The red line draws an asymmetric absolute value function according to $ \ tau $, which is used as a function to evaluate the error of quantile regression. The error evaluation function changes according to the quantile $ \ tau $.

Both are functions that start from 0 and evaluate the distance from 0. The main point is to change the distance in the negative direction to positive. Quantile regression tries to represent the specified quantile $ \ tau $ by making it asymmetrical, rather than relying on the middle.

err_func_quantile_reg-compressor.gif (The drawing code is here. This is written in Python ...)

The quantile regression line $ \ hat {g_ {\ tau}} $ at the quantile $ \ tau $ can be defined as follows. The problem is to find $ g (\ cdot) $ that minimizes the error evaluated by the function $ \ psi_ {\ tau} (\ cdot) $.

\hat{g_{\tau}} = \arg\min_{g \in \mathcal{G}} \sum_{i=1}^{n} \psi_{\tau} \{ y_i - g(x_i) \} 

At this time, in the case of the least squares method, the function $ \ psi (\ cdot) $ (blue line) is

\psi_{\tau}(r) = {1 \over 2} r^2

However, in quantile regression, the following asymmetric absolute value function (red line) is used to evaluate the error.

\psi_{\tau}(r) = 
\begin{cases}
    (\tau − 1)\ r & {\rm if}\ \ r \le 0\\
    \tau r & {\rm if}\ \ 0 < r
 \end{cases}

Now, assuming a linear function for $ g (\ cdot) $,

g(x_i) = \alpha + \beta x_i

Therefore, the problem is to find $ \ alpha, \ beta $ that minimizes the value of the error function in the above equation.

Let the estimator of $ y $ be $ \ hat {y} = g (x_i) $.

Also, let the error density function and cumulative density function be $ f (y), \ F (y) $. The image below is illustrated.

gamma_plot2-compressor.png

gamma_pdf_cdf-compressor.png

The error function $ L () $ is defined below.

L(\hat{y}) = E[\psi_{\tau} \{ y_i - \hat{y} \} ] 
= (\tau -1) \int_{-\infty}^{\hat{y}} (y-\hat{y})\ f(y) \ dy 
+ \tau \int^{\infty}_{\hat{y}} (y-\hat{y})\ f(y)\ dy

If the error function $ L (\ cdot) $ is differentiated by $ \ hat {y} $ and set to 0,

{\partial L \over \partial \hat{y}} = 
(\tau -1) \int_{-\infty}^{\hat{y}} (-1)\ f(y) \ dy 
+ \tau \int^{\infty}_{\hat{y}} (-1)\ f(y) \ dy       \\
= \int_{-\infty}^{\hat{y}}\ f(y) \ dy - \tau\  \left( \int_{-\infty}^{\hat{y}}\ f(y) \ dy + \int^{\infty}_{\hat{y}}\ f(y) \ dy \right)   \\
= F(\hat{y}) - \tau = 0

Therefore,

F(\hat{y}) = \tau 

Is established, and $ \ tau $ is equal to the quantile $ F (\ hat {y}) $ at $ \ hat {y} $.

4. References

[1] ‘quantreg’ manual   https://cran.r-project.org/web/packages/quantreg/quantreg.pdf

[2] Wikipedia "Quantile Regression"   https://en.wikipedia.org/wiki/Quantile_regression

[3] Study on adaptive quantile regression analysis [Hiroyuki Moriguchi]   http://miuse.mie-u.ac.jp/bitstream/10076/10813/1/M2008288.pdf

[4] Kyoto University 2013 Microeconomics Lecture Note 8 Quantile Regression [Ryo Okui]   http://www.okui.kier.kyoto-u.ac.jp/doc/quantile.pdf

[5] Verification of efficiency wage hypothesis by quantile regression [Kazutoshi Nakamura]   http://reposit.sun.ac.jp/dspace/bitstream/10561/746/1/v43n4p87_nakamura.pdf

Recommended Posts

[Statistics] [R] Try using quantile regression.
Try to get statistics using e-Stat
Try using Tkinter
Try using docker-py
Try using cookiecutter
Try using PDFMiner
Try using geopandas
Try using Selenium
Try using scipy
Try using pandas.DataFrame
Try using django-swiftbrowser
Try using matplotlib
Try using tf.metrics
Try using PyODE
Try using an object-oriented class in R (R6 method)
Regression using Gaussian process
Try using virtualenv (virtualenvwrapper)
[Azure] Try using Azure Functions
Try using virtualenv now
Try using W & B
Try using Django templates.html
[Kaggle] Try using LGBM
Try using Python's feedparser.
Try using Python's Tkinter
Try regression with TensorFlow
Try using Tweepy [Python2.7]
Try using Pytorch's collate_fn
Try to implement linear regression using Pytorch with Google Colaboratory
Gaussian process regression using GPy
Try using PythonTex with Texpad.
[Python] Try using Tkinter's canvas
Try using Jupyter's Docker image
Try using scikit-learn (1) --K-means clustering
Try function optimization using Hyperopt
Try using matplotlib with PyCharm
Try using Azure Logic Apps
Try using Kubernetes Client -Python-
[Kaggle] Try using xg boost
Try using the Twitter API
Try using OpenCV on Windows
Try using Jupyter Notebook dynamically
Try using AWS SageMaker Studio
Try tweeting automatically using Selenium.
Try using SQLAlchemy + MySQL (Part 1)
Try using the Twitter API
Try using SQLAlchemy + MySQL (Part 2)
Try using Django's template feature
Linear regression method using Numpy
Try using the PeeringDB 2.0 API
Try using Pelican's draft feature
Try using pytest-Overview and Samples-
Try using folium with anaconda