This is an introduction to a method called quantile regression.

A normal regression line can be interpreted as the conditional expected value (mean) of $ y $ given $ x $, but in quantile regression, 25% quantiles or 95% quantiles. Let's draw a regression line based on the "quantile" used in, etc.

Anyway, first of all, I will draw a graph using this and try to visualize it.

1. Example of a normal distribution in which the variance of the error depends on the explanatory variables

In this case, the error variance is small where the explanatory variable $ x $ is small, and the error variance is large where the explanatory variable $ x $ is large. I am trying to generate such data.

In quantile regression, different $ \ beta $ is set for each quantile, so each has a different slope.

** Quantile regression execution result ** From the bottom, 5%, 10%, 25%, 75%, 90%, 95% quantile regression line and normal regression line.

First, generate data and draw a scatter plot.

#If not installed, install it.
#install.packages('quantreg')
library(quantreg)

b <- -5   #Slope of a line
a <- 35   #Straight intercept

data_x <- runif(500, 0, 5)    # 0-Uniform random number between 5
data_y <- rep(0, length(data_x))
for (i in 1:length(data_x)){
    #Intercept:a,Tilt:A model with an error in the straight line of b
    data_y[i] <- b*data_x[i] + a + rnorm(1, mean=0, sd=1+ data_x[i])  #Set the variance depending on the size of the explanatory variable
}

#Draw a scatter plot
plot(data_x, data_y, xlab="Random Variable", ylab="Random Variable", 
        type = "n", cex=.5, xlim=c(0, 5), ylim=c(0, 40))
points(data_x, data_y, cex=.5, col="blue")

For this generated data, quantile regression was performed at 5%, 10%, 25%, 75%, 90%, and 95%, respectively, and a straight line was drawn. At the same time, the result of normal linear regression is also drawn and compared. In this example, the normal distribution is used for the error distribution, and since it is symmetrical, the result of the quantile regression line of the median and the normal linear regression representing the mean value are almost the same.

taus <- c(.05, ,.95, .10, .90, .25, .75)    #Set quantile
cols <- c(4, 3, 2, 2, 3, 4)                 #Color the straight lines separately
xx <- seq(min(data_x), max(data_x), .1)
f  <- coef(rq( (data_y) ~ (data_x), tau=taus) )    #Quantile regression calculation execution
yy <- cbind(1,xx) %*% f

#Draw quantile regression line
for(i in 1:length(taus)){
    lines(xx, yy[,i], col=cols[i]))
}
abline(lm(data_y ~ data_x), col="red", lty = 2, lw=2)    #Draw a normal regression line
abline(rq(data_y ~ data_x), col="black", lw=2)           #Median(median)Draw a quantile regression line

legend(0, 8, c("mean (LSE) fit", "median (LAE) fit"), col = c("red","black"),lty = c(2, 1), lw=c(1, 2))

2. Example of exponential distribution of error

In the case of a symmetrical distribution, it is less interesting, so try generating data using an exponential distribution with a long tail to the right and apply quantile regression.

Then, a straight line is drawn at a position where the median quantile regression represented by the black straight line is more plausible than the least squares straight line represented by the red dotted line. Due to the long hem distribution, the lower quantiles are dense and the upper quantiles are more distant.

Draw a scatter plot by generating data with an exponential distribution of errors.

#If not installed, install it.
#install.packages('quantreg')
library(quantreg)

# b      :Slope of a line
# lambda :Exponential distribution parameters
# taus   :List ex defining quantiles: c(.05, .95, .10, .90, .25,.75) 
quant_reg_exp <- function(b, lambda, taus){
    Dat   <- NULL
    Dat$x <- rnorm(100, mean=50, sd=7)   #Explanatory variables are mean:50,standard deviation:Generated with a normal distribution of 7
    Dat$y <- rep(0, 100)

    for (i in 1:length(Dat$x)){
        Dat$y[i] <- b * Dat$x[i] + rexp(1, lambda)    #Generate error according to exponential distribution
    }
    data_x = Dat$x
    data_y = Dat$y
    
    #Draw a scatter plot
    plot(data_x, data_y ,xlab="Random Variable", ylab="Random Variable", type="n", cex=.5) 
    points(data_x, data_y ,cex=.5, col="blue")

    xx <- seq(min(data_x), max(data_x), .1)
    f  <- coef(rq((data_y)~(data_x), tau=taus))     #Quantile regression calculation execution
    yy <- cbind(1,xx)%*%f

    #Draw a straight line
    cols <- c(2,　2,　3,　3,　4,　4)
    for(i in 1:length(taus)){
        lines(xx,　yy[,i], col=cols[i]) 
    }
    abline(lm(data_y ~ data_x), col="red",　lty=2)    #Draw a normal regression line
    abline(rq(data_y ~ data_x), col="black", lw=2)   #Median(median)Draw a quantile regression line
    legend(min(data_x),　max(data_y),　c("mean (LSE) fit", "median (LAE) fit"), 
                 col=c("red","black"),　lty=c(2,1), lw=c(1,2))
}

# param: b, lambda, taus
quant_reg_exp(.5, .5, c(.05, .95, .10, .90, .25,.75))

3. Just touch the theory

It's just a touch, but I'd like to mention the points that are different from the least squares method. See the references listed below for details.

The blue line below is a drawing of the $ 1/2 x ^ 2 $ function and represents the function that evaluates the error used in the least squares method.

The red line draws an asymmetric absolute value function according to $ \ tau $, which is used as a function to evaluate the error of quantile regression. The error evaluation function changes according to the quantile $ \ tau $.

Both are functions that start from 0 and evaluate the distance from 0. The main point is to change the distance in the negative direction to positive. Quantile regression tries to represent the specified quantile $ \ tau $ by making it asymmetrical, rather than relying on the middle.

(The drawing code is here. This is written in Python ...)

The quantile regression line $ \ hat {g_ {\ tau}} $ at the quantile $ \ tau $ can be defined as follows. The problem is to find $ g (\ cdot) $ that minimizes the error evaluated by the function $ \ psi_ {\ tau} (\ cdot) $.

\hat{g_{\tau}} = \arg\min_{g \in \mathcal{G}} \sum_{i=1}^{n} \psi_{\tau} \{ y_i - g(x_i) \}

At this time, in the case of the least squares method, the function $ \ psi (\ cdot) $ (blue line) is

\psi_{\tau}(r) = {1 \over 2} r^2

However, in quantile regression, the following asymmetric absolute value function (red line) is used to evaluate the error.

\psi_{\tau}(r) = 
\begin{cases}
    (\tau − 1)\ r & {\rm if}\ \ r \le 0\\
    \tau r & {\rm if}\ \ 0 < r
 \end{cases}

Now, assuming a linear function for $ g (\ cdot) $,

g(x_i) = \alpha + \beta x_i

Therefore, the problem is to find $ \ alpha, \ beta $ that minimizes the value of the error function in the above equation.

Let the estimator of $ y $ be $ \ hat {y} = g (x_i) $.

Also, let the error density function and cumulative density function be $ f (y), \ F (y) $. The image below is illustrated.

The error function $ L () $ is defined below.

L(\hat{y}) = E[\psi_{\tau} \{ y_i - \hat{y} \} ] 
= (\tau -1) \int_{-\infty}^{\hat{y}} (y-\hat{y})\ f(y) \ dy 
+ \tau \int^{\infty}_{\hat{y}} (y-\hat{y})\ f(y)\ dy

If the error function $ L (\ cdot) $ is differentiated by $ \ hat {y} $ and set to 0,

{\partial L \over \partial \hat{y}} = 
(\tau -1) \int_{-\infty}^{\hat{y}} (-1)\ f(y) \ dy 
+ \tau \int^{\infty}_{\hat{y}} (-1)\ f(y) \ dy  　　　　　\\
= \int_{-\infty}^{\hat{y}}\ f(y) \ dy - \tau\  \left( \int_{-\infty}^{\hat{y}}\ f(y) \ dy + \int^{\infty}_{\hat{y}}\ f(y) \ dy \right)   \\
= F(\hat{y}) - \tau = 0

Therefore,

F(\hat{y}) = \tau

Is established, and $ \ tau $ is equal to the quantile $ F (\ hat {y}) $ at $ \ hat {y} $.

4. References

[1] ‘quantreg’ manual 　　https://cran.r-project.org/web/packages/quantreg/quantreg.pdf

[2] Wikipedia "Quantile Regression" 　　https://en.wikipedia.org/wiki/Quantile_regression

[3] Study on adaptive quantile regression analysis [Hiroyuki Moriguchi] 　　http://miuse.mie-u.ac.jp/bitstream/10076/10813/1/M2008288.pdf

[4] Kyoto University 2013 Microeconomics Lecture Note 8 Quantile Regression [Ryo Okui] 　　http://www.okui.kier.kyoto-u.ac.jp/doc/quantile.pdf

[5] Verification of efficiency wage hypothesis by quantile regression [Kazutoshi Nakamura] 　　http://reposit.sun.ac.jp/dspace/bitstream/10561/746/1/v43n4p87_nakamura.pdf

[PYTHON] [Statistics] [R] Try using quantile regression.

1. Example of a normal distribution in which the variance of the error depends on the explanatory variables

2. Example of exponential distribution of error

3. Just touch the theory

4. References