[Statistical test 2nd grade / quasi 1st grade] Regression analysis training in Python (1) --Qiita is continued. This time, we will actually try multiple regression analysis with two or more independent variables (explanatory variables) using Python and consider a statistical interpretation. This time as well, I will use libraries such as numpy and pandas to write the calculation part by myself based on mathematical formulas.
The following is also referenced for reference in the theoretical part. The University of Tokyo Faculty of Liberal Arts, Department of Statistics "Introduction to Statistics (Basic Statistics I)" University of Tokyo Press (1991) Chapter 13
I wish I could do it with the JR route data I used last time, but it seemed difficult to collect the independent variables, so I decided to use another data this time. Statistics Bureau Homepage / Statistics of Japan 2019-Chapter 2 Population / Households
The following data for each prefecture is combined and processed into CSV format for use.
―― 2-2 Population by prefecture and population increase / decrease rate ―― 2-13 Daytime population by prefecture and number of people working / attending school outside of home ―― 2-14 Number of people moving in and out by prefecture ―― 2-16 Number of births / deaths and number of marriages / divorces by prefecture
** (2020/01/07) Only the population and population ratio columns were off by one row, so I corrected it. ** **
Prefectures,population,population性比,population増減率,昼夜間population比率,Number of over-migrants,birthrate,Mortality,Natural increase / decrease rate,Marriage rate,Divorce rate
Hokkaido,5320,89.1,-0.455,99.945,-6569,6.4,11.8,-5.4,4.5,1.92
Aomori,1278,88.6,-0.965,99.849,-6075,6.3,13.8,-7.5,4.0,1.64
:
:
Okinawa,1626,88.5,0.582,99.968,-1112,11.3,8.4,3.0,5.7,2.44
--Population (unit: 1,000 people) is the estimated population in 2017. --Population increase / decrease rate is the average rate of increase / decrease per year calculated by ((2015 population / 2010 population) ^ (1/5) ―― 1) × 100 (the original data figures are There is none).
As before, we will save this with the ** character code as UTF-8 and the file name ** population.csv.
Below, we have confirmed the operation with Python 3.6.8.
import numpy as np
import pandas as pd
df = pd.read_csv("population.csv")
print(df) #Check the contents of the imported data
Using this data, let's think about regression analysis when there are two or more independent variables, that is, "multiple regression analysis". There are various possible combinations of independent variables and dependent variables, but first of all, in terms of confirming the method, it seems that an easy-to-understand relationship will come out.
--Independent variables: Birth rate [^ birth rate] and mortality rate [^ birth rate] --Dependent variable: Population increase / decrease rate
Let's try it with.
[^ birthrate]: Number of births and deaths per 1,000 population
Consider a regression model of $ y = \ beta_0 + \ beta_1 x_1 + \ beta_2 x_2 $, where the birth rate is $ x_1 $, the mortality rate is $ x_2 $, and the population growth rate is $ y $. Same as for one variable, the partial regression coefficient $ \ beta_0 to minimize the sum of squares of the residuals of each data point (the difference between the observed value (realized value) of the dependent variable and the theoretical value (predicted value)). , \ beta_1, \ beta_2 $. Each data point (observed value) $ (x_1 ^ {(1)}, x_2 ^ {(1)}, y ^ {(1)}), ..., (x_1 ^ {(n)}, x_2 ^ { For (n)}, y ^ {(n)}) $, the theoretical value of the dependent variable is $ \ hat {y} ^ {(i)} = \ beta_0 + \ beta_1 x_1 ^ {(i)} + \ beta_2 If x_2 ^ {(i)} $, the sum of squares of the residuals $ L $ is
\begin{align}
L &= \sum_{i=1}^n (y^{(i)} - \hat{y}^{(i)})^2 \\
&= \sum_{i=1}^n (y^{(i)} - \beta_0 - \beta_1 x_1^{(i)} - \beta_2 x_2^{(i)})^2
\end{align}
It will be. Assuming that the solution for $ \ beta_0, \ beta_1, \ beta_2 $ with the partial differential coefficient set to 0 is the estimated value $ \ hat \ beta_0, \ hat \ beta_1, \ hat \ beta_2 $
\begin{align}
\hat{\boldsymbol{\beta}} &= (\boldsymbol{X}^\top \boldsymbol{X})^{-1} \boldsymbol{X}^\top \boldsymbol{y}
\end{align}
($ ^ \ Top $ is assumed to represent the transpose of a matrix / vector). However
\begin{align}
\hat{\boldsymbol{\beta}} &= (\hat\beta_0, \hat\beta_1, \hat\beta_2)^\top \\
\boldsymbol{X} &= \begin{pmatrix}
1 & x_1^{(1)} & x_2^{(1)} \\
\vdots & \vdots & \vdots \\
1 & x_1^{(n)} & x_2^{(n)} \\
\end{pmatrix}
\\
\boldsymbol{y} &= (y^{(1)}, ..., y^{(n)})^\top
\end{align}
So, let's assume that $ \ boldsymbol {X} ^ \ top \ boldsymbol {X} $ has an inverse matrix. Using this result, calculate the partial regression coefficient.
n = len(df)
X = np.c_[np.ones(n), df[["birthrate", "Mortality"]].values]
y = df["Population increase / decrease rate"]
b = np.linalg.inv(X.T.dot(X)).dot(X.T).dot(y)
print(b) # Output: [ 1.00978055 0.10175178 -0.18246347]
Therefore, $ \ hat {y} = 1.010 + 0.102 x_1 --0.182 x_2 $ is obtained as the regression equation of the theoretical value of the population increase / decrease rate of $ \ hat {y} $.
――If the birth rate per 1,000 people increases by 1, the annual population increase / decrease rate will increase by 0.102 points. --If the mortality rate per 1,000 people increases by 1, the annual population increase / decrease rate will decrease by 0.182 points.
If you try to find the coefficient of determination as in the case of one variable
TSS = ((y - y.mean()) ** 2).sum() #Total variation
ESS = ((X.dot(b) - y.mean()) ** 2).sum() #Regression variation
RSS = ((y - X.dot(b)) ** 2).sum() #Residual fluctuation
R_2 = ESS / TSS #Coefficient of determination
print(R_2) # 0.9014307808651625
It was requested to be about 0.901. Is this value large or small? I'll try it later if I take another independent variable, so compare the values.
The details are skipped in the above reference book, but let's derive the formula for the partial regression coefficients $ \ beta_0, \ beta_1, \ beta_2 $. It should be possible to obtain it by partially differentiating it normally and solving it as a simultaneous equation, but since the formula transformation that was troublesome even with one independent variable becomes more troublesome, let's organize it with a clear view using "vector differentiation".
"Differentiation by vector / differentiation by matrix" official summary --Qiita
Here, we generalize the number of independent variables to $ p $, and find the formula for the partial regression coefficients $ \ beta_0, \ beta_1, ..., \ beta_p $.
\begin{align}
\boldsymbol{\beta} &= (\beta_0, \beta_1, ..., \beta_p)^\top \\
\boldsymbol{x}^{(i)} &= (1, x_1^{(i)}, ..., x_p^{(i)})^\top \\
\boldsymbol{X} &= (\boldsymbol{x}^{(1)}, ..., \boldsymbol{x}^{(n)})^\top = \begin{pmatrix}
1 & x_1^{(1)} & \cdots & x_p^{(1)} \\
\vdots & \vdots & & \vdots \\
1 & x_1^{(n)} & \cdots & x_p^{(n)} \\
\end{pmatrix}
\\
\boldsymbol{y} &= (y^{(1)}, ..., y^{(n)})^\top \\
\epsilon^{(i)}
&= y^{(i)} - \hat{y}^{(i)} = y^{(i)} - (\beta_0 + \beta_1 x_1^{(i)} + ... + \beta_p x_p^{(i)} ) \\
&= y^{(i)} - \boldsymbol{\beta}^\top \boldsymbol{x}^{(i)} \\
\boldsymbol{\epsilon} &= (\epsilon^{(1)}, ..., \epsilon^{(n)})^\top \\
&= \boldsymbol{y} - \boldsymbol{X} \boldsymbol{\beta}\\
\end{align}
will do. Note that for the sake of simplicity, we have added one dimension to $ \ boldsymbol {x} ^ {(i)} $, which corresponds to the constant term. We will also write a zero vector (a vector with all zero components) as $ \ boldsymbol {0} $. Setting the partial derivative to 0 for $ \ beta_0, \ beta_1, ..., \ beta_p $ means that $ \ frac {\ partial L} {\ partial \ boldsymbol {\ beta}} = \ boldsymbol {0} It is the same as putting $.
Find $ \ boldsymbol {\ beta} $ using the formula on the above page as appropriate. It is a slightly carefully written version of the formula with the following contents. Introduction to Statistics-Chapter 7 | Garaku Taton Chenkan> Trivia Room> Trivia Corner
\begin{align}
L &= \sum_{i=1}^n (\epsilon^{(i)})^2 = \boldsymbol{\epsilon}^\top \boldsymbol{\epsilon} \\
&= (\boldsymbol{y} - \boldsymbol{X} \boldsymbol{\beta})^\top (\boldsymbol{y} - \boldsymbol{X} \boldsymbol{\beta}) \\
&= \boldsymbol{y}^\top \boldsymbol{y} - \boldsymbol{y}^\top (\boldsymbol{X} \boldsymbol{\beta}) - (\boldsymbol{X} \boldsymbol{\beta})^\top \boldsymbol{y} + \boldsymbol{\beta}^\top \boldsymbol{X}^\top \boldsymbol{X} \boldsymbol{\beta} \\
&= \boldsymbol{y}^\top \boldsymbol{y} - 2 \boldsymbol{y}^\top \boldsymbol{X} \boldsymbol{\beta} + \boldsymbol{\beta}^\top \boldsymbol{X}^\top \boldsymbol{X} \boldsymbol{\beta} \\
\frac{\partial L}{\partial \boldsymbol{\beta}} &= \frac{\partial}{\partial \boldsymbol{\beta}} \boldsymbol{y}^\top \boldsymbol{y} - 2 \frac{\partial}{\partial \boldsymbol{\beta}}(\boldsymbol{y}^\top \boldsymbol{X}) \boldsymbol{\beta} + \frac{\partial}{\partial \boldsymbol{\beta}}\boldsymbol{\beta}^\top (\boldsymbol{X}^\top \boldsymbol{X}) \boldsymbol{\beta} \\
&= -2 (\boldsymbol{y}^\top \boldsymbol{X})^\top + (\boldsymbol{X}^\top \boldsymbol{X} + (\boldsymbol{X}^\top \boldsymbol{X})^\top) \boldsymbol{\beta} \\
&= -2 \boldsymbol{X}^\top \boldsymbol{y} + 2\boldsymbol{X}^\top \boldsymbol{X}\boldsymbol{\beta}
\end{align}
So when there is an inverse matrix in $ \ boldsymbol {X} ^ \ top \ boldsymbol {X} $, $ \ frac {\ partial L} {\ partial \ boldsymbol {\ beta}} = \ boldsymbol {0} $ Let the solution of be the estimated value $ \ hat {\ boldsymbol {\ beta}} $
\begin{align}
\hat{\boldsymbol{\beta}} &= (\boldsymbol{X}^\top \boldsymbol{X})^{-1} \boldsymbol{X}^\top \boldsymbol{y}
\end{align}
It can be obtained.
Now that we have a regression equation, let's statistically consider "how reliable the resulting regression equation is". So far, I've been talking about linear algebra rather than statistics.
Until now, the idea was to find one optimal regression equation from the observed values of the dependent variable. Now, what if ** consider the dependent variable as a random variable that follows a probability distribution **? In other words, even if the value of the independent variable $ \ boldsymbol {X} $ is the same, if you make many observations, you will get a different observation value of $ \ boldsymbol {y} $ each time. At this time, the partial regression coefficient $ \ hat {\ boldsymbol {\ beta}} $ obtained by the above formula is also a random variable that changes with each observation. ** We do not know the value of the true partial regression coefficient, but we can consider the probability distribution that the partial regression coefficient follows. ** The idea around this is the same as the interval estimation of the population mean of the statistical test grade 2 range (the true population mean cannot be known, and the probability distribution of the population mean is calculated).
For example, when you actually find the regression equation, you may have an independent variable that plays an important role in the regression equation and an independent variable that has little effect. If $ \ beta_j = 0 $, the corresponding independent variable $ x_j $ has no effect on the value of the dependent variable. Therefore, when the regression equation is obtained from the observed values, the null hypothesis $ \ beta_j = 0 $ and the alternative hypothesis $ \ beta_j \ neq 0 $ are set, and if the null hypothesis is rejected, the independent variable is the regression. You can think of it as an expression that does explain the dependent variable. For that purpose, it is necessary to consider the probability distribution of $ \ hat {\ beta_j} $ obtained from the observed values.
Here we assume that the error $ \ epsilon ^ {(i)} $ of each data follows a normal distribution with mean 0, variance $ \ sigma ^ 2 $ independently of each other. Then
-$ Y ^ {(i)} $ (as a random variable) follows a normal distribution of mean $ \ hat {y} ^ {(i)} $, variance $ \ sigma ^ 2 $ independently of each other -$ \ hat {\ boldsymbol {\ beta}} = (\ boldsymbol {X} ^ \ top \ boldsymbol {X}) ^ {-1} \ boldsymbol {X} ^ \ top \ boldsymbol {y} $ \ hat \ beta_j $ is a linear sum of $ y ^ {(1)}, ..., y ^ {(n)} $, which also follows a normal distribution.
It can be said that. The average of this $ \ hat {\ boldsymbol {\ beta}} $ matches the true partial regression coefficient $ \ boldsymbol {\ beta} $ (proof below). Also, $ S_ {ij} $ is defined as the $ (i, j) $ component of $ (\ boldsymbol {X} ^ \ top \ boldsymbol {X}) ^ {-1} $ (however, $ \ beta_j $ The upper left is the (0,0) component according to the numbering), and the variance of each $ \ hat {\ beta_j} $ is $ \ sigma ^ 2 S_ {jj} $ (proof will be described later). Therefore, we standardized $ \ hat {\ beta_j} $ to mean 0 and variance 1.
Z_j = \frac{\hat{\beta_j} - \beta_j}{\sqrt{\sigma^2 S_{jj}}}
Will follow a standard normal distribution, but as usual it is not possible to know the value of the error variance $ \ sigma ^ 2 $. Actually, there is no choice but to estimate from the observed value, but to make it an unbiased estimator, the denominator is the number of data minus (independent variable + 1). Let the number of independent variables be $ p $
s^2 = \frac{\sum_{i=1}^n (y^{(i)} - \hat{y}^{(i)})^2}{N-p-1}
Is an unbiased estimator of $ \ sigma ^ 2 $, which is the error variance estimate. Replaced $ \ sigma ^ 2 $ with $ s ^ 2 $
t_j = \frac{\hat{\beta_j} - \beta_j}{\sqrt{s^2 S_{jj}}}
Follows a $ t $ distribution with $ n-p-1 $ degrees of freedom. The reason this follows the $ t $ distribution is similar to the "t statistic (10.4)" below. [Statistical Test Level 2] Summary of Probability Distributions in Hypothesis Tests for Normal Population --Qiita
Therefore, when testing whether $ \ beta_j = 0 $ can be said,
t_j = \frac{\hat{\beta_j} - 0}{\sqrt{s^2 S_{jj}}} = \frac{\hat{\beta_j}}{\sqrt{(y^{(i)} - \sum_{i=1}^n y^{(i)})^2 S_{jj} / (N-p-1)}}
Calculate the value of and find the calculated value in the $ t $ distribution table with $ n-p-1 $ degrees of freedom to obtain the $ P $ value. The larger the absolute value of the $ t $ value, the smaller the $ P $ value, which means that the independent variable is meaningful.
Also, the estimated error variance $ s ^ 2 $ obtained earlier represents the part that cannot be expressed by the regression equation, so it can be said that the smaller this is (relatively) the better (roughly) the regression equation. Let's do it.
The introduction has become long, but it is finally practiced. With the above data, let's calculate the $ t $ statistic and the $ P $ value for the null hypothesis $ \ beta_j = 0 $ for the partial regression coefficients corresponding to the two independent variables. The $ t $ distribution table is used to calculate the $ P $ value in the test, but here it is calculated by the scipy.stats module.
from scipy import stats
p = X.shape[1] - 1 #Number of independent variables
S = np.linalg.inv(X.T.dot(X)) # (X^T X)Inverse matrix of
s_2 = ((y - X.dot(b)) ** 2).sum() / (n - p - 1) #Estimated value of error variance
t = b / np.sqrt(s_2 * S.diagonal()) #t statistic
P = (1 - stats.t.cdf(np.abs(t), n - p - 1)) * 2 #P value
print(s_2) # 0.015396963025934803
print(t) # [ 3.273055 3.94856867 -13.93114362]
print(P) # [0.00207535 0.00028009 0. ]
** In Statistical Test Level 2, there is always one question to read the output result of R, but that output is obtained by this kind of calculation, isn't it? ** **
Looking at this result, both $ \ beta_1 = 0 and \ beta_2 = 0 $ are rejected at the 5% significance level. And it can be read that the mortality rate is more effective than the birth rate. On the other hand, from $ s ^ 2 = 0.0154 $, the standard error of the population change rate [%] is estimated to be $ s = 0.124 $, but I can't say whether this result is good or not because there is no comparison target. ..
Below, read the expected value of the vector as a vector in which the expected values of each component are arranged. If the average of each data error $ \ epsilon ^ {(i)} $ is 0
\begin{align}
{\rm E}[\hat{\boldsymbol{\beta}}] &= {\rm E}[(\boldsymbol{X}^\top \boldsymbol{X})^{-1} \boldsymbol{X}^\top \boldsymbol{y}] \\
&= (\boldsymbol{X}^\top \boldsymbol{X})^{-1} \boldsymbol{X}^\top {\rm E}[\boldsymbol{y}] \\
&= (\boldsymbol{X}^\top \boldsymbol{X})^{-1} \boldsymbol{X}^\top {\rm E}[\boldsymbol{X} \boldsymbol{\beta} + \boldsymbol{\epsilon}] \\
&= (\boldsymbol{X}^\top \boldsymbol{X})^{-1} \boldsymbol{X}^\top \boldsymbol{X} \boldsymbol{\beta} \\
&= \boldsymbol{\beta}
\end{align}
Thus, the mean of the partial regression coefficients matches the true partial regression coefficient of $ \ boldsymbol {\ beta} $ (unbiased estimator). There are various methods for estimating the partial regression coefficient other than the least squares method, but obtaining an unbiased estimator is a great advantage of the least squares method.
With reference to this page, I added the notation to complement the intermediate calculation. Section 2.4 Partial Regression Coefficient and Its Test | Data Analysis / Basics and Applications
Hereafter, if you write $ {\ rm V} [\ hat {\ boldsymbol {\ beta}}] $, it will represent the variance-covariance matrix. That is, the $ (i, j) $ component is the covariance of $ \ hat {\ beta_i}, \ hat {\ beta_j} $ (the $ (i, j) $ component is the variance of $ \ hat {\ beta_i} $ However, it is a matrix such that the upper left is the (0,0) component according to the numbering of $ \ beta_j $).
\begin{align}
\hat{\boldsymbol{\beta}} &= (\boldsymbol{X}^\top \boldsymbol{X})^{-1} \boldsymbol{X}^\top \boldsymbol{y} \\
&= (\boldsymbol{X}^\top \boldsymbol{X})^{-1} \boldsymbol{X}^\top (\boldsymbol{\epsilon} + \boldsymbol{X} \boldsymbol{\beta}) \\
&= (\boldsymbol{X}^\top \boldsymbol{X})^{-1} \boldsymbol{X}^\top \boldsymbol{\epsilon} + (\boldsymbol{X}^\top \boldsymbol{X})^{-1} \boldsymbol{X}^\top \boldsymbol{X} \boldsymbol{\beta} \\
&= (\boldsymbol{X}^\top \boldsymbol{X})^{-1} \boldsymbol{X}^\top \boldsymbol{\epsilon} + \boldsymbol{\beta}
\end{align}
Using to be
\begin{align}
{\rm V}[\hat{\boldsymbol{\beta}}] &= {\rm E}[(\hat{\boldsymbol{\beta}} - {\rm E}[\hat{\boldsymbol{\beta}}])(\hat{\boldsymbol{\beta}} - {\rm E}[\hat{\boldsymbol{\beta}}])^\top] \\
&= {\rm E}[(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta})(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta})^\top] \\
&= {\rm E}[((\boldsymbol{X}^\top \boldsymbol{X})^{-1} \boldsymbol{X}^\top \boldsymbol{\epsilon})((\boldsymbol{X}^\top \boldsymbol{X})^{-1} \boldsymbol{X}^\top \boldsymbol{\epsilon})^\top] \\
&= {\rm E}[(\boldsymbol{X}^\top \boldsymbol{X})^{-1} \boldsymbol{X}^\top \boldsymbol{\epsilon} \boldsymbol{\epsilon}^\top \boldsymbol{X} ((\boldsymbol{X}^\top \boldsymbol{X})^{-1})^\top] \\
&= {\rm E}[(\boldsymbol{X}^\top \boldsymbol{X})^{-1} \boldsymbol{X}^\top \boldsymbol{\epsilon} \boldsymbol{\epsilon}^\top \boldsymbol{X} (\boldsymbol{X}^\top \boldsymbol{X})^{-1}] \\
&= (\boldsymbol{X}^\top \boldsymbol{X})^{-1} \boldsymbol{X}^\top {\rm E}[\boldsymbol{\epsilon} \boldsymbol{\epsilon}^\top] \boldsymbol{X} (\boldsymbol{X}^\top \boldsymbol{X})^{-1} \\
&= (\boldsymbol{X}^\top \boldsymbol{X})^{-1} \boldsymbol{X}^\top (\sigma^2 \boldsymbol{I}) \boldsymbol{X} (\boldsymbol{X}^\top \boldsymbol{X})^{-1} \\
&= \sigma^2 (\boldsymbol{X}^\top \boldsymbol{X})^{-1} \\
\end{align}
($ \ Boldsymbol {I} $ is the identity matrix). If you define $ S_ {ij} $ as the $ (i, j) $ component of $ (\ boldsymbol {X} ^ \ top \ boldsymbol {X}) ^ {-1} $, each $ \ hat { You can see that the variance of \ beta_j} $ is $ \ sigma ^ 2 S_ {jj} $.
So far, we have used two independent variables, the birth rate and the mortality rate, in order to estimate the rate of population change. Here, we will consider whether this independent variable is really good, or whether it is better to combine other independent variables. Perhaps there are other factors that affect the rate of population change besides the birth rate and mortality rate.
If two independent variables are arbitrarily selected from variables other than the population increase / decrease rate (and the name of the prefecture that is the nominal scale), what kind of combination is better?
import itertools
y = df["Population increase / decrease rate"]
vars = [x for x in df.columns if x not in ["Prefectures", "Population increase / decrease rate"]]
for exp_vars in itertools.combinations(vars, 2): #Choose two independent variables
X = np.c_[np.ones(n), df[list(exp_vars)].values]
b = np.linalg.inv(X.T.dot(X)).dot(X.T).dot(y) #Calculate the partial regression coefficient
TSS = ((y - y.mean()) ** 2).sum() #Total variation
ESS = ((X.dot(b) - y.mean()) ** 2).sum() #Regression variation
R_2 = ESS / TSS #Coefficient of determination
p = X.shape[1] - 1 #Number of independent variables
S = np.linalg.inv(X.T.dot(X)) # (X^T X)Inverse matrix of
s_2 = ((y - X.dot(b)) ** 2).sum() / (n - p - 1) #Estimated value of error variance
t = b / np.sqrt(s_2 * S.diagonal()) #t statistic
P = (1 - stats.t.cdf(np.abs(t), n - p - 1)) * 2 #P value
print("Independent variable: ", exp_vars)
print("Partial regression coefficient: ", b)
print("Coefficient of determination: ", R_2)
print("t statistic: ", t)
print("P value: ", P)
print("Estimated value of error variance:", s_2)
print()
Since it is too long to output all the results, the following is an excerpt of the output for a combination with a coefficient of determination of 0.9 or more. ** (2020/01/07) The result has been replaced due to data correction. ** **
Independent variable: ('population', 'Natural increase / decrease rate')
Partial regression coefficient: [1.35396495e-01 2.92466698e-05 1.38354513e-01]
Coefficient of determination: 0.9202813093155153
t statistic: [ 2.50498414 4.06232118 16.12219163]
P value: [0.01602233 0.00019698 0. ]
Estimated value of error variance: 0.012452424232620031
Independent variable: ('Number of over-migrants', 'Natural increase / decrease rate')
Partial regression coefficient: [2.36387964e-01 6.30652813e-06 1.43339245e-01]
Coefficient of determination: 0.9235317925311585
t statistic: [ 6.24974791 4.36740999 18.56543006]
P value: [1.44848286e-07 7.53394644e-05 0.00000000e+00]
Estimated value of error variance: 0.011944683881961054
Independent variable: ('birthrate', 'Mortality')
Partial regression coefficient: [ 1.00978055 0.10175178 -0.18246347]
Coefficient of determination: 0.9014307808651792
t statistic: [ 3.273055 3.94856867 -13.93114362]
P value: [0.00207535 0.00028009 0. ]
Estimated value of error variance: 0.015396963025934803
Independent variable: ('birthrate', 'Natural increase / decrease rate')
Partial regression coefficient: [ 1.02915847 -0.08316346 0.18254551]
Coefficient of determination: 0.9030150527461578
t statistic: [ 3.35453348 -2.39398938 14.07003206]
P value: [0.00164462 0.02099475 0. ]
Estimated value of error variance: 0.01514949250939055
Independent variable: ('Mortality', 'Natural increase / decrease rate')
Partial regression coefficient: [ 1.00487249 -0.08024897 0.10178654]
Coefficient of determination: 0.902445538157104
t statistic: [ 3.30081126 -2.33256484 4.02629601]
P value: [0.00191782 0.02430887 0.0002203 ]
Estimated value of error variance: 0.01523845329397937
Independent variable: ('Natural increase / decrease rate', 'Marriage rate')
Partial regression coefficient: [-0.56327258 0.12603311 0.16092148]
Coefficient of determination: 0.9001397075218234
t statistic: [-1.3456451 7.29460315 2.0734612 ]
P value: [1.85310560e-01 4.23937485e-09 4.40154160e-02]
Estimated value of error variance: 0.015598634589388555
In this case, is it okay because the combination of the number of excess in-migrants and the natural increase / decrease rate has a large coefficient of determination and a small estimated error variance? However, since the number of over-migrants is a real number, not a percentage, the impact on the natural rate of increase or decrease should vary depending on the original population. Instead, the population divided by the population should be called the over-migration rate and used as an independent variable.
df["Over-migration rate"] = df["Number of over-migrants"] / (df["population"] * 1000) * 100
n = len(df)
exp_vars = ("Over-migration rate", "Natural increase / decrease rate")
y = df["Population increase / decrease rate"]
X = np.c_[np.ones(n), df[exp_vars].values]
b = np.linalg.inv(X.T.dot(X)).dot(X.T).dot(y)
TSS = ((y - y.mean()) ** 2).sum() #Total variation
ESS = ((X.dot(b) - y.mean()) ** 2).sum() #Regression variation
R_2 = ESS / TSS #Coefficient of determination
p = X.shape[1] - 1 #Number of independent variables
S = np.linalg.inv(X.T.dot(X)) # (X^T X)Inverse matrix of
s_2 = ((y - X.dot(b)) ** 2).sum() / (n - p - 1) #Estimated value of error variance
t = b / np.sqrt(s_2 * S.diagonal()) #t statistic
P = (1 - stats.t.cdf(np.abs(t), n - p - 1)) * 2 #P value
print("Independent variable: ", exp_vars)
print("Partial regression coefficient: ", b)
print("Coefficient of determination: ", R_2)
print("t statistic: ", t)
print("P value: ", P)
print("Estimated value of error variance:", s_2)
print()
Independent variable: ('Over-migration rate', 'Natural increase / decrease rate')
Partial regression coefficient: [0.20976966 0.75632976 0.10990394]
Coefficient of determination: 0.9580723972439072
t statistic: [ 7.50186534 8.42827532 14.31096232]
P value: [2.11598050e-09 9.87150361e-11 0.00000000e+00]
Estimated value of error variance: 0.006549283387531298
Compared to using the excess number of in-migrants as is, the coefficient of determination and the estimated value of error variance are better. Note that both the over-migration rate and the natural increase / decrease rate are thought to have the effect of increasing the population, so the corresponding partial regression coefficient is certainly a positive value.
What about using three independent variables?
for exp_vars in itertools.combinations(vars, 3): #Choose 3 independent variables
X = np.c_[np.ones(n), df[list(exp_vars)].values]
b = np.linalg.inv(X.T.dot(X)).dot(X.T).dot(y) #Calculate the partial regression coefficient
TSS = ((y - y.mean()) ** 2).sum() #Total variation
ESS = ((X.dot(b) - y.mean()) ** 2).sum() #Regression variation
R_2 = ESS / TSS #Coefficient of determination
p = X.shape[1] - 1 #Number of independent variables
S = np.linalg.inv(X.T.dot(X)) # (X^T X)Inverse matrix of
s_2 = ((y - X.dot(b)) ** 2).sum() / (n - p - 1) #Estimated value of error variance
t = b / np.sqrt(s_2 * S.diagonal()) #t statistic
P = (1 - stats.t.cdf(np.abs(t), n - p - 1)) * 2 #P value
print("Independent variable: ", exp_vars)
print("Partial regression coefficient: ", b)
print("Coefficient of determination: ", R_2)
print("t statistic: ", t)
print("P value: ", P)
print("Estimated value of error variance:", s_2)
print()
For example, you can find such a combination. (This is not the best. It is an example that makes it easier to interpret the results.)
Independent variable: ('Natural increase / decrease rate', 'Marriage rate', 'Divorce rate')
Partial regression coefficient: [-0.30054533 0.12984192 0.19858489 -0.25130329]
Coefficient of determination: 0.9109243128862279
t statistic: [-0.72219064 7.82613187 2.61426129 -2.28169067]
P value: [4.74086687e-01 8.36975600e-10 1.22779282e-02 2.75178580e-02]
Estimated value of error variance: 0.014237611977607154
The higher the natural rate of increase / decrease and the rate of marriage, the higher the population, and the higher the divorce rate, the lower the population. It's a result that seems to be convincing. The partial regression coefficients (except for the constant term) are all 5% significant. On the other hand, let's look at some not-so-good examples.
Independent variable: ('birthrate', 'Mortality', 'Natural increase / decrease rate')
Partial regression coefficient: [ 1.03883742 -0.2559058 0.17133451 0.35353742]
Coefficient of determination: 0.9034996234109349
t statistic: [ 3.3482056 -0.68534239 0.4646739 0.96013752]
P value: [0.00169918 0.49680614 0.6445094 0.34235404]
Estimated value of error variance: 0.015424353851135318
The strange result is that as the birth rate rises, the population decreases, and as the mortality rate rises, the population increases. This is because there is a relationship between the independent variables "natural rate of increase / decrease = fertility rate-mortality rate", which causes the troublesome problem of multicollinearity. What is multicollinearity? Weblio Dictionary
When variable selection is not performed, it is better not to include variables that are highly correlated with each other. If some of them are not independent (for example, variables A and B and their sum C = A + B are both included), the analysis will fail. In some cases, the sign of the correlation coefficient between each independent variable and the dependent variable may not match the sign of the partial regression coefficient. (Omitted) This happens because there is a mixture of highly correlated independent variables (when one variable cancels out too much prediction with another variable). There is). However, since such a thing is inconvenient when considering the causal relationship, it is advisable to search for a multiple regression equation excluding independent variables whose signs do not match.
It may be possible to detect whether or not multicollinearity is occurring to some extent, but it is difficult for machines to understand the story that "the higher the birth rate, the more the population will not increase." It seems that "multiple regression equations excluding independent variables whose signs do not match" is actually quite difficult. The interpretation of the final result is where humans need to think properly.
The long regression analysis is over for the time being. Next time, we are planning Chapter 3 "Principal Component Analysis".
Recommended Posts