Solving "Statistical Test Level 1 Corresponding Statistics Practice Workbook" with R, Python-Chapter 17 Regression Diagnostic Method-

Japan Statistical Society Officially Certified Statistical Test Level 1 Correspondence Statistics Practice Workbook Is useful not only for preparing statistical tests but also for organizing statistics, so I also tried to implement it in R and Python.

Q17.1

--The data was obtained from the following, but the format is complicated, so I prepared it manually (listed at the end of this article). -Statistics Bureau, Ministry of Internal Affairs and Communications e-Stat Prefectural / Municipalities (Social / Demographic System) --FY2017, "A1101_total population [people]" "B1103_ habitable area [ha]" -Automobile Inspection & Registration Information Association, Number of Vehicles Owned --There is no data as of August 1st year of Reiwa, and data as of February 2nd year of Reiwa was used.

R

df = read.csv('By prefecture_Population area passenger car.csv')
dim(df)
head(df)
47 4
A data.frame: 6 × 4
Total population of prefectures Area of habitable area Passenger car
<fct>	<int>	<int>	<int>
1 Hokkaido 5320000 2237239 2818051
2 Aomori Prefecture 1278000 322970 734341
3 Iwate Prefecture 1255000 371401 746817
4 Miyagi Prefecture 2323000 315488 1305508
5 Akita 996000 320437 592573
6 Yamagata Prefecture 1102000 288480 697137
df2 = df[! df$Prefectures%in% c('Tokyo', 'Osaka', 'Kanagawa Prefecture'), ]
dim(df2)
44 4
df2_1 = data.frame(c.ratio = df2$`Passenger car` / df2$`Total population`, p.density = df2$`Total population` / df2$`Resident area`)
head(df2_1)
A data.frame: 6 × 2
c.ratio	p.density
<dbl>	<dbl>
1	0.5297088	2.377931
2	0.5746017	3.957024
3	0.5950733	3.379097
4	0.5619923	7.363196
5	0.5949528	3.108255
6	0.6326107	3.820022
model = lm(c.ratio ~ sqrt(p.density), data = df2_1)
summary(model)

Call:
lm(formula = c.ratio ~ sqrt(p.density), data = df2_1)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.133071 -0.041664 -0.000793  0.051277  0.117539 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)      0.74215    0.03161  23.479  < 2e-16 ***
sqrt(p.density) -0.05147    0.01033  -4.981 1.13e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.05825 on 42 degrees of freedom
Multiple R-squared:  0.3713,	Adjusted R-squared:  0.3564 
F-statistic: 24.81 on 1 and 42 DF,  p-value: 1.134e-05
par(mfrow = c(2, 2))
plot(model)
par(mfrow = c(1, 1))

ダウンロード (9).png

――There are various ways to draw the scatter plot and regression curve in the answer, but ggplot2 is easy and clean to use, and it is usually better to show the confidence interval as well.

library(ggplot2)
df_1 = data.frame(c.ratio = df$`Passenger car` / df$`Total population`, p.density = df$`Total population` / df$`Resident area`)
ggplot(df_1, aes(x = p.density, y = c.ratio)) + geom_point() + geom_smooth(method = 'lm')

ダウンロード (1).png

ggplot(df2_1, aes(x = p.density, y = c.ratio)) + geom_point() + geom_smooth(method = 'lm')

ダウンロード.png

――Since the text and data are slightly different, the figure is also different, but this is the purpose.

Python

import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
from scipy import stats
import matplotlib.pyplot as plt

df = pd.read_csv('By prefecture_Population area passenger car.csv')
df.head()
Prefectures Total population Resident area Passenger car
0 Hokkaido 5320000 2237239
1 Aomori Prefecture 1278000 322970
2 Iwate Prefecture 1255000 371401
3 Miyagi Prefecture 2323000 315488
4 Akita 996000 320437
df2 = df[df['Prefectures'].isin(['Tokyo', 'Osaka', 'Kanagawa Prefecture']) == False]
df2.shape
(44, 4)
df2_1 = pd.DataFrame({'c.ratio': df2['Passenger car'] / df2['Total population'], 'p.density': df2['Total population'] / df2['Resident area']})
df2_1.head()
c.ratio p.density
0 0.529709
1 0.574602
2 0.595073
3 0.561992
4 0.594953
x = np.sqrt(df2['p.density'])
x = sm.add_constant(x)
y = df2['c.ratio']

results = smf.ols('c_ratio ~ np.sqrt(p_density)', data=df2_1).fit()
results.summary()
                  OLS Regression Results
Dep. Variable:    c_ratio       R-squared:          0.371
Model:            OLS           Adj. R-squared:     0.356
Method:           Least Squares F-statistic:        24.81
Date:	Sun, 26 Jul 2020        Prob (F-statistic): 1.13e-05
Time:	          10:36:04      Log-Likelihood:     63.679
No. Observations: 44            AIC:                -123.4
Df Residuals:     42            BIC:                -119.8
Df Model:         1		
Covariance Type:	nonrobust		
                   coef    std err t       P>|t| [0.025	 0.975]
Intercept	   0.7421  0.032   23.479  0.000  0.678	 0.806
np.sqrt(p_density) -0.0515 0.010   -4.981  0.000  -0.072 -0.031
Omnibus:       0.516    Durbin-Watson:    1.078
Prob(Omnibus): 0.772    Jarque-Bera (JB): 0.639
Skew:          -0.104   Prob(JB):         0.726
Kurtosis:      2.447    Cond. No.         12.1


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
ax = plt.subplot()
ax.invert_yaxis()
ax.scatter(result.fittedvalues, result.fittedvalues - df2_1['c_ratio'])

ダウンロード (2).png

stats.probplot(stats.zscore(result.fittedvalues - df2['c_ratio']), dist="norm", plot=plt);

ダウンロード (5).png

--Text, a little different from R ...

plt.scatter(result.fittedvalues, np.sqrt(np.abs(stats.zscore(result.fittedvalues - df2['c_ratio']))))

ダウンロード (6).png

sm.graphics.influence_plot(results, size = 1);

ダウンロード (7).png

--The numbers are off, but the plot looks correct.

--The figure in the answer, seaborn, was used.

import seaborn as sns
df_1 = pd.DataFrame({'c_ratio': df['Passenger car'] / df['Total population'], 'p_density': df['Total population'] / df['Resident area']})
sns.regplot(x="p_density", y="c_ratio", data=df_1)

ダウンロード (3).png

sns.regplot(x="p_density", y="c_ratio", data=df2_1)

ダウンロード (4).png

--By prefecture_Population area Passenger car.csv

Prefectures,Total population,Resident area,Passenger car
Hokkaido,5320000,2237239,2818051
Aomori Prefecture,1278000,322970,734341
Iwate Prefecture,1255000,371401,746817
Miyagi Prefecture,2323000,315488,1305508
Akita,996000,320437,592573
Yamagata Prefecture,1102000,288480,697137
Fukushima Prefecture,1882000,421711,1229236
Ibaraki Prefecture,2892000,397512,2000193
Tochigi Prefecture,1957000,298276,1348806
Gunma Prefecture,1960000,227936,1389173
Saitama,7310000,258464,3230250
Chiba,6246000,355433,2836551
Tokyo,13724000,142143,3152966
Kanagawa Prefecture,9159000,147093,3065153
Yamanashi Prefecture,823000,95438,562220
Niigata Prefecture,2267000,453532,1399511
Toyama Prefecture,1056000,184282,713894
Ishikawa Prefecture,1147000,139182,730058
Nagano Prefecture,2076000,322552,1387924
Fukui prefecture,779000,107729,516016
Gifu Prefecture,2008000,221113,1309297
Shizuoka Prefecture,3675000,274948,2239573
Aichi prefecture,7525000,298821,4222263
Mie Prefecture,1800000,205918,1169327
Shiga Prefecture,1413000,130722,812401
Kyoto,2599000,117382,1010972
Osaka,8823000,133058,2805443
Nara Prefecture,1348000,85553,657131
Wakayama Prefecture,945000,111506,548525
Hyogo prefecture,5503000,278293,2331794
Tottori prefecture,565000,90083,348468
Shimane Prefecture,685000,129888,412318
Okayama Prefecture,1907000,221871,1172093
Hiroshima Prefecture,2829000,231109,1473531
Yamaguchi Prefecture,1383000,170697,827588
Tokushima Prefecture,743000,101035,461360
Kagawa Prefecture,967000,100559,597098
Ehime Prefecture,1364000,167326,753155
Kochi Prefecture,714000,116311,401196
Fukuoka Prefecture,5107000,276153,2634631
Saga Prefecture,824000,133561,512857
Nagasaki Prefecture,1354000,167496,706950
Kumamoto Prefecture,1765000,279626,1047274
Oita Prefecture,1152000,179893,700764
Miyazaki prefecture,1089000,184988,684196
Kagoshima prefecture,1626000,331288,965856
Okinawa Prefecture,1443000,116902,881983

Recommended Posts

Solving "Statistical Test Level 1 Corresponding Statistics Practice Workbook" with R, Python-Chapter 17 Regression Diagnostic Method-
Solving "Statistical Test Level 1 Corresponding Statistics Practice Workbook" with R, Python-Chapter 18 Qualitative Regression-