Overview

The problem with 12-10 is It is a problem to test the correlation coefficient between the LDP vote rate and the home ownership ratio. The material is quite old (1983 general election !!), but it is interesting that the more you own a house, the more the LDP seems to have an advantage.

So, instead of just solving it, I decided to display the graph using pandas and matplotlib.

environment

windows10 64bit
python3.5
jupyter-notebook

I downloaded the necessary libraries from http://www.lfd.uci.edu/~gohlke/pythonlibs/. Sometimes it didn't work when I put it in with pip install.

Library

%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats

Data creation

p.65 From Table 3.13, steadily insert data into csv. Load the completed csv file (named table_3_13.csv) into the dataframe as follows.

df = pd.read_csv('table_3_13.csv', encoding='shift-jis') 
#If the result looks garbled, check the encode.
df

I was able to read it like this.

Graph creation

A simple graph is displayed as follows.

d = df[0:47] #I narrowed down the results from Hokkaido to Okinawa only.
#Looking at the graph, it seems that there is some correlation.
plt.xlabel(d.columns[1])
plt.ylabel(d.columns[2])
plt.scatter(d[[1]], d[[2]])
plt.show()

There seems to be a correlation between the home ownership ratio and the LDP vote rate.

Add letters to each element of the graph so that you can tell which prefecture it belongs to.

#Add letters to each element
fig, ax = plt.subplots(figsize=(15,15)) #If the graph is not large to some extent, the prefecture name cannot be seen.
df.plot(1, 2, kind='scatter', ax=ax)
for k, v in df.iterrows():
    ax.annotate(v[0], xy=(v[1], v[2]), size=12) #v[0]Prefecture name, v[1]Is the LDP vote rate, v[2]The home ownership ratio is included in.
plt.show()

At a glance, you can see that the ratio of homeowners seems to be higher in rural areas.

Calculate the correlation coefficient

Pandas can be easily calculated using the corr method. It was like this.

d.corr()

	Liberal Democratic Vote Rate	Owned house ratio
Liberal Democratic Vote Rate	1.000000	0.638782
Owned house ratio	0.638782	1.000000

How certain is the correlation coefficient?

A test is performed to see how probable the obtained correlation coefficient is. Here, we use Fisher's z-transform as a test of the correlation coefficient. Fisher's z-transform looks like this:

Fisher's z-transform

When there is a two-dimensional normal population and the population correlation coefficient is $ \ rho $ and the sample correlation coefficient is $ r , $ z=\frac{1}{2}\log\frac{1+r}{1-r} $ $ \eta=\frac{1}{2}\log\frac{1+\rho}{1-\rho} $$ Convert to. At this time, the sampling distribution of $ z $ becomes the normal distribution $ N (\ eta, 1 / (n-3)) $ when the number of data is large. Therefore, $ \ sqrt {n-3} (z- \ eta) $ follows the standard normal distribution $ N (0,1) $.

I will actually test it.

i) Null hypothesis: population correlation coefficient is 0.0

Set $ \ rho = 0.0 $ and calculate with python as below.

n=48 #The number of data
r = 0.638782
rho = 0.0

z= 0.5*np.log((1+r)/(1-r))
eta = 0.5* np.log((1+rho)/(1-rho))

Z = np.sqrt(n-3)*(z-eta)
print("Z=",Z) #Z= 5.07216324479

On the other hand, since $ Z_ {0.025} = 1.96 $, it is clear that $ Z_ {0.025} <Z $, so the hypothesis is rejected. Therefore, it cannot be said that there is no correlation (significance level 0.05).

i) Null hypothesis: population correlation coefficient is 0.5

If you write it in python in the same way as i) with $ \ rho = 0.5 $, it looks like this.

n=48 #The number of data
r = 0.638782
rho = 0.5

z= 0.5*np.log((1+r)/(1-r))
eta = 0.5* np.log((1+rho)/(1-rho))

Z = np.sqrt(n-3)*(z-eta)
print("Z=",Z)

For the obtained $ z = 1.39 $, the null hypothesis is not rejected from $ Z_ {0.025} = 1.96> 1.39 $. Therefore, the population correlation coefficient may be 0.5. (Significance level 0.05)

Task

Actually, I wanted to paint the map of Japan using geopandas, but I failed to install it on win10. Once you know how to do it, try again.

cf) How to find Z of normal standard distribution in python?

The value of $ Z $ when the area of the distribution function becomes $ a $ can be obtained by the following function.

stats.norm.ppf(a)

This time, the superiority level of both shoulders is 0.05, so calculate as follows.

stats.norm.ppf(1-0.025) #1.959963984540054

$ Z_ {0.025} = 1.96 $ is well known, but it's about the same as the above result.

[PYTHON] While solving the introductory statistics exercise 12.10, check how to draw a scatter plot in pandas.