[PYTHON] While solving the introductory statistics exercise 12.10, check how to draw a scatter plot in pandas.

Overview

The problem with 12-10 is It is a problem to test the correlation coefficient between the LDP vote rate and the home ownership ratio. The material is quite old (1983 general election !!), but it is interesting that the more you own a house, the more the LDP seems to have an advantage.

So, instead of just solving it, I decided to display the graph using pandas and matplotlib.

environment

I downloaded the necessary libraries from http://www.lfd.uci.edu/~gohlke/pythonlibs/. Sometimes it didn't work when I put it in with pip install.

Library

%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats

Data creation

p.65 From Table 3.13, steadily insert data into csv. Load the completed csv file (named table_3_13.csv) into the dataframe as follows.

df = pd.read_csv('table_3_13.csv', encoding='shift-jis') 
#If the result looks garbled, check the encode.
df

I was able to read it like this. table_3_13.png

Graph creation

A simple graph is displayed as follows.

d = df[0:47] #I narrowed down the results from Hokkaido to Okinawa only.
#Looking at the graph, it seems that there is some correlation.
plt.xlabel(d.columns[1])
plt.ylabel(d.columns[2])
plt.scatter(d[[1]], d[[2]])
plt.show()

scatter_01.png

There seems to be a correlation between the home ownership ratio and the LDP vote rate.

Add letters to each element of the graph so that you can tell which prefecture it belongs to.

#Add letters to each element
fig, ax = plt.subplots(figsize=(15,15)) #If the graph is not large to some extent, the prefecture name cannot be seen.
df.plot(1, 2, kind='scatter', ax=ax)
for k, v in df.iterrows():
    ax.annotate(v[0], xy=(v[1], v[2]), size=12) #v[0]Prefecture name, v[1]Is the LDP vote rate, v[2]The home ownership ratio is included in.
plt.show()

scatter_02.png

At a glance, you can see that the ratio of homeowners seems to be higher in rural areas.

Calculate the correlation coefficient

Pandas can be easily calculated using the corr method. It was like this.

d.corr()
Liberal Democratic Vote Rate Owned house ratio
Liberal Democratic Vote Rate 1.000000 0.638782
Owned house ratio 0.638782 1.000000

How certain is the correlation coefficient?

A test is performed to see how probable the obtained correlation coefficient is. Here, we use Fisher's z-transform as a test of the correlation coefficient. Fisher's z-transform looks like this:

Fisher's z-transform

When there is a two-dimensional normal population and the population correlation coefficient is $ \ rho $ and the sample correlation coefficient is $ r , $ z=\frac{1}{2}\log\frac{1+r}{1-r} $ $ \eta=\frac{1}{2}\log\frac{1+\rho}{1-\rho} $$ Convert to. At this time, the sampling distribution of $ z $ becomes the normal distribution $ N (\ eta, 1 / (n-3)) $ when the number of data is large. Therefore, $ \ sqrt {n-3} (z- \ eta) $ follows the standard normal distribution $ N (0,1) $.

I will actually test it.

i) Null hypothesis: population correlation coefficient is 0.0

Set $ \ rho = 0.0 $ and calculate with python as below.

n=48 #The number of data
r = 0.638782
rho = 0.0

z= 0.5*np.log((1+r)/(1-r))
eta = 0.5* np.log((1+rho)/(1-rho))

Z = np.sqrt(n-3)*(z-eta)
print("Z=",Z) #Z= 5.07216324479

On the other hand, since $ Z_ {0.025} = 1.96 $, it is clear that $ Z_ {0.025} <Z $, so the hypothesis is rejected. Therefore, it cannot be said that there is no correlation (significance level 0.05).

i) Null hypothesis: population correlation coefficient is 0.5

If you write it in python in the same way as i) with $ \ rho = 0.5 $, it looks like this.

n=48 #The number of data
r = 0.638782
rho = 0.5

z= 0.5*np.log((1+r)/(1-r))
eta = 0.5* np.log((1+rho)/(1-rho))

Z = np.sqrt(n-3)*(z-eta)
print("Z=",Z)

For the obtained $ z = 1.39 $, the null hypothesis is not rejected from $ Z_ {0.025} = 1.96> 1.39 $. Therefore, the population correlation coefficient may be 0.5. (Significance level 0.05)

Task

Actually, I wanted to paint the map of Japan using geopandas, but I failed to install it on win10. Once you know how to do it, try again.

cf) How to find Z of normal standard distribution in python?

The value of $ Z $ when the area of the distribution function becomes $ a $ can be obtained by the following function.

stats.norm.ppf(a)

This time, the superiority level of both shoulders is 0.05, so calculate as follows.

stats.norm.ppf(1-0.025) #1.959963984540054

$ Z_ {0.025} = 1.96 $ is well known, but it's about the same as the above result.

Recommended Posts

While solving the introductory statistics exercise 12.10, check how to draw a scatter plot in pandas.
[Python] How to draw a scatter plot with Matplotlib
How to check the memory size of a variable in Python
How to check the memory size of a dictionary in Python
[Python] How to draw a histogram in Matplotlib
[Pandas] How to check duplicates and delete duplicates in a table (equivalent to deleting duplicates in Excel)
How to check in Python if one of the elements of a list is in another list
[python] How to check if the Key exists in the dictionary
How to display the regional mesh of the official statistics window (eStat) in a web browser
Put the lists together in pandas to make a DataFrame
How to generate a query using the IN operator in Django
How to check if a value exists in an enum
How to get the last (last) value in a list in Python
How to plot the distribution of bacterial composition from Qiime2 analysis data in a box plot
[sh] How to store the command execution result in a variable
How to determine the existence of a selenium element in Python
A note on how to check the connection to the license server port
How to get all the possible values in a regular expression
[Introduction to Python] How to use the in operator in a for statement?
[TensorFlow 2] How to check the contents of Tensor in graph mode
How to find the memory address of a Pandas dataframe value
<Pandas> How to handle time series data in a pivot table
How to get the vertex coordinates of a feature in ArcPy
How to check local GAE from iPhone browser in the same LAN
How to get a specific column name and index name in pandas DataFrame
How to specify a .ui file in the dialog / widget GUI in PySide
[Python] If you want to draw a scatter plot of multiple clusters
How to study until a beginner in statistics gets started with Bayesian statistics
I made a program to check the size of a file in Python
How to play a video while watching the number of frames (Mac)
How to sort by specifying a column in the Python Numpy array.
How to check the version of Django
How to draw a graph using Matplotlib
How to get a stacktrace in python
How to check opencv version in python
How to read CSV files in Pandas
How to draw OpenCV images in Pygame
How to count the number of elements in Django and output to a template
A memorandum of how to execute the! Sudo magic command in Jupyter Notebook
The first thing to check when a No Reverse Match occurs in Django
Draw a line / scatter plot on the CSV file (2 columns) with python matplotlib
The first step to log analysis (how to format and put log data in Pandas)
How to get a list of files in the same directory with python
How to calculate the volatility of a brand
How to use the C library in Python
How to clear tuples in a list (Python)
How to draw a 2-axis graph with pyplot
How to embed a variable in a python string
How to draw a 3D graph before optimization
How to create a JSON file in Python
Draw graphs in Julia ... Leave the graphs to Python
How to implement a gradient picker in Houdini
How to notify a Discord channel in Python
How to get the files in the [Python] folder
How to create a Rest Api in Django
How to write a named tuple document in 2020
How to count numbers in a specific range
How to read a file in a different directory
How to Mock a Public function in Pytest
How to plot autocorrelation and partial autocorrelation in python
How to set a shared folder with the host OS in CentOS7 on VirtualBOX