[PYTHON] Data analysis based on the election results of the Tokyo Governor's election (2020)

The Tokyo Governor's election was held. Regardless of the result, when I was looking at the results of the ballot counting, the blood of data lovers ached, and I did a simple data analysis with momentum!

I think that it is also a summary of the flow from data acquisition on the net to processing with pandas and simple data analysis.

** * The following is simply a matter of personal interest and is a practice of data analysis, so there is no political intention or act. In addition, we do not guarantee the accuracy or significance of the data used and the analysis results. ** **

0. Summary of analysis

Hypothesis you want to test

** => "Does the election result correlate with educational background?" ** I'm sorry I'm pretty open-minded ... (I remember that research on the correlation between parents' annual income and children's academic ability was a topic before.)

Data used

-Results of ballot counting by municipality * Asahi Shimbun (Since I couldn't find the data in csv format, I manually entered only the top 5 candidates into Excel. ~~ To be honest, this took the longest time ... ~~) -Number of university graduates by municipality (From the 2010 census. This data was not available in the 2015 census, so it is old, but I will use it.) -Population by city / ward / town / village (Actually, the voter population is ideal, but use this for simplicity. 2020 data)

Analysis flow

It was processed according to the following flow.

  1. Read the data with pandas and combine it into one DataFrame
  2. Find the percentage of university graduates and the percentage of votes for the population by municipality
  3. Clustering from the vote rate data by ** k-means method **
  4. Create a ** linear regression model ** that predicts the vote rate of each candidate using the university graduation rate as an explanatory variable.
  5. Visualization

Let's take a look at them in order ~ The following processing is all done on Google Colab Notebook.

1. Read data

Number of votes data

election.py


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

#Number of votes data (self-made)
path = "~~~/xxxx.xlsx" #Path name in Drive
df = pd.read_excel(path)

Something like this. I checked it, but since I made it myself, please forgive me even if there is a mistake in the number of votes ... (* By the way, if the election counting data was the previous one, it was converted to open data, so I think that the results of this time will be easily available after a while.) スクリーンショット 2020-07-06 21.21.03.png

Final educational background data (2010)

election.py


path = "~~~/xxxx.csv" #Path name in Drive
edu = pd.read_csv(path,encoding='cp932') #encoding is for Japanese input
#Extract below the row of column name
edu.columns=edu.iloc[7] 
edu = edu[8:]
#Take out only the total part of the municipality
edu = edu[edu["Town chord code"].isnull()]
#index reset
edu.reset_index(inplace=True)

election.py


#Combine the number of graduates (population not enrolled) and the number of university graduates (including graduate school)

df2 = pd.concat([df,edu["Graduates"],edu["University / Graduate School 2)"]],axis=1)
#The column names for men and women were the same, so duplicate columns were deleted.
#=>Leave only the total number of men and women in df2
df2 = df2.loc[:,~df2.columns.duplicated()]

By the way, the arrangement of cities, wards, towns and villages in Tokyo is unified in any material, so you can set axis = 1 without worrying about the combination.

Population data (2020)

election.py


path = "https://www.toukei.metro.tokyo.lg.jp/kurasi/2020/csv/ku20rv0440.csv"
population = pd.read_csv(path,encoding='cp932')
#Extract population by city
population = population[8:]["Unnamed: 4"].reset_index()
#Join
df3 = pd.concat([df2,population],axis=1)

Fine-tuning the data

election.py


#Change column name
df3.rename(columns={"Unnamed: 0":"Municipality",
        'Graduates': 'graduates', 
        'University / Graduate School 2)': 'university graduation',
        "Unnamed: 4":"population"},
    inplace=True)
#Clear unnecessary index columns
df3.drop("index",axis=1,inplace=True)
#Since it was str type for some reason, it was converted to int type
df3["population"] = df3["population"].astype(int)
df3["graduates"] = df3["graduates"].astype(int)
df3["university graduation"] = df3["university graduation"].astype(int)

As a result, df3 looks like this: スクリーンショット 2020-07-06 21.57.42.png

2. Data processing

election.py


data = df3.copy()
#Replace by dividing the number of votes obtained by the population
data.iloc[:,1:6] = df3.iloc[:,1:6].values / df3["population"].values.reshape(62,1)
#Added a column for university graduation rate (university graduation rate = number of university graduates)/Number of graduations)
data["university graduation rate"] = data["university graduation"] / data["graduates"]
スクリーンショット 2020-07-06 22.04.10.png We have all the necessary data. It's finally time for machine learning.

3. Clustering by k-means method

Use sklearn.

election.py


from sklearn.cluster import KMeans

kmeans = KMeans(init='random', n_clusters=3,random_state=1)

X = data.iloc[:,1:6].values #Vote rate shape=(62,5)
kmeans.fit(X)
y = kmeans.predict(X)  #Cluster number

#Combine clustering results into data
data = pd.concat([data,pd.DataFrame(y,columns=["cluster"])],axis=1)

Now that we have divided into 3 clusters, let's take a look at the features. (By the way, I tried changing the number of clusters (n_clusters), but I thought that about 3 would be good, so I set it to 3.)

Let's look at the average of each data when each cluster is the axis.

election.py


data.groupby("cluster").mean()
スクリーンショット 2020-07-06 22.15.53.png

It's just an average, but you can see that it was divided into groups with different characteristics. I tried to paint the cities, wards, towns and villages that belong to the cluster on the map,

** 0. Yamanote Line area and its surroundings

  1. Ward and Tama district from Chiba prefecture, some islands (Mikurajima village, Ogasawara village)
  2. Mountains and islands **

It was a breakdown. I was surprised that we were able to make such a classification (which seems to be possible in common sense) based on the vote rate alone.

4. Linear regression analysis

Linear regression analysis is performed with the explanatory variable X as the percentage of university graduates and the objective variable Y as the percentage of votes for each candidate. The following defines a set of functions up to visualization.

election.py


from sklearn.linear_model import LinearRegression

colors=["blue","green","red"] #For color coding of clusters

def graph_show(Jpname,name,sp=False,cluster=True,line=True):
  #Jpname:Candidate's kanji notation
  #name:Romaji notation of candidates (for graphs)

  X = data["university graduation rate"].values.reshape(-1,1)
  Y = data[Jpname].values.reshape(-1,1) 

  model = LinearRegression()
  model.fit(X,Y)

  print("Coefficient of determination(Correlation coefficient):{}".format(model.score(X,Y)))
  plt.scatter(X,Y)

  #Emphasize specific municipalities in the graph (default is False)
  if sp:
    markup = data[data["Municipality"]==sp]
    plt.scatter(markup["university graduation rate"],markup[Jpname],color="red")

  #k-Color-coded for each cluster obtained by means
  if cluster:
    for i in range(3):
      data_ = data[data["cluster"]==i]
      X_ = data_["university graduation rate"].values.reshape(-1,1)
      Y_ = data_[Jpname].values.reshape(-1,1) 
      plt.scatter(X_,Y_,color=colors[i])
      
  #Show regression line
  if line:
    plt.plot(X, model.predict(X), color = 'orange')

  plt.title(name)
  plt.xlabel('university graduation rate')
  plt.ylabel('vote')  
  plt.show()

5. Visualization

Display the graph of each candidate using the show_graph defined earlier. (Excuse me for the title abbreviation below)

Yuriko Koike

スクリーンショット 2020-07-07 0.07.28.png The coefficient of determination is not high, but you can see that the clustering is working very well. #### Kenji Utsunomiya スクリーンショット 2020-07-07 0.07.35.png It's a decent correlation. #### Taro Yamamoto スクリーンショット 2020-07-07 0.07.41.png #### Taisuke Ono スクリーンショット 2020-07-07 0.07.47.png This seems to have a positive correlation. .. .. #### Makoto Sakurai スクリーンショット 2020-07-07 0.07.55.png

Summary

The data analysis started with the open question, "Is the election result related to educational background?", But I would like to conclude with a final conclusion.

Before that, let's review the inappropriate (potential) parts of this data analysis.

--The number of votes is self-made (may be wrong) --Educational background data is old (2010 census) --Educational background data and population data (2020) are not from the same year --Not considering the number of non-voters --Not considering turnout

For that reason, as I wrote at the beginning, I cannot guarantee that this data analysis will be significant. With that in mind, I would like to summarize the conclusions that would not be denied at least from this data analysis.

-** Election results reflect regional characteristics ** -** Depending on the candidate, there is a correlation * with the educational background (percentage of university graduates) in how to get votes **

(* Correlation and causal relationship do not always match) What a place, such as. Well, I personally think that the conclusion is not hard to imagine.

There are many things we can understand about individual candidates, but I will omit them here.


As mentioned above, because I am studying data analysis, I tried simple data analysis using fresh data, but if you combine various data other than the data used this time, there are other things. There seems to be something to understand.

As a personal impression, it was troublesome to enter the vote data, so I didn't have to be an administration, so at least I wish I could put out the data compiled by the news media in csv. (I understand that there are various restrictions)

Recommended Posts

Data analysis based on the election results of the Tokyo Governor's election (2020)
Difference in results depending on the argument of multiprocess.Process
Try scraping the data of COVID-19 in Tokyo with Python
Use of past weather data 4 (feelings of the weather during the Tokyo Olympics)
Let's make the analysis of the Titanic sinking data like that
Analyzing data on the number of corona patients in Japan
[Python] Notes on data analysis
Reuse the results of clustering
I tried using PDF data of online medical care based on the spread of the new coronavirus infection
Line profile of metal element fluorescent X-rays based on the effect of metal
Save the results of crawling with Scrapy to the Google Data Store
Practice of data analysis by Python and pandas (Tokyo COVID-19 data edition)
[GWAS] Plot the results of principal component analysis (PCA) by PLINK
Start data science on the cloud
Recommendation of data analysis using MessagePack
Time series analysis 3 Preprocessing of time series data
Illustration of the results of the knapsack problem
Data handling 2 Analysis of various data formats
I wrote a corpus reader that reads the results of MeCab analysis
2019 version: Unauthorized access trend analysis (example of general-purpose server on the cloud)
Folding @ Home on Linux Mint to contribute to the analysis of the new coronavirus
Explain the mechanism of PEP557 data class
The story of verifying the open data of COVID-19
Investigate the effect of outliers on correlation
Get the column list & data list of CASTable
Post the subject of Gmail on twitter
Display the graph of tensorBoard on jupyter
Completely erase the data on the hard disk
Study on Tokyo Rent Using Python (3-1 of 3)
Change the order of PostgreSQL on Heroku
Data analysis environment centered on Datalab (+ GCP)
Visualize the export data of Piyo log
Data Science Virtual Machines is the best environment for data analysis from now on!
Plot the environmental concentration of organofluorine compounds on a map using open data
Scraping the rainfall data of the Japan Meteorological Agency and displaying it on M5Stack
Introduction to Statistical Modeling for Data Analysis Expanding the range of applications of GLM
Organize Python tools to speed up the initial movement of data analysis competitions
I tried to rescue the data of the laptop by booting it on Ubuntu
An introduction to data analysis using Python-To increase the number of video views-