[PYTHON] Data analysis based on the election results of the Tokyo Governor's election (2020)

The Tokyo Governor's election was held. Regardless of the result, when I was looking at the results of the ballot counting, the blood of data lovers ached, and I did a simple data analysis with momentum!

I think that it is also a summary of the flow from data acquisition on the net to processing with pandas and simple data analysis.

** * The following is simply a matter of personal interest and is a practice of data analysis, so there is no political intention or act. In addition, we do not guarantee the accuracy or significance of the data used and the analysis results. ** **

0. Summary of analysis

Hypothesis you want to test

** => "Does the election result correlate with educational background?" ** I'm sorry I'm pretty open-minded ... (I remember that research on the correlation between parents' annual income and children's academic ability was a topic before.)

Data used

-Results of ballot counting by municipality * Asahi Shimbun (Since I couldn't find the data in csv format, I manually entered only the top 5 candidates into Excel. ~~ To be honest, this took the longest time ... ~~) -Number of university graduates by municipality (From the 2010 census. This data was not available in the 2015 census, so it is old, but I will use it.) -Population by city / ward / town / village (Actually, the voter population is ideal, but use this for simplicity. 2020 data)

Analysis flow

It was processed according to the following flow.

Read the data with pandas and combine it into one DataFrame
Find the percentage of university graduates and the percentage of votes for the population by municipality
Clustering from the vote rate data by ** k-means method **
Create a ** linear regression model ** that predicts the vote rate of each candidate using the university graduation rate as an explanatory variable.
Visualization

Let's take a look at them in order ~ The following processing is all done on Google Colab Notebook.

1. Read data

Number of votes data

`election.py`


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

#Number of votes data (self-made)
path = "~~~/xxxx.xlsx" #Path name in Drive
df = pd.read_excel(path)

Something like this. I checked it, but since I made it myself, please forgive me even if there is a mistake in the number of votes ... (* By the way, if the election counting data was the previous one, it was converted to open data, so I think that the results of this time will be easily available after a while.) スクリーンショット 2020-07-06 21.21.03.png

Final educational background data (2010)

`election.py`


path = "~~~/xxxx.csv" #Path name in Drive
edu = pd.read_csv(path,encoding='cp932') #encoding is for Japanese input
#Extract below the row of column name
edu.columns=edu.iloc[7] 
edu = edu[8:]
#Take out only the total part of the municipality
edu = edu[edu["Town chord code"].isnull()]
#index reset
edu.reset_index(inplace=True)

`election.py`


#Combine the number of graduates (population not enrolled) and the number of university graduates (including graduate school)

df2 = pd.concat([df,edu["Graduates"],edu["University / Graduate School 2)"]],axis=1)
#The column names for men and women were the same, so duplicate columns were deleted.
#=>Leave only the total number of men and women in df2
df2 = df2.loc[:,~df2.columns.duplicated()]

By the way, the arrangement of cities, wards, towns and villages in Tokyo is unified in any material, so you can set axis = 1 without worrying about the combination.

Population data (2020)

`election.py`


path = "https://www.toukei.metro.tokyo.lg.jp/kurasi/2020/csv/ku20rv0440.csv"
population = pd.read_csv(path,encoding='cp932')
#Extract population by city
population = population[8:]["Unnamed: 4"].reset_index()
#Join
df3 = pd.concat([df2,population],axis=1)

Fine-tuning the data

`election.py`


#Change column name
df3.rename(columns={"Unnamed: 0":"Municipality",
        'Graduates': 'graduates', 
        'University / Graduate School 2)': 'university graduation',
        "Unnamed: 4":"population"},
    inplace=True)
#Clear unnecessary index columns
df3.drop("index",axis=1,inplace=True)
#Since it was str type for some reason, it was converted to int type
df3["population"] = df3["population"].astype(int)
df3["graduates"] = df3["graduates"].astype(int)
df3["university graduation"] = df3["university graduation"].astype(int)

As a result, df3 looks like this: スクリーンショット 2020-07-06 21.57.42.png

2. Data processing

`election.py`


data = df3.copy()
#Replace by dividing the number of votes obtained by the population
data.iloc[:,1:6] = df3.iloc[:,1:6].values / df3["population"].values.reshape(62,1)
#Added a column for university graduation rate (university graduation rate = number of university graduates)/Number of graduations)
data["university graduation rate"] = data["university graduation"] / data["graduates"]

We have all the necessary data. It's finally time for machine learning.

3. Clustering by k-means method

Use sklearn.

`election.py`


from sklearn.cluster import KMeans

kmeans = KMeans(init='random', n_clusters=3,random_state=1)

X = data.iloc[:,1:6].values #Vote rate shape=(62,5)
kmeans.fit(X)
y = kmeans.predict(X)  #Cluster number

#Combine clustering results into data
data = pd.concat([data,pd.DataFrame(y,columns=["cluster"])],axis=1)

Now that we have divided into 3 clusters, let's take a look at the features. (By the way, I tried changing the number of clusters (n_clusters), but I thought that about 3 would be good, so I set it to 3.)

Let's look at the average of each data when each cluster is the axis.

`election.py`


data.groupby("cluster").mean()

It's just an average, but you can see that it was divided into groups with different characteristics. I tried to paint the cities, wards, towns and villages that belong to the cluster on the map,

** 0. Yamanote Line area and its surroundings

Ward and Tama district from Chiba prefecture, some islands (Mikurajima village, Ogasawara village)
Mountains and islands **

It was a breakdown. I was surprised that we were able to make such a classification (which seems to be possible in common sense) based on the vote rate alone.

4. Linear regression analysis

Linear regression analysis is performed with the explanatory variable X as the percentage of university graduates and the objective variable Y as the percentage of votes for each candidate. The following defines a set of functions up to visualization.

`election.py`


from sklearn.linear_model import LinearRegression

colors=["blue","green","red"] #For color coding of clusters

def graph_show(Jpname,name,sp=False,cluster=True,line=True):
  #Jpname:Candidate's kanji notation
  #name:Romaji notation of candidates (for graphs)

  X = data["university graduation rate"].values.reshape(-1,1)
  Y = data[Jpname].values.reshape(-1,1) 

  model = LinearRegression()
  model.fit(X,Y)

  print("Coefficient of determination(Correlation coefficient)：{}".format(model.score(X,Y)))
  plt.scatter(X,Y)

  #Emphasize specific municipalities in the graph (default is False)
  if sp:
    markup = data[data["Municipality"]==sp]
    plt.scatter(markup["university graduation rate"],markup[Jpname],color="red")

  #k-Color-coded for each cluster obtained by means
  if cluster:
    for i in range(3):
      data_ = data[data["cluster"]==i]
      X_ = data_["university graduation rate"].values.reshape(-1,1)
      Y_ = data_[Jpname].values.reshape(-1,1) 
      plt.scatter(X_,Y_,color=colors[i])
      
  #Show regression line
  if line:
    plt.plot(X, model.predict(X), color = 'orange')

  plt.title(name)
  plt.xlabel('university graduation rate')
  plt.ylabel('vote')  
  plt.show()

5. Visualization

Display the graph of each candidate using the show_graph defined earlier. (Excuse me for the title abbreviation below)

Only regression lines with a coefficient of determination exceeding 0.5 are displayed.
The function to make a specific municipality stand out is not used here.
Cluster color is 0: blue 1: green 2: red

Yuriko Koike

The coefficient of determination is not high, but you can see that the clustering is working very well. #### Kenji Utsunomiya スクリーンショット 2020-07-07 0.07.35.png

It's a decent correlation. #### Taro Yamamoto スクリーンショット 2020-07-07 0.07.41.png

#### Taisuke Ono スクリーンショット 2020-07-07 0.07.47.png

This seems to have a positive correlation. .. .. #### Makoto Sakurai スクリーンショット 2020-07-07 0.07.55.png

Summary

The data analysis started with the open question, "Is the election result related to educational background?", But I would like to conclude with a final conclusion.

Before that, let's review the inappropriate (potential) parts of this data analysis.

--The number of votes is self-made (may be wrong) --Educational background data is old (2010 census) --Educational background data and population data (2020) are not from the same year --Not considering the number of non-voters --Not considering turnout

For that reason, as I wrote at the beginning, I cannot guarantee that this data analysis will be significant. With that in mind, I would like to summarize the conclusions that would not be denied at least from this data analysis.

-** Election results reflect regional characteristics ** -** Depending on the candidate, there is a correlation * with the educational background (percentage of university graduates) in how to get votes **

(* Correlation and causal relationship do not always match) What a place, such as. Well, I personally think that the conclusion is not hard to imagine.

There are many things we can understand about individual candidates, but I will omit them here.

As mentioned above, because I am studying data analysis, I tried simple data analysis using fresh data, but if you combine various data other than the data used this time, there are other things. There seems to be something to understand.

As a personal impression, it was troublesome to enter the vote data, so I didn't have to be an administration, so at least I wish I could put out the data compiled by the news media in csv. (I understand that there are various restrictions)