The Tokyo Governor's election was held. Regardless of the result, when I was looking at the results of the ballot counting, the blood of data lovers ached, and I did a simple data analysis with momentum!
I think that it is also a summary of the flow from data acquisition on the net to processing with pandas and simple data analysis.
** * The following is simply a matter of personal interest and is a practice of data analysis, so there is no political intention or act. In addition, we do not guarantee the accuracy or significance of the data used and the analysis results. ** **
** => "Does the election result correlate with educational background?" ** I'm sorry I'm pretty open-minded ... (I remember that research on the correlation between parents' annual income and children's academic ability was a topic before.)
-Results of ballot counting by municipality * Asahi Shimbun (Since I couldn't find the data in csv format, I manually entered only the top 5 candidates into Excel. ~~ To be honest, this took the longest time ... ~~) -Number of university graduates by municipality (From the 2010 census. This data was not available in the 2015 census, so it is old, but I will use it.) -Population by city / ward / town / village (Actually, the voter population is ideal, but use this for simplicity. 2020 data)
It was processed according to the following flow.
Let's take a look at them in order ~ The following processing is all done on Google Colab Notebook.
election.py
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
#Number of votes data (self-made)
path = "~~~/xxxx.xlsx" #Path name in Drive
df = pd.read_excel(path)
Something like this. I checked it, but since I made it myself, please forgive me even if there is a mistake in the number of votes ... (* By the way, if the election counting data was the previous one, it was converted to open data, so I think that the results of this time will be easily available after a while.)
election.py
path = "~~~/xxxx.csv" #Path name in Drive
edu = pd.read_csv(path,encoding='cp932') #encoding is for Japanese input
#Extract below the row of column name
edu.columns=edu.iloc[7]
edu = edu[8:]
#Take out only the total part of the municipality
edu = edu[edu["Town chord code"].isnull()]
#index reset
edu.reset_index(inplace=True)
election.py
#Combine the number of graduates (population not enrolled) and the number of university graduates (including graduate school)
df2 = pd.concat([df,edu["Graduates"],edu["University / Graduate School 2)"]],axis=1)
#The column names for men and women were the same, so duplicate columns were deleted.
#=>Leave only the total number of men and women in df2
df2 = df2.loc[:,~df2.columns.duplicated()]
By the way, the arrangement of cities, wards, towns and villages in Tokyo is unified in any material, so you can set axis = 1 without worrying about the combination.
election.py
path = "https://www.toukei.metro.tokyo.lg.jp/kurasi/2020/csv/ku20rv0440.csv"
population = pd.read_csv(path,encoding='cp932')
#Extract population by city
population = population[8:]["Unnamed: 4"].reset_index()
#Join
df3 = pd.concat([df2,population],axis=1)
election.py
#Change column name
df3.rename(columns={"Unnamed: 0":"Municipality",
'Graduates': 'graduates',
'University / Graduate School 2)': 'university graduation',
"Unnamed: 4":"population"},
inplace=True)
#Clear unnecessary index columns
df3.drop("index",axis=1,inplace=True)
#Since it was str type for some reason, it was converted to int type
df3["population"] = df3["population"].astype(int)
df3["graduates"] = df3["graduates"].astype(int)
df3["university graduation"] = df3["university graduation"].astype(int)
As a result, df3 looks like this:
election.py
data = df3.copy()
#Replace by dividing the number of votes obtained by the population
data.iloc[:,1:6] = df3.iloc[:,1:6].values / df3["population"].values.reshape(62,1)
#Added a column for university graduation rate (university graduation rate = number of university graduates)/Number of graduations)
data["university graduation rate"] = data["university graduation"] / data["graduates"]
We have all the necessary data.
It's finally time for machine learning.
Use sklearn.
election.py
from sklearn.cluster import KMeans
kmeans = KMeans(init='random', n_clusters=3,random_state=1)
X = data.iloc[:,1:6].values #Vote rate shape=(62,5)
kmeans.fit(X)
y = kmeans.predict(X) #Cluster number
#Combine clustering results into data
data = pd.concat([data,pd.DataFrame(y,columns=["cluster"])],axis=1)
Now that we have divided into 3 clusters, let's take a look at the features. (By the way, I tried changing the number of clusters (n_clusters), but I thought that about 3 would be good, so I set it to 3.)
Let's look at the average of each data when each cluster is the axis.
election.py
data.groupby("cluster").mean()
It's just an average, but you can see that it was divided into groups with different characteristics. I tried to paint the cities, wards, towns and villages that belong to the cluster on the map,
** 0. Yamanote Line area and its surroundings
It was a breakdown. I was surprised that we were able to make such a classification (which seems to be possible in common sense) based on the vote rate alone.
Linear regression analysis is performed with the explanatory variable X as the percentage of university graduates and the objective variable Y as the percentage of votes for each candidate. The following defines a set of functions up to visualization.
election.py
from sklearn.linear_model import LinearRegression
colors=["blue","green","red"] #For color coding of clusters
def graph_show(Jpname,name,sp=False,cluster=True,line=True):
#Jpname:Candidate's kanji notation
#name:Romaji notation of candidates (for graphs)
X = data["university graduation rate"].values.reshape(-1,1)
Y = data[Jpname].values.reshape(-1,1)
model = LinearRegression()
model.fit(X,Y)
print("Coefficient of determination(Correlation coefficient):{}".format(model.score(X,Y)))
plt.scatter(X,Y)
#Emphasize specific municipalities in the graph (default is False)
if sp:
markup = data[data["Municipality"]==sp]
plt.scatter(markup["university graduation rate"],markup[Jpname],color="red")
#k-Color-coded for each cluster obtained by means
if cluster:
for i in range(3):
data_ = data[data["cluster"]==i]
X_ = data_["university graduation rate"].values.reshape(-1,1)
Y_ = data_[Jpname].values.reshape(-1,1)
plt.scatter(X_,Y_,color=colors[i])
#Show regression line
if line:
plt.plot(X, model.predict(X), color = 'orange')
plt.title(name)
plt.xlabel('university graduation rate')
plt.ylabel('vote')
plt.show()
Display the graph of each candidate using the show_graph defined earlier. (Excuse me for the title abbreviation below)
The data analysis started with the open question, "Is the election result related to educational background?", But I would like to conclude with a final conclusion.
Before that, let's review the inappropriate (potential) parts of this data analysis.
--The number of votes is self-made (may be wrong) --Educational background data is old (2010 census) --Educational background data and population data (2020) are not from the same year --Not considering the number of non-voters --Not considering turnout
For that reason, as I wrote at the beginning, I cannot guarantee that this data analysis will be significant. With that in mind, I would like to summarize the conclusions that would not be denied at least from this data analysis.
-** Election results reflect regional characteristics ** -** Depending on the candidate, there is a correlation * with the educational background (percentage of university graduates) in how to get votes **
(* Correlation and causal relationship do not always match) What a place, such as. Well, I personally think that the conclusion is not hard to imagine.
There are many things we can understand about individual candidates, but I will omit them here.
As mentioned above, because I am studying data analysis, I tried simple data analysis using fresh data, but if you combine various data other than the data used this time, there are other things. There seems to be something to understand.
As a personal impression, it was troublesome to enter the vote data, so I didn't have to be an administration, so at least I wish I could put out the data compiled by the news media in csv. (I understand that there are various restrictions)
Recommended Posts