[PYTHON] I tried to visualize the age group and rate distribution of Atcoder

Introduction

I tried to visualize the age group and rate distribution of people participating in AtCoder (competitive programming) by scraping and statistical processing with Python.

Rough content

  1. AtCoder age group
  2. Rate distribution for the number of participation
  3. Relationship between age and rate
  4. Source code used reference

1. AtCoder age group

First of all, the age groups participating in Atcoder are scraped and tabulated, so they are shown below. In addition, people who do not enter their age in their profile are not counted. As you can imagine, there are many young people, especially university students.

image.png

2. Rate distribution for the number of participation

Visualization with mean and standard deviation

Not surprisingly, there seems to be a correlation between the number of contest entries and the rate. By the way, according to the specifications of AtCoder's rating system, if the number of participation is 10 or less, the rate may be significantly lower than the actual ability. Please check the following page for details. About AtCoder Contest Rating

I tried to visualize the rate of active users by the average value and standard deviation for each number of times I participated in the contest so far. The mean value is the blue dot, and the mean ± standard deviation is shown by the yellow band. Even after subtracting the above rating system specifications, there seems to be a positive correlation between the number of participations and the rate. In my imagination, if the number of participation is about 30 times, there is not much correlation between the number of participation and the rate in the area beyond that (it sticks to the upper limit), but it seems that it is actually. ..

image.png

As an example, here is a histogram of the number and rates of people who have participated in the contest five times so far. image.png

Median and percentile visualization

Visualization by the average value is strongly influenced by outliers such as those who have experience in competitive programming (the ability is abnormally high from the beginning), so I decided to visualize by the median value. The average value is a blue dot, and the top 25% to bottom 25% are represented by a yellow band. The median seems to have a slightly lower overall score than the mean.

image.png

3. Relationship between age and rate

One question arose in advancing the visualization of age groups and rate distributions. You've probably heard of the programmer's 35-year-old retirement age theory, but is there a correlation between age and AtCoder rate? Therefore, I decided to actually visualize it. As mentioned above, due to the specifications of AtCoder's rating system, if the number of participation is 10 or less, the rate may be significantly lower than the actual ability, so the graph below shows that the number of participation is 10. We limited it to people who had more than one time, and visualized it with the median so that the influence of outliers would be less likely to occur. Looking at the results, it seems that there is almost no correlation between age and rating. There is little data for people in their 40s, and the results vary, so this is just for reference.

image.png

4. Source code used

The source code is shown below.

code


from urllib import request
from bs4 import BeautifulSoup
#By changing the url here, you can limit the number of participation etc.
url = "https://atcoder.jp/ranking/?f.Country=&f.UserScreenName=&f.Affiliation=&f.BirthYearLowerBound=0&f.BirthYearUpperBound=9999&f.RatingLowerBound=0&f.RatingUpperBound=9999&f.HighestRatingLowerBound=0&f.HighestRatingUpperBound=9999&f.CompetitionsLowerBound=1&f.CompetitionsUpperBound=9999&f.WinsLowerBound=0&f.WinsUpperBound=9999&page="
html = request.urlopen(url+"0")
soup = BeautifulSoup(html, "html.parser") #Extract information from html file
ul = soup.find_all("ul") #Can be extracted by specifying the element name and attribute

a = []
page = 0
i = 0
for tag in ul:
    i+=1
    try:
        string_ = tag.get("class") 
        if "pagination" in string_:
            a = tag.find_all("a")
            break
    except:
        pass
for tag in a:
    try:
        string_ = tag.get("href")
        if "ranking" in string_:
            page = max(page, int(tag.string))
    except:
        pass
organization = []
rank = []
name = []
for i in range(1,page+1): #page
    html = request.urlopen(url+str(i))
    soup = BeautifulSoup(html, "html.parser")
    td = soup.find_all("span")
    
    for tag in td:
        try:
            string_ = tag.get("class")[0]
        except:
            continue
        try:
            if string_ == "ranking-affiliation":
                organization.append(str(tag.string))
        except:
            pass        
    pp = soup.find_all("a")
    for tag in pp:
        try:
            string_ = tag.get("class")[0]
        except:
            continue
        try:
            if string_ == "username":
                name.append(str(tag.string))
        except:
            pass
information = []
for i in range(1,page+1): #page
    html = request.urlopen(url+str(i))
    soup = BeautifulSoup(html, "html.parser")    
    tbody = soup.find_all("tbody")
    for tr in tbody:
        for td in tr:
            temp = [] 
            for tag in td:
                try:
                    string_ = str(tag.string).strip()
                    if len(string_) > 0:
                        temp.append(string_)
                except:
                    pass
            if len(temp)>0:
                information.append(temp[2:])

information = [[name[i],organization[i]] + (information[i]) for i in range(len(information))]

#%%
import matplotlib.pyplot as plt
year_upper = 2020
rank_dic = {i:[] for i in range(year_upper+1)}
generation = [0 for i in range(year_upper)]

for i in range(len(information)):
    old = information[i][2]
    try:
        rank_dic[int(old)].append(int(information[i][3]))
        generation[int(old)] += 1
    except:
        pass
for i in range(len(rank_dic)-1, -1, -1): #Deleted when there are no 10 people
    if len(rank_dic[int(i)]) < 10:
        del rank_dic[int(i)]
#%%
import numpy as np
from statistics import mean, median,variance,stdev

ave_rank = np.array([[i ,mean(rank_dic[i])] for i in list(rank_dic.keys())], dtype = "float32")
stdev_rank = np.array([[i ,stdev(rank_dic[i])] for i in list(rank_dic.keys())], dtype = "float32")
max_rank = np.array([[i ,max(rank_dic[i])] for i in list(rank_dic.keys())], dtype = "float32")
median_rank = np.array([[i ,median(rank_dic[i])] for i in list(rank_dic.keys())], dtype = "float32")
percent25 = np.array([[i,np.percentile(rank_dic[i], [25])] for i in list(rank_dic.keys())], dtype = "float32")
percent75 = np.array([[i,np.percentile(rank_dic[i], [75])] for i in list(rank_dic.keys())], dtype = "float32")

#Average rank by age
plt.fill_between(ave_rank[:,0], ave_rank[:,1]-stdev_rank[:,1], ave_rank[:,1]+stdev_rank[:,1],facecolor='y',alpha=0.5)
plt.scatter(ave_rank[:,0], ave_rank[:,1])
plt.xlim(1970,2010)
plt.ylim(-100,2000)
plt.tick_params(labelsize=15)
plt.grid()
plt.title("ave")
plt.show()
#Central rank by age
plt.fill_between(percent25[:,0], percent25[:,1], percent75[:,1],facecolor='y',alpha=0.5)
plt.scatter(median_rank[:,0], median_rank[:,1])
plt.xlim(1970,2010)
plt.ylim(-100,2000)
plt.tick_params(labelsize=15)
plt.grid()
plt.title("med")
plt.show()
#Distribution of participating age groups
plt.plot([1996,1996],[-200,5000],zorder=1,linestyle="dashed",color="red")
plt.plot([2001,2001],[-200,5000],zorder=1,linestyle="dashed",color="red")
plt.fill_between([1996,2001], [-200,-200],[5000,5000],facecolor='red',alpha=0.5)
plt.scatter(range(len(generation)), generation,s=80,c="white",zorder=2,edgecolors="black",linewidths=2)
plt.xlim(1960,2010)
plt.ylim(-100,4500)
plt.tick_params(labelsize=15)
plt.grid()
plt.title("population")
plt.show()

#%%
compe_count = [[] for i in range(201)]
for i in range(len(information)):
    compe_count[int(information[i][5])].append(int(information[i][3]))

ave_rank_count = np.array([[i,mean(X)] if len(X)>5 else [i,None] for i,X in enumerate(compe_count)], dtype = "float32")[1:]
stdev_rank_count = np.array([[i,stdev(X)] if len(X)>5 else [i,None] for i,X in enumerate(compe_count)], dtype = "float32")[1:]
max_rank_count = np.array([[i,max(X)] if len(X)>5 else [i,None] for i,X in enumerate(compe_count)], dtype = "float32")[1:]
min_rank_count = np.array([[i,min(X)] if len(X)>5 else [i,None] for i,X in enumerate(compe_count)], dtype = "float32")[1:]
med_rank_count = np.array([[i,median(X)] if len(X)>5 else [i,None] for i,X in enumerate(compe_count)], dtype = "float32")[1:]
percent25_count = np.array([[i,np.percentile(X, [25])] if len(X)>5 else [i,None] for i,X in enumerate(compe_count)], dtype = "float32")[1:]
percent75_count = np.array([[i,np.percentile(X, [75])] if len(X)>5 else [i,None] for i,X in enumerate(compe_count)], dtype = "float32")[1:]
#Histogram confirmation
for i, X in enumerate(compe_count[1:20]):
    plt.hist(X, bins=40)
    plt.title(i)
    plt.show()
#Participation count and average score
plt.fill_between(ave_rank_count[:,0],ave_rank_count[:,1]-stdev_rank_count[:,1],ave_rank_count[:,1]+stdev_rank_count[:,1],facecolor='y',alpha=0.5)
plt.scatter(ave_rank_count[:,0], ave_rank_count[:,1],zorder=2)
plt.tick_params(labelsize=15)
plt.grid()
plt.ylim(-100,2500)
#plt.title("ave_count")
plt.show()
#Participation count and central score
plt.fill_between(percent25_count[:,0], percent25_count[:,1], percent75_count[:,1],facecolor='y',alpha=0.5)
plt.scatter(med_rank_count[:,0], med_rank_count[:,1])
plt.tick_params(labelsize=15)
plt.ylim(-100,2500)
plt.grid()
#plt.title("med_count")
plt.show()

reference

I have referred to the following article very much. I tried to get the rate distribution of AtCoder by web scraping of Python I examined the distribution of AtCoder ratings

Recommended Posts

I tried to visualize the age group and rate distribution of Atcoder
I tried to visualize the spacha information of VTuber
[Python] I tried to visualize the follow relationship of Twitter
I tried to visualize the common condition of VTuber channel viewers
I tried to touch the API of ebay
I tried to correct the keystone of the image
I tried to extract and illustrate the stage of the story using COTOHA
I tried to verify and analyze the acceleration of Python by Cython
I tried to visualize the text of the novel "Weathering with You" with WordCloud
I tried to predict the price of ETF
I tried to vectorize the lyrics of Hinatazaka46!
When I tried to write about logistic regression, I ended up finding the mean and variance of the logistic distribution.
I tried to notify the update of "Hamelin" using "Beautiful Soup" and "IFTTT"
[Python] I tried to judge the member image of the idol group using Keras
I tried to easily visualize the tweets of JAWS DAYS 2017 with Python + ELK
I tried to summarize the basic form of GPLVM
I tried to erase the negative part of Meros
I tried to classify the voices of voice actors
I tried to summarize the string operations of Python
[Linux] I learned LPIC lv1 in 10 days and tried to understand the mechanism of Linux.
Python practice 100 knocks I tried to visualize the decision tree of Chapter 5 using graphviz
I tried to automate the article update of Livedoor blog with Python and selenium.
I tried to visualize the characteristics of new coronavirus infected person information with wordcloud
[First data science ⑥] I tried to visualize the market price of restaurants in Tokyo
I tried to visualize the running data of the racing game (Assetto Corsa) with Plotly
I tried to compare the processing speed with dplyr of R and pandas of Python
I tried to find the entropy of the image with python
[Horse Racing] I tried to quantify the strength of racehorses
I tried to get the location information of Odakyu Bus
I tried to illustrate the time and time in C language
I tried to display the time and today's weather w
[TF] I tried to visualize the learning result using Tensorboard
[Machine learning] I tried to summarize the theory of Adaboost
I want to know the features of Python and pip
I tried to enumerate the differences between java and python
I tried to fight the Local Minimum of Goldstein-Price Function
I displayed the chat of YouTube Live and tried playing
I tried to automatically post to ChatWork at the time of deployment with fabric and ChatWork Api
I tried to visualize the power consumption of my house with Nature Remo E lite
I tried to get the number of days of the month holidays (Saturdays, Sundays, and holidays) with python
I tried to verify the yin and yang classification of Hololive members by machine learning
[Natural language processing] I tried to visualize the remarks of each member in the Slack community
I tried to move the ball
I tried to estimate the interval.
[Linux] I tried to summarize the command of resource confirmation system
I tried to get the index of the list using the enumerate function
I tried to automate the watering of the planter with Raspberry Pi
I tried to visualize bookmarks flying to Slack with Doc2Vec and PCA
[Introduction to Python] I compared the naming conventions of C # and Python.
I tried to build the SD boot image of LicheePi Nano
I tried to visualize the Beverage Preference Dataset by tensor decomposition.
I summarized how to change the boot parameters of GRUB and GRUB2
I tried to expand the size of the logical volume with LVM
I tried to visualize Boeing of violin performance by pose estimation
I tried to summarize the frequently used implementation method of pytest-mock
I tried to improve the efficiency of daily work with Python
I became horror when I tried to detect the features of anime faces using PCA and NMF.
[Python] I tried to visualize the prize money of "ONE PIECE" over 100 million characters with matplotlib.
I tried to predict the up and down of the closing price of Gurunavi's stock price using TensorFlow (progress)
I tried fitting the exponential function and logistics function to the number of COVID-19 positive patients in Tokyo
I tried the asynchronous server of Django 3.0