[PYTHON] Graph the ratio of topcoder, Codeforces and TOEIC by rating (Pandas + seaborn)

I tried to visualize the ratio of the number of people of various ratings as a stacked bar graph. I'm drawing in Python (matplotlib + Pandas + seaborn).

Since it is a graph drawing method up to the middle, if you want only the result, please skip to here.

Target audience of this article

--I want to draw a graph in Python --I'm participating in topcoder / Codeforces and can't help but worry about my position ――I received TOEIC, but I can't grasp the standard of score. ――I want to set my next goal ――I'm a beatmania

Motivation and results

--I felt that the standard of Div.1 in Codeforces was higher than that of topcoder, so I checked it. ――It was quite high (top 10%) ――I didn't understand the TOEIC standard, so I looked it up. --600 points were in the center ――It seems that you can draw a beautiful graph using seaborn, so I tried it. --Easy and clean, recommended

How to draw a graph

Beautiful graph drawing with python -seaborn makes data analysis and visualization easier Part 1 I referred to this article.

Basically, draw the graph according to the following flow.

  1. Prepare a definition of rank (rating range)
  2. Generate an array from the data source with each user's rating as an element (or manually)
  3. Create a Pandas dataframe
  4. Draw a stacked bar graph (save as an image)

Prepare a definition of rank

Color plays an important role in the rating system of topcoder and Codeforces. Therefore, prepare an array that defines the three elements of "rating range", "name of the rank", and "color" as one rank, and use it for graph drawing.

For example, in topcoder, we have prepared the following definitions.

# (border, name, color)
rank_info = [
    (0-1, 'Gray', '#9D9FA0'),
    (900-1, 'Green', '#69C329'),
    (1200-1, 'Blue', '#616BD5'),
    (1500-1, 'Yellow', '#FCD617'),
    (2200-1, 'Red', '#EF3A3A'),
    (3000-1, 'Target', '#000000'),
]

The left end shows the value of the lower limit -1 of the rank. That is, Gray is (-1..899], Green is (899..1199], and so on. This definition is used because it is convenient when processing data with pandas.cut later. The rest are lined up with rank names and color information.

When drawing a graph, an array of only rank names and color information is required, so it is necessary to cut out using map as appropriate.

rank_borders = map(lambda x: x[0], rank_info)
names = map(lambda x: x[1], rank_info)
color = map(lambda x: x[2], rank_info)

Generate an array with rating as an element from the data source

Data sources come in a variety of formats. They range from relatively well-formed formats such as JSON and XML to web pages and tweets. Recently, I think that there are many acquisitions via Web API, so only the Codeforces example is shown here.

url = 'http://codeforces.com/api/user.ratedList?activeOnly=true'
response = urllib2.urlopen(url).read()
data = json.loads(response)

if data['status'] == 'OK':
    res = data['result']

    ratings = map(lambda x: x['rating'], res)

Finally, we have an array of ratings, which is packed with rating numbers. For other data sources, prepare a parser, prepare the values manually, or get them in a way that suits each format.

Create a Pandas dataframe

Put the data in a class called a data frame, which has a lot of analysis and drawing functions. It is very convenient to see the summary of the data with the describe function.

I'd like to make it immediately, but the data extracted from the data source this time is an array that has the following rating values as elements. [1600, 2700, 100, 1200, ... ]

In order to draw the target data, it is necessary to convert it to an array that represents the ratio for each rank value as shown below. [0.1, 0.15, 0.3, 0.25, ... ] (Rank 0 is 10%, Rank 1 is 15%, ...)

I think there are various methods for conversion, but this time I took the following steps.

  1. [1600, 2700, 100, 1200, ...](Rating arrangement)
  2. [3, 8, 0, 1, ...](Arrangement of ranks)
  3. {0: 300, 1: 800, 2: 2000, 3: 12000, ...} (Counter by rank)
  4. [300, 800, 2000, 1200, ...](Number of people per rung)
  5. [0.1, 0.15, 0.3, 0.25, ...](Ratio by rank)

Conversion details

Of course, you can write a loop yourself to convert 1-> 2, but There is a function called pandas.cut that makes this easy.

If you prepare an array that represents the criteria of rank [a, b, c, d], The value between (a, b] is 0, between (b, c] is 1, between (c, d] is 2, Will do the conversion.

As a caveat, in this data definition, the person with the highest rank will not be counted, so it is necessary to add a large value (INT_MAX etc.) as a sentinel at the end.

Also, 2-> 3 and 3-> 4 conversions can be easily done using Python's Counter.

import pandas as pd

# 1->2->3
bins = rank_borders + [sys.maxint]
ranks = Counter(pd.cut(ratings, bins, labels=range(len(rank_info)))).items()

# 3->4
num_list = map(lambda x: x[1], ranks)

Finally, while creating the data frame, perform 4-> 5 conversion. By giving the total to the div, each will be converted to a percentage. For convenience of graph drawing, it is transposed with .T at the end.

df = pd.DataFrame(num_list, columns=[''], index=rank_list).div(len(ratings)).T

Draw a stacked bar graph (save as an image)

Once you have the percentage for each rank, all you have to do is draw. You can draw the data frame separately.

df.plot(kind='bar', stacked=True)

If you've imported seaborn, you've already removed it from matplotlib's inorganic graphs. However, there are some parts that are not good for the purpose of looking at it as it is, so we are making the following adjustments.

import seaborn as sns

sns.set_context('talk', 1.2)  #Increase the font size
sns.set_palette(color)        #Set rank color

#Reverse the order of the legend
handles, labels = sns.plt.gca().get_legend_handles_labels()
sns.plt.gca().legend(reversed(handles), reversed(labels), loc='lower left')

#Upper limit 1.Set to 0 to eliminate margins
sns.plt.yticks(np.arange(0.0, 1.1, 0.1))  # 0.Displayed in 1 increments. 1.1 to 0.Does not include 0
sns.plt.ylim(0.0, 1.0)  #Here 1.Draw up to 0

Finally, show () to show it on the screen or savefig to save it as an image.

sns.plt.show()  #Screen display
# sns.plt.savefig("image.png ")  #Save image

By combining the above steps, you can draw a graph.

Actual drawing example

Topcoder API is available, I have a polite sample, but the algorithm's Top Ranked Members When I accessed, 400 was returned.

  "error": {
    "name": "Not Found",
    "value": 404,
    "description": "The URI requested is invalid or the requested resource does not exist.",
    "details": "No results found"
  }

By the way, if you enter an appropriate character string in testType, it will return candidates. However, half will return 400.

  "error": {
    "name": "Bad Request",
    "value": 400,
    "description": "The request was invalid. An accompanying message will explain why.",
    "details": "challengeType should be an element of design,development,specification,architecture,bug_hunt,test_suites,assembly,ui_prototypes,conceptualization,ria_build,ria_component,test_scenarios,copilot_posting,content_creation,reporting,marathon_match,first2finish,code,algorithm."
  }

I can't help it, so I extracted it from the following XML data. http://apps.topcoder.com/wiki/display/tc/Algorithm+Data+Feeds

There are two types of data, active users (who participated in a rated contest within 180 days) and all users.

Graph

--Active users (5469) active.png --All users (69096 people) all.png

Consideration of active users

First of all, I am concerned that the number of active users is smaller than I expected. Looking at the ratio, almost 60% of the total is Div.2 and 40% is Div.1. Due to the small number of active users and the small number of sub-dirt (because registration is difficult), it seems that the rating formula is in line with the calculation formula.

Yellow was roughly in the top 20% and red coder was in the top 3.9%. The target is 0.29%, which is the thickness that can be visually confirmed on the graph. If your eyes are tired, you may miss it. It's just a person above the clouds.

One line summary

The ratio of topcoder is beautiful.

Codeforces As mentioned earlier, the API is available (http://codeforces.com/api/help). If you request user.ratedList, the list will be returned in JSON format. Again, active users and all users are separated, but the active condition is "Did you participate in a contest with a rate last month?", Which seems to be stricter than topcoder.

Graph

--Active users (9640 people) active.png --All users (71606 people) all.png

Consideration

I feel that there are a certain number of sub-dirt, but even if you exclude them, there are many active users. Even though the conditions are strict, I feel the momentum of this number.

As you can see, the Div.1 standard was considerably higher than topcoder, and it was in the top 10% (purple ~). Compared to topcoder, there may be more beginners entering the market, or because they participate from all over the world in terms of time, the top ranks may stand out. Also, there are only 2.4% above purple, but there are also 5 ranks. In terms of motivation for beginners, I think it should be divided a little further down.

One line summary

Play of the gods of the upper ranks (need to rank so much ...?)

TOEIC We have obtained the latest 206 data from the following web page. http://www.toeic.or.jp/toeic/about/data/data_avelist/data_dist01_09.html It is kind data that the ratio is already written. I didn't use it for graphing.

In addition, each of the three types of listening, reading, and total is graphed.

Graph

--Listening, Reading (94782 people) toeic.png --Total (94782 people) total.png

Consideration

There are many participants. If there are so many participants in the competition pro every time ... The server goes down! I just pray for the development of the competitive professional world.

The graph is in increments of about 50 points, but it has a beautiful distribution. I don't know the detailed scoring method, but it is natural that the distribution-> score is decided instead of the score-> distribution.

However, it seems that Reading is more uniform in the middle than Listening, and the top is less. Listening 470 ~ is 4.1%, Reading 470 ~ is 1.1%. Certainly, I often hear that Reading has a lower score. Listening is sure to be perfect for those who can hear it, but is it because Reading is unlikely due to time constraints?

Looking at the total, 145 points are about 20%, and every 50 points are increased by 10%. 600 points is the 50% position.

I often see "990 points" in bookstores, the point is that I want to know the number of people with a perfect score, but I did not know because the data is not released. At least it seems to be in the top 3.6%. Of course, this is the only story.

One line summary

The difference of 50 points was surprisingly large.

IIDX SP rank

** It's a bonus **

Speaking of rating, there is no other place than Beatmania IIDX (individual impression). There are the following previous studies, which may be of concern to many people. http://clickagain.sakura.ne.jp/top/tokusyuu/dani_dd/dani_tokusyuu2.html http://esports-runner.com/beatmaniaiidx/dani_transition2/

That's why I made a graph. It is created with two types, the current value of the latest work (Copula) and the final result of the previous work (Pendual). In addition, it is only SP rank.

As for the number of people in the latest work, @ 2500bpm tweeted every day on Twitter, so I used that. https://twitter.com/2500bpm Since the result of the previous work is on the official website, the number of people at each rank is counted and extracted. http://p.eagate.573.jp/game/2dx/22/p/ranking/dani.html

Graph

--Latest work (55075 people) iidx.png --Previous work (89599 people) pendual.png

Consideration

As most users know, the depopulation of the lower ranks is amazing. Up to 4 steps are compressed. It seems that there are many people in the 8th dan, probably because there are many people who have not received the 9th dan or later in this work. It seems that the ratio will gradually decrease as the operation continues. Compared to the previous work, you can see that Chuden has successfully relaxed the ten-dan dumplings.

One line summary

balance…….

SP rank of BMS

By the way, let's take a look at BMS as well. The number of people is published in LR2IR, so I used this. Due to the characteristics of BMS, I think that there may be fraud and sub-dirt, so this is just for reference.

I created two types, all dan and mad dan only.

Graph

--All ranks (58166 people) bms.png --Only the mad rank (29584 people) insane.png

Consideration

All ranks are surprisingly evenly distributed. There may be users of various levels, and the balance between ranks may be good. About half of the people have gone mad. I think that many people start from madness, but it seems that the normal dan is also played firmly.

The madness rank has a distribution in which the number of people decreases as the difficulty level increases. Seeing that the number of ★ 05 is small, ★ 06 may be easy to accept.

The percentage of the highest rank (^^) is 0.07%. The ratio of ★★ is 1.2%. I don't recommend it, but if you are interested, you may want to watch the video.

One line summary

(^^)

Comparison between ratings

Compare the industry authority of different sports. The rank is set as follows. --Top 80%: Beginners --Top 40%: Intermediate --Top 20%: Advanced --Top 10%: Super advanced --Top 5%: Ranker --Top 1%: God

Since each has different standards, it can't be helped to compare them, You may get a feel for other ratings. It is just a guide and we do not guarantee its accuracy.

Top 80%: Beginners

--topcoder: Gray medium coder --Codeforces: Pupil (light green) lower

Top 40% or less: Intermediate

--topcoder: Blue lower coder --Codeforces: Specialist (dark green) Medium

Top 20%: Advanced

--topcoder: Blue top to yellow bottom coder --Codeforces: Expert (blue) Medium

Top 10%: Super Advanced

--topcoder: Yellow medium coder --Codeforces: Expert (blue) top-Candidate Master (purple) bottom

Top 5%: Ranker

--topcoder: Yellow top-red coder --Codeforces: Candidate Master (purple) Medium to high

Within the top 1%: God

--topcoder: Red top ~ Target --Codeforces: Grandmaster (orange) -Legendary grandmaster (dark red)

Summary

Using Python + matplotlib + Pandas + seaborn, you can easily draw a beautiful graph as described above. Of course, numpy and scipy can be used, so detailed analysis can be performed. If you have any concerns, why don't you make a quick graph?

Recommended Posts

Graph the ratio of topcoder, Codeforces and TOEIC by rating (Pandas + seaborn)
Pandas of the beginner, by the beginner, for the beginner [Python]
Calculation of technical indicators by TA-Lib and pandas
Analysis of financial data by pandas and its visualization (2)
Analysis of financial data by pandas and its visualization (1)
relation of the Fibonacci number series and the Golden ratio
Find the diameter of the graph by breadth-first search (Python memory)
Visualization memo by pandas, seaborn
Connected components of the graph
The Power of Pandas: Python
Practice of data analysis by Python and pandas (Tokyo COVID-19 data edition)
Seaborn basics for beginners ① Aggregate graph of the number of data (Countplot)
Find the ratio of the area of Lake Biwa by the Monte Carlo method