[PYTHON] [Qiita API] [Statistics • Machine learning] I tried to summarize and analyze the articles posted so far.

It's been about half a year since I started posting to Qiita, mainly for articles related to statistics, machine learning, and data analysis. Let's look back on the articles so far while using the Qiita API. (Hereafter calculated from the data as of August 10, 2015)

We'll look at the data first, then the Python code that generated those contents, and how to use the Qiita API from Python.

1. View the data

By stock number of posted articles

The top 5 are 73%. Popular articles are biased ... I personally like "The meaning of division of fractions understood by pizza" at the bottom, but it is not stocked at all. : sweat_smile:

Stock quantity Percentage(%) Accumulation(%) title
750 28.1 28.1 [Machine learning] I will explain while trying the deep learning framework Chainer.
595 22.3 50.4 [Mathematics] Let's visualize what are eigenvalues and eigenvectors
318 11.9 62.3 [Statistics] First "standard deviation" (to avoid frustration with statistics)
163 6.1 68.4 Get a large amount of Starbucks Twitter data with python and try data analysis Part 1
124 4.6 73.1 [Deep learning] Try Autoencoder with Chainer and visualize the result.
82 3.1 76.1 [Update] Explain what the stochastic gradient descent method is by running it in Python.
55 2.1 78.2 Get a large amount of Starbucks Twitter data with python and try data analysis Part 2
52 1.9 80.1 Get a large amount of Starbucks Twitter data with python and try data analysis Part 3
50 1.9 82.0 [Statistics] Understand what an ROC curve is by animation.
45 1.7 83.7 Starbucks Twitter Data Location Visualization and Analysis
44 1.6 85.4 Try rudimentary sentiment analysis on Twitter Stream API data.
44 1.6 87.0 Principal component analysis Analyze handwritten numbers using PCA. Part 1
40 1.5 88.5 Understanding the meaning of complex and bizarre normal distribution formulas
31 1.2 89.7 Playing handwritten numbers with python Part 1
28 1.0 90.7 [Statistics] Generalized linear mixed model(GLMM)Visualization to understand.
28 1.0 91.8 [Statistics] Let's visualize the relationship between the normal distribution and the chi-square distribution.
24 0.9 92.7 Explanation of the concept of regression analysis using Python Part 1
21 0.8 93.4 Introduction to Graph Database Neo4j in Python for Beginners(For Mac OS X)
20 0.7 94.2 [Statistics] Grasp the image of the central limit theorem with a graph
20 0.7 94.9 Visualize the frequency of word occurrences in sentences with Word Cloud.[Python]
20 0.7 95.7 [Machine learning] k-nearest neighbor method(k-nearest neighbor method)Write in python by yourself and recognize handwritten numbers
17 0.6 96.3 Get the world's 100 most influential tech Twitter user information in python.
16 0.6 96.9 [Statistics] [R] Try using quantile regression.
15 0.6 97.5 Play handwritten numbers with python Part 2 (identify)
14 0.5 98.0 Explanation of the concept of regression analysis using Python Extra 1
12 0.4 98.5 [Statistics] Q-Understand the mechanism of Q-plot with animation.
11 0.4 98.9 Explanation of the concept of regression analysis using python Part 2
11 0.4 99.3 Principal component analysis Analyze handwritten numbers using PCA. Part 2
8 0.3 99.6 [python]Random number generation memorandum
6 0.2 99.8 Preferences to generate animated GIFs from Python on Mac
5 0.2 100.0 The meaning of fractional division understood in pizza

Categorized by category

I write articles in the major categories of "machine learning," "statistics," "mathematics," "data analysis," and "others."

Machine learning
[Machine learning] I will explain while trying the deep learning framework Chainer.
[Deep learning] Try Autoencoder with Chainer and visualize the result.
[Update] Explain what the stochastic gradient descent method is by running it in Python.
Principal component analysis Analyze handwritten numbers using PCA. Part 1
Playing handwritten numbers with python Part 1
[Machine learning] k-nearest neighbor method(k-nearest neighbor method)Write in python by yourself and recognize handwritten numbers
Play handwritten numbers with python Part 2 (identify)
Principal component analysis Analyze handwritten numbers using PCA. Part 2

statistics
[Statistics] First "standard deviation" (to avoid frustration with statistics)
[Statistics] Understand what an ROC curve is by animation.
Understanding the meaning of complex and bizarre normal distribution formulas
[Statistics] Generalized linear mixed model(GLMM)Visualization to understand.
[Statistics] Let's visualize the relationship between the normal distribution and the chi-square distribution.
Explanation of the concept of regression analysis using Python Part 1
[Statistics] Grasp the image of the central limit theorem with a graph
[Statistics] [R] Try using quantile regression.
Explanation of the concept of regression analysis using Python Extra 1
[Statistics] Q-Understand the mechanism of Q-plot with animation.
Explanation of the concept of regression analysis using python Part 2
[python]Random number generation memorandum

Math
[Mathematics] Let's visualize what are eigenvalues and eigenvectors
The meaning of fractional division understood in pizza

Data analysis
Get a large amount of Starbucks Twitter data with python and try data analysis Part 1
Get a large amount of Starbucks Twitter data with python and try data analysis Part 2
Get a large amount of Starbucks Twitter data with python and try data analysis Part 3
Starbucks Twitter Data Location Visualization and Analysis
Try rudimentary sentiment analysis on Twitter Stream API data.

Other
Introduction to Graph Database Neo4j in Python for Beginners(For Mac OS X)
Visualize the frequency of word occurrences in sentences with Word Cloud.[Python]
Get the world's 100 most influential tech Twitter user information in python.
Preferences to generate animated GIFs from Python on Mac

By tag

Let's look at each tag. Since I'm basically using Python, the top number of articles is Python. Looking at the stock / article ratio, "Deep Learning", "Deep Learning", and "Chainer" are overwhelmingly high. You can see the excitement of deep learning these days.

"Mathematics" and "machine learning" also seem to have a relatively high stock rate.

tag Number of articles Stock quantity stock/Article ratio
Python 30 2664 88.8
statistics 22 1589 72.2
statistics 17 1274 74.9
Machine learning 9 1127 125.2
Twitter 6 376 62.7
Natural language processing 6 379 63.2
Math 6 1054 175.7
matplotlib 5 63 12.6
MongoDB 4 314 78.5
MachineLearning 4 148 37.0
DeepLearning 2 874 437.0
statistics 2 35 17.5
scikit-learn 2 55 27.5
Deep learning 2 874 437.0
Scraping 2 37 18.5
Chainer 2 874 437.0
Database 1 21 21.0
Data visualization 1 45 45.0
Statistical test 1 12 12.0
Way of thinking 1 5 5.0
Pattern recognition 1 50 50.0
Note 1 5 5.0
R 1 16 16.0
Data analysis 1 40 40.0
Visualization 1 20 20.0
math 1 82 82.0
numpy 1 8 8.0
Graph database 1 21 21.0
BeautifulSoup 1 17 17.0
Statistical modeling 1 28 28.0
neo4j 1 21 21.0
Introduction to Statistics 1 11 11.0

Looking at the graph, it looks like this.

stock_article.png

Stock user

I imagined that the same person would stock a lot, but it seems that there are quite a lot of people at first glance. The table below shows the regulars who are well stocked. Thank you: relaxed:

Ranking Stock quantity
1 22
2 18
3 13
4 10
5 10
6 10
7 9
8 9
9 9
10 9
11 8
12 8
13 8
14 8
15 8
16 8
17 7
18 7
19 7
20 7

It is a graph of the top 150 users with a large number of stocks. The number of unique users was 1771.

stock_num.png

This is a histogram of the number of stocks. It is closer to 1 to 5 stocks than I imagined. Low repeat rate ...: weary: In the future, I will do my best to write articles that will be repeated!

stock_hist.png

2. Explanation of Python code

Get data from Qiita API

The access token is Qiita [Settings] → [Applications] → [Issue new token] It can be issued at. Please set the acquired token in the following''.

%matplotlib inline
import requests
import json, sys
from collections import defaultdict
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('ggplot')

key = '<Access token>'
auth_str = 'Bearer %s'%(key)
headers = {'Authorization': auth_str}
cnt = 0
data_list = []
users = defaultdict(int)

Define a get_stockers function to get the stock user and the number of stocks.

# -------------------Get the number of stocks for each article-----------------------#
def get_stockers(_id):
    global headers
    url = 'https://qiita.com/api/v2/items/{}/stockers'.format(_id)
    cnt = 0
    _sum = 0
    while True:
        cnt += 1
        payload = {'page': cnt, 'per_page': 20}
        res = requests.get(url, params=payload, headers=headers)
        data = res.json()
        for d in data:
            users[d['id']] += 1
        num = len(data)
        if num == 0:
            break
        _sum += num
        
    return _sum

In the loop below, get the set of articles you posted, get the stock user information associated with it, and keep it in the list.

# -------------------Article information acquisition-----------------------#
url = 'https://qiita.com/api/v2/authenticated_user/items'

while True:
    cnt += 1
    sys.stdout.write("{}, ".format(cnt))
    payload = {'page': cnt, 'per_page': 20}
    res = requests.get(url, params=payload, headers=headers)
    data = res.json()
    if len(data) == 0:
        break
    data_list.extend(data)

res = []

Extract necessary information from the acquired data and organize it. Also, private articles (limited shared posts) are excluded.

# -------------------Data formatting-----------------------#
for i, d in enumerate(data_list):
    sys.stdout.write("{}, ".format(i))

    #Excludes private articles
    if d['private'] == True:
        continue
        
    article_info = {}
    for k in ['id', 'title', 'private', 'created_at', 'tags', 'url']:
        article_info[k] = d[k]
    
    article_info['stock'] = get_stockers(d['id'])
    res.append(article_info)

Below, the article set, the number of stocks, and the ratio are output in a form that can be pasted as a markdown table as it is.

sum_of_stocks = np.sum([r['stock'] for r in res]).astype(np.float32)

cum = 0
print "|Stock quantity|Percentage(%)|Accumulation(%)|title|"
print "|:----------:|:----------:|:----------:|:----------|"
for i in np.argsort([r['stock'] for r in res])[::-1]:
    r = res[i]
    ratio = r['stock']/sum_of_stocks*100
    cum += ratio
    print "|{0}|{1:.1f}|{2:.1f}|[{3}]({4})|".format(r['stock'],ratio,cum,r['title'].encode('utf-8'),r['url'])

Aggregate around tags.

#Tag aggregation
tag_cnt = defaultdict(int)
for r in res:
    for t in r['tags']:
        tag_cnt[t['name']] += 1

#Number of stocks by tag
tag_stock_cnt = defaultdict(int)
for t in tag_cnt.keys():
    for r in res:
        for _t in r['tags']:
            if t == _t['name']:
                tag_stock_cnt[t] += r['stock']
tag_stock_dict = {}
for t, cnt in tag_stock_cnt.items():
    tag_stock_dict[t] = cnt

#Processed so that it can be placed in a DataFrame
tag_list = []
ind_list = []
for k, t in tag_cnt.items():
    ind_list.append(k)
    tag_list.append((t , tag_stock_dict[k]))

#Data frame generation
tag_list = np.array(tag_list)
df = pd.DataFrame(tag_list, index=ind_list, columns=['cnt', 'stocks'])

n = float(len(tag_cnt))
df['cnt_ratio'] = df['cnt']/n
df['stock_ratio'] = df['stocks']/sum_of_stocks

#Display of stock quantity and stock ratio by tag
df_tag = df.sort(columns='cnt', ascending=False)

print "|tag|Number of articles|Stock quantity|stock/Article ratio|"
print "|:----------:|:----------:|:----------:|:----------:|"
for d in df_tag.iterrows():
    print "|[{0}](http://qiita.com/tags/{0})|{1}|{2}|{3:.1f}|".format(d[0].encode('utf-8'),int(d[1][0]),int(d[1][1]),d[1][1]/d[1][0])



#graph display
df[['cnt_ratio','stock_ratio']].sort(columns='cnt_ratio', ascending=False).plot(kind="bar", figsize=(17, 8), alpha=0.7, 
                                title="The ratio of article and stocks for each tag.")

Next, the function is aggregated and displayed to the user.

#User aggregation
id_list = []
cnt_list = []
for _id, cnt in users.items():
    id_list.append((_id, cnt))

df = pd.DataFrame(id_list, columns=["id","cnt"])

#Top 20 people display
print "|Ranking|Stock quantity|"
print "|:----------:|:----------:|"
for i, d in enumerate(df.sort(columns="cnt", ascending=False)['cnt'][:20]):
    print "| {} | {} |".format(i+1, d)


#Bar chart by user with the most stock
df.sort(columns="cnt", ascending=False)[:150].plot(kind="bar", figsize=(17, 8), alpha=0.6, xticks=[], 
                                                   title="The number of stocks from 1 user.", width=1, color="blue")


#Histogram of stock numbers
df['cnt'].plot(kind="hist", figsize=(13, 10), alpha=0.7, color="Green", bins=25, xlim=(1,26),
              title="Histgram of stocked users.")

Recommended Posts

[Qiita API] [Statistics • Machine learning] I tried to summarize and analyze the articles posted so far.
[Machine learning] I tried to summarize the theory of Adaboost
Qiita Job I tried to analyze the job offer
I tried to process and transform the image and expand the data for machine learning
I tried to compress the image using machine learning
[First COTOHA API] I tried to summarize the old story
I tried to verify the yin and yang classification of Hololive members by machine learning
[Slack api + Python] I tried to summarize the methods such as status confirmation and message sending
I tried to summarize the umask command
I tried to summarize the graphical modeling.
I tried to touch the COTOHA API
I tried to summarize until I quit the bank and became an engineer
I tried calling the prediction API of the machine learning model from WordPress
I tried to get the authentication code of Qiita API with Python.
I tried to summarize various sentences using the automatic summarization API "summpy"
I tried to verify and analyze the acceleration of Python by Cython
I tried to visualize the model with the low-code machine learning library "PyCaret"
Judge the authenticity of posted articles by machine learning (Google Prediction API).
I tried web scraping to analyze the lyrics.
I tried hitting the Qiita API from go
I tried to touch the API of ebay
LeetCode I tried to summarize the simple ones
I tried to predict the presence or absence of snow by machine learning.
I tried to predict the change in snowfall for 2 years by machine learning
I didn't understand the Resize of TensorFlow so I tried to summarize it visually.
I tried to move machine learning (ObjectDetection) with TouchDesigner
I tried to scrape YouTube, but I can use the API, so don't do it.
I tried to notify the update of "Become a novelist" using "IFTTT" and "Become a novelist API"
I tried to summarize the basic form of GPLVM
[No code] I wrote about elliptic curves and blockchain in my thesis, so I tried to summarize the study method.
I tried to score the syntax that was too humorous and humorous using the COTOHA API.
I tried to summarize the string operations of Python
How to output the number of VIEWs, likes, and stocks of articles posted on Qiita to CSV (created with "Python + Qiita API v2")
Matching app I tried to take statistics of strong people & tried to create a machine learning model
I tried to summarize useful knowledge when developing / operating machine learning services [Python x Azure]
I considered the machine learning method and its implementation language from the tag information of Qiita
I tried to analyze the whole novel "Weathering with You" ☔️
I tried to summarize the code often used in Pandas
I tried machine learning to convert sentences into XX style
I tried to illustrate the time and time in C language
I tried to display the time and today's weather w
I tried to summarize the commands often used in business
[TF] I tried to visualize the learning result using Tensorboard
I tried to enumerate the differences between java and python
I tried to get various information from the codeforces API
I tried to summarize how to use the EPEL repository again
I touched the Qiita API
I tried to summarize SparseMatrix
I tried to make Othello AI with tensorflow without understanding the theory of machine learning ~ Introduction ~
I was frustrated by Kaggle, so I tried to find a good rental property by scraping & machine learning
I tried to find out the difference between A + = B and A = A + B in Python, so make a note
I don't really understand the difference between modules, packages and libraries, so I tried to organize them.
I tried to make Othello AI with tensorflow without understanding the theory of machine learning ~ Implementation ~
[Linux] I tried to summarize the command of resource confirmation system
Get the number of PVs of Qiita articles you posted with API
I tried to analyze my favorite singer (SHISHAMO) using Spotify API
I tried to summarize what was output with Qiita with Word cloud
I tried to summarize the commands used by beginner engineers today
I tried to extract players and skill names from sports articles
I tried to summarize the frequently used implementation method of pytest-mock
[Machine learning] I tried to do something like passing an image