From this month, I entered a school called Protoout Studio (https://protoout.studio/), and I had an issue to post a Qiita article. Originally I wrote a blog on note and Hatena blog, but this is my first post on Qiita. I usually set a standard such as writing 1000 characters at a time, but how much should I write a Qiita article on average? So I looked it up.
Anyway, I want to analyze various things, so I tried to extract elements other than the number of characters in the article body. ・ Number of title characters ・ Number of tags ・ Number of characters in the text ・ Number of code insertions (Abandoned because the logic to extract the number of characters in the code is difficult) ・ Number of sections ・ Number of text characters per section ・ Number of LGTM ・ Number of comments I took out 5000 articles including the word "node.js" with Qiita API, molded them into csv, and visualized and analyzed them with jupyter notebook.
node v12.18.2 Visual Studio Code 1.47.1 jupyter-notebook 6.0.3 python 3.7.6
//package require
const axios = require("axios");
const fs = require("fs");
const csvStr = require("csv-stringify/lib/sync");
//Get data with Qiita API and output to csv
async function getArticle(query) {
//list for converting to csv
let outcsv = [];
let columns = ["title","title文字数","Number of tags","Number of characters in the text","Number of code insertions","Number of sections","Number of characters per section","LGTM number","Number of comments"];
outcsv.push(columns);
//Search parameters
var PAGE_MAX = 50;
for (PAGE = 1;PAGE<=PAGE_MAX;PAGE++){
//GET request
var PER_PAGE = 100;
let response = await axios.get("https://qiita.com/api/v2/items?page=" + PAGE + "&per_page=" + PER_PAGE+"&query="+encodeURIComponent(query));
for (i =0 ; i<PER_PAGE ; i++) {
//Information storage list of one record
let record = [];
//Elements you want
var title_name = response["data"][i]["title"];
var title_len = response["data"][i]["title"].length;
var tag = response["data"][i]["tags"].length;
var body = response["data"][i]["rendered_body"].replace(/<("[^"]*"|'[^']*'|[^'">])*>/g,'').length;
var code = (response["data"][i]["rendered_body"].match(/code-frame/g)||[]).length;
var section = (response["data"][i]["rendered_body"].match(/<h1>|<h2>|<h3>/g)||[]).length;
var body_ratio = section == 0 ? body : Math.round(body / section); //section=0 people are section=Calculated as 1
var like = response["data"][i]["likes_count"];
var comment = response["data"][i]["comments_count"];
//Store in list
record.push(title_name,title_len,tag,body,code,section,body_ratio,like,comment);
outcsv.push(record);
}
}
//Output as csv
fs.writeFileSync("node.csv", csvStr(outcsv));
}
var query = "node.js";
getArticle(query);
var body = response["data"][i]["rendered_body"].replace(/<("[^"]*"|'[^']*'|[^'">])*>/g,'').length;
Here is where to count the number of characters in the body of the article. Since the html tag is attached to the body of the article returned as a response, it is replaced using a regular expression. (Reference: https://qiita.com/miiitaka/items/793555b4ccb0259a4cb8)
var code = (response["data"][i]["rendered_body"].match(/code-frame/g)||[]).length;
The number of code insertions is counted by extracting the html tag `<div class =" code-frame ">`
.
The logic to extract the code part is too difficult, so I gave up this time.
I want to challenge someday.
var section = (response["data"][i]["rendered_body"].match(/<h1>|<h2>|<h3>/g)||[]).length;
var body_ratio = section == 0 ? body : Math.round(body / section); //section=0 people are section=Calculated as 1
The section takes out and counts the html tags ```
import pandas as pd
//Data reading
df = pd.read_csv("node.csv")
//Take out the indexth element and draw a histogram
df[df.columns[index]].hist(bins=Number of sticks,range=(Left edge of the figure,Right edge of the figure))
//Calculation of each statistic
df.describe()
//Calculation of correlation coefficient
df.corr()
//Draw heatmap
import seaborn as sns
df_corr = df.corr()
sns.heatmap(df_corr, vmax=1, vmin=-1, center=0)
I did some basic analysis with jupyter notebook. I want to dive into machine learning. Reference: https://pythondatascience.plavox.info/matplotlib/%E3%83%92%E3%82%B9%E3%83%88%E3%82%B0%E3%83%A9%E3%83%A0
Average: 36.3, median 34.0, minimum 4.0, maximum 225.0
It seems that about 35 characters are the average. "How long should I write a Qiita article after all?" Is 23 characters, so it seems that a slightly longer one is common. 225 characters are amazing, they're really long ...
Average: 3.4, median 3.0, minimum 1.0, maximum 5.0
About 3 is average. This is somehow intuitive.
Average 5700 characters, median 3540, minimum 15.0, maximum 559479
The upper figure is the whole picture, and the lower figure is the figure with the right end as 30,000. It seems that the average is pulled up, so the median of around 3540 characters is common. An article with 560,000 characters ... How many hours will it take to make it ...?
Since the number of characters is not clear, I will quote a memorable chapter. ・ 2000 character perfect prize https://qiita.com/keenjoe007/items/c7068c58c63c17388f39 https://qiita.com/lelouch99v/items/3dc11676bb9c23457d41 ・ 3000 characters perfect prize https://qiita.com/Ryusuke-Kawasaki/items/87dd43c176a489aa9fa5 ・ 4000 characters perfect prize Not applicable ・ 5000 characters perfect prize https://qiita.com/seki0809/items/5f831d63146e44dc106a
Average 8.4, median 5.0, minimum 0, maximum 222
Similarly, the upper figure is the whole picture, and the lower figure is the figure with the right end at 40. I feel that the average is pulled up a little, so is it generally about 5 times? Next, I would like to analyze the number of characters in the code.
Average 9.8, median 8.0, minimum 0, maximum 132
Similarly, the upper figure is the whole picture, and the lower figure is the figure with the right end at 40. It seems that there are many 8 to 9 sections. I would like to discuss the ratio of h1, h2, h3 separately.
Average 738, median 448, minimum 15.0, maximum 207108
In the figure below, the right end is 6000. It's about 450 characters and it's one section. It was a little unexpected that there was such a variation.
I didn't get it right here either, so I'll post the perfect prize (it's rounded, so it's not strict) ・ 300 character perfect prize https://qiita.com/deren2525/items/43386d5d5872967195d4 ・ 400 characters https://qiita.com/mejileben/items/cbe0608ee43aa1fab258 ・ 500 characters https://qiita.com/ryokkkke/items/602a35595090e2224fbd
Average 13.6, median 2.0, minimum 0.0, maximum 1954
In the figure below, the right end is 40. A median of 2 means that if you get 3 LGTMs, you're on top of 2500. It seems better to quit expecting likes like SNS.
Average 0.27, median 0.0, minimum 0.0, maximum 26.0
In the figure below, the right end is set to 5 (since bins = 30, the figure is sparse) The median is 0, but the 75% point was also 0. Don't expect comments! It is that.
This is a drawing of the correlation coefficient. It seems intuitively reasonable that there is a slight correlation between the number of characters in the text, the number of code insertions, the number of sections, and the number of characters per section. It is intuitive that there seems to be a slight correlation between the number of LGTMs and the number of comments.
I was expecting a correlation between the number of LGTMs and other factors, but it seems that there is not much correlation. It suggests that the quantity is not very involved in LGTM, is it the content after all? However, the possibility that there are many records with LGTM number = 0 cannot be ruled out.
Number of title characters: About 35 characters Number of tags: 3 Number of characters in the text: About 3500 characters Number of code insertions: About 5 times Number of sections: 8/9 sections Number of characters per section: about 450 characters Number of LGTMs: Hurray if you get it twice Number of comments: Hurray if you get one
The number of characters in the text is now 3500, but the code is also included. I'm wondering how much of the code is. I'm guessing that the code part has more characters.
This time I searched for articles using node.js as a query, but I would like to try other queries as well. What happens if you query python, what happens if you query music, etc. It would be worth considering if there are any differences between the queries.
Please note that this time it is not the maximum value in all Qiita articles because it takes only 5000 cases as a query of node.js.
・ 225 title characters (It's long because it's in English!) https://qiita.com/PINTO/items/865250ee23a15339d556
・ 559479 characters in the text https://qiita.com/K-Hama/items/5c1d4759fd5cbcf397b2
・ Number of code insertions 222 times Number of sections 132 (I think it's the number one character, 182,000 characters. Is it the amount of code?) https://qiita.com/y-bash/items/09575a8e3d85656015bc
・ LGTM number 1954 Number of comments 26 (purely amazing ...!) https://qiita.com/akaoni_sohei/items/186121bd9994197aab50
The parameters in this article are as follows: Number of title characters: 23 characters Number of tags: 4 Number of characters in the text: 6129 characters Number of code insertions: 8 times Number of sections: 23 Number of characters per section: 266 characters Number of LGTM: 0 Number of comments: 0
For your information!
Recommended Posts