background

From this month, I entered a school called Protoout Studio (https://protoout.studio/), and I had an issue to post a Qiita article. Originally I wrote a blog on note and Hatena blog, but this is my first post on Qiita. I usually set a standard such as writing 1000 characters at a time, but how much should I write a Qiita article on average? So I looked it up.

Survey target

Anyway, I want to analyze various things, so I tried to extract elements other than the number of characters in the article body. ・ Number of title characters ・ Number of tags ・ Number of characters in the text ・ Number of code insertions (Abandoned because the logic to extract the number of characters in the code is difficult) ・ Number of sections ・ Number of text characters per section ・ Number of LGTM ・ Number of comments I took out 5000 articles including the word "node.js" with Qiita API, molded them into csv, and visualized and analyzed them with jupyter notebook.

environment

node v12.18.2 Visual Studio Code 1.47.1 jupyter-notebook 6.0.3 python 3.7.6

Sample code

//package require
const axios = require("axios");
const fs = require("fs");
const csvStr = require("csv-stringify/lib/sync");

//Get data with Qiita API and output to csv
async function getArticle(query) {
  
  //list for converting to csv
  let outcsv = [];
  let columns = ["title","title文字数","Number of tags","Number of characters in the text","Number of code insertions","Number of sections","Number of characters per section","LGTM number","Number of comments"];
  outcsv.push(columns);

  //Search parameters
  var PAGE_MAX = 50;

  for (PAGE = 1;PAGE<=PAGE_MAX;PAGE++){
    //GET request
    var PER_PAGE = 100;   
    let response = await axios.get("https://qiita.com/api/v2/items?page=" + PAGE + "&per_page=" + PER_PAGE+"&query="+encodeURIComponent(query));
    
    for (i =0 ; i<PER_PAGE ; i++) {
      //Information storage list of one record
      let record = [];

      //Elements you want
      var title_name = response["data"][i]["title"];
      var title_len = response["data"][i]["title"].length;
      var tag = response["data"][i]["tags"].length;
      var body = response["data"][i]["rendered_body"].replace(/<("[^"]*"|'[^']*'|[^'">])*>/g,'').length;
      var code = (response["data"][i]["rendered_body"].match(/code-frame/g)||[]).length;
      var section = (response["data"][i]["rendered_body"].match(/<h1>|<h2>|<h3>/g)||[]).length;
      var body_ratio = section == 0 ? body : Math.round(body / section); //section=0 people are section=Calculated as 1
      var like = response["data"][i]["likes_count"];
      var comment = response["data"][i]["comments_count"];
      
      //Store in list
      record.push(title_name,title_len,tag,body,code,section,body_ratio,like,comment); 
      outcsv.push(record);
    }
  }
  //Output as csv
  fs.writeFileSync("node.csv", csvStr(outcsv));
}

var query = "node.js";
getArticle(query);

Small pieces

Number of characters in the article body

var body = response["data"][i]["rendered_body"].replace(/<("[^"]*"|'[^']*'|[^'">])*>/g,'').length;

Here is where to count the number of characters in the body of the article. Since the html tag is attached to the body of the article returned as a response, it is replaced using a regular expression. (Reference: https://qiita.com/miiitaka/items/793555b4ccb0259a4cb8)

Number of code insertions

var code = (response["data"][i]["rendered_body"].match(/code-frame/g)||[]).length;

The number of code insertions is counted by extracting the html tag `<div class =" code-frame ">`. The logic to extract the code part is too difficult, so I gave up this time. I want to challenge someday.

Number of sections

var section = (response["data"][i]["rendered_body"].match(/<h1>|<h2>|<h3>/g)||[]).length;
var body_ratio = section == 0 ? body : Math.round(body / section); //section=0 people are section=Calculated as 1

The section takes out and counts the html tags ```

` ``. Since there are articles that I have never used, the number of characters per section seems to be infinite, so when section = 0, the number of characters in the body is assigned to body_ratio as it is.

analysis

import pandas as pd
//Data reading
df = pd.read_csv("node.csv")
//Take out the indexth element and draw a histogram
df[df.columns[index]].hist(bins=Number of sticks,range=(Left edge of the figure,Right edge of the figure))
//Calculation of each statistic
df.describe()
//Calculation of correlation coefficient
df.corr()
//Draw heatmap
import seaborn as sns
df_corr = df.corr()
sns.heatmap(df_corr, vmax=1, vmin=-1, center=0)

I did some basic analysis with jupyter notebook. I want to dive into machine learning. Reference: https://pythondatascience.plavox.info/matplotlib/%E3%83%92%E3%82%B9%E3%83%88%E3%82%B0%E3%83%A9%E3%83%A0

result

Title name Number of characters

Average: 36.3, median 34.0, minimum 4.0, maximum 225.0

It seems that about 35 characters are the average. "How long should I write a Qiita article after all?" Is 23 characters, so it seems that a slightly longer one is common. 225 characters are amazing, they're really long ...

Number of tags

Average: 3.4, median 3.0, minimum 1.0, maximum 5.0

About 3 is average. This is somehow intuitive.

Article Text

Average 5700 characters, median 3540, minimum 15.0, maximum 559479

The upper figure is the whole picture, and the lower figure is the figure with the right end as 30,000. It seems that the average is pulled up, so the median of around 3540 characters is common. An article with 560,000 characters ... How many hours will it take to make it ...?

Since the number of characters is not clear, I will quote a memorable chapter. ・ 2000 character perfect prize https://qiita.com/keenjoe007/items/c7068c58c63c17388f39 https://qiita.com/lelouch99v/items/3dc11676bb9c23457d41 ・ 3000 characters perfect prize https://qiita.com/Ryusuke-Kawasaki/items/87dd43c176a489aa9fa5 ・ 4000 characters perfect prize Not applicable ・ 5000 characters perfect prize https://qiita.com/seki0809/items/5f831d63146e44dc106a

Number of code insertions

Average 8.4, median 5.0, minimum 0, maximum 222

Similarly, the upper figure is the whole picture, and the lower figure is the figure with the right end at 40. I feel that the average is pulled up a little, so is it generally about 5 times? Next, I would like to analyze the number of characters in the code.

Number of sections

Average 9.8, median 8.0, minimum 0, maximum 132

Similarly, the upper figure is the whole picture, and the lower figure is the figure with the right end at 40. It seems that there are many 8 to 9 sections. I would like to discuss the ratio of h1, h2, h3 separately.

Number of characters per section

Average 738, median 448, minimum 15.0, maximum 207108

In the figure below, the right end is 6000. It's about 450 characters and it's one section. It was a little unexpected that there was such a variation.

I didn't get it right here either, so I'll post the perfect prize (it's rounded, so it's not strict) ・ 300 character perfect prize https://qiita.com/deren2525/items/43386d5d5872967195d4 ・ 400 characters https://qiita.com/mejileben/items/cbe0608ee43aa1fab258 ・ 500 characters https://qiita.com/ryokkkke/items/602a35595090e2224fbd

LGTM number

Average 13.6, median 2.0, minimum 0.0, maximum 1954

In the figure below, the right end is 40. A median of 2 means that if you get 3 LGTMs, you're on top of 2500. It seems better to quit expecting likes like SNS.

Number of comments

Average 0.27, median 0.0, minimum 0.0, maximum 26.0

In the figure below, the right end is set to 5 (since bins = 30, the figure is sparse) The median is 0, but the 75% point was also 0. Don't expect comments! It is that.

Correlation between variables

This is a drawing of the correlation coefficient. It seems intuitively reasonable that there is a slight correlation between the number of characters in the text, the number of code insertions, the number of sections, and the number of characters per section. It is intuitive that there seems to be a slight correlation between the number of LGTMs and the number of comments.

I was expecting a correlation between the number of LGTMs and other factors, but it seems that there is not much correlation. It suggests that the quantity is not very involved in LGTM, is it the content after all? However, the possibility that there are many records with LGTM number = 0 cannot be ruled out.

Conclusion

Number of title characters: About 35 characters Number of tags: 3 Number of characters in the text: About 3500 characters Number of code insertions: About 5 times Number of sections: 8/9 sections Number of characters per section: about 450 characters Number of LGTMs: Hurray if you get it twice Number of comments: Hurray if you get one

What i want to do

The number of characters in the text is now 3500, but the code is also included. I'm wondering how much of the code is. I'm guessing that the code part has more characters.

This time I searched for articles using node.js as a query, but I would like to try other queries as well. What happens if you query python, what happens if you query music, etc. It would be worth considering if there are any differences between the queries.

Articles that hit various maximum values

Please note that this time it is not the maximum value in all Qiita articles because it takes only 5000 cases as a query of node.js.

・ 225 title characters (It's long because it's in English!) 　　https://qiita.com/PINTO/items/865250ee23a15339d556

・ 559479 characters in the text 　　https://qiita.com/K-Hama/items/5c1d4759fd5cbcf397b2

・ Number of code insertions 222 times Number of sections 132 (I think it's the number one character, 182,000 characters. Is it the amount of code?) 　　https://qiita.com/y-bash/items/09575a8e3d85656015bc

・ LGTM number 1954 Number of comments 26 (purely amazing ...!) 　　https://qiita.com/akaoni_sohei/items/186121bd9994197aab50

By the way

The parameters in this article are as follows: Number of title characters: 23 characters Number of tags: 4 Number of characters in the text: 6129 characters Number of code insertions: 8 times Number of sections: 23 Number of characters per section: 266 characters Number of LGTM: 0 Number of comments: 0

For your information!

[PYTHON] After all, how much should I write a Qiita article?

background

Survey target

environment

Sample code

Small pieces

Number of characters in the article body

Number of code insertions

Number of sections

` ``. Since there are articles that I have never used, the number of characters per section seems to be infinite, so when section = 0, the number of characters in the body is assigned to body_ratio as it is.

analysis

result

Title name Number of characters

Number of tags

Article Text

Number of code insertions

Number of sections

Number of characters per section

LGTM number

Number of comments

Correlation between variables

Conclusion

What i want to do

Articles that hit various maximum values

By the way