[PYTHON] How to get article data using Qiita API

Introduction

[First post] I wanted to get the article in Qiita by specifying the tag, so I implemented it in Python. The reason I worked on it in the first place was that I used the Livedoor News Corpus to classify article categories by machine learning, and I was advised that I would like to do the same with Qiita articles. .. .. It may be a little difficult to understand how to write the code, but in that case, please let me know in the comments.

About Qiita API

It is a Web-API provided by Qiita that allows you to acquire various data and post articles. https://qiita.com/api/v2/docs

There is an upper limit for acquiring articles, the upper limit for pages is 100 at a time, and the upper limit for per_page (how many articles are acquired for each page) is 100, so a maximum of 10,000 articles can be acquired.

However, user authentication is required, so be careful.

Accepts requests up to 1000 times per user per hour in the authenticated state and up to 60 times per hour per IP address in the unauthenticated state. (From Qiita API official)

This time, I want to get a total of 900 articles per page, so I will do page = 100, per_page = 1 x 9 times.

How to get access token for Qiita API

First, get the access token required for user authentication.

-Select "Application" from "Settings" スクリーンショット 2020-02-01 19.34.53.png

・ "Personal access token" → "Issue a new token" スクリーンショット 2020-02-01 19.36.55.png

・ This time, put a check mark only for read_qiita and "issue" スクリーンショット 2020-02-01 19.38.26.png

・ A token will be issued, so copy it. スクリーンショット 2020-02-01 19.39.27.png

Code example for user authentication of Qiita API

#Header required for user authentication
h = {'Authorization': 'Bearer [Obtained access token]'}
connect = http.client.HTTPSConnection("qiita.com")
url = "/api/v2/items?"

Code example to get the article

#Specify the tag you want to get
query = "&query=tag%3A" + tag_name
#Get the number of articles created within the period specified in the search
connect.request("GET", url + query, headers=h)
#Response to request
res = connect.getresponse()
#Read response
res.read()
#Response from the server
print(res.status, res.reason)
total_count = int(res.headers['Total-Count'])
print("total_count: " + str(total_count))
#Get data and write 100 articles to txt file
for pg in range(100):
    pg += 1
    page = "page=" + str(pg) + "&per_page=1"
    connect.request("GET", url + page + query, headers=h)
    res = connect.getresponse()
    data = res.read().decode("utf-8")
    #pandas json file data.Stored in DataFrame format
    df = pd.read_json(data)
    #Specifying a txt file
    filename = "./qiita/" + tag_name + "/page/" + str(pg) + ".txt"
    #Get title and text from Qiita article
    df.to_csv(filename, columns=[
       'title',
       'body',
    ], header=False, index=False)
    print(str(pg) + "/" + str(100) + "Done")

Explanation of the above code

User authentication

In user authentication, in the header

'Bearer [Obtained access token]'}


 It is necessary to specify the token for authentication as in.

## Get json file
 In Qiita API, the posted data is a json file.
https://qiita.com/api/v2/docs#%E6%8A%95%E7%A8%BF

 When getting it, I use the read_json function of the pandas library to convert it to pandas DataFrame format.

# Code to get 900 articles with the specified tag
 Here is the whole code.

```python
#Library import
import http.client
import pandas as pd
import time
#Number of pages you want to get
TOTAL_PAGE = 900
TIME = int(TOTAL_PAGE / 100)
PER_PAGE = 1

#User authentication
h = {'Authorization': 'Bearer [Obtained access token]'}
connect = http.client.HTTPSConnection("qiita.com")
url = "/api/v2/items?"

#Tag to specify
tag_name = "Java"

#Count variable
num = 0
pg = 0
count = 0

#Get articles by tag repeatedly only for PAGE
query = "&query=tag%3A" + tag_name
#Get the number of articles created within the period specified in the search
connect.request("GET", url + query, headers=h)
#Response to request
res = connect.getresponse()
#Read response
res.read()
#Response from the server
print(res.status, res.reason)
print("Specified tag: " + tag_name)
total_count = int(res.headers['Total-Count'])
print("total_count: " + str(total_count))

#Get data and write 900 articles to txt file
for count in range(TIME):
    count += 1
    for pg in range(100):
        pg += 1
        page = "page=" + str(pg) + "&per_page=" + str(PER_PAGE)
        connect.request("GET", url + page + query, headers=h)
        res = connect.getresponse()
        data = res.read().decode("utf-8")
        df = pd.read_json(data)
        filename = "./qiita/" + tag_name + "/page" + str(count) + "-" + str(pg) + ".txt"
        df.to_csv(filename, columns=[
            'title',
            'body',
        ], header=False, index=False)
        print(str(count) + ":" + str(pg) + "/" + str(100) + "Done")

result

It's hard to understand, but I got 900 articles.

スクリーンショット 2020-02-01 20.25.25.png

Summary

This time, I got it by specifying the title and body of the article, but I can also get the "number of likes" and "update date", so if you want other items, please refer to the Qiita API official. Please try!

Reference material

・ Qiita API official https://qiita.com/api/v2/docs

・ Get Qiita article information with API and write it to CSV https://qiita.com/arai-qiita/items/94902fc0e686e59cb8c5

Recommended Posts

How to get article data using Qiita API
Get Salesforce data using REST API
Get Amazon data using Keep API # 1 Get data
Bulk posting to Qiita: Team using Qiita API
Get Youtube data in Python using Youtube Data API
How to get followers and followers from python using the Mastodon API
[Python] I tried to get various information using YouTube Data API!
How to display Map using Google Map API (Android)
[Python] Get all comments using Youtube Data API
[Django] How to get data by specifying SQL.
How to get a sample report from a hash value using VirusTotal's API
How to search HTML data using Beautiful Soup
[First API] Try to get Qiita articles with Python
How to scrape horse racing data using pandas read_html
Get LEAD data using Marketo's REST API in Python
How to get more than 1000 data with SQLAlchemy + MySQLdb
[Python] Get insight data using Google My Business API
How to analyze with Google Colaboratory using Kaggle API
[Python] Get user information and article information with Qiita API
How to handle data frames
I tried to search videos using Youtube Data API (beginner)
Get data using Ministry of Internal Affairs and Communications API
How to reset password via API using Django rest framework
How to get an overview of your data in Pandas
How to get temperature from switchBot thermo-hygrometer using raspberry Pi
[Introduction to Python] How to get data with the listdir function
I tried to get data from AS / 400 quickly using pypyodbc
[Question] How to get data of textarea data in real time using Python web framework bottle
How to get only the data you need from a structured data set using a versatile method
How to install python using anaconda
[Python] How to FFT mp3 data
Data acquisition using python googlemap api
How to deal with imbalanced data
How to get the Python version
How to get started with Scrapy
How to get started with Django
How to use OpenPose's Python API
How to Data Augmentation with PyTorch
Data acquisition memo using Backlog API
Try to get statistics using e-Stat
How to use bing search api
Get data from Twitter using Tweepy
[Python] How to use Typetalk API
How to collect machine learning data
How to update a Tableau packaged workbook data source using Python
Try to create a Qiita article with REST API [Environmental preparation]
I tried to get data from AS / 400 quickly using pypyodbc Preparation 1
How to get rid of the "Tags must be an array of hashes." Error in the qiita api
How to divide and process a data frame using the groupby function
Markdown to get Jupyter notebook results to Qiita
How to get parent id with sqlalchemy
Qiita (1) How to write a code name
How to get rid of long comprehensions
How to draw a graph using Matplotlib
How to get IP when Tornado + nginx
How to set up SVM using Optuna
I tried using YOUTUBE Data API V3
Get mail using Gmail API in Java
Get Google Fit API data in Python
How to install a package using a repository
How to get a value from a parameter store in lambda (using python)