[PYTHON] Get the minutes of the Diet via API

Get the minutes of the Diet via API

TL; DR

Hit the API from python to collect any parliamentary minutes.

1. Official information

You can also search by GUI from National Diet Library Search System, but there is a proper API manual .jp / api.html).

2. Search & get by specifying keywords

Here, we will collect minutes that include the following keywords for the statements made during the 10 years from 2010 to 2019.

# -*- coding: utf-8 -*-
"""
Created on Thu Dec 26 15:05:04 2019

@author: boomin

pip install untangle
"""

import urllib
import untangle
import urllib.parse

import re
import pandas as pd
import os

spt = os.sep
pklDir  = "pkl"

def getSpeech(keyword:str):
    start="1" #'#Serial number of remark
    apipath = 'http://kokkai.ndl.go.jp/api/1.0/speech?'

    #Regular expression to remove the speaker part from the content of the statement
    p = re.compile(r'^○([^ ]+)You?\s(.+)')

    startdate='2010-01-01'
    enddate= '2020-01-01'

    df = pd.DataFrame()

    while start!=None:
        date = []
        speaker = []
        speech = []
        speakerGroup = []
        speakerPosition = []

        url = apipath+urllib.parse.quote(
            'maximumRecords=100&recordPacking=xml'
            + '&from=' + startdate
            + '&until=' + enddate
            + '&any=' + keyword
            + f'&startRecord={start}'
        )
        #Get signal request search results (XML)
        obj = untangle.parse(url)

        for record in obj.data.records.record:
            speechrecord = record.recordData.speechRecord

            speechdata = speechrecord.speech.cdata.replace("\u3000"," ").replace("\n"," ")
            m = p.search(speechdata)
            if not isinstance(m,type(None)):
                date.append(speechrecord.date.cdata)
                speaker.append(speechrecord.speaker.cdata)
                speech.append(m.group(2))
                speakerGroup.append(speechrecord.speakerGroup.cdata)
                speakerPosition.append(speechrecord.speakerPosition.cdata)

        offset = int(start)-1
        index = [ offset+n for n in list(range(len(date))) ]
        adddf = pd.DataFrame({
            "date":date, 
            "speaker":speaker,
            "speech":speech,
            "speakerGroup":speakerGroup,
            "speakerPosition":speakerPosition,
          }, index=index)
        df = pd.concat([df, adddf ])

        #Since only 100 items are returned at a time, change the start position and repeatedly send the GET function.
        try:
            start = obj.data.nextRecordPosition.cdata
            print(f"finished: {start}")
        except:
            pass
            break

    df["date"] = pd.to_datetime(df["date"])
    return df

if __name__ == '__main__':
  
    df1 = getSpeech('Artificial intelligence')
    df2 = getSpeech('AI')
    df3 = getSpeech('big data')
    df4 = getSpeech('Machine learning')

    df = pd.concat([df1,df2,df3,df4])
    #Delete duplicate remarks
    df.drop_duplicates(subset=["date","speaker","speech"], inplace=True)
    df.sort_values(by=["date","speaker"],inplace=True)

    df.reset_index(drop=True, inplace=True)

    pd.to_pickle(df, f"{pklDir}{spt}kokkailog.pkl")
    df.to_csv(f"{pklDir}{spt}kokkailog.tsv", sep="\t")

3. Obtained data

In[4]: df.tail()
Out[4]: 
#           date speaker  ...         speakerGroup speakerPosition
#4288 2019-12-05 Taku Eto...Liberal Democratic Party, Group of Independents Minister of Agriculture, Forestry and Fisheries
#4289 2019-12-05 Masayoshi Hamada...Komeito
#4290 2019-12-05 Mitsuko Ishii...Japan Restoration Party
#4291 2019-12-05 Takashi Midorikawa...Constitutional Democratic / National / Social Insurance / Independent Forum
#4292 2019-12-05 Koichi Hagiuda...Liberal Democratic Party, Independent Minister of Education, Culture, Sports, Science and Technology
#
#[5 rows x 5 columns]

Recommended Posts

Get the minutes of the Diet via API
[Python] Get the text of the law from the e-GOV Law API
Get the number of digits
Get the number of views of Qiita
Get the attributes of an object
Get the first element of queryset
Get the number of Youtube subscribers
Get the number of PVs of Qiita articles you posted with API
[Understanding in 3 minutes] The beginning of Linux
Get the column list & data list of CASTable
Get the value of the middle layer of NN
Get holidays with the Google Calendar API
Get the last day of the specified month
[Python] Get the character code of the file
Get the filename of a directory (glob)
[PowerShell] Get the reading of the character string
I tried to get the authentication code of Qiita API with Python.
Get the number of articles accessed and likes with Qiita API + Python
I tried to get the movie information of TMDb API with Python
Get the contents of git diff from python
Get the weather in Osaka via WebAPI (python)
Golang api get
[Python] Get / edit the scale label of the figure
[Python] Get the main topics of Yahoo News
Use the MediaWiki API to get Wiki information
Get the caller of a function in Python
I tried to touch the API of ebay
[Python] Get the last updated date of the website
Get an access token for the Pocket API
Use twitter API to get the number of tweets related to a certain keyword
Get only the address part of NIC (eth0)
To get the path of the currently running python.exe
[Python] Get the day of the week (English & Japanese)
Get the update date of the Python memo file.
Get all songs of Arashi's song information using Spotify API and verify the index
Get the title of yahoo news and analyze sentiment
Get the variable name of the variable as a character string.
How to get the number of digits in Python
[Python] Get the official file path of the shortcut file (.lnk)
Get the sum of each of multiple columns with awk
Let's use the API of the official statistics counter (e-Stat)
Get the image of "Suzu Hirose" by Google image search.
Get the absolute path of the script you are running
[python] Get the list of classes defined in the module
Since there are many earthquakes, get the history of earthquakes
Get the return code of the Python script from bat
[C language] [Linux] Get the value of environment variable
Get to know the feelings of gradient boosting trees
Send and receive Gmail via the Gmail API using Python
Get the size (number of elements) of UnionFind in Python
Let's use the Python version of the Confluence API module.
Try to get the contents of Word with Golang
[Python] Get the list of ExifTags names of Pillow library
Get comments and subscribers with the YouTube Data API
I tried using the API of the salmon data project
Using the National Diet Library Search API in Python
[Django 2.2] Sort and get the value of the relation destination
Get the operation status of JR West with Python
Script to get the expiration date of the SSL certificate
[Python] Get the number of views of all posted articles
Get the URL of the HTTP redirect destination in Python