[PYTHON] Get the tag search results of Nico Nico Douga in XML format. (No login required)

Overview

Get the tag search results of Nico Nico Douga in XML format. It seems that there is an existing search API, but login is required and it is difficult to use. (→ Reference article: Search API specifications for Nico Nico Douga) I wrote Python code to use without login.

code

nc2xml.py


#!/usr/bin/env python
# -*- coding: utf-8 -*-

# nc2xml.py
# Copyright (c) 2014 nezuq
# This software is released under the MIT License.
# http://opensource.org/licenses/mit-license.php

import sys
import time
import datetime
import urllib
import urllib2
from lxml import etree

argvs = sys.argv
argc = len(argvs)

#Search keyword
KEYWORD = u'Kimashitawa'
if 1 < argc:
    KEYWORD = argvs[1].decode('utf-8')

#Sort criteria
#SORT_TYPE = 'n' #Latest comment posting date and time
#SORT_TYPE = 'v' #Views
#SORT_TYPE = 'm' #My list
#SORT_TYPE = 'r' #Number of comments
SORT_TYPE = 'f' #Post date and time
#SORT_TYPE = 'l' #Playback time
if 2 < argc:
    SORT_TYPE = argvs[2]

#Target page number(ID acquisition start page)
MIN_PAGENUMBER = 1
if 3 < argc:
    MIN_PAGENUMBER = int(argvs[3])

#Target page number(ID acquisition end page)
MAX_PAGENUMBER = 3
if 4 < argc:
    MAX_PAGENUMBER = int(argvs[4])

#retrieval method
SEARCH_TYPE = 'tag' #Tag search

def main():
    keyword = urllib2.quote(KEYWORD.encode("utf-8"))
    ids = []
    nrow_bef = -1
    npage = MIN_PAGENUMBER - 1
    while ((nrow_bef != len(ids)) and (npage < MAX_PAGENUMBER)):
        npage += 1
        uri = 'http://www.nicovideo.jp/%s/%s?sort=%s&rss=2.0&page=%d' % (SEARCH_TYPE, keyword, SORT_TYPE, npage)
        print u'Inquiring about video ID:' + uri
        time.sleep(1)
        res = urllib2.urlopen(uri)
        rss = etree.fromstring(res.read())
        nrow_bef = len(ids)
        ids += map((lambda x: x.text.rsplit('/', 1)[1]), rss.findall('./channel/item/link'))
    query = 'http://i.nicovideo.jp/v3/video.array?v=' + ','.join(ids)
    print u'Acquiring video data:' + query
    d = datetime.datetime.today()
    #* If the inquiry takes more than 20 seconds, it will time out and the number will be 0.
    #* Issuing a load-intensive query may be an IP block or API blockade, so it is NG.
    time.sleep(1)
    urllib.urlretrieve(query, u'%s_%s.xml' % (KEYWORD, d.strftime('%y%m%d%H%M%S')))
    print 'end!'

if __name__ == '__main__':
    main()

nc2xml.Run py.


python nc2xml.py "lily" m 1 3

Output image

http://i.nicovideo.jp/v3/video.array?v=sm9720246,sm19673281

Impressions

I have to analyze the data for the development of Yuri culture ... (sense of mission)

Reference page

-> Nico Nico Douga search API specifications organized -> Comparison of i.nicovideo.jp API and getthumbinfo API -> I searched for Nico Nico Douga search API, so make a note -> Get the tag search results of Nico Nico Douga by RSS

Recommended Posts

Get the tag search results of Nico Nico Douga in XML format. (No login required)
Preprocessing with Python. Convert Nico Nico Douga tag search results to CSV format
Results that did not get caught in the search with this word
[Python] Explore the characteristics of the titles of the top sites in Google search results
Play the comment of Nico Nico Douga on the terminal in conjunction with the video
In search of the fastest FizzBuzz in Python
Get the size of the image file on the web (Python3, no additional library required)
Get the caller of a function in Python
Search by the value of the instance in the list
How to get the number of digits in Python
In search of the best random dot stereogram (RDS).
Get the image of "Suzu Hirose" by Google image search.
[python] Get the list of classes defined in the module
Get the result in dict format with Python psycopg2
Get the size (number of elements) of UnionFind in Python
Difference in results depending on the argument of multiprocess.Process
Get the URL of the HTTP redirect destination in Python