[PYTHON] Collect anime song lyrics with Scrapy

Introduction

I heard a rumor that a Python library called Scrapy is easy and easy to use, so I immediately tried using it.

environment

Stable pyenv + anaconda (Python3)

Collected data

Leave the lyrics of anime songs here? Of the anime songs added between July 31st and November 30th on the Latest Additional Songs page here I decided to collect the lyrics.

Creation procedure

Installation

It can be installed from pip.``` $ pip install scrapy


## Project creation
You can decide the name freely. This time, the name of the tutorial is adopted as it is.```
$ scrapy startproject aipa_commander

Check the contents

I'm too new to scraping and have no idea what the files inside mean. For the time being, do not touch it until it can be used to some extent.

Program creation

The only directory operated by beginners like me aipa_commander (first project name) / spiders / Create a python script file here. As a result of coding through various trials and errors, it finally became like this.

get_kashi.py


# -*- coding: utf-8 -*-

import scrapy

class KashiSpider(scrapy.Spider):
    name = 'kashi'

    start_urls = ['http://www.jtw.zaq.ne.jp/animesong/tuika.html']

    custom_settings = {
        "DOWNLOAD_DELAY": 1,
    }

    def parse(self, response):
        for href in response.xpath('//td[2]/a/@href'):
            full_url = response.urljoin(href.extract())
            yield scrapy.Request(full_url, callback=self.parse_item)

    def parse_item(self, response):
        kashi = response.xpath('//pre/text()').extract()
        kashi = kashi[0].split('\n')
        file = open('./lyrics/{}.txt'.format(kashi[0]), 'w')
        for j in range(len(kashi)):
            file.write(kashi[j]+'\n')
        file.close()

Scrapy is amazing because you can get the lyrics of 200 songs at once in just a few lines.

I made it with reference to the code in Official Tutorial, so I don't have much explanation about the code ... ・ However, what I struggled with most was that I had no knowledge of HTML and CSS.

response.xpath('//td[2]/a/@href')And response.xpath('//pre/text()').extract()


 Specifying xpath such as.

 However, such a function like a savior was prepared for me.
```$scrapy shell "url"```
 When you enter
 Shell starts

#### **`>>>sel.xpath('//td[2]/a/@href')`**

And run

[<Selector xpath='//td[2]/a/@href' data='ku/qualidea/brave.html'>,


 <Selector xpath='//td[2]/a/@href' data='ku/qualidea/axxxis.html'>,
 <Selector xpath='//td[2]/a/@href' data='ku/qualidea/gravity.html'>,
 <Selector xpath='//td[2]/a/@href' data='ku/qualidea/yakusoku.html'>,
 <Selector xpath='//td[2]/a/@href' data='ku/qualidea/clever.html'>,
 <Selector xpath='//td[2]/a/@href' data='to/drefes/pleasure.html'>,
・ ・ ・ Omitted below

The result can be easily confirmed in this way. By using Shell, you can easily try how to get the data you want without having to rewrite the script. This is really convenient, so scraping beginners should definitely take advantage of it.

I will write about the description method when specifying xpath if there is an opportunity, I used this time xpath(//td[2]/a/@href) Gets only httpl: // www ~ in <a>` `` in all <td [2]> .

xpath('//pre/text()').extract()


 Gets only the text part in all ``` <pre>` `.
 It is a process called.

# Execution result

$ scrapy crawl kashi

 And execute (the kashi part is the keyword specified in name)
 200 text files like this were generated.
 ![スクリーンショット 2016-12-14 0.20.23.png](https://qiita-image-store.s3.amazonaws.com/0/125193/b660612a-0d67-1311-5238-ecb093b06b15.png)

 The contents of the text file
 Like this (because it is long, part of it)
 ![スクリーンショット 2016-12-14 0.23.52.png](https://qiita-image-store.s3.amazonaws.com/0/125193/053df284-92ce-a9fa-9c99-e3c6d0020d97.png)

# in conclusion
 I was impressed because it was easier than I imagined to collect.
 Next time I would like to try it with images.


Recommended Posts

Collect anime song lyrics with Scrapy
Restart with Scrapy
[Voice analysis] Collect bird song data with xeno-canto
Collect store latitude / longitude information with scrapy + splash ②
Collect store latitude / longitude information with scrapy + splash ①
Scraping with scrapy shell
Problems with installing Scrapy
Festive scraping with Python, scrapy
Easy web scraping with Scrapy
Anime face detection with OpenCV