I did some basic crawling and scraping with Scrapy.

What I made

Scrapy project to get the title and total score of the 2016 winter anime from the following anime information summary site https://www.anikore.jp/chronicle/2016/winter/

Creation procedure

Project creation

https://doc.scrapy.org/en/1.2/intro/tutorial.html As you can see in this original tutorial

% scrapy startproject project_name

The project will be created by executing the command. This time, I changed project_name to ```anime``.

Spider creation

Next, create a Python file that will be a Spider (scraper) in the spiders directory of the project. This time, I named it `ʻanime_spider.py``.

The finished product looks like this:

`anime_spider.py`


import scrapy


class AnimeSpider(scrapy.Spider):
    name = "anime"
    start_urls = [
        'https://www.anikore.jp/chronicle/2016/winter/'
    ]

    def parse(self, response):
        for anime in response.css('div.animeSearchResultBody'):
            yield {
                'title': anime.css('span.animeTitle a::text').extract_first(),
                'score': anime.css('span.totalRank::text').extract_first()
            }

        next_page = response.css('a.next::attr(href)').extract_first()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)

Spider name

name = "anime"

This is the name of Spider. When performing scraping, use the name declared here as follows.

% scrapy crawl anime

Crawling start URL

start_urls = [
    'https://www.anikore.jp/chronicle/2016/winter/'
]

This is the URL that will be the starting point for crawling. This time I want to get a list of 2016 winter anime, so I declare it at the top of 2016 winter anime.

Scraping

Scraping and crawling are done in a function called parse (). Below is the scraping part.

for anime in response.css('div.animeSearchResultBody'):
    yield {
        'title': anime.css('span.animeTitle a::text').extract_first(),
        'score': anime.css('span.totalRank::text').extract_first()
    }

In Scrapy you can access data with css and xpath, but this time I wrote it with css.

In this site, each animation description was separated by the div tag of the ```animeSearchResultBody`` class, so the information of all the animations displayed in the page is acquired as follows. I will.

response.css('div.animeSearchResultBody')

I want only the title and overall evaluation from the extracted animation information, so I will extract it as follows.

yield {
    'title': anime.css('span.animeTitle a::text').extract_first(),
    'score': anime.css('span.totalRank::text').extract_first()
}

```extract_first () `` will extract the first element.

anime.css('span.animeTitle a::text')[0].extract()

You can also access it with a subscript as, but I am using this because it prevents index errors and returns None.

Crawling

Crawling is done in the following places.

next_page = response.css('a.next::attr(href)').extract_first()
if next_page is not None:
    next_page = response.urljoin(next_page)
    yield scrapy.Request(next_page, callback=self.parse)

The URL of the next page is created by specifying the href character string of the Next button. By calling the parse () function recursively, all pages will be crawled until there are no more pages.

Run

Finally, let's run the program. I introduced the execution command in "Spider name", but I will add an option and output it as a file in json format. Type the following command under the project directory.

% scrapy crawl anime -o anime.json

I got the title and overall evaluation of the 2016 winter anime.

Promotion

We have delivered a video that briefly explains Scrapy. Here, we are scraping using XPath.

"Scrapy: Automatically collects information on web pages !! Crawling & scraping framework" https://www.youtube.com/watch?v=Zfcukqxvia0&t=3s

References

https://doc.scrapy.org/en/1.2/intro/tutorial.html https://ja.wikipedia.org/wiki/ウェブスクレイピング

[PYTHON] I tried collecting data from a website with Scrapy