[PYTHON] I tried collecting data from a website with Scrapy

I did some basic crawling and scraping with Scrapy.

What I made

Scrapy project to get the title and total score of the 2016 winter anime from the following anime information summary site https://www.anikore.jp/chronicle/2016/winter/

Creation procedure

Project creation

https://doc.scrapy.org/en/1.2/intro/tutorial.html As you can see in this original tutorial

% scrapy startproject project_name

The project will be created by executing the command. This time, I changed project_name to ```anime``.

Spider creation

Next, create a Python file that will be a Spider (scraper) in the spiders directory of the project. This time, I named it `ʻanime_spider.py``.

The finished product looks like this:

anime_spider.py


import scrapy


class AnimeSpider(scrapy.Spider):
    name = "anime"
    start_urls = [
        'https://www.anikore.jp/chronicle/2016/winter/'
    ]

    def parse(self, response):
        for anime in response.css('div.animeSearchResultBody'):
            yield {
                'title': anime.css('span.animeTitle a::text').extract_first(),
                'score': anime.css('span.totalRank::text').extract_first()
            }

        next_page = response.css('a.next::attr(href)').extract_first()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)

Spider name

name = "anime"

This is the name of Spider. When performing scraping, use the name declared here as follows.

% scrapy crawl anime

Crawling start URL

start_urls = [
    'https://www.anikore.jp/chronicle/2016/winter/'
]

This is the URL that will be the starting point for crawling. This time I want to get a list of 2016 winter anime, so I declare it at the top of 2016 winter anime.

Scraping

Scraping and crawling are done in a function called parse (). Below is the scraping part.

for anime in response.css('div.animeSearchResultBody'):
    yield {
        'title': anime.css('span.animeTitle a::text').extract_first(),
        'score': anime.css('span.totalRank::text').extract_first()
    }

In Scrapy you can access data with css and xpath, but this time I wrote it with css.

In this site, each animation description was separated by the div tag of the ```animeSearchResultBody`` class, so the information of all the animations displayed in the page is acquired as follows. I will.

response.css('div.animeSearchResultBody')

I want only the title and overall evaluation from the extracted animation information, so I will extract it as follows.

yield {
    'title': anime.css('span.animeTitle a::text').extract_first(),
    'score': anime.css('span.totalRank::text').extract_first()
}

```extract_first () `` will extract the first element.

anime.css('span.animeTitle a::text')[0].extract()

You can also access it with a subscript as, but I am using this because it prevents index errors and returns None.

Crawling

Crawling is done in the following places.

next_page = response.css('a.next::attr(href)').extract_first()
if next_page is not None:
    next_page = response.urljoin(next_page)
    yield scrapy.Request(next_page, callback=self.parse)

The URL of the next page is created by specifying the href character string of the Next button. By calling the parse () function recursively, all pages will be crawled until there are no more pages.

Run

Finally, let's run the program. I introduced the execution command in "Spider name", but I will add an option and output it as a file in json format. Type the following command under the project directory.

% scrapy crawl anime -o anime.json

I got the title and overall evaluation of the 2016 winter anime.

Promotion

We have delivered a video that briefly explains Scrapy. Here, we are scraping using XPath.

"Scrapy: Automatically collects information on web pages !! Crawling & scraping framework" https://www.youtube.com/watch?v=Zfcukqxvia0&t=3s

References

https://doc.scrapy.org/en/1.2/intro/tutorial.html https://ja.wikipedia.org/wiki/ウェブスクレイピング

Recommended Posts

I tried collecting data from a website with Scrapy
I tried reading data from a file using Node.js.
I tried scraping conversation data from Askfm
I tried factor analysis with Titanic data!
I tried a functional language with Python
[Data science basics] I tried saving from csv to mysql with python
I tried to send a registration completion email from Gmail with django.
I tried to save the data with discord
I tried principal component analysis with Titanic data!
I tried DBM with Pylearn 2 using artificial data
I tried using a database (sqlite3) with kivy
Extract data from a web page with Python
I tried to create a table only with Django
I tried running python etc. from a bat file
I tried to automatically generate a password with Python3
I tried to analyze J League data with Python
I tried a simple RPA for login with selenium
[Basics of data science] Collecting data from RSS with python
I tried scraping food recall information with Python to create a pandas data frame
I tried fp-growth with python
I tried scraping with Python
I tried to make a function to retrieve data from database column by column using sql with sqlite3 of python [sqlite3, sql, pandas]
I tried Learning-to-Rank with Elasticsearch!
I tried clustering with PyCaret
I tried gRPC with Python
I tried scraping with python
I tried to implement a volume moving average with Quantx
I tried using the Python library from Ruby with PyCall
I tried to make various "dummy data" with Python faker
I tried sending an email from Amazon SES with Python
I tried to automatically create a report with Markov chain
[Linux] Copy data from Linux to Windows with a shell script
I tried replacing the Windows 10 HDD with a smaller SSD
I tried to solve a combination optimization problem with Qiskit
I tried to get started with Hy ・ Define a class
Xpath summary when extracting data from websites with Python Scrapy
I want to install a package from requirements.txt with poetry
I tried to sort a random FizzBuzz column with bubble sort.
[Python] I tried collecting data using the API of wikipedia
I tried a stochastic simulation of a bingo game with Python
I tried to divide with a deep learning language model
I tried to get data from AS / 400 quickly using pypyodbc
I made a server with Python socket and ssl and tried to access it from a browser
I tried to make a generator that generates a C # container class from CSV with Python
I tried to build a service that sells machine-learned data at explosive speed with Docker
[5th] I tried to make a certain authenticator-like tool with python
I tried AdaNet on table data
What skills should I study as a data analyst from inexperienced?
Can I be a data scientist?
I tried trimming efficiently with OpenCV
[2nd] I tried to make a certain authenticator-like tool with python
I tried summarizing sentences with summpy
A memorandum when I tried to get it automatically with selenium
I tried web scraping with python.
[3rd] I tried to make a certain authenticator-like tool with python
I tried moving food with SinGAN
[Python] A memo that I tried to get started with asyncio
I made a fortune with Python.
I tried to create a list of prime numbers with python
Get data from MySQL on a VPS with Python 3 and SQLAlchemy
[Pandas] I tried to analyze sales data with Python [For beginners]