[PYTHON] Easy web scraping with Scrapy

We have described in an easy-to-understand manner how to execute Scrapy, a framework that enables web scraping. I'm glad if you can use it as a reference.

reference: Python --Create a crawler with Scrapy https://qiita.com/naka-j/items/4b2136b7b5a4e2432da8

Time required 15 minutes Contents

  1. Install Scrapy and create a project
  2. About Spider
  3. Let's actually get the web page information!

1. Install Scrapy and create a project

Execute the following pip on the terminal to install scrapy

pip install scrapy

Then go to the directory where you want to create the scrapy project and do the following

scrapy startproject sake

After this, I will scrape the website related to sake, so I changed the name of the project to "sake". Then, the following folders will be configured under the current directory.

Screen Shot 2020-04-09 at 11.52.09.png

2. About Spider

Web scraping is not possible with the above files alone, so enter the following command and enter the following command. Create files in the spiders directory.

#scrapy genspider <file name> <Web URL you want to scrape>
scrapy genspider scrapy_sake https://www.saketime.jp/ranking/

Then you can see that a file called "scrapy_sake.py" is created in the spiders directory. Screen Shot 2020-04-09 at 11.57.12.png The contents of the created file are as follows.

sake/sake/spiders/scrapy_sake.py


# -*- coding: utf-8 -*-
import scrapy


class ScrapySakeSpider(scrapy.Spider):
    name = 'scrapy_sake'
    allowed_domains = ['https://www.saketime.jp/ranking/']
    start_urls = ['http://https://www.saketime.jp/ranking/']

    def parse(self, response):
        pass

As I will explain in detail later, I will mainly code this "def parse" part. Before coding, let's check once if you can get the web information accurately. Add a print statement to the "def parse" part to see the acquired information.

sake/sake/spiders/scrapy_sake.py


# -*- coding: utf-8 -*-
import scrapy


class ScrapySakeSpider(scrapy.Spider):
    name = 'scrapy_sake'
    allowed_domains = ['https://www.saketime.jp/ranking/']
    start_urls = ['http://https://www.saketime.jp/ranking/']

    def parse(self, response):
        #Delete pass and add print statement
        print(response)

And if you execute the following command, quite a lot of output will be returned, but you can confirm that html is acquired firmly in it.

Execution command


#scrapy crawl <file name>
scrapy crawl scrapy_sake

Output



               <li class="brand_review clearfix">
                <div>
                  <p>
Iso's pride, special brewed raw sake

Click here for today's sake, Iso's proud special brewed raw sake!
Rice...                    <br>
                    <span class="brand_review_user">
                      by
                      <span>Sue</span>
                      <span>
                        <span class="review-star">★</span>
                        <span>4.5</span>
                      </span>
                      <span class="reviewtime">
                        <span>March 23, 2020</span>
                      </span>
                    </span>

                  </p>
                </div>
              </li>
                                        </ul>
          </a>
                  </div>
:
:
:
'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2020, 4, 9, 3, 23, 24, 461847)}
2020-04-09 12:23:26 [scrapy.core.engine] INFO: Spider closed (finished)

Next, let's extract only the necessary information from here!

3. Let's actually get the web page information!

Basically, there are only two files implemented by scrapy:

Let's edit from item.py first. When you first open it, it originally looks like the following file.

sake/items.py


# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy

class SakeItem(scrapy.Item):
    pass

Arbitrarily register the information you want to get with scrapy in this class part.

sake/items.py


# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class SakeItem(scrapy.Item):
    #Information name you want to get(Any) = scrapy.Field()
    prefecture_maker = scrapy.Field()
    prefecture = scrapy.Field()
    maker = scrapy.Field()
    brand = scrapy.Field()
    pass

This is the end of the description of items.py. Next, let's move on to coding scrapy_sake.py.

The completed form is as follows. I think that the inside of def parse () is richer than the one seen in Chapter 2 above.

sake/sake/spiders/scrapy_sake.py


# -*- coding: utf-8 -*-
import scrapy
#items.Don't forget to import py
from sake.items import SakeItem

class ScrapySakeSpider(scrapy.Spider):
    name = 'scrapy_sake'
    #allowed_domains = ['ja.wikipedia.org']
    start_urls = ['https://www.saketime.jp/ranking/']
    
    def parse(self, response):
        items = []
        #html tag li.Sake information was stored in a place called clearfix.
        sakes = response.css("li.clearfix")

        #Multiple li on the page.Let's look at each clearfix
        for sake in sakes:
            #item.Declare a SakeItem object defined in py
            item = SakeItem()
            item["prefecture_maker"] = sake.css("div.col-center p.brand_info::text").extract_first()

            #<div class="headline clearfix">In the case of description like,headline.In between as clearfix.To put on
            item["brand"] = sake.css("div.headline.clearfix h2 a span::text").extract_first()

            #Cleansing the acquired data
            if (item["prefecture_maker"] is not None) or (item["brand"] is not None):
                #Delete \ n and spaces
                item["prefecture_maker"] = item["prefecture_maker"].replace(' ','').replace('\n','')
                #Separation of prefecture and maker
                item["prefecture"] = item["prefecture_maker"].split('|')[0]
                item["maker"] = item["prefecture_maker"].split('|')[1]
                items.append(item) 
        print(items)

    #Reflect page switching with recursive processing
        #a tag rel="next"Get the elements of
        next_page = response.css('a[rel="next"]::attr(href)').extract_first()
        if next_page is not None:
            #Convert to absolute path if URL is relative path
            next_page = response.urljoin(next_page)
            #Return Request once in yield, sakes are registered on the page after request, and the above for statement is executed again
            yield scrapy.Request(next_page, callback=self.parse)

When these are executed, it will be as follows.

:
:
:
2020-04-10 16:52:58 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.saketime.jp/ranking/page:110/> from <GET https://www.saketime.jp/ranking/page:110>
2020-04-10 16:52:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.saketime.jp/ranking/page:110/> (referer: https://www.saketime.jp/ranking/page:109/)
[{'brand': 'Orochi's tongue',
 'maker': 'Kisuki Brewery',
 'prefecture': 'Shimane',
 'prefecture_maker': 'Shimane|Kisuki Brewery'}, {'brand': '禱 and Minoru',
 'maker': 'Fukumitsuya',
 'prefecture': 'Ishikawa',
 'prefecture_maker': 'Ishikawa|Fukumitsuya'}, {'brand': 'Kanazawa beauty',
 'maker': 'Fukumitsuya',
 'prefecture': 'Ishikawa',
 'prefecture_maker': 'Ishikawa|Fukumitsuya'}, {'brand': 'Jinkuro',
 'maker': 'Hokusetsu Sake Brewery',
 'prefecture': 'Niigata',
 'prefecture_maker': 'Niigata|Hokusetsu Sake Brewery'}, {'brand': 'Kenroku Sakura',
 'maker': 'Nakamura Sake Brewery',
 'prefecture': 'Ishikawa',
 'prefecture_maker': 'Ishikawa|Nakamura Sake Brewery'}, {'brand': 'birth',
 'maker': 'Tohoku Meijo',
 'prefecture': 'Yamagata',
 'prefecture_maker': 'Yamagata|Tohoku Meijo'}, {'brand': 'SUMMERGODDESS',
 'maker': 'Mana Tsuru Sake Brewery',
 'prefecture': 'Fukui',
:
:
:
 'scheduler/dequeued/memory': 221,
 'scheduler/enqueued': 221,
 'scheduler/enqueued/memory': 221,
 'start_time': datetime.datetime(2020, 4, 10, 7, 51, 13, 756973)}
2020-04-10 16:53:00 [scrapy.core.engine] INFO: Spider closed (finished)

I got 110 pages of sake information in JSON format in about 20 seconds. It's convenient.

Try scraping the sites you are interested in to get information.

Note

Although it is basic information, as a way to read the html information of the site you want to scrape, in the case of chrome browser and macOS, it is possible to display it with cmd + option + i. You can also press cmd + shift + c to click on an element in the site to see where it represents in the html code. Screen Shot 2020-04-10 at 17.09.44.png

Recommended Posts

Easy web scraping with Scrapy
Scraping with scrapy shell
Easy web scraping with Python and Ruby
Web scraping with python + JupyterLab
Festive scraping with Python, scrapy
Save images with web scraping
Web scraping beginner with python
web scraping
Web scraping with Python ① (Scraping prior knowledge)
Web scraping with BeautifulSoup4 (layered page)
Web scraping with Python First step
I tried web scraping with python.
Scraping with selenium
Scraping with Python
WEB scraping with Python (for personal notes)
Scraping with Python
Getting Started with Python Web Scraping Practice
web scraping (prototype)
Https access via proxy with Python web scraping was easy with requests
Scraping with Selenium
[Personal note] Web page scraping with python3
Web scraping with Python ② (Actually scraping stock sites)
Horse Racing Site Web Scraping with Python
Getting Started with Python Web Scraping Practice
Easy web app with Python + Flask + Heroku
Practice web scraping with Python and Selenium
Web scraping with BeautifulSoup4 (serial number page)
[For beginners] Try web scraping with Python
AWS-Perform web scraping regularly with Lambda + Python + Cron
Let's do web scraping with Python (weather forecast)
Let's do web scraping with Python (stock price)
Successful scraping with Selenium
Easy Grad-CAM with pytorch-gradcam
Scraping with Python (preparation)
Try scraping with Python.
Scraping with Python + PhantomJS
Introduction to Web Scraping
Problems with installing Scrapy
Scraping with Selenium [Python]
Python web scraping selenium
Scraping with Python + PyQuery
Scraping with Beautiful Soup
Easy debugging with ipdb
Scraping RSS with Python
Easy TopView with OpenCV
Data analysis for improving POG 1 ~ Web scraping with Python ~
Easy scraping with Python (JavaScript / Proxy / Cookie compatible version)
Easy machine learning with scikit-learn and flask ✕ Web app
Python beginners get stuck with their first web scraping
I tried scraping with Python
Automatically download images with scraping
Easy tox environment with Jenkins
Scraping with selenium in Python
Scraping with Selenium + Python Part 1
[Co-occurrence analysis] Easy co-occurrence analysis with Python! [Python]
Web scraping notes in python3
Easy web server construction & deployment with EB CLI + git + Django
Web application development with Flask
Easy folder synchronization with Python
Web scraping technology and concerns
Web application creation with Django