python super beginner tries scraping

What is scraping?

When you say the word "scraping", there are roughly two things, "crawling" and "scraping". I was confused, so I'll sort it out once.

So, for example, from the Shogi Federation page, I will extract the title of my favorite Go player. Toka is a translation of "scraping".

scrapy

Let's actually scrape it. When I think about it, I've only used PHP so far, so I tried hard to extract the information I wanted from the page using Goutte and so on.

So, I learned that Python, which I recently introduced, has a library (framework?) Called Scrapy, which makes scraping very easy.

So, this time, I will use this to collect information on my favorite Go players from the Shogi Federation page.

Installation

$ pip install scrapy

Complete

tutorial

Well, I'm a super beginner who really doesn't understand Python at all, so I'll try the tutorial step by step to get a feel for it.

There was a tutorial corner in the documentation. https://docs.scrapy.org/en/latest/intro/tutorial.html

It's English, but it's quite so.

The order of work described in the tutorial

  1. Create a new Scrapy project
  2. Write a spider to crawl your site and extract the data you need.
  3. Output the extracted information from the command line
  4. Let's change spider to feel like following a link (I didn't understand English)
  5. Let's use spider arguments

I'd like to do something in this order.

1. Create a new Scrapy project

scrapy startproject tutorial

This seems to be good.

[vagrant@localhost test]$ scrapy startproject tutorial
New Scrapy project 'tutorial', using template directory '/usr/lib64/python3.5/site-packages/scrapy/templates/project', created in:
    /home/vagrant/test/tutorial

You can start your first spider with:
    cd tutorial
    scrapy genspider example example.com
    
[vagrant@localhost test]$ ll
Total 0
drwxr-xr-x 3 vagrant vagrant 38 april 16 04:15 tutorial

A directory called tutorial has been created!

So, there are various things in this, but according to the document, each file has the following roles.

tutorial/
    scrapy.cfg            #Deployment configuration file

    tutorial/             # project's Python module, you'll import your code from here
        __init__.py

        items.py          # project items definition file

        pipelines.py      # project pipelines file

        settings.py       # project settings file

        spiders/          # a directory where you'll later put your spiders
            __init__.py

I didn't understand anything other than the deployment configuration file lol

2. Write a spider to crawl your site and extract the data you need.

Create a file called quotes_spider.py undertutorial / spides /and create it because there is something to copy and paste.

[vagrant@localhost tutorial]$ vi tutorial/spiders/quotes_spider.py
import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename)

3. Output the extracted information from the command line

scrapy crawl quotes

It seems that you can go with this.

After something came out, quotes-1.html and quotes-2.html were created.

[vagrant@localhost tutorial]$ ll
32 in total
-rw-rw-r--1 vagrant vagrant 11053 April 16 04:27 quotes-1.html
-rw-rw-r--1 vagrant vagrant 13734 April 16 04:27 quotes-2.html
-rw-r--r--1 vagrant vagrant 260 April 16 04:15 scrapy.cfg
drwxr-xr-x 4 vagrant vagrant 129 April 16 04:15 tutorial

I wrote here "Let's output the information extracted from the command line", Actually, when I looked at the contents of the parse method, I was just doing something like ↓

--Extract the number part from the URL of the crawled site --Apply this number to the% s part of quotes-% s.html --Finally, put the body of response (TextResponse) in this file and save it.

The start_requests method is easy to write

After all, this method only returns an object of scrapy.Request in the end, but it seems that this can be achieved by just writing start_urls.

    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
         'http://quotes.toscrape.com/page/2/',
    ]

This is OK without having to bother to define the start_requests method

Finally try to extract the data

The tutorial says, "To learn how scrapy actually pulls out, use the scrapy shell. "

I will try it immediately

[vagrant@localhost tutorial]$ scrapy shell 'http://quotes.toscrape.com/page/1/'

...Omission...

2017-04-16 04:36:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x7fbb13dd0080>
[s]   item       {}
[s]   request    <GET http://quotes.toscrape.com/page/1/>
[s]   response   <200 http://quotes.toscrape.com/page/1/>
[s]   settings   <scrapy.settings.Settings object at 0x7fbb129308d0>
[s]   spider     <DefaultSpider 'default' at 0x7fbb11f14828>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser

First, extract the elements using css and see

>>> response.css('title')
[<Selector xpath='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]

Oh, it seems that something like a title element can be extracted.

When this reponse.css (xxx) is done, the XML called SelectorList is returned. Or an object that wraps HTML. So, I will extract more data from here. You can also say that. Extract the text of the title as a trial.

>>> response.css('title::text').extract()
['Quotes to Scrape']