[PYTHON] Scraping with scrapy shell

Introduction

scrapy has a shell mode that allows you to scrape interactively. When used in combination with chrome, it is relatively easy to scrape from a web page. This is useful for considering what kind of xpath to write before writing a program.

Get XPath

For scrapy, specify the data you want to retrieve in the web page with XPath. It's not difficult to write your own XPath on a page that knows the structure of HTML, but it's hard to write the XPath of the data you want to retrieve on a page you haven't created. That's where chrome comes in.

For example, suppose you want to extract the title and link of each comic from the page http://toyokeizai.net/category/diary. Open this page in chrome, right-click on the top title "Engineers can't go home on Premium Friday" and select "inspect" from the menu. Developer Tools will open as shown in the figure below, and the corresponding tag will be selected. Screen Shot 2017-03-16 at 9.48.18 AM.png Right-click on the `<span>` tag and select" Copy "→" Copy XPath "from the menu to copy the xpath of this ` `` tag to the clipboard. .. In this example, the XPath is

//*[@id="latest-items"]/div/ul/li[1]/div[2]/a/span[2]

is. In this way, you can easily get the XPath with just chrome. Please refer to the following sites for XPath.

TECHSCORE Location Path XML Path Language (XPath) Version 1.0

Scraping with Scrapy Shell

Installation of scrapy

Scrapy Install scrapy in python anaconda environment

Please refer to.

Load a web page with scrapy shell

First, start the scrapy shell.

$ scrapy shell
2017-03-16 10:44:42 [scrapy] INFO: Scrapy 1.1.1 started (bot: scrapybot)
2017-03-16 10:44:42 [scrapy] INFO: Overridden settings: {'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'LOGSTATS_INTERVAL': 0}
2017-03-16 10:44:42 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole']
2017-03-16 10:44:42 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-03-16 10:44:42 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-03-16 10:44:42 [scrapy] INFO: Enabled item pipelines:
[]
2017-03-16 10:44:42 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6024
2017-03-16 10:44:43 [traitlets] DEBUG: Using default logger
2017-03-16 10:44:43 [traitlets] DEBUG: Using default logger
[s] Available Scrapy objects:
[s]   crawler    <scrapy.crawler.Crawler object at 0x1083d7668>
[s]   item       {}
[s]   settings   <scrapy.settings.Settings object at 0x108f2cb70>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser
In [1]:

Then load the web page with the ``` fetch ()` `` command.

In [1]: fetch('http://toyokeizai.net/category/diary')
2017-03-16 10:46:30 [scrapy] INFO: Spider opened
2017-03-16 10:46:31 [scrapy] DEBUG: Crawled (200) <GET http://toyokeizai.net/category/diary> (referer: None)

You can also specify the URL when starting scrapy shell and load it all at once.

$ scrapy shell http://toyokeizai.net/category/diary

The loaded page is stored in the response object. Whether or not the target page can be loaded is

In [3]: view(response)
Out[3]: True

You can check it with a command such as. Use the view () command to display the web page loaded in the default browser.

Retrieving the desired data

Now let's retrieve the desired data. XPath uses the one above.

In [4]: response.xpath('//*[@id="latest-items"]/div/ul/li[1]/div[2]/a/span[2]/text()').extract()
Out[4]: ['Engineers can't go home on premium Friday']

You now have the title. The `text ()` added to the XPath copied in chrome retrieves all the child text nodes of the selected node. Also, extract () extracts text data from the node. The result is returned as an array.

Extract all titles

Then get all the comic titles listed on the page. The HTML corresponding to the XPath used so far is

The HTML of this part is

<div id="latest-items">
  <div class="article-list">
    <ul class="business">
      <li class="clearfix">
        <div class="ico">…</div>
        <div class="ttl">
          <a href="/articles/-/161892" class="link-box">
            <span class="column-ttl">Will be incorporated as work time</span><br>
            <span class="column-main-ttl">Engineers can't go home on premium Friday</span>
            <span class="date">March 12, 2017</span>
            <span class="summary">From February 24th, there will be a Friday "Premium Friday" where you can leave the office once a month ...</span>
          </a>
        </div>
      </li>
      <li class="clearfix">…</li>
      <li class="clearfix">…</li>
      <li class="clearfix">…</li>
      <li class="clearfix">…</li>
      <li class="clearfix">…</li>
    </ul>
  </div>
</div>

The structure is as follows: `<li class =" clearfix ">… </ li>` contains information on one manga. The XPath used earlier

//*[@id="latest-items"]/div/ul/li[1]/div[2]/a/span[2]

`li [1] ``` refers to the first <li class =" clearfix ">… </ li> ```, so if you do not specify this order, all `` You can specify

  • … </ li> ```. That is, XPath is

    //*[@id="latest-items"]/div/ul/li/div[2]/a/span[2]
    

    Just do. I will actually try it.

    In [5]: response.xpath('//*[@id="latest-items"]/div/ul/li/div[2]/a/span[2]/text()').extract()
    Out[5]:
    ['Engineers can't go home on premium Friday',
     'If you can't beat the machine, become a machine!',
     'Data in the cloud may disappear!',
     'What is the unexpectedly large number of paperless offices?',
     'Unfortunately common points of unpopular male engineers',
     'What you need to do when you challenge advanced programming work',
     'New Year's Day 2017 was a second longer than usual',
     'The latest situation of the engineer's advent calendar',
     'There are "unexpected enemies" in the Amazon cloud',
     'When will Mizuho Bank's system be completed?',
     'Do you remember the nostalgic "Konami Command"?',
     '"DV" has a different meaning in the engineer community',
     'Amazing evolution of the game over the last 40 years',
     '"Pit" hidden in long autumn night programming',
     'Former Sony engineers are popular at work']
    

    I got all the titles. If you compare the above HTML and XPath, the title tag```It seems that you can just specify directly, but this page is

     <div id = "ranking-items" style = "display: none;"> <!-In order of popularity->
      <div class="article-list ranking category">
        <ul class="ranked business">
          <li class="clearfix">
           ...
     <div id = "latest-items"> <!-Latest order->
      <div class="article-list">
        <ul class="business">
          <li class="clearfix">
    

    The structure is almost the same as the latest order under the popularity order, so if you are not careful, extra data will be mixed. When I actually try it,

    In [6]: response.xpath('//span[@class="column-main-ttl"]/text()').extract()
    Out[6]:
     ['Engineers aren't playing with Pokemon GO! ',
     "When will Mizuho Bank's system be completed?",
     'Unfortunate commonality of unpopular male engineers',
     'Engineers can't go home on premium Friday',
     'Students who no longer know desktop PCs! ',
     'Why former Sony engineers are popular in the workplace',
     'Cloud data may disappear! ',
     "Why I don't envy Yahoo 3 days a week",
     "The memory of the first computer I bought is vivid",
     'What is the most profitable programming language',
     "Who is attracted to" Famicom Mini "?",
     'Programming has become a very popular lesson! ',
     "" Self-driving cars "do not run automatically",
     "The truth about engineer girls'same clothes and staying suspicions'",
     "New employees will learn the basics by" creating minutes "",
     'Engineers can't go home on premium Friday',
     'If you can't beat the machine, become a machine! ',
     'Cloud data may disappear! ',
     'What is the unexpectedly large number of paperless offices? ',
     'Unfortunate commonality of unpopular male engineers',
     'What you need to do when you challenge advanced programming work',
     'New Year's Day 2017 was a second longer than usual',
     'The latest situation of the engineer's advent calendar',
     "There are" unexpected enemies "in the Amazon cloud",
     "When will Mizuho Bank's system be completed?",
     "Do you remember the nostalgic" Konami Command "?",
     "In the engineer area," DV "has a different meaning",
     'Amazing evolution of the game in the last 40 years',
     "The" pitfalls "hidden in long autumn night programming",
     'Why former Sony engineers are popular at work']
    

    The same data as is obtained twice. That is, the XPath must uniquely point to the required data.

    ##Extract the link The link URL to the manga posting page is the title when you look at the HTML<span>Parent tag<a>It is written in the href of. The XPath pointing to this looks like this:

    //*[@id="latest-items"]/div/ul/li/div[2]/a/@href
    

    Last@hrefRefers to the href attribute of the a tag. This time, I want to extract the attribute value of the a tag instead of the text node of the child of the a tag, so I am doing as above. When you actually move this

    In [7]: response.xpath('//*[@id="latest-items"]/div/ul/li/div[2]/a/@href').extract()
    Out[7]:
    ['/articles/-/161892',
     '/articles/-/159846',
     '/articles/-/157777',
     '/articles/-/153378',
     '/articles/-/153367',
     '/articles/-/152301',
     '/articles/-/152167',
     '/articles/-/149922',
     '/articles/-/149911',
     '/articles/-/146637',
     '/articles/-/146559',
     '/articles/-/144778',
     '/articles/-/144756',
     '/articles/-/142415',
     '/articles/-/142342']
    

    It will be. Now that you have an XPath to get the title and link of each manga, you can retrieve the necessary information by creating a scraping program based on this.

    ##Export the acquired data If you only scrape once, you can output the required data as it is. First, save the scraped data in a variable and then output it to a file.

    In [8]: titles = response.xpath('//*[@id="latest-items"]/div/ul/li/div[2]/a/span[2]/text()').extract()
    
    In [9]: links = response.xpath('//*[@id="latest-items"]/div/ul/li/div[2]/a/@href').extract()
    
    In [10]: f = open('bohebohe.txt', 'w')
    
    In [11]: for title, link in zip(titles, links):
        ...:     f.write(title + ', ' + link + '\n')
    
    In [12]: f.close()
    

    with thisbohebohe.txtThe result of scraping was written to the file.

    $ cat bohebohe.txt
     Engineers can't go home on Premium Friday, / articles /-/161892
     If you can't beat the machine, become a machine! , / articles /-/ 159846
     Data in the cloud may disappear! , / articles /-/ 157777
     What is the unexpectedly large number of paperless offices? , / articles /-/ 153378
     Unfortunately commonality of unpopular male engineers, / articles /-/ 1533767
     What you need to do to tackle advanced programming work, / articles /-/ 152301
     New Year's Day 2017 was a second longer than usual, / articles /-/ 152167
     Engineer's Advent Calendar Latest Circumstances, / articles /-/ 149922
     There are "unexpected enemies" in the Amazon cloud, / articles /-/ 149911
     When will Mizuho Bank's system be completed, / articles /-/ 146637
     Do you remember the nostalgic "Konami Command", / articles /-/ 146559
     "DV" has a different meaning in the engineer community, / articles /-/ 144778
     Amazing evolution of the game over the last 40 years, / articles /-/ 144756
     "Pitfalls" in long autumn night programming, / articles /-/ 142415
     Former Sony engineers are popular at work, / articles /-/ 142342
    

    #in conclusion

    Debugging XPath that specifies data while creating a program is a bit of a hassle, and sometimes it's a waste to write a program for something that is used only once. In such a case, the scrapy shell, which allows you to try various things interactively and run the python script as it is, is quite convenient, and it is useful for various experiments such as just wanting to extract a little data from the page created in the past.

    #Bonus XPath A brief description of the XPath used in this article. As an example of HTML

     1: <div id="latest-items">
     2:  <div class="article-list">
     3:    <ul class="business">
     4:      <li class="clearfix">
     5:        <div class="ttl">
     6:          <a href="/articles/-/161892" class="link-box">
     7: <span class = "column-ttl"> Incorporated as work time </ span> <br>
     8: <span class = "column-main-ttl"> Engineers can't go home on Premium Friday </ span>
     9: <span class = "date"> March 12, 2017 </ span>
    11:          </a>
    12:        </div>
    13:      </li>
    14:    </ul>
    15:  </div>
    16:</div>
    

    Is used.

    XPath function
    //e All nodes that match the path rooted at tag e.//divThen all nodes starting with the div tag (1), 2,5th line) is taken out.
    //e1/e2 All nodes whose tag e1 and its child elements match tag e2.//dev/ulThen specify the node on the third line.//div/a/spanThen 7, 8,Take out the 9th line.
    //e1/e2[1] The first node of the child element e2 of the tag e1.//li/div/a/span[1]Takes out the 7th line
    //e[@name="value"] Node with tag e whose attribute name is value.//div@[@class="article-list"]Takes out the second line
    @name Retrieves the name attribute of the selected node.//div/a/@hrefGets the href value on the 6th line
    text() Extracts the text nodes of all child elements of the selected node

    Recommended Posts

    Scraping with scrapy shell
    Festive scraping with Python, scrapy
    Easy web scraping with Scrapy
    Scraping with selenium
    Scraping with selenium ~ 2 ~
    Scraping with Python
    Scraping with Python
    Scraping with Selenium
    Restart with Scrapy
    Successful scraping with Selenium
    Scraping with Python (preparation)
    Try scraping with Python.
    Regular serverless scraping with AWS lambda + scrapy Part 1.8
    Scraping with Python + PhantomJS
    Problems with installing Scrapy
    Scraping with Selenium [Python]
    Scraping with Python + PyQuery
    Scraping with Beautiful Soup
    Scraping RSS with Python
    I tried scraping with Python
    Automatically download images with scraping
    Web scraping with python + JupyterLab
    Scraping with Selenium + Python Part 1
    Scraping with chromedriver in python
    Try programming with a shell!
    Save images with web scraping
    Scraping with Selenium in Python
    Scraping with Tor in Python
    Scraping weather forecast with python
    scraping the Nikkei 225 with playwright-python
    Scraping with Selenium + Python Part 2
    I tried scraping with python
    Web scraping beginner with python
    I-town page scraping with selenium
    Table scraping with Beautiful Soup
    reload in django shell with ipython
    Try scraping with Python + Beautiful Soup
    Scraping multiple pages with Beautiful Soup
    Scraping with Node, Ruby and Python
    Web scraping with Python ① (Scraping prior knowledge)
    Scraping with Selenium in Python (Basic)
    Web scraping with BeautifulSoup4 (layered page)
    Scraping with Python, Selenium and Chromedriver
    Scraping Alexa's web rank with pyQuery
    Web scraping with Python First step
    I tried web scraping with python.
    Scraping with Python and Beautiful Soup
    Collect anime song lyrics with Scrapy
    Scraping pages with pagination with Beautiful Soup
    Scraping with Beautiful Soup in 10 minutes
    How to get started with Scrapy
    Let's do image scraping with Python
    Serverless scraping on a regular basis with AWS lambda + scrapy Part 1
    Get Qiita trends with Python scraping
    Automate environment construction with Shell Script
    Make Scrapy an exe with Pyinstaller
    "Scraping & machine learning with Python" Learning memo
    Website scraping with Python's Beautiful Soup
    Scraping 1
    Get weather information with Python & scraping
    Get property information by scraping with python