Summary of useful techniques for Python Scrapy

Python Scrapy is a library dedicated to scraping and crawling.

Recently, I've been getting a lot of scraping work in my work. I used to implement scraping with a PHP library called simplehtml.

--Automatic form submission --There is a dedicated CLI --Simply popular

For that reason, I've been doing scraping work recently using Python Scrapy. (PHP is easy, but I also have a personal desire to graduate from PHP.)

Why I think Scrapy is so deep

The main reasons why Scrapy is good are as follows:

――You can make complicated scraping --You can try out with the command tool of CLI

I think that can be mentioned. Until now, scraping used to read URL patterns. Scrapy provides methods for screen transitions that allow you to submit forms, for example, with much less memory resources than browser automation with *** selenium ***.

Useful techniques found in Scrapy

Basic

*** Installing scrapy ***

$pip install scrapy

*** Start a scrapy Spider project ***

$scrapy startproject [project_name] [project_dir]

Command line edition

*** List the created Spider projects ***

$scrapy list

*** Create a new Spider in the created project ***

#Add domain name
$scrapy genspider [spider_name] mydomain.com

** Specify URL when executing command line **

$scrapy crawl -a start_urls="http://example1.com,http://example2.com" [spider_name]

*** Output in CSV ***

$scrapy crawl -o csv_file_name.csv [spider_name]

*** Output as JSON ***

$scrapy crawl -o json_file_name.json [spider_name]

Shell edition

** Launch the Scrapy shell **

$ scrapy shell [URL]

** Show all pages **

#response can be used without definition
response.body

** Get all links **

for link in response.css('a::attr(href)'):
   print link.get()

Library edition

** Use regular expressions **

#When a specific file in the href of the a tag matches
matched = response.css('a::attr(href)').re(r'detail\.php')
if len(matched) > 0:
   print 'matched'

#When a specific Japanese in the character string of the a tag matches
matched = response.css('a::text').re(u'Summary')
if len(matched) > 0:
   print 'matched'

** Get Tag **

#get a tag
response.css('a')

** Get with selector **

#get a tag
response.css('a.link')

#Get multiple classes<li class="page next"></li>
response.css('li.page.next')

** Convert relative path to URL **

for link in response.css('a::attr(href)'):
   print response.urljoin(link.get())

** Submit form information **

scrapy.FormRequest(response,formdata={"username":"login_username","password":"login_password"}

** Iterative processing of child elements of the element acquired by XPath **

#Get DIV element
divs = response.xpath('//div')
#Repeat P element in DIV
for p in divs.xpath('.//p'):  
     print(p.get())

** Transition to another page **

#self.parse(self,response)As a callback function
yield scrapy.Request([url],callback=self.parse)

others

** Create Item (Edit items.py directly under Project) ** Original story

class Product(scrapy.Item):
    name = scrapy.Field()
    price = scrapy.Field()
    stock = scrapy.Field()
    tags = scrapy.Field()
    last_updated = scrapy.Field(serializer=str)

sample

** Go to the detail page until there are no more items in the list (it will not work as it is, so please put it in the class) **

    def parse(self, response):
            title = a.css('::text').extract_first()
            title_match = a.css('::text').re(u'training')
            if len(title_match) > 0:
                    "title":title,
                    "url":response.urljoin(link_param)
                }
                ptn = re.search("\/jinzaiikusei\/\w+\/",url)
                if ptn:
                    self.scraping_list.append(url)
        yield scrapy.Request(self.scraping_list[0],callback=self.parse_detail)
        pass

    def parse_detail(self, response):
         for item in response.css('a'):
             title =  item.css('::text').extract_first()
             url =  item.css('::attr(href)').extract_first()
             title_matched = item.css('::text').re(u'training')
             url_matched = item.css('::attr(href)').re(r'jinzaiikusei\/.*\/.*\.html')
             if url_matched:
                 item = {
                         "title":title,
                         "url":url
                    }
                 yield item
         self.current_index = self.current_index + 1
         if self.current_index < len(self.scraping_list):

             yield scrapy.Request(self.scraping_list[self.current_index],callback=self.parse_detail)
         else:
             pass

Change log

--2019/12/06 Newly created --2019/12/07 Added library techniques --2019/12/09 Added library techniques (form input, etc.) --2019/12/16 Added chapter about items --2019/12/21 Added command --Added to the shell part on 2020/1/20 --Added 2020/2/12 url join --2020/2/13 Added sample

Recommended Posts

Summary of useful techniques for Python Scrapy
Summary of petit techniques for Linux commands
Summary of Python arguments
Summary of frequently used Python arrays (for myself)
A summary of Python e-books that are useful for free-to-read data analysis
Summary of python file operations
Summary of Python3 list operations
Summary of useful tips for Linux terminals ☆ Updated daily
Summary of python environment settings for myself [mac] [ubuntu]
Summary of tools for operating Windows GUI with Python
Summary of pre-processing practices for Python beginners (Pandas dataframe)
[For beginners] Summary of standard input in Python (with explanation)
Summary of Hash (Dictionary) operation support for Ruby and Python
A brief summary of Python collections
Python course for data science_useful techniques
Introductory table of contents for python3
Record of Python introduction for newcomers
[For competition professionals] Summary of doubling
Summary of Python indexes and slices
[OpenCV; Python] Summary of findcontours function
Python Summary
A brief summary of Graphviz in python (explained only for mac)
Python summary
Python techniques for those who want to get rid of beginners
Summary of pages useful for studying the deep learning framework Chainer
[Python] Summary of how to use pandas
[Python] Minutes of study meeting for beginners (7/15)
Summary of methods for automatically determining thresholds
[Python] Summary of array generation (initialization) time! !! !!
Detailed Python techniques required for data shaping (1)
[Python2.7] Summary of how to use unittest
[Python] Script useful for Excel / csv processing
Pandas of the beginner, by the beginner, for the beginner [Python]
Summary of built-in methods in Python list
Summary of how to use Python list
Detailed Python techniques required for data shaping (2)
[Python2.7] Summary of how to use subprocess
Axis option specification summary of Python "numpy.sum (...)"
Useful for everyday life !? Semi-automation of COSPA's strongest design of experiments with Python
2016-10-30 else for Python3> for:
python [for myself]
Data analysis in Python Summary of sources to look at first for beginners
Introduction of Python
Correspondence summary of array operation of ruby and python
The story of low learning costs for Python
Python tutorial summary
Summary of how to use MNIST in Python
Installation of Python3 and Flask [Environment construction summary]
Basics of Python ①
Basics of python ①
Image processing? The story of starting Python for
Easy understanding of Python for & arrays (for super beginners)
Copy of python
[Python] Summary of S3 file operations with boto3
PDF files and sites useful for learning Python 3
Code for checking the operation of Python Matplotlib
Summary of studying Python to use AWS Lambda
python related summary
I / O related summary of python and fortran
Kernel / Python version summary for each Debian release
[Python for Hikari] Chapter 09-01 Classes (Basics of Objects)