Xpath summary when extracting data from websites with Python Scrapy

You can use the Python module Scrapy to automatically retrieve website data in sequence as you cycle through the links.

To extract the desired data from your website, you must specify ** where you want the data **.

What you specify is called ** Selector **. In Scrapy, there are css and xpath specification method, but this time I will explain how to use xpath.

Preparation

Install Scrapy with pip.

commandline


$ pip install scrapy

Scrapy Shell Scrapy has a tool called Scrapy shell that allows you to interactively validate your data extraction.

commandline


scrapy shell "http://hogehoge.com/hoge/page1"

If you specify like, the interactive shell of python will be launched with the instance ** response ** containing the information of the specified page. When actually developing a spider (crawler), we will also extract data from this response instance.

Practice

response xpath method

Basically, we will extract the data with this syntax.

shell


>>> response.xpath('//title/text()')
    [<Selector (text) xpath=//title/text()>]

In this example, the body (text ()) of all title tags (// title) in the received html text is extracted. However, if it is left as it is, the return value is a selector as described above. Use .extract () to get the characters.

shell


>>> response.xpath('//title/text()').extract()
    [u'exsample title']
Stringification of extracted data

Since the extracted data is a list, make it a character string by specifying an array.

shell


>>> response.xpath('//title/text()').extract()[0]
    u'exsample title'

By the way, this ʻu'string'` means Unicode. In python, strings are handled in Unicode.

If you are visiting multiple websites, the xpath you specify may not be applicable anywhere. In that state, if you specify the 0th response.xpath (hoge) .extract [0] of the array as above, an error will occur, so to avoid this

shell


>>> item['hoge'] = response.xpath('//title/text()').extract_first()

And so on.

Also, if you want to concatenate all the obtained arrays [u'hoge1', u'hoge2', u'hoge3'], etc., and obtain them as a character string.

shell


>>> extract_list = [u'hoge1', u'hoge2', u'hoge3']
>>> ''.join(extract_list)
    u'hoge1hoge2hoge3'

You can do it.

xpath collection

xpath Contents
//div All div tags
//div[@class='aaa'] In all classes'aaa'Div tag with
//div[@id='aaa']/text() All, id'aaa'Div tag->Body
//a[text()='aaa']/@href All the text'aaa'A tag->Href attribute value of
//div/tr All divs->Tr tag of child element
//table/tr/th[text()='price']/following-sibling::td[1]/text() All tables->That line->Field called price->First of the data elements->Body

The xpath for the last table is convenient because you can get the value from the table on the web page by specifying the field (price in the above case, the amount part). If td is specified, the td element on the same line will be taken more and more, so the first one is extracted astd [1]. It is [1]. It's not [0].

Recommended Posts

Xpath summary when extracting data from websites with Python Scrapy
[Note] Get data from PostgreSQL with Python
Extract data from a web page with Python
Data analysis with Python
BigQuery-Python was useful when working with BigQuery from Python
How to scrape image data from flickr with python
I tried collecting data from a website with Scrapy
[Basics of data science] Collecting data from RSS with python
Get data from database via ODBC with Python (Access)
Sample data created with python
A memo that reads data from dashDB with Python & Spark
Problem not knowing parameters when dealing with Blender from Python
Festive scraping with Python, scrapy
Notes on importing data from MySQL or CSV with Python
Get Youtube data with python
Script when executing Scrapy from Script
20200329_Introduction to Data Analysis with Python Second Edition Personal Summary
Python data type summary memo
Error when playing with python
Get data from analytics API with Google API Client for python
Sample code summary when working with Google Spreadsheets from Google Colab
Basic summary of data manipulation with Python Pandas-First half: Data creation & manipulation
With skype, notify with skype from python!
Read json data with python
How to avoid duplication of data when inputting from Python to SQLite.
Introduction to Data Analysis with Python P17-P26 [ch02 1.usa.gov data from bit.ly]
Get data from MySQL on a VPS with Python 3 and SQLAlchemy
Manipulating kintone data with Python & C Data ODBC Driver from AWS Lambda
Python: Exclude tags from html data
Call C from Python with DragonFFI
Hit treasure data from Python Pandas
Using Rstan from Python with PypeR
Install Python from source with Ansible
Create folders from '01' to '12' with python
[Python] Get economic data with DataReader
Python data structures learned with chemoinformatics
Run Aprili from Python with Orange
Call python from nim with Nimpy
Easy data visualization with Python seaborn.
Process Pubmed .xml data with python
Data analysis starting with python (data visualization 1)
Precautions when using phantomjs from python
When matplotlib doesn't work with python2.7
Read fbx from python with cinema4d
When using MeCab with virtualenv python
Precautions when using six with Python 2.5
Data analysis starting with python (data visualization 2)
Python application: Data cleansing # 2: Data cleansing with DataFrame
[Python] Format when to_csv with pandas
How to deal with OAuth2 error when using Google APIs from Python
Data integration from Python app on Linux to Amazon Redshift with ODBC
Data integration from Python app on Windows to Amazon Redshift with ODBC
Copy data from Amazon S3 to Google Cloud Storage with Python (boto)
Make a decision tree from 0 with Python and understand it (4. Data structure)
Tips (data structure) that you should know when programming competitive programming with Python2
Get additional data in LDAP with python
Data pipeline construction with Python and Luigi
Get html from element with Python selenium
Snippet when searching all bits with python
Play audio files from Python with interrupts
Create wordcloud from your tweet with python3