[PYTHON] XPath Basics (2) -How to write XPath

In the previous article, I briefly introduced the Basic Concepts of XPath. This time, I will introduce how to specify and acquire data from a Web page (HTML) using XPath, that is, how to write XPath.

1. Specify by tag (element)

In the HTML sample below, you can see that the text is surrounded by symbols such as <> </ strong>, such as </ html>. Such symbols such as <> </ strong> are called tags.

** <Tag name> The content will be here ... </ Tag name> **

The first tag is called the "start tag" and the end tag is called the "end tag". And the whole from this start tag to the end tag is called an element.

The part displayed in red in the HTML below is the tag. (It is displayed in blue in Firefox and purple in Chrome.) Harry Potter(html).jpg

Below is a summary of the tags you often see in HTML. See this article for more details! 微信截图_20200515172336.png

** The most common way to write XPath is to write the tags separated by a slash “/”. ** **

For example, if you want to get "Harry Potter" from this HTML, you can specify "html tag-> body tag-> h1 tag" in order from the top of the tree structure. Write as follows.

/html/body/h1

You can also use "//" to omit the halfway path.

//h1 mceclip1.png You can specify the Nth tag if you want to match more than one tag. In this example, when getting "7,631 yen", it is "span" on the second line from the "div" line, so write as follows.

//div/span[2]

In abstraction, the XPath syntax written in tags (elements) looks like this.

** // Tag name //タグ名/タグ名**

2. Specify by attribute

An attribute is described in a tag and represents the information of the tag in detail. By adding attributes to tags, you can specify the effect of the element and add specific instructions. Attributes are usually displayed as ** "id =" booksTitle "" **. It is also possible to specify multiple attributes.

** <Tag name Attribute name =" Attribute value "> **

The most common attributes are href, title, style, src, id, class and so on. Please see this article for details!

** In XPath, attributes are represented by "@" functions. ** **

For example, if you want to get "Harry Potter", write XPath as follows.

//h1[@id="booksTitle"] mceclip3 (1).png

In abstraction, the XPath syntax written in the attribute looks like this:

** // Tag name [@ attribute name =" attribute value "] **

If you want to get all the elements with the same attributes, write:

** // * [@ attribute name =" attribute value "] **

3. Specify in text

The text is enclosed in tags as shown below.

** <Tag name> Text goes here ... </ Tag name> **

Retrieving data from a web page is usually retrieving the content or text within the page. So you can directly specify the text you want to get.

** In XPath, text is represented by a "text ()" function. ** **

For example, if you want to get "Harry Potter", specify it in text and write as follows.

//h1[text()="Harry Potter"] mceclip4.png

In abstraction, the XPath syntax written in the attribute looks like this:

** // Tag name [text () =" Text to get "] **

If you want to get all the elements with the same text, write:

** // * [text () =" text to get "] **

4. Specify in tag relation

In the HTML tree structure, all elements have a parent-child / sibling relationship.

Elements that contain one or more elements are called parent elements, and those that contain are child elements. The child element has only one parent and is between the parent's start and end tags. Elements with the same parent are called sibling elements.

Let's also look at a concrete example.

The sample below is based on the [body] element, where the [body] element is the parent of the [h1] and [div] elements, and the [h1] and [div] elements are children of the [body] element. This is an example of getting elements that have a parent-child / sibling relationship and changing the style for each.

The [h1] element and the [div] element are sibling elements because they have the same parent [body] element.

Also, since the [div] element is the parent of the two [span] elements, the two [span] elements are descendants of the [body] element.

mceclip0 (1).png

You can get elements that have a parent-child or sibling relationship with the current element as the base point. For example, if you want to get "7,631 yen", you can write as follows if you specify it in relation to the tag.

** When making it a child element of the [div] element **

//div/span[2]

** When making it a descendant element of the [body] element **

//body//span[2]

** When making it a sibling element of the [span class = "author not Faded"] element **

//span[@class="author notFaded"]/following-sibling::span[1]

** When making it a sibling element of the [span class = "tax_postage"] element **

//span[@class="tax_postage"]/preceding-sibling::span[1]

Two functions, "following-sibling ::" and "preceding-sibling ::", are often used to specify sibling tags.

-** "following-sibling ::" specifies sibling elements after the specified element ** -** "preceding-sibling ::" specifies sibling elements before the specified element **

"Following-sibling ::" is very useful when specifying table elements. For example, there is the following HTML sample. 5.jpg

When this HTML is converted to a page, it will look like a table like the one below. 微信截图_20200515173912.png

In this example, the store name "12345" is acquired. However, there are multiple [td] elements, and ** // td [1] ** cannot be used. Also, if you want to get tables with the same structure from multiple pages at once, it is recommended to use "following-sibling ::" with the fixed value "store name" as the base point. Write as follows.

** // th [text () =" store name "] / following-sibling :: td [1] ** 图片1.png

In abstraction, the XPath syntax written in tag relations looks like this. 微信截图_20200515174203.png If the above syntax matches more than one, you can specify the Nth tag by adding ** [N] **.

What do you think? The above is the most used XPath writing method. Please give it a try. Next time, I will introduce the functions that are often used for XPath. looking forward to!

Original article: https://helpcenter.octoparse.jp/hc/ja/articles/360013122059

Recommended Posts

XPath Basics (2) -How to write XPath
How to write a Python class
How to write soberly in pandas
Flask reuse How to write html
How to write Docker base image
How to write Django1.9 environment-independent wsgi.py
Notes on how to write requirements.txt
How to set optuna (how to write search space)
How to write Python document comments (Docstrings)
Jupyter Notebook Basics of how to use
Basics of PyTorch (1) -How to use Tensor-
How to write this process in Perl?
How to write Ruby to_s in Python
Summary of how to write AWS Lambda
How to write pydoc and multi-line comments
Answer to "Offline real-time how to write F02 problem"
How to write regular expression patterns in Linux
How to write a ShellScript Bash for statement
Answer to "Offline Real-time How to Write F01 Problem"
Answer to "Offline Real-time How to Write E13 Problem"
How to write async and await in Vue.js
How to write a named tuple document in 2020
[Go] How to write or call a function
How to write a ShellScript bash case statement
How to use xml.etree.ElementTree
How to use Python-shell
How to use tf.data
How to use virtualenv
Scraping 2 How to scrape
How to use Seaboan
How to use image-match
How to use shogun
How to install Python
How to use Pandas 2
How to read PyPI
How to install pip
How to use Virtualenv
How to use numpy.vectorize
How to update easy_install
How to install archlinux
How to use pytest_report_header
How to restart gunicorn
How to install python
How to virtual host
How to debug selenium
How to use partial
How to use Bio.Phylo
How to read JSON
How to use SymPy
How to use x-means
How to use WikiExtractor.py
How to update Spyder
How to use IPython
How to install BayesOpt
How to use virtualenv
How to use Matplotlib
How to use iptables
How to use numpy
How to use TokyoTechFes2015
How to use venv
How to use dictionary {}