[PYTHON] XPath Basics (1) -Basic Concept of XPath

There are two ways to automatically retrieve data from your website. One is to create a web crawler in a programming language such as Python, and the other is a web scraping tool Like Octoparse. jp / blog / top-30-free-web-scraping-software /) to get the data. But in any case, XPath plays an important role. If you know how to write XPath, you can get the data more correctly and efficiently.

So, in the XPath series, I would like to introduce in detail from the basic concept of XPath to how to write and apply XPath.

This article briefly introduces the basic concepts of XPath.

1. What is XPath?

XPath (XML Path Language) is an element from an XML / HTML document that has a tree structure. A concise syntax (language) for specifying and attribute values. Since web pages are usually written in HTML, XPath is often used to get information about web pages. When viewing the HTML of a web page in a browser (Chrome, Firefox, etc.), you can easily access the corresponding HTML document by pressing F12. 1.png

2. How XPath works

Let's take a look at how XPath works specifically. The image below is part of an HTML document. 2.png

HTML has different levels, like a tree structure. In this example, level 1 is ** bookstore ** and level 2 is ** book **. ** Title, author, year, price ** are all level 3.

Text that contains angle brackets (such as ) is called a tag. HTML elements usually consist of a start tag and an end tag, with content inserted between them. It has the following form.

** <○○> (start tag) Content is entered here ... </ ○○> (end tag) **

XPath describes the hierarchy separated by a slash “/”, and you can specify another node from the reference node. Similar to a URL. In this example, if you search for the element "author", the XPath would be:

/bookstore/book/author

To better understand how it works, see How to Find Specific Files on Your Computer. 3.png

To find the file named "author", the correct file path is ** \ bookstore \ book \ author **.

Just as every file on your computer has its own path, so does an element on a web page. The path is described in XPath.

An XPath that starts at the root element (the top element of the document) and goes through all the elements inside to the target element is called an absolute XPath.

** Example: / html / body / div / div / div / div / div / div / div / div / div / span / span / span… **

Absolute XPath can be long and confusing, so to simplify absolute XPath, you can use "//" to omit halfway paths (also known as short XPath).

For example

** Absolute XPath: / bookstore / book / author ** ** Short XPath: // author **

3. To display / write XPath

[For Google Chrome]

View this page in Chrome and view the developer tools from the right-click menu Validate. In html on the Element tab, right click on the element. Select [Copy]-> [Copy XPath] from the menu to copy the XPath to get the element to the clipboard. xpath-chrome2.png

From the displayed Element tab html, press “Ctrl + F” to display the search field. When you enter the XPath, the resulting element should be selected. mceclip0.png

You can also add an extension called "XPath Helper". Enter the XPath and you will see matching results. (Install XPath Helper) 25.png

[For Firefox]

You can use the extension "Firebug" installed in the previous version of Firefox. ([How to install the Firebug & FireXPath extension](https://helpcenter.octoparse.jp/hc/ja/articles/360015765193-Firebug-FireXPath%E6%8B%A1%E5%BC%B5%E6%A9%9F] % E8% 83% BD% E3% 82% 92% E3% 82% A4% E3% 83% B3% E3% 82% B9% E3% 83% 88% E3% 83% BC% E3% 83% AB% E3 % 81% 99% E3% 82% 8B% E6% 96% B9% E6% B3% 95))

Open a web page in Firefox ➡ Click the Firebug button ➡ Click an element in the page ➡ The XPath of that element is displayed. mceclip3.png

The above is the basic concept of XPath. Next time, I'll show you how to write XPath, so please look forward to it!

Original article: https://helpcenter.octoparse.jp/hc/ja/articles/360015765513

Recommended Posts

XPath Basics (1) -Basic Concept of XPath
Basics of Python ①
Basics of python ①
Basic usage of flask-classy
Basic usage of Jinja2
Basic operation of Pandas
Basic usage of SQLAlchemy
# 4 [python] Basics of functions
Basic knowledge of Python
Basics of network programs?
Basics of Perceptron Foundation
Basics of regression analysis
Basics of python: Output
Basic processing of librosa
Super basic usage of pytest
Basics of Machine Learning (Notes)
Basic usage of PySimple GUI
Basic flow of anomaly detection
Basic usage of Pandas Summary
One-liner basic graph of HoloViews
Basic usage of Python f-string
Basics of Python × GIS (Part 1)
Basic knowledge of Linux and basic commands
Basics of Python x GIS (Part 3)
Paiza Python Primer 5: Basics of Dictionaries
Summary of basic knowledge of PyPy Part 1
Summary of basic implementation by PyTorch
XPath Basics (3) -Functions often used for XPath
Getting Started with Python Basics of Python
About the basic type of Go
[Must-see for beginners] Basics of Linux
Topic extraction of Japanese text 1 Basics
Review of the basics of Python (FizzBuzz)
Basics of Quantum Information Theory: Entropy (2)
Basics of Python x GIS (Part 2)
Basics of touching MongoDB with MongoEngine
Read "Basics of Quantum Annealing" Day 6
About the basics list of Python basics
Basic study of OpenCV with Python
Learn the basics of Python ① Beginners