This article is the 16th day article of Dwango Advent Calendar 2020. I like the Advent calendar too much, so I will write the last day in addition to this. The content is thin because it is a hurry, but if you are interested, please contact us.
The XPath 1.0 parser introduced in this article was developed for the HTML parser "gammo" to be introduced on the final day of Dwango Advent Calendar 2020.
Since the content of the final day will be specialized for HTML parser, this article will introduce the story of implementing XPath 1.0 as a mechanism for traversing the DOM tree built by gammo.
XPath 1.0
XPath 1.0 was published at the same time as XSLT 1.0, which was recommended in 1999, and has a simple specification with very few features compared to the current latest version, 3.1. I chose XPath 1.0 simply because it seemed very simple and easy to implement, and I thought it was enough to traverse the DOM tree built by gammo. Be done.
As mentioned earlier, the XPath specification itself isn't that big. Furthermore, since it aims to follow XPath 1.0 rather than the latest specifications, the volume related to its implementation is also small.
The detailed XPath 1.0 grammar is defined by w3c, and in principle, AST can be built by implementing according to this content.
Since gammo is implemented by pure Ruby without using native extensions, we took a similar approach for XPath 1.0. In addition, instead of using third-party gems, parsing is performed using racc, which comes standard with the runtime library, and strscan is used for lexical analysis. I decided.
In contrast to HTML, XPath can be expressed by (E) BNF, and the syntax is defined by EBNF in the W3C standard specification. Gammo :: XPath :: Parser corresponds to this specification. Based on the Abstract Syntax Tree constructed in this way, each node of XPath AST (eg Path, for the element that becomes the context node when traversing using XPath (the element node that becomes the root of the target to execute XPath) Axes, Predicate, Function) are evaluated in order, and candidate elements are searched / added or filtered. After going through these processes, the final set of remaining nodes is returned as the evaluation result of the given XPath.
Since gammo is implemented according to the XPath 1.0 specification in principle, we will touch on how XPath works here.
Without the omission, XPath basically builds a set of nodes with the following syntax:
/axis::Namespace:Node test[predicate]/~
In XPath, the search direction is first determined by the axis. This means the direction seen from the current context node, such as child
or ancestor
(ancestor). Then, the search is started in that direction, and the set is assembled by filtering the candidate element groups by the element names and expressions specified in namespace: node test
. For the part that is [predicate]
, it is possible to filter the set assembled by the above procedure using an additional complicated formula.
In XPath, it is also possible to track HTML documents in order from the left, using /
as a delimiter like URI.
Taking these into consideration, for example, for the following HTML, only those with foo
specified in the class
attribute value of the li
element are extracted.
doc = Gammo.new(<<-EOS).parse
<!DOCTYPE html>
<html>
<head>
<title>hello world</title>
<meta charset="utf8">
</head>
<body>
<ul>
<li class="foo">a</li>
<li class="bar">b</li>
<li class="foo">c</li>
<li class="bar">d</li>
<li class="foo">e</li>
</ul>
</body>
</html>
EOS
p doc.xpath('descendant-or-self::li[@class="foo"]').map(&:inner_text)
#=> ["a", "c", "e"]
I didn't know it existed until I implemented it, but it seems that XPath has the concept of a function. Many functions are defined, such as functions that handle character strings, numerical values, and boolean values, and functions that handle node information.
Here, as an example, we will introduce a function that returns whether an element has a specific character string as inner text.
doc = Gammo.new(<<-EOS).parse
<!DOCTYPE html>
<html>
<head>
<title>hello world</title>
<meta charset="utf8">
</head>
<body>
</body>
</html>
EOS
#Using the contains function, the text node under the title element
# `hello`Test whether it contains the string.
p doc.xpath('contains(//title/text(), "hel")', result_type: Gammo::XPath::BOOLEAN_TYPE) #=> true
In this way, it is possible to use a function in an XPath expression.
However, since there aren't many situations that I want to use personally, the response is inadequate. If you are interested in Gammo or XPath functions, I would appreciate it if you could send me a patch while looking at the Correspondence Table.
This article was written as a subset of the last day's article, but I hope you'll get an idea of the XPath 1.0 overview and implementation in Ruby.
Recommended Posts