Rewrite relative links in html to absolute links in python (lxml)

Rewrite relative links in html to absolute links in python (lxml)

lxml

It is a convenient library when handling html (xml).

lxml - Processing XML and HTML with Python(http://lxml.de/)

lxml is the most feature-rich and easy-to-use library for processing XML and HTML in the Python language.

Using lxml, you can easily write the process of rewriting all relative links in html to absolute links.

Make_links_absolute () is good when rewriting all links to absolute links

Use lxml.html.make_links_absolute (). For example, suppose you have the following html (a.html).

a.html

<html>
  <head>
    <style type="text/css">
      .download {background-image:url(./images/download.png);}
    </style>
    <script src="./js/lib.js" type="text/javascript"/>
    <script src="./js/app.js" type="text/javascript"/>
  </head>
  <body>
    <img src="images/icon.png " alt="image"/>
    <a class="download "href="./download">download</a>
  </body>
</html>

Execute code similar to the following. Please give the base url to base_url.

from lxml import html

with open("./a.html", "r") as rf:
    doc = html.parse(rf).getroot()
    html.make_links_absolute(doc, base_url="http://example.net/foo/bar")
    print html.tostring(doc, pretty_print=True)

It can be rewritten as an absolute link as follows.

<html>
<head>
<style type="text/css">
      .download {background-image:url(http://example.net/foo/images/download.png);}
    </style>
<script src="http://example.net/foo/js/lib.js" type="text/javascript"></script><script src="http://example.net/foo/js/app.js" type="text/javascript"></script>
</head>
<body>
    <img src="http://example.net/foo/images/icon.png " alt="image"><a class="download " href="http://example.net/foo/download">download</a>
  </body>
</html>

It is wonderful that the following three are also interpreted as links.

If you want to do something a little more complicated, use rewrite_links ()

For example, you may want to rewrite all links to absolute links, but leave only the js files as relative links.

Let's take a look at the implementation of make_links_absolute ().

## lxml-3.2.1-py2.7-linux-x86_64.egg/lxml/html/__init__.py

class HtmlMixin(object):
#...
    def make_links_absolute(self, base_url=None, resolve_base_href=True):
        """
        Make all links in the document absolute, given the
        ``base_url`` for the document (the full URL where the document
        came from), or if no ``base_url`` is given, then the ``.base_url`` of the document.

        If ``resolve_base_href`` is true, then any ``<base href>``
        tags in the document are used *and* removed from the document.
        If it is false then any such tag is ignored.
        """
        if base_url is None:
            base_url = self.base_url
            if base_url is None:
                raise TypeError(
                    "No base_url given, and the document has no base_url")
        if resolve_base_href:
            self.resolve_base_href()
        def link_repl(href):
            return urljoin(base_url, href)
        self.rewrite_links(link_repl)

rewrite_links () is used. In other words, you should use this.

Suppose you want to change a link other than the path that refers to js in the previous html to an absolute link.

from lxml import html
from urlparse import urljoin
import functools

def repl(base_url, href):
    if href.endswith(".js"):
       return href
    else:
        return urljoin(base_url, href)

with open("./a.html", "r") as rf:
    doc = html.parse(rf).getroot()   
    base_url="http://example.net/foo/bar"
    doc.rewrite_links(functools.partial(repl, base_url))
    print html.tostring(doc, pretty_print=True)

The link to js (src attribute of script tag) remains a relative link.

<html>
<head>
<style type="text/css">
      .download {background-image:url(http://example.net/foo/images/download.png);}
    </style>
<script src="./js/lib.js" type="text/javascript"></script><script src="./js/app.js" type="text/javascript"></script>
</head>
<body>
    <img src="http://example.net/foo/images/icon.png " alt="image"><a class="download " href="http://example.net/foo/download">download</a>
  </body>
</html>

Precautions when rewriting html containing multibyte characters

As mentioned above, using lxml is convenient because you can easily rewrite from a relative link to an absolute link. You need to be a little careful when trying to convert html that contains multibyte characters such as Japanese.

For example, suppose you have the following html (b.html).

b.html

<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
  </head>
  <body>
Japanese string
    <p>AIUEO</p>
  </body>
</html>

Let's convert this with lxml.

# -*- coding:utf-8 -*-
from lxml import html

with open("./b.html", "r") as rf:
    doc = html.parse(rf).getroot()
    print html.tostring(doc, pretty_print=True)

The output result is as follows.

<html>
<head></head>
<body>
    &#26085;&#26412;&#35486;&#12398;&#25991;&#23383;&#21015;
    <p>&#12354;&#12356;&#12358;&#12360;&#12362;</p>
  </body>
</html>

When the generated html is viewed with a browser, it will be displayed as the intended character string. However, when I open this html with an editor etc., it is converted into a list of multiple numbers and symbols. In html, there are two types of character representation, "character entity reference" and "numerical character reference". This is because the character string passed in the former has been converted to the latter when it is passed to lxml. (For details, see http://www.asahi-net.or.jp/~sd5a-ucd/rec-html401j/charset.html#h-5.3.1)

This is exactly the same format as when converting a unicode string to a str type value in python as follows.

print(u"AIUEO".encode("ascii", "xmlcharrefreplace"))
## &#12354;&#12356;&#12358;&#12360;&#12362;

In fact, the help for encode () also mentions that.

$ python
Python 2.7.3 (default, Sep 26 2012, 21:51:14) 
[GCC 4.7.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> help("".encode)
Help on built-in function encode:

encode(...)
    S.encode([encoding[,errors]]) -> object
    
    Encodes S using the codec registered for encoding. encoding defaults
    to the default encoding. errors may be given to set a different error
    handling scheme. Default is 'strict' meaning that encoding errors raise
    a UnicodeEncodeError. Other possible values are 'ignore', 'replace' and
    'xmlcharrefreplace' as well as any other name registered with
    codecs.register_error that is able to handle UnicodeEncodeErrors.

The story goes back. If you want to pass html containing multibyte characters to lxml, Please give encoding to tostring.

# -*- coding:utf-8 -*-
from lxml import html

with open("./b.html", "r") as rf:
    doc = html.parse(rf).getroot()
    print html.tostring(doc, pretty_print=True, encoding="utf-8")

It was output in a human-readable format.

<html>
<head></head>
<body>
Japanese string
    <p>AIUEO</p>
  </body>
</html>

It will continue for a little longer.

Continued, Precautions when rewriting html containing multibyte characters

As far as I looked around on the net, there are many places where I ended up passing the encoding to lxml.html.tostring. There are a few more traps.

In the previous example, the encoding specification was added to html. But when passed html without it, lxml returns strange output. (Although it can be said that the existence of html without encoding is evil in itself. The principle claim is powerless before reality)

html (d.html) without encoding specified

<html>
  <head>
  </head>
  <body>
Japanese string
    <p>AIUEO</p>
  </body>
</html>

The result is as follows.

<html>
<head></head>
<body>
    日本語の文字列
    <p>あいうえお</p>
  </body>
</html>

Let's investigate the cause. Let's take a look at the implementation of lxml.html.parse, lxml.html.tostring. Since I passed the encoding earlier, I can say that the tostring is already supported, so I will look at the parse.

## lxml-3.2.1-py2.7-linux-x86_64.egg/lxml/html/__init__.py

def parse(filename_or_url, parser=None, base_url=None, **kw):
    """
    Parse a filename, URL, or file-like object into an HTML document
    tree.  Note: this returns a tree, not an element.  Use
    ``parse(...).getroot()`` to get the document root.

    You can override the base URL with the ``base_url`` keyword.  This
    is most useful when parsing from a file-like object.
    """
    if parser is None:
        parser = html_parser
    return etree.parse(filename_or_url, parser, base_url=base_url, **kw)

It seems that it receives an argument called parser and is passed a parser called html_parser by default.

## lxml-3.2.1-py2.7-linux-x86_64.egg/lxml/html/__init__.py

from lxml import etree

# ..snip..

class HTMLParser(etree.HTMLParser):
    """An HTML parser that is configured to return lxml.html Element
    objects.
    """
    def __init__(self, **kwargs):
        super(HTMLParser, self).__init__(**kwargs)
        self.set_element_class_lookup(HtmlElementClassLookup())

# ..snip..

html_parser = HTMLParser()
xhtml_parser = XHTMLParser()

A class that inherits from lxml.etree.HTMLParser is defined as HTMLParser. The default parser seems to be this instance. By the way, etree.HTMLParser seems to be a class written in C.

## lxml-3.2.1-py2.7-linux-x86_64.egg/lxml/etree/__init__.py

def __bootstrap__():
   global __bootstrap__, __loader__, __file__
   import sys, pkg_resources, imp
   __file__ = pkg_resources.resource_filename(__name__,'etree.so')
   __loader__ = None; del __bootstrap__, __loader__
   imp.load_dynamic(__name__,__file__)
__bootstrap__()

Let's take a look at the documentation (lxml.etree.HTMLParse: http://lxml.de/3.1/api/lxml.etree.HTMLParser-class.html)

Method Details 	[hide private]
__init__(self, encoding=None, remove_blank_text=False, remove_comments=False, remove_pis=False, strip_cdata=True, no_network=True, target=None, XMLSchema schema=None, recover=True, compact=True)
(Constructor)
	 
x.__init__(...) initializes x; see help(type(x)) for signature

Overrides: object.__init__ 

HTMLParser also seems to be able to specify encoding. Probably the default is None, and the behavior at this time is like setting the encoding by looking at the meta tag of html. If you don't have that meta tag, it probably sets the default encoding for your system environment.

I will imitate html_parser and create an instance.

# -*- coding:utf-8 -*-
from lxml import html

html_parser = html.HTMLParser(encoding="utf-8")

with open("./d.html", "r") as rf:
    doc = html.parse(rf, parser=html_parser).getroot()
    print html.tostring(doc, pretty_print=True, encoding="utf-8")

It seems that the output was successful.

<html>
<head></head>
<body>
Japanese string
    <p>AIUEO</p>
  </body>
</html>

Chardet if you don't know the proper encoding

If you don't know the proper encoding, you can use chardet (cchardet).

# -*- coding:utf-8 -*-
import chardet #if not installed. pip install chardet

print chardet.detect(u"AIUEO".encode("utf-8"))
# {'confidence': 0.9690625, 'encoding': 'utf-8'}

## warning
print chardet.detect(u"abcdefg".encode("utf-8")[:3])
# {'confidence': 1.0, 'encoding': 'ascii'}

def detect(fname, size = 4096 << 2):
    with open(fname, "rb") as rf:
        return chardet.detect(rf.read(size)).get("encoding")

print detect("b.html") # utf-8
print detect("a.html") # ascii

That's it.

Recommended Posts

Rewrite relative links in html to absolute links in python (lxml)
Convert absolute URLs to relative URLs in Python
Convert from Markdown to HTML in Python
Rewrite Python2 code to Python3 (2to3)
Python> List> Convert relative paths to absolute paths> all_filepaths = [datas_path + fp for fp in train_filepaths]
To flush stdout in Python
Login to website in Python
Speech to speech in python [text to speech]
Relative url handling in python
Post to Slack in Python
A story about how to specify a relative path in python.
[Python] How to do PCA in Python
View photos in Python and html
Convert markdown to PDF in Python
How to collect images in Python
How to use SQLite in Python
Output the contents of ~ .xlsx in the folder to HTML with Python
Try to calculate Trace in Python
How to use Mysql in python
[Introduction to Udemy Python3 + Application] 69. Import of absolute path and relative path
How to wrap C in Python
How to use ChemSpider in Python
6 ways to string objects in Python
How to use PubChem in Python
How to handle Japanese in Python
An alternative to `pause` in Python
I made a web application in Python that converts Markdown to HTML
[Introduction to Python] How to use class in Python?
Install Pyaudio to play wave in python
How to access environment variables in Python
HTML email with image to send with python
I tried to implement permutation in Python
Method to build Python environment in Xcode 6
How to dynamically define variables in Python
How to do R chartr () in Python
Pin current directory to script directory in Python
[Itertools.permutations] How to put permutations in Python
PUT gzip directly to S3 in Python
Send email to multiple recipients in Python (Python 3)
Convert psd file to png in Python
Sample script to trap signals in Python
I tried to implement PLSA in Python 2
To set default encoding to utf-8 in python
How to work with BigQuery in Python
Log in to Slack using requests in Python
How to get a stacktrace in python
How to display multiplication table in python
Easy way to use Wikipedia in Python
Rewrite SPSS Modeler filter nodes in Python
I tried to implement ADALINE in Python
[Python] Pandas to fully understand in 10 minutes
Throw Incoming Webhooks to Mattermost in Python
Module to generate word N-gram in Python
To reference environment variables in Python in Blender
I wanted to solve ABC159 in Python
How to switch python versions in cloud9
How to adjust image contrast in Python
How to use __slots__ in Python class
How to dynamically zero pad in Python
Manage python packages to install in containers
To work with timestamp stations in Python