Scraping using Python 3.5 Async syntax

As a method of scraping with Python, there is a method of using existing libraries such as Scrapy and demiurge, but This time I will try to make my own using the Async syntax added from Python 3.5.

I will not explain what async / await is. For how to use the async / await syntax, the article here was helpful.


environment

How To

First from the download part of the website

import asyncio
import urllib.request


class Downloader:
    def __init__(self, urls):
        self.urls = urls

    def run(self):
        loop = asyncio.get_event_loop()
        return loop.run_until_complete(self.fetch())

    async def fetch(self):
        return await asyncio.wait([self.download(i, url) for i, url in enumerate(self.urls)])

    async def download(self, n, url):
        request = urllib.request.Request(url)
        html = urllib.request.urlopen(request).read()
        print("{0} {1} download finish...".format(n, url))
        return html


if __name__ == "__main__":
    downloader = Downloader([
        "https://www.python.org/", 
        "https://www.python.org/about/", 
        "https://www.python.org/downloads/"
    ])

    downloader.run()

result

1 https://www.python.org/about/ download finish
2 https://www.python.org/downloads/ download finish
0 https://www.python.org/ download finish

What's special about the code is that it runs the download method in parallel. You can see that they are downloading asynchronously instead of downloading them one by one synchronously.

Scraping

With this alone, I just downloaded the HTML and parsing is troublesome, so I will modify the code to add a parser. This time, we will use BeautifulSoup to get the contents of the Title tag of the website.

import asyncio
import urllib.request
from bs4 import BeautifulSoup


class Scraping:
    def __init__(self, urls):
        self.urls = urls

    def run(self):
        loop = asyncio.get_event_loop()
        return loop.run_until_complete(self.fetch())

    async def fetch(self):
        return await asyncio.wait(
            [self.scraping(url) for url in self.urls]
        )

    async def scraping(self, url):
        request = urllib.request.Request(url)
        html = urllib.request.urlopen(request).read()
        bs = BeautifulSoup(html, "html.parser")
        print(bs.title.string)


if __name__ == "__main__":
    scraping = Scraping([
        "https://www.python.org/", 
        "https://www.python.org/about/", 
        "https://www.python.org/downloads/"
    ])

    scraping.run()

result

Welcome to Python.org
Download Python | Python.org
About Python™ | Python.org

Summary

It's easy, but now I can implement my own scraping process. After that, by implementing the crawling function, it becomes a fine framework. For crawling, I think you should refer to Crawler / Web Scraping Advent Calendar etc.

Compared to 3.4 in Async syntax,

All of this code is published on Github, so please refer to that if you have any.

reference

Recommended Posts