Scraping using Python 3.5 Async syntax

As a method of scraping with Python, there is a method of using existing libraries such as Scrapy and demiurge, but This time I will try to make my own using the Async syntax added from Python 3.5.

I will not explain what async / await is. For how to use the async / await syntax, the article here was helpful.

environment

Python3.5.0
beautifulsoup4==4.4.1

How To

First from the download part of the website

import asyncio
import urllib.request


class Downloader:
    def __init__(self, urls):
        self.urls = urls

    def run(self):
        loop = asyncio.get_event_loop()
        return loop.run_until_complete(self.fetch())

    async def fetch(self):
        return await asyncio.wait([self.download(i, url) for i, url in enumerate(self.urls)])

    async def download(self, n, url):
        request = urllib.request.Request(url)
        html = urllib.request.urlopen(request).read()
        print("{0} {1} download finish...".format(n, url))
        return html


if __name__ == "__main__":
    downloader = Downloader([
        "https://www.python.org/", 
        "https://www.python.org/about/", 
        "https://www.python.org/downloads/"
    ])

    downloader.run()

result

1 https://www.python.org/about/ download finish
2 https://www.python.org/downloads/ download finish
0 https://www.python.org/ download finish

What's special about the code is that it runs the download method in parallel. You can see that they are downloading asynchronously instead of downloading them one by one synchronously.

Scraping

With this alone, I just downloaded the HTML and parsing is troublesome, so I will modify the code to add a parser. This time, we will use BeautifulSoup to get the contents of the Title tag of the website.

import asyncio
import urllib.request
from bs4 import BeautifulSoup


class Scraping:
    def __init__(self, urls):
        self.urls = urls

    def run(self):
        loop = asyncio.get_event_loop()
        return loop.run_until_complete(self.fetch())

    async def fetch(self):
        return await asyncio.wait(
            [self.scraping(url) for url in self.urls]
        )

    async def scraping(self, url):
        request = urllib.request.Request(url)
        html = urllib.request.urlopen(request).read()
        bs = BeautifulSoup(html, "html.parser")
        print(bs.title.string)


if __name__ == "__main__":
    scraping = Scraping([
        "https://www.python.org/", 
        "https://www.python.org/about/", 
        "https://www.python.org/downloads/"
    ])

    scraping.run()

result

Welcome to Python.org
Download Python | Python.org
About Python™ | Python.org

Summary

It's easy, but now I can implement my own scraping process. After that, by implementing the crawling function, it becomes a fine framework. For crawling, I think you should refer to Crawler / Web Scraping Advent Calendar etc.

Compared to 3.4 in Async syntax,

You no longer need to add a @ asyncio.coroutine </ code> decorator when defining a coroutine, and it's complete with async def.


The syntax that was easily confused with  yield from </ code> and the generator is now a  await </ code> statement, which is simpler and easier to understand.
I think it is.


All of this code is published on Github, so please refer to that if you have any.
reference

asyncio – Asynchronous I / O, Event Loops, Coroutines and Tasks
PEP 0492 -- Coroutines with async and await syntax
Handling coroutines with async and await introduced from Python 3.5
Beautiful Soup Documentation










        
          
          
            Recommended Posts
            

            
            
              

                  Scraping using Python 3.5 Async syntax
              
            
            
              
                  Scraping using Python 3.5 async / await
              
            
            
              
                  Scraping using Python
              
            
            
              
                  Web scraping using Selenium (Python)
              
            
            
              
                  [Scraping] Python scraping
              
            
            
              
                  Python scraping notes
              
            
            
              
                  Python Scraping get_ranker_categories
              
            
            
              
                  Scraping with Python
              
            
            
              
                  Scraping with Python
              
            
            
              
                  [Beginner] Python web scraping using Google Colaboratory
              
            
            
              
                  Start using Python
              
            
            
              
                  Play Python async
              
            
            
              
                  Python Scraping eBay
              
            
            
              
                  Python Scraping get_title
              
            
            
              
                  Scraping a website using JavaScript in Python
              
            
            
              
                  with syntax (Python)
              
            
            
              
                  Python: Scraping Part 1
              
            
            
              
                  [Python] Scraping a table using Beautiful Soup
              
            
            
              
                  Python syntax-control syntax
              
            
            
              
                  Python: Scraping Part 2
              
            
            
              
                  I tried web scraping using python and selenium
              
            
            
              
                  Pharmaceutical company researchers summarized web scraping using Python
              
            
            
              
                  Scraping with Python (preparation)
              
            
            
              
                  Summary about Python scraping
              
            
            
              
                  Try scraping with Python.
              
            
            
              
                  Operate Redmine using Python Redmine
              
            
            
              
                  Fibonacci sequence using Python
              
            
            
              
                  UnicodeEncodeError:'cp932' during python scraping
              
            
            
              
                  Data analysis using Python 0
              
            
            
              
                  Basics of Python scraping basics
              
            
            
              
                  Scraping with Python + PhantomJS
              
            
            
              
                  python async / await curio
              
            
            
              
                  Data cleaning using Python
              
            
            
              
                  Using Python #external packages
              
            
            
              
                  WiringPi-SPI communication using Python
              
            
            
              
                  Python control syntax (memories)
              
            
            
              
                  Age calculation using python
              
            
            
              
                  Search Twitter using Python
              
            
            
              
                  Scraping with Selenium [Python]
              
            
            
              
                  Python web scraping selenium
              
            
            
              
                  Scraping with Python + PyQuery
              
            
            
              
                  Name identification using python
              
            
            
              
                  Notes using Python subprocesses
              
            
            
              
                  Try using Tweepy [Python2.7]
              
            
            
              
                  Scraping RSS with Python
              
            
            
              
                  Flatten using Python yield from
              
            
            
              
                  I tried scraping with Python
              
            
            
              
                  Lightweight thread performance benchmark using async / await implemented in Python 3.5
              
            
            
              
                  Save images using python3 requests
              
            
            
              
                  Web scraping with python + JupyterLab
              
            
            
              
                  Scraping with Selenium + Python Part 1
              
            
            
              
                  [S3] CRUD with S3 using Python [Python]
              
            
            
              
                  [Python] Scraping in AWS Lambda
              
            
            
              
                  python super beginner tries scraping
              
            
            
              
                  Web scraping notes in python3
              
            
            
              
                  [Python] Try using Tkinter's canvas
              
            
            
              
                  Scraping with chromedriver in python
              
            
            
              
                  Festive scraping with Python, scrapy
              
            
            
              
                  Using Quaternion with Python ~ numpy-quaternion ~
              
            
            
              
                  Try using Kubernetes Client -Python-
              
            
            
              
                  [Python] Using OpenCV with Python (Basic)