[Python] Create an event-driven web crawler using AWS's serverless architecture

Create a serverless, event-driven web crawler

I'm studying ElasticSearch right now, so I wondered if I could do something with ES, so I wrote an event-driven web crawler using Kinesis + Lambda.

--Execution environment - CentOS7 - python 2.7

Workflow

serverless-crawler.png

The general flow is as follows.

  1. Extract the URL with Scrapy (Scraping Hub or AWS Lambda) and put it in the Kinesis stream
  2. Fire AWS Lambda from the Kinesis stream
  3. Patrol from Lambda function to URL and send data to ElasticSearch Service

Creating an IAM user

Since you need permission to use Kinesis and ElasticSearch, Have ** access key ID ** and ** secret access key ** for each.

Also, the user's ARN is also required, so keep it in mind (arn: aws: iam :: ********: user / ************)

Create a stream on AWS Kinesis

First, create a Kinesis stream.

  1. Create a stream
  2. Enter an appropriate stream name (temporarily scraping_url)
  3. Enter the number of shards (1 for the time being)
  4. Create stream スクリーンショット 2017-04-06 12.45.52.png

Create an AWS ElasticSearch Service

Next, create an ES with Amazon ElasticSearch Service.

Operation on AWS

  1. Create a new domain

  2. Enter an appropriate domain name in Elasticsearch domain name (tentatively web-archives)

  3. Select [5.1] for Elasticsearch version. Press [Next] スクリーンショット 2017-04-06 13.06.02.png

  4. Set Instance type to [t2.small] in Configure cluster. Press Next (for testing, so with a small instance) スクリーンショット 2017-04-06 13.07.46.png

  5. Select Allow or deny access to one or more AWS accounts or IAM users in the Set up access policy スクリーンショット 2017-04-06 13.18.13.png

  6. Enter the ARN of the user you want to allow in Account ID or ARN * スクリーンショット 2017-04-06 13.18.51.png

  7. Create with [Confirm and create]

  8. After a while, ES will start up, so check [Endpoint] and use it with Lambda, so keep it in mind. スクリーンショット 2017-04-06 13.30.18.png

Index and map to ElasticSearch

Create mapping data to save the URL, title, and article content for saving the article.

mapping.json


{
    "mappings": {                                            
        "article": {
            "properties" : {
                "url" : {
                    "type": "string",
                    "index" : "not_analyzed"
                },
                "title" : {
                    "type": "string",
                    "index" : "analyzed"
                },
                "contents" : {
                    "type": "string",
                    "index" : "analyzed"
                }
            }
        }   
    }
}

Next, create the above mapping data and a script to create an index.

Install the following packages locally in advance
$ pip install requests_aws4auth elasticsearch

es-mapping.py


# -*- coding: utf-8 -*-
import elasticsearch
from requests_aws4auth import AWS4Auth
import json

if __name__ == '__main__':
    #Specify ES endpoint
    host='search-***************.ap-northeast-1.es.amazonaws.com'
    awsauth = AWS4Auth(
            #AWS user access key ID and secret access key
            'ACCESS_KRY_ID',
            'SECRET_ACCESS_KEY',
            'ap-northeast-1', 'es')

    es = elasticsearch.Elasticsearch(
            hosts=[{'host': host, 'port': 443}],
            http_auth=awsauth,
            use_ssl=True,
            verify_certs=True,
            connection_class=elasticsearch.connection.RequestsHttpConnection
            )

    f = open('mapping.json', 'r')
    mapping = json.load(f)

    es.indices.create(index='website')                    
    es.indices.put_mapping(index='website', doc_type='article', body=mapping['mappings'])

$ python es-mapping.py
When you run the script, it should be indexed on AWS ES. スクリーンショット 2017-04-06 15.32.52.png

Create AWS Lambda

Now that we've created ElasticSearch, let's create a Lambda Function.

Creating a Lambda function

Create a Lambda function locally. $ mkdir web_crawler
$ cd web_crawler
$ vim lambda_function.py

lambda_function.py



# -*- coding: utf-8 -*-                    
import os
import base64
from readability import Document
import html2text
import requests
import elasticsearch
from elasticsearch import helpers
from requests_aws4auth import AWS4Auth

def lambda_handler(event, context):
    host = os.environ['ES_HOST']
    #Use IAM Role to authenticate to ElasticSearch Service
    awsauth = AWS4Auth(
            os.environ['ACCESS_ID'],
            os.environ['SECRET_KEY'], 'ap-northeast-1', 'es')

    es = elasticsearch.Elasticsearch(
            hosts=[{'host': host, 'port': 443}],
            http_auth=awsauth,
            use_ssl=True,
            verify_certs=True,
            connection_class=elasticsearch.connection.RequestsHttpConnection
    )

    articles = []

    #Get events from Kinesis Stream
    for record in event['Records']:
        payload = base64.b64decode(record['kinesis']['data'])
        try:
            response = requests.get(payload)
            if response.ok:
                article = Document(response.content).summary()
                titleText = html2text.html2text(Document(response.content).title())
                contentsText = html2text.html2text(article)
                res = es.search(index="website", body={"query": {"match": {"url": payload}}})
                #Is the URL already registered in ES?
                if res['hits']['total'] is 0:
                    doc = {
                        'url': payload,
                        'title': titleText.encode('utf-8'),
                        'contents': contentsText.encode('utf-8')
                    }
                    articles.append({'_index':'website', '_type':'scraper', '_source':doc})
        except requests.exceptions.HTTPError as err:
            print("HTTPError: " + err)                                                                                                                                                       
    # Bulk Insert
    helpers.bulk(es, articles)

After creating the Lambda function, install the required libraries in the same hierarchy
$ pip install readability-lxml html2text elasticsearch requests_aws4auth requests -t /path/to/web_crawler
Zip it up
$ zip -r web_crawler.zip .

Deploy a Lambda function on AWS

  1. [Create Lambda function]

  2. Select [Blank Function]

  3. Select the Kinesis stream created earlier in [Trigger Settings].

  4. [Batch size] is about 10

  5. [Start position] is horizontal trim

  6. Check trigger activation スクリーンショット 2017-04-06 13.51.27.png

  7. Enter [Name] in [Function Settings](Tentatively WebCrawler here)

  8. Select Python 2.7 for Runtime

  9. Under Code Entry Type, select Upload .ZIP file

  10. Specify the zip file created earlier from [Function Package]

  11. Set [Environment Variables] to 3 for accessing ElasticSearch.

  12. Access key ID in ACCESS_ID

  13. Secret access key to SECRET_KEY

  14. ElasticSearch endpoint on ES_HOST スクリーンショット 2017-04-06 16.24.30.png

  15. [Handler] remains lambda_function.lambda_handler

  16. Create roles as appropriate

  17. Set [Timeout] in [Detailed Settings] to about 2 minutes.

  18. [Create Function]

Extract URL with Scrapy and send to Kinesis stream

Next, in the final stage, I will use Scrapy to extract the URL from the list page and send the data to the Kinesis stream.

The list page uses the hot entry of Hatena Bookmark. RSS seems to be easier to get data if you use Scrapy, but I dared to scrape it from a web page. Scrapy is a useful and powerful framework for creating advanced web crawlers, so feel free to touch it if you're interested.

Creating a project

First install Scrapy
$ pip install scrapy
$ scrapy startproject hotentry
$ vim hotentry/hotentry/spiders/hotentry.py
Enter the code below.

hotentry.py


# -*- coding: utf-8 -*-
import scrapy
from scrapy.conf import settings
import boto3
import json

kinesis = boto3.client(
        'kinesis',                                                                                                                                                                           
        aws_access_key_id=settings['AWS_ACCESS_KEY_ID'],
        aws_secret_access_key=settings['AWS_SECRET_ACCESS_KEY'],
        region_name='ap-northeast-1')

class HotEntrySpider(scrapy.Spider):
    name = "hotentry"
    allowed_domains = ["b.hatena.ne.jp"]
    start_urls = ['http://b.hatena.ne.jp/hotentry/general']

    def parse(self, response):
        for sel in response.css("li.hb-entry-unit-with-favorites"):
            url = sel.css("a.entry-link::attr('href')").extract_first()
            if url is None:
                continue
            kinesis.put_record(
                    StreamName = "scraping_url",
                    Data = sel.css("a.entry-link::attr('href')").extract_first(),
                    PartitionKey = "scraper"
            )

$ vim hotentry/hotentry/settings.py
Add access key ID and secret access key to settings.py

AWS_ACCESS_KEY_ID = 'AKI******************'
AWS_SECRET_ACCESS_KEY = '************************************'

You can now PUT into your Kinesis stream. Let's try running this code.
$ scrapy crawl hotenty

You should now be able to populate the data with "Scrapy-> Kinesis-> AWS Lambda-> ElasticSearch".

Deploy Scrapy to Scraping hub

I was able to extract the URL with Scrapy and send it to Kinesis, but as it is, it will be a local batch, so deploy the Scrapy code to a cloud service called Scrapinghub.

Please see the following article for details on how to install it.

It's easy to do from user registration to deployment, so I'll fold it.

Finally

Initially, I used SQS and DynamoDB and divided the Lambda function into multiple functions, but it became complicated and I was frustrated because I could not follow the error. After all simple is best. I want more Lambda triggers to support more services.

** * Since this code was written in a test, error handling etc. is not strictly performed. If you have any disadvantages with this code, please do so at your own risk. ** **

Recommended Posts

[Python] Create an event-driven web crawler using AWS's serverless architecture
[Hyperledger Iroha] Create an account using Python library
Create a web map using Python and GDAL
Create an image file using PIL (Python Imaging Library).
[Python] Create an infrastructure diagram in 3 minutes using diagrams
Create JIRA tickets using Python
Web scraping using Selenium (Python)
Let's make a web chat using WebSocket with AWS serverless (Python)!
Create a python GUI using tkinter
Create an Excel file with Python3
Easily create homemade RPA using Python
Build an application with Clean Architecture while using DI + mock in Python
[Beginner] Python web scraping using Google Colaboratory
Quickly create an excel file with Python #python
Create an application using the Spotify API
Create an OpenCV3 + python3 environment on OSX
[Python] Quickly create an API with Flask
Create an English word app with python
Create an elliptical scatter plot in Python without using a multivariate normal distribution