[PYTHON] Web scraping using AWS lambda

Background Continuation of FizzBuzz using AWS Lambda. This time, I tried to get the data from an external web page by scraping.

AWS Architecture archtecture.png

--S3 (Data storage) --AWS Lambda (data processing) --Amazon Event Bridge (Periodic execution)

I am using three services.

Setting

S3

Create a bucket for data storage.

Enter only the bucket name and leave the other settings at their defaults. (Select the region as appropriate.)

Bucket creation is complete.

Lambda

Create a lambda for data processing.

Create from scratch, not ... Here, "s3-get-object-python" in "Use blueprint" is used.

Enter the function name and role name. This time we will upload the file to S3, so delete the "read-only access" policy template. For the S3 trigger, enter the bucket you created earlier for the bucket name.

After that, enter any character in the ** prefix-option. ** ** Arbitrary characters mentioned here are characters that do not overlap with the beginning of the file name, although the file is created with lambda. If you don't enter or enter duplicate characters, it will trigger an infinite loop to ** lambda, which will incur a large charge: scream: **, which is important.

Another workaround is to add restrictions such as making the event type copy only.

After completing all the entries, press "Create Function".

A template will be created, but if you proceed with development and deploy and test it as it is, a permission error will occur.

S3 settings

First, enable S3.

If you look at the event notification in the bucket properties in S3, you can see that it has been added.

Add permissions to a role

Select IAM → Roles to display the role list. Here, select the role name described earlier when creating the lambda.

Press "Attach Policy" without thinking.

Filter by "LambdaFull", select "AWSLambdaFullAccess" and press "Attach Policy".

This completes adding permissions.

basic configuration

Processing failed when the memory was small. Memory: 256MB, set timeout to 10 seconds.

That's it.

Development (send file)

Use the bote3 package to send and receive S3 buckets. If you select s3-get-object-python when creating lambda, it will come with it. If you upload the package from the beginning, the capacity of bote3 itself will be large and it will be 10MB or more, so it is better to use the existing one.

import json
import urllib.parse
import boto3
import datetime

def lambda_handler(event, context):

    try:
        # Get the object from the event and show its content type
        s3 = boto3.resource('s3')
    
        bucket = '[Bucket name]'    
        key = 'test_{}.txt'.format(datetime.datetime.now().strftime('%Y-%m-%d-%H-%M-%S'))
        file_contents = 'Lambda test'  
        
        obj = s3.Object(bucket,key)
        obj.put( Body=file_contents )
        
    except Exception as e:
        print(e)
        raise e

After deploying and testing, the file will be uploaded to the bucket. Also, although it is a test event setting, it will start even with an empty json.

Development (Web scraping)

To scrape, I have requests`` beautifulsoup, but I need to upload the package installed by pip to lambda.

The method is to install the package in the folder with pip and zip the folder. Create an executable file here and copy the code written in lambda.

mkdir packages
cd packages
pip install requests -t ./
pip install beautifulsoup -t ./
touch lambda_function.py

The package is placed in the project. Then, move the folders and files under packages to the articleStore one level higher. Then deploy and test it to add the file to S3.

All you have to do is web scraping. Here, I will try to get the Mainichi Shimbun Editorial dated today.

import json
import urllib.parse
import boto3
import datetime
from datetime import timedelta, timezone
import random
import os
import requests
from bs4 import BeautifulSoup

print('Loading function')

s3 = boto3.resource('s3')

def lambda_handler(event, context):
    # Get the object from the event and show its content type
    JST = timezone(timedelta(hours=+9), 'JST')
    dt_now = datetime.datetime.now(JST)
    date_str = dt_now.strftime('%Y year%m month%d day')

    response = requests.get('https://mainichi.jp/editorial/')

    soup = BeautifulSoup(response.text)
    pages = soup.find("ul", class_="list-typeD")

    articles = pages.find_all("article")
    
    links = [ "https:" + a.a.get("href") for a in articles if date_str in a.time.text ]
    
    for i, link in enumerate(links):
        bucket_name = "[Bucket name]"
        folder_path = "/tmp/"
        filename = 'article_{0}_{1}.txt'.format(dt_now.strftime('%Y-%m-%d'), i + 1)
        
        try:
            bucket = s3.Bucket(bucket_name)
            
            with open(folder_path + filename, 'w') as fout:
                fout.write(extract_article(link))
            
            bucket.upload_file(folder_path + filename, filename)
            os.remove(folder_path + filename)
    
        except Exception as e:
            print(e)
            raise e

    return {
        "date" : dt_now.strftime('%Y-%m-%d %H:%M:%S')
    }

#Extract editorial
def extract_article(src):

    response = requests.get(src)
    soup = BeautifulSoup(response.text)

    text_area = soup.find(class_="main-text") 
    title = soup.h1.text.strip()
    sentence = "".join([txt.text.strip() for txt in text_area.find_all(class_="txt")])

    return title + "\n" + sentence

This will add two text files with the articles you extracted into your S3 bucket from Deploy> Test.

It's been a long time, but the Lambda settings are complete.

Amazon EventBridge

I was able to handle it, but it's really tedious to press the "test" button every morning. Therefore, use Amazon EventBridge to set up periodic execution.

Select Amazon EventBridge → Events → Rules, Press "Create Rule".

Write the rule name and description, and since the cron expression is executed in standard time, execute it as 0 22 * *? * At 7:00 am Japan time. Select the target lambda name in the target and create it.

That's it.

Post-Scripting As a plan after this, I will stock editorials of several newspaper companies for one year and try machine learning. It would be nice if you could get all the pages with requests, but if you have a site that loads and lists (for example, Asahi Shimbun) when the page loads, you need to control the browser with selenium. There is.

Recommended Posts

Web scraping using AWS lambda
Serverless scraping using selenium with [AWS Lambda] -Part 1-
[Python] Scraping in AWS Lambda
Web scraping using Selenium (Python)
web scraping
Summary if using AWS Lambda (Python)
Tweet WakaTime Summary using AWS Lambda
Using Lambda with AWS Amplify with Go
[AWS] Using ini files with Lambda [Python]
[Beginner] Python web scraping using Google Colaboratory
web scraping (prototype)
Regularly post to Twitter using AWS lambda!
Scraping using Python
AWS-Perform web scraping regularly with Lambda + Python + Cron
I tried web scraping using python and selenium
Regular serverless scraping with AWS lambda + scrapy Part 1.8
Pharmaceutical company researchers summarized web scraping using Python
Introduction to Web Scraping
Tweet from AWS Lambda
Web application using Bottle (1)
Python web scraping selenium
Example of using lambda
Try AWS Lambda Destinations
Develop, run, and deploy AWS Lambda remotely using lambda-uploader
Check types_map when using mimetypes on AWS Lambda (Python)
How to set layer on Lambda using AWS SAM
I tried to get an AMI using AWS Lambda
Scraping using Python 3.5 async / await
WEB application development using django-Development 1-
Web scraping notes in python3
The simplest AWS Lambda implementation
Save images with web scraping
Scraping using Python 3.5 Async syntax
Addictive points when downloading files using boto on AWS Lambda
Web scraping technology and concerns
Trade-offs in web scraping & crawling
Try using AWS SageMaker Studio
Easy web scraping with Scrapy
Image collection by web scraping
AWS Lambda with PyTorch [Lambda import]
Create API with Python, lambda, API Gateway quickly using AWS SAM
Web scraping beginner with python
Algorithm-based web scraping library Scrapely
One-liner web scraping by tse
I tried using AWS Chalice
Build a Flask / Bottle-like web application on AWS Lambda with Chalice
I stopped my instance at a specific time using AWS Lambda
Serverless scraping on a regular basis with AWS lambda + scrapy Part 1
I compared Node.js and Python in creating thumbnails using AWS Lambda
Let's make a web chat using WebSocket with AWS serverless (Python)!
[Python] I wrote a REST API using AWS API Gateway and Lambda.
WEB application development using Django [Django startup]
Proxy measures when using WEB API
WEB application development using Django [Application addition]
Touch AWS Lambda environment variable support
Web scraping with Python ① (Scraping prior knowledge)
[AWS] Create API with API Gateway + Lambda
Web scraping with BeautifulSoup4 (layered page)
Meteorology x Ruby ~ Ruby scraping using Mechanize ~
I made a bot to post on twitter by web scraping a dynamic site with AWS Lambda (continued)
Scraping Alexa's web rank with pyQuery