Background Continuation of FizzBuzz using AWS Lambda. This time, I tried to get the data from an external web page by scraping.

AWS Architecture

--S3 (Data storage) --AWS Lambda (data processing) --Amazon Event Bridge (Periodic execution)

I am using three services.

Setting

Create a bucket for data storage.

Enter only the bucket name and leave the other settings at their defaults. (Select the region as appropriate.)

Bucket creation is complete.

Lambda

Create a lambda for data processing.

Create from scratch, not ... Here, "s3-get-object-python" in "Use blueprint" is used.

Enter the function name and role name. This time we will upload the file to S3, so delete the "read-only access" policy template. For the S3 trigger, enter the bucket you created earlier for the bucket name.

After that, enter any character in the ** prefix-option. ** ** Arbitrary characters mentioned here are characters that do not overlap with the beginning of the file name, although the file is created with lambda. If you don't enter or enter duplicate characters, it will trigger an infinite loop to ** lambda, which will incur a large charge: scream: **, which is important.

Another workaround is to add restrictions such as making the event type copy only.

After completing all the entries, press "Create Function".

A template will be created, but if you proceed with development and deploy and test it as it is, a permission error will occur.

S3 settings

First, enable S3.

If you look at the event notification in the bucket properties in S3, you can see that it has been added.

Add permissions to a role

Select IAM → Roles to display the role list. Here, select the role name described earlier when creating the lambda.

Press "Attach Policy" without thinking.

Filter by "LambdaFull", select "AWSLambdaFullAccess" and press "Attach Policy".

This completes adding permissions.

basic configuration

Processing failed when the memory was small. Memory: 256MB, set timeout to 10 seconds.

That's it.

Development (send file)

Use the bote3 package to send and receive S3 buckets. If you select s3-get-object-python when creating lambda, it will come with it. If you upload the package from the beginning, the capacity of bote3 itself will be large and it will be 10MB or more, so it is better to use the existing one.

import json
import urllib.parse
import boto3
import datetime

def lambda_handler(event, context):

    try:
        # Get the object from the event and show its content type
        s3 = boto3.resource('s3')
    
        bucket = '[Bucket name]'    
        key = 'test_{}.txt'.format(datetime.datetime.now().strftime('%Y-%m-%d-%H-%M-%S'))
        file_contents = 'Lambda test'  
        
        obj = s3.Object(bucket,key)
        obj.put( Body=file_contents )
        
    except Exception as e:
        print(e)
        raise e

After deploying and testing, the file will be uploaded to the bucket. Also, although it is a test event setting, it will start even with an empty json.

Development (Web scraping)

To scrape, I have requests`` beautifulsoup, but I need to upload the package installed by pip to lambda.

The method is to install the package in the folder with pip and zip the folder. Create an executable file here and copy the code written in lambda.

mkdir packages
cd packages
pip install requests -t ./
pip install beautifulsoup -t ./
touch lambda_function.py

The package is placed in the project. Then, move the folders and files under packages to the articleStore one level higher. Then deploy and test it to add the file to S3.

All you have to do is web scraping. Here, I will try to get the Mainichi Shimbun Editorial dated today.

import json
import urllib.parse
import boto3
import datetime
from datetime import timedelta, timezone
import random
import os
import requests
from bs4 import BeautifulSoup

print('Loading function')

s3 = boto3.resource('s3')

def lambda_handler(event, context):
    # Get the object from the event and show its content type
    JST = timezone(timedelta(hours=+9), 'JST')
    dt_now = datetime.datetime.now(JST)
    date_str = dt_now.strftime('%Y year%m month%d day')

    response = requests.get('https://mainichi.jp/editorial/')

    soup = BeautifulSoup(response.text)
    pages = soup.find("ul", class_="list-typeD")

    articles = pages.find_all("article")
    
    links = [ "https:" + a.a.get("href") for a in articles if date_str in a.time.text ]
    
    for i, link in enumerate(links):
        bucket_name = "[Bucket name]"
        folder_path = "/tmp/"
        filename = 'article_{0}_{1}.txt'.format(dt_now.strftime('%Y-%m-%d'), i + 1)
        
        try:
            bucket = s3.Bucket(bucket_name)
            
            with open(folder_path + filename, 'w') as fout:
                fout.write(extract_article(link))
            
            bucket.upload_file(folder_path + filename, filename)
            os.remove(folder_path + filename)
    
        except Exception as e:
            print(e)
            raise e

    return {
        "date" : dt_now.strftime('%Y-%m-%d %H:%M:%S')
    }

#Extract editorial
def extract_article(src):

    response = requests.get(src)
    soup = BeautifulSoup(response.text)

    text_area = soup.find(class_="main-text") 
    title = soup.h1.text.strip()
    sentence = "".join([txt.text.strip() for txt in text_area.find_all(class_="txt")])

    return title + "\n" + sentence

This will add two text files with the articles you extracted into your S3 bucket from Deploy> Test.

It's been a long time, but the Lambda settings are complete.

Amazon EventBridge

I was able to handle it, but it's really tedious to press the "test" button every morning. Therefore, use Amazon EventBridge to set up periodic execution.

Select Amazon EventBridge → Events → Rules, Press "Create Rule".

Write the rule name and description, and since the cron expression is executed in standard time, execute it as 0 22 * *? * At 7:00 am Japan time. Select the target lambda name in the target and create it.