Background Continuation of FizzBuzz using AWS Lambda. This time, I tried to get the data from an external web page by scraping.
AWS Architecture
--S3 (Data storage) --AWS Lambda (data processing) --Amazon Event Bridge (Periodic execution)
I am using three services.
Setting
S3
Create a bucket for data storage.
Enter only the bucket name and leave the other settings at their defaults. (Select the region as appropriate.)
Bucket creation is complete.
Lambda
Create a lambda for data processing.
Create from scratch, not ... Here, "s3-get-object-python" in "Use blueprint" is used.
Enter the function name and role name. This time we will upload the file to S3, so delete the "read-only access" policy template. For the S3 trigger, enter the bucket you created earlier for the bucket name.
After that, enter any character in the ** prefix-option. ** ** Arbitrary characters mentioned here are characters that do not overlap with the beginning of the file name, although the file is created with lambda. If you don't enter or enter duplicate characters, it will trigger an infinite loop to ** lambda, which will incur a large charge: scream: **, which is important.
Another workaround is to add restrictions such as making the event type copy only.
After completing all the entries, press "Create Function".
A template will be created, but if you proceed with development and deploy and test it as it is, a permission error will occur.
First, enable S3.
If you look at the event notification in the bucket properties in S3, you can see that it has been added.
Select IAM → Roles to display the role list. Here, select the role name described earlier when creating the lambda.
Press "Attach Policy" without thinking.
Filter by "LambdaFull", select "AWSLambdaFullAccess" and press "Attach Policy".
This completes adding permissions.
Processing failed when the memory was small. Memory: 256MB, set timeout to 10 seconds.
That's it.
Use the bote3 package to send and receive S3 buckets. If you select s3-get-object-python
when creating lambda, it will come with it. If you upload the package from the beginning, the capacity of bote3 itself will be large and it will be 10MB or more, so it is better to use the existing one.
import json
import urllib.parse
import boto3
import datetime
def lambda_handler(event, context):
try:
# Get the object from the event and show its content type
s3 = boto3.resource('s3')
bucket = '[Bucket name]'
key = 'test_{}.txt'.format(datetime.datetime.now().strftime('%Y-%m-%d-%H-%M-%S'))
file_contents = 'Lambda test'
obj = s3.Object(bucket,key)
obj.put( Body=file_contents )
except Exception as e:
print(e)
raise e
After deploying and testing, the file will be uploaded to the bucket. Also, although it is a test event setting, it will start even with an empty json.
To scrape, I have requests`` beautifulsoup
, but I need to upload the package installed by pip to lambda.
The method is to install the package in the folder with pip and zip the folder. Create an executable file here and copy the code written in lambda.
mkdir packages
cd packages
pip install requests -t ./
pip install beautifulsoup -t ./
touch lambda_function.py
The package is placed in the project.
Then, move the folders and files under packages
to the articleStore
one level higher.
Then deploy and test it to add the file to S3.
All you have to do is web scraping. Here, I will try to get the Mainichi Shimbun Editorial dated today.
import json
import urllib.parse
import boto3
import datetime
from datetime import timedelta, timezone
import random
import os
import requests
from bs4 import BeautifulSoup
print('Loading function')
s3 = boto3.resource('s3')
def lambda_handler(event, context):
# Get the object from the event and show its content type
JST = timezone(timedelta(hours=+9), 'JST')
dt_now = datetime.datetime.now(JST)
date_str = dt_now.strftime('%Y year%m month%d day')
response = requests.get('https://mainichi.jp/editorial/')
soup = BeautifulSoup(response.text)
pages = soup.find("ul", class_="list-typeD")
articles = pages.find_all("article")
links = [ "https:" + a.a.get("href") for a in articles if date_str in a.time.text ]
for i, link in enumerate(links):
bucket_name = "[Bucket name]"
folder_path = "/tmp/"
filename = 'article_{0}_{1}.txt'.format(dt_now.strftime('%Y-%m-%d'), i + 1)
try:
bucket = s3.Bucket(bucket_name)
with open(folder_path + filename, 'w') as fout:
fout.write(extract_article(link))
bucket.upload_file(folder_path + filename, filename)
os.remove(folder_path + filename)
except Exception as e:
print(e)
raise e
return {
"date" : dt_now.strftime('%Y-%m-%d %H:%M:%S')
}
#Extract editorial
def extract_article(src):
response = requests.get(src)
soup = BeautifulSoup(response.text)
text_area = soup.find(class_="main-text")
title = soup.h1.text.strip()
sentence = "".join([txt.text.strip() for txt in text_area.find_all(class_="txt")])
return title + "\n" + sentence
This will add two text files with the articles you extracted into your S3 bucket from Deploy> Test.
It's been a long time, but the Lambda settings are complete.
Amazon EventBridge
I was able to handle it, but it's really tedious to press the "test" button every morning. Therefore, use Amazon EventBridge to set up periodic execution.
Select Amazon EventBridge → Events → Rules, Press "Create Rule".
Write the rule name and description, and since the cron expression is executed in standard time, execute it as 0 22 * *? *
At 7:00 am Japan time.
Select the target lambda name in the target and create it.
That's it.
Post-Scripting
As a plan after this, I will stock editorials of several newspaper companies for one year and try machine learning.
It would be nice if you could get all the pages with requests
, but if you have a site that loads and lists (for example, Asahi Shimbun) when the page loads, you need to control the browser with selenium
. There is.
Recommended Posts