When creating a Web Scraper with Python, when the target target is obtained by client-side javascript, it is often not possible to simply obtain the desired information with urlopen. Therefore, I think that it is often obtained by Selenium or API (if any).
On the other hand, if you want to do something regularly and economically, it's a good idea to use on-demand AWS lambda and CloudEvent.
Let's start
To use AWS lambda with Amazon Linux as the OS under Windows environment, you need the support of ** Ubuntu on Windows 10 ** or other remote linux server. This time, I chose Ubuntu on Windows 10 as a more economical method.
Please refer to the following for the specific installation method.
Next, put Python in.
sudo apt-get update
sudo apt-get install python3.6
Once the environment is ready, do the following under Ubuntu.
--Move to an easy-to-understand place, the following is an example
#Move to C disk
cd /mnt/c/
mkdir /path/to/folder
cd /path/to/folder
So you will quickly generate and find the file in File Explorer. (It will be convenient to put it in S3 later)
--Get headless chrome and chromedriver
--It should be noted here that even if you use a new version of headless-chromium, an error will occur unless it is a corresponding chrome driver. I was using the following two
stable-headless-chromium-64.0.3282.167-amazonlinux-2017-03.zip
chromedriver_linux64.zip
--Next, unzip them, adjust the permissions as shown below (chmod 777
), and finally store them in the chrome folder as shown below and compress them into one ZIP file.
chrome.zip
chrome
├── chromedriver
└── headless-chromium
--Create a ZIP file of selenium package on Ubuntu.
mkdir python-selenium
cd python-selenium
python3 -m pip install --system --target ./ selenium
zip -r python-selenium.zip ../python-selenium
--Put chrome.zip
& python-selenium.zip
in S3, make a note of the Object URL, and create a layer.
--Let's move the last sample.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
def lambda_handler(event, context):
options = Options()
#Enter your corresponding file path
options.binary_location = '/opt/chrome/headless-chromium'
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--single-process')
options.add_argument('--disable-dev-shm-usage')
#Enter your corresponding file path
browser = webdriver.Chrome('/opt/chrome/chromedriver', chrome_options=options)
browser.get('https://www.google.com')
title = browser.title
browser.close()
browser.quit()
return {"title": title}
Note that the Basic setting of lambda function requires more than 256MB of memory to run the Example, and Duration takes about 10 seconds (512MB, timeout: 20s setting) in advance. It is better to adjust to and see the situation.
Q&A
You may be wondering, so write down how many things you were confused about when you did it.
Q: Why put it in a chrome folder and compress it
A: It's not necessary, but when you add a layer to Lambda, the one in the zip file will be attached to the / opt
folder. For example, / opt / chrome / chromedriver
& / opt / chrome / headless-chromium
Q: As shown in the figure, is it not possible to use the zipping that Windows has?
A: When I actually upload it to Layer, I get the error message Message:'chromedriver' executable may have wrong permissions
, so I checked and it seems that the permissions are not working properly. Therefore, let's adjust the permissions of the file under Ubuntu and compress it.
-[Python] Run Headless Chrome on AWS Lambda -Use AWS Lambda-like “layer (Layer)” function realization package management
Don't panic if you get any errors, it's a good idea to check the error report after running the lambda test. Surprisingly, the problem is clearly written.
Recommended Posts