When creating a Web Scraper with Python, when the target target is obtained by client-side javascript, it is often not possible to simply obtain the desired information with urlopen. Therefore, I think that it is often obtained by Selenium or API (if any).
On the other hand, if you want to do something regularly and economically, it's a good idea to use on-demand AWS lambda and CloudEvent.
Let's start
To use AWS lambda with Amazon Linux as the OS under Windows environment, you need the support of ** Ubuntu on Windows 10 ** or other remote linux server. This time, I chose Ubuntu on Windows 10 as a more economical method.
Please refer to the following for the specific installation method.
Next, put Python in.
sudo apt-get update
sudo apt-get install python3.6
Once the environment is ready, do the following under Ubuntu.
--Move to an easy-to-understand place, the following is an example
 #Move to C disk
  cd /mnt/c/  
  mkdir /path/to/folder
  cd /path/to/folder
So you will quickly generate and find the file in File Explorer. (It will be convenient to put it in S3 later)
--Get headless chrome and chromedriver
--It should be noted here that even if you use a new version of headless-chromium, an error will occur unless it is a corresponding chrome driver. I was using the following two
stable-headless-chromium-64.0.3282.167-amazonlinux-2017-03.zip chromedriver_linux64.zip--Next, unzip them, adjust the permissions as shown below (chmod 777), and finally store them in the chrome folder as shown below and compress them into one ZIP file.
  chrome.zip
  chrome
  ├── chromedriver
  └── headless-chromium
--Create a ZIP file of selenium package on Ubuntu.
  mkdir python-selenium
  cd python-selenium
  
  python3 -m pip install --system --target ./ selenium
  zip -r python-selenium.zip ../python-selenium
--Put chrome.zip & python-selenium.zip in S3, make a note of the Object URL, and create a layer.
--Let's move the last sample.
  from selenium import webdriver
  from selenium.webdriver.chrome.options import Options
  
  def lambda_handler(event, context):
      options = Options()
      #Enter your corresponding file path
      options.binary_location = '/opt/chrome/headless-chromium'
      options.add_argument('--headless')
      options.add_argument('--no-sandbox')
      options.add_argument('--single-process')
      options.add_argument('--disable-dev-shm-usage')
      #Enter your corresponding file path
      browser = webdriver.Chrome('/opt/chrome/chromedriver', chrome_options=options)
      browser.get('https://www.google.com')
      title = browser.title
      browser.close()
      browser.quit()
  
      return {"title": title}
Note that the Basic setting of lambda function requires more than 256MB of memory to run the Example, and Duration takes about 10 seconds (512MB, timeout: 20s setting) in advance. It is better to adjust to and see the situation.
Q&A
You may be wondering, so write down how many things you were confused about when you did it.
Q: Why put it in a chrome folder and compress it
A: It's not necessary, but when you add a layer to Lambda, the one in the zip file will be attached to the / opt folder. For example, / opt / chrome / chromedriver & / opt / chrome / headless-chromium
Q: As shown in the figure, is it not possible to use the zipping that Windows has?

A: When I actually upload it to Layer, I get the error message Message:'chromedriver' executable may have wrong permissions, so I checked and it seems that the permissions are not working properly. Therefore, let's adjust the permissions of the file under Ubuntu and compress it.
-[Python] Run Headless Chrome on AWS Lambda -Use AWS Lambda-like “layer (Layer)” function realization package management
Don't panic if you get any errors, it's a good idea to check the error report after running the lambda test. Surprisingly, the problem is clearly written.
Recommended Posts