[PYTHON] I put Selenium and headless chrome in AWS lambda. (Notes under Win10 environment, etc.)

When creating a Web Scraper with Python, when the target target is obtained by client-side javascript, it is often not possible to simply obtain the desired information with urlopen. Therefore, I think that it is often obtained by Selenium or API (if any).

On the other hand, if you want to do something regularly and economically, it's a good idea to use on-demand AWS lambda and CloudEvent.

Let's start

procedure

Environment

To use AWS lambda with Amazon Linux as the OS under Windows environment, you need the support of ** Ubuntu on Windows 10 ** or other remote linux server. This time, I chose Ubuntu on Windows 10 as a more economical method.

Please refer to the following for the specific installation method.

Using Linux on Windows 10

Next, put Python in.

sudo apt-get update
sudo apt-get install python3.6

Next

Once the environment is ready, do the following under Ubuntu.

--Move to an easy-to-understand place, the following is an example

 #Move to C disk
  cd /mnt/c/  
  mkdir /path/to/folder
  cd /path/to/folder

So you will quickly generate and find the file in File Explorer. (It will be convenient to put it in S3 later)

--Get headless chrome and chromedriver

--It should be noted here that even if you use a new version of headless-chromium, an error will occur unless it is a corresponding chrome driver. I was using the following two

--Next, unzip them, adjust the permissions as shown below (chmod 777), and finally store them in the chrome folder as shown below and compress them into one ZIP file.

  chrome.zip
  chrome
  ├── chromedriver
  └── headless-chromium

--Create a ZIP file of selenium package on Ubuntu.

  mkdir python-selenium
  cd python-selenium
  
  python3 -m pip install --system --target ./ selenium
  zip -r python-selenium.zip ../python-selenium

--Put chrome.zip & python-selenium.zip in S3, make a note of the Object URL, and create a layer.

--Let's move the last sample.

  from selenium import webdriver
  from selenium.webdriver.chrome.options import Options
  
  def lambda_handler(event, context):
      options = Options()

      #Enter your corresponding file path
      options.binary_location = '/opt/chrome/headless-chromium'
      options.add_argument('--headless')
      options.add_argument('--no-sandbox')
      options.add_argument('--single-process')
      options.add_argument('--disable-dev-shm-usage')

      #Enter your corresponding file path
      browser = webdriver.Chrome('/opt/chrome/chromedriver', chrome_options=options)
      browser.get('https://www.google.com')
      title = browser.title
      browser.close()
      browser.quit()
  
      return {"title": title}

Note that the Basic setting of lambda function requires more than 256MB of memory to run the Example, and Duration takes about 10 seconds (512MB, timeout: 20s setting) in advance. It is better to adjust to and see the situation.

Q&A

You may be wondering, so write down how many things you were confused about when you did it.

Q: Why put it in a chrome folder and compress it

A: It's not necessary, but when you add a layer to Lambda, the one in the zip file will be attached to the / opt folder. For example, / opt / chrome / chromedriver & / opt / chrome / headless-chromium

Q: As shown in the figure, is it not possible to use the zipping that Windows has?

image-20201008223558454.png

A: When I actually upload it to Layer, I get the error message Message:'chromedriver' executable may have wrong permissions, so I checked and it seems that the permissions are not working properly. Therefore, let's adjust the permissions of the file under Ubuntu and compress it.

reference

-[Python] Run Headless Chrome on AWS Lambda -Use AWS Lambda-like “layer (Layer)” function realization package management

in conclusion

Don't panic if you get any errors, it's a good idea to check the error report after running the lambda test. Surprisingly, the problem is clearly written.

Recommended Posts

I put Selenium and headless chrome in AWS lambda. (Notes under Win10 environment, etc.)
Tips for using Selenium and Headless Chrome in a CUI environment
I tried running TensorFlow in AWS Lambda environment: Preparation
I compared Node.js and Python in creating thumbnails using AWS Lambda
Install pip in Serverless Framework and AWS Lambda with Python environment
I tried using Selenium with Headless chrome
Screenshots of Megalodon in selenium and Chrome.
[Python] Run Headless Chrome on AWS Lambda
I tried using Headless Chrome from Selenium
Python + Selenium + Headless Chromium with aws lambda