[PYTHON] Serverless scraping using selenium with [AWS Lambda] -Part 1-

Lambda A serverless computing service provided by aws that can execute code. There is also a function to scale according to the number of requests, so there is no need for environment construction, load distribution, or maintenance.

You pay for the time you run, so you won't be charged when you're not running your code. In other words, there is no server maintenance cost. This is a great deal.

First, install the required libraries

Lambda doesn't use a server, so you can't connect directly and install the required libraries. Instead, you can use it by uploading a library adapted to the pre-installed Linux environment to Lambda.

I will upload Selenium used this time in the same way.

First of all, you need to install Selenium and Webdriver in some environment

・ Launch and install Python environment with Docker ・ Use Cloud9

There is a method called, but this time we will adopt Cloud9, which can be done more quickly and easily.

Cloud9 A service that allows you to execute code from your browser.

Enter Cloud9 to move.

The environment name is python_for_lambda, and everything else is created by default.

スクリーンショット (127).png

This alone creates an environment where python can be executed.

$ python -V

Python 3.7.9

Install selenium immediately Specify the directory ** python/lib/python3.7/site-packages ** as the installation destination.

$ pip install selenium -t python/lib/python3.7/site-packages

Collecting selenium
  Using cached https://files.pythonhosted.org/packages/80/d6/4294f0b4bce4e17190289f9d0613b0a44e5dd6a7f5ca98459853/selenium-3.141.0-py2.py3-none-any.whl
Collecting urllib3 (from selenium)
  Using cached https://files.pythonhosted.org/packages/f5/71/45d36a8df6861b2c017f3d094538c0fb98fa61d4dc43e69b9/urllib3-1.26.2-py2.py3-none-any.whl
Installing collected packages: urllib3, selenium
Successfully installed selenium-3.141.0 urllib3-1.26.2

Next, we will install chrome. Since we are considering headless operation, we will install chrome driver and headless-chromium.

$ mkdir -p headless/python/bin
#Create a directory to save in advance

$ cd headless/python/bin

$ url -SL https://github.com/adieuadieu/serverless-chrome/releases/download/v1.0.0-37/stable-headless-chromium-amazonlinux-2017-03.zip > headless-chromium.zip
# headless-Install chromium
$ unzip -o headless-chromium.zip -d .   
$ rm headless-chromium.zip
#Extract the file and delete the zip

$ curl -SL https://chromedriver.storage.googleapis.com/2.37/chromedriver_linux64.zip > chromedriver.zip
#Install chrome driver
$ unzip -o chromedriver.zip -d .
$ rm chromedriver.zip

スクリーンショット (145).png

Left-click each and select Download to download with zip.

selenium: ** Under python ** chromedriver, headless-chromium: ** Under headless **

`headless`


headless
    ┗  python
        ┗  bin
           ┣ chromedriver
           ┗ headless-chromium

`selenium`


python
    ┗ lib
       ┗  python3.7
             ┗ site-packages
                      ┣ selenium
                      ┣ selenium-3.141.0.dist-info
                      ┣ urllib3
                      ┗ urllib3-1.26.2.dist-info

If you are already using Selenium or chromedriver, it may already exist on your PC, but it will not be able to run when you upload it to Lambda unless it is a version compatible with the Linux environment **, so Linux We recommend that you use the one installed on cloud9, which is the environment.

I tried using the chrome driver installed on Windows, but it didn't work.

Lambda

Now that we have the required libraries installed, we'll move this up to Lambda.

Upload to layer

First move to Lambda.

Select ** Layer ** from the console.

This layer archives the libraries and content needed to execute the function and can be used during code execution. Let's upload the selenium and headless files that we just installed.

	Layer name	File
chromedriver, headless-chromium	headless	headless.zip
selenium	selenium	python.zip

Set the runtime to python3.7.

スクリーンショット (136).png

Create functions and add layers

Next, create the function. I will write the code later, so let's first create the function type.

Select Create Dashboard Function.

スクリーンショット (133).png

The function name is lambda_function_for_headless_chrome, and the runtime is created with python3.7.

Now you have an environment to run python.

Then add the layer you just created to this function.

Click Layers and select Add Layer. スクリーンショット (139)_LI.jpg

Add headless and selenium layers from ** Custom Layer ** respectively.

Finally, if Layers is (2), layer addition is complete.

Function execution

Now let's run selenium in python.

Rewrite the code that already exists in the function code from the function as follows.

`lambda_function_for_headless_chrome`


#Import automatically under python
from selenium import webdriver

def lambda_handler(event, context):

    URL = "https://news.yahoo.co.jp/"

    options = webdriver.ChromeOptions()
    options.add_argument("--headless")
    options.add_argument("--disable-gpu")
    options.add_argument("--hide-scrollbars")
    options.add_argument("--single-process")
    options.add_argument("--ignore-certificate-errors")
    options.add_argument("--window-size=880x996")
    options.add_argument("--no-sandbox")
    options.add_argument("--homedir=/tmp")
    options.binary_location = "/opt/python/bin/headless-chromium"
    
    #Browser definition
    browser = webdriver.Chrome(
        "/opt/python/bin/chromedriver",
        options=options
    )


    browser.get(URL)
    title = browser.title
    browser.close()

    return title

What you need to pay attention to here is the description of ** chromedriver PATH **. It is specified in ** under opt ** because the folder uploaded to the ** Lambda layer is automatically saved under opt **.

Then Selenium's PATH is not necessary.

In the Lambda layer ・ ** Under python ** ・ ** python/lib/python3.x (version to be used) /site-packages subordinate ** If either of these, ** the file will be read automatically **.

Therefore, this time, you can execute the import command without specifying PATH.

Change basic settings

Finally, play with ** basic settings ** a little.

This is because ** execution processing in Selenium takes longer than a normal program **, so there is a high possibility that it will time out if it is an existing setting. Therefore, it is necessary to take a long timeout.

Also, ** execution memory was not enough if it was 128MB, so change it to 256MB ** before running the test. スクリーンショット (141).png

Run test

Now let's create a test and run the code.

Click the test and enter the function name, otherwise it will be created by default. スクリーンショット (143).png

Once created, click the test again to run it!

If this happens after the standby screen has been displayed for a while, it is successful. スクリーンショット (144).png

Next, we will write the production code to implement more practical periodic processing.

If it doesn't work

If it doesn't work, please refer to here.

When'lambda_function': No module named'selenium' comes out

When chromedriver'executable may have wrong permissions. Appears

Articles that I used as a reference

[Periodically run Python scraping on AWS Lambda] (https://qiita.com/eisu26/items/be7a75edf7a798f17f11)

How to make AWS Lambda Layers when running selenium x chrome on AWS Lambda