[PYTHON] Serverless scraping using selenium with [AWS Lambda] -Part 1-

Lambda A serverless computing service provided by aws that can execute code. There is also a function to scale according to the number of requests, so there is no need for environment construction, load distribution, or maintenance.

You pay for the time you run, so you won't be charged when you're not running your code. In other words, there is no server maintenance cost. This is a great deal.

First, install the required libraries

Lambda doesn't use a server, so you can't connect directly and install the required libraries. Instead, you can use it by uploading a library adapted to the pre-installed Linux environment to Lambda.

I will upload Selenium used this time in the same way.

First of all, you need to install Selenium and Webdriver in some environment

・ Launch and install Python environment with Docker ・ Use Cloud9

There is a method called, but this time we will adopt Cloud9, which can be done more quickly and easily.

Cloud9 A service that allows you to execute code from your browser.

Enter Cloud9 to move.

The environment name is python_for_lambda, and everything else is created by default.

スクリーンショット (127).png

This alone creates an environment where python can be executed.

$ python -V

Python 3.7.9

Install selenium immediately Specify the directory ** python/lib/python3.7/site-packages ** as the installation destination.

$ pip install selenium -t python/lib/python3.7/site-packages

Collecting selenium
  Using cached https://files.pythonhosted.org/packages/80/d6/4294f0b4bce4e17190289f9d0613b0a44e5dd6a7f5ca98459853/selenium-3.141.0-py2.py3-none-any.whl
Collecting urllib3 (from selenium)
  Using cached https://files.pythonhosted.org/packages/f5/71/45d36a8df6861b2c017f3d094538c0fb98fa61d4dc43e69b9/urllib3-1.26.2-py2.py3-none-any.whl
Installing collected packages: urllib3, selenium
Successfully installed selenium-3.141.0 urllib3-1.26.2

Next, we will install chrome. Since we are considering headless operation, we will install chrome driver and headless-chromium.

$ mkdir -p headless/python/bin
#Create a directory to save in advance

$ cd headless/python/bin

$ url -SL https://github.com/adieuadieu/serverless-chrome/releases/download/v1.0.0-37/stable-headless-chromium-amazonlinux-2017-03.zip > headless-chromium.zip
# headless-Install chromium
$ unzip -o headless-chromium.zip -d .   
$ rm headless-chromium.zip
#Extract the file and delete the zip

$ curl -SL https://chromedriver.storage.googleapis.com/2.37/chromedriver_linux64.zip > chromedriver.zip
#Install chrome driver
$ unzip -o chromedriver.zip -d .
$ rm chromedriver.zip

スクリーンショット (145).png

Left-click each and select Download to download with zip.

selenium: ** Under python ** chromedriver, headless-chromium: ** Under headless **

headless


headless
    ┗  python
        ┗  bin
           ┣ chromedriver
           ┗ headless-chromium

selenium


python
    ┗ lib
       ┗  python3.7
             ┗ site-packages
                      ┣ selenium
                      ┣ selenium-3.141.0.dist-info
                      ┣ urllib3
                      ┗ urllib3-1.26.2.dist-info

If you are already using Selenium or chromedriver, it may already exist on your PC, but it will not be able to run when you upload it to Lambda unless it is a version compatible with the Linux environment **, so Linux We recommend that you use the one installed on cloud9, which is the environment.

Lambda

Now that we have the required libraries installed, we'll move this up to Lambda.

Upload to layer

First move to Lambda.

Select ** Layer ** from the console.

This layer archives the libraries and content needed to execute the function and can be used during code execution. Let's upload the selenium and headless files that we just installed.

Layer name File
chromedriver, headless-chromium headless headless.zip
selenium selenium python.zip

Set the runtime to python3.7.

スクリーンショット (136).png

Create functions and add layers

Next, create the function. I will write the code later, so let's first create the function type.

Select Create Dashboard Function.

スクリーンショット (133).png

The function name is lambda_function_for_headless_chrome, and the runtime is created with python3.7.

Now you have an environment to run python.

Then add the layer you just created to this function.

Click Layers and select Add Layer. スクリーンショット (139)_LI.jpg

Add headless and selenium layers from ** Custom Layer ** respectively.

Finally, if Layers is (2), layer addition is complete.

Function execution

Now let's run selenium in python.

Rewrite the code that already exists in the function code from the function as follows.

lambda_function_for_headless_chrome


#Import automatically under python
from selenium import webdriver

def lambda_handler(event, context):

    URL = "https://news.yahoo.co.jp/"

    options = webdriver.ChromeOptions()
    options.add_argument("--headless")
    options.add_argument("--disable-gpu")
    options.add_argument("--hide-scrollbars")
    options.add_argument("--single-process")
    options.add_argument("--ignore-certificate-errors")
    options.add_argument("--window-size=880x996")
    options.add_argument("--no-sandbox")
    options.add_argument("--homedir=/tmp")
    options.binary_location = "/opt/python/bin/headless-chromium"
    
    #Browser definition
    browser = webdriver.Chrome(
        "/opt/python/bin/chromedriver",
        options=options
    )


    browser.get(URL)
    title = browser.title
    browser.close()

    return title

What you need to pay attention to here is the description of ** chromedriver PATH **. It is specified in ** under opt ** because the folder uploaded to the ** Lambda layer is automatically saved under opt **.

Then Selenium's PATH is not necessary.

In the Lambda layer ・ ** Under python ** ・ ** python/lib/python3.x (version to be used) /site-packages subordinate ** If either of these, ** the file will be read automatically **.

Therefore, this time, you can execute the import command without specifying PATH.

Change basic settings

Finally, play with ** basic settings ** a little.

This is because ** execution processing in Selenium takes longer than a normal program **, so there is a high possibility that it will time out if it is an existing setting. Therefore, it is necessary to take a long timeout.

Also, ** execution memory was not enough if it was 128MB, so change it to 256MB ** before running the test. スクリーンショット (141).png

Run test

Now let's create a test and run the code.

Click the test and enter the function name, otherwise it will be created by default. スクリーンショット (143).png

Once created, click the test again to run it!

If this happens after the standby screen has been displayed for a while, it is successful. スクリーンショット (144).png

Next, we will write the production code to implement more practical periodic processing.

If it doesn't work

If it doesn't work, please refer to here.

When'lambda_function': No module named'selenium' comes out

When chromedriver'executable may have wrong permissions. Appears

Articles that I used as a reference

[Periodically run Python scraping on AWS Lambda] (https://qiita.com/eisu26/items/be7a75edf7a798f17f11)

How to make AWS Lambda Layers when running selenium x chrome on AWS Lambda

Recommended Posts

Serverless scraping using selenium with [AWS Lambda] -Part 1-
Regular serverless scraping with AWS lambda + scrapy Part 1.8
Serverless scraping on a regular basis with AWS lambda + scrapy Part 1
Scraping with Selenium + Python Part 1
Web scraping using AWS lambda
Scraping with Selenium + Python Part 2
Using Lambda with AWS Amplify with Go
Scraping with selenium
Scraping with selenium ~ 2 ~
Scraping with Selenium
[AWS] Using ini files with Lambda [Python]
Python + Selenium + Headless Chromium with aws lambda
Successful scraping with Selenium
Serverless application with AWS SAM! (APIGATEWAY + Lambda (Python))
Scraping with Selenium [Python]
[AWS] Play with Step Functions (SAM + Lambda) Part.3 (Branch)
Deploy Python3 function with Serverless Framework on AWS Lambda
[AWS] Play with Step Functions (SAM + Lambda) Part.1 (Basic)
[AWS] Play with Step Functions (SAM + Lambda) Part.2 (Parameter)
Scraping with selenium in Python
[Python] Scraping in AWS Lambda
Deploy Django serverless with Lambda
Scraping with Selenium in Python
Web scraping using Selenium (Python)
AWS Lambda with PyTorch [Lambda import]
I-town page scraping with selenium
Scraping using Python
Using X11 with ubuntu18.04 (C)
Scraping using Python 3.5 async / await
Scraping using Python 3.5 Async syntax
Web scraping using Selenium (Python)
Web scraping using AWS lambda
Create API with Python, lambda, API Gateway quickly using AWS SAM
Summary if using AWS Lambda (Python)
[AWS] Create API with API Gateway + Lambda
Scraping with Python, Selenium and Chromedriver
Tweet WakaTime Summary using AWS Lambda
Notify HipChat with AWS Lambda (Python)
Using PhantomJS with AWS Lambda until displaying the html of the website
Install pip in Serverless Framework and AWS Lambda with Python environment
Let's make a web chat using WebSocket with AWS serverless (Python)!
How to create a serverless machine learning API with AWS Lambda
selenium
I tried using Selenium with Headless chrome
I tried using Headless Chrome from Selenium
Python + Selenium + Headless Chromium with aws lambda
Automate simple tasks with Python Part1 Scraping
I tried using Selenium with Headless chrome
Regularly post to Twitter using AWS lambda!
[AWS] Link Lambda and S3 with boto3
[Part1] Scraping with Python → Organize to csv!
Connect to s3 with AWS Lambda Python
Practice web scraping with Python and Selenium
[AWS] Do SSI-like things with S3 / Lambda
Touch AWS with Serverless Framework and Python
I just did FizzBuzz with AWS Lambda
I tried to create an environment to check regularly using Selenium with AWS Fargate
AWS-Perform web scraping regularly with Lambda + Python + Cron
[AWS SAM] Create API with DynamoDB + Lambda + API Gateway
I tried web scraping using python and selenium
LINE BOT with Python + AWS Lambda + API Gateway
How to deal with SessionNotCreatedException when using Selenium
[AWS] Try tracing API Gateway + Lambda with X-Ray
I tried connecting AWS Lambda with other services
Infrastructure construction automation with CloudFromation + troposphere + AWS Lambda
Scraping with Python
Scraping with Python
Try using Selenium
Beginning with Selenium