Summary of points I was addicted to running Selenium on AWS Lambda (python)

Introduction

When scraping with Python, I stumbled around authentication and implemented it using Selenium to solve it quickly, but I stumbled when I tried to run it with Lambda, so I will leave it as a memo (a lot of writing is miscellaneous) It may be ...)

Work environment

Addictive point 1

Q. Should I use headless-chromium and chromedriver to use Selenium? A. Compress these two and register them in the layer

Details

The following are provided as headless browsers that can be used with AWS Lambda This, chrome and chromedriver versions are combined and compressed as one file https://github.com/adieuadieu/serverless-chrome

Reference article https://qiita.com/mishimay/items/afd7f247f101fbe25f30

How to register a layer

image.png

  1. Click "Create Layer" image.png
  2. Upload the zip file with a name and description that you can understand (compressing chrome and chromedriver together can easily exceed 10MB, so you will need to register after uploading to S3) image.png
  3. Click "Add Layer" in Lambda and select the added layer.

Supplementary explanation

If you register it in the layer, the file will be placed under "/ opt/xxxx". For example, if you create a chrome directory, place the "serverless-chrome, chromedriver" file under it and register the compressed one in the layer, the definition will be as follows.

Example

    from selenium import webdriver
    from selenium.webdriver.chrome.options import Options

    options = Options()
    options.binary_location = '/opt/chrome/headless-chromium'
    options.add_argument("--headless")
    options.add_argument("--no-sandbox")
    options.add_argument("--single-process")
    options.add_argument("--disable-gpu")
    options.add_argument("--window-size=1280x1024")
    options.add_argument("--disable-application-cache")
    options.add_argument("--disable-infobars")
    options.add_argument("--hide-scrollbars")
    options.add_argument("--enable-logging")
    options.add_argument("--log-level=0")
    options.add_argument("--ignore-certificate-errors")

    driver = webdriver.Chrome(
        options=options, executable_path='/opt/chrome/chromedriver')

Addictive point 2

Q. Selenium doesn't work on Lambda A. If it is python 3.8, it doesn't work because it is Amazon Linux2 or some libraries are missing.

I get a status code 127 error at run time (it seems that the library is missing when I do a quick search) There may be some workaround, but the quickest way to deal with it is to run it in a python3.7 environment.

Addictive point 3

It's slow to implement everything with Selenium

phenomenon

It took 12 minutes to parse a table of 70 pages and about 2100 items (30 items per page) with 512MB of memory and register it in dynamoDB.

Cause

Gripping the element with Selenium was very costly

Countermeasures

For example, after loading a page, let BeautifulSoup handle the processing after parsing.

Example)

driver.get('https://example.com')
html = BeautifulSoup(driver.page_source, 'html.parser')
table = html.select_one('table')
rows = table .findAll('tr')
for row in rows:
    cells = row.findAll('td')
# todo 
driver.quit()

As a result, the processing speed improved to 12 minutes => 2 minutes under the same conditions.

Small story understood during implementation

When executing python of lambda, it seems that the library under "/ opt/python /" is loaded. Therefore, by spitting out lib with "pip install -r ./requirements.txt -t." etc. and registering the zip-compressed one in the layer, only the source file to be executed can be uploaded and Lambda's WEB You can now edit the source from the screen of

at the end

I didn't think I could run Selenium on Lambda, I was thinking of putting cron on my local PC's WSL, but I knew I didn't have to do that.

Layer awesome: laughing: Next I wanted to be able to deploy these with template.yml and layer yml files using sam: sweat:

Recommended Posts

Summary of points I was addicted to running Selenium on AWS Lambda (python)
Summary of studying Python to use AWS Lambda
[Python] I was addicted to not saving internal variables of lambda expressions
I wrote AWS Lambda, and I was a little addicted to the default value of Python arguments
I want to AWS Lambda with Python on Mac!
[Fixed] I was addicted to alphanumeric judgment of Python strings
A story that I was addicted to calling Lambda from AWS Lambda.
How to make AWS Lambda Layers when running selenium × chrome on AWS Lambda
Things to note when running Python on EC2 from AWS Lambda
I was addicted to Flask on dotCloud
What I was addicted to Python autorun
Summary of how to write AWS Lambda
I was addicted to running tensorflow on GPU with NVIDIA driver 440 + CUDA 10.2
I was able to recurse in Python: lambda
A note I was addicted to when running Python with Visual Studio Code
Support for Python 2.7 runtime on AWS Lambda (as of 2020.1)
I was able to repeat it in Python: lambda
What I was addicted to when using Python tornado
It was a life I wanted to OCR on AWS Lambda to locate the characters.
Try running a Schedule to start and stop an instance on AWS Lambda (Python)
Posted as an attachment to Slack on AWS Lambda (Python)
What I was addicted to when migrating Processing users to Python
Post images of Papillon regularly on Python + AWS Lambda + Slack
[Python] Allow pip3 packages to be imported on AWS Lambda
Run Python on Schedule on AWS Lambda
I was addicted to multiprocessing + psycopg2
The record I was addicted to when putting MeCab on Heroku
What I was addicted to when introducing ALE to Vim for Python
What I was addicted to with json.dumps in Python base64 encoding
A note I was addicted to when making a beep on Linux
[Python / AWS Lambda layers] I want to reuse only module in AWS Lambda Layers
I was addicted to confusing class variables and instance variables in Python
I was a little addicted to installing Python3.3 + mod_wsgi3.4 on Sakura VPS (CentOS), so a retrospective memo
Migrate Django applications running on Python 2.7 to Python 3.5
[Python] Summary of how to use pandas
Two things I was addicted to building Django + Apache + Nginx on Windows
I tried to notify Zabbix Server of execution error of AWS Lambda function
A story I was addicted to when inserting from Python to a PostgreSQL table
A story I was addicted to trying to install LightFM on Amazon Linux
I was addicted to creating a Python venv environment with VS Code
I was addicted to pip install mysqlclient
(Python Selenium) I want to check the settings of the download destination of WebDriver
Use Python from Java with Jython. I was also addicted to it.
I want to play with aws with python
[Python2.7] Summary of how to use unittest
I was addicted to not being able to use Markdown on pypi's long_description
The file name was bad in Python and I was addicted to import
[Python] Run Headless Chrome on AWS Lambda
Use AWS lambda to scrape the news and notify LINE of updates on a regular basis [python]
Connect to s3 with AWS Lambda Python
Summary of how to use Python list
[Python2.7] Summary of how to use subprocess
[Lambda] I tried to incorporate an external module of python via S3
Python + Selenium + Headless Chromium with aws lambda
I want to use Python in the environment of pyenv + pipenv on Windows 10
AtCoder AGC 041 C --I was addicted to the full search of Domino Quality
I tried to automate the 100 yen deposit of Rakuten horse racing (python / selenium)
I tried to reduce costs by starting / stopping EC2 collectively on AWS Lambda
A story that I was addicted to when I made SFTP communication with python
I tried to use Twitter Scraper on AWS Lambda and it didn't work.
A memo of a tutorial on running python on heroku