[PYTHON] Run headless-chrome on a Debian-based image

I want to load js by scraping

It's easy to use a headless browser because phantom.js still works comfortably for doing it locally on Mac os. I wanted to use it with Cloud Run, so when I tried to check the operation with the official python image, it was unexpectedly complicated, so make a note

I don't understand the information is overflowing

Apparently phantom.js would stop updating, so I decided to use headless-chrome quietly. I didn't want to use man-hours, so I caught other people's articles But I'm stupid, so I didn't understand it anyway when I saw the article, so I decided to do something myself

Conclusion

It's not particularly difficult, and it works easily if the following conditions are met.

Download chrome body
Download the driver version that matches the chrome body
Set startup options appropriately

Dockerfile Base image used

# Use the official Python image.
# https://hub.docker.com/_/python
FROM python:3.7

Be sure to check the version at the time of installation when downloading Chrome itself.

RUN sh -c 'echo "deb http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google-chrome.list'
RUN wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | apt-key add -
RUN apt update
RUN apt install google-chrome-stable -y

Find the driver download that is closest to your version of Chrome first https://chromedriver.storage.googleapis.com/ Then, look for the latest version that is closest to the version of the main unit, because this time it was 80 units https://chromedriver.storage.googleapis.com/LATEST_RELEASE_80

Download and unzip with the number you find

RUN wget https://chromedriver.storage.googleapis.com/80.0.3987.106/chromedriver_linux64.zip
RUN unzip chromedriver_linux64.zip -d /usr/bin/

Of course, make sure you can see the PATH for both at this stage.

which chromedriver
witch google-chrome

Just this is OK After that, just use it, write a usage example for the time being

app.py


from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup


URL = "https://example.jp"


def get_trends():
    try:
        options = Options()
        options.add_argument('--headless')
        options.add_argument('--no-sandbox')
        options.add_argument('--disable-dev-shm-usage')

        driver = webdriver.Chrome(options=options)
        driver.get(URL)
        html = driver.page_source.encode('utf-8')  # more sophisticated methods may be available
        soup = BeautifulSoup(html, "lxml")

the end

Bumpy memo

Recommended Posts

Run headless-chrome on a Debian-based image
Run a Linux server on GCP
Run TensorFlow Docker Image on Python3
Run TensorFlow2 on a VPS server
Run Python code on A2019 Community Edition
Run Jupyter notebook on a remote server
Run matplotlib on a Windows Docker container
Periodically run a python program on AWS Lambda
Run mysqlclient on Lambda
Run OpenMVG on Mac
[kotlin] Create a real-time image recognition app on android
How to run Django on IIS on a Windows server
How to run a trained transformer model locally on CloudTPU
Run Jupyter on Ubuntu on Windows
Run Openpose on Python (Windows)
A comment on Boruta algorithm
Run Tensorflow 2.x on Python 3.7
Run Python CGI on CORESERVER
Create a classroom on Jupyterhub
Run IPython Notebook on Docker
Run YOLO v3 on AWS v2
Run CircuitPython on Seeeduino XIAO
Run Jupyter Notebook on windows
Run FreeBSD on Linux + qemu
Run OpenVino on macOS Catalina
The image is a slug
Run YOLO v3 on AWS
Run a Java app that resides on AWS EC2 as a daemon
Create a Docker container image with JRE8 / JDK8 on Amazon Linux
Building an environment to run ChainerMN on a GPU instance on AWS
Prepare a machine learning project format and run it on SageMaker