[PYTHON] Automatic scraping of reCAPTCHA site every day (6/7: containerization)

  1. Requirement definition ~ python environment construction
  2. Create a site scraping mechanism
  3. Process the downloaded file (xls) to create the final product (csv)
  4. Download file from S3 / Create file upload to S3
  5. Implement 2captcha
  6. ** Allow to start in Docker container **
  7. Register for AWS batch

Make it work on the server

Until the last time, once you start it, it will work automatically without doing anything else. If you set it with cron, you can run it every day, but if you leave it as it is, you will keep your PC running all the time. I want to make it start on the server somehow.

The wall here is that ** this scraping requires Display **. I also tried Chrome's headless mode, but it didn't work.

So, this time, I decided to implement it using a mechanism called Xvfb that can create a virtual display, although it is a little old technology.

Xvfb is an application that runs on Linux. So, I decided to implement it using Linux Docker container and finally execute batch using AWS batch.

Docker image creation

Verification for creation

First, create the Docker image that you use. I will use CentOS, which seems to have been used a lot when I was investigating Xvfb.

First, start from the existing centOS image and actually install the required application.

mac(host)


docker pull centos          #Pulling centOS image from Docker Hub
docker run -it -d centos    #Start
docker ps                   #Confirm startup&Get container ID
docker exec -it b7948c7802eb /bin/bash  #Enter the terminal on the container side

All you need

So I will install each one.

Try various things in the container


yum install -y python36 #python put
python3 -m pip install --upgrade pip #pip
pip install requests #Try putting in the necessary packages
...
yum -y install xorg-x11-server-Xvfb  #Install Xvfb
yum -y install firefox  #install firefox
Xvfb :1 -screen 0 1600x1200x16 & #Launch Xvfb
export DISPLAY=:1 #:Use the display defined as 1
firefox #Start firefox

O. I can't see the screen well, but it looks like firefox is running ...? So next, let's run my program here.

Now I started creating the Dockerfile. Actually, it may be more efficient to try it by using the docker cp command.

Creating a Dockerfile

FROM centos
ENV TZ JST-9  #(1)

#Set home directory
ENV HOME=/home
WORKDIR $HOME

#My app(Below app)To under home
COPY .  $HOME/

RUN yum install -y python36
RUN python3 -m pip install --upgrade pip
RUN pip install -r app/requirements.txt #(2)
RUN yum -y install xorg-x11-server-Xvfb
RUN yum -y install firefox

RUN chmod 744 startup.sh
CMD ["./startup.sh"]  #(3)

(1) Apparently the time zone will be UTC. This batch is changing the time zone because time is important (2) At first, I wrote line by line, but it was cleaner to put them together, so I put together the required packages in requirements.txt. The required packages are the ones that came out with the pip freeze command. (3) I found that it is necessary to start the Xvfb command at the timing of docker run, so I created a shell and summarized it.

startup.sh


#!/usr/bin/env bash
Xvfb :1 -screen 0 1600x1200x16 &
export DISPLAY=:1
python3 app/source/run.py --run_mode test #At the end I will make it normal

Slightly modified to work on linux

--gecko driver download for linux --Change to branch by judging the OS and which driver to use

The final file structure looks like this.

├── Dockerfile
├── README.md
├── app
│   ├── drivers
│   │   ├── geckodriver
│   │   └── geckodriver_linux
│   ├── requirements.txt
│   └── source
│       ├── run.py
│       ├── scraping.py
│       ├── make_outputs.py
│       ├── s3_operator.py
│       └── configs.py
├── startup.sh
└── tmp
    ├── files
    │   ├── download
    │   ├── fromS3
    │   └── toS3
    └── logs

Try to start

It can be executed with the following command.

docker build -t myapp .
docker run -it myapp

Actually, it didn't go like this ... I feel like I typed the above command about 40 times.

However, when it goes well, it's a moving thing! I can't see the screen at all, but the console is out and there are files in S3.

At the end, just run it on AWS ... I can see the goal.

Recommended Posts

Automatic scraping of reCAPTCHA site every day (6/7: containerization)
Automatic scraping of reCAPTCHA site every day (2/7: scraping)
Automatic scraping of reCAPTCHA site every day (5/7: 2captcha)
Automatic scraping of reCAPTCHA site every day (4/7: S3 file processing)
Automatic scraping of reCAPTCHA site every day (1/7: python environment construction)
Automatic scraping of reCAPTCHA site every day (3/7: xls file processing)
The definitive edition of python scraping! (Target site: BicCamera)
I tried scraping the advertisement of the pirated cartoon site