[PYTHON] Strategy to bring local scraping work to GCP (Part 1)

Thing you want to do

Currently on a local PC ① Scraping website using Selenium in Python (2) Accumulate scraping results in a log file ③ Regular execution with CRON are doing

クリックツールGCP移行.png I want to build this in a cloud GCP environment クリックツールGCP移行 (1).png Probably Python execution environment is ** Google Cloud Functions (GCF) ** ** Google Cloud Storage ** or ** Google Drive ** for log file storage Scheduled execution is ** Google Cloud scheduler ** Should be replaced by these!

background

Three. ① ** I wanted to use GCP. ** AWS was fine, but somehow. ② I'm going to study abroad, so I didn't want to leave my PC on. ** The city I live in is hot in the summer ~~ shit ~~, so I thought it would be a heavy burden to run a PC. There is a lot of lightning, so I'm worried about power outages. After that, if you move to the cloud, it seems easy to manage and maintain sushi wherever you are. ③ ** I was worried about the electricity bill. ** I use [iMac electricity bill is about 60 yen a day](https://web.waytoearnmoney.org/2015/03/03/imac%E3%82%B9%E3%83%AA%E3 % 83% BC% E3% 83% 97% E6% 99% 82% E3% 81% AE% E5% BE% 85% E6% A9% 9F% E9% 9B% BB% E5% 8A% 9B% E3% 81 % A8% E9% 9B% BB% E6% B0% 97% E4% BB% A3% E3% 81% AF% E3% 81% A9% E3% 82% 8C% E3% 81% 8F% E3% 82% 89 % E3% 81% 84% EF% BC% 9F /) (likely). It's 1800 yen a month. On the other hand, GCF has a free tier. Scraping is done once every 5 minutes. Probably fits comfortably. In other words, it's free.

menu

The following three main tasks are required ① Program GCF migration ② Scheduled execution settings ③ Log storage So, this time up to "Program GCF migration"

About GCF

Among the many Google Cloud services, other than Cloud Functions this time ・ Launch a PC in a cloud environment with Compute Engine ・ Use Cloud Run to put everything in the container and execute it. It was investigated The reason for adopting Cloud Functions is Compute Engine costs money to launch an instance all the time to run it once every 5 minutes, Is Cloud Run enough to make a container? I thought that it was different because I understood that the purpose of Cloud Run is to temporarily run a more complete application in the first place. If you misunderstand, tell me an erotic person

Program GCF migration

GCP registration, GCF initial settings

There are so many articles that it rots, so google!

Flow from preparation of selenium etc. to deployment and testing

This site was helpful to me. ** Many thanks ** First, start Cloud Shell from the leftmost button in the button group at the top right of the screen. スクリーンショット 2020-05-16 0.34.16.png Execute the following command when it can be started.

#Clone from God Git who puts together useful tools such as webdriver
git clone https://github.com/ryfeus/gcf-packs.git
#Move
cd gcf-packs/selenium_chrome/source
#Defrost
unzip headless-chromium.zip
#Deploy for the time being(A program that randomly accesses the Wiki and fetches the page title)
gcloud functions deploy handler --runtime python37 --trigger-http --region asia-northeast1 --memory 512MB

Click here for deployment options On the way

Allow unauthenticated invocations of new function [handler]? (y/N)?

Is displayed, enter "y". If you know the trigger http, that is, the URL issued after deployment, even a stranger can execute it, but especially because there is no benefit (in the case of my program) to that person when executed by another person. There should be no problem (though if the person maliciously executes it 100 million times, the usage fee will be great and I will die).

Then, it will be displayed on the screen with the name handler like this. スクリーンショット 2020-05-16 0.46.26.png Go back to Cloud Shell

Deploying function (may take a while - up to 2 minutes)...done.
availableMemoryMb: 256
entryPoint: handler
httpsTrigger:
  url: https://asia-northeast1-************.cloudfunctions.net/handler
ingressSettings: ALLOW_ALL
labels:

Copy the https ~ part of

curl https://asia-northeast1-************.cloudfunctions.net/handler

Will bring back the title of some WIki page. Also, except for the console, you can do the same by clicking "handler" on the screen, clicking "Test" at the transition destination, and "Testing the function". The original code is "main.py" in the same directory. Also, if you're using a tool other than chromedriver or headless-chromium, you'll have to bring it yourself (the one that can be managed by importing in python should be okay).

All you have to do is rewrite the contents of "main.py" to the code you used locally. When writing code, it is convenient to use "Open Editor" on the screen where Cloud Shell is launched. スクリーンショット 2020-05-16 1.12.13.png

After rewriting the code, again

gcloud functions deploy *** --runtime python37 --trigger-http --region asia-northeast1 --memory 512MB

To do. Note that *** after deploy will result in an error if it does not match the function name in main.py. Test it and if there is no problem, it's done! !! Thank you for your hard work.

important point

About memory size

Chrome eats up memory unexpectedly. Click the name of the deployed function to move to the details screen. Screenshot 2020-05-18 16.38.14.png You can check the memory usage from the "General" pull-down. If the behavior is strange, change the memory size.

What I want to be careful about when rewriting

    chrome_options.add_argument('--headless')
    chrome_options.add_argument('--disable-gpu')
    chrome_options.add_argument('--window-size=1280x1696')
    chrome_options.add_argument('--no-sandbox')
    chrome_options.add_argument('--hide-scrollbars')
    chrome_options.add_argument('--enable-logging')
    chrome_options.add_argument('--log-level=0')
    chrome_options.add_argument('--v=99')
    chrome_options.add_argument('--single-process')
    chrome_options.add_argument('--ignore-certificate-errors')

Do not erase these. It doesn't work. However, even if you look at the official webdriver, it doesn't mention which argument has what meaning, so if you know the appropriate page, please let me know.

Afterword

I get an error like the image during the test, but it is a mystery that the log is properly executed to the end and output. Hmmm. .. .. スクリーンショット 2020-05-16 1.23.21.png

Postscript

I understand the reason for the above error! !! Open a new tab to open a link when scraping

key_down(Keys.CONTROL).click().key_up(Keys.CONTROL)

What to do was set as Keys. ** COMMAND ** because the local environment was Mac. GCF is the execution environment of Python is Ubuntu.

next time

Next time, ~~ Joey Wheeler will die, Duel Standby! ~~ It is a setting for scheduled execution, so please look forward to it!

Recommended Posts

Strategy to bring local scraping work to GCP (Part 1)
[Part1] Scraping with Python → Organize to csv!
Python) Save scraping content to local PC
Python: Scraping Part 1
Python: Scraping Part 2