Currently on a local PC ① Scraping website using Selenium in Python (2) Accumulate scraping results in a log file ③ Regular execution with CRON are doing
I want to build this in a cloud GCP environment Probably Python execution environment is ** Google Cloud Functions (GCF) ** ** Google Cloud Storage ** or ** Google Drive ** for log file storage Scheduled execution is ** Google Cloud scheduler ** Should be replaced by these!
Three. ① ** I wanted to use GCP. ** AWS was fine, but somehow. ② I'm going to study abroad, so I didn't want to leave my PC on. ** The city I live in is hot in the summer ~~ shit ~~, so I thought it would be a heavy burden to run a PC. There is a lot of lightning, so I'm worried about power outages. After that, if you move to the cloud, it seems easy to manage and maintain sushi wherever you are. ③ ** I was worried about the electricity bill. ** I use [iMac electricity bill is about 60 yen a day](https://web.waytoearnmoney.org/2015/03/03/imac%E3%82%B9%E3%83%AA%E3 % 83% BC% E3% 83% 97% E6% 99% 82% E3% 81% AE% E5% BE% 85% E6% A9% 9F% E9% 9B% BB% E5% 8A% 9B% E3% 81 % A8% E9% 9B% BB% E6% B0% 97% E4% BB% A3% E3% 81% AF% E3% 81% A9% E3% 82% 8C% E3% 81% 8F% E3% 82% 89 % E3% 81% 84% EF% BC% 9F /) (likely). It's 1800 yen a month. On the other hand, GCF has a free tier. Scraping is done once every 5 minutes. Probably fits comfortably. In other words, it's free.
The following three main tasks are required ① Program GCF migration ② Scheduled execution settings ③ Log storage So, this time up to "Program GCF migration"
Among the many Google Cloud services, other than Cloud Functions this time ・ Launch a PC in a cloud environment with Compute Engine ・ Use Cloud Run to put everything in the container and execute it. It was investigated The reason for adopting Cloud Functions is Compute Engine costs money to launch an instance all the time to run it once every 5 minutes, Is Cloud Run enough to make a container? I thought that it was different because I understood that the purpose of Cloud Run is to temporarily run a more complete application in the first place. If you misunderstand, tell me an erotic person
There are so many articles that it rots, so google!
This site was helpful to me. ** Many thanks ** First, start Cloud Shell from the leftmost button in the button group at the top right of the screen. Execute the following command when it can be started.
#Clone from God Git who puts together useful tools such as webdriver
git clone https://github.com/ryfeus/gcf-packs.git
#Move
cd gcf-packs/selenium_chrome/source
#Defrost
unzip headless-chromium.zip
#Deploy for the time being(A program that randomly accesses the Wiki and fetches the page title)
gcloud functions deploy handler --runtime python37 --trigger-http --region asia-northeast1 --memory 512MB
Click here for deployment options On the way
Allow unauthenticated invocations of new function [handler]? (y/N)?
Is displayed, enter "y". If you know the trigger http, that is, the URL issued after deployment, even a stranger can execute it, but especially because there is no benefit (in the case of my program) to that person when executed by another person. There should be no problem (though if the person maliciously executes it 100 million times, the usage fee will be great and I will die).
Then, it will be displayed on the screen with the name handler like this. Go back to Cloud Shell
Deploying function (may take a while - up to 2 minutes)...done.
availableMemoryMb: 256
entryPoint: handler
httpsTrigger:
url: https://asia-northeast1-************.cloudfunctions.net/handler
ingressSettings: ALLOW_ALL
labels:
Copy the https ~ part of
curl https://asia-northeast1-************.cloudfunctions.net/handler
Will bring back the title of some WIki page. Also, except for the console, you can do the same by clicking "handler" on the screen, clicking "Test" at the transition destination, and "Testing the function". The original code is "main.py" in the same directory. Also, if you're using a tool other than chromedriver or headless-chromium, you'll have to bring it yourself (the one that can be managed by importing in python should be okay).
All you have to do is rewrite the contents of "main.py" to the code you used locally. When writing code, it is convenient to use "Open Editor" on the screen where Cloud Shell is launched.
gcloud functions deploy *** --runtime python37 --trigger-http --region asia-northeast1 --memory 512MB
To do. Note that *** after deploy will result in an error if it does not match the function name in main.py. Test it and if there is no problem, it's done! !! Thank you for your hard work.
Chrome eats up memory unexpectedly. Click the name of the deployed function to move to the details screen. You can check the memory usage from the "General" pull-down. If the behavior is strange, change the memory size.
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument('--window-size=1280x1696')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--hide-scrollbars')
chrome_options.add_argument('--enable-logging')
chrome_options.add_argument('--log-level=0')
chrome_options.add_argument('--v=99')
chrome_options.add_argument('--single-process')
chrome_options.add_argument('--ignore-certificate-errors')
Do not erase these. It doesn't work. However, even if you look at the official webdriver, it doesn't mention which argument has what meaning, so if you know the appropriate page, please let me know.
I get an error like the image during the test, but it is a mystery that the log is properly executed to the end and output. Hmmm. .. ..
I understand the reason for the above error! !! Open a new tab to open a link when scraping
key_down(Keys.CONTROL).click().key_up(Keys.CONTROL)
What to do was set as Keys. ** COMMAND ** because the local environment was Mac. GCF is the execution environment of Python is Ubuntu.
Next time, ~~ Joey Wheeler will die, Duel Standby! ~~ It is a setting for scheduled execution, so please look forward to it!