Beginners use Python for web scraping (4) -3 GCE VM instance creation and scraping on VM

By previous, I created a git repository on GCP and manually executed scraping PGM on CloudShell. This time, we will finally perform automatic scraping on GCE.

Roadmap for learning web scraping in Python

(1) Succeed in scraping the target stuff locally for the time being. (2) Link the result of scraping locally to Google Spreadsheet. (3) cron is automatically executed locally. (4) Challenge free automatic execution on the cloud server. (Google Compute Engine) (4) -1 Put the test PGM on the cloud and run it normally on CloudShell. (4) -2 Add scraping PGM to the repository and run it normally on Cloud Shell. (4) -3 Create a VM instance of Compute Engine and have it automatically execute scraping. ← Now here </ font> (5) Challenge free automatic execution without a server on the cloud. (Maybe Cloud Functions + Cloud Scheduler)

This procedure

(1) Create a GCE instance (2) Enhanced security for GCE instance SSH connection (3) Install git, anyenv, pyenv, python 3.8.5 on the GCE instance (3) Clone repository to GCE instance (4) Crontab settings (test PGM) (5) Crontab settings (scraping PGM)

(1) Create a GCE instance

Create an instance of Compute Engine. The free tier of Compute Engine is as follows, so make it within that performance range.

Compute Engine 1 f1-micro instance (per month, US region only, excluding North Virginia [us-east4]) 30 GB-Moon HDD 5 GB-Monthly snapshot (some regions) 1 GB downlink (outward) network from North America to all regions (per month, excluding China and Australia)

Quote: Google Cloud Platform Free Tier

スクリーンショット 2020-09-27 15.47.03.png スクリーンショット 2020-09-27 15.47.40.png

The f1-micro instance free tier limit is based on time, not number of instances. Since 720 hours are free every month, I think that it will be 30 days if it is always started. Be careful of fine charges on the 31st month ...? スクリーンショット 2020-09-27 19.24.33.png

The free tier f1-micro instance limit is based on time, not number of instances. All f1-micro instances will be free to use each month until you have used up the equivalent number of hours in the month. Usage is aggregated for all supported regions.

The Google Cloud free tier is also provided for external IP addresses used by VM instances. The external IP address in use can be used at no additional charge until the total number of hours in the month is used up. Usage is the sum of all in-use external IP addresses in all regions. The Google Cloud free tier for your external IP address applies to all instance types, not just f1-micro instances.

Quote: [Google Cloud Free Program](https://cloud.google.com/free/docs/gcp-free-tier?hl=ja&_ga=2.249650575.-865936855.1596008883&_gac=1.221982442.1601002469.CjwKCAjwh7H7BRBBEiwAPXjadiZJ_avk6 always-free)

(2) Enhanced security for GCE instance SSH connection

Limit SSH connections to your instance and change the default SSH port.

There are two types of SSH connection restrictions for an instance: one is by registering the SSH key in the metadata, and the other is by a function called OS login. This time, we will adopt SSH restrictions by OS login. I think this is an easy-to-understand explanation of the differences. Use the convenient function "OS Login" to restrict SSH connection to an instance with IAM

In addition, two-step authentication can be set for OS login, so set that as well to enhance security. Setting up OS Login using 2-step verification (https://cloud.google.com/compute/docs/oslogin/setup-two-factor-authentication?hl=ja)

It also disables the default SSH port 22. When logging in, it is necessary to specify additional parameters, but if it remains at 22, it will be attacked indiscriminately, so it can not be helped. From the VM instance details, view the network details,

Setting up OS Login using a two-step authentication process

The instance login after setting will be as follows. In the first place, all accounts other than the one set in IAM (in my case, the Google account that is the project owner by default) should be rejected. Then, (for the first time) log in to ssh with the passphrase of the automatically created "google_compute_engine" ssh key, but you will be shown the option of two-step authentication, and in your case, log in with the one-time password of the authenticator app of your smartphone.

bash


hoge@cloudshell:~ (my-hoge-app)$ gcloud compute --project "my-hoge-app" ssh --zone "us-central1-a" "instance-7" --ssh-flag="-p 50050"
Enter passphrase for key '/home/hoge/.ssh/google_compute_engine':
Please choose from the available authentication methods:
1: Google phone prompt
2: Security code from Google Authenticator application
3: Voice or text message verification code

Enter the number for the authentication method to use: 2
Enter your one-time password: xxxxxx
Linux instance-7 4.19.0-10-cloud-amd64 #1 SMP Debian 4.19.132-1 (2020-07-24) x86_64

The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
Last login: Tue Sep 29 11:38:31 2020 from 35.189.187.53
hoge_gmail_com@instance-7:~$

Python3 installation on GCE

So, I'm going to install Python from here, but it's not straightforward. Even if you simply try to insert Python, there is a high possibility that an error will occur due to insufficient swap area.

Therefore, I would like to thank you for using the method on the following site. Building an environment where Python works in GCE's f1-micro environment

The meaning of the command There are other aspects of lack of understanding, but since it is an environment where it is completely okay to break it, I will enter the command in the same way. The general flow is git, anyenv, pyenv, and finally Python 3.8.5.

bash


hoge_gmail_com@instance-7:~$
hoge_gmail_com@instance-7:~$ sudo dd if=/dev/zero of=/var/swapfile bs=1M count=1200
1200+0 records in
1200+0 records out
1258291200 bytes (1.3 GB, 1.2 GiB) copied, 11.2339 s, 112 MB/s
hoge_gmail_com@instance-7:~$
hoge_gmail_com@instance-7:~$ sudo chmod 600 /var/swapfile
hoge_gmail_com@instance-7:~$
hoge_gmail_com@instance-7:~$ sudo mkswap -L swap /var/swapfile
Setting up swapspace version 1, size = 1.2 GiB (1258287104 bytes)
LABEL=swap, UUID=80b8b0ee-3779-4f2d-b9cb-00cccd3f401f
hoge_gmail_com@instance-7:~$
hoge_gmail_com@instance-7:~$ sudo swapon /var/swapfile
hoge_gmail_com@instance-7:~$
hoge_gmail_com@instance-7:~$ cat /proc/swaps
Filename                                Type            Size    Used    Priority
/var/swapfile                           file            1228796 0       -2
hoge_gmail_com@instance-7:~$
hoge_gmail_com@instance-7:~$ echo '/var/swapfile swap swap defaults 0 0' | sudo tee -a /etc/fstab
/var/swapfile swap swap defaults 0 0
hoge_gmail_com@instance-7:~$
hoge_gmail_com@instance-7:~$

git install

I use git to install anyenv, but since git cannot be used in the initial state, install it.

bash


hoge_gmail_com@instance-7:~$ sudo apt-get install git-all

The log flows like a demon. Experience 5 minutes. At the end, it ends with the following feeling.

bash


Install emacsen-common for emacs
emacsen-common: Handling install of emacsen flavor emacs
Install git for emacs
Setting up git-el (1:2.20.1-2+deb10u3) ...
Install git for emacs
Install git for emacs
Setting up emacs (1:26.1+1-3.2+deb10u1) ...
Setting up git-all (1:2.20.1-2+deb10u3) ...
Processing triggers for libgdk-pixbuf2.0-0:amd64 (2.38.1+dfsg-1) ...
Processing triggers for libc-bin (2.28-10) ...
hoge_gmail_com@instance-7:~$

anyenv installation

Go back to the steps on the example site and install anyenv.

bash


hoge_gmail_com@instance-7:~$ git clone https://github.com/anyenv/anyenv ~/.anyenv
Cloning into '/home/hoge_gmail_com/.anyenv'...
remote: Enumerating objects: 14, done.
remote: Counting objects: 100% (14/14), done.
remote: Compressing objects: 100% (11/11), done.
remote: Total 406 (delta 3), reused 4 (delta 2), pack-reused 392
Receiving objects: 100% (406/406), 70.99 KiB | 3.09 MiB/s, done.
Resolving deltas: 100% (179/179), done.
hoge_gmail_com@instance-7:~$

Other commands are faithfully executed according to the procedure.

In addition, install pyenv according to the procedure. What is the version of pyenv? .. ..

bash


hoge_gmail_com@instance-7:~$ pyenv --version
pyenv 1.2.20-7-gdd62b0d1

Python 3.8.5 installation

Finally, install Python 3.8.5 on pyenv. (Almost no log flows, but about 15 minutes) Globally, safely to 3.8.5.

bash


hoge_gmail_com@instance-7:~$ pyenv install 3.8.5
Downloading Python-3.8.5.tar.xz...
-> https://www.python.org/ftp/python/3.8.5/Python-3.8.5.tar.xz
Installing Python-3.8.5...

Installed Python-3.8.5 to /home/hoge_gmail_com/.anyenv/envs/pyenv/versions/3.8.5

hoge_gmail_com@instance-7:~$
hoge_gmail_com@instance-7:~$ pyenv global 3.8.5
hoge_gmail_com@instance-7:~$ python --version
Python 3.8.5
hoge_gmail_com@instance-7:~$

(3) Clone repository to GCE instance

Let's clone the Cloud Source Repositories repository right away.

bash


instance-7:10/01/20 12:24:54 ~ $ gcloud source repos clone gce-cron-test
ERROR: (gcloud.source.repos.clone) PERMISSION_DENIED: Request had insufficient authentication scopes.

If you are in a compute engine VM, it is likely that the specified scopes during VM creation are not enough to run this command.
See https://cloud.google.com/compute/docs/access/service-accounts#accesscopesiam for more information of access scopes.
See https://cloud.google.com/compute/docs/access/create-enable-service-accounts-for-instances#changeserviceaccountandscopes for how to update access scopes of the VM.

What went wrong with Cloud Shell gcloud source repos clone failed. .. ..

Cloud API access scope fix

The Cloud API access scope seems to be a problem, so fix it. Stop the VM instance once, and change the permission of "Cloud Source Repositories" from disabled to read only in "Cloud API Access Scope" at the bottom from [VM Instance Details]> [Edit]. スクリーンショット 2020-10-01 21.31.59.png スクリーンショット 2020-10-01 21.33.38.png

Clone

You should now be able to work with Cloud Source Repositories from your VM.

bash


instance-7:10/01/20 12:51:36 ~ $
instance-7:10/01/20 12:51:36 ~ $ gcloud source repos clone gce-cron-test
Cloning into '/home/hogehoge_gmail_com/gce-cron-test'...
remote: Total 11 (delta 1), reused 11 (delta 1)
Unpacking objects: 100% (11/11), done.
Project [my-gce-app] repository [gce-cron-test] was cloned to [/home/hogehoge_gmail_com/gce-cron-test].

Succeeded. Check the contents of the gce-cron-test directory.

bash


nstance-7:10/01/20 12:53:02 ~ $ 
instance-7:10/01/20 12:53:20 ~ $
instance-7:10/01/20 12:53:20 ~ $ cd gce-cron-test
instance-7:10/01/20 12:53:42 ~/gce-cron-test $ ls -la
total 36
drwxr-xr-x 3 hogehoge_gmail_com hogehoge_gmail_com 4096 Oct  1 12:52 .
drwxr-xr-x 6 hogehoge_gmail_com hogehoge_gmail_com 4096 Oct  1 12:52 ..
-rw-r--r-- 1 hogehoge_gmail_com hogehoge_gmail_com 6148 Oct  1 12:52 .DS_Store
drwxr-xr-x 8 hogehoge_gmail_com hogehoge_gmail_com 4096 Oct  1 12:52 .git
-rw-r--r-- 1 hogehoge_gmail_com hogehoge_gmail_com  146 Oct  1 12:52 cron-test.py
-rw-r--r-- 1 hogehoge_gmail_com hogehoge_gmail_com 2352 Oct  1 12:52 my-web-scraping-app-6293fbee8c53.json
-rw-r--r-- 1 hogehoge_gmail_com hogehoge_gmail_com 2763 Oct  1 12:52 requests-test2.py
-rw-r--r-- 1 hogehoge_gmail_com hogehoge_gmail_com  334 Oct  1 12:52 requirements.txt
instance-7:10/01/20 12:53:47 ~/gce-cron-test $hogehoge

Familiar faces are lined up and it is a wonderful success. Check the python path.

bash


instance-7:10/01/20 12:53:56 ~/gce-cron-test $
instance-7:10/01/20 12:53:56 ~/gce-cron-test $
instance-7:10/01/20 12:58:06 ~ $
instance-7:10/01/20 12:58:06 ~ $ which python
/home/hogehoge_gmail_com/.anyenv/envs/pyenv/shims/python
instance-7:10/01/20 12:59:10 ~ $ 

(4) Crontab settings (test PGM)

Let's finally edit crontab. You'll be asked for an editor's choice first, so safely choose vim.

bash


instance-7:10/01/20 13:05:10 ~/gce-cron-test $ 
instance-7:10/01/20 13:05:36 ~/gce-cron-test $
instance-7:10/01/20 13:05:36 ~/gce-cron-test $ crontab -e
no crontab for hogehoge_gmail_com - using an empty one

Select an editor.  To change later, run 'select-editor'.
  1. /bin/nano        <---- easiest
  2. /usr/bin/vim.basic
  3. /usr/bin/vim.tiny
  4. /usr/bin/emacs

Choose 1-4 [1]: 2
crontab: installing new crontab
instance-7:10/01/20 13:06:47 ~/gce-cron-test $

crontab: installing new crontab generated a new crontab.

Check the contents with crontab -l. The procedure is the same as before, just tweak the python path and the PGM and log directories.

bash


instance-7:10/01/20 13:06:52 ~/gce-cron-test $ crontab -l
# Edit this file to introduce tasks to be run by cron.
#
# Each task to run has to be defined through a single line
# indicating with different fields when the task will be run
# and what command to run for the task
#
# To define the time you can provide concrete values for
# minute (m), hour (h), day of month (dom), month (mon),
# and day of week (dow) or use '*' in these fields (for 'any').
#
# Notice that tasks will be started based on the cron's system
# daemon's notion of time and timezones.
#
# Output of the crontab jobs (including errors) is sent through
# email to the user the crontab file belongs to (unless redirected).
#
# For example, you can run a backup of all your user accounts
# at 5 a.m every week with:
# 0 5 * * 1 tar -zcf /var/backups/home.tgz /home/
#
# For more information see the manual pages of crontab(5) and cron(8)
#
# m h  dom mon dow   command
* * * * * cd /home/hogehoge_gmail_com/gce-cron-test; /home/hogehoge_gmail_com/.anyenv/envs/pyenv/shims/python /home/hogehoge_gmail_com/gce-cron-test/cron-test.py >> /home/hogehoge_gmail_com/gce-cron-test/cron.log 2>&1
instance-7:10/01/20 13:06:58 ~/gce-cron-test $

The result of running the test PGM before scraping.

bash


instance-7:10/01/20 13:09:49 ~/gce-cron-test $ cat cron.log
2020/10/01 13:07:02 cron works!
2020/10/01 13:08:01 cron works!
2020/10/01 13:09:01 cron works!
2020/10/01 13:10:01 cron works!

I was able to confirm that crontab works properly on GCE.

(5) Crontab settings (scraping PGM)

Scraping library installation

Install the library using requirements.txt.

bash


instance-7:10/02/20 11:54:02 ~/gce-cron-test $ /home/hogehoge_gmail_com/.anyenv/envs/pyenv/versions/3.8.5/bin/python3.8 -m pip install -r requirements.txt

It was installed firmly.

bash


instance-7:10/02/20 11:57:34 ~/gce-cron-test $ pip list
Package              Version
-------------------- ---------
beautifulsoup4       4.9.1
cachetools           4.1.1
certifi              2020.6.20
chardet              3.0.4
google-auth          1.21.0
google-auth-oauthlib 0.4.1
gspread              3.6.0
httplib2             0.18.1
idna                 2.10
oauth2client         4.1.3
oauthlib             3.1.0
pip                  20.2.3
pyasn1               0.4.8
pyasn1-modules       0.2.8
requests             2.24.0
requests-oauthlib    1.3.0
rsa                  4.6
setuptools           47.1.0
six                  1.15.0
soupsieve            2.0.1
    #scope = ['https://spreadsheets.google.com/feeds',
urllib3              1.25.10

Try hitting the command to put on crontab directly.

bash


iinstance-7:10/02/20 11:57:40 ~/gce-cron-test $
instance-7:10/02/20 12:06:15 ~/gce-cron-test $ cd /home/hogehoge_gmail_com/gce-cron-test; /home/hogehoge_gmail_com/.anyenv/envs/pyenv/shims/python /home/hogehoge_gmail
_com/gce-cron-test/requests-test2.py
2020/10/02 12:06:44 Finished scraping.
instance-7:10/02/20 12:06:46 ~/gce-cron-test $
It's a success.

###Crontab settings (scraping PGM)
Then edit crontab.


#### **`bash`**
```bash

instance-7:10/02/20 12:06:57 ~/gce-cron-test $
instance-7:10/02/20 12:06:59 ~/gce-cron-test $ crontab -e
crontab: installing new crontab
instance-7:10/02/20 12:08:52 ~/gce-cron-test $
instance-7:10/02/20 12:08:54 ~/gce-cron-test $
instance-7:10/02/20 12:08:54 ~/gce-cron-test $ crontab -l
 Edit this file to introduce tasks to be run by cron.

 Each task to run has to be defined through a single line
 indicating with different fields when the task will be run
 and what command to run for the task

 To define the time you can provide concrete values for
 minute (m), hour (h), day of month (dom), month (mon),
 and day of week (dow) or use '*' in these fields (for 'any').

 Notice that tasks will be started based on the cron's system
 daemon's notion of time and timezones.

 Output of the crontab jobs (including errors) is sent through
 email to the user the crontab file belongs to (unless redirected).

 For example, you can run a backup of all your user accounts
 at 5 a.m every week with:
 0 5 * * 1 tar -zcf /var/backups/home.tgz /home/

 For more information see the manual pages of crontab(5) and cron(8)

 m h  dom mon dow   command
* * * * * cd /home/hogehoge_gmail_com/gce-cron-test; /home/hogehoge_gmail_com/.anyenv/envs/pyenv/shims/python /home/hogehoge_gmail_com/gce-cron-test/cron-test.py >> /home/hogehoge_gmail_com/gce-cron-test/cron.log 2>&1
*/3 * * * * cd /home/hogehoge_gmail_com/gce-cron-test; /home/hogehoge_gmail_com/.anyenv/envs/pyenv/shims/python /home/hogehoge_gmail_com/gce-cron-test/requests-test2.py >> /home/hogehoge_gmail_com/gce-cron-test/cron.log 2>&1
instance-7:10/02/20 12:11:42 ~/gce-cron-test $

It worked fine with the setting every 3 minutes! !! Successful automatic execution of python scraping on GCE!

bash


instance-7:10/02/20 12:16:38 ~/gce-cron-test $ cat cron.log
 2020/10/01 13:04:53 cron works!
 2020/10/01 13:07:02 cron works!
 2020/10/01 13:08:01 cron works!
 2020/10/01 13:09:01 cron works!
 2020/10/01 13:10:01 cron works!
 2020/10/02 12:09:21 Scraping has ended.
 2020/10/02 12:12:21 Scraping has ended.
 2020/10/02 12:15:21 Scraping has ended.
instance-7:10/02/20 12:16:48 ~/gce-cron-test $

Screenshot 2020-10-04 19.32.52.png

#Summary ・ Crontab works on GCE ・ Scraping works even with the performance of the free frame for GCE instances -The GCE VM instance cannot be used unless git is installed in the initial state. -For GCE VM instances, repository operations cannot be performed unless permissions are set. -Installation of Python3 with GCE free frame may fail if you do not install after setting the swap area properly

Recommended Posts

Beginners use Python for web scraping (4) -3 GCE VM instance creation and scraping on VM
Beginners use Python for web scraping (1)
Beginners use Python for web scraping (4) ―― 1
Beginners use Python for web scraping (4) --2 Scraping on Cloud Shell
Beginners can use Python for web scraping (1) Improved version
[For beginners] Try web scraping with Python
WEB scraping with Python (for personal notes)
WebApi creation with Python (CRUD creation) For beginners
Practice web scraping with Python and Selenium
Easy web scraping with Python and Ruby
Easy-to-understand explanation of Python Web application (Django) even for beginners (2) [Project creation]
Causal reasoning and causal search with Python (for beginners)
[Python / Chrome] Basic settings and operations for scraping
Easy-to-understand explanation of Python Web application (Django) even for beginners (3) [Application creation / DB setting]
Installing TensorFlow on Windows Easy for Python beginners
Install Python and libraries for Python on MacOS Catalina
[For beginners] Web scraping with Python "Access the URL in the page to get the contents"
Rock-paper-scissors poi in Python for beginners (answers and explanations)
Initial settings for using Python3.8 and pip on CentOS8
[BigQuery] How to use BigQuery API for Python -Table creation-
Data analysis for improving POG 1 ~ Web scraping with Python ~
Python beginners get stuck with their first web scraping
[Introduction for beginners] Reading and writing Python CSV files
Ubuntu 20.04 on raspberry pi 4 with OpenCV and use with python
Compile and install MySQL-python for python2.7 on amazon linux
Python # How to check type and type for super beginners
python textbook for beginners
Python web scraping selenium
OpenCV for Python beginners
Tips for Python beginners to use the Scikit-image example for themselves 8 Processing time measurement and profiler
Tips for Python beginners to use Scikit-image examples for themselves 4 Use GUI
Web scraping of comedy program information and notification on LINE
[For beginners] How to use for statements on Linux (variables, etc.)
How to learn TensorFlow for liberal arts and Python beginners
Install pyenv on MacBook Air and switch python to use
Web crawling, web scraping, character acquisition and image saving with python
Tips for Python beginners to use the Scikit-image example for themselves
[Python] Introduction to graph creation using coronavirus data [For beginners]
[Python] Accessing and cropping image pixels using OpenCV (for beginners)
Python: Class and instance variables
Web scraping with python + JupyterLab
Python on Ruby and angry Ruby on Python
Web scraping notes in python3
Learning flow for Python beginners
Web scraping technology and concerns
Python3 environment construction (for beginners)
Python class variables and instance variables
Python #function 2 for super beginners
Web scraping using Selenium (Python)
Basic Python grammar for beginners
100 Pandas knocks for Python beginners
Python for super beginners Python #functions 1
[Python + Selenium] Tips for scraping
Python #list for super beginners
Web scraping beginner with python
~ Tips for beginners to Python ③ ~
Use Python on Windows (PyCharm)
How to set cron for regular Python scraping on Sakura server.
[Pytest] [mock] Web development beginners summarized unit test and mock in python.