Beginners use Python for web scraping (4) --2 Scraping on Cloud Shell

Now, last time We will add scraping PGM to the created Cloud Source Repositroies repository.

Roadmap for learning web scraping in Python

(1) Succeed in scraping the target stuff locally for the time being. (2) Link the result of scraping locally to Google Spreadsheet. (3) cron is automatically executed locally. (4) Challenge free automatic execution on the cloud server. (Google Compute Engine) (4) -1 Put the test PGM on the cloud and run it normally on CloudShell. (4) -2 Add scraping PGM to the repository and run it normally on Cloud Shell. ← Now here </ font> (4) -3 Create a VM instance of Compute Engine and have it automatically execute scraping. (5) Challenge free automatic execution without a server on the cloud. (Maybe Cloud Functions + Cloud Scheduler)

This procedure

[1] Added scraping PGM to local repository [2] Push to master of Cloud Source Repositories [3] Pull from master to clone on CloudShell [4] Bulk installation of required modules using requirements.txt [5] Performing scraping on CloudShell

[1] Added scraping PGM to local repository

Add the file to your local repository.

Mac zsh

11:28:14 [~] % cd gce-cron-test
11:28:25 [~/gce-cron-test] % ls -la
total 40
drwxr-xr-x   7 hoge  staff   224  9 26 11:27 .
drwxr-xr-x+ 45 hoge  staff  1440  9 23 16:45 ..
-rw-r--r--@  1 hoge  staff  6148  9 26 11:26 .DS_Store
drwxr-xr-x  13 hoge  staff   416  9 23 16:49 .git
-rw-r--r--   1 hoge  staff   146  9 21 15:29
-rw-r--r--@  1 hoge  staff  2352  9 16 17:54 my-web-hoge-app-hogehoge.json
-rw-r--r--   1 hoge  staff  2763  9 17 13:22

Make sure there are files that need to be committed, then add and commit.

Mac zsh

11:28:28 [~/gce-cron-test] % git status
On branch master
Your branch is up to date with 'origin/master'.

Untracked files:
  (use "git add <file>..." to include in what will be committed)

nothing added to commit but untracked files present (use "git add" to track)
11:28:34 [~/gce-cron-test] % 
11:28:52 [~/gce-cron-test] % 
11:28:53 [~/gce-cron-test] % git add .
11:28:58 [~/gce-cron-test] % 
11:29:38 [~/gce-cron-test] % 
11:29:38 [~/gce-cron-test] % git commit -m "Add requests-test to Cloud Source Repositories" 
[master 44abc4d] Add requests-test to Cloud Source Repositories
 3 files changed, 73 insertions(+)
 create mode 100644 .DS_Store
 create mode 100644 my-web-hoge-app-hogehoge.json
 create mode 100644

[2] Push to master of Cloud Source Repositries

Do a push to master.

Mac zsh

11:30:13 [~/gce-cron-test] % 
11:30:23 [~/gce-cron-test] % 
11:30:23 [~/gce-cron-test] % git push origin master
Enumerating objects: 6, done.
Counting objects: 100% (6/6), done.
Delta compression using up to 4 threads
Compressing objects: 100% (5/5), done.
Writing objects: 100% (5/5), 3.48 KiB | 891.00 KiB/s, done.
Total 5 (delta 0), reused 0 (delta 0)
   938ea70..44abc4d  master -> master
11:31:37 [~/gce-cron-test] % 

[3] Pull from master to clone on CloudShell

Pull to the repository you cloned last time with CloudShell.


cloudshell:09/26/20 02:54:33 ~/gce-cron-test $ git pull origin master

Confirm that it has been added to the CloudShell repository. (Later, I added the leaked requirements.txt.)


cloudshell:09/26/20 02:55:06 ~/gce-cron-test $
cloudshell:09/26/20 02:55:06 ~/gce-cron-test $ ls -la
total 40
drwxr-xr-x  3 hoge hoge 4096 Sep 26 02:52 .
drwxr-xr-x 13 hoge rvm        4096 Sep 23 11:18 ..
-rw-r--r--  1 hoge hoge   80 Sep 23 11:09 cron.log
-rw-r--r--  1 hoge hoge  146 Sep 23 09:03
-rw-r--r--  1 hoge hoge 6148 Sep 26 02:47 .DS_Store
drwxr-xr-x  8 hoge hoge 4096 Sep 26 02:52 .git
-rw-r--r--  1 hoge hoge 2352 Sep 26 02:47 my-web-scraping-app-hogehoge.json
-rw-r--r--  1 hoge hoge 2763 Sep 26 02:47
-rw-r--r--  1 hoge hoge  334 Sep 26 02:52 requirements.txt

[4] Bulk installation of required modules using requirements.txt

Install the required modules in bulk using requirements.txt.


cloudshell:09/26/20 02:55:10 ~/gce-cron-test $ pip install -r requirements.txt

Check the list of pips. I put all the necessary modules in requirements.txt with "pip freeze> requirements.txt" locally on Mac, so of course I have them properly.


cloudshell:09/26/20 02:55:41 ~/gce-cron-test $ pip list
Package              Version
-------------------- ---------
appdirs              1.4.4
beautifulsoup4       4.9.1
cachetools           4.1.1
certifi              2020.6.20
chardet              3.0.4
distlib              0.3.1
filelock             3.0.12
google-auth          1.21.0
google-auth-oauthlib 0.4.1
gspread              3.6.0
httplib2             0.18.1
idna                 2.10
oauth2client         4.1.3
oauthlib             3.1.0
pip                  20.1.1
pyasn1               0.4.8
pyasn1-modules       0.2.8
requests             2.24.0
requests-oauthlib    1.3.0
rsa                  4.6
setuptools           47.1.0
six                  1.15.0
soupsieve            2.0.1
urllib3              1.25.10
virtualenv           20.0.31
WARNING: You are using pip version 20.1.1; however, version 20.2.3 is available.
You should consider upgrading via the '/home/hoge/.pyenv/versions/3.8.5/bin/python3.8 -m pip install --upgrade pip' command.

[5] Performing scraping

Try running the scraping PGM "".


cloudshell:09/26/20 02:55:49 ~/gce-cron-test $ python
Traceback (most recent call last):
  File "", line 40, in <module>
    sheet = get_gspread_book(secret_key, book_name).worksheet(sheet_name)
  File "", line 20, in get_gspread_book
    credentials = ServiceAccountCredentials.from_json_keyfile_name(secret_key, scope)
  File "/home/hoge/.pyenv/versions/3.8.5/lib/python3.8/site-packages/oauth2client/", line 219, in from_json_keyfile_name
    with open(filename, 'r') as file_obj:
FileNotFoundError: [Errno 2] No such file or directory: '/Users/hoge/git-repository/env2/my-web-hoge-app-hogehoge.json'

Inadvertently, there is no such file. Not surprisingly, the full Mac local path was still specified. Local is VScode, but this is fixed in the CloudShell code editor.


cloudshell:09/26/20 02:55:55 ~/gce-cron-test $ pwd
cloudshell:09/26/20 02:56:12 ~/gce-cron-test $ cloudshell open

The "cloudshell open" command will bring up the code editor, so modify the json path. スクリーンショット 2020-09-26 12.00.21.png

It is a re-execution.


cloudshell:09/26/20 03:00:32 ~/gce-cron-test $
cloudshell:09/26/20 03:00:33 ~/gce-cron-test $ python
2020/09/26 03:01:15 Finished scraping.
cloudshell:09/26/20 03:01:18 ~/gce-cron-test $

I was able to scrape safely. Click here for the full source. Beginners use Python for web scraping (2) The time on GCP is UTC by default, so it will be -9 hours Tokyo time. スクリーンショット 2020-09-26 19.15.39.png

Next time, I will create a VM on Google Compute Engine, check the operation of scraping, and try to execute it automatically with cron.

