Beginners use Python for web scraping (4) --2 Scraping on Cloud Shell

Now, last time We will add scraping PGM to the created Cloud Source Repositroies repository.

Roadmap for learning web scraping in Python

(1) Succeed in scraping the target stuff locally for the time being. (2) Link the result of scraping locally to Google Spreadsheet. (3) cron is automatically executed locally. (4) Challenge free automatic execution on the cloud server. (Google Compute Engine) (4) -1 Put the test PGM on the cloud and run it normally on CloudShell. (4) -2 Add scraping PGM to the repository and run it normally on Cloud Shell. ← Now here </ font> (4) -3 Create a VM instance of Compute Engine and have it automatically execute scraping. (5) Challenge free automatic execution without a server on the cloud. (Maybe Cloud Functions + Cloud Scheduler)

This procedure

[1] Added scraping PGM to local repository [2] Push to master of Cloud Source Repositories [3] Pull from master to clone on CloudShell [4] Bulk installation of required modules using requirements.txt [5] Performing scraping on CloudShell

[1] Added scraping PGM to local repository

Add the file to your local repository.

Mac zsh


11:28:14 [~] % cd gce-cron-test
11:28:25 [~/gce-cron-test] % ls -la
total 40
drwxr-xr-x   7 hoge  staff   224  9 26 11:27 .
drwxr-xr-x+ 45 hoge  staff  1440  9 23 16:45 ..
-rw-r--r--@  1 hoge  staff  6148  9 26 11:26 .DS_Store
drwxr-xr-x  13 hoge  staff   416  9 23 16:49 .git
-rw-r--r--   1 hoge  staff   146  9 21 15:29 cron-test.py
-rw-r--r--@  1 hoge  staff  2352  9 16 17:54 my-web-hoge-app-hogehoge.json
-rw-r--r--   1 hoge  staff  2763  9 17 13:22 requests-test2.py

Make sure there are files that need to be committed, then add and commit.

Mac zsh


11:28:28 [~/gce-cron-test] % git status
On branch master
Your branch is up to date with 'origin/master'.

Untracked files:
  (use "git add <file>..." to include in what will be committed)
        .DS_Store
        my-web-hoge-app-hogehoge.json
        requests-test2.py

nothing added to commit but untracked files present (use "git add" to track)
11:28:34 [~/gce-cron-test] % 
11:28:52 [~/gce-cron-test] % 
11:28:53 [~/gce-cron-test] % git add .
11:28:58 [~/gce-cron-test] % 
11:29:38 [~/gce-cron-test] % 
11:29:38 [~/gce-cron-test] % git commit -m "Add requests-test to Cloud Source Repositories" 
[master 44abc4d] Add requests-test to Cloud Source Repositories
 3 files changed, 73 insertions(+)
 create mode 100644 .DS_Store
 create mode 100644 my-web-hoge-app-hogehoge.json
 create mode 100644 requests-test2.py

[2] Push to master of Cloud Source Repositries

Do a push to master.

Mac zsh


11:30:13 [~/gce-cron-test] % 
11:30:23 [~/gce-cron-test] % 
11:30:23 [~/gce-cron-test] % git push origin master
Enumerating objects: 6, done.
Counting objects: 100% (6/6), done.
Delta compression using up to 4 threads
Compressing objects: 100% (5/5), done.
Writing objects: 100% (5/5), 3.48 KiB | 891.00 KiB/s, done.
Total 5 (delta 0), reused 0 (delta 0)
To https://source.developers.google.com/p/my-gce-app/r/gce-cron-test
   938ea70..44abc4d  master -> master
11:31:37 [~/gce-cron-test] % 

[3] Pull from master to clone on CloudShell

Pull to the repository you cloned last time with CloudShell.

cloudshell


cloudshell:09/26/20 02:54:33 ~/gce-cron-test $ git pull origin master

Confirm that it has been added to the CloudShell repository. (Later, I added the leaked requirements.txt.)

cloudshell


cloudshell:09/26/20 02:55:06 ~/gce-cron-test $
cloudshell:09/26/20 02:55:06 ~/gce-cron-test $ ls -la
total 40
drwxr-xr-x  3 hoge hoge 4096 Sep 26 02:52 .
drwxr-xr-x 13 hoge rvm        4096 Sep 23 11:18 ..
-rw-r--r--  1 hoge hoge   80 Sep 23 11:09 cron.log
-rw-r--r--  1 hoge hoge  146 Sep 23 09:03 cron-test.py
-rw-r--r--  1 hoge hoge 6148 Sep 26 02:47 .DS_Store
drwxr-xr-x  8 hoge hoge 4096 Sep 26 02:52 .git
-rw-r--r--  1 hoge hoge 2352 Sep 26 02:47 my-web-scraping-app-hogehoge.json
-rw-r--r--  1 hoge hoge 2763 Sep 26 02:47 requests-test2.py
-rw-r--r--  1 hoge hoge  334 Sep 26 02:52 requirements.txt

[4] Bulk installation of required modules using requirements.txt

Install the required modules in bulk using requirements.txt.

cloudshell


cloudshell:09/26/20 02:55:10 ~/gce-cron-test $ pip install -r requirements.txt

Check the list of pips. I put all the necessary modules in requirements.txt with "pip freeze> requirements.txt" locally on Mac, so of course I have them properly.

cloudshell


cloudshell:09/26/20 02:55:41 ~/gce-cron-test $ pip list
Package              Version
-------------------- ---------
appdirs              1.4.4
beautifulsoup4       4.9.1
cachetools           4.1.1
certifi              2020.6.20
chardet              3.0.4
distlib              0.3.1
filelock             3.0.12
google-auth          1.21.0
google-auth-oauthlib 0.4.1
gspread              3.6.0
httplib2             0.18.1
idna                 2.10
oauth2client         4.1.3
oauthlib             3.1.0
pip                  20.1.1
pyasn1               0.4.8
pyasn1-modules       0.2.8
requests             2.24.0
requests-oauthlib    1.3.0
rsa                  4.6
setuptools           47.1.0
six                  1.15.0
soupsieve            2.0.1
urllib3              1.25.10
virtualenv           20.0.31
WARNING: You are using pip version 20.1.1; however, version 20.2.3 is available.
You should consider upgrading via the '/home/hoge/.pyenv/versions/3.8.5/bin/python3.8 -m pip install --upgrade pip' command.

[5] Performing scraping

Try running the scraping PGM "requests-test2.py".

cloudshell


cloudshell:09/26/20 02:55:49 ~/gce-cron-test $ python requests-test2.py
Traceback (most recent call last):
  File "requests-test2.py", line 40, in <module>
    sheet = get_gspread_book(secret_key, book_name).worksheet(sheet_name)
  File "requests-test2.py", line 20, in get_gspread_book
    credentials = ServiceAccountCredentials.from_json_keyfile_name(secret_key, scope)
  File "/home/hoge/.pyenv/versions/3.8.5/lib/python3.8/site-packages/oauth2client/service_account.py", line 219, in from_json_keyfile_name
    with open(filename, 'r') as file_obj:
FileNotFoundError: [Errno 2] No such file or directory: '/Users/hoge/git-repository/env2/my-web-hoge-app-hogehoge.json'

Inadvertently, there is no such file. Not surprisingly, the full Mac local path was still specified. Local is VScode, but this is fixed in the CloudShell code editor.

cloudshell


cloudshell:09/26/20 02:55:55 ~/gce-cron-test $ pwd
/home/hoge/gce-cron-test
cloudshell:09/26/20 02:56:12 ~/gce-cron-test $ cloudshell open requests-test2.py

The "cloudshell open" command will bring up the code editor, so modify the json path. スクリーンショット 2020-09-26 12.00.21.png

It is a re-execution.

cloudshell


cloudshell:09/26/20 03:00:32 ~/gce-cron-test $
cloudshell:09/26/20 03:00:33 ~/gce-cron-test $ python requests-test2.py
2020/09/26 03:01:15 Finished scraping.
cloudshell:09/26/20 03:01:18 ~/gce-cron-test $

I was able to scrape safely. Click here for the full source. Beginners use Python for web scraping (2) The time on GCP is UTC by default, so it will be -9 hours Tokyo time. スクリーンショット 2020-09-26 19.15.39.png

Next time, I will create a VM on Google Compute Engine, check the operation of scraping, and try to execute it automatically with cron.

Recommended Posts

Beginners use Python for web scraping (4) --2 Scraping on Cloud Shell
Beginners use Python for web scraping (1)
Beginners use Python for web scraping (4) ―― 1
Beginners can use Python for web scraping (1) Improved version
Beginners use Python for web scraping (4) -3 GCE VM instance creation and scraping on VM
[For beginners] Try web scraping with Python
WEB scraping with Python (for personal notes)
Installing TensorFlow on Windows Easy for Python beginners
python textbook for beginners
Python web scraping selenium
OpenCV for Python beginners
[For beginners] How to use say command in python!
Python beginners get stuck with their first web scraping
[For beginners] Web scraping with Python "Access the URL in the page to get the contents"
Web scraping with python + JupyterLab
Use matplotlib on Ubuntu 12 & Python
Python3 environment construction (for beginners)
Web scraping using Selenium (Python)
Basic Python grammar for beginners
100 Pandas knocks for Python beginners
Python for super beginners Python #functions 1
[Python + Selenium] Tips for scraping
Python #list for super beginners
Web scraping beginner with python
~ Tips for beginners to Python ③ ~
Use Python on Windows (PyCharm)
Tips for Python beginners to use Scikit-image examples for themselves 4 Use GUI
Tips for Python beginners to use the Scikit-image example for themselves
How to set cron for regular Python scraping on Sakura server.
Python Exercise for Beginners # 2 [for Statement / While Statement]
Web scraping with Python ① (Scraping prior knowledge)
Web teaching materials for learning Python
Web scraping with Python First step
I tried web scraping with python.
Next, use Python (Flask) for Heroku!
Python #index for super beginners, slices
Easy-to-understand explanation of Python Web application (Django) even for beginners (5) [Introduction to DB operation with Django shell]
<For beginners> python library <For machine learning>
Python #len function for super beginners
Web scraping for weather warning notifications.
How to update the python version of Cloud Shell on GCP
Run unittests in Python (for beginners)
Sakura Use Python on the Internet
Python #Hello World for super beginners
Python for super beginners Python # dictionary type 2 for super beginners
Tips for Python beginners to use the Scikit-image example for themselves 9 Use from C
Easy-to-understand explanation of Python Web application (Django) even for beginners (2) [Project creation]
[GCP] Procedure for creating a web application with Cloud Functions (Python + Flask)
Tips for Python beginners to use Scikit-image examples for themselves 3 Write to a file
Add words to MeCab's user dictionary on Ubuntu for use in Python
Rock-paper-scissors with Python Let's run on a Windows local server for beginners
Easy-to-understand explanation of Python Web application (Django) even for beginners (1) [Environment construction]
Tips for Python beginners to use the Scikit-image example for themselves 6 Improve Python code
Notes for using OpenCV on Windows10 Python 3.8.3.
INSERT into MySQL with Python [For beginners]
[Python] Minutes of study meeting for beginners (7/15)
Use DeepL with python (for dissertation translation)
Notes on nfc.ContactlessFrontend () for nfcpy in python
[Beginner] Python web scraping using Google Colaboratory
Use Tabpy with Cloud Run (on GKE)
Let's put together Python for super beginners