[Python] Save the result of web scraping the Mercari product page on Google Colab to Google Sheets and display the product image as well.

This article is based on the content of the following video previously published on YouTube.

[Python learning] Save the result of web scraping the Mercari product page to Google Sheets and display the product image

The content covered in this article is for learning purposes only. In some cases, scraping may lead to a load on the other server, so please do it within the bounds of common sense. </ font>

Python code to run on Google Colab

First of all, I will paste the actual code written in the video below.

Install gspread

!pip install gspread

I had a habit of explicitly typing gspread to install it, but I didn't need it ... Even if I installed it this way, it didn't make sense because it was the same as the version originally installed in Colab.

For more information on this, please refer to the following article posted earlier. https://qiita.com/safa/items/bfa52430f920ac562bec#gspread%E3%81%AE%E3%82%A4%E3%83%B3%E3%82%B9%E3%83%88%E3%83%BC%E3%83%AB

A spell to make Google Sheets accessible from Colab

from google.colab import auth
from oauth2client.client import GoogleCredentials
import gspread

auth.authenticate_user()
gc = gspread.authorize(GoogleCredentials.get_application_default())

Code for accessing Google Sheets on Google Colab. In the middle of myself, it has become not the same.

If you execute this code, the URL will be displayed. If you follow the instructions on the screen, the authentication itself will be completed without any problems. (It is easier if you log in with your own Google account in advance)

Code for saving the result of web scraping the Mercari product page in a Google spreadsheet and displaying the product image

Here is the actual Python code. Please enter the Google Spreadsheet URL that you want to save to.

from bs4 import BeautifulSoup
import requests

workbook_url = "URL of Google Spreadsheet to save to"
workbook = gc.open_by_url(workbook_url)

mercari_url = "https://www.mercari.com"
fetch_path = "/jp/category/967/"

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4222.0 Safari/537.36'}

# fetch_url: https://www.mercari.com/jp/category/967/
fetch_url = mercari_url + fetch_path

print("Acquisition URL: " + fetch_url)

r = requests.get(fetch_url, headers=headers)

soup = BeautifulSoup(r.text, "lxml")

title = soup.find('title').get_text()

worksheet = workbook.add_worksheet(title=title, rows=100, cols=4)

item_list = soup.find_all("li", class_="sc-bwzfXH")

result_list = []

worksheet.append_row(["ID", "title", "price", "Detailed URL", "image"])

for i, item in enumerate(item_list):
  item_title = item.find("span").get_text()
  item_price = item.find("div", class_="style_thumbnail__N_xAi").get_text()
  item_url = mercari_url + item.find("a")["href"]
  image = "=IMAGE(\"" + item.find("img")["src"] + "\")"
  worksheet.append_row([i, item_title, item_price, item_url, image], value_input_option="USER_ENTERED")

print("Done!")

If the execution is successful, the data will be written with the image displayed in the Google spreadsheet as shown below.

実際にWebスクレイピングを実行した結果のGoogleスプレッドシート

Supplementary explanation of the code

I will give some supplementary explanations about the code I wrote.

Granting User-Agent

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4222.0 Safari/537.36'}

・
・
・

r = requests.get(fetch_url, headers=headers)

The Mercari page seems to return 403 for access without ʻUser-Agent. In requests, it is possible to set ʻUser-Agent as an option in the second argument of get, so access is performed after setting this. The content of ʻUser-Agent is that of Google Chrome Canarywhen creating the video. The value displayed in theNetwork tab of Devtools` is used as it is.

About lxml

soup = BeautifulSoup(r.text, "lxml")

Since lxml is installed from the beginning on Google Colab, writing such code suddenly does not cause an error. If you want to run the same code locally, you also need to have pip installed on lxml.

About the title of the sheet

title = soup.find('title').get_text()

worksheet = workbook.add_worksheet(title=title, rows=100, cols=4)

When I execute the Python script, I try to create a sheet with the title of the acquired page as the sheet name. In the ʻadd_worksheet` function, the height and width of the sheet to be created are specified, but it seems that they will be added without permission even if they are not specified. (For the time being, it is specified in 100 rows and 4 columns)

Display the image using the IMAGE function of Google Spreadsheet

  image = "=IMAGE(\"" + item.find("img")["src"] + "\")"
  worksheet.append_row([i, item_title, item_price, item_url, image], value_input_option="USER_ENTERED")

To display an image in Google Sheets, use the IMAGE function provided by Google Sheets.

IMAGE --Document Editor Help

Looking at the docs, it seemed that some options were offered, but I didn't use them this time. I really wanted to resize the cell to fit the image, but unfortunately that feature wasn't provided, so I'm just passing the image URL.

There is no mode to resize cells to fit the image.

Quoted from the help page above (I wanted to use this feature ...)

Even if you pass the image URL as it is, the append_row function of gspread will be '= IMAGE and will be interpreted as a character string. Therefore, value_input_option =" USER_ENTERED " is specified as an option.

By specifying this, the behavior will be the same as when the user types the character on the screen on Google Sheets. In this case, when the user types in the value = IMAGE ("...") as it is, it will be interpreted in the same way, so the spreadsheet will display the image as intended.

■ Reference document API Reference - gspread(append_row) ValueInputOption | Sheet API

Finally (promotion)

That's it.

I occasionally upload Python related videos to YouTube. Please take a look if you like!

Learn Python with Safa's Colab

Recommended Posts

[Python] Save the result of web scraping the Mercari product page on Google Colab to Google Sheets and display the product image as well.
Save images on the web to Drive with Python (Colab)
Execute the command on the web server and display the result
Save the result of the life game as a gif with python
Display the image of the camera connected to the personal computer on the GUI.
Convert the result of python optparse to dict and utilize it
The story of making a tool to load an image with Python ⇒ save it as another name
[Python] How to specify the window display position and size of matplotlib
[Android] Display images on the web in the info Window of Google Map
Write the result of keyword search with ebaysdk to Google Spread Sheets
[python] Send the image captured from the webcam to the server and save it
GAE --With Python, rotate the image based on the rotation information of EXIF and upload it to Cloud Storage.
Get the size of the image file on the web (Python3, no additional library required)
Let's play with Python Receive and save / display the text of the input form
[Python] How to save images on the Web at once with Beautiful Soup
[Latest] How to use Python library to save Google image search & use Chrome Driver on ubuntu
Save screenshot of [Python] [Windows] screen as an image
Python OpenCV tried to display the image in text.
Call the API of Hatena Blog from Python and save your blog articles individually on your PC
Run the Caffe model on Google Colaboratory to predict the age and gender of the world's supermodels
POST the image selected on the website with multipart / form-data and save it to Amazon S3! !!
[Super easy! ] How to display the contents of dictionaries and lists including Japanese in Python
Return the image data with Flask of Python and draw it to the canvas element of HTML