[PYTHON] Get the image of "Suzu Hirose" by Google image search.

Introduction

This time, I tried scraping the image of "Suzu Hirose" using Google's image search function. I think that you will need some image data when you perform image processing yourself. I hope you will refer to this article as one of the means to acquire images.

Implementation

This time, when I got an image from Google's image search, I had to scroll to get it. Use selenium to scroll because it cannot be done with Beautiful Soup.

First of all, import everything.


from selenium import webdriver
from time import sleep
from bs4 import BeautifulSoup
import requests
import base64
import os
import re
import shutil

You will need a chromedriver to use selenium. Get it with ChromeDriver --WebDriver for Chrome .


#Now open google
driver = webdriver.Chrome("C:\\Users\\chromedriver")#Specify the path where the driver is located.
driver.get("https://www.google.com/")
sleep(2)

Specifies the location of the search bar. At this time, please use the verification function of Chrome opened in selenium to identify the location. I verified it with Chrome that I originally downloaded, and I got an error because I was doing it based on it. As a result, it took about an hour to find out the cause of the error. .. .. .. .. ..

search_bar = driver.find_element_by_name("q")
#Enter keywords in the search bar
search_bar.send_keys("Hirose Suzu")
search_bar.submit()
sleep(2)

If it goes well, Suzu Hirose will be typed into the search bar to search. 2020-11-03.png

Then move to the image list.


#Move to image screen
img_btn = driver.find_element_by_xpath('//a[@class="q qs"]')
img_btn.click()

I will move to the image list below, so I would like to get the images here. 2020-11-03 (1).png

First, get the URL of the image. This time, when I get the URL of the image, I use BeautifulSoup to find the img tag and get it from there. Most of the image URLs are stored in the data-src of the img tag, but sometimes there are some that do not have data-src, so at that time I am getting from src.

#Scroll the screen.
try:
    #The image URL is duplicated in this.
    all_images = []
    #Scroll 5 times
    for i in range(5):
        #I'm scrolling the screen here.
        driver.execute_script("window.scrollTo(0,document.body.scrollHeight);")
        #I'm loading it into Beautiful Soup here.
        soup = BeautifulSoup(driver.page_source , "html.parser")
        
        #all_Append image URL to images
        for image in soup.find_all("img"):
            try:
                url = image.get("data-src")

                if url is None:
                    url = image.get("src")

                if url is not None:
                    all_images.append(url)
            except:
                print("An error occurred when getting the image URL.")
                print()
        
        sleep(2)           

except Exception:
    print("An error occurred while scrolling the screen.")
    error_flag = True

And as commented in the code, the image URL is stored in all_images, but the URL is duplicated here. Therefore, we will remove duplicates to make them unique.

all_images = list(dict.fromkeys(all_images))

In this URL, the data was changed to base64 format separately from the https URL. Therefore, you need to use two patterns to download. (1) Download from HTTP (2) Download from base64. This time I created a function to correspond to each pattern.

#Save the image passed by http url.
def img_url_download(url , file_path):
    response = requests.get(url , stream = True)
    
    #Save to file,
    with open(file_path , 'wb') as file:
        shutil.copyfileobj(response.raw , file)


#Function to save base64
#url"data:image/jpeg;base64,"Put in the one with the removed.
def base64_download(url , file_path):
    img = base64.b64decode(url.encode())
    with open(file_path , "wb") as f:
        f.write(img)

After defining the function, save the image in the folder at the end.

#Insert the image data into a file! !!

#File path
path = r"C:\Users\suzu_img_files"#Please specify the path of the folder to save the image

#base64 first"data:image/jpeg;base64,"There is, so try to remove it.
base64_string = "data:image/jpeg;base64,"

for index , image_url in enumerate(all_images):
    filename = "suzu_" + str(index) + ".jpg "
    file_path = os.path.join(path , filename)
    
    #The if statement branches depending on whether it is base64 or not.
    if len(re.findall(base64_string , image_url)) > 0:
        url = url.replace(base64_string , "")#The prefix is missing from the url.
        base64_download(url , file_path)
    
    else:
        img_url_download(image_url , file_path)

If all goes well, the image will be saved as shown below. 2020-11-03 (3).png

Summary

How was that? Isn't it possible to expand the range of scraping by using selenium? This time it was Mr. Suzu Hirose, but I think it's good to scrape with people, animals, buildings, etc. that you like! Also, this time I implemented selenium from the search screen of Google because I also implemented it, but if you just want to get the image, it is faster to implement it by making the first URL the URL of Mr. Suzu Hirose's image list. is not it,,,


Reference materials
[Introduction to Python] Scraping images of Kanna Hashimoto. Examples of what Python can do: Download images. Exercises after Progate | Data analysis in Python. Beautiful Soup ChromeDriver - WebDriver for Chrome Python-based web scraping (BeautifulSoup, Selenium, Requests) >

Recommended Posts

Get the image of "Suzu Hirose" by Google image search.
Get Google Image Search images in original size
Search by the value of the instance in the list
Scraping google search (image)
Play music by hitting the unofficial API of Google Play Music
Get the number of digits
Get the output value of the command (as received by xargs)
Judging the victory or defeat of Shadowverse by image recognition
100 language processing knock-29: Get the URL of the national flag image
Find the diameter of the graph by breadth-first search (Python memory)
Google search for the last line of the file in Python
Get the number of views of Qiita
How to get the pixel value of the point from the satellite image by specifying the latitude and longitude
Add-on that sketches the range specified by the annotation of the image editor
Get the attributes of an object
Get and visualize google search trends
Get the first element of queryset
Get the number of Youtube subscribers
[Python] Explore the characteristics of the titles of the top sites in Google search results
I tried to get the batting results of Hachinai using image processing
The google search console sitemap api client is in webmasters instead of search console
Read the graph image with OpenCV and get the coordinates of the final point of the graph
Write the result of keyword search with ebaysdk to Google Spread Sheets
Judge the authenticity of posted articles by machine learning (Google Prediction API).
Get the column list & data list of CASTable
Get images by keyword search from Twitter
Get the minutes of the Diet via API
Grayscale by matrix-Reinventor of Python image processing-
Save dog images from Google image search
Pandas of the beginner, by the beginner, for the beginner [Python]
Get the value of the middle layer of NN
Analysis of X-ray microtomography image by Python
Get holidays with the Google Calendar API
In search of the fastest FizzBuzz in Python
Image collection using Google Custom Search API
Get the last day of the specified month
[Python] Get the character code of the file
Get the filename of a directory (glob)
[PowerShell] Get the reading of the character string
Get the size of the image file on the web (Python3, no additional library required)
Automatically save images of your favorite characters from Google Image Search with Python
Get Unix time of the time specified by JST regardless of the time zone of the server in Python
Get the last element of the array by splitting the string in Python and PHP
[Python] Download original images from Google Image Search
Get the contents of git diff from python
Check the operation of OpenCV3 installed by Anaconda
Extract dominant color of image by k-means clustering
Transform the image by projective transformation-Hack the monitor screen-
[Python] Get / edit the scale label of the figure
I tried to get an image by scraping
[Python] Get the main topics of Yahoo News
Get the caller of a function in Python
Sort the elements of the array by specifying the conditions
I tried to correct the keystone of the image
Get a panoramic image in Google Street View
Image processing? The story of starting Python for
[Python] Get the last updated date of the website
Judge Yosakoi Naruko by image classification of Tensorflow.
Super (concise) summary of image classification by ArcFace
[At Coder] Solve the problem of binary search
Minimize the number of polishings by combinatorial optimization