[PYTHON] [Selenium] If you can't scrape Google image search, crawl and collect thumbnail images alone.

Introduction

Scraping is what you rely on when you need a large amount of image data such as machine learning. There are already many articles that pick up images from various sites such as Google, Yahoo and Bing, but this time I will write about Google image search.

Many programs that collect images from Google Image Search have been introduced, but there are still few that can collect images properly, whether there are many specification changes or scraping measures. As you can see by actually scraping the search results of Google image search using requests and Beautiful Soup, the income information contains only about 20 image data.

If you can't scrape directly, why not actually go through the browser by crawling?

With that said, Selenium, the main subject, is here. When you ** display ** the search result page using Selenium, the content such as HTML will naturally be rewritten to include a lot of photo data according to the displayed content (although the cause and effect are reversed). You can get a lot of images by bringing it to this state and then scraping it again. However, Google is quite tough, and only thumbnail images can be obtained in this state. Well, if hundreds of original size image data are embedded, it will be a huge amount of data, so it is natural regardless of Google.

It is possible to click on individual images from here and save them one by one in their original size, but if you do not need to have a large image size in the first place, you can win by collecting thumbnails as they are. Assuming that you are ready to use Selenium, such as downloading Chromedriver Let's go.

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
import os
import time
import requests
import shutil
import base64

options = Options()
options.add_argument("--disable-gpu")
options.add_argument("--disable-extensions")
options.add_argument('--proxy-server="direct://"')
options.add_argument("--proxy-bypass-list=*")
options.add_argument("--start-maximized")
#In headless mode, it is not "displayed", so you can only download around 100 items.

DRIVER_PATH = "chromedriver.exe" #Location of chromedriver
driver = webdriver.Chrome(executable_path=DRIVER_PATH, chrome_options=options)

query = input('Search word? :')
url = ("https://www.google.com/search?hl=jp&q=" + "+".join(query.split()) + "&btnG=Google+Search&tbs=0&safe=off&tbm=isch")
driver.get(url)

#Scroll down appropriately--
for t in range(5):
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
    time.sleep(1.5)
try:driver.find_element_by_class_name("mye4qd").click() #Pressing the button "Show more search results"
except:pass
for t in range(5):
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
    time.sleep(1.5)

srcs = driver.find_elements(By.XPATH, '//img[@class="rg_i Q4LuWd"]')
try:os.mkdir(query) #Create a folder with the same name as the search term, save destination
except:pass
#--

i = 0 #Counter for assigning serial numbers to file names

print("Downloading...")
for j, src in enumerate(srcs):
    if j % 50 == 0 or j == len(srcs)-1:
        print("|"+ ("■" * (20 * j // (len(srcs)-1)))+ (" -" * (20 - 20 * j // (len(srcs)-1)))+ "|",f"{100*j//(len(srcs)-1)}%") #The one that shows the progress of the download
    file_name = f"{query}/{'_'.join(query.split())}_{str(i).zfill(3)}.jpg " #File name or location
    src = src.get_attribute("src")
    if src != None:
#Convert to image--
        if "base64," in src:
            with open(file_name, "wb") as f:
                f.write(base64.b64decode(src.split(",")[1]))
        else:
            res = requests.get(src, stream=True)
            with open(file_name, "wb") as f:
                shutil.copyfileobj(res.raw, f)
#--
        i += 1

driver.quit() #close the window
print(f"Download is complete. {i} images are downloaded.")

result

After executing, enter the search term and wait for a while, then about 400 to 1,000 images should be saved in the folder with the same name as the search term. image.png image.png

Summary

Machine learning seems to progress because hundreds to thousands of images are collected in about 2 minutes (although I have never created a data set myself). I have just started studying scraping and crawling, so I would appreciate it if you could tell me any improvements or bad points.

Referenced sites, etc.

https://tanuhack.com/selenium/

Recommended Posts

[Selenium] If you can't scrape Google image search, crawl and collect thumbnail images alone.
Save dog images from Google image search
Get Google Image Search images in original size