[PYTHON] Since google image download did not work, it corresponds

Premise

The command sample was executed in the Windows + Anaconda environment. In addition, some paths such as account names are hidden, so please read them according to your environment.

Conclusion

First of all, let me tell you from the conclusion that the cause is that the scraping of the official tool is not working well due to the specification change of Google image search. Someone has posted a patch PR to the official GigHub, so if you download the source and then run it, you can get it to work properly.

Google Image Download doesn't work!

I want to make image recognition using Tensorflow etc., so I try to collect images for learning https://qiita.com/too-ai/items/4fad0239b8b3c465fe6d https://qiita.com/Ikko_Kojima/items/4d943c60ff5e886a0544 I tried to collect images using Google Image Download by referring to the page around here.

(base) PS C:\Users\*\Downloads\img> googleimagesdownload -k cat

Item no.: 1 --> Item name = cat
Evaluating...
Starting Download...


Unfortunately all 100 could not be downloaded because some images were not downloadable. 0 is all we got for this search filter!

Errors: 0


Everything downloaded!
Total errors: 0
Total time taken: 0.609447717666626 Seconds

In this way, no error occurred, and the result was that the directory was created (downloads \ KEYWORD under the execution environment), but no files were saved under it.

I didn't know what was the cause, but I wondered if I should take a look inside.

pip install google_images_download

not

git clone https://github.com/hardikvasa/google-images-download.git

I ran to get the source and took a look at the contents.

I'm new to Python, but fortunately, this software actually consisted of only one executable file, so when I started reading the source based on the keywords output on the screen, the point that was a problem was I was able to identify it immediately.

# Lines 714-724
    def _get_next_item(self,s):
        start_line = s.find('rg_meta notranslate')
        if start_line == -1:  # If no links are found then give an error!
            end_quote = 0
            link = "no_links"
            return link, end_quote
        else:
            start_line = s.find('class="rg_meta notranslate">')
            start_object = s.find('{', start_line + 1)
            end_object = s.find('</div>', start_object + 1)
            object_raw = str(s[start_object:end_object])

However, rg_meta notranslate and {, which are set as keywords in this area, are not found in the current search results, and image data cannot be acquired normally.

At first, I thought that I could do something if I adjusted the keywords around here by myself, but I did not know in what state it was cut out and handed it to the subsequent processing, so I checked the official website again. , Someone was throwing a patch pull request.

patch

https://github.com/Joeclinton1/google-images-download/tree/patch-1

From Clone or Download on this page, get the URL and git clone it or download the ZIP, get the source and run it using that source. I got it with git clone.

C:\Users\*\Downloads\img>git clone https://github.com/Joeclinton1/google-images-download.git gid-joeclinton
Cloning into 'gid-joeclinton'...
remote: Enumerating objects: 13, done.
remote: Counting objects: 100% (13/13), done.
remote: Compressing objects: 100% (13/13), done.
remote: Total 621 (delta 7), reused 0 (delta 0), pack-reused 608
Receiving objects: 100% (621/621), 272.98 KiB | 718.00 KiB/s, done.
Resolving deltas: 100% (358/358), done.

Since the source of the head family was in the same folder, this source is saved by creating a folder called gid-joeclinton. Therefore, when executing, specify that source and execute

(base) PS C:\Users\*\Downloads\img> python .\gid-joeclinton\google_images_download\google_images_download.py
 -k cat

Item no.: 1 --> Item name = cat
Evaluating...
Starting Download...
Completed Image ====> 1.Layer-1704-1920x840.jpg
Invalid or missing image format. Skipping...
Completed Image ====> 2.An_up-close_picture_of_a_curious_male_domestic_shorthair_tabby_cat.jpg
Completed Image ====> 3.Thinking-of-getting-a-cat.png
Completed Image ====> 4.cat-10-e1573844975155.jpg
 (Omitted)
Completed Image ====> 72.15276403_web1_190123-VNE-CatLeash.jpg
Completed Image ====> 73.CatsHaveFacialExpressionsButHardToRead_600.jpg
Completed Image ====> 74.Banner3.jpg


Unfortunately all 100 could not be downloaded because some images were not downloadable. 74 is all we got for this searc
h filter!

Errors: 26

Everything downloaded!
Total errors: 26
Total time taken: 120.21685576438904 Seconds

If you're used to it, you'll think of it right away, but I didn't have much experience of downloading the patch directly and using it, so I couldn't think of it right away, so I wrote it down as a memo for myself. I hope it will be helpful to anyone.

Postscript

This patch is mentioned in the PR issue, but it seems that more than 100 patches cannot be obtained even with the -l option. https://github.com/hardikvasa/google-images-download/pull/298 If you specify a number less than 100, it will work fine, but if you specify a number greater than 100, an error will occur and it will not work.

Recommended Posts

Since google image download did not work, it corresponds
Download the top n Google image searches