[PYTHON] Detect folders with the same image in ImageHash

Purpose

If the scrape destination is creating an article from the same source, the same image with a different title may be downloaded. I want to detect and delete an image folder with a different title but exactly the same contents.

usage environment

windows10 Anaconda python3.6.1 jupyter notebook

Reference url

Using Python's similar image library ImageHash on Windows

What is Image Hash?

When hashing image information, close your eyes to the size and subtle differences of the image and use it when you want to obtain the same digest value for similar images and similar digest values for similar images. Similar image library. It judges the similarity regardless of the extension and size of the image.

Module installation

In the case of Anaconda, installation of ImageHash only is completed.

py


pip install numpy
pip install scipy
pip install Pillow
pip install PyWavelets
pip install ImageHash

code

compimages.py


from PIL import Image,ImageFile
import imagehash,os
from glob import glob
#Do not skip large images
ImageFile.LOAD_TRUNCATED_IMAGES = True

#Output the difference between the hash values of two images
def d_hash(img,otherimg):
    hash = imagehash.phash(Image.open(img))
    other_hash = imagehash.phash(Image.open(otherimg))
    return hash-other_hash
#Detect the smaller image size
def minhash(img,otherimg):
    hash_size = Image.open(img).size
    otherhash_size = Image.open(otherimg).size
    if hash_size<otherhash_size: return 0
    else: return 1
    
#Specify working folder
directory_dir = r'C:\Users\hogehoge\images'
#Get folder list and folder path
folder_list = os.listdir(directory_dir)
folder_dir = [os.path.join(directory_dir,i) for i in folder_list if len(os.listdir(os.path.join(directory_dir,i))) >2 ]

#Get image list, path
img_list = [os.listdir(i) for i in folder_dir]
img_list_count = [ len( i ) for i in img_list ]
#Create an image list for each folder with double inclusion notation
img_dir = [ [ os.path.join(dir,list[i]) for i in range(count) if list[i] in 'jpg' or 'png']  for (count,dir,list) in zip(img_list_count, folder_dir, img_list) ]



i = 0
length = len(img_dir)
delete_file = []

#d_hash(),minhash()Compare images by folder with
while i < length:
    #progress
    print('i = ',i+'/'+length)
    for j in range(i+1,length):
        #Flag to break
        switch = 0
        for k in img_dir[j]:
            #If the difference between hash values is 10 or less, it is recognized as the same image.
            if d_hash(img_dir[i][1],k)<10:
                print(folder_list[i]+' | vs | '+folder_list[j])
                #Save the path with the smaller image size in the delete list
                if minhash(img_dir[i][1],k) == 0:
                    delete_file.append(folder_dir[i])
                else: delete_file.append(folder_dir[j])
                i += 1
                switch = 1
                break
        if switch != 0:break
    i += 1

#View the folder path you want to delete
print(delete_file)

#If you want to continue deleting
#import shutil
#for i in delete_file:
#   shutil.rmtree(i)

Execution result

The first folder takes time, but the number of comparison folders gradually decreases as i increases, so if the processing proceeds to half, the amount of image comparison for each folder will also decrease to half. However, assuming that 100 folders contain 10 images, the total number of loops is ** 50500 times **. If parallel processing can be done with threading module etc., I would like to implement it in the future.

Recommended Posts

Detect folders with the same image in ImageHash
Determine the numbers in the image taken with the webcam
[Python] Get the numbers in the graph image with OCR
Convert the image in .zip to PDF with Python
Load the module with the same name in another location
Tweet with image in Python
[Note] How to write QR code and description in the same image with python
Try blurring the image with opencv2
I tried to process the image in "sketch style" with OpenCV
I tried to process the image in "pencil style" with OpenCV
Memorandum (Add name only to people with the same surname in the list)
Behavior when returning in the with block
Image display taken with the built-in ISIGHT
Display Python 3 in the browser with MAMP
Cut out A4 print in the image
Right-click the image → realize "Compress with TinyPNG"
I tried playing with the image with Pillow
Easy image processing in Python with Pillow
Hashing algorithm for determining the same image
Extract the color of the object in the image with Mask R-CNN and K-Means clustering
How to display in the entire window when setting the background image with tkinter
Turn multiple lists with a for statement at the same time in Python
How to get a list of files in the same directory with python
I tried "smoothing" the image with Python + OpenCV
Log in to the remote server with SSH
[Python] Get the files in a folder with Python
Load the network modeled with Rhinoceros in Python ③
Crop the image to rounded corners with pythonista
Access files in the same directory as the executable
I tried "differentiating" the image with Python + OpenCV
(Note) Importing Excel with the same column name
What is wheezy in the Docker Python image?
[Automation] Extract the table in PDF with Python
A program that searches for the same image
Determining if there are birds in the image
I tried "binarizing" the image with Python + OpenCV
Create an image with characters in python (Japanese)
Load the network modeled with Rhinoceros in Python ②
Loop variables at the same time in the template
Load the network modeled with Rhinoceros in Python ①
The story that fits in with pip installation
Display the image after Data Augmentation with Pytorch
58 The same castle
Solved the problem that the image was not displayed in ROMol when loaded with PandasTools.LoadSDF.
When a local variable with the same name as a global variable is defined in the function
Extract the table of image files with OneDrive & Python
Complement the library you put in anaconda with jedi-vim
I tried to detect the iris from the camera image
Implement Sign In With Google on the backend side
Identify the name from the flower image with keras (tensorflow)
Crawl the URL contained in the twitter tweet with python
Read the linked list in csv format with graph-tool
Write letters in the card illustration with OpenCV python
Try loading the image in a separate thread (OpenCV-Python)
POST the image with json and receive it with flask
Participated in the first ISUCON with the team "Lunch" # ISUCON10 Qualifying
Python OpenCV tried to display the image in text.
When reading an image with SimpleITK, there is a problem if there is Japanese in the path