Sorting image files with Python (2)


Last time, I've done enough to sort a large number of image files into year / month folders. At that time, the following CSV files were generated in each folder as clues for deleting duplicate files.



From the left, the file name, CRC32, and file size are displayed, and the file whose file size matches CRC32 is almost certainly the same file, so it is targeted for deletion. Speaking of the above file, the 2nd and 4th lines are duplicated, so I would like to divide them into 2 groups as follows.





The idea is simply delete list ← original list-duplicate list and survival list ← original list-delete list, but I tried various kneading but could not reach a convincing implementation. So this time

  1. Prepare an empty survival list and deletion list
  2. If the inspection target is on the survival list, add it to the deletion list.
  3. If the test target is not on the survival list, add it to the survival list

Make it a simple implementation such as (I think this is enough because what I want to do is not complicated)

Development environment


import os
import sys
import pandas as pd

def classify(path, target):
    lines = pd.read_csv(os.path.join(path, target), header=None)

    d = {}
    servived_dict = {}      #What survives
    delete_dict = {}        #Target to be deleted

    # filename,crc32,filesize{filename, (crc32, filesize)}To
    for i in range(len(lines)):
        (filename, crc32, filesize) = lines.values[i]
        d[filename] = (crc32, filesize)

    for key, value in d.items():
        if value in servived_dict.values():
            delete_dict[key] = value
            servived_dict[key] = value

    def output(full_path, dic):
        with open(full_path, mode='w') as f:
            for key in dic.keys():
                #I only want the full path of the file to be deleted
                f.write(os.path.join(path, key) + "\n")

    output(os.path.join(path, "servived.txt"), servived_dict)
    output(os.path.join(path, "delete.txt"), delete_dict)

if __name__ == "__main__":
    full_path = sys.argv[1]
    classify(os.path.dirname(full_path), os.path.basename(full_path))

In the first for statement, the inspection target is converted to {filename, (crc32, filesize)} so that it can be handled easily later. If you don't make tuples, you will have to check if CRC32 and file size are included, so it's a little crap. Also, although saved_dict has significance, it is useless even if it is output to saved.txt, so output is unnecessary (although it was useful when debugging)


Since it is still Pythonista, I feel that it will probably end quickly with set arithmetic, but this time it is. Next is the deletion process by referring to the delete.txt generated in each year / month folder (why should I write it ...)

Recommended Posts

Sorting image files with Python (2)
Sorting image files with Python (3)
Sorting image files with Python
Image processing with Python
Image editing with python OpenCV
Sort huge files with python
Image processing with Python (Part 1)
Tweet with image in Python
Integrate PDF files with Python
Image processing with Python (Part 3)
Reading .txt files with Python
[Python] Image processing with scikit-image
Cut out an image with python
[Python] Using OpenCV with Python (Image Filtering)
Recursively unzip zip files with python
Manipulating EAGLE .brd files with Python
[Python] Using OpenCV with Python (Image transformation)
[Python] POST wav files with requests [POST]
Decrypt files encrypted with OpenSSL with Python 3
Image processing with Python 100 knocks # 3 Binarization
Let's do image scraping with Python
Handle Excel CSV files with Python
Read files in parallel with Python
Find image similarity with Python + OpenCV
Image processing with Python 100 knocks # 2 Grayscale
Send image with python, save with php
Sorting files by Python naming convention
Gradation image generation with Python [1] | np.linspace
[Python] Send gmail with python: Send one by one with multiple image files attached
[Python] Easy reading of serial number image files with OpenCV
Basics of binarized image processing with Python
Image processing with Python 100 knock # 10 median filter
[AWS] Using ini files with Lambda [Python]
FizzBuzz with Python3
Scraping with Python
Play audio files from Python with interrupts
python image processing
HTML email with image to send with python
Statistics with python
Scraping with Python
Create a dummy image with Python + PIL.
Python with Go
Image processing with Python 100 knocks # 8 Max pooling
Introduction to Python Image Inflating Image inflating with ImageDataGenerator
Twilio with Python
Play with 2016-Python
Decrypt files encrypted with openssl from python with openssl
AES256 with python
Use cryptography library cryptography with Docker Python image
Image processing with Python & OpenCV [Tone Curve]
Tested with Python
Image processing with Python 100 knock # 12 motion filter
Algorithm learned with Python 19th: Sorting (heapsort)
python starts with ()
Image acquisition from camera with Python + OpenCV
[Python] I made an image viewer with a simple sorting function.
Reading and writing JSON files with Python
Download files on the web with Python
[Easy Python] Reading Excel files with openpyxl
with syntax (Python)
Bingo with python