Sorting image files with Python (2)


Last time, I've done enough to sort a large number of image files into year / month folders. At that time, the following CSV files were generated in each folder as clues for deleting duplicate files.



From the left, the file name, CRC32, and file size are displayed, and the file whose file size matches CRC32 is almost certainly the same file, so it is targeted for deletion. Speaking of the above file, the 2nd and 4th lines are duplicated, so I would like to divide them into 2 groups as follows.





The idea is simply delete list ← original list-duplicate list and survival list ← original list-delete list, but I tried various kneading but could not reach a convincing implementation. So this time

  1. Prepare an empty survival list and deletion list
  2. If the inspection target is on the survival list, add it to the deletion list.
  3. If the test target is not on the survival list, add it to the survival list

Make it a simple implementation such as (I think this is enough because what I want to do is not complicated)

Development environment


import os
import sys
import pandas as pd

def classify(path, target):
    lines = pd.read_csv(os.path.join(path, target), header=None)

    d = {}
    servived_dict = {}      #What survives
    delete_dict = {}        #Target to be deleted

    # filename,crc32,filesize{filename, (crc32, filesize)}To
    for i in range(len(lines)):
        (filename, crc32, filesize) = lines.values[i]
        d[filename] = (crc32, filesize)

    for key, value in d.items():
        if value in servived_dict.values():
            delete_dict[key] = value
            servived_dict[key] = value

    def output(full_path, dic):
        with open(full_path, mode='w') as f:
            for key in dic.keys():
                #I only want the full path of the file to be deleted
                f.write(os.path.join(path, key) + "\n")

    output(os.path.join(path, "servived.txt"), servived_dict)
    output(os.path.join(path, "delete.txt"), delete_dict)

if __name__ == "__main__":
    full_path = sys.argv[1]
    classify(os.path.dirname(full_path), os.path.basename(full_path))

In the first for statement, the inspection target is converted to {filename, (crc32, filesize)} so that it can be handled easily later. If you don't make tuples, you will have to check if CRC32 and file size are included, so it's a little crap. Also, although saved_dict has significance, it is useless even if it is output to saved.txt, so output is unnecessary (although it was useful when debugging)


Since it is still Pythonista, I feel that it will probably end quickly with set arithmetic, but this time it is. Next is the deletion process by referring to the delete.txt generated in each year / month folder (why should I write it ...)

