[PYTHON] Duplicate file removal

background

With many years of PC history, he has saved various files on various media. Since I want to unify it loosely, I introduced NAS and copied from individual media (USB flash, HDD, MO !, FDD !! etc.) one after another (progressive form). Most of the old media has a low capacity and takes up a lot of space, so it is being discarded.

The problem is the data stored with the HDD every time the PC is migrated. Many take over from the previous PC and take over ... × n ... There are a lot of duplicate files. There is a lot of duplication in the media that was copied for taking out, and there are differences in versions, so it is chaos.

For the time being, I just allocate one directory to one media and copy it, but since NAS is also finite, I wanted to remove at least the same file.

solution

There seem to be many excellent tools out there, but I don't have the motivation to find good ones for free. Therefore, we are dealing with it by creating a script for deduplication while checking it automatically and with our own eyes to some extent.

Assumption: ʻUNIX-like environment (GNU series command system) rdfind installed`

mergehelper.sh


#!/bin/bash
# Usage: mergehelper.sh DIR1 DIR2

DIR1=$1
DIR2=$2

rdfind -ignoreempty true -dryrun true ${DIR1} ${DIR2}
grep -v -e '^\s*#' -e '^\s*$' results.txt |  awk '{if ($1=="DUPTYPE_FIRST_OCCURRENCE") printf("\n"); for (i=0; i<=NF; i++){if (i>7) printf("%s "),$i} printf("\n")}' > duplications.txt
cat duplications.txt | sed -e 's/^/#rm -v \"/' | sed -e 's/$/\"/' | sed -e 's/#rm -v \""//g' | sed 's/\ "$/\"/g' > rmfile_comout.sh

When executed, a shell script of a huge number of commented out rm commands is created. Duplication is eliminated by excluding commenters other than those required.

I think it's a good idea not to comment out from the beginning except for the first block, or to customize the output according to the length of the list. In my case, for example, I select the file to be deleted by keyword like the python script below.

selection.py


import sys

#When deleting a file with a path containing the following keywords
rmlist = []
rmlist.append('KEYWORD1')
rmlist.append('KEYWORD2')
rmlist.append('KEYWORD3')
# as well ...

#Uncomment the path that contains the keyword
def process(lines):
    nline = len(lines)
    n = 0
    for line in lines:
        delete = False
        token = line.strip()
        for l in rmlist:
            if l in token:
                delete = True

        if delete:
            n = n + 1
            if n < nline:   #Leave the last one in the list of identical files
                print(token.replace('#',''))
            else:
                print(token)
        else:
            print(token)

    return 0

#Process with standard input
lines = []
for line in sys.stdin:
    if line != '\n':
        lines.append(line)
    elif line == '\n':
        print("")
        dummy = process(lines)
        lines = []

You can pipe it to the last line,

cat rmfile_comout.sh | python3.x selsction.py > rmfile.sh

You may adjust while checking the result. There may be situations where you want to explicitly select the files to keep.

Recommended Posts

Duplicate file removal
File matching
File creation
Read file
File operations