SoFunction
Updated on 2024-11-17

Detailed explanation of how to use Python to achieve the deletion of duplicate files

Python Office Automation - Delete Duplicate Files

Introduction to the idea

Two levels of judgment:

1. First determine whether the file size is the same, the size is different is not a duplicate file, to be retained;

2. file size is the same and then determine the file md5, md5 is the same, it is a duplicate file, to be deleted.

source code commentary

from pathlib import Path
import hashlib


def getmd5(filename):
    # Receive the path of the file, return the md5 value of the file
    with open(filename, 'rb') as f:
        data = ()
    file_md5 = ("md5", data).hexdigest()
    return file_md5


def main():
    path = r"F:\FileRecv\ Delete File Test"
    all_size = {}
    total_file = 0
    total_delete = 0

    # Get all the filenames in the path, the default is ascending order, the same file will keep the date and time of the newest
    all_files = Path(path).glob('*.*')

    # In descending order, the shortest file name (i.e. oldest date) will be retained for identical files
    all_files = sorted(all_files, reverse=True)

    # Iterate over all files in the file path
    for file in all_files:
        # Get the size of the bytes occupied by the file as a key to the data dictionary
        size = ().st_size
        # name_and_md5 list used to store the absolute path and md5 value of a file as a data dictionary value
        name_and_md5 = [file, '']

        # Process duplicate files and generate dictionaries to store relevant information
        # Dictionary all_size where key is size and value is a list of name_and_md5
        # For files of the same size, call the getmd5 function again to get the md5 value of the file.
        # If the file size is different (not in all_size.keys()), it is automatically judged to be a different file and is retained
        if size in all_size.keys():
            # Call the getmd5 function to get the md5 value of the file
            new_md5 = getmd5(file)
            if all_size[size][1] == '':
                all_size[size][1] = getmd5(all_size[size][0])
            # Determine that the md5 value exists, i.e. the file is a duplicate, then delete the file. md5 value does not exist, then add the md5 value to the list
            if new_md5 in all_size[size]:
                ()
                total_delete += 1
            else:
                all_size[size].append(new_md5)
        else:
            all_size[size] = name_and_md5
        total_file += 1

    print(f'Total number of documents:{total_file}')
    print(f'Number of deletions:{total_delete}')


if __name__ == '__main__':
    main()

Rendering:

Code Note: Special thanks to Mr. Yu Liang for the code!

Knowledge Expansion

pathlib and os, common function correspondences

An introduction to pathlib's common methods:

Path(path).name # return filename + file extension

Path(path).stem # return filename

Path(path).suffix # Return the file suffix

Path(path).suffixes # Returns a list of file suffixes

Path(path).root # Returns the root directory

Path(path).parts # Returns the file

Path(path).anchor # Returns the root directory

Path(path).parent # Returns the parent directory

Path(path).parents # Returns a list of all parent directories

() # Determine if the Path path is an existing file or folder

Path.is_dir() # Determine whether Path is a folder or not

Path.is_file() # Determine whether Path is a file or not

() # Create folder

() # Delete folder, folder must be empty

() # Delete files

to this article on how to use Python to achieve the removal of duplicate files to this article, more related Python to delete duplicate files, please search for my previous posts or continue to browse the following related articles I hope you will support me in the future more!