Python Office Automation - Delete Duplicate Files
Introduction to the idea
Two levels of judgment:
1. First determine whether the file size is the same, the size is different is not a duplicate file, to be retained;
2. file size is the same and then determine the file md5, md5 is the same, it is a duplicate file, to be deleted.
source code commentary
from pathlib import Path import hashlib def getmd5(filename): # Receive the path of the file, return the md5 value of the file with open(filename, 'rb') as f: data = () file_md5 = ("md5", data).hexdigest() return file_md5 def main(): path = r"F:\FileRecv\ Delete File Test" all_size = {} total_file = 0 total_delete = 0 # Get all the filenames in the path, the default is ascending order, the same file will keep the date and time of the newest all_files = Path(path).glob('*.*') # In descending order, the shortest file name (i.e. oldest date) will be retained for identical files all_files = sorted(all_files, reverse=True) # Iterate over all files in the file path for file in all_files: # Get the size of the bytes occupied by the file as a key to the data dictionary size = ().st_size # name_and_md5 list used to store the absolute path and md5 value of a file as a data dictionary value name_and_md5 = [file, ''] # Process duplicate files and generate dictionaries to store relevant information # Dictionary all_size where key is size and value is a list of name_and_md5 # For files of the same size, call the getmd5 function again to get the md5 value of the file. # If the file size is different (not in all_size.keys()), it is automatically judged to be a different file and is retained if size in all_size.keys(): # Call the getmd5 function to get the md5 value of the file new_md5 = getmd5(file) if all_size[size][1] == '': all_size[size][1] = getmd5(all_size[size][0]) # Determine that the md5 value exists, i.e. the file is a duplicate, then delete the file. md5 value does not exist, then add the md5 value to the list if new_md5 in all_size[size]: () total_delete += 1 else: all_size[size].append(new_md5) else: all_size[size] = name_and_md5 total_file += 1 print(f'Total number of documents:{total_file}') print(f'Number of deletions:{total_delete}') if __name__ == '__main__': main()
Rendering:
Code Note: Special thanks to Mr. Yu Liang for the code!
Knowledge Expansion
pathlib and os, common function correspondences
An introduction to pathlib's common methods:
Path(path).name # return filename + file extension
Path(path).stem # return filename
Path(path).suffix # Return the file suffix
Path(path).suffixes # Returns a list of file suffixes
Path(path).root # Returns the root directory
Path(path).parts # Returns the file
Path(path).anchor # Returns the root directory
Path(path).parent # Returns the parent directory
Path(path).parents # Returns a list of all parent directories
() # Determine if the Path path is an existing file or folder
Path.is_dir() # Determine whether Path is a folder or not
Path.is_file() # Determine whether Path is a file or not
() # Create folder
() # Delete folder, folder must be empty
() # Delete files
to this article on how to use Python to achieve the removal of duplicate files to this article, more related Python to delete duplicate files, please search for my previous posts or continue to browse the following related articles I hope you will support me in the future more!