SoFunction
Updated on 2024-11-16

python method to delete duplicate files in local folder

The last blog post focused on downloading images from the internet, so I pulled down the entire joke site, but there were a lot of duplicates in the pulled images, such as other images on the page, duplicate postings, and so on. So I looked for some more methods in python and wrote a script that can delete duplicate images in a specified folder.

I. Methodology and thinking

1. Comparison of whether the file is the same method: hashlib library provides a way to get the md5 value of the file, so we can use the md5 value to determine whether the same picture

2. The operation of the file : os library has a file operation methods , such as : () you can delete the specified file , () you can specify the folder path to get the file name of all the files in the folder .

Thoughts:By getting the names of all files in the specified folder, and then matching them to a list of absolute paths, the md5 value of each file is compared cyclically, and if the md5 value is duplicated, the file is deleted.

II. Code Implementation

import os 
import hashlib 
import logging 
import sys 
 
def logger(): 
 """Get logger""" 
 logger = () 
 if not : 
 # Specify the logger output format
 formatter = ('%(asctime)s %(levelname)-8s: %(message)s') 
 # File logs
 file_handler = ("") 
 file_handler.setFormatter(formatter) # You can specify the output format via setFormatter.
 # Console log
 console_handler = () 
 console_handler.formatter = formatter # You can also assign a value to the formatter directly.
 # Log processors added to logger
 (file_handler) 
 (console_handler) 
 # Specify the lowest output level for logs, default is WARN level
 () 
 return logger 
 
def get_md5(filename): 
 m = hashlib.md5() 
 mfile = open(filename, "rb") 
 (()) 
 () 
 md5_value = () 
 return md5_value 
 
def get_urllist(): 
 #Just replace the specified folder path
 base = ("F:\\\\pythonFile\\\\\fryinge.com\\\\\\boring chart\\\jpg\\") 
 list = (base) 
 urlList=[] 
 for i in list: 
 url = base + i 
 (url) 
 return urlList 
 
if __name__ == '__main__': 
 log = logger() 
 md5List =[] 
 urlList =get_urllist() 
 for a in urlList: 
 md5 =get_md5(a) 
 if (md5 in md5List): 
  (a) 
  print("Repeat: %s"%a) 
  ("Repeat: %s"%a) 
 else: 
  (md5) 
  # print(md5List) 
  print("A total of %s photos."%len(md5List)) 

Then we can use the log to see what files are duplicated, but for some large files, the md5 value of the acquisition will have some changes, but the general handling of small files can be, just replace my path, you can run on your computer.

This is the whole content of this article.