The last blog post focused on downloading images from the internet, so I pulled down the entire joke site, but there were a lot of duplicates in the pulled images, such as other images on the page, duplicate postings, and so on. So I looked for some more methods in python and wrote a script that can delete duplicate images in a specified folder.
I. Methodology and thinking
1. Comparison of whether the file is the same method: hashlib library provides a way to get the md5 value of the file, so we can use the md5 value to determine whether the same picture
2. The operation of the file : os library has a file operation methods , such as : () you can delete the specified file , () you can specify the folder path to get the file name of all the files in the folder .
Thoughts:By getting the names of all files in the specified folder, and then matching them to a list of absolute paths, the md5 value of each file is compared cyclically, and if the md5 value is duplicated, the file is deleted.
II. Code Implementation
import os import hashlib import logging import sys def logger(): """Get logger""" logger = () if not : # Specify the logger output format formatter = ('%(asctime)s %(levelname)-8s: %(message)s') # File logs file_handler = ("") file_handler.setFormatter(formatter) # You can specify the output format via setFormatter. # Console log console_handler = () console_handler.formatter = formatter # You can also assign a value to the formatter directly. # Log processors added to logger (file_handler) (console_handler) # Specify the lowest output level for logs, default is WARN level () return logger def get_md5(filename): m = hashlib.md5() mfile = open(filename, "rb") (()) () md5_value = () return md5_value def get_urllist(): #Just replace the specified folder path base = ("F:\\\\pythonFile\\\\\fryinge.com\\\\\\boring chart\\\jpg\\") list = (base) urlList=[] for i in list: url = base + i (url) return urlList if __name__ == '__main__': log = logger() md5List =[] urlList =get_urllist() for a in urlList: md5 =get_md5(a) if (md5 in md5List): (a) print("Repeat: %s"%a) ("Repeat: %s"%a) else: (md5) # print(md5List) print("A total of %s photos."%len(md5List))
Then we can use the log to see what files are duplicated, but for some large files, the md5 value of the acquisition will have some changes, but the general handling of small files can be, just replace my path, you can run on your computer.
This is the whole content of this article.