SoFunction
Updated on 2024-12-13

De-duplication of images in a directory using python opencv

Version:

Platform: ubuntu 14 / I5 / 4G RAM

python version: python2.7

opencv version: 2.13.4

Dependency:

If your system does not have python, you will need to install it

sudo apt-get install python

sudo apt-get install python-dev

sudo apt-get install python-pip

sudo pip install numpy mathplotlib

sudo apt-get install libcv-dev

sudo apt-get install python-opencv

Image de-duplication using perceptual hashing algorithms

Principle: Iterate through all the files to de-emphasize each one, so the more images the slower it is, but it saves manual operations

Perceiving the Hash Principle:

1, need to compare the picture are scaled to 8 * 8 size grayscale map

2. Get the comparison of each pixel of each image with the average value to get the fingerprints

3. Calculate Hamming distance based on fingerprints

5. If the resulting different elements are less than 5 then they are identical (similar?). of the picture

#!/usr/bin/python
# -*- coding: UTF-8 -*-
 
import cv2
import numpy as np
import os,sys,types
 
def cmpandremove2(path):
 dirs = (path)
 ()
 if len(dirs) <= 0:
  return
 dict={}
 for i in dirs:
  prepath = path + "/" + i
  preimg = (prepath)
  if type(preimg) is :
   continue
  preresize = (preimg, (8,8))
  pregray = (preresize, cv2.COLOR_BGR2GRAY)
  premean = (pregray)[0]
  prearr = ()
  for j in range(0,len(prearr)):
   if prearr[j] >= premean:
    prearr[j] = 1
   else:
    prearr[j] = 0
  print "get", prepath
  dict[i] = prearr
 dictkeys = ()
 ()
 index = 0
 while True:
  if index >= len(dictkeys):
   break
  curkey = dictkeys[index]
  dellist=[]
  print curkey
  index2 = index
  while True:
   if index2 >= len(dictkeys):
    break
   j = dictkeys[index2]
   if curkey == j:
    index2 = index2 + 1
    continue
   arr1 = dict[curkey]
   arr2 = dict[j]
   diff = 0
   for k in range(0,len(arr2)):
    if arr1[k] != arr2[k]:
     diff = diff + 1
   if diff <= 5:
    (j)
   index2 = index2 + 1
  if len(dellist) > 0:
   for j in dellist:
    file = path + "/" + j
    print "remove", file
    (file)
    (j)
   dictkeys = ()
   ()
  index = index + 1
def cmpandremove(path):
 index = 0
 flag = 0
 dirs = (path)
 ()
 if len(dirs) <= 0:
  return 0
 while True:
  if index >= len(dirs):
   break
  prepath = path + dirs[index]
  print prepath
  index2 = 0
  preimg = (prepath)
  if type(preimg) is :
   index = index + 1
   continue
  preresize = (preimg,(8,8))
  pregray = (preresize, cv2.COLOR_BGR2GRAY)
  premean = (pregray)[0]
  prearr = ()
  for i in range(0,len(prearr)):
   if prearr[i] >= premean:
    prearr[i] = 1
   else:
    prearr[i] = 0
  removepath = []
  while True:
   if index2 >= len(dirs):
    break
   if index2 != index:
    curpath = path + dirs[index2]
    #print curpath
    curimg = (curpath)
    if type(curimg) is :
     index2 = index2 + 1
     continue
    curresize = (curimg, (8,8))
    curgray = (curresize, cv2.COLOR_BGR2GRAY)
    curmean = (curgray)[0]
    curarr = ()
    for i in range(0,len(curarr)):
     if curarr[i] >= curmean:
      curarr[i] = 1
     else:
      curarr[i] = 0
    diff = 0
    for i in range(0,len(curarr)):
     if curarr[i] != prearr[i] :
      diff = diff + 1
    if diff <= 5:
     print 'the same'
     (curpath)
     flag = 1
   index2 = index2 + 1
  index = index + 1
  if len(removepath) > 0:
   for file in removepath:
    print "remove", file
    (file)
   dirs = (path)
   ()
   if len(dirs) <= 0:
    return 0
   #index = 0
 return flag
  
def main(argv):
 if len(argv) <= 1:
  print "command error"
  return -1
 if (argv[1]) is False:
  return -1
 path = argv[1]
 '''
 while True:
  if cmpandremove(path) == 0:
   break
 '''
 cmpandremove(path)
 return 0
   
if __name__ == '__main__':
 main()

To save operations, iterate through all directories, going through the directories you want to de-duplicate

#!/bin/bash
indir=$1
addcount=0
function intest()
{
 
 for file in $1/*
 do
  echo $file
  if test -d $file 
  then
   ~/ $file/
   intest $file
  fi
 done
}

intest $indir

The above this method of using python opencv to de-duplicate images in a directory is all I have to share with you, I hope it will give you a reference, and I hope you will support me more.