Duplicate Image Detector Not Deleting All Duplicate Files - python

I use a double for loop to iterate through the files in my Picture folder. If the first image is equal to the second image, I then delete the second image from the folder and delete the image from the list of files I'm iterating through. My program leaves some duplicate images and doesn't delete them, but when I run the program again it deletes the duplicate image. How to I properly iterate through my files so that I am able to delete all duplicate images?
I've tried cutting down on unnecessary if, else loops in order to make sure the file doesn't slip through the program.
path = "Pictures"
directory = os.listdir(path)
for first_file in directory:
for second_file in directory:
if first_file == second_file:
continue
if first_file.endswith(".jpg") and second_file.endswith(".jpg"):
first_file_path = R"Pictures\{}".format(first_file)
second_file_path = R"Pictures\{}".format(second_file)
img1 = cv2.imread(first_file_path, 1)
img2 = cv2.imread(second_file_path, 1)
img1 = cv2.resize(img1, (100,100))
img2 = cv2.resize(img2, (100,100))
difference = cv2.subtract(img1, img2)
b, g, r = cv2.split(difference)
if cv2.countNonZero(b) == 0 and cv2.countNonZero(g) == 0 and cv2.countNonZero(r) == 0:
os.remove(second_file_path)
directory.remove(second_file)

Related

Saving images with different name in folder

I tried save images in folder like this, it saves different images but every next image have all names of previously images.
db = h5py.File('results/Results.h5', 'r')
dsets = sorted(db['data'].keys())
for k in dsets:
db = get_data()
imnames = sorted(db['data'].keys())
slika = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
cv2.imwrite(f'spremljene_slike/ime_{imnames}.png', slika)
So i tried like this and it saves different names but only last generated picture is imwrited in folder, so different names - the same picture
NUM_IMG = -1
N = len(imnames)
global NUM_IMG
if NUM_IMG < 0:
NUM_IMG = N
start_idx,end_idx = 0,N #min(NUM_IMG, N)
**In different function:**
for u in range(start_idx,end_idx):
imname = imnames[u]
cv2.imwrite(f'spremljene_slike/ime_{imname}.png', imname)
Can someone help, I can't figure out.
I have script which generate images with rendered text and save it in .h5 file, and then from there I want to save this pictures with corresponding names in different folder.
Don't see how this works at all. On line 1 you define db=h5py.File(), then on line 4, you redefine it as db=get_data(). What is get_data()?
It's hard to write code without the schema. Answer below is my best-guess assuming your images are datasets in db['data'] and you want to use the dataset names (aka keys) as the image names.
with h5py.File('results/Results.h5', 'r') as db:
dsets = sorted(db['data'].keys())
for imname in dsets:
img_arr = db['data'][imname][()]
slika = cv2.cvtColor(img_arr, cv2.COLOR_BGR2RGB)
cv2.imwrite(f'spremljene_slike/ime_{imname}.png', slika)
That should be all you need to do. You will get 1 .png for each dataset named ime_{imname}.png (where imname is the matching dataset name).
Also, you can eliminate all of the intermediate variables (dsets, img_arr and slika). Compress the code above into a few lines:
with h5py.File('results/Results.h5', 'r') as db:
for imname in sorted(db['data'].keys()):
cv2.imwrite(f'spremljene_slike/ime_{imname}.png', \
cv2.cvtColor(db['data'][imname][()], cv2.COLOR_BGR2RGB))

Optimized way to read files based on file names in directory using python

I have a directory with images I want to read. Each image is named as numbers (0.jpg, 1.jpg.. 10000.jpg). What is the optimized way to read images in a range and store the output in respective directories.
One approach is to use count and read/process accordingly.
img_dir = "/path/to/images"
op_dir_1 = "/path/to/output/dir_1"
op_dir_2 = "/path/to/output/dir_2"
op_dir_3 = "/path/to/output/dir_3"
count = 0
for img_path in tqdm(sorted(glob.glob(os.path.join(img_dir, "*.jpg")))):
if count < 100:
img = cv2.imread(img_path, cv2.COLOR_BGR2RGB)
# do processing and store output in op_dir_1
count += 1
if count > 99 and count < 150:
img = cv2.imread(img_path, cv2.COLOR_BGR2RGB)
# do processing and store output in op_dir_2
count += 1
if count > 149 and count < 200:
img = cv2.imread(img_path, cv2.COLOR_BGR2RGB)
# do processing and store output in op_dir_3
count += 1
Is there an optimized way to this, just iterate the directory for the images I need (in a range) and read. By using the file name (Since its numbered) instead of count?

Apply the same code to multiple files in the same directory

I have a code that already works but I need to use it to analyse many files in the same folder. How can I re-write it to do this? All the files have similar names (e.g. "pos001", "pos002", "pos003").
This is the code at the moment:
pos001 = mpimg.imread('pos001.tif')
coord_pos001 = np.genfromtxt('treat_pos001_fluo__spots.csv', delimiter=",")
Here I label the tif file "pos001" to differentiate separate objects in the same image:
label_im = label(pos001)
regions = regionprops(label_im)
Here I select only the object of interest by setting its pixel values == 1 and all the others == 0 (I'm interested in many objects, I show only one here):
cell1 = np.where(label_im != 1, 0, label_im)
Here I convert the x,y coordinates of the spots in the csv file to a 515x512 image where each spot has value 1:
x = coord_pos001[:,2]
y = coord_pos001[:,1]
coords = np.column_stack((x, y))
img = Image.new("RGB", (512,512), "white")
draw = ImageDraw.Draw(img)
dotSize = 1
for (x,y) in coords:
draw.rectangle([x,y,x+dotSize-1,y+dotSize-1], fill="black")
im_invert = ImageOps.invert(img)
bin_img = im_invert.convert('1')
Here I set the values of the spots of the csv file equal to 1:
bin_img = np.where(bin_img == 255, 1, bin_img)
I convert the arrays from 2d to 1d:
bin_img = bin_img.astype(np.int64)
cell1 = cell1.flatten()
bin_img = bin_img.flatten()
I multiply the arrays to get an array where only the spots overlapping the labelled object have value = 1:
spots_cell1 = []
for num1, num2 in zip(cell1, bin_img):
spots_cell1.append(num1 * num2)
I count the spots belonging to that object:
spots_cell1 = sum(float(num) == 1 for num in spots_cell1)
print(spots_cell1)
I hope it's clear. Thank you in advance!
You can define a function that takes the .tif file path and the .csv file path and processes the two
def process(tif_file, csv_file):
pos = mpimg.imread(tif_file)
coord = np.genfromtxt(csv_file, delimiter=",")
# Do other processing with pos and coord
To process a single pair of files, you'd do:
process('pos001.tif', 'treat_pos001_fluo__spots.csv')
To list all the files in your tif file directory, you'd use the example in this answer:
import os
tif_file_directory = "/home/username/path/to/tif/files"
csv_file_directory = "/home/username/path/to/csv/files"
all_tif_files = os.listdir(tif_file_directory)
for file in all_tif_files:
if file.endswith(".tif"): # Make sure this is a tif file
fname = file.rstrip(".tif") # Get just the file name without the .tif extension
tif_file = f"{tif_file_directory}/{fname}.tif" # Full path to tif file
csv_file = f"{csv_file_directory}/treat_{fname}_fluo__spots.csv" # Full path to csv file
# Just to keep track of what is processed, print them
print(f"Processing {tif_file} and {csv_file}")
process(tif_file, csv_file)
The f"...{variable}..." construct is called an f-string. More information here: https://realpython.com/python-f-strings/

Saving multiple images with Python

For my cancer research I need to turn cancerscans into black and white images, and save them. I've this code, which turns the first file in the folder into a black and white picture, and copies it 49 times with names like result1, result2 , result3 etc.
wd = os.getcwd()
wd = os.chdir("C:\\Users\\Tije\\Documents\\School\\DeepLearning\\IDC_regular_ps50_idx5\\8863\\test")
for x in range(50):
for file in os.listdir(wd):
image_file = Image.open(file)
image_file= image_file.convert('1')
print(image_file)
image_file.save(f"result{x}.png")
I need the code to blackandwhite'n every picture in the folder and not just the first one of course. I can't seem to understand why it does this.
Any help?
You're looping over the entire directory 50 times and thus result{x} gets overwritten 50 times.
If you want to index for each result, just use enumerate as follows:
for index, file in enumerate(os.listdir(wd)):
image_file = Image.open(file)
image_file= image_file.convert('1')
print(image_file)
image_file.save(f"result{index}.png")

How to loop through one element of a zip() function twice - Python

So here's my dilema... I'm writing a script that reads all .png files from a folder and then converts them to a number of different dimensions which I have specified in a list. Everything works as it should except it quits after handling one image.
Here is my code:
sizeFormats = ["1024x1024", "114x114", "40x40", "58x58", "60x60", "640x1136", "640x960"]
def resizeImages():
widthList = []
heightList = []
resizedHeight = 0
resizedWidth = 0
#targetPath is the path to the folder that contains the images
folderToResizeContents = os.listdir(targetPath)
#This splits the dimensions into 2 separate lists for height and width (ex: 640x960 adds
#640 to widthList and 960 to heightList
for index in sizeFormats:
widthList.append(index.split("x")[0])
heightList.append(index.split("x")[1])
#for every image in the folder, apply the dimensions from the populated lists and save
for image,w,h in zip(folderToResizeContents,widthList,heightList):
resizedWidth = int(w)
resizedHeight = int(h)
sourceFilePath = os.path.join(targetPath,image)
imageFileToConvert = Image.open(sourceFilePath)
outputFile = imageFileToConvert.resize((resizedWidth,resizedHeight), Image.ANTIALIAS)
outputFile.save(sourceFilePath)
The following will be returned if the target folder contains 2 images called image1.png,image2.png (for sake of visualization I'll add the dimensions that get applied to the image after an underscore):
image1_1024x1024.png,
..............,
image1_640x690.png (Returns all 7 different dimensions for image1 fine)
it stops there when I need it to apply the same transformations to image_2. I know this is because the length of widthList and heightList are only 7 elements long and so exits the loop before image2 gets its turn. Is there any way I can go about looping through widthList and heightList for every image in the targetPath?
Why not keep it simple:
for image in folderToResizeContents:
for fmt in sizeFormats:
(w,h) = fmt.split('x')
N.B. You are overwriting the files produced as you are not changing the name of the outpath.
Nest your for loops and you can apply all 7 dimensions to each image
for image in folderToResizeContents:
for w,h in zip(widthList,heightList):
the first for loop will ensure it happens for each image, whereas the second for loop will ensure that the image is resized to each size
You need to re-iterate through the sizeFormats for every file. Zip doesn't do this unless you get even trickier with cyclic iterators for height and width.
Sometimes tools such as zip make for longer more complicated code when a couple of nested for loops work fine. I think its more straight forward than splitting into multiple lists and then zipping them back together again.
sizeFormats = ["1024x1024", "114x114", "40x40", "58x58", "60x60", "640x1136", "640x960"]
sizeTuples = [(int(w), int(h)) for w,h in map(lambda wh: wh.split('x'), sizeFormats)]
def resizeImages():
#for every image in the folder, apply the dimensions from the populated lists and save
for image in os.listdir(targetPath):
for resizedWidth, resizedHeight in sizeTuples:
sourceFilePath = os.path.join(targetPath,image)
imageFileToConvert = Image.open(sourceFilePath)
outputFile = imageFileToConvert.resize((resizedWidth,resizedHeight), Image.ANTIALIAS)
outputFile.save(sourceFilePath)

Categories