Removing all duplicate images with different filenames from a directory

Removing all duplicate images with different filenames from a directory - python

I am trying to iterate through a folder and delete any file that is a duplicate image (but different name). After running this script all files get deleted except for one. There are at least a dozen unique ones out of about 5,000. Any help understanding why this is happening would be appreciated.
import os
import cv2
directory = r'C:\Users\Grid\scratch'
for filename in os.listdir(directory):
a=directory+'\\'+filename
n=(cv2.imread(a))
q=0
for filename in os.listdir(directory):
b=directory+'\\'+filename
m=(cv2.imread(b))
comparison = n == m
equal_arrays = comparison.all()
if equal_arrays==True and q==1:
os.remove(b)
q=1

There are a few issues with your code, and it's confusing that it could run at all without throwing an exception, since the comparison variable is a boolean, so calling comparison.all() shouldn't work.
A few pointers: You only need to get the directory contents once. It also would be much simpler to collect md5 or sha1 hashes of the files while iterating the directory and then remove duplicates along the way.
for example:
import hashlib
import os
hashes = set()
for filename in os.listdir(directory):
path = os.path.join(directory, filename)
digest = hashlib.sha1(open(path,'rb').read()).digest()
if digest not in hashes:
hashes.add(digest)
else:
os.remove(path)
You can use a more secure hash if you would like but the chances of encountering a collision are astronomically low.

Related

Seperation of files from one folder to another based on their filenames

My filenames have pattern like 29_11_2019_17_05_17_1050_R__2.png and 29_11_2019_17_05_17_1550_2
I want to write a function which separates these files and puts them in different folders.
Please find my code below but its not working.
Can you help me with this?
def sort_photos(folder, dir_name):
for filename in os.listdir(folder):
wavelengths = ["1550_R_", "1550_", "1050_R_", "1200_"]
for x in wavelengths:
if x == "1550_R_":
if re.match(r'.*x.*', filename):
filesrc = os.path.join(folder, filename)
shutil.copy(filesrc, dir_name)
print("Filename has 'x' in it, do something")
print("file 1550_R_ copied")
# cv2.imwrite(dir_name,filename)
else:
print("filename doesn't have '1550_R_' in it (so maybe it only has 'N' in it)")

In order to construct a RegEx using a variable, you can use string-interpolation.
In Python3.6+, you can use f-strings to accomplish this. In this case, the condition for your second if statement could be:
if re.match(fr'.*{x}.*', filename) is not None:
instead of:
if re.match(r'.*x.*', filename) is not None:
Which would only ever match filenames with an 'x' in them. I think is the immediate (though not necessarily only) problem in your example code.
Footnote
Earlier versions of Python do string interpolation differently, the oldest (AFAIK) is %-formatting, e.g:
if re.match(r".*%s.*" % x, filename) is not None:
Read here for more detail.

I am not very cleared about which problem you encounter.
However, there are two suggestions:
To detect character x in file name, you can just use:
if('x' in filename):
...
If you only intended to move the files, a file check should be added:
if os.path.isfile(name):
...

I didn't have much time so I've edited your function which acts very close to what you wanted. It essentially reads file names, and copies them to separate directories but in directories named by wavelengths instead. Though currently it cannot differentiate between '1550_' and '1550_R_' since '1550_R_' includes '1550_' and I didn't have much time. You can create a conditional statement for it by a few lines and there you go. (If you do not do that it will create two directories '1550_' and '1550_R_' but it will copy files that are eligible for either to both of the folders.)
One final note that as I said that I didn't have much time I've made it all simpler that the destination folders are created just where your files are located. You can add it easily if you want by a few lines too.
import cv2
import os
import re
import shutil
def sort_photos(folder):
wavelengths = ["1550_R_", "1550_", "1050_R_", "1200_"]
for filename in os.listdir(folder):
for x,idx in zip(wavelengths, range(len(wavelengths))):
if (x in filename):
filesrc = os.path.join(folder, filename)
path = './'+x+'/'
if not os.path.exists(path):
os.mkdir(path)
shutil.copy(filesrc, path+filename)
# print("Filename has 'x' in it, do something")
# cv2.imwrite(dir_name,filename)
# else:
# print("filename doesn't have 'A' in it (so maybe it only has 'N' in it)")
########## USAGE: sort_photos(folder), for example, go to the folder where all the files are located:
sort_photos('./')

recursive searching with pathlib - python

Say I have these files
/home/user/one/two/abc.txt
/home/user/one/three/def.txt
/home/user/one/four/ghi.txt
I'm trying to find ghi.txt recursively using the pathlib module. I tried:
p = '/home/user/'
f = Path(p).rglob(*i.txt)
but the only way I can get the filename is by using a list comprehension:
file = [str(i) for i in f]
which actually only works once. Re-running the command above returns an empty list.
I decided to learn pathlib because apparently it's what is recommended by the community, but isn't:
file = glob.glob(os.path.join(p,'**/*i.txt'),recursive=True)
much more straightforward?

You already have a solution.
Not sure if I read the requirements correctly but I have posted the answer with the following assumptions.
PathListPrefix is the starting path beneath which you want to search
all files. In your case it might be '/home/user'
FileName is the name of the file that you are looking for. In your case it is 'ghi.txt'
You are not expecting more than one match.
Something other than pathlib modules has to be tried.
As far as straightforward solution is concerned I am not sure about that either. However the below solution is what I could think of using os module.
import os
PathListPrefix = '/home/user/'
FileName = 'ghi.txt'
def Search(StartingPath, FileNameToSearch):
root, ext = os.path.splitext(StartingPath)
FileName = os.path.split(root)[1]
if FileName+ext == FileNameToSearch:
return True
return False
for root, dirs, files in os.walk(PathListPrefix):
Path = root
for eachfile in files:
SearchPath = os.path.join(Path,eachfile)
Found = Search(SearchPath, FileName)
if Found:
print(Path, FileName)
break
The code is definitely having many many more lines than yours. So it is not as compact as yours.

Python code for renaming files in directory not working

I'm looking to rename my files from time to time in numerical order, so for example, 1.png, 2.png., 3.png, etc
I wrote this code in an attempt to do so, I simply ended it with by printing what the files would be named to make sure it was right:
import os
os.chdir('/Users/hasso/Pictures/Digital Art/saved images for vid/1')
for f in os.listdir():
f_name=1
f_ext= '.png'
print('{}{}'.format(f_name, f_ext))
How would I go by solving this?

You keep on getting 1.png suggested as the new name because you always set f_name = 1 inside the loop. Initialize it with 1 before the loop, and then increment it as you are renaming each file instead.
A few additional points:
You don't need os.chdir because even if the default is . – the current dir –, you can also supply the target path to os.filelist.
When dealing with user home directories, it's nice if you don't have to hardcode it. os.path.expanduser retrieves this value for you.
When iterating over lists that you possibly want to change, it's best to make a separate list of only the items that you want to change. So, rather than looping over all files and changing some of them, make it easier by first gathering all items that you want to change. In your case, make a list of only .png files and then loop over this list.
(Rather advanced) os.rename will throw an error if you try to rename to an already existing name. What I do below is check if the next name to be used is already in the list, and if it is, increase the f_name number.
import os
yourPath = os.path.expanduser('~')+'/Pictures/Digital Art/saved images for vid/1'
filelist = []
for f in os.listdir(yourPath):
if f.lower().endswith('.png'):
filelist.append (f)
f_name = 1
for f in filelist:
while True:
next_name = str(f_name)+'.png'
if not next_name in filelist:
break
f_name += 1
print ('Renaming {} to {}'.format(yourPath+'/'+f, next_name))
# os.rename (yourPath+'/'+f, next_name)
f_name += 1

I'm not sure why you need to use os.chdir() to change directories, when you can just pass the path straight to os.listdir(). To rename files, you can use os.rename(). You also need to increment the counter for the file names, since your current code you keep fname equal to 1 on each iteration. You need to keep the counter outside the loop and increment it within the loop. This is where you can useenumerate(), since you can use indexing instead.
Basic version:
from os import listdir
from os import rename
from os.path import join
path = "path_to_images"
for i, f in enumerate(listdir(path), start=1):
rename(join(path, f), join(path, str(i) + '.png'))
You can get the full paths using os.path.join(), since os.listdir() doesn't include the full path of the file. The above code is also not very robust as it renames all files, and doesn't handle renaming already existent .png files.
Advanced version:
from os import listdir
from os import rename
from os.path import join
from os.path import exists
path = "path_to_images"
extension = '.png'
fname = 1
for f in listdir(path):
if f.endswith(extension):
while exists(join(path, str(fname) + extension)):
fname += 1
rename(join(path, f), join(path, str(fname) + extension))
fname += 1
Which uses os.path.exists() to check if the file already exists.

taking data from files which are in folder

How do I get the data from multiple txt files that placed in a specific folder. I started with this could not fix. It gives an error like 'No such file or directory: '.idea' (??)
(Let's say I have an A folder and in that, there are x.txt, y.txt, z.txt and so on. I am trying to get and print the information from all the files x,y,z)
def find_get(folder):
for file in os.listdir(folder):
f = open(file, 'r')
for data in open(file, 'r'):
print data
find_get('filex')
Thanks.

If you just want to print each line:
import glob
import os
def find_get(path):
for f in glob.glob(os.path.join(path,"*.txt")):
with open(os.path.join(path, f)) as data:
for line in data:
print(line)
glob will find only your .txt files in the specified path.
Your error comes from not joining the path to the filename, unless the file was in the same directory you were running the code from python would not be able to find the file without the full path. Another issue is you seem to have a directory .idea which would also give you an error when trying to open it as a file. This also presumes you actually have permissions to read the files in the directory.
If your files were larger I would avoid reading all into memory and/or storing the full content.

First of all make sure you add the folder name to the file name, so you can find the file relative to where the script is executed.
To do so you want to use os.path.join, which as it's name suggests - joins paths. So, using a generator:
def find_get(folder):
for filename in os.listdir(folder):
relative_file_path = os.path.join(folder, filename)
with open(relative_file_path) as f:
# read() gives the entire data from the file
yield f.read()
# this consumes the generator to a list
files_data = list(find_get('filex'))
See what we got in the list that consumed the generator:
print files_data
It may be more convenient to produce tuples which can be used to construct a dict:
def find_get(folder):
for filename in os.listdir(folder):
relative_file_path = os.path.join(folder, filename)
with open(relative_file_path) as f:
# read() gives the entire data from the file
yield (relative_file_path, f.read(), )
# this consumes the generator to a list
files_data = dict(find_get('filex'))
You will now have a mapping from the file's name to it's content.
Also, take a look at the answer by #Padraic Cunningham . He brought up the glob module which is suitable in this case.

The error you're facing is simple: listdir returns filenames, not full pathnames. To turn them into pathnames you can access from your current working directory, you have to join them to the directory path:
for filename in os.listdir(directory):
pathname = os.path.join(directory, filename)
with open(pathname) as f:
# do stuff
So, in your case, there's a file named .idea in the folder directory, but you're trying to open a file named .idea in the current working directory, and there is no such file.
There are at least four other potential problems with your code that you also need to think about and possibly fix after this one:
You don't handle errors. There are many very common reasons you may not be able to open and read a file--it may be a directory, you may not have read access, it may be exclusively locked, it may have been moved since your listdir, etc. And those aren't logic errors in your code or user errors in specifying the wrong directory, they're part of the normal flow of events, so your code should handle them, not just die. Which means you need a try statement.
You don't do anything with the files but print out every line. Basically, this is like running cat folder/* from the shell. Is that what you want? If not, you have to figure out what you want and write the corresponding code.
You open the same file twice in a row, without closing in between. At best this is wasteful, at worst it will mean your code doesn't run on any system where opens are exclusive by default. (Are there such systems? Unless you know the answer to that is "no", you should assume there are.)
You don't close your files. Sure, the garbage collector will get to them eventually--and if you're using CPython and know how it works, you can even prove the maximum number of open file handles that your code can accumulate is fixed and pretty small. But why rely on that? Just use a with statement, or call close.
However, none of those problems are related to your current error. So, while you have to fix them too, don't expect fixing one of them to make the first problem go away.

Full variant:
import os
def find_get(path):
files = {}
for file in os.listdir(path):
if os.path.isfile(os.path.join(path,file)):
with open(os.path.join(path,file), "r") as data:
files[file] = data.read()
return files
print(find_get("filex"))
Output:
{'1.txt': 'dsad', '2.txt': 'fsdfs'}
After the you could generate one file from that content, etc.
Key-thing:
os.listdir return a list of files without full path, so you need to concatenate initial path with fount item to operate.
there could be ideally used dicts :)
os.listdir return files and folders, so you need to check if list item is really file

You should check if the file is actually file and not a folder, since you can't open folders for reading. Also, you can't just open a relative path file, since it is under a folder, so you should get the correct path with os.path.join. Check below:
import os
def find_get(folder):
for file in os.listdir(folder):
if not os.path.isfile(file):
continue # skip other directories
f = open(os.path.join(folder, file), 'r')
for line in f:
print line

Changing name of file until it is unique

I have a script that downloads files (pdfs, docs, etc) from a predetermined list of web pages. I want to edit my script to alter the names of files with a trailing _x if the file name already exists, since it's possible files from different pages will share the same filename but contain different contents, and urlretrieve() appears to automatically overwrite existing files.
So far, I have:
urlfile = 'https://www.foo.com/foo/foo/foo.pdf'
filename = urlfile.split('/')[-1]
filename = foo.pdf
if os.path.exists(filename):
filename = filename('.')[0] + '_' + 1
That works fine for one occurrence, but it looks like after one foo_1.pdf it will start saving as foo_1_1.pdf, and so on. I would like to save the files as foo_1.pdf, foo_2.pdf, and so on.
Can anybody point me in the right direction on how to I can ensure that file names are stored in the correct fashion as the script runs?
Thanks.

So what you want is something like this:
curName = "foo_0.pdf"
while os.path.exists(curName):
num = int(curName.split('.')[0].split('_')[1])
curName = "foo_{}.pdf".format(str(num+1))
Here's the general scheme:
Assume you start from the first file name (foo_0.pdf)
Check if that name is taken
If it is, iterate the name by 1
Continue looping until you find a name that isn't taken
One alternative: Generate a list of file numbers that are in use, and update it as needed. If it's sorted you can say name = "foo_{}.pdf".format(flist[-1]+1). This has the advantage that you don't have to run through all the files every time (as the above solution does). However, you need to keep the list of numbers in memory. Additionally, this will not fill any gaps in the numbers

Why not just use the tempfile module:
fileobj = tempfile.NamedTemporaryFile(suffix='.pdf', prefix='', delete = False)
Now your filename will be available in fileobj.name and you can manipulate to your heart's content. As an added benefit, this is cross-platform.

Since you're dealing with multiple pages, this seeems more like a "global archive" than a per-page archive. For a per-page archive, I would go with the answer from #wnnmaw
For a global archive, I would take a different approch...
Create a directory for each filename
Store the file in the directory as "1" + extension
write the current "number" to the directory as "_files.txt"
additional files are written as 2,3,4,etc and increment the value in _files.txt
The benefits of this:
The directory is the original filename. If you keep turning "Example-1.pdf" into "Example-2.pdf" you run into a possibility where you download a real "Example-2.pdf", and can't associate it to the original filename.
You can grab the number of like-named files either by reading _files.txt or counting the number of files in the directory.
Personally, I'd also suggest storing the files in a tiered bucketing system, so that you don't have too many files/directories in any one directory (hundreds of files makes it annoying as a user, thousands of files can affect OS performance ). A bucketing system might turn a filename into a hexdigest, then drop the file into `/%s/%s/%s" % ( hex[0:3], hex[3:6], filename ). The hexdigest is used to give you a more even distribution of characters.

import os
def uniquify(path, sep=''):
path = os.path.normpath(path)
num = 0
newpath = path
dirname, basename = os.path.split(path)
filename, ext = os.path.splitext(basename)
while os.path.exists(newpath):
newpath = os.path.join(dirname, '{f}{s}{n:d}{e}'
.format(f=filename, s=sep, n=num, e=ext))
num += 1
return newpath
filename = uniquify('foo.pdf', sep='_')
Possible problems with this include:
If you call to uniquify many many thousands of times with the same
path, each subsequent call may get a bit slower since the
while-loop starts checking from num=0 each time.
uniquify is vulnerable to race conditions whereby a file may not
exist at the time os.path.exists is called, but may exist at the
time you use the value returned by uniquify. Use
tempfile.NamedTemporaryFile to avoid this problem. You won't get
incremental numbering, but you will get files with unique names,
guaranteed not to already exist. You could use the prefix parameter to
specify the original name of the file. For example,
import tempfile
import os
def uniquify(path, sep='_', mode='w'):
path = os.path.normpath(path)
if os.path.exists(path):
dirname, basename = os.path.split(path)
filename, ext = os.path.splitext(basename)
return tempfile.NamedTemporaryFile(prefix=filename+sep, suffix=ext, delete=False,
dir=dirname, mode=mode)
else:
return open(path, mode)
Which could be used like this:
In [141]: f = uniquify('/tmp/foo.pdf')
In [142]: f.name
Out[142]: '/tmp/foo_34cvy1.pdf'
Note that to prevent a race-condition, the opened filehandle -- not merely the name of the file -- is returned.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.