Seperation of files from one folder to another based on their filenames - python

My filenames have pattern like 29_11_2019_17_05_17_1050_R__2.png and 29_11_2019_17_05_17_1550_2
I want to write a function which separates these files and puts them in different folders.
Please find my code below but its not working.
Can you help me with this?
def sort_photos(folder, dir_name):
for filename in os.listdir(folder):
wavelengths = ["1550_R_", "1550_", "1050_R_", "1200_"]
for x in wavelengths:
if x == "1550_R_":
if re.match(r'.*x.*', filename):
filesrc = os.path.join(folder, filename)
shutil.copy(filesrc, dir_name)
print("Filename has 'x' in it, do something")
print("file 1550_R_ copied")
# cv2.imwrite(dir_name,filename)
else:
print("filename doesn't have '1550_R_' in it (so maybe it only has 'N' in it)")

In order to construct a RegEx using a variable, you can use string-interpolation.
In Python3.6+, you can use f-strings to accomplish this. In this case, the condition for your second if statement could be:
if re.match(fr'.*{x}.*', filename) is not None:
instead of:
if re.match(r'.*x.*', filename) is not None:
Which would only ever match filenames with an 'x' in them. I think is the immediate (though not necessarily only) problem in your example code.
Footnote
Earlier versions of Python do string interpolation differently, the oldest (AFAIK) is %-formatting, e.g:
if re.match(r".*%s.*" % x, filename) is not None:
Read here for more detail.

I am not very cleared about which problem you encounter.
However, there are two suggestions:
To detect character x in file name, you can just use:
if('x' in filename):
...
If you only intended to move the files, a file check should be added:
if os.path.isfile(name):
...

I didn't have much time so I've edited your function which acts very close to what you wanted. It essentially reads file names, and copies them to separate directories but in directories named by wavelengths instead. Though currently it cannot differentiate between '1550_' and '1550_R_' since '1550_R_' includes '1550_' and I didn't have much time. You can create a conditional statement for it by a few lines and there you go. (If you do not do that it will create two directories '1550_' and '1550_R_' but it will copy files that are eligible for either to both of the folders.)
One final note that as I said that I didn't have much time I've made it all simpler that the destination folders are created just where your files are located. You can add it easily if you want by a few lines too.
import cv2
import os
import re
import shutil
def sort_photos(folder):
wavelengths = ["1550_R_", "1550_", "1050_R_", "1200_"]
for filename in os.listdir(folder):
for x,idx in zip(wavelengths, range(len(wavelengths))):
if (x in filename):
filesrc = os.path.join(folder, filename)
path = './'+x+'/'
if not os.path.exists(path):
os.mkdir(path)
shutil.copy(filesrc, path+filename)
# print("Filename has 'x' in it, do something")
# cv2.imwrite(dir_name,filename)
# else:
# print("filename doesn't have 'A' in it (so maybe it only has 'N' in it)")
########## USAGE: sort_photos(folder), for example, go to the folder where all the files are located:
sort_photos('./')

Related

Python Delete Files in Directory from list in Text file

I've searched through many answers on deleting multiple files based on certain parameters (e.g. all txt files). Unfortunately, I haven't seen anything where one has a longish list of files saved to a .txt (or .csv) file and wants to use that list to delete files from the working directory.
I have my current working directory set to where the .txt file is (text file with list of files for deletion, one on each row) as well as the ~4000 .xlsx files. Of the xlsx files, there are ~3000 I want to delete (listed in the .txt file).
This is what I have done so far:
import os
path = "c:\\Users\\SFMe\\Desktop\\DeleteFolder"
os.chdir(path)
list = open('DeleteFiles.txt')
for f in list:
os.remove(f)
This gives me the error:
OSError: [WinError 123] The filename, directory name, or volume label syntax is incorrect: 'Test1.xlsx\n'
I feel like I'm missing something simple. Any help would be greatly appreciated!
Thanks
Strip ending '\n' from each line read from the text file;
Make absolute path by joining path with the file name;
Do not overwrite Python types (i.e., in you case list);
Close the text file or use with open('DeleteFiles.txt') as flist.
EDIT: Actually, upon looking at your code, due to os.chdir(path), second point may not be necessary.
import os
path = "c:\\Users\\SFMe\\Desktop\\DeleteFolder"
os.chdir(path)
flist = open('DeleteFiles.txt')
for f in flist:
fname = f.rstrip() # or depending on situation: f.rstrip('\n')
# or, if you get rid of os.chdir(path) above,
# fname = os.path.join(path, f.rstrip())
if os.path.isfile(fname): # this makes the code more robust
os.remove(fname)
# also, don't forget to close the text file:
flist.close()
As Henry Yik pointed in the commentary, you need to pass the full path when using os.remove function. Also, open function just returns the file object. You need to read the lines from the file. And don't forget to close the file. A solution would be:
import os
path = "c:\\Users\\SFMe\\Desktop\\DeleteFolder"
os.chdir(path)
# added the argument "r" to indicates only reading
list_file = open('DeleteFiles.txt', "r")
# changing variable list to _list to do not shadow
# the built-in function and type list
_list = list_file.read().splitlines()
list_file.close()
for f in _list:
os.remove(os.path.join(path,f))
A further improvement would be use list comprehension instead of a loop and a with block, which "automagically" closes the file for us:
with open('DeleteFiles.txt', "r") as list_file:
_list = list_file.read().splitlines()
[os.remove(os.path.join(path,f)) for f in _list]

Changing name of file until it is unique

I have a script that downloads files (pdfs, docs, etc) from a predetermined list of web pages. I want to edit my script to alter the names of files with a trailing _x if the file name already exists, since it's possible files from different pages will share the same filename but contain different contents, and urlretrieve() appears to automatically overwrite existing files.
So far, I have:
urlfile = 'https://www.foo.com/foo/foo/foo.pdf'
filename = urlfile.split('/')[-1]
filename = foo.pdf
if os.path.exists(filename):
filename = filename('.')[0] + '_' + 1
That works fine for one occurrence, but it looks like after one foo_1.pdf it will start saving as foo_1_1.pdf, and so on. I would like to save the files as foo_1.pdf, foo_2.pdf, and so on.
Can anybody point me in the right direction on how to I can ensure that file names are stored in the correct fashion as the script runs?
Thanks.
So what you want is something like this:
curName = "foo_0.pdf"
while os.path.exists(curName):
num = int(curName.split('.')[0].split('_')[1])
curName = "foo_{}.pdf".format(str(num+1))
Here's the general scheme:
Assume you start from the first file name (foo_0.pdf)
Check if that name is taken
If it is, iterate the name by 1
Continue looping until you find a name that isn't taken
One alternative: Generate a list of file numbers that are in use, and update it as needed. If it's sorted you can say name = "foo_{}.pdf".format(flist[-1]+1). This has the advantage that you don't have to run through all the files every time (as the above solution does). However, you need to keep the list of numbers in memory. Additionally, this will not fill any gaps in the numbers
Why not just use the tempfile module:
fileobj = tempfile.NamedTemporaryFile(suffix='.pdf', prefix='', delete = False)
Now your filename will be available in fileobj.name and you can manipulate to your heart's content. As an added benefit, this is cross-platform.
Since you're dealing with multiple pages, this seeems more like a "global archive" than a per-page archive. For a per-page archive, I would go with the answer from #wnnmaw
For a global archive, I would take a different approch...
Create a directory for each filename
Store the file in the directory as "1" + extension
write the current "number" to the directory as "_files.txt"
additional files are written as 2,3,4,etc and increment the value in _files.txt
The benefits of this:
The directory is the original filename. If you keep turning "Example-1.pdf" into "Example-2.pdf" you run into a possibility where you download a real "Example-2.pdf", and can't associate it to the original filename.
You can grab the number of like-named files either by reading _files.txt or counting the number of files in the directory.
Personally, I'd also suggest storing the files in a tiered bucketing system, so that you don't have too many files/directories in any one directory (hundreds of files makes it annoying as a user, thousands of files can affect OS performance ). A bucketing system might turn a filename into a hexdigest, then drop the file into `/%s/%s/%s" % ( hex[0:3], hex[3:6], filename ). The hexdigest is used to give you a more even distribution of characters.
import os
def uniquify(path, sep=''):
path = os.path.normpath(path)
num = 0
newpath = path
dirname, basename = os.path.split(path)
filename, ext = os.path.splitext(basename)
while os.path.exists(newpath):
newpath = os.path.join(dirname, '{f}{s}{n:d}{e}'
.format(f=filename, s=sep, n=num, e=ext))
num += 1
return newpath
filename = uniquify('foo.pdf', sep='_')
Possible problems with this include:
If you call to uniquify many many thousands of times with the same
path, each subsequent call may get a bit slower since the
while-loop starts checking from num=0 each time.
uniquify is vulnerable to race conditions whereby a file may not
exist at the time os.path.exists is called, but may exist at the
time you use the value returned by uniquify. Use
tempfile.NamedTemporaryFile to avoid this problem. You won't get
incremental numbering, but you will get files with unique names,
guaranteed not to already exist. You could use the prefix parameter to
specify the original name of the file. For example,
import tempfile
import os
def uniquify(path, sep='_', mode='w'):
path = os.path.normpath(path)
if os.path.exists(path):
dirname, basename = os.path.split(path)
filename, ext = os.path.splitext(basename)
return tempfile.NamedTemporaryFile(prefix=filename+sep, suffix=ext, delete=False,
dir=dirname, mode=mode)
else:
return open(path, mode)
Which could be used like this:
In [141]: f = uniquify('/tmp/foo.pdf')
In [142]: f.name
Out[142]: '/tmp/foo_34cvy1.pdf'
Note that to prevent a race-condition, the opened filehandle -- not merely the name of the file -- is returned.

Python: rename all files in a folder using numbers that file contain

I want to write a little script for managing a bunch of files I got. Those files have complex and different name but they all contain a number somewhere in their name. I want to take that number, place it in front of the file name so they can be listed logically in my filesystem.
I got a list of all those files using os.listdir but I'm struggling to find a way to locate the numbers in those files. I've checked regular expression but I'm unsure if it's the right way to do this!
example:
import os
files = os.litdir(c:\\folder)
files
['xyz3.txt' , '2xyz.txt', 'x1yz.txt']`
So basically, what I ultimately want is:
1xyz.txt
2xyz.txt
3xyz.txt
where I am stuck so far is to find those numbers (1,2,3) in the list files
This (untested) snippet should show the regexp approach. The search method of compiled patterns is used to look for the number. If found, the number is moved to the front of the file name.
import os, re
NUM_RE = re.compile(r'\d+')
for name in os.listdir('.'):
match = NUM_RE.search(name)
if match is None or match.start() == 0:
continue # no number or number already at start
newname = match.group(0) + name[:match.start()] + name[match.end():]
print 'renaming', name, 'to', newname
#os.rename(name, newname)
If this code is used in production and not as homework assignment, a useful improvement would be to parse match.group(0) as an integer and format it to include a number of leading zeros. That way foo2.txt would become 02foo.txt and get sorted before 12bar.txt. Implementing this is left as an exercise to the reader.
Assuming that the numbers in your file names are integers (untested code):
def rename(dirpath, filename):
inds = [i for i,char in filename if char in '1234567890']
ints = filename[min(inds):max(inds)+1]
newname = ints + filename[:min(inds)] + filename[max(inds)+1:]
os.rename(os.path.join(dirpath, filename), os.path.join(dirpath, newname))
def renameFilesInDir(dirpath):
""" Apply your renaming scheme to all files in the directory specified by dirpath """
dirpath, dirnames, filenames = os.walk(dirpath):
for filename in filenames:
rename(dirpath, filename)
for dirname in dirnames:
renameFilesInDir(os.path.join(dirpath, dirname))
Hope this helps

Delete multiple files matching a pattern

I have made an online gallery using Python and Django. I've just started to add editing functionality, starting with a rotation. I use sorl.thumbnail to auto-generate thumbnails on demand.
When I edit the original file, I need to clean up all the thumbnails so new ones are generated. There are three or four of them per image (I have different ones for different occasions).
I could hard-code in the file-varients... But that's messy and if I change the way I do things, I'll need to revisit the code.
Ideally I'd like to do a regex-delete. In regex terms, all my originals are named like so:
^(?P<photo_id>\d+)\.jpg$
So I want to delete:
^(?P<photo_id>\d+)[^\d].*jpg$
(Where I replace photo_id with the ID I want to clean.)
Using the glob module:
import glob, os
for f in glob.glob("P*.jpg"):
os.remove(f)
Alternatively, using pathlib:
from pathlib import Path
for p in Path(".").glob("P*.jpg"):
p.unlink()
Try something like this:
import os, re
def purge(dir, pattern):
for f in os.listdir(dir):
if re.search(pattern, f):
os.remove(os.path.join(dir, f))
Then you would pass the directory containing the files and the pattern you wish to match.
If you need recursion into several subdirectories, you can use this method:
import os, re, os.path
pattern = "^(?P<photo_id>\d+)[^\d].*jpg$"
mypath = "Photos"
for root, dirs, files in os.walk(mypath):
for file in filter(lambda x: re.match(pattern, x), files):
os.remove(os.path.join(root, file))
You can safely remove subdirectories on the fly from dirs, which contains the list of the subdirectories to visit at each node.
Note that if you are in a directory, you can also get files corresponding to a simple pattern expression with glob.glob(pattern). In this case you would have to substract the set of files to keep from the whole set, so the code above is more efficient.
How about this?
import glob, os, multiprocessing
p = multiprocessing.Pool(4)
p.map(os.remove, glob.glob("P*.jpg"))
Mind you this does not do recursion and uses wildcards (not regex).
UPDATE
In Python 3 the map() function will return an iterator, not a list. This is useful since you will probably want to do some kind processing on the items anyway, and an iterator will always be more memory-efficient to that end.
If however, a list is what you really need, just do this:
...
list(p.map(os.remove, glob.glob("P*.jpg")))
I agree it's not the most functional way, but it's concise and does the job.
It's not clear to me that you actually want to do any named-group matching -- in the use you describe, the photoid is an input to the deletion function, and named groups' purpose is "output", i.e., extracting certain substrings from the matched string (and accessing them by name in the match object). So, I would recommend a simpler approach:
import re
import os
def delete_thumbnails(photoid, photodirroot):
matcher = re.compile(r'^%s\d+\D.*jpg$' % photoid)
numdeleted = 0
for rootdir, subdirs, filenames in os.walk(photodirroot):
for name in filenames:
if not matcher.match(name):
continue
path = os.path.join(rootdir, name)
os.remove(path)
numdeleted += 1
return "Deleted %d thumbnails for %r" % (numdeleted, photoid)
You can pass the photoid as a normal string, or as a RE pattern piece if you need to remove several matchable IDs at once (e.g., r'abc[def] to remove abcd, abce, and abcf in a single call) -- that's the reason I'm inserting it literally in the RE pattern, rather than inserting the string re.escape(photoid) as would be normal practice. Certain parts such as counting the number of deletions and returning an informative message at the end are obviously frills which you should remove if they give you no added value in your use case.
Others, such as the "if not ... // continue" pattern, are highly recommended practice in Python (flat is better than nested: bailing out to the next leg of the loop as soon as you determine there is nothing to do on this one is better than nesting the actions to be done within an if), although of course other arrangements of the code would work too.
My recomendation:
def purge(dir, pattern, inclusive=True):
regexObj = re.compile(pattern)
for root, dirs, files in os.walk(dir, topdown=False):
for name in files:
path = os.path.join(root, name)
if bool(regexObj.search(path)) == bool(inclusive):
os.remove(path)
for name in dirs:
path = os.path.join(root, name)
if len(os.listdir(path)) == 0:
os.rmdir(path)
This will recursively remove every file that matches the pattern by default, and every file that doesn't if inclusive is true. It will then remove any empty folders from the directory tree.
import os, sys, glob, re
def main():
mypath = "<Path to Root Folder to work within>"
for root, dirs, files in os.walk(mypath):
for file in files:
p = os.path.join(root, file)
if os.path.isfile(p):
if p[-4:] == ".jpg": #Or any pattern you want
os.remove(p)
I find Popen(["rm " + file_name + "*.ext"], shell=True, stdout=PIPE).communicate() to be a much simpler solution to this problem. Although this is prone to injection attacks, I don't see any issues if your program is using this internally.
def recursive_purge(dir, pattern):
for f in os.listdir(dir):
if os.path.isdir(os.path.join(dir, f)):
recursive_purge(os.path.join(dir, f), pattern)
elif re.search(pattern, os.path.join(dir, f)):
os.remove(os.path.join(dir, f))

Is this the best way to get unique version of filename w/ Python?

Still 'diving in' to Python, and want to make sure I'm not overlooking something. I wrote a script that extracts files from several zip files, and saves the extracted files together in one directory. To prevent duplicate filenames from being over-written, I wrote this little function - and I'm just wondering if there is a better way to do this?
Thanks!
def unique_filename(file_name):
counter = 1
file_name_parts = os.path.splitext(file_name) # returns ('/path/file', '.ext')
while os.path.isfile(file_name):
file_name = file_name_parts[0] + '_' + str(counter) + file_name_parts[1]
counter += 1
return file_name
I really do require the files to be in a single directory, and numbering duplicates is definitely acceptable in my case, so I'm not looking for a more robust method (tho' I suppose any pointers are welcome), but just to make sure that what this accomplishes is getting done the right way.
One issue is that there is a race condition in your above code, since there is a gap between testing for existance, and creating the file. There may be security implications to this (think about someone maliciously inserting a symlink to a sensitive file which they wouldn't be able to overwrite, but your program running with a higher privilege could) Attacks like these are why things like os.tempnam() are deprecated.
To get around it, the best approach is to actually try create the file in such a way that you'll get an exception if it fails, and on success, return the actually opened file object. This can be done with the lower level os.open functions, by passing both the os.O_CREAT and os.O_EXCL flags. Once opened, return the actual file (and optionally filename) you create. Eg, here's your code modified to use this approach (returning a (file, filename) tuple):
def unique_file(file_name):
counter = 1
file_name_parts = os.path.splitext(file_name) # returns ('/path/file', '.ext')
while 1:
try:
fd = os.open(file_name, os.O_CREAT | os.O_EXCL | os.O_RDRW)
return os.fdopen(fd), file_name
except OSError:
pass
file_name = file_name_parts[0] + '_' + str(counter) + file_name_parts[1]
counter += 1
[Edit] Actually, a better way, which will handle the above issues for you, is probably to use the tempfile module, though you may lose some control over the naming. Here's an example of using it (keeping a similar interface):
def unique_file(file_name):
dirname, filename = os.path.split(file_name)
prefix, suffix = os.path.splitext(filename)
fd, filename = tempfile.mkstemp(suffix, prefix+"_", dirname)
return os.fdopen(fd), filename
>>> f, filename=unique_file('/home/some_dir/foo.txt')
>>> print filename
/home/some_dir/foo_z8f_2Z.txt
The only downside with this approach is that you will always get a filename with some random characters in it, as there's no attempt to create an unmodified file (/home/some_dir/foo.txt) first.
You may also want to look at tempfile.TemporaryFile and NamedTemporaryFile, which will do the above and also automatically delete from disk when closed.
Yes, this is a good strategy for readable but unique filenames.
One important change: You should replace os.path.isfile with os.path.lexists! As it is written right now, if there is a directory named /foo/bar.baz, your program will try to overwrite that with the new file (which won't work)... since isfile only checks for files and not directories. lexists checks for directories, symlinks, etc... basically if there's any reason that filename could not be created.
EDIT: #Brian gave a better answer, which is more secure and robust in terms of race conditions.
Two small changes...
base_name, ext = os.path.splitext(file_name)
You get two results with distinct meaning, give them distinct names.
file_name = "%s_%d%s" % (base_name, str(counter), ext)
It isn't faster or significantly shorter. But, when you want to change your file name pattern, the pattern is on one place, and slightly easier to work with.
If you want readable names this looks like a good solution.
There are routines to return unique file names for eg. temp files but they produce long random looking names.
if you don't care about readability, uuid.uuid4() is your friend.
import uuid
def unique_filename(prefix=None, suffix=None):
fn = []
if prefix: fn.extend([prefix, '-'])
fn.append(str(uuid.uuid4()))
if suffix: fn.extend(['.', suffix.lstrip('.')])
return ''.join(fn)
How about
def ensure_unique_filename(orig_file_path):
from time import time
import os
if os.path.lexists(orig_file_path):
name, ext = os.path.splitext(orig_file_path)
orig_file_path = name + str(time()).replace('.', '') + ext
return orig_file_path
time() returns current time in milliseconds. combined with original filename, it's fairly unique even in complex multithreaded cases.

Categories