Changing name of file until it is unique - python

I have a script that downloads files (pdfs, docs, etc) from a predetermined list of web pages. I want to edit my script to alter the names of files with a trailing _x if the file name already exists, since it's possible files from different pages will share the same filename but contain different contents, and urlretrieve() appears to automatically overwrite existing files.
So far, I have:
urlfile = 'https://www.foo.com/foo/foo/foo.pdf'
filename = urlfile.split('/')[-1]
filename = foo.pdf
if os.path.exists(filename):
filename = filename('.')[0] + '_' + 1
That works fine for one occurrence, but it looks like after one foo_1.pdf it will start saving as foo_1_1.pdf, and so on. I would like to save the files as foo_1.pdf, foo_2.pdf, and so on.
Can anybody point me in the right direction on how to I can ensure that file names are stored in the correct fashion as the script runs?
Thanks.

So what you want is something like this:
curName = "foo_0.pdf"
while os.path.exists(curName):
num = int(curName.split('.')[0].split('_')[1])
curName = "foo_{}.pdf".format(str(num+1))
Here's the general scheme:
Assume you start from the first file name (foo_0.pdf)
Check if that name is taken
If it is, iterate the name by 1
Continue looping until you find a name that isn't taken
One alternative: Generate a list of file numbers that are in use, and update it as needed. If it's sorted you can say name = "foo_{}.pdf".format(flist[-1]+1). This has the advantage that you don't have to run through all the files every time (as the above solution does). However, you need to keep the list of numbers in memory. Additionally, this will not fill any gaps in the numbers

Why not just use the tempfile module:
fileobj = tempfile.NamedTemporaryFile(suffix='.pdf', prefix='', delete = False)
Now your filename will be available in fileobj.name and you can manipulate to your heart's content. As an added benefit, this is cross-platform.

Since you're dealing with multiple pages, this seeems more like a "global archive" than a per-page archive. For a per-page archive, I would go with the answer from #wnnmaw
For a global archive, I would take a different approch...
Create a directory for each filename
Store the file in the directory as "1" + extension
write the current "number" to the directory as "_files.txt"
additional files are written as 2,3,4,etc and increment the value in _files.txt
The benefits of this:
The directory is the original filename. If you keep turning "Example-1.pdf" into "Example-2.pdf" you run into a possibility where you download a real "Example-2.pdf", and can't associate it to the original filename.
You can grab the number of like-named files either by reading _files.txt or counting the number of files in the directory.
Personally, I'd also suggest storing the files in a tiered bucketing system, so that you don't have too many files/directories in any one directory (hundreds of files makes it annoying as a user, thousands of files can affect OS performance ). A bucketing system might turn a filename into a hexdigest, then drop the file into `/%s/%s/%s" % ( hex[0:3], hex[3:6], filename ). The hexdigest is used to give you a more even distribution of characters.

import os
def uniquify(path, sep=''):
path = os.path.normpath(path)
num = 0
newpath = path
dirname, basename = os.path.split(path)
filename, ext = os.path.splitext(basename)
while os.path.exists(newpath):
newpath = os.path.join(dirname, '{f}{s}{n:d}{e}'
.format(f=filename, s=sep, n=num, e=ext))
num += 1
return newpath
filename = uniquify('foo.pdf', sep='_')
Possible problems with this include:
If you call to uniquify many many thousands of times with the same
path, each subsequent call may get a bit slower since the
while-loop starts checking from num=0 each time.
uniquify is vulnerable to race conditions whereby a file may not
exist at the time os.path.exists is called, but may exist at the
time you use the value returned by uniquify. Use
tempfile.NamedTemporaryFile to avoid this problem. You won't get
incremental numbering, but you will get files with unique names,
guaranteed not to already exist. You could use the prefix parameter to
specify the original name of the file. For example,
import tempfile
import os
def uniquify(path, sep='_', mode='w'):
path = os.path.normpath(path)
if os.path.exists(path):
dirname, basename = os.path.split(path)
filename, ext = os.path.splitext(basename)
return tempfile.NamedTemporaryFile(prefix=filename+sep, suffix=ext, delete=False,
dir=dirname, mode=mode)
else:
return open(path, mode)
Which could be used like this:
In [141]: f = uniquify('/tmp/foo.pdf')
In [142]: f.name
Out[142]: '/tmp/foo_34cvy1.pdf'
Note that to prevent a race-condition, the opened filehandle -- not merely the name of the file -- is returned.

Related

How To Rename Files (Using Python OS Module)?

I have a folder which contains +100 songs named this way "Song Name, Singer Name" (e.g. Smooth Criminal, Michael Jackson). I'm trying to rename all the songs to "Song Name (Singer Name)" (e.g. Smooth Criminal (Michael Jackson)).
I tried this code. But, I didn't know what parameters to write.
import os
files = os.getcwd()
os.rename(files, "") # I'm confused because I don't know what to put here as parameters since I want to change only parts of the files' names, and not the files' names entirely.
Any suggestions on the parameters of "os.rename()"?
Unlike the command line program rename which can rename a batch of files using a pattern in a single command, Python's os.rename() is a thin wrapper around the underlying rename syscall. It can thus rename a single file at a time.
Assuming all songs are stored in the same directory and ends with an extension like '.mp3', one approach is to loop over the return of os.listdir().
Additionally, it would be wise to check that current file is, indeed a file and not, say, a directory or symbolic link. This can be done using os.path.isfile()
Here is a full example:
import os
TARGET_DIR = "/path/to/some/folder"
for filename in os.listdir(TARGET_DIR):
# Only consider regular files
if not os.path.isfile(filename):
continue
# Extract filename and extension
basename, extension = os.path.splitext(filename)
if ',' not in basename:
continue # Ignore
# Extract the song and singer names
songname, singername = basename.split(',', 1)
# Build the new name, rename
target_name = f'{songname} ({singername}){extension}'
os.rename(
os.path.join(TARGET_DIR, filename),
os.path.join(TARGET_DIR, target_name)
)
Note: If the songs are potentially stored in subfolders, os.walk() will be a better candidate than the lower level os.listdir()
I would recommend you to write a loop using os.
However with pandas you can replace some Name snipeds with no problem, I would recommend chosing pandas for this Task!

Seperation of files from one folder to another based on their filenames

My filenames have pattern like 29_11_2019_17_05_17_1050_R__2.png and 29_11_2019_17_05_17_1550_2
I want to write a function which separates these files and puts them in different folders.
Please find my code below but its not working.
Can you help me with this?
def sort_photos(folder, dir_name):
for filename in os.listdir(folder):
wavelengths = ["1550_R_", "1550_", "1050_R_", "1200_"]
for x in wavelengths:
if x == "1550_R_":
if re.match(r'.*x.*', filename):
filesrc = os.path.join(folder, filename)
shutil.copy(filesrc, dir_name)
print("Filename has 'x' in it, do something")
print("file 1550_R_ copied")
# cv2.imwrite(dir_name,filename)
else:
print("filename doesn't have '1550_R_' in it (so maybe it only has 'N' in it)")
In order to construct a RegEx using a variable, you can use string-interpolation.
In Python3.6+, you can use f-strings to accomplish this. In this case, the condition for your second if statement could be:
if re.match(fr'.*{x}.*', filename) is not None:
instead of:
if re.match(r'.*x.*', filename) is not None:
Which would only ever match filenames with an 'x' in them. I think is the immediate (though not necessarily only) problem in your example code.
Footnote
Earlier versions of Python do string interpolation differently, the oldest (AFAIK) is %-formatting, e.g:
if re.match(r".*%s.*" % x, filename) is not None:
Read here for more detail.
I am not very cleared about which problem you encounter.
However, there are two suggestions:
To detect character x in file name, you can just use:
if('x' in filename):
...
If you only intended to move the files, a file check should be added:
if os.path.isfile(name):
...
I didn't have much time so I've edited your function which acts very close to what you wanted. It essentially reads file names, and copies them to separate directories but in directories named by wavelengths instead. Though currently it cannot differentiate between '1550_' and '1550_R_' since '1550_R_' includes '1550_' and I didn't have much time. You can create a conditional statement for it by a few lines and there you go. (If you do not do that it will create two directories '1550_' and '1550_R_' but it will copy files that are eligible for either to both of the folders.)
One final note that as I said that I didn't have much time I've made it all simpler that the destination folders are created just where your files are located. You can add it easily if you want by a few lines too.
import cv2
import os
import re
import shutil
def sort_photos(folder):
wavelengths = ["1550_R_", "1550_", "1050_R_", "1200_"]
for filename in os.listdir(folder):
for x,idx in zip(wavelengths, range(len(wavelengths))):
if (x in filename):
filesrc = os.path.join(folder, filename)
path = './'+x+'/'
if not os.path.exists(path):
os.mkdir(path)
shutil.copy(filesrc, path+filename)
# print("Filename has 'x' in it, do something")
# cv2.imwrite(dir_name,filename)
# else:
# print("filename doesn't have 'A' in it (so maybe it only has 'N' in it)")
########## USAGE: sort_photos(folder), for example, go to the folder where all the files are located:
sort_photos('./')

Run only if "if " statement is true.!

So I've a question, Like I'm reading the fits file and then i'm using the information from the header of the fits to define the other files which are related to the original fits file. But for some of the fits file, the other files (blaze_file, bis_file, ccf_table) are not available. And because of that my code gives the pretty obvious error that No Such file or directory.
import pandas as pd
import sys, os
import numpy as np
from glob import glob
from astropy.io import fits
PATH = os.path.join("home", "Desktop", "2d_spectra")
for filename in os.listdir(PATH):
if filename.endswith("_e2ds_A.fits"):
e2ds_hdu = fits.open(filename)
e2ds_header = e2ds_hdu[0].header
date = e2ds_header['DATE-OBS']
date2 = date = date[0:19]
blaze_file = e2ds_header['HIERARCH ESO DRS BLAZE FILE']
bis_file = glob('HARPS.' + date2 + '*_bis_G2_A.fits')
ccf_table = glob('HARPS.' + date2 + '*_ccf_G2_A.tbl')
if not all(file in os.listdir(PATH) for file in [blaze_file,bis_file,ccf_table]):
continue
So what i want to do is like, i want to make my code run only if all the files are available otherwise don't. But the problem is that, i'm defining the other files as variable inside the for loop as i'm using the header information. So how can i define them before the for loop???? and then use something like
So can anyone help me out of this?
The filenames returned by os.listdir() are always relative to the path given there.
In order to be used, they have to be joined with this path.
Example:
PATH = os.path.join("home", "Desktop", "2d_spectra")
for filename in os.listdir(PATH):
if filename.endswith("_e2ds_A.fits"):
filepath = os.path.join(PATH, filename)
e2ds_hdu = fits.open(filepath)
…
Let the filenames be ['a', 'b', 'a_ed2ds_A.fits', 'b_ed2ds_A.fits']. The code now excludes the two first names and then prepends the file path to the remaining two.
a_ed2ds_A.fits becomes /home/Desktop/2d_spectra/a_ed2ds_A.fits and
b_ed2ds_A.fits becomes /home/Desktop/2d_spectra/b_ed2ds_A.fits.
Now they can be accessed from everywhere, not just from the given file path.
I should become accustomed to reading a question in full before trying to answer it.
The problem I mentionned is a problem if you don't start the script from any path outside the said directory. Nevertheless, applying it will make your code much more consistent.
Your real problem, however, lies somewhere else: you examine a file and then, after checking its contents, want to read files whose names depend on informations from that first file.
There are several ways to accomplish your goal:
Just extend your loop with the proper tests.
Pseudo code:
for file in files:
if file.endswith("fits"):
open file
read date from header
create file names depending on date
if all files exist:
proceed
or
for file in files:
if file.endswith("fits"):
open file
read date from header
create file names depending on date
if not all files exist:
continue # actual keyword, no pseudo code!
proceed
Put some functionality into functions (variation of 1.)
Create a loop in a generator function which yields the "interesting information" of one fits file (or alternatively nothing) and have another loop run over them to actually work with the data.
If I am still missing some points or am not detailled enough, please let me know.
Since you have to read the fits file to know the other dependant files names, there's no way you can avoid reading the fit file first. The only thing you can do is test for the dependant files existance before trying to read them and skip the rest of the loop (using continue) if not.
Edit this line
e2ds_hdu = fits.open(filename)
And replace with
e2ds_hdu = fits.open(os.path.join(PATH, filename))

Prevent my file from being overwritten - python

i am currently creating a file on run of my application using the simple method
file = open('myfile.dat', 'w+')
however i have noticed that this is overwritting the file on each run, what i want to do is if it already exsists, create a new file called myfilex.dat where x is the number of previous copies of the file, is there a quick and effective way of doing this ?
Thanks :)
EDIT : I know how to check it already exists using the os.path.exists function, but i am am asking if it does exist how can i apend the number of versions on the end easy if that makes sense sorry if it does not
You could use a timestamp, so that each time you will execute the program it will write to a different file:
import time
file = open('myfile.%d.dat' % time.time(), 'w+')
You can do two things, either Open with append that is file = open('myfile.dat', 'a') or check if file exists and give user option to overwrite. Python have number of option. You can check this question for enlightment
How do I check whether a file exists using Python?
Consider
import os
def build_filename(name, num=0):
root, ext = os.path.splitext(name)
return '%s%d%s' % (root, num, ext) if num else name
def find_next_filename(name, max_tries=20):
if not os.path.exists(name): return name
else:
for i in range(max_tries):
test_name = build_filename(name, i+1)
if not os.path.exists(test_name): return test_name
return None
If your filename doesn't exist, it'll return your filename.
If your filename does exist, it'll try rootX.extension where root and extension are determined by os.path.splittext and X is an integer, starting at 1 and ending at max_tries (I had it default to 20, but you could change the default or pass a different argument).
If no file can be found, the function returns None.
Note, there are still race conditions present here (a file is created by another process with a clashing name after your check), but its what you said you wanted.
# When the files doesn't exist
print find_next_filename('myfile.dat') # myfile.dat
# When the file does exist
print find_next_filename('myfile.dat') # myfile1.dat
# When the file does exist, as does "1" and "2"
print find_next_filename('myfile.dat') # myfile3.dat
Nothing particularly quick, but effective? Sure! I'm used to a backup system where I do:
filename.ext
filename-1.ext # older
filename-2.ext # older still
filename-3.ext # even older
This is slightly harder than what you want to do. You want filename-N.ext to be the NEWEST file! Let's use glob to see how many files match that name, then make a new one!
from glob import glob
import os.path
num_files = len(glob.glob(os.path.join(root, head, filename + "*", ext)))
# where:
# root = r"C:\"
# head = r"users\username\My Documents"
# filename = "myfile"
# ext = "dat"
if num_files = 0:
num_files = "" # handles the case where file doesn't exist AT ALL yet
with open(os.path.join(root, head, filename + str(num_files), ext), 'w+'):
do_stuff_to_file()
Here is a few solutions for everyone experiencing a similar problem.
Keep YOUR program from overwiting data:
with open('myfile.txt', 'a') as myfile:
myfile.write('data')
Note: I believe that a+ (not a) allows for reading and writing, but I'm not 100% sure.
Prevent ALL programs from overwriting your data (by setting it to read-only):
from os import chmod
from stat import S_IREAD
chmod('path_to_file', IREAD)
Note: both of these modules are built-in to Python (at least Python 3.10.4) so no need to use pip.
Note 2: Setting it to read-only is not the best idea, as programs can set it back. I would combine this with a hash and/or signature to verify the file has not been tampered with to 'invalidate' the data inside and require the user to re-generate the file (eg, to store any temporary but very important data like decryption keys after generating them before deleting them).
Just check to see if your file already exists then?
name = "myfile"
extension =".dat"
x = 0
fileName = name + extension
while(!os.path.exists(fileName)):
x = x + 1
fileName = name + x + extension
file = open(fileName, 'w+')

Is this the best way to get unique version of filename w/ Python?

Still 'diving in' to Python, and want to make sure I'm not overlooking something. I wrote a script that extracts files from several zip files, and saves the extracted files together in one directory. To prevent duplicate filenames from being over-written, I wrote this little function - and I'm just wondering if there is a better way to do this?
Thanks!
def unique_filename(file_name):
counter = 1
file_name_parts = os.path.splitext(file_name) # returns ('/path/file', '.ext')
while os.path.isfile(file_name):
file_name = file_name_parts[0] + '_' + str(counter) + file_name_parts[1]
counter += 1
return file_name
I really do require the files to be in a single directory, and numbering duplicates is definitely acceptable in my case, so I'm not looking for a more robust method (tho' I suppose any pointers are welcome), but just to make sure that what this accomplishes is getting done the right way.
One issue is that there is a race condition in your above code, since there is a gap between testing for existance, and creating the file. There may be security implications to this (think about someone maliciously inserting a symlink to a sensitive file which they wouldn't be able to overwrite, but your program running with a higher privilege could) Attacks like these are why things like os.tempnam() are deprecated.
To get around it, the best approach is to actually try create the file in such a way that you'll get an exception if it fails, and on success, return the actually opened file object. This can be done with the lower level os.open functions, by passing both the os.O_CREAT and os.O_EXCL flags. Once opened, return the actual file (and optionally filename) you create. Eg, here's your code modified to use this approach (returning a (file, filename) tuple):
def unique_file(file_name):
counter = 1
file_name_parts = os.path.splitext(file_name) # returns ('/path/file', '.ext')
while 1:
try:
fd = os.open(file_name, os.O_CREAT | os.O_EXCL | os.O_RDRW)
return os.fdopen(fd), file_name
except OSError:
pass
file_name = file_name_parts[0] + '_' + str(counter) + file_name_parts[1]
counter += 1
[Edit] Actually, a better way, which will handle the above issues for you, is probably to use the tempfile module, though you may lose some control over the naming. Here's an example of using it (keeping a similar interface):
def unique_file(file_name):
dirname, filename = os.path.split(file_name)
prefix, suffix = os.path.splitext(filename)
fd, filename = tempfile.mkstemp(suffix, prefix+"_", dirname)
return os.fdopen(fd), filename
>>> f, filename=unique_file('/home/some_dir/foo.txt')
>>> print filename
/home/some_dir/foo_z8f_2Z.txt
The only downside with this approach is that you will always get a filename with some random characters in it, as there's no attempt to create an unmodified file (/home/some_dir/foo.txt) first.
You may also want to look at tempfile.TemporaryFile and NamedTemporaryFile, which will do the above and also automatically delete from disk when closed.
Yes, this is a good strategy for readable but unique filenames.
One important change: You should replace os.path.isfile with os.path.lexists! As it is written right now, if there is a directory named /foo/bar.baz, your program will try to overwrite that with the new file (which won't work)... since isfile only checks for files and not directories. lexists checks for directories, symlinks, etc... basically if there's any reason that filename could not be created.
EDIT: #Brian gave a better answer, which is more secure and robust in terms of race conditions.
Two small changes...
base_name, ext = os.path.splitext(file_name)
You get two results with distinct meaning, give them distinct names.
file_name = "%s_%d%s" % (base_name, str(counter), ext)
It isn't faster or significantly shorter. But, when you want to change your file name pattern, the pattern is on one place, and slightly easier to work with.
If you want readable names this looks like a good solution.
There are routines to return unique file names for eg. temp files but they produce long random looking names.
if you don't care about readability, uuid.uuid4() is your friend.
import uuid
def unique_filename(prefix=None, suffix=None):
fn = []
if prefix: fn.extend([prefix, '-'])
fn.append(str(uuid.uuid4()))
if suffix: fn.extend(['.', suffix.lstrip('.')])
return ''.join(fn)
How about
def ensure_unique_filename(orig_file_path):
from time import time
import os
if os.path.lexists(orig_file_path):
name, ext = os.path.splitext(orig_file_path)
orig_file_path = name + str(time()).replace('.', '') + ext
return orig_file_path
time() returns current time in milliseconds. combined with original filename, it's fairly unique even in complex multithreaded cases.

Categories