Use of regular expression to exclude characters in file rename,Python? - python

I am trying to rename files so that they contain an ID followed by a -(int). The files generally come to me in this way but sometimes they come as 1234567-1(crop to bottom).jpg.
I have been trying to use the following code but my regular expression doesn't seem to be having any effect. The reason for the walk is because we have to handles large directory trees with many images.
def fix_length():
for root, dirs, files in os.walk(path):
for fn in files:
path2 = os.path.join(root, fn)
filename_zero, extension = os.path.splitext(fn)
re.sub("[^0-9][-]", "", filename_zero)
os.rename(path2, filename_zero + extension)
fix_length()
I have inserted print statements for filename_zero before and after the re.sub line and I am getting the same result (i.e. 1234567-1(crop to bottom) not what I wanted)
This raises an exception as the rename is trying to create a file that already exists.
I thought perhaps adding the [-] in the regex was the issue but removing it and running again I would then expect 12345671.jpg but this doesn't work either. My regex is failing me or I have failed the regex.
Any insight would be greatly appreciated.
As a follow up, I have taken all the wonderful help and settled on a solution to my specific problem.
path = 'C:\Archive'
errors = 'C:\Test\errors'
num_files = []
def best_sol():
num_files = []
for root, dirs, files in os.walk(path):
for fn in files:
filename_zero, extension = os.path.splitext(fn)
path2 = os.path.join(root, fn)
ID = re.match('^\d{1,10}', fn).group()
if len(ID) <= 7:
if ID not in num_files:
num_files = []
num_files.append(ID)
suffix = str(len(num_files))
os.rename(path2, os.path.join(root, ID + '-' + suffix + extension))
else:
num_files.append(ID)
suffix = str(len(num_files))
os.rename(path2, os.path.join( root, ID + '-' + suffix +extension))
else:
shutil.copy(path2, errors)
os.remove(path2)
This code creates an ID based upon (up to) the first 10 numeric characters in the filename. I then use lists that store the instances of this ID and use the, length of the list append a suffix. The first file will have a -1, second a -2 etc...
I am only interested (or they should only be) in ID's with a length of 7 but allow to read up to 10 to allow for human error in labelling. All files with ID longer than 7 are moved to a folder where we can investigate.
Thanks for pointing me in the right direction.

re.sub() returns the altered string, but you ignore the return value.
You want to re-assign the result to filename_zero:
filename_zero = re.sub("[^\d-]", "", filename_zero)
I've corrected your regular expression as well; this removes anything that is not a digit or a dash from the base filename:
>>> re.sub(r'[^\d-]', '', '1234567-1(crop to bottom)')
'1234567-1'
Remember, strings are immutable, you cannot alter them in-place.
If all you want is the leading digits, plus optional dash-digit suffix, select the characters to be kept, rather than removing what you don't want:
filename_zero = re.match(r'^\d+(?:-\d)?', filename_zero).group()

new_filename = re.sub(r'^([0-9]+)-([0-9]+)', r'\g1-\g2', filename_zero)
Try using this regular expression instead, I hope this is how regular expressions work in Python, I don't use it often. You also appear to have forgotten to assign the value returned by the re.sub call to the filename_zero variable.

Related

Search list items (list has path that extracted using OS Walk) in a csv file. If it’s not available, then perform some action

I wanted to check each item of this list (Filepath_list) in a DataFrame (from a csv) that has list of path.
If it’s there than I will skip and check for next item in the list. If it not their than I will do some processing.
The code I am using is:
l = os.listdir(dir_path)
filepath_list = []
for root, dirs, files in os.walk(dir_path):
for f in files:
if 'Apr.log' in f:
filepath_list.append(os.path.join(root,f))
df_path_list_csv = pd.read_csv(r"C:\Users\ABC6\OneDrive - ABB\Files\Folder List.csv")
for g in filepath_list:
if df_path_list_csv['Detail'].str.contains(g).any():
print(g)
The error I am getting is:
incomplete escape \U at position 2
Can anyone please help me with correcting the error in this code.
Is it because of backlash? If yes, how can I correct it. I tried replacing "\" with "/" but still its not working..
The variable Filepath_list has "\\" while df_path_list_csv dataframe has "\"
Filepath_list =
['C:\\Users\\ABC6\\OneDrive - ABB\\Job Files\\SA_R 1_Ret\\13Mar22\\Apr.log',
'C:\\Users\\ABC6\\OneDrive - ABB\\Job Files\\SA_Run 2_Dri\\28Mar22\\Apr.log',
'C:\\Users\\ABC6\\OneDrive - ABB\\Job Files\\SA_Well_Run 2_Dri\\29Mar22\\Apr.log']
df_path_list_csv:
0 C:\Users\ ABC6\OneDrive - ABB\Job Files...
1 C:\Users\ ABC6\OneDrive - ABB\ Job Files...
By default the contains method of pandas assume that the pattern used is a regular expression and so identify a backslash as an escape char. To avoid this behavior and use your pattern as litteral string you need to set the regex parameter to False.
l = os.listdir(dir_path)
filepath_list = []
for root, dirs, files in os.walk(dir_path):
for f in files:
if 'Apr.log' in f:
filepath_list.append(os.path.join(root,f))
# As said in my comment just double the backslash to avoid the error you get
df_path_list_csv = pd.read_csv("C:\\Users\\ABC6\\OneDrive - ABB\\Files\\Folder List.csv")
for g in filepath_list:
# Here you have to set to False the `regex` parameter of the `contains` method
if df_path_list_csv['Detail'].str.contains(g, regex=False).any():
print(g)

How to remove .txt or .docx at end of string in python

I am trying to create a list of all file names from a specific directory. My code is below:
import os
#dir = input('Enter the directory: ')
dir = 'C:/Users/brian/Documents/Moeller'
r = os.listdir(dir)
for fnam in os.listdir(dir):
print(fnam.split())
sep = fnam.split()
My output is:
['50', 'OP', '856101P02.txt']
['856101P02', 'OP', '040.txt']
['856101P02', 'OP', '50.txt']
['OP', '040', '856101P02.txt']
How would I be able to remove anything to the right of a "." in a string, while keeping the text to the left of the period?
Basically, what you do is start splitting from the right with rsplit and then instruct it to split only once.
print "a.b.c.d".rsplit('.',1)[0]
prints a.b.c
You can use os.path.splitext to split a filename to two parts,
keeping only the extension in the right, and everything else on the left.
For example,
a path like some/path/file.tar.gz will be split to some/path/file.tar and .gz:
base, ext = os.path.splitext('path/to/hello.tar.gz')
If you want to get rid of the . in the ext part,
simply use ext[1:].
If the file has no extension, for example path/to/file,
then the ext part will be the empty string.
This is a nice feature,
so that os.path.splitext always returns a tuple of two elements,
and this way the base, ext = ... example above always works.
I am trying to create a list of all file names from a specific directory.
[...]
How would I be able to remove anything to the right of a "." in a string, while keeping the text to the left of the period?
To get the base names (filenames without the extension) of a specific directory somedir, you could use this list comprehension:
basenames = [os.path.splitext(f)[0] for f in os.listdir(somedir)]
From there, find the period and take everything up to that position. In simple steps ...
for fnam in os.listdir(dir):
nam_split = fnam.split() # "sep" is usually the separator character
print(nam_split)
ext_split = nam_split.rsplit('.', 1) # Split at only one dot, from the right
file_no_ext = ext_split[0] # The first part of the split is the file name
print(file_no_ext)

Finding all subfolders that contain two files that end with certain strings

So I have a folder, say D:\Tree, that contains only subfolders (names may contain spaces). These subfolders contain a few files - and they may contain files of the form "D:\Tree\SubfolderName\SubfolderName_One.txt" and "D:\Tree\SubfolderName\SubfolderName_Two.txt" (in other words, the subfolder may contain both of them, one, or neither). I need to find every occurence where a subfolder contains both of these files, and send their absolute paths to a text file (in a format explained in the following example). Consider these three subfolders in D:\Tree:
D:\Tree\Grass contains Grass_One.txt and Grass_Two.txt
D:\Tree\Leaf contains Leaf_One.txt
D:\Tree\Branch contains Branch_One.txt and Branch_Two.txt
Given this structure and the problem mentioned above, I'd to like to be able to write the following lines in myfile.txt:
D:\Tree\Grass\Grass_One.txt D:\Tree\Grass\Grass_Two.txt
D:\Tree\Branch\Branch_One.txt D:\Tree\Branch\Branch_Two.txt
How might this be done? Thanks in advance for any help!
Note: It is very important that "file_One.txt" comes before "file_Two.txt" in myfile.txt
import os
folderPath = r'Your Folder Path'
for (dirPath, allDirNames, allFileNames) in os.walk(folderPath):
for fileName in allFileNames:
if fileName.endswith("One.txt") or fileName.endswith("Two.txt") :
print (os.path.join(dirPath, fileName))
# Or do your task as writing in file as per your need
Hope this helps....
Here is a recursive solution
def findFiles(writable, current_path, ending1, ending2):
'''
:param writable: file to write output to
:param current_path: current path of recursive traversal of sub folders
:param postfix: the postfix which needs to match before
:return: None
'''
# check if current path is a folder or not
try:
flist = os.listdir(current_path)
except NotADirectoryError:
return
# stores files which match given endings
ending1_files = []
ending2_files = []
for dirname in flist:
if dirname.endswith(ending1):
ending1_files.append(dirname)
elif dirname.endswith(ending2):
ending2_files.append(dirname)
findFiles(writable, current_path+ '/' + dirname, ending1, ending2)
# see if exactly 2 files have matching the endings
if len(ending1_files) == 1 and len(ending2_files) == 1:
writable.write(current_path+ '/'+ ending1_files[0] + ' ')
writable.write(current_path + '/'+ ending2_files[0] + '\n')
findFiles(sys.stdout, 'G:/testf', 'one.txt', 'two.txt')

Python - Truncate unknown file names

Let's say I have the following files in a directory:
snackbox_1a.dat
zebrabar_3z.dat
cornrows_00.dat
meatpack_z2.dat
I have SEVERAL of these directories, in which all of the files are of the same format, ie:
snackbox_xx.dat
zebrabar_xx.dat
cornrows_xx.dat
meatpack_xx.dat
So what I KNOW about these files is the first bit (snackbox, zebrabar, cornrows, meatpack). What I don't know is the bit for the file extension (the 'xx'). This changes both within the directory across the files, and across the directories (so another directory might have different xx values, like 12, yy, 2m, 0t, whatever).
Is there a way for me to rename all of these files, or truncate them all (since the xx.dat will always be the same length), for ease of use when attempting to call them? For instance, I'd like to rename them so that I can, in another script, use a simple index to step through and find the file I want (instead of having to go into each directory and pull the file out manually).
In other words, I'd like to change the file names to:
snackbox.dat
zebrabar.dat
cornrows.dat
meatpack.dat
Thanks!
You can use shutil.move to move files. To calculate the new filename, you can use Python's string split method:
original_name = "snackbox_12.dat"
truncated_name = original.split("_")[0] + ".dat"
Try re.sub:
import re
filename = 'snackbox_xx.dat'
filename_new = re.sub(r'_[A-Za-z0-9]{2}', '', filename)
You should get 'snackbox.dat' for filename_new
This assumes the two characters after the "_" are either a number or lowercase/uppercase letter, but you could choose to expand the classes included in the regular expression.
EDIT: including moving and recursive search:
import shutil, re, os, fnmatch
directory = 'your_path'
for root, dirnames, filenames in os.walk(directory):
for filename in fnmatch.filter(filenames, '*.dat'):
filename_new = re.sub(r'_[A-Za-z0-9]{2}', '', filename)
shutil.move(os.path.join(root, filename), os.path.join(root, filename_new))
This solution renames all files in the current directory that match the pattern in the function call.
What the function does
snackbox_5R.txt >>> snackbox.txt
snackbox_6y.txt >>> snackbox_0.txt
snackbox_a2.txt >>> snackbox_1.txt
snackbox_Tm.txt >>> snackbox_2.txt
Let's look at the functions inputs and some examples.
list_of_files_names This is a list of string. Where each string is the filename without the _?? part.
Examples:
['snackbox.txt', 'zebrabar.txt', 'cornrows.txt', 'meatpack.txt', 'calc.txt']
['text.dat']
upper_bound=1000 This is an integer. When the ideal filename is already taken, e.g snackbox.dat already exist it will create snackbox_0.dat all the way up to snackbox_9999.dat if need be. You shouldn't have to change the default.
The Code
import re
import os
import os.path
def find_and_rename(dir, list_of_files_names, upper_bound=1000):
"""
:param list_of_files_names: List. A list of string: filname (without the _??) + extension, EX: snackbox.txt
Renames snackbox_R5.dat to snackbox.dat, etc.
"""
# split item in the list_of_file_names into two parts, filename and extension "snackbox.dat" -> "snackbox", "dat"
list_of_files_names = [(prefix.split('.')[0], prefix.split('.')[1]) for prefix in list_of_files_names]
# store the content of the dir in a list
list_of_files_in_dir = os.listdir(dir)
for file_in_dir in list_of_files_in_dir: # list all files and folders in current dir
file_in_dir_full_path = os.path.join(dir, file_in_dir) # we need the full path to rename to use .isfile()
print() # DEBUG
print('Is "{}" a file?: '.format(file_in_dir), end='') # DEBUG
print(os.path.isfile(file_in_dir_full_path)) # DEBUG
if os.path.isfile(file_in_dir_full_path): # filters out the folder, only files are needed
# Filename is a tuple containg the prefix filename and the extenstion
for file_name in list_of_files_names: # check if the file matches on of our renaming prefixes
# match both the file name (e.g "snackbox") and the extension (e.g "dat")
# It find "snackbox_5R.txt" by matching "snackbox" in the front and matching "dat" in the rear
if re.match('{}_\w+\.{}'.format(file_name[0], file_name[1]), file_in_dir):
print('\nOriginal File: ' + file_in_dir) # printing this is not necessary
print('.'.join(file_name))
ideal_new_file_name = '.'.join(file_name) # name might already be taken
# print(ideal_new_file_name)
if os.path.isfile(os.path.join(dir, ideal_new_file_name)): # file already exists
# go up a name, e.g "snackbox.dat" --> "snackbox_1.dat" --> "snackbox_2.dat
for index in range(upper_bound):
# check if this new name already exists as well
next_best_name = file_name[0] + '_' + str(index) + '.' + file_name[1]
# file does not already exist
if os.path.isfile(os.path.join(dir,next_best_name)) == False:
print('Renaming with next best name')
os.rename(file_in_dir_full_path, os.path.join(dir, next_best_name))
break
# this file exist as well, keeping increasing the name
else:
pass
# file with ideal name does not already exist, rename with the ideal name (no _##)
else:
print('Renaming with ideal name')
os.rename(file_in_dir_full_path, os.path.join(dir, ideal_new_file_name))
def find_and_rename_include_sub_dirs(master_dir, list_of_files_names, upper_bound=1000):
for path, subdirs, files in os.walk(master_dir):
print(path) # DEBUG
find_and_rename(path, list_of_files_names, upper_bound)
find_and_rename_include_sub_dirs('C:/Users/Oxen/Documents/test_folder', ['snackbox.txt', 'zebrabar.txt', 'cornrows.txt', 'meatpack.txt', 'calc.txt'])

batch renaming 100K files with python

I have a folder with over 100,000 files, all numbered with the same stub, but without leading zeros, and the numbers aren't always contiguous (usually they are, but there are gaps) e.g:
file-21.png,
file-22.png,
file-640.png,
file-641.png,
file-642.png,
file-645.png,
file-2130.png,
file-2131.png,
file-3012.png,
etc.
I would like to batch process this to create padded, contiguous files. e.g:
file-000000.png,
file-000001.png,
file-000002.png,
file-000003.png,
When I parse the folder with for filename in os.listdir('.'): the files don't come up in the order I'd like to them to. Understandably they come up
file-1,
file-1x,
file-1xx,
file-1xxx,
etc. then
file-2,
file-2x,
file-2xx,
etc. How can I get it to go through in the order of the numeric value? I am a complete python noob, but looking at the docs i'm guessing I could use map to create a new list filtering out only the numerical part, and then sort that list, then iterate that? With over 100K files this could be heavy. Any tips welcome!
import re
thenum = re.compile('^file-(\d+)\.png$')
def bynumber(fn):
mo = thenum.match(fn)
if mo: return int(mo.group(1))
allnames = os.listdir('.')
allnames.sort(key=bynumber)
Now you have the files in the order you want them and can loop
for i, fn in enumerate(allnames):
...
using the progressive number i (which will be 0, 1, 2, ...) padded as you wish in the destination-name.
There are three steps. The first is getting all the filenames. The second is converting the filenames. The third is renaming them.
If all the files are in the same folder, then glob should work.
import glob
filenames = glob.glob("/path/to/folder/*.txt")
Next, you want to change the name of the file. You can print with padding to do this.
>>> filename = "file-338.txt"
>>> import os
>>> fnpart = os.path.splitext(filename)[0]
>>> fnpart
'file-338'
>>> _, num = fnpart.split("-")
>>> num.rjust(5, "0")
'00338'
>>> newname = "file-%s.txt" % num.rjust(5, "0")
>>> newname
'file-00338.txt'
Now, you need to rename them all. os.rename does just that.
os.rename(filename, newname)
To put it together:
for filename in glob.glob("/path/to/folder/*.txt"): # loop through each file
newname = make_new_filename(filename) # create a function that does step 2, above
os.rename(filename, newname)
Thank you all for your suggestions, I will try them all to learn the different approaches. The solution I went for is based on using a natural sort on my filelist, and then iterating that to rename. This was one of the suggested answers but for some reason it has disappeared now so I cannot mark it as accepted!
import os
files = os.listdir('.')
natsort(files)
index = 0
for filename in files:
os.rename(filename, str(index).zfill(7)+'.png')
index += 1
where natsort is defined in http://code.activestate.com/recipes/285264-natural-string-sorting/
Why don't you do it in a two step process. Parse all the files and rename with padded numbers and then run another script that takes those files, which are sorted correctly now, and renames them so they're contiguous?
1) Take the number in the filename.
2) Left-pad it with zeros
3) Save name.
def renamer():
for iname in os.listdir('.'):
first, second = iname.replace(" ", "").split("-")
number, ext = second.split('.')
first, number, ext = first.strip(), number.strip(), ext.strip()
number = '0'*(6-len(number)) + number # pad the number to be 7 digits long
oname = first + "-" + number + '.' + ext
os.rename(iname, oname)
print "Done"
Hope this helps
The simplest method is given below. You can also modify for recursive search this script.
use os module.
get filenames
os.rename
import os
class Renamer:
def __init__(self, pattern, extension):
self.ext = extension
self.pat = pattern
return
def rename(self):
p, e = (self.pat, self.ext)
number = 0
for x in os.listdir(os.getcwd()):
if str(x).endswith(f".{e}") == True:
os.rename(x, f'{p}_{number}.{e}')
number+=1
return
if __name__ == "__main__":
pattern = "myfile"
extension = "txt"
r = Renamer(pattern=pattern, extension=extension)
r.rename()

Categories