batch renaming 100K files with python

batch renaming 100K files with python - python

I have a folder with over 100,000 files, all numbered with the same stub, but without leading zeros, and the numbers aren't always contiguous (usually they are, but there are gaps) e.g:
file-21.png,
file-22.png,
file-640.png,
file-641.png,
file-642.png,
file-645.png,
file-2130.png,
file-2131.png,
file-3012.png,
etc.
I would like to batch process this to create padded, contiguous files. e.g:
file-000000.png,
file-000001.png,
file-000002.png,
file-000003.png,
When I parse the folder with for filename in os.listdir('.'): the files don't come up in the order I'd like to them to. Understandably they come up
file-1,
file-1x,
file-1xx,
file-1xxx,
etc. then
file-2,
file-2x,
file-2xx,
etc. How can I get it to go through in the order of the numeric value? I am a complete python noob, but looking at the docs i'm guessing I could use map to create a new list filtering out only the numerical part, and then sort that list, then iterate that? With over 100K files this could be heavy. Any tips welcome!

import re
thenum = re.compile('^file-(\d+)\.png$')
def bynumber(fn):
mo = thenum.match(fn)
if mo: return int(mo.group(1))
allnames = os.listdir('.')
allnames.sort(key=bynumber)
Now you have the files in the order you want them and can loop
for i, fn in enumerate(allnames):
...
using the progressive number i (which will be 0, 1, 2, ...) padded as you wish in the destination-name.

There are three steps. The first is getting all the filenames. The second is converting the filenames. The third is renaming them.
If all the files are in the same folder, then glob should work.
import glob
filenames = glob.glob("/path/to/folder/*.txt")
Next, you want to change the name of the file. You can print with padding to do this.
>>> filename = "file-338.txt"
>>> import os
>>> fnpart = os.path.splitext(filename)[0]
>>> fnpart
'file-338'
>>> _, num = fnpart.split("-")
>>> num.rjust(5, "0")
'00338'
>>> newname = "file-%s.txt" % num.rjust(5, "0")
>>> newname
'file-00338.txt'
Now, you need to rename them all. os.rename does just that.
os.rename(filename, newname)
To put it together:
for filename in glob.glob("/path/to/folder/*.txt"): # loop through each file
newname = make_new_filename(filename) # create a function that does step 2, above
os.rename(filename, newname)

Thank you all for your suggestions, I will try them all to learn the different approaches. The solution I went for is based on using a natural sort on my filelist, and then iterating that to rename. This was one of the suggested answers but for some reason it has disappeared now so I cannot mark it as accepted!
import os
files = os.listdir('.')
natsort(files)
index = 0
for filename in files:
os.rename(filename, str(index).zfill(7)+'.png')
index += 1
where natsort is defined in http://code.activestate.com/recipes/285264-natural-string-sorting/

Why don't you do it in a two step process. Parse all the files and rename with padded numbers and then run another script that takes those files, which are sorted correctly now, and renames them so they're contiguous?

1) Take the number in the filename.
2) Left-pad it with zeros
3) Save name.

def renamer():
for iname in os.listdir('.'):
first, second = iname.replace(" ", "").split("-")
number, ext = second.split('.')
first, number, ext = first.strip(), number.strip(), ext.strip()
number = '0'*(6-len(number)) + number # pad the number to be 7 digits long
oname = first + "-" + number + '.' + ext
os.rename(iname, oname)
print "Done"
Hope this helps

The simplest method is given below. You can also modify for recursive search this script.
use os module.
get filenames
os.rename
import os
class Renamer:
def __init__(self, pattern, extension):
self.ext = extension
self.pat = pattern
return
def rename(self):
p, e = (self.pat, self.ext)
number = 0
for x in os.listdir(os.getcwd()):
if str(x).endswith(f".{e}") == True:
os.rename(x, f'{p}_{number}.{e}')
number+=1
return
if __name__ == "__main__":
pattern = "myfile"
extension = "txt"
r = Renamer(pattern=pattern, extension=extension)
r.rename()

Related

How to get the latest folder in a directory using Python

I need to retrieve the directory of the most recently create folder. I am using a program that will output a new run## folder each time it is executed (i.e run01, run02, run03 and so on). Within any one run## folder resides a data file that I want analyze (file-i-want.txt).
folder_numb = 'run01'
dir = os.path.dirname(__file__)
filepath = os.path.join(dir, '..\data\directory',run_numb,'file-i-want.txt')
In short I want to skip having to hardcode in run## and just get the directory of a file within the most recently created run## folder.

You can get the creation date with os.stat
path = '/a/b/c'
#newest
newest = max([f for f in os.listdir(path)], key=lambda x: os.stat(os.path.join(path,x)).st_birthtime)
# all files sorted
sorted_files = sorted([f for f in os.listdir(path)],key=lambda x: os.stat(os.path.join(path, x)).st_birthtime, reverse=True)

pathlib is the recommeded over os for filesystem related tasks.
reference
You can try:
filepath = Path(__file__).parent / 'data/directory'
fnames = sorted(list(Path(filepath).rglob('file-i-want.txt')), key=lambda x: Path.stat(x).st_mtime, reverse=True)
filepath = str(fnames[0])
filepath

glob.glob('run*') will return the list of files/directories that match the pattern ordered by name.
so if you want the latest run your code will be:
import glob
print(glob.glob('run*')[-1]) # raises index error if there are no runs
IMPORTANT, the files are ordered by name, in this case, for example, 'run21' will come AFTER 'run100', so you will need to use a high enough number of digits to not see this error. or just count the number of matched files and recreate the name of the folder with this number.
you can use glob to check the number of files with the same name pattern:
import glob
n = len(glob.glob('run*')) # number of files which name starts with 'run'
new_run_name = 'run' + str(n)
Note: with this code the file names starts from 0, if you want to start from 1 just add 1 to n.
if you want always double digit run number (00, 01, 02) instead of 'str(n)' use 'str(n).zfill(2)'
example:
import glob
n = len(glob.glob('run*')) # number of files which name starts with 'run'
new_run_name = 'run' + str(n + 1).zfill(2)

Python Glob regex file search with for single result from multiple matches

In Python, I am trying to find a specific file in a directory, let's say, 'file3.txt'. The other files in the directory are 'flie1.txt', 'File2.txt', 'file_12.txt', and 'File13.txt'. The number is unique, so I need to search by a user supplied number.
file_num = 3
my_file = glob.glob('C:/Path_to_dir/' + r'[a-zA-Z_]*' + f'{file_num} + '.txt')
Problem is, that returns both 'file3.txt' and 'File13.txt'. If I try lookbehind, I get no files:
file_num = 3
my_file = glob.glob('C:/Path_to_dir/' + r'[a-zA-Z_]*' + r'(?<![1-9]*)' + f'{file_num}' + '.txt')
How do I only get 'file3.txt'?

glob accepts Unix wildcards, not regexes. Those are less powerful but what you're asking can still be achieved. This:
glob.glob("/path/to/file/*[!0-9]3.txt")
filters the files containing 3 without digits before.
For other cases, you can use a list comprehension and regex:
[x for x in glob.glob("/path/to/file/*") if re.match(some_regex,os.path.basename(x))]

The problem with glob is that it has limited RegEx. For instance, you can't have "[a-z_]+" with glob.
So, it's better to write your own RegEx, like this:
import re
import os
file_num = 3
file_re = r"[a-z_]+{file_num}\.txt".format(file_num=file_num)
match_file = re.compile(file_re, flags=re.IGNORECASE).match
work_dir = "C:/Path_to_dir/"
names = list(filter(match_file, os.listdir(work_dir)))

How to organize file names that were named as only numbers [duplicate]

Lets say I have three files in a folder: file9.txt, file10.txt and file11.txt and i want to read them in this particular order. Can anyone help me with this?
Right now I am using the code
import glob, os
for infile in glob.glob(os.path.join( '*.txt')):
print "Current File Being Processed is: " + infile
and it reads first file10.txt then file11.txt and then file9.txt.
Can someone help me how to get the right order?

Files on the filesystem are not sorted. You can sort the resulting filenames yourself using the sorted() function:
for infile in sorted(glob.glob('*.txt')):
print "Current File Being Processed is: " + infile
Note that the os.path.join call in your code is a no-op; with only one argument it doesn't do anything but return that argument unaltered.
Note that your files will sort in alphabetical ordering, which puts 10 before 9. You can use a custom key function to improve the sorting:
import re
numbers = re.compile(r'(\d+)')
def numericalSort(value):
parts = numbers.split(value)
parts[1::2] = map(int, parts[1::2])
return parts
for infile in sorted(glob.glob('*.txt'), key=numericalSort):
print "Current File Being Processed is: " + infile
The numericalSort function splits out any digits in a filename, turns it into an actual number, and returns the result for sorting:
>>> files = ['file9.txt', 'file10.txt', 'file11.txt', '32foo9.txt', '32foo10.txt']
>>> sorted(files)
['32foo10.txt', '32foo9.txt', 'file10.txt', 'file11.txt', 'file9.txt']
>>> sorted(files, key=numericalSort)
['32foo9.txt', '32foo10.txt', 'file9.txt', 'file10.txt', 'file11.txt']

You can wrap your glob.glob( ... ) expression inside a sorted( ... ) statement and sort the resulting list of files. Example:
for infile in sorted(glob.glob('*.txt')):
You can give sorted a comparison function or, better, use the key= ... argument to give it a custom key that is used for sorting.
Example:
There are the following files:
x/blub01.txt
x/blub02.txt
x/blub10.txt
x/blub03.txt
y/blub05.txt
The following code will produce the following output:
for filename in sorted(glob.glob('[xy]/*.txt')):
print filename
# x/blub01.txt
# x/blub02.txt
# x/blub03.txt
# x/blub10.txt
# y/blub05.txt
Now with key function:
def key_func(x):
return os.path.split(x)[-1]
for filename in sorted(glob.glob('[xy]/*.txt'), key=key_func):
print filename
# x/blub01.txt
# x/blub02.txt
# x/blub03.txt
# y/blub05.txt
# x/blub10.txt
EDIT:
Possibly this key function can sort your files:
pat=re.compile("(\d+)\D*$")
...
def key_func(x):
mat=pat.search(os.path.split(x)[-1]) # match last group of digits
if mat is None:
return x
return "{:>10}".format(mat.group(1)) # right align to 10 digits.
It sure can be improved, but I think you get the point. Paths without numbers will be left alone, paths with numbers will be converted to a string that is 10 digits wide and contains the number.

You need to change the sort from 'ASCIIBetical' to numeric by isolating the number in the filename. You can do that like so:
import re
def keyFunc(afilename):
nondigits = re.compile("\D")
return int(nondigits.sub("", afilename))
filenames = ["file10.txt", "file11.txt", "file9.txt"]
for x in sorted(filenames, key=keyFunc):
print xcode here
Where you can set filenames with the result of glob.glob("*.txt");
Additinally the keyFunc function assumes the filename will have a number in it, and that the number is only in the filename. You can change that function to be as complex as you need to isolate the number you need to sort on.

glob.glob(os.path.join( '*.txt'))
returns a list of strings, so you can easily sort the list using pythons sorted() function.
sorted(glob.glob(os.path.join( '*.txt')))

for fname in ['file9.txt','file10.txt','file11.txt']:
with open(fname) as f: # default open mode is for reading
for line in f:
# do something with line

Extracting "unsigned files" from a directory

I have a directory with xml files associated with encrypted P7M files, meaning that for every name.xml there is a name.P7M. But there are some exceptions (P7M file is absent) and my goal is to detect them using python.
I'm thinking this code.. Can you help with an elegant code?
import glob
# functions to eleminate extension name
def is_xml(x):
a = re.search(r"(\s)(.xml)",x)
if a :
return a.group(0)
else:
return False
def is_P7M(x):
a = re.search(r"(\s)(.P7M)", x)
if a :
return a.group(0)
else:
return False
# putting xml files and P7M files in two sets
setA = set (glob.glob('directory/*.xml'))
setB = set (glob.glob('directory/*.P7M'))
#eliminating extention names
for elt in setA:
elt= is_xml(elt)
for elt in setB:
elt= is_P7M(elt)
#difference between two sets. setB is always a larger set
print "unsigned files are:", setB.difference(setA)

A simpler way is to glob for the .xml files, then check using os.path.exists for a .P7M file:
import os, glob
for xmlfile in glob.glob('*.xml'):
if not os.path.exists(xmlfile.rsplit(".", 1)[0] + ".P7M"):
print xmlfile, "is unsigned"
This code:
Uses glob.glob to get all the xml files.
Uses str.rsplit to split the filename up into name and extension (e.g. "name.xml" to ("name", ".xml")). The second argument stops str.rsplit splitting more than once.
Takes the name of the file and adds the .P7M extension.
Uses os.path.exists to see if the key file is there. If is isn't, the xmlfile is unsigned, so print it out.
If you need them in a list, you can do:
unsigned = [xmlfile for xmlfile in glob.glob('*.xml') if not os.path.exists(xmlfile.rsplit(".", 1)[0] + ".P7M")]
Or a set:
unsigned = {xmlfile for xmlfile in glob.glob('*.xml') if not os.path.exists(xmlfile.rsplit(".", 1)[0] + ".P7M")}

My solution would be:
import glob
import os
get_name = lambda fname: os.path.splitext(fname)[0]
xml_names = {get_name(fname) for fname in glob.glob('directory/*.xml')}
p7m_names = {get_name(fname) for fname in glob.glob('directory/*.p7m')}
unsigned = [xml_name + ".xml" for xml_name in \
xml_names.difference(p7m_names)]
print unsigned

get all xml's in a dict removing the extension and using the name as key and setting the value to false initially, if we find a matching P7M set value to True, finally print all keys with a False value.
xmls = glob.glob('directory/*.xml')
p7ms = glob.glob('directory/*.P7M')
# use xml file names as keys by removing the extension
d = {k[rsplit(k,1)[0]]:False for k in xmls}
# go over every .P7M again removing extension
# setting value to True for every match
for k in p7ms:
k[rsplit(k,1)[0]] = True
# any values that are False means there is no .P7M match for the xml file
for k,v in d.items():
if not v:
print(k)
Or create a set of each and find the difference:
xmls = {x.rsplit(".",1)[0] for x in in glob.glob('directory/*.xml')}
pm7s = {x.rsplit(".",1)[0] for x in glob.glob('directory/*.P7M')}
print(xmls - pm7s)

Iterate over glob once and populate a dict of filenames by extension. Finally, compute the difference between 'xml' and 'P7M' sets.
import os, glob, collections
fnames = collections.defaultdict(set)
for fname in glob.glob('*'):
f, e = os.path.splitext(fname)
fnames[e].add(f)
print fnames['.xml'] - fnames['.P7M']
Note that unlike other suggestions, this makes one single request to the filesystem, which might be important if the FS is slow (e.g. a network mount).

Use of regular expression to exclude characters in file rename,Python?

I am trying to rename files so that they contain an ID followed by a -(int). The files generally come to me in this way but sometimes they come as 1234567-1(crop to bottom).jpg.
I have been trying to use the following code but my regular expression doesn't seem to be having any effect. The reason for the walk is because we have to handles large directory trees with many images.
def fix_length():
for root, dirs, files in os.walk(path):
for fn in files:
path2 = os.path.join(root, fn)
filename_zero, extension = os.path.splitext(fn)
re.sub("[^0-9][-]", "", filename_zero)
os.rename(path2, filename_zero + extension)
fix_length()
I have inserted print statements for filename_zero before and after the re.sub line and I am getting the same result (i.e. 1234567-1(crop to bottom) not what I wanted)
This raises an exception as the rename is trying to create a file that already exists.
I thought perhaps adding the [-] in the regex was the issue but removing it and running again I would then expect 12345671.jpg but this doesn't work either. My regex is failing me or I have failed the regex.
Any insight would be greatly appreciated.
As a follow up, I have taken all the wonderful help and settled on a solution to my specific problem.
path = 'C:\Archive'
errors = 'C:\Test\errors'
num_files = []
def best_sol():
num_files = []
for root, dirs, files in os.walk(path):
for fn in files:
filename_zero, extension = os.path.splitext(fn)
path2 = os.path.join(root, fn)
ID = re.match('^\d{1,10}', fn).group()
if len(ID) <= 7:
if ID not in num_files:
num_files = []
num_files.append(ID)
suffix = str(len(num_files))
os.rename(path2, os.path.join(root, ID + '-' + suffix + extension))
else:
num_files.append(ID)
suffix = str(len(num_files))
os.rename(path2, os.path.join( root, ID + '-' + suffix +extension))
else:
shutil.copy(path2, errors)
os.remove(path2)
This code creates an ID based upon (up to) the first 10 numeric characters in the filename. I then use lists that store the instances of this ID and use the, length of the list append a suffix. The first file will have a -1, second a -2 etc...
I am only interested (or they should only be) in ID's with a length of 7 but allow to read up to 10 to allow for human error in labelling. All files with ID longer than 7 are moved to a folder where we can investigate.
Thanks for pointing me in the right direction.

re.sub() returns the altered string, but you ignore the return value.
You want to re-assign the result to filename_zero:
filename_zero = re.sub("[^\d-]", "", filename_zero)
I've corrected your regular expression as well; this removes anything that is not a digit or a dash from the base filename:
>>> re.sub(r'[^\d-]', '', '1234567-1(crop to bottom)')
'1234567-1'
Remember, strings are immutable, you cannot alter them in-place.
If all you want is the leading digits, plus optional dash-digit suffix, select the characters to be kept, rather than removing what you don't want:
filename_zero = re.match(r'^\d+(?:-\d)?', filename_zero).group()

new_filename = re.sub(r'^([0-9]+)-([0-9]+)', r'\g1-\g2', filename_zero)
Try using this regular expression instead, I hope this is how regular expressions work in Python, I don't use it often. You also appear to have forgotten to assign the value returned by the re.sub call to the filename_zero variable.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

batch renaming 100K files with python - python

Why don't you do it in a two step process. Parse all the files and rename with padded numbers and then run another script that takes those files, which are sorted correctly now, and renames them so they're contiguous?

1) Take the number in the filename. 2) Left-pad it with zeros 3) Save name.

Related

How to get the latest folder in a directory using Python

Python Glob regex file search with for single result from multiple matches

How to organize file names that were named as only numbers [duplicate]

Extracting "unsigned files" from a directory

Use of regular expression to exclude characters in file rename,Python?

Categories

Resources