Read or open a file using "fuzzy match" filename - Python - python

Given a list of files in a directory:
import os
os.listdir('system-outputs/newstest2016/ru-en')
[out]:
['newstest2016.AFRL-MITLL-contrast.4524.ru-en',
'newstest2016.AFRL-MITLL-Phrase.4383.ru-en',
'newstest2016.AMU-UEDIN.4458.ru-en',
'newstest2016.NRC.4456.ru-en',
'newstest2016.online-A.0.ru-en',
'newstest2016.online-B.0.ru-en',
'newstest2016.online-F.0.ru-en',
'newstest2016.online-G.0.ru-en',
'newstest2016.PROMT-Rule-based.4277.ru-en',
'newstest2016.uedin-nmt.4309.ru-en']
And then I have the input:
filename, suffix = 'newstest2016.AFRL-MITLL-contrast', 'ru-en'
Using the filename, if I want to do a regex match such that I can read the file newstest2016.AFRL-MITLL-contrast.4524.ru-en, I could do:
import re
fin = open(next(_fn for _fn in os.list('system-outputs/newstest2016/ru-en') if re.match(filename + '.*.' + suffix, _fn) for _fn in))
But is there a way to read/open a "fuzzy match" filename? There must be a better way than the crude re.match way above.
It's okay to assume that there should always be one clear match from the os.listdir.

I believe glob might be a better way.

You can use glob as suggested, but it can give several matches. I'd go with the pattern that seems to be:
filenames = [
'newstest2016.AFRL-MITLL-contrast.4524.ru-en',
# ...
'newstest2016.PROMT-Rule-based.4277.ru-en',
'newstest2016.uedin-nmt.4309.ru-en'
]
my_filename, suffix = 'newstest2016.AFRL-MITLL-contrast', 'ru-en'
for filename in filenames:
*fn, suff = filename.split('.')
if ('.'.join(fn[:-1]), suff) == (my_filename, suffix):
break
else:
filename = None
# `filename` is now set to real file name
I use python3.x for nicer syntax but this is easy to port to python2.x.

Related

Using glob to get txt and png files only from folder [duplicate]

Is there a better way to use glob.glob in python to get a list of multiple file types such as .txt, .mdown, and .markdown? Right now I have something like this:
projectFiles1 = glob.glob( os.path.join(projectDir, '*.txt') )
projectFiles2 = glob.glob( os.path.join(projectDir, '*.mdown') )
projectFiles3 = glob.glob( os.path.join(projectDir, '*.markdown') )
Maybe there is a better way, but how about:
import glob
types = ('*.pdf', '*.cpp') # the tuple of file types
files_grabbed = []
for files in types:
files_grabbed.extend(glob.glob(files))
# files_grabbed is the list of pdf and cpp files
Perhaps there is another way, so wait in case someone else comes up with a better answer.
glob returns a list: why not just run it multiple times and concatenate the results?
from glob import glob
project_files = glob('*.txt') + glob('*.mdown') + glob('*.markdown')
So many answers that suggest globbing as many times as number of extensions, I'd prefer globbing just once instead:
from pathlib import Path
files = (p.resolve() for p in Path(path).glob("**/*") if p.suffix in {".c", ".cc", ".cpp", ".hxx", ".h"})
from glob import glob
files = glob('*.gif')
files.extend(glob('*.png'))
files.extend(glob('*.jpg'))
print(files)
If you need to specify a path, loop over match patterns and keep the join inside the loop for simplicity:
from os.path import join
from glob import glob
files = []
for ext in ('*.gif', '*.png', '*.jpg'):
files.extend(glob(join("path/to/dir", ext)))
print(files)
Chain the results:
import itertools as it, glob
def multiple_file_types(*patterns):
return it.chain.from_iterable(glob.iglob(pattern) for pattern in patterns)
Then:
for filename in multiple_file_types("*.txt", "*.sql", "*.log"):
# do stuff
For example, for *.mp3 and *.flac on multiple folders, you can do:
mask = r'music/*/*.[mf][pl][3a]*'
glob.glob(mask)
The idea can be extended to more file extensions, but you have to check that the combinations won't match any other unwanted file extension you may have on those folders. So, be careful with this.
To automatically combine an arbitrary list of extensions into a single glob pattern, you can do the following:
def multi_extension_glob_mask(mask_base, *extensions):
mask_ext = ['[{}]'.format(''.join(set(c))) for c in zip(*extensions)]
if not mask_ext or len(set(len(e) for e in extensions)) > 1:
mask_ext.append('*')
return mask_base + ''.join(mask_ext)
mask = multi_extension_glob_mask('music/*/*.', 'mp3', 'flac', 'wma')
print(mask) # music/*/*.[mfw][pml][a3]*
with glob it is not possible. you can use only:
* matches everything
? matches any single character
[seq] matches any character in seq
[!seq] matches any character not in seq
use os.listdir and a regexp to check patterns:
for x in os.listdir('.'):
if re.match('.*\.txt|.*\.sql', x):
print x
While Python's default glob doesn't really follow after Bash's glob, you can do this with other libraries. We can enable braces in wcmatch's glob.
>>> from wcmatch import glob
>>> glob.glob('*.{md,ini}', flags=glob.BRACE)
['LICENSE.md', 'README.md', 'tox.ini']
You can even use extended glob patterns if that is your preference:
from wcmatch import glob
>>> glob.glob('*.#(md|ini)', flags=glob.EXTGLOB)
['LICENSE.md', 'README.md', 'tox.ini']
Same answer as #BPL (which is computationally efficient) but which can handle any glob pattern rather than extension:
import os
from fnmatch import fnmatch
folder = "path/to/folder/"
patterns = ("*.txt", "*.md", "*.markdown")
files = [f.path for f in os.scandir(folder) if any(fnmatch(f, p) for p in patterns)]
This solution is both efficient and convenient. It also closely matches the behavior of glob (see the documentation).
Note that this is simpler with the built-in package pathlib:
from pathlib import Path
folder = Path("/path/to/folder")
patterns = ("*.txt", "*.md", "*.markdown")
files = [f for f in folder.iterdir() if any(f.match(p) for p in patterns)]
Here is one-line list-comprehension variant of Pat's answer (which also includes that you wanted to glob in a specific project directory):
import os, glob
exts = ['*.txt', '*.mdown', '*.markdown']
files = [f for ext in exts for f in glob.glob(os.path.join(project_dir, ext))]
You loop over the extensions (for ext in exts), and then for each extension you take each file matching the glob pattern (for f in glob.glob(os.path.join(project_dir, ext)).
This solution is short, and without any unnecessary for-loops, nested list-comprehensions, or functions to clutter the code. Just pure, expressive, pythonic Zen.
This solution allows you to have a custom list of exts that can be changed without having to update your code. (This is always a good practice!)
The list-comprehension is the same used in Laurent's solution (which I've voted for). But I would argue that it is usually unnecessary to factor out a single line to a separate function, which is why I'm providing this as an alternative solution.
Bonus:
If you need to search not just a single directory, but also all sub-directories, you can pass recursive=True and use the multi-directory glob symbol ** 1:
files = [f for ext in exts
for f in glob.glob(os.path.join(project_dir, '**', ext), recursive=True)]
This will invoke glob.glob('<project_dir>/**/*.txt', recursive=True) and so on for each extension.
1 Technically, the ** glob symbol simply matches one or more characters including forward-slash / (unlike the singular * glob symbol). In practice, you just need to remember that as long as you surround ** with forward slashes (path separators), it matches zero or more directories.
Python 3
We can use pathlib; .glob still doesn't support globbing multiple arguments or within braces (as in POSIX shells) but we can easily filter the result.
For example, where you might ideally like to do:
# NOT VALID
Path(config_dir).glob("*.{ini,toml}")
# NOR IS
Path(config_dir).glob("*.ini", "*.toml")
you can do:
filter(lambda p: p.suffix in {".ini", ".toml"}, Path(config_dir).glob("*"))
which isn't too much worse.
A one-liner, Just for the hell of it..
folder = "C:\\multi_pattern_glob_one_liner"
files = [item for sublist in [glob.glob(folder + ext) for ext in ["/*.txt", "/*.bat"]] for item in sublist]
output:
['C:\\multi_pattern_glob_one_liner\\dummy_txt.txt', 'C:\\multi_pattern_glob_one_liner\\dummy_bat.bat']
files = glob.glob('*.txt')
files.extend(glob.glob('*.dat'))
By the results I've obtained from empirical tests, it turned out that glob.glob isn't the better way to filter out files by their extensions. Some of the reason are:
The globbing "language" does not allows perfect specification of multiple extension.
The former point results in obtaining incorrect results depending on file extensions.
The globbing method is empirically proven to be slower than most other methods.
Even if it's strange even other filesystems objects can have "extensions", folders too.
I've tested (for correcteness and efficiency in time) the following 4 different methods to filter out files by extensions and puts them in a list:
from glob import glob, iglob
from re import compile, findall
from os import walk
def glob_with_storage(args):
elements = ''.join([f'[{i}]' for i in args.extensions])
globs = f'{args.target}/**/*{elements}'
results = glob(globs, recursive=True)
return results
def glob_with_iteration(args):
elements = ''.join([f'[{i}]' for i in args.extensions])
globs = f'{args.target}/**/*{elements}'
results = [i for i in iglob(globs, recursive=True)]
return results
def walk_with_suffixes(args):
results = []
for r, d, f in walk(args.target):
for ff in f:
for e in args.extensions:
if ff.endswith(e):
results.append(path_join(r,ff))
break
return results
def walk_with_regs(args):
reg = compile('|'.join([f'{i}$' for i in args.extensions]))
results = []
for r, d, f in walk(args.target):
for ff in f:
if len(findall(reg,ff)):
results.append(path_join(r, ff))
return results
By running the code above on my laptop I obtained the following auto-explicative results.
Elapsed time for '7 times glob_with_storage()': 0.365023 seconds.
mean : 0.05214614
median : 0.051861
stdev : 0.001492152
min : 0.050864
max : 0.054853
Elapsed time for '7 times glob_with_iteration()': 0.360037 seconds.
mean : 0.05143386
median : 0.050864
stdev : 0.0007847381
min : 0.050864
max : 0.052859
Elapsed time for '7 times walk_with_suffixes()': 0.26529 seconds.
mean : 0.03789857
median : 0.037899
stdev : 0.0005759071
min : 0.036901
max : 0.038896
Elapsed time for '7 times walk_with_regs()': 0.290223 seconds.
mean : 0.04146043
median : 0.040891
stdev : 0.0007846776
min : 0.04089
max : 0.042885
Results sizes:
0 2451
1 2451
2 2446
3 2446
Differences between glob() and walk():
0 E:\x\y\z\venv\lib\python3.7\site-packages\Cython\Includes\numpy
1 E:\x\y\z\venv\lib\python3.7\site-packages\Cython\Utility\CppSupport.cpp
2 E:\x\y\z\venv\lib\python3.7\site-packages\future\moves\xmlrpc
3 E:\x\y\z\venv\lib\python3.7\site-packages\Cython\Includes\libcpp
4 E:\x\y\z\venv\lib\python3.7\site-packages\future\backports\xmlrpc
Elapsed time for 'main': 1.317424 seconds.
The fastest way to filter out files by extensions, happens even to be the ugliest one. Which is, nested for loops and string comparison using the endswith() method.
Moreover, as you can see, the globbing algorithms (with the pattern E:\x\y\z\**/*[py][pyc]) even with only 2 extension given (py and pyc) returns also incorrect results.
I have released Formic which implements multiple includes in a similar way to Apache Ant's FileSet and Globs.
The search can be implemented:
import formic
patterns = ["*.txt", "*.markdown", "*.mdown"]
fileset = formic.FileSet(directory=projectDir, include=patterns)
for file_name in fileset.qualified_files():
# Do something with file_name
Because the full Ant glob is implemented, you can include different directories with each pattern, so you could choose only those .txt files in one subdirectory, and the .markdown in another, for example:
patterns = [ "/unformatted/**/*.txt", "/formatted/**/*.mdown" ]
I hope this helps.
This is a Python 3.4+ pathlib solution:
exts = ".pdf", ".doc", ".xls", ".csv", ".ppt"
filelist = (str(i) for i in map(pathlib.Path, os.listdir(src)) if i.suffix.lower() in exts and not i.stem.startswith("~"))
Also it ignores all file names starting with ~.
After coming here for help, I made my own solution and wanted to share it. It's based on user2363986's answer, but I think this is more scalable. Meaning, that if you have 1000 extensions, the code will still look somewhat elegant.
from glob import glob
directoryPath = "C:\\temp\\*."
fileExtensions = [ "jpg", "jpeg", "png", "bmp", "gif" ]
listOfFiles = []
for extension in fileExtensions:
listOfFiles.extend( glob( directoryPath + extension ))
for file in listOfFiles:
print(file) # Or do other stuff
Not glob, but here's another way using a list comprehension:
extensions = 'txt mdown markdown'.split()
projectFiles = [f for f in os.listdir(projectDir)
if os.path.splitext(f)[1][1:] in extensions]
The following function _glob globs for multiple file extensions.
import glob
import os
def _glob(path, *exts):
"""Glob for multiple file extensions
Parameters
----------
path : str
A file name without extension, or directory name
exts : tuple
File extensions to glob for
Returns
-------
files : list
list of files matching extensions in exts in path
"""
path = os.path.join(path, "*") if os.path.isdir(path) else path + "*"
return [f for files in [glob.glob(path + ext) for ext in exts] for f in files]
files = _glob(projectDir, ".txt", ".mdown", ".markdown")
From previous answer
glob('*.jpg') + glob('*.png')
Here is a shorter one,
from glob import glob
extensions = ['jpg', 'png'] # to find these filename extensions
# Method 1: loop one by one and extend to the output list
output = []
[output.extend(glob(f'*.{name}')) for name in extensions]
print(output)
# Method 2: even shorter
# loop filename extension to glob() it and flatten it to a list
output = [p for p2 in [glob(f'*.{name}') for name in extensions] for p in p2]
print(output)
You can try to make a manual list comparing the extension of existing with those you require.
ext_list = ['gif','jpg','jpeg','png'];
file_list = []
for file in glob.glob('*.*'):
if file.rsplit('.',1)[1] in ext_list :
file_list.append(file)
import os
import glob
import operator
from functools import reduce
types = ('*.jpg', '*.png', '*.jpeg')
lazy_paths = (glob.glob(os.path.join('my_path', t)) for t in types)
paths = reduce(operator.add, lazy_paths, [])
https://docs.python.org/3.5/library/functools.html#functools.reduce
https://docs.python.org/3.5/library/operator.html#operator.add
To glob multiple file types, you need to call glob() function several times in a loop. Since this function returns a list, you need to concatenate the lists.
For instance, this function do the job:
import glob
import os
def glob_filetypes(root_dir, *patterns):
return [path
for pattern in patterns
for path in glob.glob(os.path.join(root_dir, pattern))]
Simple usage:
project_dir = "path/to/project/dir"
for path in sorted(glob_filetypes(project_dir, '*.txt', '*.mdown', '*.markdown')):
print(path)
You can also use glob.iglob() to have an iterator:
Return an iterator which yields the same values as glob() without actually storing them all simultaneously.
def iglob_filetypes(root_dir, *patterns):
return (path
for pattern in patterns
for path in glob.iglob(os.path.join(root_dir, pattern)))
One glob, many extensions... but imperfect solution (might match other files).
filetypes = ['tif', 'jpg']
filetypes = zip(*[list(ft) for ft in filetypes])
filetypes = ["".join(ch) for ch in filetypes]
filetypes = ["[%s]" % ch for ch in filetypes]
filetypes = "".join(filetypes) + "*"
print(filetypes)
# => [tj][ip][fg]*
glob.glob("/path/to/*.%s" % filetypes)
I had the same issue and this is what I came up with
import os, sys, re
#without glob
src_dir = '/mnt/mypics/'
src_pics = []
ext = re.compile('.*\.(|{}|)$'.format('|'.join(['png', 'jpeg', 'jpg']).encode('utf-8')))
for root, dirnames, filenames in os.walk(src_dir):
for filename in filter(lambda name:ext.search(name),filenames):
src_pics.append(os.path.join(root, filename))
Use a list of extension and iterate through
from os.path import join
from glob import glob
files = []
extensions = ['*.gif', '*.png', '*.jpg']
for ext in extensions:
files.extend(glob(join("path/to/dir", ext)))
print(files)
You could use filter:
import os
import glob
projectFiles = filter(
lambda x: os.path.splitext(x)[1] in [".txt", ".mdown", ".markdown"]
glob.glob(os.path.join(projectDir, "*"))
)
You could also use reduce() like so:
import glob
file_types = ['*.txt', '*.mdown', '*.markdown']
project_files = reduce(lambda list1, list2: list1 + list2, (glob.glob(t) for t in file_types))
this creates a list from glob.glob() for each pattern and reduces them to a single list.
Yet another solution (use glob to get paths using multiple match patterns and combine all paths into a single list using reduce and add):
import functools, glob, operator
paths = functools.reduce(operator.add, [glob.glob(pattern) for pattern in [
"path1/*.ext1",
"path2/*.ext2"]])
If you use pathlib try this:
import pathlib
extensions = ['.py', '.txt']
root_dir = './test/'
files = filter(lambda p: p.suffix in extensions, pathlib.Path(root_dir).glob('**/*'))
print(list(files))

Python Glob regex file search with for single result from multiple matches

In Python, I am trying to find a specific file in a directory, let's say, 'file3.txt'. The other files in the directory are 'flie1.txt', 'File2.txt', 'file_12.txt', and 'File13.txt'. The number is unique, so I need to search by a user supplied number.
file_num = 3
my_file = glob.glob('C:/Path_to_dir/' + r'[a-zA-Z_]*' + f'{file_num} + '.txt')
Problem is, that returns both 'file3.txt' and 'File13.txt'. If I try lookbehind, I get no files:
file_num = 3
my_file = glob.glob('C:/Path_to_dir/' + r'[a-zA-Z_]*' + r'(?<![1-9]*)' + f'{file_num}' + '.txt')
How do I only get 'file3.txt'?
glob accepts Unix wildcards, not regexes. Those are less powerful but what you're asking can still be achieved. This:
glob.glob("/path/to/file/*[!0-9]3.txt")
filters the files containing 3 without digits before.
For other cases, you can use a list comprehension and regex:
[x for x in glob.glob("/path/to/file/*") if re.match(some_regex,os.path.basename(x))]
The problem with glob is that it has limited RegEx. For instance, you can't have "[a-z_]+" with glob.
So, it's better to write your own RegEx, like this:
import re
import os
file_num = 3
file_re = r"[a-z_]+{file_num}\.txt".format(file_num=file_num)
match_file = re.compile(file_re, flags=re.IGNORECASE).match
work_dir = "C:/Path_to_dir/"
names = list(filter(match_file, os.listdir(work_dir)))

Python - Truncate unknown file names

Let's say I have the following files in a directory:
snackbox_1a.dat
zebrabar_3z.dat
cornrows_00.dat
meatpack_z2.dat
I have SEVERAL of these directories, in which all of the files are of the same format, ie:
snackbox_xx.dat
zebrabar_xx.dat
cornrows_xx.dat
meatpack_xx.dat
So what I KNOW about these files is the first bit (snackbox, zebrabar, cornrows, meatpack). What I don't know is the bit for the file extension (the 'xx'). This changes both within the directory across the files, and across the directories (so another directory might have different xx values, like 12, yy, 2m, 0t, whatever).
Is there a way for me to rename all of these files, or truncate them all (since the xx.dat will always be the same length), for ease of use when attempting to call them? For instance, I'd like to rename them so that I can, in another script, use a simple index to step through and find the file I want (instead of having to go into each directory and pull the file out manually).
In other words, I'd like to change the file names to:
snackbox.dat
zebrabar.dat
cornrows.dat
meatpack.dat
Thanks!
You can use shutil.move to move files. To calculate the new filename, you can use Python's string split method:
original_name = "snackbox_12.dat"
truncated_name = original.split("_")[0] + ".dat"
Try re.sub:
import re
filename = 'snackbox_xx.dat'
filename_new = re.sub(r'_[A-Za-z0-9]{2}', '', filename)
You should get 'snackbox.dat' for filename_new
This assumes the two characters after the "_" are either a number or lowercase/uppercase letter, but you could choose to expand the classes included in the regular expression.
EDIT: including moving and recursive search:
import shutil, re, os, fnmatch
directory = 'your_path'
for root, dirnames, filenames in os.walk(directory):
for filename in fnmatch.filter(filenames, '*.dat'):
filename_new = re.sub(r'_[A-Za-z0-9]{2}', '', filename)
shutil.move(os.path.join(root, filename), os.path.join(root, filename_new))
This solution renames all files in the current directory that match the pattern in the function call.
What the function does
snackbox_5R.txt >>> snackbox.txt
snackbox_6y.txt >>> snackbox_0.txt
snackbox_a2.txt >>> snackbox_1.txt
snackbox_Tm.txt >>> snackbox_2.txt
Let's look at the functions inputs and some examples.
list_of_files_names This is a list of string. Where each string is the filename without the _?? part.
Examples:
['snackbox.txt', 'zebrabar.txt', 'cornrows.txt', 'meatpack.txt', 'calc.txt']
['text.dat']
upper_bound=1000 This is an integer. When the ideal filename is already taken, e.g snackbox.dat already exist it will create snackbox_0.dat all the way up to snackbox_9999.dat if need be. You shouldn't have to change the default.
The Code
import re
import os
import os.path
def find_and_rename(dir, list_of_files_names, upper_bound=1000):
"""
:param list_of_files_names: List. A list of string: filname (without the _??) + extension, EX: snackbox.txt
Renames snackbox_R5.dat to snackbox.dat, etc.
"""
# split item in the list_of_file_names into two parts, filename and extension "snackbox.dat" -> "snackbox", "dat"
list_of_files_names = [(prefix.split('.')[0], prefix.split('.')[1]) for prefix in list_of_files_names]
# store the content of the dir in a list
list_of_files_in_dir = os.listdir(dir)
for file_in_dir in list_of_files_in_dir: # list all files and folders in current dir
file_in_dir_full_path = os.path.join(dir, file_in_dir) # we need the full path to rename to use .isfile()
print() # DEBUG
print('Is "{}" a file?: '.format(file_in_dir), end='') # DEBUG
print(os.path.isfile(file_in_dir_full_path)) # DEBUG
if os.path.isfile(file_in_dir_full_path): # filters out the folder, only files are needed
# Filename is a tuple containg the prefix filename and the extenstion
for file_name in list_of_files_names: # check if the file matches on of our renaming prefixes
# match both the file name (e.g "snackbox") and the extension (e.g "dat")
# It find "snackbox_5R.txt" by matching "snackbox" in the front and matching "dat" in the rear
if re.match('{}_\w+\.{}'.format(file_name[0], file_name[1]), file_in_dir):
print('\nOriginal File: ' + file_in_dir) # printing this is not necessary
print('.'.join(file_name))
ideal_new_file_name = '.'.join(file_name) # name might already be taken
# print(ideal_new_file_name)
if os.path.isfile(os.path.join(dir, ideal_new_file_name)): # file already exists
# go up a name, e.g "snackbox.dat" --> "snackbox_1.dat" --> "snackbox_2.dat
for index in range(upper_bound):
# check if this new name already exists as well
next_best_name = file_name[0] + '_' + str(index) + '.' + file_name[1]
# file does not already exist
if os.path.isfile(os.path.join(dir,next_best_name)) == False:
print('Renaming with next best name')
os.rename(file_in_dir_full_path, os.path.join(dir, next_best_name))
break
# this file exist as well, keeping increasing the name
else:
pass
# file with ideal name does not already exist, rename with the ideal name (no _##)
else:
print('Renaming with ideal name')
os.rename(file_in_dir_full_path, os.path.join(dir, ideal_new_file_name))
def find_and_rename_include_sub_dirs(master_dir, list_of_files_names, upper_bound=1000):
for path, subdirs, files in os.walk(master_dir):
print(path) # DEBUG
find_and_rename(path, list_of_files_names, upper_bound)
find_and_rename_include_sub_dirs('C:/Users/Oxen/Documents/test_folder', ['snackbox.txt', 'zebrabar.txt', 'cornrows.txt', 'meatpack.txt', 'calc.txt'])

How to extract numbers from filename in Python?

I need to extract just the numbers from file names such as:
GapPoints1.shp
GapPoints23.shp
GapPoints109.shp
How can I extract just the numbers from these files using Python? I'll need to incorporate this into a for loop.
you can use regular expressions:
regex = re.compile(r'\d+')
Then to get the strings that match:
regex.findall(filename)
This will return a list of strings which contain the numbers. If you actually want integers, you could use int:
[int(x) for x in regex.findall(filename)]
If there's only 1 number in each filename, you could use regex.search(filename).group(0) (if you're certain that it will produce a match). If no match is found, the above line will produce a AttributeError saying that NoneType has not attribute group.
So, you haven't left any description of where these files are and how you're getting them, but I assume you'd get the filenames using the os module.
As for getting the numbers out of the names, you'd be best off using regular expressions with re, something like this:
import re
def get_numbers_from_filename(filename):
return re.search(r'\d+', filename).group(0)
Then, to include that in a for loop, you'd run that function on each filename:
for filename in os.listdir(myfiledirectory):
print get_numbers_from_filename(filename)
or something along those lines.
If there is just one number:
filter(lambda x: x.isdigit(), filename)
Hear is my code I used to bring the published year of a paper to the first of filename, after the file is downloaded from google scholar.
The main files usually are constructed so: Author+publishedYear.pdf hence, by implementing this code the filename will become: PublishedYear+Author.pdf.
# Renaming Pdf according to number extraction
# You want to rename a pdf file, so the digits of document published year comes first.
# Use regular expersion
# As long as you implement this file, the other pattern will be accomplished to your filename.
# import libraries
import re
import os
# Change working directory to this folder
address = os.getcwd ()
os.chdir(address)
# defining a class with two function
class file_name:
# Define a function to extract any digits
def __init__ (self, filename):
self.filename = filename
# Because we have tow pattern, we must define tow function.
# First function for pattern as : schrodinger1990.pdf
def number_extrction_pattern_non_digits_first (filename):
pattern = (r'(\D+)(\d+)(\.pdf)')
digits_pattern_non_digits_first = re.search(pattern, filename, re.IGNORECASE).group (2)
non_digits_pattern_non_digits_first = re.search(pattern, filename, re.IGNORECASE).group (1)
return digits_pattern_non_digits_first, non_digits_pattern_non_digits_first
# Second function for pattern as : 1993schrodinger.pdf
def number_extrction_pattern_digits_first (filename):
pattern = (r'(\d+)(\D+)(\.pdf)')
digits_pattern_digits_first = re.search(pattern, filename, re.IGNORECASE).group (1)
non_digits_pattern_digits_first = re.search(pattern, filename, re.IGNORECASE).group (2)
return digits_pattern_digits_first, non_digits_pattern_digits_first
if __name__ == '__main__':
# Define a pattern to check filename pattern
pattern_check1 = (r'(\D+)(\d+)(\.pdf)')
# Declare each file address.
for filename in os.listdir(address):
if filename.endswith('.pdf'):
if re.search(pattern_check1, filename, re.IGNORECASE):
digits = file_name.number_extrction_pattern_non_digits_first (filename)[0]
non_digits = file_name.number_extrction_pattern_non_digits_first (filename)[1]
os.rename(filename, digits + non_digits + '.pdf')
# Else other pattern exists.
else :
digits = file_name.number_extrction_pattern_digits_first (filename)[0]
non_digits = file_name.number_extrction_pattern_digits_first (filename)[1]
os.rename(filename, digits + non_digits + '.pdf')

batch renaming 100K files with python

I have a folder with over 100,000 files, all numbered with the same stub, but without leading zeros, and the numbers aren't always contiguous (usually they are, but there are gaps) e.g:
file-21.png,
file-22.png,
file-640.png,
file-641.png,
file-642.png,
file-645.png,
file-2130.png,
file-2131.png,
file-3012.png,
etc.
I would like to batch process this to create padded, contiguous files. e.g:
file-000000.png,
file-000001.png,
file-000002.png,
file-000003.png,
When I parse the folder with for filename in os.listdir('.'): the files don't come up in the order I'd like to them to. Understandably they come up
file-1,
file-1x,
file-1xx,
file-1xxx,
etc. then
file-2,
file-2x,
file-2xx,
etc. How can I get it to go through in the order of the numeric value? I am a complete python noob, but looking at the docs i'm guessing I could use map to create a new list filtering out only the numerical part, and then sort that list, then iterate that? With over 100K files this could be heavy. Any tips welcome!
import re
thenum = re.compile('^file-(\d+)\.png$')
def bynumber(fn):
mo = thenum.match(fn)
if mo: return int(mo.group(1))
allnames = os.listdir('.')
allnames.sort(key=bynumber)
Now you have the files in the order you want them and can loop
for i, fn in enumerate(allnames):
...
using the progressive number i (which will be 0, 1, 2, ...) padded as you wish in the destination-name.
There are three steps. The first is getting all the filenames. The second is converting the filenames. The third is renaming them.
If all the files are in the same folder, then glob should work.
import glob
filenames = glob.glob("/path/to/folder/*.txt")
Next, you want to change the name of the file. You can print with padding to do this.
>>> filename = "file-338.txt"
>>> import os
>>> fnpart = os.path.splitext(filename)[0]
>>> fnpart
'file-338'
>>> _, num = fnpart.split("-")
>>> num.rjust(5, "0")
'00338'
>>> newname = "file-%s.txt" % num.rjust(5, "0")
>>> newname
'file-00338.txt'
Now, you need to rename them all. os.rename does just that.
os.rename(filename, newname)
To put it together:
for filename in glob.glob("/path/to/folder/*.txt"): # loop through each file
newname = make_new_filename(filename) # create a function that does step 2, above
os.rename(filename, newname)
Thank you all for your suggestions, I will try them all to learn the different approaches. The solution I went for is based on using a natural sort on my filelist, and then iterating that to rename. This was one of the suggested answers but for some reason it has disappeared now so I cannot mark it as accepted!
import os
files = os.listdir('.')
natsort(files)
index = 0
for filename in files:
os.rename(filename, str(index).zfill(7)+'.png')
index += 1
where natsort is defined in http://code.activestate.com/recipes/285264-natural-string-sorting/
Why don't you do it in a two step process. Parse all the files and rename with padded numbers and then run another script that takes those files, which are sorted correctly now, and renames them so they're contiguous?
1) Take the number in the filename.
2) Left-pad it with zeros
3) Save name.
def renamer():
for iname in os.listdir('.'):
first, second = iname.replace(" ", "").split("-")
number, ext = second.split('.')
first, number, ext = first.strip(), number.strip(), ext.strip()
number = '0'*(6-len(number)) + number # pad the number to be 7 digits long
oname = first + "-" + number + '.' + ext
os.rename(iname, oname)
print "Done"
Hope this helps
The simplest method is given below. You can also modify for recursive search this script.
use os module.
get filenames
os.rename
import os
class Renamer:
def __init__(self, pattern, extension):
self.ext = extension
self.pat = pattern
return
def rename(self):
p, e = (self.pat, self.ext)
number = 0
for x in os.listdir(os.getcwd()):
if str(x).endswith(f".{e}") == True:
os.rename(x, f'{p}_{number}.{e}')
number+=1
return
if __name__ == "__main__":
pattern = "myfile"
extension = "txt"
r = Renamer(pattern=pattern, extension=extension)
r.rename()

Categories