Cut out a sequence of files using glob in python - python

I have a directory with files like img-0001.jpg, img-0005.pg, img-0006.jpg, ... , img-xxxx.jpg.
What I need to do is to get a list with all files starting at 0238, literally img-0238.jpg. The next existing filename is img-0240.jpg
Right now I use glob to get all filenames.
list_images = glob.glob(path_images + "*.jpg")
Thanks in advance
Edit:
-> The last filename is img-0315.jpg

Glob doesn't allow regex filtering. But you filter list right after you receive all matching files.
Here is how it would look like using re:
import re
list_images = [f for f in glob.glob(path_images + "*.jpg") \
if re.search(r'[1-9]\d{3,}|0[3-9]\d{2,}|02[4-9]\d|023[8-9]\.jpg$', f)]
The regular expression with verify that file ends with number with 4 digits bigger or equal 0238.
You can play around with regular expression using https://regex101.com/
Basically, we check if number is:
starts with 1 followed by any 3 digits
or starts with 0[3-9] followed by any 2 digits
or starts with 02[4-9] followed by any 1 digit
or starts with 023 and followed by either 8 or 9.
But it's probably would be easier to do simple comparison:
list_images = [f for f in glob.glob(path_images + "*.jpg") \
if f[-8:-4] > "0237" and f[-8:-4] < "0316"]

You can specify multiple repeated wildcards to match all files whose number is 23[89] or 2[4-9][0-9] or 30[0-9] etc;
list_images = []
for pattern in ('023[89]', '02[4-9][0-9]', '030[0-9]', '031[0-5]'):
list_images.extend(glob.glob(
os.path.join(path_images, '*{0}.jpg'.format(pattern))))
or you can just filter out the ones you don't want.
list_images = [x for x in glob.glob(os.path.join(path_images, "*.jpg"))
if 238 <= int(x[-8:-4]) <= 315]

For something like this, you could try the wcmatch library. It's a library that aims to enhance file globbing and wildcard matching.
In this example, we enable brace expansion and demonstrate the pattern by filtering a list of files:
from wcmatch import glob
files = []
# Generate list of files from img-0000.jpg to img-0315.jpg
for x in range(316):
files.append('path/img-{:04d}.jpg'.format(x))
print(glob.globfilter(files, 'path/img-{0238..0315}.jpg', flags=glob.BRACE))
And we get the following output:
['path/img-0238.jpg', 'path/img-0239.jpg', 'path/img-0240.jpg', 'path/img-0241.jpg', 'path/img-0242.jpg', 'path/img-0243.jpg', 'path/img-0244.jpg', 'path/img-0245.jpg', 'path/img-0246.jpg', 'path/img-0247.jpg', 'path/img-0248.jpg', 'path/img-0249.jpg', 'path/img-0250.jpg', 'path/img-0251.jpg', 'path/img-0252.jpg', 'path/img-0253.jpg', 'path/img-0254.jpg', 'path/img-0255.jpg', 'path/img-0256.jpg', 'path/img-0257.jpg', 'path/img-0258.jpg', 'path/img-0259.jpg', 'path/img-0260.jpg', 'path/img-0261.jpg', 'path/img-0262.jpg', 'path/img-0263.jpg', 'path/img-0264.jpg', 'path/img-0265.jpg', 'path/img-0266.jpg', 'path/img-0267.jpg', 'path/img-0268.jpg', 'path/img-0269.jpg', 'path/img-0270.jpg', 'path/img-0271.jpg', 'path/img-0272.jpg', 'path/img-0273.jpg', 'path/img-0274.jpg', 'path/img-0275.jpg', 'path/img-0276.jpg', 'path/img-0277.jpg', 'path/img-0278.jpg', 'path/img-0279.jpg', 'path/img-0280.jpg', 'path/img-0281.jpg', 'path/img-0282.jpg', 'path/img-0283.jpg', 'path/img-0284.jpg', 'path/img-0285.jpg', 'path/img-0286.jpg', 'path/img-0287.jpg', 'path/img-0288.jpg', 'path/img-0289.jpg', 'path/img-0290.jpg', 'path/img-0291.jpg', 'path/img-0292.jpg', 'path/img-0293.jpg', 'path/img-0294.jpg', 'path/img-0295.jpg', 'path/img-0296.jpg', 'path/img-0297.jpg', 'path/img-0298.jpg', 'path/img-0299.jpg', 'path/img-0300.jpg', 'path/img-0301.jpg', 'path/img-0302.jpg', 'path/img-0303.jpg', 'path/img-0304.jpg', 'path/img-0305.jpg', 'path/img-0306.jpg', 'path/img-0307.jpg', 'path/img-0308.jpg', 'path/img-0309.jpg', 'path/img-0310.jpg', 'path/img-0311.jpg', 'path/img-0312.jpg', 'path/img-0313.jpg', 'path/img-0314.jpg', 'path/img-0315.jpg']
So, we could apply this to a file search:
from wcmatch import glob
list_images = glob.glob('path/img-{0238..0315}.jpg', flags=glob.BRACE)
In this example, we've hard coded the path, but in your example, make sure path_images has a trailing / so that the pattern is constructed correctly. Others have suggested this might be an issue. Print out your pattern to confirm the pattern is correct.

Related

how to count the no of files of each extension in a directory using python?

I'm fairly new to python and i came across this problem,
i want to be able to write a python script that counts the no of files in a directory of each extension and output the following details
First row shows image count
Second row shows file names in padding format.
Third row shows frame numbers continuity
Example:
files in the directory:-
alpha.txt
file01_0040.rgb
file01_0041.rgb
file01_0042.rgb
file01_0043.rgb
file02_0044.rgb
file02_0045.rgb
file02_0046.rgb
file02_0047.rgb
Output:-
1 alpha.txt
4 file01_%04d.rgb 40-43
4 file02_%04d.rgb 44-47
I'd suggest you have a look at python's native Pathlib library (it comes with glob)
here's a first idea how you could do it (sure this can be improved but it should provide you with the basics):
from pathlib import Path
from itertools import groupby
basepath = Path(r"path-to-the-folder")
# get all filenames that follow the pattern
files = basepath.glob("file*_[0-9][0-9][0-9][0-9].rgb")
# extract the pattern so that we get the 2 numbers separately
patterns = [i.stem.lstrip("file").split("_") for i in files]
# group by the first number
groups = groupby(patterns, key=lambda x: x[0])
filenames = list()
# loop through the obtained groups and extract min/max file_ids found
for file_num, group in groups:
file_ids = [int(i[1]) for i in group]
filenames.append(f"file_{file_num}_%04d.rgb {min(file_ids)} - {max(file_ids)}")
print(*filenames, sep="\n")
you can use the glob library to search through directories like this
import glob
glob.glob('*.rgb')
This code will return the filenames of all files ending with .rgb in an array for you to sort and edit

Python regular expression, ignoring characters until some charater is matched a number of times

i'm renaming a batch of files i downloaded from a torrent and wanted to get the episode's name,so i figured regex would do the trick. I'm kinda new to regex so I'd appreciate the help. This is what i could come up to:
i have a class related to other renaming functions so the function defined here is within this class, that initializes with the path to the files directory, the expression to rename to and the file extension.
im using glob to access all files with the extension ".mkv"
for debugging i printed out all the file names:
Mr.Robot.S02E01.eps2.0_unm4sk-pt1.tc.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E02.eps2.0_unm4sk-pt2.tc.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E03.eps2.1_k3rnel-pan1c.ksd.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E04.eps2.2_init_1.asec.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E05.eps2.3.logic-b0mb.hc.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E06.eps2.4.m4ster-s1ave.aes.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E07.eps2.5_h4ndshake.sme.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E08.eps2.6.succ3ss0r.p12.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E09.eps2.7_init_5.fve.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E10.eps2.8_h1dden-pr0cess.axx.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E11.eps2.9_pyth0n-pt1.p7z.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E12.eps2.9_pyth0n-pt2.p7z.720p.WEB-DL.x264-[MULVAcoded].mkv
def strip_ep_name(self):
for i, f in enumerate(self.files):
f_list = f.split("\\")
name, ext = os.path.splitext(f_list[-1])
ep_name = name.strip(r'(.*?)".720p.WEB-DL.x264-[MULVAcoded]"')
print(ep_name)
for me, the goal is to get the episode's name, either with or without the episode's number, because i can, later on, give the episode a new name.
and the output is:
r.Robot.S02E01.eps2.0_unm4sk-pt1.t
r.Robot.S02E02.eps2.0_unm4sk-pt2.t
r.Robot.S02E03.eps2.1_k3rnel-pan1c.ks
r.Robot.S02E04.eps2.2_init_1.as
r.Robot.S02E05.eps2.3.logic-b0mb.h
r.Robot.S02E06.eps2.4.m4ster-s1ave.aes
r.Robot.S02E07.eps2.5_h4ndshake.sm
r.Robot.S02E08.eps2.6.succ3ss0r.p1
r.Robot.S02E09.eps2.7_init_5.fv
r.Robot.S02E10.eps2.8_h1dden-pr0cess.a
r.Robot.S02E11.eps2.9_pyth0n-pt1.p7z
r.Robot.S02E12.eps2.9_pyth0n-pt2.p7z
I wanted to strip all the ".eps2.2" before the episode's name, but they dont follow an order.
Now I don't know how to move on from here. can anyone help?
Do it all in one step:
\.eps\d+\.\d+[-_.](.+?)(?:\.720p.+)\.(\w+)$
Broken down, this reads:
\.eps\d+\.\d+ # ".eps", followed by digits, a dot and other digits
[-_.] # one of -, _ or .
(.+?) # anything else lazily afterwards
(?:\.720p.+) # until .720p is found (might need some tweaking)
\. # a dot
(\w+)$ # some word characters (aka the file extension) at the end
This needs to be replaced by .\1.\2 to get your desired format in the end.
Everything in Python:
import re
filenames = """
Mr.Robot.S02E01.eps2.0_unm4sk-pt1.tc.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E02.eps2.0_unm4sk-pt2.tc.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E03.eps2.1_k3rnel-pan1c.ksd.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E04.eps2.2_init_1.asec.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E05.eps2.3.logic-b0mb.hc.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E06.eps2.4.m4ster-s1ave.aes.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E07.eps2.5_h4ndshake.sme.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E08.eps2.6.succ3ss0r.p12.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E09.eps2.7_init_5.fve.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E10.eps2.8_h1dden-pr0cess.axx.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E11.eps2.9_pyth0n-pt1.p7z.720p.WEB-DL.x264-[MULVAcoded].mkv
Mr.Robot.S02E12.eps2.9_pyth0n-pt2.p7z.720p.WEB-DL.x264-[MULVAcoded].mkv
"""
rx = re.compile(r'\.eps\d+\.\d+[-_.](.+?)(?:\.720p.+)\.(\w+)$', re.M)
filenames = rx.sub(r".\1.\2", filenames)
print(filenames)
Which yields
Mr.Robot.S02E01.unm4sk-pt1.tc.mkv
Mr.Robot.S02E02.unm4sk-pt2.tc.mkv
Mr.Robot.S02E03.k3rnel-pan1c.ksd.mkv
Mr.Robot.S02E04.init_1.asec.mkv
Mr.Robot.S02E05.logic-b0mb.hc.mkv
Mr.Robot.S02E06.m4ster-s1ave.aes.mkv
Mr.Robot.S02E07.h4ndshake.sme.mkv
Mr.Robot.S02E08.succ3ss0r.p12.mkv
Mr.Robot.S02E09.init_5.fve.mkv
Mr.Robot.S02E10.h1dden-pr0cess.axx.mkv
Mr.Robot.S02E11.pyth0n-pt1.p7z.mkv
Mr.Robot.S02E12.pyth0n-pt2.p7z.mkv
See a demo on regex101.com.
Firstly import the regex module of Python:
import re
Then use this to replace from "r.Robot.S02E01.eps2.0_unm4sk-pt1.t" :
ep_name = re.sub(r"eps2\.\d{1,2}(\.|\_)","",episode_name)
use ep_name in loop and pass episode name to episode_name one by one and then print ep_name.
Output will be like:
r.Robot.S02E01.unm4sk-pt1.t
I'm not sure if I understand correctly, I don't know the series hence nor do I the titles. But do you really need re?
for f in files:
print(f[23:-35].split('.')[0])
results in
unm4sk-pt1
unm4sk-pt2
k3rnel-pan1c
init_1
logic-b0mb
m4ster-s1ave
h4ndshake
succ3ss0r
init_5
h1dden-pr0cess
pyth0n-pt1
pyth0n-pt2
Edit:
I still don't see an actual target format definition in your post, but just in case that #Jan is right, here's the re-less solution for that, too:
for f in files:
print(f[:16] + '.'.join(f[23:].split('.')[:2]) + '.mkv')
Mr.Robot.S02E01.unm4sk-pt1.tc.mkv
Mr.Robot.S02E02.unm4sk-pt2.tc.mkv
Mr.Robot.S02E03.k3rnel-pan1c.ksd.mkv
Mr.Robot.S02E04.init_1.asec.mkv
Mr.Robot.S02E05.logic-b0mb.hc.mkv
Mr.Robot.S02E06.m4ster-s1ave.aes.mkv
Mr.Robot.S02E07.h4ndshake.sme.mkv
Mr.Robot.S02E08.succ3ss0r.p12.mkv
Mr.Robot.S02E09.init_5.fve.mkv
Mr.Robot.S02E10.h1dden-pr0cess.axx.mkv
Mr.Robot.S02E11.pyth0n-pt1.p7z.mkv
Mr.Robot.S02E12.pyth0n-pt2.p7z.mkv

Python Glob regex file search with for single result from multiple matches

In Python, I am trying to find a specific file in a directory, let's say, 'file3.txt'. The other files in the directory are 'flie1.txt', 'File2.txt', 'file_12.txt', and 'File13.txt'. The number is unique, so I need to search by a user supplied number.
file_num = 3
my_file = glob.glob('C:/Path_to_dir/' + r'[a-zA-Z_]*' + f'{file_num} + '.txt')
Problem is, that returns both 'file3.txt' and 'File13.txt'. If I try lookbehind, I get no files:
file_num = 3
my_file = glob.glob('C:/Path_to_dir/' + r'[a-zA-Z_]*' + r'(?<![1-9]*)' + f'{file_num}' + '.txt')
How do I only get 'file3.txt'?
glob accepts Unix wildcards, not regexes. Those are less powerful but what you're asking can still be achieved. This:
glob.glob("/path/to/file/*[!0-9]3.txt")
filters the files containing 3 without digits before.
For other cases, you can use a list comprehension and regex:
[x for x in glob.glob("/path/to/file/*") if re.match(some_regex,os.path.basename(x))]
The problem with glob is that it has limited RegEx. For instance, you can't have "[a-z_]+" with glob.
So, it's better to write your own RegEx, like this:
import re
import os
file_num = 3
file_re = r"[a-z_]+{file_num}\.txt".format(file_num=file_num)
match_file = re.compile(file_re, flags=re.IGNORECASE).match
work_dir = "C:/Path_to_dir/"
names = list(filter(match_file, os.listdir(work_dir)))

Python script to match a partcular text in filename and count the number of such files

In a folder I have files containing file names as the following :
Q1234_ABC_B02_12232.hl7
12313_SDDD_Q03_44545.hl7
Q43434_SAD_B02_2312.hl7
4324_SDSD_W05_344423423.hl7
3123123_DSD_D06_67578.hl7
and many such files
I need to write a python script to count the number of files whose file names begin with "Q" and which have "B02" after the second underscore which means that I should get output count as 2. I have tried the following script but not got the desired solution.
import re
import os
resultsDict = {}
myString1 = ""
regex = r'[^_]+_([^_]*)_.*'
for file_name in os.listdir("."):
m = file_name.split("_")
if len(m) > 2 :
myString = m[2]
if "B02" in myString:
myString1 = myString
if myString1 in resultsDict:
resultsDict[myString1] += 1
else:
resultsDict.update({myString1: 1})
else:
print "error in the string! there are less then 2 _"
print resultsDict
I am using python 2.6.6. Any help would be useful.
As time of this writing, there is several answer with a wrong regex.
One of these is probably better:
r'^Q[^_]*_[^_]*_B02_.*'
r'^Q[^_]*_[^_]*_B02.*'
r'^Q[^_]*_[^_]*_B02(_.*|$)'
If you stick with .* instead, the regex might consume some intermediate underscore. So your are no longer able to enforce B02 being after the second _
After that, test for matching values (re.match) is a simple loop over the various file names ( os.listdir or glob.glob). Here is an example using list comprehension:
>>> l = [file for file in os.listdir(".") if re.match(r'^Q[^_]*_[^_]*_B02.*', file)]
>>> l
['Q1234_ABC_B02_12232.hl7', 'Q43434_SAD_B02_2312.hl7']
>>> len(l)
2
For better performances you might wish to compile the regex first (re.compile).
As a comment by #camh above let me think that maybe you have jumped into Python because you couldn't find a shell-based solution, here is how to do the same thing using only bash:
sh$ shopt -s extglob
sh$ ls Q*([^_])_*([^_])_B02*
Q1234_ABC_B02_12232.hl7 Q43434_SAD_B02_2312.hl7
sh$ ls Q*([^_])_*([^_])_B02* | wc -l
# ^^^^^^^
# This *won't* work if some file names contain '\n' !!!
Use a regular expression,
import re
resultsDict = {}
expression = "^Q.*_.*_B02_.*"
p = re.compile(expression)
for file_name in os.listdir("."):
if p.match(file_name):
if file_name in resultsDict:
resultsDict[file_name] = resultsDict[file_name] + 1
else:
resultsDict[file_name] = 1
You can try with this regular expression:
'^Q.*_.*_B02_.*'
This code will match all files in current directory according to your requirements.
import os
import re
regex = r'^Q\w+_\w+_B02' # This will match any word character between the underscores
for f in os.listdir("."):
if re.match(regex, f, re.I):
print f
Word character is A-Z, a-z and 0-9.
A solution with list comprehensions instead of regular expressions. First, get all the directory names that start with Q, and split them on the underscores;
import os
dirs = [d.split('_') for d in os.listdir(".") if d.startswith('Q')]
Now get all directories with two underscores or more;
dirs = [d for d in dirs if len(d) > 2]
Finally, narrow it down;
dirs = [d for d in dirs if d[2] == 'B02']
You could combine the last to comprehensions into one;
dirs = [d for d in dirs if len(d) > 2 and d[2] == 'B02']

Use of regular expression to exclude characters in file rename,Python?

I am trying to rename files so that they contain an ID followed by a -(int). The files generally come to me in this way but sometimes they come as 1234567-1(crop to bottom).jpg.
I have been trying to use the following code but my regular expression doesn't seem to be having any effect. The reason for the walk is because we have to handles large directory trees with many images.
def fix_length():
for root, dirs, files in os.walk(path):
for fn in files:
path2 = os.path.join(root, fn)
filename_zero, extension = os.path.splitext(fn)
re.sub("[^0-9][-]", "", filename_zero)
os.rename(path2, filename_zero + extension)
fix_length()
I have inserted print statements for filename_zero before and after the re.sub line and I am getting the same result (i.e. 1234567-1(crop to bottom) not what I wanted)
This raises an exception as the rename is trying to create a file that already exists.
I thought perhaps adding the [-] in the regex was the issue but removing it and running again I would then expect 12345671.jpg but this doesn't work either. My regex is failing me or I have failed the regex.
Any insight would be greatly appreciated.
As a follow up, I have taken all the wonderful help and settled on a solution to my specific problem.
path = 'C:\Archive'
errors = 'C:\Test\errors'
num_files = []
def best_sol():
num_files = []
for root, dirs, files in os.walk(path):
for fn in files:
filename_zero, extension = os.path.splitext(fn)
path2 = os.path.join(root, fn)
ID = re.match('^\d{1,10}', fn).group()
if len(ID) <= 7:
if ID not in num_files:
num_files = []
num_files.append(ID)
suffix = str(len(num_files))
os.rename(path2, os.path.join(root, ID + '-' + suffix + extension))
else:
num_files.append(ID)
suffix = str(len(num_files))
os.rename(path2, os.path.join( root, ID + '-' + suffix +extension))
else:
shutil.copy(path2, errors)
os.remove(path2)
This code creates an ID based upon (up to) the first 10 numeric characters in the filename. I then use lists that store the instances of this ID and use the, length of the list append a suffix. The first file will have a -1, second a -2 etc...
I am only interested (or they should only be) in ID's with a length of 7 but allow to read up to 10 to allow for human error in labelling. All files with ID longer than 7 are moved to a folder where we can investigate.
Thanks for pointing me in the right direction.
re.sub() returns the altered string, but you ignore the return value.
You want to re-assign the result to filename_zero:
filename_zero = re.sub("[^\d-]", "", filename_zero)
I've corrected your regular expression as well; this removes anything that is not a digit or a dash from the base filename:
>>> re.sub(r'[^\d-]', '', '1234567-1(crop to bottom)')
'1234567-1'
Remember, strings are immutable, you cannot alter them in-place.
If all you want is the leading digits, plus optional dash-digit suffix, select the characters to be kept, rather than removing what you don't want:
filename_zero = re.match(r'^\d+(?:-\d)?', filename_zero).group()
new_filename = re.sub(r'^([0-9]+)-([0-9]+)', r'\g1-\g2', filename_zero)
Try using this regular expression instead, I hope this is how regular expressions work in Python, I don't use it often. You also appear to have forgotten to assign the value returned by the re.sub call to the filename_zero variable.

Categories