Can I match groups / optional groups using glob in Python? - python

I want to be able to use the glob module in Python 3.9 to match filenames in a directory containing the following file names:
"MM05_awani3_StudentA.py"
"MM05_liu127.py"
Specifically, I want to be able to loop over all the files in a directory that fit a certain pattern. So I want to use a for loop like this:
for file in current_path.glob("string"):
# do something
The glob pattern "MM05_submissions/MM05_*[a-z0-9]?(_Student[A-Z]).py" seems to work according to DigitalOcean's glob tester tool, but I'm not getting any matches inside of Python 3.9
Is the glob used on DigitalOcean's tester different from the one in Python?
Can I match optional groups in Python using round brackets ()?
If not, should I use something like RegEx to loop over files that match a certain pattern in a directory?

You can't use (...) grouping, no. The glob() module uses the fnmatch module to do the matching, and it supports *, ?, [seq] and [!seq], nothing more.
However, fnmatch uses a simple pattern-to-regex conversion to test filenames. Just do the same yourself with os.scandir():
import re
import os
pattern = re.compile("MM05_[a-z0-9]*(_Student[A-Z])?\.py")
for entry in os.scandir("MM05_submissions"):
if pattern.fullmatch(entry.name):
# name matched the pattern
If you need to support arbitrary depth patterns using directory names, you'll have to write something up using os.walk().

Related

match zero or one pattern occurence

Suppose I have next files:
/path/to/file/a.png
/path/to/file/a_match_me.png
/path/to/file/a_dont_match_me.png
I want to match 2 files
/path/to/file/a.png
/path/to/file/a_match_me.png
but do not match /path/to/file/a_dont_match_me.png
So, I need regexp, something like /path/to/file/a[zero or one occurence of _match_me].png
Can I do it, using glob library in python?
You would have a hard time doing this with the built-in Python glob library, but you could do this with a third party library.
The python library wcmatch can be used for the described case above. Full disclosure, I am the author of the mentioned library.
Below we use the GLOBSTAR flag (G) to match multiple folders with ** and the EXTGLOB flag (E) to use extended glob patterns such as !() which excludes a file name pattern. We use the globfilter command which can filter a full path names with glob patterns. It is kind of like fnmatch's filter, but does full paths.
from wcmatch import glob
files = [
"/path/to/file/a.png",
"/path/to/file/a_match_me.png",
"/path/to/file/a_dont_match_me.png"
]
print(glob.globfilter(files, '**/a!(_dont_match_me).png', flags=glob.G | glob.E))
Output
['/path/to/file/a.png', '/path/to/file/a_match_me.png']
You could also glob these files directly from the file system:
from wcmatch import glob
glob.glob('**/a!(_dont_match_me).png', flags=glob.G | glob.E)
Hopefully this helps.

How to replace a file path using Regex in Python?

I want to replace a file path such as "C:\Users\Bob\Documents\file.xlsx" to "C:\Users\Bob\Documents" only. What will be the regular expression to replace the 'file.xlsx' to empty. The file name can be anything with any extension like txt, xls, csv etc.
I'm unable to create the regex to replace the right group.
Regex Self Try
You're not looking for regex here, you're looking for os.path.dirname.
import os
....
path = r"C:\Users\Bob\Documents\file.xlsx"
os.path.dirname(path)
Output:
'C:\\Users\\Bob\\Documents'
Extra info: I highly recommend using os.path.join to create cross platform compatible paths, instead of using a string directly.

Extract file that contains specific string on filename from ZIP using Python zipfile

I have a ZIP file and I need to extract all the files (normally one) that contain the string "test" in the filename. They are all xlsx files.
I am using Python zipfile for that. This is my code that doesn't work:
zip.extract(r'*\test.*\.xlsx$', './')
The error I get:
KeyError: "There is no item named '*\\\\test.*\\\\.xlsx$' in the archive"
Any ideas?
You have multiple problems here:
r simply means treat the string as a raw string, it looks like you might think it creates a regular expression object; (in any case, zip.extract() only accepts strings)
The * quantifier at the start of the regex has no character before it to match
You need to manually iterate through the zip file index and match the filenames against your regex:
from zipfile import ZipFile
import re
zip = ZipFile('myzipfile.zip')
for info in zip.infolist():
if re.match(r'.*test.*\.xlsx$', info.filename):
print info.filename
zip.extract(info)
You might also consider using shell file globbing syntax: fnmatchcase(info.filename, '*.test.*.xls') (behind the scenes it converts it to a regex but it makes your code slightly simpler)

apply command to list of files in python

I've a tricky problem. I need to apply a specific command called xRITDecompress to a list of files with extension -C_ and I should do this with Python.
Unfortunately, this command doesn't work with wildcards and I can't do something like:
os.system("xRITDecompress *-C_")
In principle, I could write an auxiliary bash script with a for cycle and call it inside my python program. However, I'd like not to rely on auxiliary files...
What would be the best way to do this within a python program?
You can use glob.glob() to get the list of files on which you want to run the command and then for each file in that list, run the command -
import glob
for f in glob.glob('*-C_'):
os.system('xRITDecompress {}'.format(f))
From documentation -
The glob module finds all the pathnames matching a specified pattern according to the rules used by the Unix shell.
If by _ (underscore) , you wanted to match a single character , you should use - ? instead , like -
glob.glob('*-C?')
Please note, glob would only search in current directory but according to what you wanted with the original trial, seems like that maybe what you want.
You may also, want to look at subprocess module, it is a more powerful module for running commands (spawning processes). Example -
import subprocess
import glob
for f in glob.glob('*-C_'):
subprocess.call(['xRITDecompress',f])
You can use glob.glob or glob.iglob to get files that match the given pattern:
import glob
files = glob.iglob('*-C_')
for f in files:
os.system("xRITDecompress %s" % f)
Just use glob.glob to search and os.system to execute
import os
from glob import glob
for file in glob('*-C_'):
os.system("xRITDecompress %s" % file)
I hope it satisfies your question

Grab part of filename with Python

Newbie here.
I've just been working with Python/coding for a few days, but I want to create a script that grabs parts of filenames corresponding to a certain pattern, and outputs it to a textfile.
So in my case, let's say I have four .pdf like this:
aaa_ID_8423.pdf
bbbb_ID_8852.pdf
ccccc_ID_7413.pdf
dddddd_ID_4421.pdf
(Note that they are of variable length.)
I want the script to go through these filenames, grab the string after "ID_" and before the filename extension.
Can you point me in the direction to which Python modules and possibly guides that could assist me?
Here's a simple solution using the re module as mentioned in other answers.
# Libraries
import re
# Example filenames. Use glob as described below to grab your pdf filenames
file_list = ['name_ID_123.pdf','name2_ID_456.pdf'] # glob.glob("*.pdf")
for fname in file_list:
res = re.findall("ID_(\d+).pdf", fname)
if not res: continue
print res[0] # You can append the result to a list
And below should be your output. You should be able to adapt this to other patterns.
# Output
123
456
Goodluck!
Here's another alternative, using re.split(), which is probably closer to the spirit of exactly what you're trying to do (although solutions with re.match() and re.search(), among others, are just as valid, useful, and instructive):
>>> import re
>>> re.split("[_.]", "dddddd_ID_4421.pdf")[-2]
'4421'
>>>
If the numbers are variable length, you'll want the regex module "re"
import re
# create and compile a regex pattern
pattern = re.compile(r"_([0-9]+)\.[^\.]+$")
pattern.search("abc_ID_8423.pdf").group(1)
Out[23]: '8423'
Regex is generally used to match variable strings. The regex I just wrote says:
Find an underscore ("_"), followed by a variable number of digits ("[0-9]+"), followed by the last period in the string ("\.[^\.]+$")
You can use the os module in python and do a listdir to get a list of filenames present in that path like so:
import os
filenames = os.listdir(path)
Now you can iterate over the filenames list and look for the pattern which you need using regular expressions:
import re
for filename in filenames:
m = re.search('(?<=ID_)\w+', filename)
print (m)
The above snippet will return the part of the filename following ID_ and prints it out. So, for your example, it would return 4421.pdf, 8423.pdf etc. You can write a similar regex to remove the .pdf part.
You probably want to use glob, which is a python module for file globbing. From the python help page the usage is as follows:
>>> import glob
>>> glob.glob('./[0-9].*')
['./1.gif', './2.txt']
>>> glob.glob('*.gif')
['1.gif', 'card.gif']
>>> glob.glob('?.gif')
['1.gif']

Categories