match zero or one pattern occurence - python

Suppose I have next files:
/path/to/file/a.png
/path/to/file/a_match_me.png
/path/to/file/a_dont_match_me.png
I want to match 2 files
/path/to/file/a.png
/path/to/file/a_match_me.png
but do not match /path/to/file/a_dont_match_me.png
So, I need regexp, something like /path/to/file/a[zero or one occurence of _match_me].png
Can I do it, using glob library in python?

You would have a hard time doing this with the built-in Python glob library, but you could do this with a third party library.
The python library wcmatch can be used for the described case above. Full disclosure, I am the author of the mentioned library.
Below we use the GLOBSTAR flag (G) to match multiple folders with ** and the EXTGLOB flag (E) to use extended glob patterns such as !() which excludes a file name pattern. We use the globfilter command which can filter a full path names with glob patterns. It is kind of like fnmatch's filter, but does full paths.
from wcmatch import glob
files = [
"/path/to/file/a.png",
"/path/to/file/a_match_me.png",
"/path/to/file/a_dont_match_me.png"
]
print(glob.globfilter(files, '**/a!(_dont_match_me).png', flags=glob.G | glob.E))
Output
['/path/to/file/a.png', '/path/to/file/a_match_me.png']
You could also glob these files directly from the file system:
from wcmatch import glob
glob.glob('**/a!(_dont_match_me).png', flags=glob.G | glob.E)
Hopefully this helps.

Related

Python Regex to extract file where filename contains and also should not contain specific pattern from a zip folder

I want to extract just one specific single file from the zip folder which has the below 3 files.
Basically it should start with 'kpidata_nfile' and should not contain 'fileheader'
kpidata_nfile_20220919-20220925_fileheader.csv
kpidata_nfile_20220905-20220911.csv
othername_kpidata_nfile_20220905-20220911.csv
Below is my code i have tried-
from zipfile import ZipFile
import re
import os
for x in os.listdir('.'):
if re.match('.*\.(zip)', x):
with ZipFile(x, 'r') as zip:
for info in zip.infolist():
if re.match(r'^kpidata_nfile_', info.filename):
zip.extract(info)
Output required - kpidata_nfile_20220905-20220911.csv
This regex does what you require:
^kpidata_nfile(?:(?!fileheader).)*$
See this answer for more about the (?:(?!fileheader).)*$ part.
You can see the regex working on your example filenames here.
The regex is not particularly readable, so it might be better to use Python expressions instead of regex. Something like:
fname = info.filename
if fname.startswith('kpidata_nfile') and 'fileheader' not in fname:

Can I match groups / optional groups using glob in Python?

I want to be able to use the glob module in Python 3.9 to match filenames in a directory containing the following file names:
"MM05_awani3_StudentA.py"
"MM05_liu127.py"
Specifically, I want to be able to loop over all the files in a directory that fit a certain pattern. So I want to use a for loop like this:
for file in current_path.glob("string"):
# do something
The glob pattern "MM05_submissions/MM05_*[a-z0-9]?(_Student[A-Z]).py" seems to work according to DigitalOcean's glob tester tool, but I'm not getting any matches inside of Python 3.9
Is the glob used on DigitalOcean's tester different from the one in Python?
Can I match optional groups in Python using round brackets ()?
If not, should I use something like RegEx to loop over files that match a certain pattern in a directory?
You can't use (...) grouping, no. The glob() module uses the fnmatch module to do the matching, and it supports *, ?, [seq] and [!seq], nothing more.
However, fnmatch uses a simple pattern-to-regex conversion to test filenames. Just do the same yourself with os.scandir():
import re
import os
pattern = re.compile("MM05_[a-z0-9]*(_Student[A-Z])?\.py")
for entry in os.scandir("MM05_submissions"):
if pattern.fullmatch(entry.name):
# name matched the pattern
If you need to support arbitrary depth patterns using directory names, you'll have to write something up using os.walk().

apply command to list of files in python

I've a tricky problem. I need to apply a specific command called xRITDecompress to a list of files with extension -C_ and I should do this with Python.
Unfortunately, this command doesn't work with wildcards and I can't do something like:
os.system("xRITDecompress *-C_")
In principle, I could write an auxiliary bash script with a for cycle and call it inside my python program. However, I'd like not to rely on auxiliary files...
What would be the best way to do this within a python program?
You can use glob.glob() to get the list of files on which you want to run the command and then for each file in that list, run the command -
import glob
for f in glob.glob('*-C_'):
os.system('xRITDecompress {}'.format(f))
From documentation -
The glob module finds all the pathnames matching a specified pattern according to the rules used by the Unix shell.
If by _ (underscore) , you wanted to match a single character , you should use - ? instead , like -
glob.glob('*-C?')
Please note, glob would only search in current directory but according to what you wanted with the original trial, seems like that maybe what you want.
You may also, want to look at subprocess module, it is a more powerful module for running commands (spawning processes). Example -
import subprocess
import glob
for f in glob.glob('*-C_'):
subprocess.call(['xRITDecompress',f])
You can use glob.glob or glob.iglob to get files that match the given pattern:
import glob
files = glob.iglob('*-C_')
for f in files:
os.system("xRITDecompress %s" % f)
Just use glob.glob to search and os.system to execute
import os
from glob import glob
for file in glob('*-C_'):
os.system("xRITDecompress %s" % file)
I hope it satisfies your question

Grab part of filename with Python

Newbie here.
I've just been working with Python/coding for a few days, but I want to create a script that grabs parts of filenames corresponding to a certain pattern, and outputs it to a textfile.
So in my case, let's say I have four .pdf like this:
aaa_ID_8423.pdf
bbbb_ID_8852.pdf
ccccc_ID_7413.pdf
dddddd_ID_4421.pdf
(Note that they are of variable length.)
I want the script to go through these filenames, grab the string after "ID_" and before the filename extension.
Can you point me in the direction to which Python modules and possibly guides that could assist me?
Here's a simple solution using the re module as mentioned in other answers.
# Libraries
import re
# Example filenames. Use glob as described below to grab your pdf filenames
file_list = ['name_ID_123.pdf','name2_ID_456.pdf'] # glob.glob("*.pdf")
for fname in file_list:
res = re.findall("ID_(\d+).pdf", fname)
if not res: continue
print res[0] # You can append the result to a list
And below should be your output. You should be able to adapt this to other patterns.
# Output
123
456
Goodluck!
Here's another alternative, using re.split(), which is probably closer to the spirit of exactly what you're trying to do (although solutions with re.match() and re.search(), among others, are just as valid, useful, and instructive):
>>> import re
>>> re.split("[_.]", "dddddd_ID_4421.pdf")[-2]
'4421'
>>>
If the numbers are variable length, you'll want the regex module "re"
import re
# create and compile a regex pattern
pattern = re.compile(r"_([0-9]+)\.[^\.]+$")
pattern.search("abc_ID_8423.pdf").group(1)
Out[23]: '8423'
Regex is generally used to match variable strings. The regex I just wrote says:
Find an underscore ("_"), followed by a variable number of digits ("[0-9]+"), followed by the last period in the string ("\.[^\.]+$")
You can use the os module in python and do a listdir to get a list of filenames present in that path like so:
import os
filenames = os.listdir(path)
Now you can iterate over the filenames list and look for the pattern which you need using regular expressions:
import re
for filename in filenames:
m = re.search('(?<=ID_)\w+', filename)
print (m)
The above snippet will return the part of the filename following ID_ and prints it out. So, for your example, it would return 4421.pdf, 8423.pdf etc. You can write a similar regex to remove the .pdf part.
You probably want to use glob, which is a python module for file globbing. From the python help page the usage is as follows:
>>> import glob
>>> glob.glob('./[0-9].*')
['./1.gif', './2.txt']
>>> glob.glob('*.gif')
['1.gif', 'card.gif']
>>> glob.glob('?.gif')
['1.gif']

search in wildcard folders recursively in python

hello im trying to do something like
// 1. for x in glob.glob('/../../nodes/*/views/assets/js/*.js'):
// 2 .for x in glob.glob('/../../nodes/*/views/assets/js/*/*.js'):
print x
is there anything can i do to search it recuresively ?
i already looked into Use a Glob() to find files recursively in Python? but the os.walk dont accept wildcards folders like above between nodes and views, and the http://docs.python.org/library/glob.html docs that dosent help much.
thanks
Caveat: This will also select any files matching the pattern anywhere beneath the root folder which is nodes/.
import os, fnmatch
def locate(pattern, root_path):
for path, dirs, files in os.walk(os.path.abspath(root_path)):
for filename in fnmatch.filter(files, pattern):
yield os.path.join(path, filename)
As os.walk does not accept wildcards we walk the tree and filter what we need.
js_assets = [js for js in locate('*.js', '/../../nodes')]
The locate function yields an iterator of all files which match the pattern.
Alternative solution: You can try the extended glob which adds recursive searching to glob.
Now you can write a much simpler expression like:
fnmatch.filter( glob.glob('/../../nodes/*/views/assets/js/**/*'), '*.js' )
I answered a similar question here: fnmatch and recursive path match with `**`
You could use glob2 or formic, both available via easy_install or pip.
GLOB2
FORMIC
You can find them both mentioned here:
Use a Glob() to find files recursively in Python?
I use glob2 a lot, ex:
import glob2
files = glob2.glob(r'C:\Users\**\iTunes\**\*.mp4')
Why don't you split your wild-carded paths into multiple parts, like:
parent_path = glob.glob('/../../nodes/*')
for p in parent_path:
child_paths = glob.glob(os.path.join(p, './views/assets/js/*.js'))
for c in child_paths:
#do something
You can replace some of the above with a list of child assets that you want to retrieve.
Alternatively, if your environment provides the find command, that provides better support for this kind of task. If you're in Windows, there may be an analogous program.

Categories