Regular expression, glob, Python

Regular expression, glob, Python - python

I have a folder, contains many files.
There is a group contains pc_0.txt,pc_1.txt,...,pc_699.txt.
I want to select all files beetween pc_200 - > to pc_699.txt
How?
for filename in glob.glob("pc*.txt"):
global_list.append(filename)

For this specific case, glob already supports what you need (see fnmatch docs for glob wildcards). You can just do:
for filename in glob.glob("pc[23456]??.txt"):
If you need to be extra specific that the two trailing characters are numbers (some files might have non-numeric characters there), you can replace the ?s with [0123456789], but otherwise, I find the ? a little less distracting.
In a more complicated scenario, you might be forced to resort to regular expressions, and you could do so here with:
import re
for filename in filter(re.compile(r'^pc_[2-6]\d\d\.txt$').match, os.listdir('.')):
but given that glob-style wildcards work well enough, you don't need to break out the big guns just yet.

Related

glob syntax working not as expected( [ ] *)

I have a folder containing 4 files.
Keras_entity_20210223-2138.h5
intent_tokens.pickle
word_entity_set_20210223-2138.pickle
LSTM_history.h5
I used code:
NER_MODEL_FILEPATH = glob.glob("model/[Keras_entity]*.h5")[0]
It's working correctly since NER_MODEL_FILEPATH is a list only containing the path of that Keras_entity file. Not picking that other .h5 file.
But when I use this code:
WORD_ENTITY_SET_FILEPATH = glob.glob("model/[word_entity_set]*.pickle")[0]
It's not working as expected, rather than picking up only that word_entity_set file,
this list contains both of those two pickle files.
Why would this happen?

Simply remove the square brackets: word_entity_set*.pickle
Per the docs:
[seq] matches any character in seq
So word_entity_set_20210223-2138.pickle is matched because it starts with a w, and intent_tokens.pickle is matched because it starts with an i.
To be clear, it is working as expected. Your expectations were incorrect.

Your code selects intent_tokens.pickle and word_entity_set_20210223-2138.pickle because your glob is incorrect. Change the glob to "word_entity_set*.pickle"
When you use [<phrase>]*.pickle, you're telling the globber to match one of any of the characters in <phrase> plus any characters, plus ".pickle". So "wordwordword.pickle" will match, so will:
wwww.pickle
.pickle
w.pickle
But
xw.pickle
foobar.pickle
will not.
There are truly infinite permutations.

Extracting the suffix of a filename in Python

I'm using Python to create HTML links from a listing of filenames.
The file names are formatted like: song1_lead.pdf, song1_lyrics.pdf.
They could also have names like song2_with_extra_underscores_vocals.pdf. But the common thing is they will all end with _someText.pdf
My goal is to extract just the someText part, after the last underscore, and without the .pdf extension. So song1_lyrics.pdf results with just: lyrics
I have the following Python code getting to my goal, but seems like I'm doing it the hard way.
Is there is a more efficient way to do this?
testString = 'file1_with_extra_underscores_lead.pdf'
#Step 1: Separate string using last occurrence of under_score
HTMLtext = testString.rpartition('_')
# Result: ('file1_with_extra_underscores', '_', 'lyrics.pdf')
#Step 2: Separate the suffix and .pdf extension.
HTMLtext = HTMLtext[2].rpartition('.')
#Result: ('lead', '.', 'pdf')
#Step 3: Use the first item as the end result.
HTMLtext = HTMLtext[0] #Result: lead
I'm thinking what I'm trying to do is possible with much fewer lines of code, and not having to set HTMLtext multiple times as I'm doing now.

you can use Path from pathlib to extract the final path component, without its suffix:
from path import Path
Path('file1_with_extra_underscores_lead.pdf').stem.split('_')[-1]
outout:
'lead'

As #wwii said in its comment, you should use os.path.splitext which is especially designed to separate filenames from their extension and str.split/str.rsplit which are especially designed to cut strings at a character. Using thoses functions there is several ways to achieve what you want.
Unlike #wwii, I would start by discarding the extension:
test_string = 'file1_with_extra_underscores_lead.pdf'
filename = os.path.splitext(test_string)[0]
print(filename) # 'file1_with_extra_underscores_lead'
Then I would use split or rsplit, with the maxsplit argument or selecting the last (or the second index) of the resulting list (according to what method have been used). Every following line are equivalent (in term of functionality at least):
filename.split('_')[-1] # splits at each underscore and selects the last chunk
filename.rsplit('_')[-1] # same as previous line except it splits from the right of the string
filename.rsplit('_', maxsplit=1)[-1] # split only one time from the right of the string and selects the last chunk
filename.rsplit('_', maxsplit=1)[1] # same as previous line except it select the second chunks (which is the last since only one split occured)
The best is probably one of the two last solutions since it will not do useless splits.
Why is this answer better than others? (in my opinion at least)
Using pathlib is fine but a bit overkill for separating a filename from its extension, os.path.splitext could be more efficient.
Using a slice with rfind works but is does not clearly express the code intention and it is not so readable.
Using endswith('.pdf') is OK if you are sure you will never use anything else than PDF. If one day you use a .txt, you will have to rework your code.
I love regex but in this case it suffers from the same caveheats than the 2 two previously discussed solutions: no clear intention, not very readable and you will have to rework it if one day you use an other extension.
Using splitext clearly indicates that you do something with the extension, and the first item selection is quite explicit. This will still work with any other extension.
Using rsplit('_', maxsplit=1) and selecting the last index is also quite expressive and far more clear than a arbitrary looking slice.

This should do fine:
testString = 'file1_with_extra_underscores_lead.pdf'
testString[testString.rfind('_') + 1:-4]
But, no error checking in here. Will fail if there is no "_" in the string.
You could use a regex as well. That shouldn't be difficult.
Basically I wouldn't do it this way myself. It's better to do some exception handling unless you are 100% sure that there is no need for exception handling.

This will work with "..._lead.pdf" or "..._lead.pDf":
import re
testString = 'file1_with_extra_underscores_lead.pdf'
m = re.search('_([^_]+)\.pdf$', testString, flags=re.I)
print(m.group(1) if m else "No match")

Python regular expression for Windows file path

The problem, and it may not be easily solved with a regex, is that I want to be able to extract a Windows file path from an arbitrary string. The closest that I have been able to come (I've tried a bunch of others) is using the following regex:
[a-zA-Z]:\\([a-zA-Z0-9() ]*\\)*\w*.*\w*
Which picks up the start of the file and is designed to look at patterns (after the initial drive letter) of strings followed by a backslash and ending with a file name, optional dot, and optional extension.
The difficulty is what happens, next. Since the maximum path length is 260 characters, I only need to count 260 characters beyond the start. But since spaces (and other characters) are allowed in file names I would need to make sure that there are no additional backslashes that could indicate that the prior characters are the name of a folder and that what follows isn't the file name, itself.
I am pretty certain that there isn't a perfect solition (the perfect being the enemy of the good) but I wondered if anyone could suggest a "best possible" solution?

Here's the expression I got, based on yours, that allow me to get the path on windows : [a-zA-Z]:\\((?:[a-zA-Z0-9() ]*\\)*).* . An example of it being used is available here : https://regex101.com/r/SXUlVX/1
First, I changed the capture group from ([a-zA-Z0-9() ]*\\)* to ((?:[a-zA-Z0-9() ]*\\)*).
Your original expression captures each XXX\ one after another (eg : Users\ the Users\).
Mine matches (?:[a-zA-Z0-9() ]*\\)*. This allows me to capture the concatenation of XXX\YYYY\ZZZ\ before capturing. As such, it allows me to get the full path.
The second change I made is related to the filename : I'll just match any group of character that does not contain \ (the capture group being greedy). This allows me to take care of strange file names.
Another regex that would work would be : [a-zA-Z]:\\((?:.*?\\)*).* as shown in this example : https://regex101.com/r/SXUlVX/2
This time, I used .*?\\ to match the XXX\ parts of the path.
.*? will match in a non-greedy way : thus, .*?\\ will match the bare minimum of text followed by a back-slash.
Do not hesitate if you have any question regarding the expressions.
I'd also encourage you to try to see how well your expression works using : https://regex101.com . This also has a list of the different tokens you can use in your regex.
Edit : As my previous answer did not work (though I'll need to spend some times to find out exactly why), I looked for another way to do what you want. And I managed to do so using string splitting and joining.
The command is "\\".join(TARGETSTRING.split("\\")[1:-1]).
How does this work : Is plit the original string into a list of substrings, based. I then remove the first and last part ([1:-1]from 2nd element to the one before the last) and transform the resulting list back into a string.
This works, whether the value given is a path or the full address of a file.
Program Files (x86)\\Adobe\\Acrobat Distiller\\acrbd.exe fred is a file path
Program Files (x86)\\Adobe\\Acrobat Distiller\\acrbd.exe fred\ is a directory path

Matching patterns for folder names in a path, excluding a chunk of the path from matching?

Assume an initial (Unix) path [segment] like /var/log. Underneath this path, there might be an entire tree of directories. A user provides a pattern for folder names using Unix shell-style wildcards, e.g. *var*. Folders following the pattern underneath the initial path [segment] shall be matched using a regular expression given a full path as input, i.e. the initial path segment must be excluded from matching.
How would I build a regular expression doing this?
I am working with Python, which offers the fnmatch module as part of its standard library. fnmatch provides a translate method, which translates patterns specified using Unix shell-style wildcards into regular expressions:
>>> fnmatch.translate('*var*')
'(?s:.*var.*)\\Z'
I would like to use this for constructing my regular expressions.
Matching input paths could look this this:
/var/log/foo/var/bar
/var/log/foo/avarb/bar
/var/log/var/
Not matching input paths could look like this:
/var/log
/var/log/foo/bar
The underlying issue is that I have to provide the regular expression to a third-party module, pyinotify, as input. I can not work around this by just stripping the initial path segment and then matching against the remainder ...

You should be able to do a negative look behind like so:
(?<!^\/)var
Both positive and negative look behinds are really useful when doing regex.
Also here is an interactive example so you can get a feel on how it works with visual feedback: https://regex101.com/r/52sZjw/1
another example https://regex101.com/r/F023eD/1/
Not exactly sure how you can use this with fnmatch. It really looks like you might end up building the strings yourself, that is when the users input will match part of the path you want to exclude.

split twice in the same expression?

Imagine I have the following:
inFile = "/adda/adas/sdas/hello.txt"
# that instruction give me hello.txt
Name = inFile.name.split("/") [-1]
# that one give me the name I want - just hello
Name1 = Name.split(".") [0]
Is there any chance to simplify that doing the same job in just one expression?

You can get what you want platform independently by using os.path.basename to get the last part of a path and then use os.path.splitext to get the filename without extension.
from os.path import basename, splitext
pathname = "/adda/adas/sdas/hello.txt"
name, extension = splitext(basename(pathname))
print name # --> "hello"
Using os.path.basename and os.path.splitext instead of str.split, or re.split is more proper (and therefore received more points then any other answer) because it does not break down on other platforms that use different path separators (you would be surprised how varried this can be).
It also carries most points because it answers your question for "one line" precisely and is aesthetically more pleasing then your example (even though that is debatable as are all questions of taste)

Answering the question in the topic rather than trying to analyze the example...
You really want to use Florians solution if you want to split paths, but if you promise not to use this for path parsing...
You can use re.split() to split using several separators by or:ing them with a '|', have a look at this:
import re
inFile = "/adda/adas/sdas/hello.txt"
print re.split('\.|/', inFile)[-2]

>>> inFile = "/adda/adas/sdas/hello.txt"
>>> inFile.split('/')[-1]
'hello.txt'
>>> inFile.split('/')[-1].split('.')[0]
'hello'

if it is always going to be a path like the above you can use os.path.split and os.path.splitext
The following example will print just the hello
from os.path import split, splitext
path = "/adda/adas/sdas/hello.txt"
print splitext(split(path)[1])[0]
For more info see https://docs.python.org/library/os.path.html

I'm pretty sure some Regex-Ninja*, would give you a more or less sane way to do that (or as I now see others have posted: ways to write two expressions on one line...)
But I'm wondering why you want to do split it with just one expression?
For such a simple split, it's probably faster to do two than to create some advanced either-or logic. If you split twice it's safer too:
I guess you want to separate the path, the file name and the file extension, if you split on '/' first you know the filename should be in the last array index, then you can try to split just the last index to see if you can find the file extension or not. Then you don't need to care if ther is dots in the path names.
*(Any sane users of regular expressions, should not be offended. ;)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Regular expression, glob, Python - python

I have a folder, contains many files. There is a group contains pc_0.txt,pc_1.txt,...,pc_699.txt. I want to select all files beetween pc_200 - > to pc_699.txt How? for filename in glob.glob("pc*.txt"): global_list.append(filename)

Related

glob syntax working not as expected( [ ] *)

Extracting the suffix of a filename in Python

Python regular expression for Windows file path

Matching patterns for folder names in a path, excluding a chunk of the path from matching?

split twice in the same expression?

Categories

Resources