glob syntax working not as expected( [ ] *) - python

I have a folder containing 4 files.
Keras_entity_20210223-2138.h5
intent_tokens.pickle
word_entity_set_20210223-2138.pickle
LSTM_history.h5
I used code:
NER_MODEL_FILEPATH = glob.glob("model/[Keras_entity]*.h5")[0]
It's working correctly since NER_MODEL_FILEPATH is a list only containing the path of that Keras_entity file. Not picking that other .h5 file.
But when I use this code:
WORD_ENTITY_SET_FILEPATH = glob.glob("model/[word_entity_set]*.pickle")[0]
It's not working as expected, rather than picking up only that word_entity_set file,
this list contains both of those two pickle files.
Why would this happen?

Simply remove the square brackets: word_entity_set*.pickle
Per the docs:
[seq] matches any character in seq
So word_entity_set_20210223-2138.pickle is matched because it starts with a w, and intent_tokens.pickle is matched because it starts with an i.
To be clear, it is working as expected. Your expectations were incorrect.

Your code selects intent_tokens.pickle and word_entity_set_20210223-2138.pickle because your glob is incorrect. Change the glob to "word_entity_set*.pickle"
When you use [<phrase>]*.pickle, you're telling the globber to match one of any of the characters in <phrase> plus any characters, plus ".pickle". So "wordwordword.pickle" will match, so will:
wwww.pickle
.pickle
w.pickle
But
xw.pickle
foobar.pickle
will not.
There are truly infinite permutations.

Related

Cut string in python by counting characters

So I have a string inside of a file:
C:\d\folder\project\folder\Folder1\Folder2\Folder3\Module.c
What would be the best way to cut it just by counting backslashes from the end:
So in this case we need to cut everything what is after 4th backslash when counting backward:
Folder1\Folder2\Folder3\Module.c
I need something to do this since I always need to count backward cause I know that in my folder structure it will be like that, I cannot count from the 1st character since number of backslashes "" will not always be the same when counting from the start.
If your string is always a path, you should be using pathlib.Path to handle it:
import os
from pathlib import Path
path = Path(r'C:\d\folder\project\folder\Folder1\Folder2\Folder3\Module.c')
Then we can get the following:
>>> path.parts[-4:]
('Folder1', 'Folder2', 'Folder3', 'Module.c')
>>> os.sep.join(path.parts[-4:])
'Folder1\\Folder2\\Folder3\\Module.c'
Try this:
'\\'.join(s.split('\\')[-4:])
to read your file mentioned in comment:
with open('yourfile') as f:
for s in f: # usually better than for s in f.readlines()
print('\\'.join(s.split('\\')[-4:]))
readlines() loads all file into memory, can be problematic if the file is huge and exceeds process memory limits.
Try this:
s = r'C:\d\folder\project\folder\Folder1\Folder2\Folder3\Module.c'
'\'.join(s.split('\')[:-4])
First the string is split based on the backslashes and all components excluding the last 4 are taken. These are then joined back using the backslash.

Extracting the suffix of a filename in Python

I'm using Python to create HTML links from a listing of filenames.
The file names are formatted like: song1_lead.pdf, song1_lyrics.pdf.
They could also have names like song2_with_extra_underscores_vocals.pdf. But the common thing is they will all end with _someText.pdf
My goal is to extract just the someText part, after the last underscore, and without the .pdf extension. So song1_lyrics.pdf results with just: lyrics
I have the following Python code getting to my goal, but seems like I'm doing it the hard way.
Is there is a more efficient way to do this?
testString = 'file1_with_extra_underscores_lead.pdf'
#Step 1: Separate string using last occurrence of under_score
HTMLtext = testString.rpartition('_')
# Result: ('file1_with_extra_underscores', '_', 'lyrics.pdf')
#Step 2: Separate the suffix and .pdf extension.
HTMLtext = HTMLtext[2].rpartition('.')
#Result: ('lead', '.', 'pdf')
#Step 3: Use the first item as the end result.
HTMLtext = HTMLtext[0] #Result: lead
I'm thinking what I'm trying to do is possible with much fewer lines of code, and not having to set HTMLtext multiple times as I'm doing now.
you can use Path from pathlib to extract the final path component, without its suffix:
from path import Path
Path('file1_with_extra_underscores_lead.pdf').stem.split('_')[-1]
outout:
'lead'
As #wwii said in its comment, you should use os.path.splitext which is especially designed to separate filenames from their extension and str.split/str.rsplit which are especially designed to cut strings at a character. Using thoses functions there is several ways to achieve what you want.
Unlike #wwii, I would start by discarding the extension:
test_string = 'file1_with_extra_underscores_lead.pdf'
filename = os.path.splitext(test_string)[0]
print(filename) # 'file1_with_extra_underscores_lead'
Then I would use split or rsplit, with the maxsplit argument or selecting the last (or the second index) of the resulting list (according to what method have been used). Every following line are equivalent (in term of functionality at least):
filename.split('_')[-1] # splits at each underscore and selects the last chunk
filename.rsplit('_')[-1] # same as previous line except it splits from the right of the string
filename.rsplit('_', maxsplit=1)[-1] # split only one time from the right of the string and selects the last chunk
filename.rsplit('_', maxsplit=1)[1] # same as previous line except it select the second chunks (which is the last since only one split occured)
The best is probably one of the two last solutions since it will not do useless splits.
Why is this answer better than others? (in my opinion at least)
Using pathlib is fine but a bit overkill for separating a filename from its extension, os.path.splitext could be more efficient.
Using a slice with rfind works but is does not clearly express the code intention and it is not so readable.
Using endswith('.pdf') is OK if you are sure you will never use anything else than PDF. If one day you use a .txt, you will have to rework your code.
I love regex but in this case it suffers from the same caveheats than the 2 two previously discussed solutions: no clear intention, not very readable and you will have to rework it if one day you use an other extension.
Using splitext clearly indicates that you do something with the extension, and the first item selection is quite explicit. This will still work with any other extension.
Using rsplit('_', maxsplit=1) and selecting the last index is also quite expressive and far more clear than a arbitrary looking slice.
This should do fine:
testString = 'file1_with_extra_underscores_lead.pdf'
testString[testString.rfind('_') + 1:-4]
But, no error checking in here. Will fail if there is no "_" in the string.
You could use a regex as well. That shouldn't be difficult.
Basically I wouldn't do it this way myself. It's better to do some exception handling unless you are 100% sure that there is no need for exception handling.
This will work with "..._lead.pdf" or "..._lead.pDf":
import re
testString = 'file1_with_extra_underscores_lead.pdf'
m = re.search('_([^_]+)\.pdf$', testString, flags=re.I)
print(m.group(1) if m else "No match")

Trying to find a way to filter out parts of a string python

I'm trying to filter out strings in file names that appear in a for loop
if search == "List":
onlyfiles = [f for f in listdir("path") if isfile(join("path", f))]
for i in onlyfiles:
print(i)
now it will output all the filenames, as expected and wanted, but I want to filter out the .json at the end of the file as well as a few other elements in the name of the file so that I can just see the file name.
For example: filename-IDENTIFIER.json
I want to filter out "-IDENTIFIER.json" out from the for loop's output
Thanks for any help
There are a few approaches here, based on how much your data can vary:
So let's try to build a get_filename(f) method
Quick and dirty
If you know that f always ends in exactly the same way, then you can directly try to remove those characters. So here we have to remove the last 16 characters. It's useful to know that in Python, a string can be considered as an (immutable) array of characters, so you can use list indexing as well.
get_filename(f: str):
return f[:-16]
This will however fail if the Identifier or suffix changes in length.
Varying lenghts
If the suffix changes based on the length, then you should split the string on a fixed delimiter and return the relevant part. In this case you want to split on -.
get_filename(f: str):
return f.split("-")[0]
Note however that this will fail if the filename also contains a -.
You can fix that by dropping the last part and rejoining all the earlier pieces, in the following way.
get_filename(f: str):
return "-".join(f.split("-")[:-1])
Using regexes to match the format
The most general approach would be to use python regexes to select the relevant part. These allow you to very specifically target a specific pattern. The exact regex that you'll need will depend on the complexity of your strings.
Split the string on "-" and get the first element:
filename = f.split("-")[0]
This will get messed up case filename contains "-" though.
This should work:
i.split('-')[0].split('.')[0]
Case 1: filename-IDENTIFIER.json
It takes the substring before the dash, so output will become filename
Case 2: filename.json
There is no dash in the string, so the first split does nothing (full string will be in the 0th element), then it takes the substring before the point. Output will be filename
Case 3: filename
Nothing to split, output will be filename
If it's always .json and -IDENTIFIER, then it's safer to use this:
i.split('-IDENTIFIER')[0].split('.json')[0]
Case 4: filename-blabla.json
If the filename has an extra dash in it, it won't be a problem, output will be filename-blabla

Python regular expression for Windows file path

The problem, and it may not be easily solved with a regex, is that I want to be able to extract a Windows file path from an arbitrary string. The closest that I have been able to come (I've tried a bunch of others) is using the following regex:
[a-zA-Z]:\\([a-zA-Z0-9() ]*\\)*\w*.*\w*
Which picks up the start of the file and is designed to look at patterns (after the initial drive letter) of strings followed by a backslash and ending with a file name, optional dot, and optional extension.
The difficulty is what happens, next. Since the maximum path length is 260 characters, I only need to count 260 characters beyond the start. But since spaces (and other characters) are allowed in file names I would need to make sure that there are no additional backslashes that could indicate that the prior characters are the name of a folder and that what follows isn't the file name, itself.
I am pretty certain that there isn't a perfect solition (the perfect being the enemy of the good) but I wondered if anyone could suggest a "best possible" solution?
Here's the expression I got, based on yours, that allow me to get the path on windows : [a-zA-Z]:\\((?:[a-zA-Z0-9() ]*\\)*).* . An example of it being used is available here : https://regex101.com/r/SXUlVX/1
First, I changed the capture group from ([a-zA-Z0-9() ]*\\)* to ((?:[a-zA-Z0-9() ]*\\)*).
Your original expression captures each XXX\ one after another (eg : Users\ the Users\).
Mine matches (?:[a-zA-Z0-9() ]*\\)*. This allows me to capture the concatenation of XXX\YYYY\ZZZ\ before capturing. As such, it allows me to get the full path.
The second change I made is related to the filename : I'll just match any group of character that does not contain \ (the capture group being greedy). This allows me to take care of strange file names.
Another regex that would work would be : [a-zA-Z]:\\((?:.*?\\)*).* as shown in this example : https://regex101.com/r/SXUlVX/2
This time, I used .*?\\ to match the XXX\ parts of the path.
.*? will match in a non-greedy way : thus, .*?\\ will match the bare minimum of text followed by a back-slash.
Do not hesitate if you have any question regarding the expressions.
I'd also encourage you to try to see how well your expression works using : https://regex101.com . This also has a list of the different tokens you can use in your regex.
Edit : As my previous answer did not work (though I'll need to spend some times to find out exactly why), I looked for another way to do what you want. And I managed to do so using string splitting and joining.
The command is "\\".join(TARGETSTRING.split("\\")[1:-1]).
How does this work : Is plit the original string into a list of substrings, based. I then remove the first and last part ([1:-1]from 2nd element to the one before the last) and transform the resulting list back into a string.
This works, whether the value given is a path or the full address of a file.
Program Files (x86)\\Adobe\\Acrobat Distiller\\acrbd.exe fred is a file path
Program Files (x86)\\Adobe\\Acrobat Distiller\\acrbd.exe fred\ is a directory path

Iterating through python string array gives unexpected output

I was debugging some python code and as any begginer, I'm using print statements. I narrowed down the problem to:
paths = ("../somepath") #is this not how you declare an array/list?
for path in paths:
print path
I was expecting the whole string to be printed out, but only . is. Since I planned on expanding it anyway to cover more paths, it appears that
paths = ("../somepath", "../someotherpath")
fixes the problem and correctly prints out both strings.
I'm assuming the initial version treats the string as an array of characters (or maybe that's just the C++ in me talking) and just prints out characters.?...??
I'd still like to know why this happens.
("../somepath")
is nothing but a string covered in parenthesis. So, it is the same as "../somepath". Since Python's for loop can iterate through any iterable and a string happens to be an iterable, it prints one character at a time.
To create a tuple with one element, use comma at the end
("../somepath",)
If you want to create a list, you need to use square brackets, like this
["../somepath"]
paths = ["../somepath","abc"]
This way you can create list.Now your code will work .
paths = ("../somepath", "../someotherpath") this worked as it formed a tuple.Which again is a type of non mutable list.
Tested it and the output is one character per line
So all is printed one character per character
To get what you want you need
# your code goes here
paths = ['../somepath'] #is this not how you declare an array/list?
for path in paths:
print path

Categories