How to extract string between key substring and "/" with regex? - python

I have a string that's
/path/to/file?_subject_ID_SOMEOTHERSTRING
the path/to/file part changes depends on situation, and subject_ID is always there. I try to write a regex that extract only file part of the string. Using ?subject_ID is definite, but I don't know how to safely get the file
My current regex looks like (.*[\/]).*\?_subject_ID
url = '/path/to/file?_subject_ID_SOMEOTHERSTRING'
file_re = re.compile('(.*[\/]).*\?_subject_ID')
file_re.search(url)
this will find the right string, but I still can't extract the file name
printing _.group(1) will get me /path/to/. What's the next step that gets me the actual file name?

As for your '(.*[\/]).*\?_subject_ID' regex approach, you just need to add a capturing group around the second .*. You could use r'(.*/)(.*)\?_subject_ID' (then, there will be .group(1) and .group(2) parts captured), but it is not the most appropriate way to parse URLs in Python.
You may use the non-regex approach here, here is a snippet showing how to leverage urlparse and os.path to parse the URL like yours:
import urlparse
path = urlparse.urlparse('/path/to/file?_subject_ID_SOMEOTHERSTRING').path
import os.path
print(os.path.split(path)[1]) # => file
print(os.path.split(path)[0]) # => /path/to
See the IDEONE demo

It's pretty simple, really. Just match a / before and ?subject_ID after:
([^/?]*)\?subject_ID
The [^/?]* (as opposed to .*) is because otherwise it'd match the part before, too. The ? in the character class
If you want to get both the path and the file, you can do much the same thing, but also grab the part before the /:
([^?]*)([^/?]*)\?subject_ID
It's basically the same as the one before but with the first bit captured instead of ignored.

Related

Extract values in name=value lines with regex

I'm really sorry for asking because there are some questions like this around. But can't get the answer fixed to make problem.
This are the input lines (e.g. from a config file)
profile2.name=share2
profile8.name=share8
profile4.name=shareSSH
profile9.name=share9
I just want to extract the values behind the = sign with Python 3.9. regex.
I tried this on regex101.
^profile[0-9]\.name=(.*?)
But this gives me the variable name including the = sign as result; e.g. profile2.name=. But I want exactly the inverted opposite.
The expected results (what Pythons re.find_all() return) are
['share2', 'share8', 'shareSSH', 'share9']
Try pattern profile\d+\.name=(.*), look at Regex 101 example
import re
re.findall('profile\d+\.name=(.*)', txt)
# output
['share2', 'share8', 'shareSSH', 'share9']
But this problem doesn't necessarily need regex, split should work absolutely fine:
Try removing the ? quantifier. It will make your capture group match an empty st
regex101

How to get everything after string x in python

I have a string:
s3://tester/test.pdf
I want to exclude s3://tester/ so even if i have s3://tester/folder/anotherone/test.pdf I am getting the entire path after s3://tester/
I have attempted to use the split & partition method but I can't seem to get it.
Currently am trying:
string.partition('/')[3]
But i get an error saying that it out of index.
EDIT: I should have specified that the name of the bucket will not always be the same so I want to make sure that it is only grabbing anything after the 3rd '/'.
You can use str.split():
path = 's3://tester/test.pdf'
print(path.split('/', 3)[-1])
Output:
test.pdf
UPDATE: With regex:
import re
path = 's3://tester/test.pdf'
print(re.split('/',path,3)[-1])
Output:
test.pdf
Have you tried .replace?
You could do:
string = "s3://tester/test.pdf"
string = string.replace("s3://tester/", "")
print(string)
This will replace "s3://tester/" with the empty string ""
Alternatively, you could use .split rather than .partition
You could also try:
string = "s3://tester/test.pdf"
string = "/".join(string.split("/")[3:])
print(string)
To answer "How to get everything after x amount of characters in python"
string[x:]
PLEASE SEE UPDATE
ORIGINAL
Using the builtin re module.
p = re.search(r'(?<=s3:\/\/tester\/).+', s).group()
The pattern uses a lookbehind to skip over the part you wish to ignore and matches any and all characters following it until the entire string is consumed, returning the matched group to the p variable for further processing.
This code will work for any length path following the explicit s3://tester/ schema you provided in your question.
UPDATE
Just saw updates duh.
Got the wrong end of the stick on this one, my bad.
Below re method should work no matter S3 variable, returning all after third / in string.
p = ''.join(re.findall(r'\/[^\/]+', s)[1:])[1:]

Extracting the suffix of a filename in Python

I'm using Python to create HTML links from a listing of filenames.
The file names are formatted like: song1_lead.pdf, song1_lyrics.pdf.
They could also have names like song2_with_extra_underscores_vocals.pdf. But the common thing is they will all end with _someText.pdf
My goal is to extract just the someText part, after the last underscore, and without the .pdf extension. So song1_lyrics.pdf results with just: lyrics
I have the following Python code getting to my goal, but seems like I'm doing it the hard way.
Is there is a more efficient way to do this?
testString = 'file1_with_extra_underscores_lead.pdf'
#Step 1: Separate string using last occurrence of under_score
HTMLtext = testString.rpartition('_')
# Result: ('file1_with_extra_underscores', '_', 'lyrics.pdf')
#Step 2: Separate the suffix and .pdf extension.
HTMLtext = HTMLtext[2].rpartition('.')
#Result: ('lead', '.', 'pdf')
#Step 3: Use the first item as the end result.
HTMLtext = HTMLtext[0] #Result: lead
I'm thinking what I'm trying to do is possible with much fewer lines of code, and not having to set HTMLtext multiple times as I'm doing now.
you can use Path from pathlib to extract the final path component, without its suffix:
from path import Path
Path('file1_with_extra_underscores_lead.pdf').stem.split('_')[-1]
outout:
'lead'
As #wwii said in its comment, you should use os.path.splitext which is especially designed to separate filenames from their extension and str.split/str.rsplit which are especially designed to cut strings at a character. Using thoses functions there is several ways to achieve what you want.
Unlike #wwii, I would start by discarding the extension:
test_string = 'file1_with_extra_underscores_lead.pdf'
filename = os.path.splitext(test_string)[0]
print(filename) # 'file1_with_extra_underscores_lead'
Then I would use split or rsplit, with the maxsplit argument or selecting the last (or the second index) of the resulting list (according to what method have been used). Every following line are equivalent (in term of functionality at least):
filename.split('_')[-1] # splits at each underscore and selects the last chunk
filename.rsplit('_')[-1] # same as previous line except it splits from the right of the string
filename.rsplit('_', maxsplit=1)[-1] # split only one time from the right of the string and selects the last chunk
filename.rsplit('_', maxsplit=1)[1] # same as previous line except it select the second chunks (which is the last since only one split occured)
The best is probably one of the two last solutions since it will not do useless splits.
Why is this answer better than others? (in my opinion at least)
Using pathlib is fine but a bit overkill for separating a filename from its extension, os.path.splitext could be more efficient.
Using a slice with rfind works but is does not clearly express the code intention and it is not so readable.
Using endswith('.pdf') is OK if you are sure you will never use anything else than PDF. If one day you use a .txt, you will have to rework your code.
I love regex but in this case it suffers from the same caveheats than the 2 two previously discussed solutions: no clear intention, not very readable and you will have to rework it if one day you use an other extension.
Using splitext clearly indicates that you do something with the extension, and the first item selection is quite explicit. This will still work with any other extension.
Using rsplit('_', maxsplit=1) and selecting the last index is also quite expressive and far more clear than a arbitrary looking slice.
This should do fine:
testString = 'file1_with_extra_underscores_lead.pdf'
testString[testString.rfind('_') + 1:-4]
But, no error checking in here. Will fail if there is no "_" in the string.
You could use a regex as well. That shouldn't be difficult.
Basically I wouldn't do it this way myself. It's better to do some exception handling unless you are 100% sure that there is no need for exception handling.
This will work with "..._lead.pdf" or "..._lead.pDf":
import re
testString = 'file1_with_extra_underscores_lead.pdf'
m = re.search('_([^_]+)\.pdf$', testString, flags=re.I)
print(m.group(1) if m else "No match")

Regex to match image extensions with dynamic url in Python

I am creating a regex that matches a web url that ends in a filename with an image extension. The base url, everything before the filename, will be dynamic. Here's what I got:
import re
text = 'google.com/dsadasd/dsd.jpg'
dynamic_url = 'google.com/dsadasd'
regex = '{}/(.*)(.gif|.jpg|.jpeg|.tiff|.png)'.format(dynamic_url)
re.search(regex, text)
This works, but passes, and should be failing, with the following url:
text = 'google.com/dsadasd/.jpg'
It should only match if there is a filename for the image file. Any way to account for this?
If there are any improvements in this approach that you think could make the regular expression capture other edge cases that I missed based on initial requirements def feel free to say so. Additionally, if there are alternative approaches to this that do not leverage regex, those are appreciated as well (maybe a url parse?). The two most important things to me are performance and clarity (speed performance foremost).
You may also directly apply os.path.splitext():
In [1]: import os
In [2]: text = 'google.com/dsadasd/dsd.jpg'
In [3]: _, extension = os.path.splitext(text)
In [4]: extension
Out[4]: '.jpg'
Then, you may check the extension against a set of supported file extensions.
You could try this: (.*)(\w+)(.gif|.jpg|.jpeg|.tiff|.png)'. Just adds a check for something before the ending .whatever.
What you might do is to use anchors to assert the begin ^ and the end $ of the line or use a word boundary \b
To prevent matching for example .jpg right after the forward / slash, you could add a character class and add the characters you want to allow for the filename.
In this example I have added one or more word characters and a hyphen [\w-]+ but you can update that to your requirements
The regex part of your code could look like:
^{}/[\w-]+\.(?:gif|jpg|jpeg|tiff|png)$
Test Python

How to split a string on multiple pattern using pythonic way (one liner)?

I am trying to extract file name from file pointer without extension. My file name is as follows:
this site:time.list,this.list,this site:time_sec.list, that site:time_sec.list and so on. Here required file name always precedes either whitespace or dot.
Currently I am doing this to get file from file name preceding white space and dot in file name.
search_term = os.path.basename(f.name).split(" ")[0]
and
search_term = os.path.basename(f.name).split(".")[0]
Expected file name output: this, this, this, that.
How can i combine above two into one liner kind and pythonic way?
Thanks in advance.
using regex as below,
[ .] will split either on a space or a dot char
re.split('[ .]', os.path.basename(f.name))[0]
If you split on one and splitting on the other still returns something smaller, that's the one you want. If not, what you get is what you got from the first split. You don't need regex for this.
search_term = os.path.basename(f.name).split(" ")[0].split(".")[0]
Use regex to get the first word at the beginning of the string:
import re
re.match(r"\w+", "this site:time_sec.list").group()
# 'this'
re.match(r"\w+", "this site:time.list").group()
# 'this'
re.match(r"\w+", "that site:time_sec.list").group()
# 'that'
re.match(r"\w+", "this.list").group()
# 'this'
try this:
pattern = re.compile(r"\w+")
pattern.match(os.path.basename(f.name)).group()
Make sure your filenames don't have whitespace inside when you rely on the assumption that a whitespace separates what you want to extract from the rest. It's much more likely to get unexpected results you didn't think up in advance if you rely on implicit rules like that instead of actually looking at the strings you want to extract and tailor explicit expressions to fit the content.

Categories