Regex to match image extensions with dynamic url in Python - python

I am creating a regex that matches a web url that ends in a filename with an image extension. The base url, everything before the filename, will be dynamic. Here's what I got:
import re
text = 'google.com/dsadasd/dsd.jpg'
dynamic_url = 'google.com/dsadasd'
regex = '{}/(.*)(.gif|.jpg|.jpeg|.tiff|.png)'.format(dynamic_url)
re.search(regex, text)
This works, but passes, and should be failing, with the following url:
text = 'google.com/dsadasd/.jpg'
It should only match if there is a filename for the image file. Any way to account for this?
If there are any improvements in this approach that you think could make the regular expression capture other edge cases that I missed based on initial requirements def feel free to say so. Additionally, if there are alternative approaches to this that do not leverage regex, those are appreciated as well (maybe a url parse?). The two most important things to me are performance and clarity (speed performance foremost).

You may also directly apply os.path.splitext():
In [1]: import os
In [2]: text = 'google.com/dsadasd/dsd.jpg'
In [3]: _, extension = os.path.splitext(text)
In [4]: extension
Out[4]: '.jpg'
Then, you may check the extension against a set of supported file extensions.

You could try this: (.*)(\w+)(.gif|.jpg|.jpeg|.tiff|.png)'. Just adds a check for something before the ending .whatever.

What you might do is to use anchors to assert the begin ^ and the end $ of the line or use a word boundary \b
To prevent matching for example .jpg right after the forward / slash, you could add a character class and add the characters you want to allow for the filename.
In this example I have added one or more word characters and a hyphen [\w-]+ but you can update that to your requirements
The regex part of your code could look like:
^{}/[\w-]+\.(?:gif|jpg|jpeg|tiff|png)$
Test Python

Related

Extracting the suffix of a filename in Python

I'm using Python to create HTML links from a listing of filenames.
The file names are formatted like: song1_lead.pdf, song1_lyrics.pdf.
They could also have names like song2_with_extra_underscores_vocals.pdf. But the common thing is they will all end with _someText.pdf
My goal is to extract just the someText part, after the last underscore, and without the .pdf extension. So song1_lyrics.pdf results with just: lyrics
I have the following Python code getting to my goal, but seems like I'm doing it the hard way.
Is there is a more efficient way to do this?
testString = 'file1_with_extra_underscores_lead.pdf'
#Step 1: Separate string using last occurrence of under_score
HTMLtext = testString.rpartition('_')
# Result: ('file1_with_extra_underscores', '_', 'lyrics.pdf')
#Step 2: Separate the suffix and .pdf extension.
HTMLtext = HTMLtext[2].rpartition('.')
#Result: ('lead', '.', 'pdf')
#Step 3: Use the first item as the end result.
HTMLtext = HTMLtext[0] #Result: lead
I'm thinking what I'm trying to do is possible with much fewer lines of code, and not having to set HTMLtext multiple times as I'm doing now.
you can use Path from pathlib to extract the final path component, without its suffix:
from path import Path
Path('file1_with_extra_underscores_lead.pdf').stem.split('_')[-1]
outout:
'lead'
As #wwii said in its comment, you should use os.path.splitext which is especially designed to separate filenames from their extension and str.split/str.rsplit which are especially designed to cut strings at a character. Using thoses functions there is several ways to achieve what you want.
Unlike #wwii, I would start by discarding the extension:
test_string = 'file1_with_extra_underscores_lead.pdf'
filename = os.path.splitext(test_string)[0]
print(filename) # 'file1_with_extra_underscores_lead'
Then I would use split or rsplit, with the maxsplit argument or selecting the last (or the second index) of the resulting list (according to what method have been used). Every following line are equivalent (in term of functionality at least):
filename.split('_')[-1] # splits at each underscore and selects the last chunk
filename.rsplit('_')[-1] # same as previous line except it splits from the right of the string
filename.rsplit('_', maxsplit=1)[-1] # split only one time from the right of the string and selects the last chunk
filename.rsplit('_', maxsplit=1)[1] # same as previous line except it select the second chunks (which is the last since only one split occured)
The best is probably one of the two last solutions since it will not do useless splits.
Why is this answer better than others? (in my opinion at least)
Using pathlib is fine but a bit overkill for separating a filename from its extension, os.path.splitext could be more efficient.
Using a slice with rfind works but is does not clearly express the code intention and it is not so readable.
Using endswith('.pdf') is OK if you are sure you will never use anything else than PDF. If one day you use a .txt, you will have to rework your code.
I love regex but in this case it suffers from the same caveheats than the 2 two previously discussed solutions: no clear intention, not very readable and you will have to rework it if one day you use an other extension.
Using splitext clearly indicates that you do something with the extension, and the first item selection is quite explicit. This will still work with any other extension.
Using rsplit('_', maxsplit=1) and selecting the last index is also quite expressive and far more clear than a arbitrary looking slice.
This should do fine:
testString = 'file1_with_extra_underscores_lead.pdf'
testString[testString.rfind('_') + 1:-4]
But, no error checking in here. Will fail if there is no "_" in the string.
You could use a regex as well. That shouldn't be difficult.
Basically I wouldn't do it this way myself. It's better to do some exception handling unless you are 100% sure that there is no need for exception handling.
This will work with "..._lead.pdf" or "..._lead.pDf":
import re
testString = 'file1_with_extra_underscores_lead.pdf'
m = re.search('_([^_]+)\.pdf$', testString, flags=re.I)
print(m.group(1) if m else "No match")

How to find a pattern in string & replace in HTML Code

I have HTML Code in a string variable. I want to modify this tag, from this:
3.jpg
to
3.jpg, basically add "download="3.jpg"
I want to do this, with all links that have .jpg,.png,.gif,.jpeg,.mp4 extension at the end.
There might be easier ways to accomplish this, but I think one way to start could be using regex. Define a pattern to find all of the file endings. Then fetch the file-name (e.g., 3.jpg) to compile a string that .replace()s the first pattern. Like this:
import re
# all possible formats you mentioned:
html = ['3.jpg',
'3.png',
'3.gif',
'3.jpeg',
'3.mp4']
# regex patterns (everything within paranthesis is going to be extracted
regex1 = re.compile(r'(\.jpg\"|\.png\"|\.gif\"|\.jpeg\"|\.mp4\")')
regex2 = re.compile(r'\/images\/(.*?)\.')
# iterate over the strings
for x in html:
if regex1.search(x): # if pattern is found:
# find and extract
a = regex1.search(x).group(1)
b = regex2.search(x).group(1)
# compile new string by replacing a
new = x.replace(a, f'{a} download="{b + a}')
print(new)
This gives you:
3.jpg
3.png
3.gif
3.jpeg
3.mp4
If you want to learn more on regex, see the documentation.
Also, note that f-strings (as in f'{a} download="{b + a}') are supported in python version >3.6.

How to extract string between key substring and "/" with regex?

I have a string that's
/path/to/file?_subject_ID_SOMEOTHERSTRING
the path/to/file part changes depends on situation, and subject_ID is always there. I try to write a regex that extract only file part of the string. Using ?subject_ID is definite, but I don't know how to safely get the file
My current regex looks like (.*[\/]).*\?_subject_ID
url = '/path/to/file?_subject_ID_SOMEOTHERSTRING'
file_re = re.compile('(.*[\/]).*\?_subject_ID')
file_re.search(url)
this will find the right string, but I still can't extract the file name
printing _.group(1) will get me /path/to/. What's the next step that gets me the actual file name?
As for your '(.*[\/]).*\?_subject_ID' regex approach, you just need to add a capturing group around the second .*. You could use r'(.*/)(.*)\?_subject_ID' (then, there will be .group(1) and .group(2) parts captured), but it is not the most appropriate way to parse URLs in Python.
You may use the non-regex approach here, here is a snippet showing how to leverage urlparse and os.path to parse the URL like yours:
import urlparse
path = urlparse.urlparse('/path/to/file?_subject_ID_SOMEOTHERSTRING').path
import os.path
print(os.path.split(path)[1]) # => file
print(os.path.split(path)[0]) # => /path/to
See the IDEONE demo
It's pretty simple, really. Just match a / before and ?subject_ID after:
([^/?]*)\?subject_ID
The [^/?]* (as opposed to .*) is because otherwise it'd match the part before, too. The ? in the character class
If you want to get both the path and the file, you can do much the same thing, but also grab the part before the /:
([^?]*)([^/?]*)\?subject_ID
It's basically the same as the one before but with the first bit captured instead of ignored.

using \b in regex

--SOLVED--
I solved my issue by enabling multiline mode, and now the characters ^ and $ work perfectly for identifying the beginning and end of each string
--EDIT--
My code:
import re
import test_regex
def regex_content(text_content, regex_dictionary):
#text_content = text_content.lower()
regex_matches = []
# Search sanitized text (markup removed) for DLP theme keywords
for key,value in regex_dictionary.items():
# Get confiiguration settings
min_matches = value.get('min_matches',1)
risk = value.get('risk',1)
enabled = value.get('enabled',False)
regex_str = value.get('regex','')
# Fast compute True/False hit for each DLP theme word
if enabled:
print "Searching for key : %s" % (key)
my_regex = re.compile(value.get('regex'))
hits = my_regex.findall(text_content)
if len(hits) > 0:
regex_matches.append((key, risk, len(hits), hits))
# Return array of results (key, risk, number of hits, regex matches)
return regex_matches
def main():
#print defaults.test_regex.dlp_regex
text_content = ""
for line in open('testData.txt'):
text_content+=line
for match in regex_content(text_content, test_regex.dlp_regex):
print "\nFound %s : %s" % (match[0], match[3])
print "\n"
if __name__ == '__main__':
main()
and it is using the regex found here:
'Large number of US Zip Codes' : { 'regex' : "\b\d{5}(?:-\d{1,4})?\b"},
When I precede my regex with the 'r' flag, I can find the zip codes I'm looking for, but as well as every other 5 digit number in my document I am searching through. From my understanding this is because it ignored the \b characters. Without the r flag though, it cannot find any zip codes. It works perfectly fine in regexr, but not in my code. I haven't had any luck making \b characters work, nor ^ and $ for identifying the beginnings and ends of the strings I'm searching for. What is it that I am misunderstanding about these special characters?
--Original post--
I am writing a regex for identifying zip codes (and only zip codes), so to avoid false positives I am trying to include a boundary on my regex, using both of the following:
\b\d{5}\b|\b\d{5}-\b\d{1,4}\b
using the online regex debugger Regexr, my code should correctly catch 5 digit zip codes, such as 34332. However, I have two problems:
1. This regex is not working in my actual code for finding any zip codes, but it does work when I don't have the boundary (\b) characters. The exact code I'm trying to extract with my regex is:
Zip:
----
98839-0111
34332
2. I don't see why my regex can't correctly identify 98839-0111 in Regexr. I tried doing the super-primitive approach of
\b\d{5}\b|98839-0111
and even that couldn't identify 98839-0111. Does anyone know what could be going on?
Note: I have also tried using ^ and $ for the boundaries of my regex, but this also doesn't find the regex's, not even in Regexr.
EDIT: After removing the first part of my regex, leaving only
98839-0111
It can now correctly identify it. I guess this means that once a string is pulled out by one of my regex's, it can no longer be found by any subsequent regexs? Why is this?
It is because of the alternative list: the first part was matched, and the engine stopped checking.
Try this regex
98839-0111|\b\d{5}\b
And you'll get a match.
Or, to be more generic in your case:
\b(?:\d{5}-\d{4}|\d{5})\b
will match both, and more (actually, functionally the same as \b\d{5}(?:-\d{4})?\b). See demo.
Your pattern is evaluated for each position in the string from the left to the right, so if the left branch of your pattern succeeds, the second branch isn't tested at all.
I suggest you to use this pattern that solves the problem:
\b\d{5}(?:-\d{1,4})?\b
You can use this regex:
\b(\d{5}-\d{1,4}|\d{5})\b
Working demo

Regex in Django template tag matches only once

I have a template tag like this:
#register.filter(name='bknz')
def bknz(text):
pattern = re.compile(r"(?P<start>.*)\(bkz: (?P<bkz>.*)\)(?P<end>.*)")
link = r'\g<start>(bkz: \g<bkz>)\g<end>'
text = pattern.sub(link,text)
return mark_safe(text)
It changes the (bkz: something) to linked (bkz: something). It works fine but only once. When I put a few (bkz: sth) to my object. It only renders the last one in object as changed version. How can I run this as much as necessary? Thanks.
Take out the start and end groups. They are not needed; you want to match your (bkz: something), not what's around it.
Use non-greedy matching A .* in a regex will try to match as much as possible at a time. Use .*? to avoid clobbering future instances of the pattern.
pattern = re.compile(r"\(bkz: (?P<bkz>.*?)\)")
This one worked.
#register.filter(name='bknz')
def bknz(text):
pattern = re.compile(r"(?P<start>.*?)\(bkz: (?P<bkz>[^)]*)\)(?P<end>.*?)")
link = r'\g<start>(bkz: \g<bkz>)\g<end>'
text = pattern.sub(link, text)
return mark_safe(text)

Categories