How to find a pattern in string & replace in HTML Code - python

I have HTML Code in a string variable. I want to modify this tag, from this:
3.jpg
to
3.jpg, basically add "download="3.jpg"
I want to do this, with all links that have .jpg,.png,.gif,.jpeg,.mp4 extension at the end.

There might be easier ways to accomplish this, but I think one way to start could be using regex. Define a pattern to find all of the file endings. Then fetch the file-name (e.g., 3.jpg) to compile a string that .replace()s the first pattern. Like this:
import re
# all possible formats you mentioned:
html = ['3.jpg',
'3.png',
'3.gif',
'3.jpeg',
'3.mp4']
# regex patterns (everything within paranthesis is going to be extracted
regex1 = re.compile(r'(\.jpg\"|\.png\"|\.gif\"|\.jpeg\"|\.mp4\")')
regex2 = re.compile(r'\/images\/(.*?)\.')
# iterate over the strings
for x in html:
if regex1.search(x): # if pattern is found:
# find and extract
a = regex1.search(x).group(1)
b = regex2.search(x).group(1)
# compile new string by replacing a
new = x.replace(a, f'{a} download="{b + a}')
print(new)
This gives you:
3.jpg
3.png
3.gif
3.jpeg
3.mp4
If you want to learn more on regex, see the documentation.
Also, note that f-strings (as in f'{a} download="{b + a}') are supported in python version >3.6.

Related

What is the equivalent of My.Resources(vb.net) in Python?

I would like to use resources in a python project with Flask and output their names. I know how it works in VB. But I don't have idea what the equivalent of My.Resources.ResourceManager is in Python. Is there same functionality in Python?
I want to save multiple regex paterns like as below.
And also I want to use it in code by name.
Name Value
Regex1 (?Pnickname\s*.+?)
Regex2 (?Paddress\s*.+?)
Welcome to SO!
Essentially, you don't need to worry about resource management in python most of the time because it is done automatically for you. So, to save a regex pattern:
import re
# create pattern strings
regex1 = '(?P<nickname>\s*.+?)'
regex2 = '(?P<address>\s*.+?)'
test_string = 'nickname jojo rabbit.'
matches = re.search(regex1, test_string)
As you probably noticed, there is nothing special here. Creating and storing these patterns is just like declaring any string or other type of variables.
If you want to save all your patterns more neatly, you can use a dictionary where the names of the patterns are the keys and the pattern strings are the values, like so:
import re
regex_dictionary = {'regex1':'(?P<nickname>\s*.+?)'}
# to add another regex pattern:
regex_dictionary['regex2'] = '(?P<address>\s*.+?)'
test_string = 'nickname jojo rabbit.'
# to access and search using a regex pattern:
matches = re.search(regex_dictionary['regex1'], test_string)
I hope this makes sense!
Read more about python's regex: https://www.w3schools.com/python/python_regex.asp#matchobject
Read more about python's dictionaries: https://www.w3schools.com/python/python_dictionaries.asp
Read more about python's resource management: https://www.drdobbs.com/web-development/resource-management-in-python/184405999

Extract specific word from the string using Python

I have a string Job_Cluster_AK_Alaska_Yakutat_CDP.png
From the string above, I want to extract only the word after this word Job_Cluster_AK_Alaska_ and before .png.
So basically I want to extract after fourth word separated by underscore and till the word before .png
I am new to regex.
Finally I want only Yakutat_CDP.
I think what you are asking for is something like this:
import os
# I think you will have different jobs/pngs, so pass these variables from somewhere
jobPrefix = 'Job_Cluster_AK_Alaska_'
pngString = 'Job_Cluster_AK_Alaska_Yakutat_CDP.png'
# Split filename/extension
pngTitle = os.path.splitext(pngString)[0]
# Get the filename without the jobPrefix
finalTitle = pngTitle[len(jobPrefix):]
Edit
Try to avoid regular expressions as it is much slower in general than string slicing
You can do it even without regex like so:
s = 'Job_Cluster_AK_Alaska_Yakutat_CDP.png'
print(s[len('Job_Cluster_AK_Alaska_'):-len('.png')])
In essence here I take the substring starting immediately after Job_Cluster_AK_Alaska_ and ending before .png.
Still probably a regex approach is more readable and maintanable:
import re
m = re.match('Job_Cluster_AK_Alaska_(.*).png')
print(m[1])

Regex to match image extensions with dynamic url in Python

I am creating a regex that matches a web url that ends in a filename with an image extension. The base url, everything before the filename, will be dynamic. Here's what I got:
import re
text = 'google.com/dsadasd/dsd.jpg'
dynamic_url = 'google.com/dsadasd'
regex = '{}/(.*)(.gif|.jpg|.jpeg|.tiff|.png)'.format(dynamic_url)
re.search(regex, text)
This works, but passes, and should be failing, with the following url:
text = 'google.com/dsadasd/.jpg'
It should only match if there is a filename for the image file. Any way to account for this?
If there are any improvements in this approach that you think could make the regular expression capture other edge cases that I missed based on initial requirements def feel free to say so. Additionally, if there are alternative approaches to this that do not leverage regex, those are appreciated as well (maybe a url parse?). The two most important things to me are performance and clarity (speed performance foremost).
You may also directly apply os.path.splitext():
In [1]: import os
In [2]: text = 'google.com/dsadasd/dsd.jpg'
In [3]: _, extension = os.path.splitext(text)
In [4]: extension
Out[4]: '.jpg'
Then, you may check the extension against a set of supported file extensions.
You could try this: (.*)(\w+)(.gif|.jpg|.jpeg|.tiff|.png)'. Just adds a check for something before the ending .whatever.
What you might do is to use anchors to assert the begin ^ and the end $ of the line or use a word boundary \b
To prevent matching for example .jpg right after the forward / slash, you could add a character class and add the characters you want to allow for the filename.
In this example I have added one or more word characters and a hyphen [\w-]+ but you can update that to your requirements
The regex part of your code could look like:
^{}/[\w-]+\.(?:gif|jpg|jpeg|tiff|png)$
Test Python

How to split a string on multiple pattern using pythonic way (one liner)?

I am trying to extract file name from file pointer without extension. My file name is as follows:
this site:time.list,this.list,this site:time_sec.list, that site:time_sec.list and so on. Here required file name always precedes either whitespace or dot.
Currently I am doing this to get file from file name preceding white space and dot in file name.
search_term = os.path.basename(f.name).split(" ")[0]
and
search_term = os.path.basename(f.name).split(".")[0]
Expected file name output: this, this, this, that.
How can i combine above two into one liner kind and pythonic way?
Thanks in advance.
using regex as below,
[ .] will split either on a space or a dot char
re.split('[ .]', os.path.basename(f.name))[0]
If you split on one and splitting on the other still returns something smaller, that's the one you want. If not, what you get is what you got from the first split. You don't need regex for this.
search_term = os.path.basename(f.name).split(" ")[0].split(".")[0]
Use regex to get the first word at the beginning of the string:
import re
re.match(r"\w+", "this site:time_sec.list").group()
# 'this'
re.match(r"\w+", "this site:time.list").group()
# 'this'
re.match(r"\w+", "that site:time_sec.list").group()
# 'that'
re.match(r"\w+", "this.list").group()
# 'this'
try this:
pattern = re.compile(r"\w+")
pattern.match(os.path.basename(f.name)).group()
Make sure your filenames don't have whitespace inside when you rely on the assumption that a whitespace separates what you want to extract from the rest. It's much more likely to get unexpected results you didn't think up in advance if you rely on implicit rules like that instead of actually looking at the strings you want to extract and tailor explicit expressions to fit the content.

How to extract string between key substring and "/" with regex?

I have a string that's
/path/to/file?_subject_ID_SOMEOTHERSTRING
the path/to/file part changes depends on situation, and subject_ID is always there. I try to write a regex that extract only file part of the string. Using ?subject_ID is definite, but I don't know how to safely get the file
My current regex looks like (.*[\/]).*\?_subject_ID
url = '/path/to/file?_subject_ID_SOMEOTHERSTRING'
file_re = re.compile('(.*[\/]).*\?_subject_ID')
file_re.search(url)
this will find the right string, but I still can't extract the file name
printing _.group(1) will get me /path/to/. What's the next step that gets me the actual file name?
As for your '(.*[\/]).*\?_subject_ID' regex approach, you just need to add a capturing group around the second .*. You could use r'(.*/)(.*)\?_subject_ID' (then, there will be .group(1) and .group(2) parts captured), but it is not the most appropriate way to parse URLs in Python.
You may use the non-regex approach here, here is a snippet showing how to leverage urlparse and os.path to parse the URL like yours:
import urlparse
path = urlparse.urlparse('/path/to/file?_subject_ID_SOMEOTHERSTRING').path
import os.path
print(os.path.split(path)[1]) # => file
print(os.path.split(path)[0]) # => /path/to
See the IDEONE demo
It's pretty simple, really. Just match a / before and ?subject_ID after:
([^/?]*)\?subject_ID
The [^/?]* (as opposed to .*) is because otherwise it'd match the part before, too. The ? in the character class
If you want to get both the path and the file, you can do much the same thing, but also grab the part before the /:
([^?]*)([^/?]*)\?subject_ID
It's basically the same as the one before but with the first bit captured instead of ignored.

Categories