Extract specific word from the string using Python

Extract specific word from the string using Python - python

I have a string Job_Cluster_AK_Alaska_Yakutat_CDP.png
From the string above, I want to extract only the word after this word Job_Cluster_AK_Alaska_ and before .png.
So basically I want to extract after fourth word separated by underscore and till the word before .png
I am new to regex.
Finally I want only Yakutat_CDP.

I think what you are asking for is something like this:
import os
# I think you will have different jobs/pngs, so pass these variables from somewhere
jobPrefix = 'Job_Cluster_AK_Alaska_'
pngString = 'Job_Cluster_AK_Alaska_Yakutat_CDP.png'
# Split filename/extension
pngTitle = os.path.splitext(pngString)[0]
# Get the filename without the jobPrefix
finalTitle = pngTitle[len(jobPrefix):]
Edit
Try to avoid regular expressions as it is much slower in general than string slicing

You can do it even without regex like so:
s = 'Job_Cluster_AK_Alaska_Yakutat_CDP.png'
print(s[len('Job_Cluster_AK_Alaska_'):-len('.png')])
In essence here I take the substring starting immediately after Job_Cluster_AK_Alaska_ and ending before .png.
Still probably a regex approach is more readable and maintanable:
import re
m = re.match('Job_Cluster_AK_Alaska_(.*).png')
print(m[1])

Related

Is there a way to strip the end of a string until a certain character is reached?

I'm working on a side project for myself and have stumbled on an issue that I'm not sure how to solve for. I have a url, for arguments sake let's say https://stackoverflow.com/xyz/abc. I'm attempting to strip the the end of the url so that I am only left with https://stackoverflow.com/xyz/.
Initially I tried to use the strip function and specify a length/position to remove up to, but realized for other url's I'm working with, it is not the same length. (i.e. URL 1 = /xyz/abc, URL 2 = /xyz/abcd))
Is there any advice for achieving this, I looked into using the regular expression operations in Python, but was unsure how to apply it to this use case. Ideally I would like to write a function that would start from the end of the string and strip away all characters till the first '/' is reached. Any advice would be appreciated.
Thanks

Why not just use rfind, which starts from the end?
>>> string = 'https://stackoverflow.com/xyz/abc'
>>> string = string[:string.rfind('/')+1]
>>> print(string)
'https://stackoverflow.com/xyz/'
And if you don't want the character either (the / in this case), simply remove the +1.
Keep in mind however that this only works if the string actually contains the character you are looking for.
If you want to protect against this, you will have to use the following:
string = 'https://stackoverflow.com/xyz/abc'
idx = string.rfind('/')
if(idx != -1):
string = string[:idx+1]
Unless, obviously, you do want to end up with an empty string in case the character is not found.
Then the first example works just fine.

if yo dont want to use regex, you can combine both the split and join().
lol = 'https://stackoverflow.com/xyz/abc'
splt= lol.split('/')[:-1]
'/'.join(splt)
output
'https://stackoverflow.com/xyz'

How to find a pattern in string & replace in HTML Code

I have HTML Code in a string variable. I want to modify this tag, from this:
3.jpg
to
3.jpg, basically add "download="3.jpg"
I want to do this, with all links that have .jpg,.png,.gif,.jpeg,.mp4 extension at the end.

There might be easier ways to accomplish this, but I think one way to start could be using regex. Define a pattern to find all of the file endings. Then fetch the file-name (e.g., 3.jpg) to compile a string that .replace()s the first pattern. Like this:
import re
# all possible formats you mentioned:
html = ['3.jpg',
'3.png',
'3.gif',
'3.jpeg',
'3.mp4']
# regex patterns (everything within paranthesis is going to be extracted
regex1 = re.compile(r'(\.jpg\"|\.png\"|\.gif\"|\.jpeg\"|\.mp4\")')
regex2 = re.compile(r'\/images\/(.*?)\.')
# iterate over the strings
for x in html:
if regex1.search(x): # if pattern is found:
# find and extract
a = regex1.search(x).group(1)
b = regex2.search(x).group(1)
# compile new string by replacing a
new = x.replace(a, f'{a} download="{b + a}')
print(new)
This gives you:
3.jpg
3.png
3.gif
3.jpeg
3.mp4
If you want to learn more on regex, see the documentation.
Also, note that f-strings (as in f'{a} download="{b + a}') are supported in python version >3.6.

How to split a string on multiple pattern using pythonic way (one liner)?

I am trying to extract file name from file pointer without extension. My file name is as follows:
this site:time.list,this.list,this site:time_sec.list, that site:time_sec.list and so on. Here required file name always precedes either whitespace or dot.
Currently I am doing this to get file from file name preceding white space and dot in file name.
search_term = os.path.basename(f.name).split(" ")[0]
and
search_term = os.path.basename(f.name).split(".")[0]
Expected file name output: this, this, this, that.
How can i combine above two into one liner kind and pythonic way?
Thanks in advance.

using regex as below,
[ .] will split either on a space or a dot char
re.split('[ .]', os.path.basename(f.name))[0]

If you split on one and splitting on the other still returns something smaller, that's the one you want. If not, what you get is what you got from the first split. You don't need regex for this.
search_term = os.path.basename(f.name).split(" ")[0].split(".")[0]

Use regex to get the first word at the beginning of the string:
import re
re.match(r"\w+", "this site:time_sec.list").group()
# 'this'
re.match(r"\w+", "this site:time.list").group()
# 'this'
re.match(r"\w+", "that site:time_sec.list").group()
# 'that'
re.match(r"\w+", "this.list").group()
# 'this'
try this:
pattern = re.compile(r"\w+")
pattern.match(os.path.basename(f.name)).group()
Make sure your filenames don't have whitespace inside when you rely on the assumption that a whitespace separates what you want to extract from the rest. It's much more likely to get unexpected results you didn't think up in advance if you rely on implicit rules like that instead of actually looking at the strings you want to extract and tailor explicit expressions to fit the content.

Matching regex to set

I am looking for a way to match the beginning of a line to a regex and for the line to be returned afterwards. The set is quite extensive hence why I cannot simply use the method given on Python regular expressions matching within set. I was also wondering if regex is the best solution. I have read the http://docs.python.org/3.3/library/re.html alas, it does not seem to hold the answer. Here is what I have tried so far...
import re
import os
import itertools
f2 = open(file_path)
unilist = []
bases=['A','G','C','N','U']
patterns= set(''.join(per) for per in itertools.product(bases, repeat=5))
#stuff
if re.match(r'.*?(?:patterns)', line):
print(line)
unilist.append(next(f2).strip())
print (unilist)
You see, the problem is that I do not know how to refer to my set...
The file I am trying to match it to looks like:
#SRR566546.970 HWUSI-EAS1673_11067_FC7070M:4:1:2299:1109 length=50 TTGCCTGCCTATCATTTTAGTGCCTGTGAGGTGGAGATGTGAGGATCAGT
+
hhhhhhhhhhghhghhhhhfhhhhhfffffeee[X]b[d[ed`[Y[^Y

You are going about it the wrong way.
You simply leave the set of characters to the regular expression:
re.search('[AGCNU]{5}', line)
matches any 5-character pattern built from those 5 characters; that matches the same 3125 different combinations you generated with your set line, but doesn't need to build all possible combinations up front.
Otherwise, your regular expression attempt had no correlation to your patterns variable, the pattern r'.*?(?:patterns)' would match 0 or more arbitrary characters, followed by the literal text 'patterns'.

According to what I've understood from your question, it seems to me that this could fit your need:
import re
sss = '''dfgsdfAUGNA321354354
!=**$=)"nNNUUG54788
=AkjhhUUNGffdffAAGjhff1245GGAUjkjdUU
.....cv GAUNAANNUGGA'''
print re.findall('^(.+?[AGCNU]{5})',sss,re.MULTILINE)

Python: Regex a dictionary using user input wildcards

I would like to be able to search a dictionary in Python using user input wildcards.
I have found this:
import fnmatch
lst = ['this','is','just','a','test', 'thing']
filtered = fnmatch.filter(lst, 'th*')
This matches this and thing. Now if I try to input a whole file and search through
with open('testfilefolder/wssnt10.txt') as f:
file_contents = f.read().lower()
filtered = fnmatch.filter(file_contents, 'th*')
this doesn't match anything. The difference is that in the file that I am reading from I is a text file (Shakespeare play) so I have spaces and it is not a list. I can match things such as a single letter, so if I just have 't' then I get a bunch of t's. So this tells me that I am matching single letters - I however am wanting to match whole words - but even more, to preserve the wildcard structure.
Since what I would like to happen is that a user enters in text (including what will be a wildcard) that I can substitute it in to the place that 'th*' is. The wild card would do what it should still. That leads to the question, can I just stick in a variable holding the search text in for 'th*'? After some investigation I am wondering if I am somehow supposed to translate the 'th*' for example and have found something such as:
regex = fnmatch.translate('th*')
print(regex)
which outputs th.*\Z(?ms)
Is this the right way to go about doing this? I don't know if it is needed.
What would be the best way in going about "passing in regex formulas" as well as perhaps an idea of what I have wrong in the code as it is not operating on the string of incoming text in the second set of code as it does (correctly) in the first.

If the problem is just that you "have spaces and it is not a list," why not make it into a list?
with open('testfilefolder/wssnt10.txt') as f:
file_contents = f.read().lower().split(' ') # split line on spaces to make a list
filtered = fnmatch.filter(file_contents, 'th*')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract specific word from the string using Python - python

Related

Is there a way to strip the end of a string until a certain character is reached?

How to find a pattern in string & replace in HTML Code

How to split a string on multiple pattern using pythonic way (one liner)?

Matching regex to set

Python: Regex a dictionary using user input wildcards

Categories

Resources