Python Regex Get String Between Two Substrings - python

First off, I know this may seem like a duplicate question, however, I could find no working solution to my problem.
I have string that looks like the following:
string = "api('randomkey123xyz987', 'key', 'text')"
I need to extract randomkey123xyz987 which will always be between api(' and ',
I was planning on using Regex for this, however, I seem to be having some trouble.
This is the only progress that I have made:
import re
string = "api('randomkey123xyz987', 'key', 'text')"
match = re.findall("\((.*?)\)", string)[0]
print match
The following code returns 'randomkey123xyz987', 'key', 'text'
I have tried to use [^'], but my guess is that I am not properly inserting it into the re.findall function.
Everything that I am trying is failing.
Update: My current workaround is using [2:-4], but I would still like to avoid using match[2:-4].

If the string contains only one instance, use re.search() instead:
>>> import re
>>> s = "api('randomkey123xyz987', 'key', 'text')"
>>> match = re.search(r"api\('([^']*)'", s).group(1)
>>> print match
randomkey123xyz987

You want the string between the ( and ,, you are catching everything between the parens:
match = re.findall("api\((.*?),", string)
print match
["'randomkey123xyz987'"]
Or match between the '':
match = re.findall("api\('(.*?)'", string)
print match
['randomkey123xyz987']
If that is how your strings actually look you can split:
string = "api('randomkey123xyz987', 'key', 'text')"
print(string.split(",",1)[0][4:])

You should use the following regex:
api\('(.*?)'
Assuming that api( is fixed prefix
It matches api(, then captures what appears next, until ' token.
>>> re.findall(r"api\('(.*?)'", "api('randomkey123xyz987', 'key', 'text')")
['randomkey123xyz987']

If you are certain that randomkey123xyz987 will always be between "api('" and "',", then using the split() method can get it done in one line. This way you will not have to use regex matching. It will match the pattern between the starting and ending delimiter which is "api('" and "',
".
>>> string = "api('randomkey123xyz987', 'key', 'text')"
>>> value = (string.split("api('")[1]).split("',")[0]
>>> print value
randomkey123xyz987

Related

replacing special characters in string Python

I'm trying to replace special characters in a data frame with unaccented or different ones.
I can replace one with
df['col_name'] = df.col_name.str.replace('?','j')
this turned the '?' to 'j' - but - I can't seem to figure out how to change more than one..
I have a list of special characters that I want to change. I've tried using a dictionary but it doesn't seem to work
the_reps = {'?','j'}
df1 = df.replace(the_reps, regex = True)
this gave me the error nothing to replace at position 0
EDIT:
this is what worked - although it is probably not that pretty:
df[col]=df.col.str.replace('old char','new char')
df[col]=df.col.str.replace('old char','new char')
df[col]=df.col.str.replace('old char','new char')
df[col]=df.col.str.replace('old char','new char')...
for each one ..
import re
s=re.sub("[_list of special characters_]","",_your string goes here_)
print(s)
An example for this..
str="Hello$#& Python3$"
import re
s=re.sub("[$#&]","",str)
print (s)
#Output:Hello Python3
Explanation goes here..
s=re.sub("[$#&]","",s)
Pattern to be replaced → “[$#&]”
[] used to indicate a set of characters
[$#&] → will match either $ or # or &
The replacement string is given as an empty string
If these characters are found in the string, they’ll be replaced with an empty string
you can use Series.replace with a dictionary
#d = { 'actual character ':'replacement ',...}
df.columns = df.columns.to_series().replace(d, regex=True)
Try This:
import re
my_str = "hello Fayzan-Bhatti Ho~!w"
my_new_string = re.sub('[^a-zA-Z0-9 \n\.]', '', my_str)
print my_new_string
Output: hello FayzanBhatti How

Extract number from a string using a pattern

I have strings like :
's3://bukcet_name/tables/name=moonlight/land/timestamp=2020-06-25 01:00:23.180745/year=2019/month=5'
And from it I would like to obtain a tuple contain the year value and the month value as first and second element of my tuple.
('2019', '5')
For now I did this :
([elem.split('=')[-1:][0] for elem in part[0].split('/')[-2:]][0], [elem.split('=')[-1:][0] for elem in part[0].split('/')[-2:]][1])
It isn't very elegant, how could I do better ?
Use, re.findall along with the given regex pattern:
import re
matches = re.findall(r'(?i)/year=(\d+)/month=(\d+)', string)
Result:
# print(matches)
[('2019', '5')]
Test the regex pattern here.
Perhaps regular expressions could do it. I would use regular expressions to capture the strings 'year=2019' and 'month=5' then return the item at index [-1] by splitting these two with the character '='. Hold on, let me open up my Sublime and try to write actual code which suits your specific case.
import re
search_string = 's3://bukcet_name/tables/name=moonlight/land/timestamp=2020-06-25 01:00:23.180745/year=2019/month=5'
string1 = re.findall(r'year=\d+', search_string)
string2 = re.findall(r'month=\d+', search_string)
result = (string1[0].split('=')[-1], string2[0].split('=')[-1]) print(result)

How to get string following some specific letters?

How can I get string from some specific characters? (more specifically, get "test" from "A8 test")
In this case, "A8" is following a pattern like "[A-Z]+[0-9]+".
So it can also be "C6 test","X90 test" and etc.
I've tried in Python using "(?<=[A-Z]+[0-9]).+", which throws an Exception:
"sre_constants.error: look-behind requires fixed-width pattern."
It means I should use fixed-width pattern such as "(?<=[A-Z]{1}[0-9]{1})".
But actually it's not fixed-width. What can I do?
If you means get the rest behind pattern "[A-Z]+[0-9]+", you can try this:
import re
s1 = 'A8 test'
s2 = 'C6 123'
s3 = 'X90 test32'
# parentheses is what you want
p = re.compile("[A-Z]+[0-9]+ (\w+)")
print(p.findall(s1))
print(p.findall(s2))
print(p.findall(s3))
output:
['test']
['123']
['test32']
Hope that will help you, and comment if you have further questions. : )
You can use a capture group to get what you need.
>>> regexp = r"[A-Z]+[0-9]+ (.+)"
>>> re.search(regexp, "C6 test")[1]
"test"
>>> re.search(regexp, "X90 test")[1]
"test"
>>> re.search(regexp, "CBF58456 test")[1]
"test"
Note that the current pattern you show would pick up on any number of uppercase letters followed by any number of digits, as long as there's at least one of each. Also note that my example above would require a blank between the first part and the test string to capture.
You could also use re.sub to jettison part of str you do not need by simply using empty str as second argument:
import re
text = "X90 test"
t = re.sub("[A-Z]+[0-9]+ ","",text)
print(t) #test
import re
ex = r"[A-Z]+[0-9]+ (.+)"
print(re.search(ex , "X90 test")[1])
print(re.search(ex , "C6 test")[1])
print(re.search(ex , "CBF58456 test")[1])
Output
test
test
test
You can split the string, then get your string.
>>> re.split(r'([A-Z]+[0-9]+ )(test)', 'A8 test')
['', 'A8 ', 'test', '']
Or you can write a simple function to find your string in the whole string by not using regex.

Regex to retrieve the last few characters of a string

Regex to retrieve the last portion of a string:
https://play.google.com/store/apps/details?id=com.lima.doodlejump
I'm looking to retrieve the string followed by id=
The following regex didn't seem to work in python
sampleURL = "https://play.google.com/store/apps/details?id=com.lima.doodlejump"
re.search("id=(.*?)", sampleURL).group(1)
The above should give me an output:
com.lima.doodlejump
Is my search group right?
Your regular expression
(.*?)
will not work because, it will match between zero and unlimited times, as few times as possible (becasue of the ?). So, you have the following choices of RegEx
(.*) # Matches the rest of the string
(.*?)$ # Matches till the end of the string
But, you don't need RegEx at all here, simply split the string like this
data = "https://play.google.com/store/apps/details?id=com.lima.doodlejump"
print data.split("id=", 1)[-1]
Output
com.lima.doodlejump
If you really have to use RegEx, you can do like this
data = "https://play.google.com/store/apps/details?id=com.lima.doodlejump"
import re
print re.search("id=(.*)", data).group(1)
Output
com.lima.doodlejump
I'm surprised that nobody has mentioned urlparse yet...
>>> s = "https://play.google.com/store/apps/details?id=com.lima.doodlejump"
>>> urlparse.urlparse(s)
ParseResult(scheme='https', netloc='play.google.com', path='/store/apps/details', params='', query='id=com.lima.doodlejump', fragment='')
>>> urlparse.parse_qs(urlparse.urlparse(s).query)
{'id': ['com.lima.doodlejump']}
>>> urlparse.parse_qs(urlparse.urlparse(s).query)['id']
['com.lima.doodlejump']
>>> urlparse.parse_qs(urlparse.urlparse(s).query)['id'][0]
'com.lima.doodlejump'
The HUGE advantage here is that if the url query string gets more components then it could easily break the other solutions which rely on a simple str.split. It won't confuse urlparse however :).
Just split it in the place you want:
id = url.split('id=')[1]
If you print id, you'll get:
com.lima.doodlejump
Regex isn't needed here :)
However, in case there are multiple id=s in your string, and you only wanted the last one:
id = url.split('id=')[-1]
Hope this helps!
This works:
>>> import re
>>> sampleURL = "https://play.google.com/store/apps/details?id=com.lima.doodlejump"
>>> re.search("id=(.+)", sampleURL).group(1)
'com.lima.doodlejump'
>>>
Instead of capturing non-greedily for zero or more characters, this code captures greedily for one or more.

python regular expression substitute

I need to find the value of "taxid" in a large number of strings similar to one given below. For this particular string, the 'taxid' value is '9606'. I need to discard everything else. The "taxid" may appear anywhere in the text, but will always be followed by a ":" and then number.
score:0.86|taxid:9606(Human)|intact:EBI-999900
How to write regular expression for this in python.
>>> import re
>>> s = 'score:0.86|taxid:9606(Human)|intact:EBI-999900'
>>> re.search(r'taxid:(\d+)', s).group(1)
'9606'
If there are multiple taxids, use re.findall, which returns a list of all matches:
>>> re.findall(r'taxid:(\d+)', s)
['9606']
for line in lines:
match = re.match(".*\|taxid:([^|]+)\|.*",line)
print match.groups()

Categories