How to get a string after keyword - python

I would like to get the string after a specific keyword.
For example:
import re
def findWholeWord(w):
return re.compile(r'\b({0})\b'.format(w), flags=re.IGNORECASE).search
abc = "<StephenCurry Pro='ThreepointShooter'>MVP1times</StephenCurry>"
if findWholeWord("SeedNumber")(abc):
dddd = re.search('(?<=ThreepointShooter)(.\w+)', abc)
mvp = dddd.gorup()
print (mvp)
print ("found")
else:
print ("not found")
I expect the result suppose to be 'MVP1times'.
Is there any better method to find a specific string after keyword ? the result maybe a string, Digit or even mix like the result above.
Thanks for help!

You can use look-arounds to get the string surrounded by > and < (assuming this stays consistent):
>>> s = "<StephenCurry Pro='ThreepointShooter'>MVP1times</StephenCurry>"
>>> re.search(r'(?<=\>)[^<]+(?=\<)', s).group(0)
'MVP1times'

You can change the regular expressiion to: (?<=ThreepointShooter['|"]>)(.\w+). See it live on http://pythex.org/
I'm not sure what exactly your going to do but you don't even need to use lookbehind expression here.

Related

Regex Expression not matching correctly

I'm tackling a python challenge problem to find a block of text in the format xXXXxXXXx (lower vs upper case, not all X's) in a chunk like this:
jdskvSJNDfbSJneSfnJDKoJIWhsjnfakjn
I have tested the following RegEx and found it correctly matches what I am looking for from this site (http://www.regexr.com/):
'([a-z])([A-Z]){3}([a-z])([A-Z]){3}([a-z])'
However, when I try to match this expression to the block of text, it just returns the entire string:
In [1]: import re
In [2]: example = 'jdskvSJNDfbSJneSfnJDKoJIWhsjnfakjn'
In [3]: expression = re.compile(r'([a-z])([A-Z]){3}([a-z])([A-Z]){3}([a-z])')
In [4]: found = expression.search(example)
In [5]: print found.string
jdskvSJNDfbSJneSfnJDKoJIWhsjnfakjn
Any ideas? Is my expression incorrect? Also, if there is a simpler way to represent that expression, feel free to let me know. I'm fairly new to RegEx.
You need to return the match group instead of the string attribute.
>>> import re
>>> s = 'jdskvSJNDfbSJneSfnJDKoJIWhsjnfakjn'
>>> rgx = re.compile(r'[a-z][A-Z]{3}[a-z][A-Z]{3}[a-z]')
>>> found = rgx.search(s).group()
>>> print found
nJDKoJIWh
The string attribute always returns the string passed as input to the match. This is clearly documented:
string
The string passed to match() or search().
The problem has nothing to do with the matching, you're just grabbing the wrong thing from the match object. Use match.group(0) (or match.group()).
Based on xXXXxXXXx if you want upper letters with len 3 and lower with len 1 between them this is what you want :
([a-z])(([A-Z]){3}([a-z]))+
also you can get your search function with group()
print expression.search(example).group(0)

python regular expression : How can I filter only special characters?

I want to check either given words contain special character or not.
so below is my python code
The literal 'a#bcd' has '#', so it will be matchd and it's ok.
but 'a1bcd' has no special character. but it was filtered too!!
import re
regexp = re.compile('[~`!##$%^&*()-_=+\[\]{}\\|;:\'\",.<>/?]+')
if regexp.search('a#bcd') :
print 'matched!! nich catch!!'
if regexp.search('a1bcd') :
print 'something is wrong here!!!'
result :
python ../special_char.py
matched!! nich catch!!
something is wrong here!!!
I have no idea why it works like above..someone help me..T_T;;;
thanks~
Move the dash in you regular expression to the start of the [] group, like this:
regexp = re.compile('[-~`!##$%^&*()_=+\[\]{}\\|;:\'\",.<>/?]+')
Where you had the dash, it was read with the surrounding characters as )-_ and since it is inside [] it is interpreted as asking to match a range from ) to _. If you move the dash to just after the [ it has no special meaning and instead matches itself.
Here's an interactive session showing the specific problem there was in your regular expression:
>>> import re
>>> print re.search('[)-_]', 'abcd')
None
>>> print re.search('[)-_]', 'a1b')
<_sre.SRE_Match object at 0x7f71082247e8>
>>> print re.search('[)-_]', 'a1b').group(0)
1
After fixing it:
>>> print re.search('[-)_]', 'a1b')
None
Unless there's some reason not visible in your question, I'd also say that the final + is not needed.
re will be relatively slow for this
I'd suggest trying
specialchars = '''-~`!##$%^&*()_=+[]{}\\|;:'",.<>/?'''
len(word) != len(word.translate(None, specialchars))
or
set(word) & set(specialchars)

Simple python regex, match after colon

I have a simple regex question that's driving me crazy.
I have a variable x = "field1: XXXX field2: YYYY".
I want to retrieve YYYY (note that this is an example value).
My approach was as follows:
values = re.match('field2:\s(.*)', x)
print values.groups()
It's not matching anything. Can I get some help with this? Thanks!
Your regex is good
field2:\s(.*)
Try this code
match = re.search(r"field2:\s(.*)", subject)
if match:
result = match.group(1)
else:
result = ""
re.match() only matches at the start of the string. You want to use re.search() instead.
Also, you should use a verbatim string:
>>> values = re.search(r'field2:\s(.*)', x)
>>> print values.groups()
('YYYY',)

Why doesn't this regular expression match in this string?

I want to be able to replace a string in a file using regular expressions. But my function isn't finding a match. So I've mocked up a test to replicate what's happening.
I have defined the string I want to replace as follows:
string = 'buf = O_strdup("ONE=001&TYPE=PUZZLE&PREFIX=EXPRESS&");'
I want to replace the "TYPE=PUZZLE&PREFIX=EXPRESS&" part with something else. NB. the string won't always contain exactly "PUZZLE" and "PREFIX" in the original file, but it will be of that format ).
So first I tried testing that I got the correct match.
obj = re.search(r'TYPE=([\^&]*)\&PREFIX=([\^&]*)\&', string)
if obj:
print obj.group()
else:
print "No match!!"
Thinking that ([\^&]*) will match any number of characters that are NOT an ampersand.
But I always get "No match!!".
However,
obj = re.search(r'TYPE=([\^&]*)', string)
returns me "TYPE="
Why doesn't my first one work?
Since the ^ sign is escaped with \ the following part: ([\^&]*) matches any sequence of these characters: ^, &.
Try replacing it with ([^&]*).
In my regex tester, this does work: 'TYPE=(.*)\&PREFIX=(.*)\&'
Try this instead
obj = re.search(r'TYPE=(?P<type>[^&]*?)&PREFIX=(?P<prefix>[^&]*?)&', string)
The ?P<some_name> is a named capture group and makes it a little bit easier to access the captured group, obj.group("type") -->> 'PUZZLE'
It might be better to use the functions urlparse.parse_qsl() and urllib.urlencode() instead of regular expressions. The code will be less error-prone:
from urlparse import parse_qsl
from urllib import urlencode
s = "ONE=001&TYPE=PUZZLE&PREFIX=EXPRESS&"
a = parse_qsl(s)
d = dict(TYPE="a", PREFIX="b")
print urlencode(list((key, d.get(key, val)) for key, val in a))
# ONE=001&TYPE=a&PREFIX=b

Returning all characters before the first underscore

Using re in Python, I would like to return all of the characters in a string that precede the first appearance of an underscore. In addition, I would like the string that is being returned to be in all uppercase and without any non-alpanumeric characters.
For example:
AG.av08_binloop_v6 = AGAV08
TL.av1_binloopv2 = TLAV1
I am pretty sure I know how to return a string in all uppercase using string.upper() but I'm sure there are several ways to remove the . efficiently. Any help would be greatly appreciated. I am still learning regular expressions slowly but surely. Each tip gets added to my notes for future use.
To further clarify, my above examples aren't the actual strings. The actual string would look like:
AG.av08_binloop_v6
With my desired output looking like:
AGAV08
And the next example would be the same. String:
TL.av1_binloopv2
Desired output:
TLAV1
Again, thanks all for the help!
Even without re:
text.split('_', 1)[0].replace('.', '').upper()
Try this:
re.sub("[^A-Z\d]", "", re.search("^[^_]*", str).group(0).upper())
Since everyone is giving their favorite implementation, here's mine that doesn't use re:
>>> for s in ('AG.av08_binloop_v6', 'TL.av1_binloopv2'):
... print ''.join(c for c in s.split('_',1)[0] if c.isalnum()).upper()
...
AGAV08
TLAV1
I put .upper() on the outside of the generator so it is only called once.
You don't have to use re for this. Simple string operations would be enough based on your requirements:
tests = """
AG.av08_binloop_v6 = AGAV08
TL.av1_binloopv2 = TLAV1
"""
for t in tests.splitlines():
print t[:t.find('_')].replace('.', '').upper()
# Returns:
# AGAV08
# TLAV1
Or if you absolutely must use re:
import re
pat = r'([a-zA-Z0-9.]+)_.*'
pat_re = re.compile(pat)
for t in tests.splitlines():
print re.sub(r'\.', '', pat_re.findall(t)[0]).upper()
# Returns:
# AGAV08
# TLAV1
He, just for fun, another option to get text before the first underscore is:
before_underscore, sep, after_underscore = str.partition('_')
So all in one line could be:
re.sub("[^A-Z\d]", "", str.partition('_')[0].upper())
import re
re.sub("[^A-Z\d]", "", yourstr.split('_',1)[0].upper())

Categories