python regular expression returns empty string - python

Given:
lst = ['(abc): my name is ?123']
I'm trying to return everything from ': ' till the end of lst[0], for that I tried a regex expression:
result = re.search(r': (.*?)', lst[0]).group(1)
It returns an empty string.
How can this be done using regex correctly?
Expected output :
'my name is ?123'
Resources used : Regex wiki

The issue is that you made your .* lazy by placing ? at the end. Lazy means match as little as possible, for a valid match. In this case, since your pattern does not have anything to match beyond the (.*?), the regex engine is matching empty string. Just use (.*), the non lazy version, and it will work.
lst = ['(abc): my name is ?123']
result = re.search(r': (.*)', lst[0]).group(1)
print(result)
This prints:
my name is ?123

Related

Python regular expression to replace everything but specific words

I am trying to do the following with a regular expression:
import re
x = re.compile('[^(going)|^(you)]') # words to replace
s = 'I am going home now, thank you.' # string to modify
print re.sub(x, '_', s)
The result I get is:
'_____going__o___no______n__you_'
The result I want is:
'_____going_________________you_'
Since the ^ can only be used inside brackets [], this result makes sense, but I'm not sure how else to go about it.
I even tried '([^g][^o][^i][^n][^g])|([^y][^o][^u])' but it yields '_g_h___y_'.
Not quite as easy as it first appears, since there is no "not" in REs except ^ inside [ ] which only matches one character (as you found). Here is my solution:
import re
def subit(m):
stuff, word = m.groups()
return ("_" * len(stuff)) + word
s = 'I am going home now, thank you.' # string to modify
print re.sub(r'(.+?)(going|you|$)', subit, s)
Gives:
_____going_________________you_
To explain. The RE itself (I always use raw strings) matches one or more of any character (.+) but is non-greedy (?). This is captured in the first parentheses group (the brackets). That is followed by either "going" or "you" or the end-of-line ($).
subit is a function (you can call it anything within reason) which is called for each substitution. A match object is passed, from which we can retrieve the captured groups. The first group we just need the length of, since we are replacing each character with an underscore. The returned string is substituted for that matching the pattern.
Here is a one regex approach:
>>> re.sub(r'(?!going|you)\b([\S\s]+?)(\b|$)', lambda x: (x.end() - x.start())*'_', s)
'_____going_________________you_'
The idea is that when you are dealing with words and you want to exclude them or etc. you need to remember that most of the regex engines (most of them use traditional NFA) analyze the strings by characters. And here since you want to exclude two word and want to use a negative lookahead you need to define the allowed strings as words (using word boundary) and since in sub it replaces the matched patterns with it's replace string you can't just pass the _ because in that case it will replace a part like I am with 3 underscore (I, ' ', 'am' ). So you can use a function to pass as the second argument of sub and multiply the _ with length of matched string to be replace.

Can't get re.search() to work in Python

I have a string of type "animal cat dog" and I am trying to extract animal from it.
Following this example, I tried using re.search (and also re.match later), however that didn't produce the result I expected. The if statement would go through but groups() would be empty.
The code I had:
string = "fox cat dog"
regex = "\S+ cat dog\s*"
m = re.search(regex, string)
if m:
temp = m.group(1)
I tried printing out m and m.groups() and they had the following values:
m: <_sre.SRE_Match object at 0x000000002009A920>
m.groups(): ()
I found a way around the problem by using substrings and .find() but I am very curious what was wrong with my original code.
Any help would be appreciated. Thank you!
You just need to add a parenthesis to the group you want. Like so:
string = "fox cat dog"
regex = "(\S+) cat dog\s*"
# ~~~~~~Note the parenthesis there
m = re.search(regex, string)
if m:
temp = m.group(1)
You may want to check the documentation for more information:
(...) Matches whatever regular expression is inside the parentheses,
and indicates the start and end of a group; the contents of a group
can be retrieved after a match has been performed, and can be matched
later in the string with the \number special sequence, described
below. To match the literals '(' or ')', use \( or \), or enclose them
inside a character class: [(] [)].

Get particular information from a string

I want to get the value of name from fstr using RegEx in Python. I tried as below, but couldn't find the intended result.
Any help will be highly appreciaaed.
fstr = "MCode=1,FCode=1,Name=XYZ,Extra=whatever" #",Extra=whatever" this portion is optional
myobj = re.search( r'(.*?),Name(.*?),*(.*)', fstr, re.M|re.I)
print(myobj.group(2))
You may not believe, but the actual problem was ,*, in your regular expression. It makes matching , optional. So, the second capturing group in your regex matches nothing (.*? means match between zero to unlimited and match lazily) and it checks the next item ,*, it also means match , zero or more times. So it matches zero times and the last capturing groups matches the rest of the string.
If you want to fix your RegEx, you can simply remove the * after the comma, like this
myobj = re.search( r'(.*?),Name(.*?),(.*)', fstr, re.I)
print(myobj.group(2))
# =XYZ
Online RegEx demo (with the mistake)
Online RegEx demo (after fixing it)
Debuggex Demo
But as the other answer shows, you don't have to create additional capture groups.
BTW, I like to use RegEx only when it is particularly needed. In this case, I would have solved it, without RegEx, like this
fstr = "MCode=1,FCode=1,Name=XYZ,Extra=whatever"
d = dict(item.split("=") for item in fstr.split(","))
# {'FCode': '1', 'Extra': 'whatever', 'Name': 'XYZ', 'MCode': '1'}
Now that I have all the information, I can access them like this
print d["Name"]
# XYZ
Simple, huh? :-)
Edit: If you want to use the same regex for one million records, we can slightly improve the performance by precompiling the RegEx, like this
import re
pattern = re.compile(r"Name=([^,]+)", re.I)
match = re.search(pattern, data)
if match:
match.group(1)
You can do it as follows:
import re
fstr = "MCode=1,FCode=1,Name=XYZ,Extra=whatever"
myobj = re.search( r'Name=([^,]+)', fstr, re.M|re.I)
>>> print myobj.group(1)
XYZ
try it
rule = re.compile(r"Name=(?P<Name>\w*),")
res = rule.search(fstr)
res.group("Name")

python regular expression replace

I'm trying to change a string that contains substrings such as
the</span></p>
<p><span class=font7>currency
to
the currency
At the line break is CRLF
The words before and after the code change. I only want to replace if the second word starts with a lower case letter. The only thing that changes in the code is the digit after 'font'
I tried:
p = re.compile('</span></p>\r\n<p><span class=font\d>([a-z])')
res = p.sub(' \1', data)
but this isn't working
How should I fix this?
Use a lookahead assertion.
p = re.compile('</span></p>\r\n<p><span class=font\d>(?=[a-z])')
res = p.sub(' ', data)
I think you should use the flag re.DOTALL, which means it will "see" nonprintable characters, such as linebreaks, as if they were regular characters.
So, first line of your code would become :
p = re.compile('</span></p>..<p><span class=font\d>([a-z])', re.DOTALL)
(not the two unescaped dots instead of the linebreak).
Actually, there is also re.MULTILINE, everytime I have a problem like this one of those end up solving the problem.
Hope it helps.
This :
result = re.sub("(?si)(.*?)</?[A-Z][A-Z0-9]*[^>]*>.*</?[A-Z][A-Z0-9]*[^>]*>(.*)", r"\1 \2", subject)
Applied to :
the</span></p>
<p><span class=font7>currency
Produces :
the currency
Although I would strongly suggest against using regex with xml/html/xhtml. THis generic regex will remove all elements and capture any text before / after to groups 1,2.

Python-Regex, what's going on here?

I've got a book on python recently and it's got a chapter on Regex, there's a section of code which I can't really understand. Can someone explain exactly what's going on here (this section is on Regex groups)?
>>> my_regex = r'(?P<zip>Zip:\s*\d\d\d\d\d)\s*(State:\s*\w\w)'
>>> addrs = "Zip: 10010 State: NY"
>>> y = re.search(my_regex, addrs)
>>> y.groupdict('zip')
{'zip': 'Zip: 10010'}
>>> y.group(2)
'State: NY'
regex definition:
(?P<zip>...)
Creates a named group "zip"
Zip:\s*
Match "Zip:" and zero or more whitespace characters
\d
Match a digit
\w
Match a word character [A-Za-z0-9_]
y.groupdict('zip')
The groupdict method returns a dictionary with named groups as keys and their matches as values. In this case, the match for the "zip" group gets returned
y.group(2)
Return the match for the second group, which is a unnamed group "(...)"
Hope that helps.
The search method will return an object containing the results of your regex pattern.
groupdict returns a dictionnary of groups where the keys are the name of the groups defined by (?P...). Here name is a name for the group.
group returns a list of groups that are matched. "State: NY" is your third group. The first is the entire string and the second is "Zip: 10010".
This was a relatively simple question by the way. I simply looked up the method documentation on google and found this page. Google is your friend.
# my_regex = r' <= this means that the string is a raw string, normally you'd need to use double backslashes
# ( ... ) this groups something
# ? this means that the previous bit was optional, why it's just after a group bracket I know not
# * this means "as many of as you can find"
# \s is whitespace
# \d is a digit, also works with [0-9]
# \w is an alphanumeric character
my_regex = r'(?P<zip>Zip:\s*\d\d\d\d\d)\s*(State:\s*\w\w)'
addrs = "Zip: 10010 State: NY"
# Runs the grep on the string
y = re.search(my_regex, addrs)
The (?P<identifier>match) syntax is Python's way of implementing named capturing groups. That way, you can access what was matched by match using a name instead of just a sequential number.
Since the first set of parentheses is named zip, you can access its match using the match's groupdict method to get an {identifier: match} pair. Or you could use y.group('zip') if you're only interested in the match (which usually makes sense since you already know the identifier). You could also access the same match using its sequential number (1). The next match is unnamed, so the only way to access it is its number.
Adding to previous answers: In my opinion you'd better choose one type of groups (named or unnamed) and stick with it. Normally I use named groups. For example:
>>> my_regex = r'(?P<zip>Zip:\s*\d\d\d\d\d)\s*(?P<state>State:\s*\w\w)'
>>> addrs = "Zip: 10010 State: NY"
>>> y = re.search(my_regex, addrs)
>>> print y.groupdict()
{'state': 'State: NY', 'zip': 'Zip: 10010'}
strfriend is your friend:
http://strfriend.com/vis?re=(Zip%3A\s*\d\d\d\d\d)\s*(State%3A\s*\w\w)
EDIT: Why the heck is it making the entire line a link in the actual comment, but not the preview?

Categories