Python-Regex, what's going on here? - python

I've got a book on python recently and it's got a chapter on Regex, there's a section of code which I can't really understand. Can someone explain exactly what's going on here (this section is on Regex groups)?
>>> my_regex = r'(?P<zip>Zip:\s*\d\d\d\d\d)\s*(State:\s*\w\w)'
>>> addrs = "Zip: 10010 State: NY"
>>> y = re.search(my_regex, addrs)
>>> y.groupdict('zip')
{'zip': 'Zip: 10010'}
>>> y.group(2)
'State: NY'

regex definition:
(?P<zip>...)
Creates a named group "zip"
Zip:\s*
Match "Zip:" and zero or more whitespace characters
\d
Match a digit
\w
Match a word character [A-Za-z0-9_]
y.groupdict('zip')
The groupdict method returns a dictionary with named groups as keys and their matches as values. In this case, the match for the "zip" group gets returned
y.group(2)
Return the match for the second group, which is a unnamed group "(...)"
Hope that helps.

The search method will return an object containing the results of your regex pattern.
groupdict returns a dictionnary of groups where the keys are the name of the groups defined by (?P...). Here name is a name for the group.
group returns a list of groups that are matched. "State: NY" is your third group. The first is the entire string and the second is "Zip: 10010".
This was a relatively simple question by the way. I simply looked up the method documentation on google and found this page. Google is your friend.

# my_regex = r' <= this means that the string is a raw string, normally you'd need to use double backslashes
# ( ... ) this groups something
# ? this means that the previous bit was optional, why it's just after a group bracket I know not
# * this means "as many of as you can find"
# \s is whitespace
# \d is a digit, also works with [0-9]
# \w is an alphanumeric character
my_regex = r'(?P<zip>Zip:\s*\d\d\d\d\d)\s*(State:\s*\w\w)'
addrs = "Zip: 10010 State: NY"
# Runs the grep on the string
y = re.search(my_regex, addrs)

The (?P<identifier>match) syntax is Python's way of implementing named capturing groups. That way, you can access what was matched by match using a name instead of just a sequential number.
Since the first set of parentheses is named zip, you can access its match using the match's groupdict method to get an {identifier: match} pair. Or you could use y.group('zip') if you're only interested in the match (which usually makes sense since you already know the identifier). You could also access the same match using its sequential number (1). The next match is unnamed, so the only way to access it is its number.

Adding to previous answers: In my opinion you'd better choose one type of groups (named or unnamed) and stick with it. Normally I use named groups. For example:
>>> my_regex = r'(?P<zip>Zip:\s*\d\d\d\d\d)\s*(?P<state>State:\s*\w\w)'
>>> addrs = "Zip: 10010 State: NY"
>>> y = re.search(my_regex, addrs)
>>> print y.groupdict()
{'state': 'State: NY', 'zip': 'Zip: 10010'}

strfriend is your friend:
http://strfriend.com/vis?re=(Zip%3A\s*\d\d\d\d\d)\s*(State%3A\s*\w\w)
EDIT: Why the heck is it making the entire line a link in the actual comment, but not the preview?

Related

Python regular expression to replace everything but specific words

I am trying to do the following with a regular expression:
import re
x = re.compile('[^(going)|^(you)]') # words to replace
s = 'I am going home now, thank you.' # string to modify
print re.sub(x, '_', s)
The result I get is:
'_____going__o___no______n__you_'
The result I want is:
'_____going_________________you_'
Since the ^ can only be used inside brackets [], this result makes sense, but I'm not sure how else to go about it.
I even tried '([^g][^o][^i][^n][^g])|([^y][^o][^u])' but it yields '_g_h___y_'.
Not quite as easy as it first appears, since there is no "not" in REs except ^ inside [ ] which only matches one character (as you found). Here is my solution:
import re
def subit(m):
stuff, word = m.groups()
return ("_" * len(stuff)) + word
s = 'I am going home now, thank you.' # string to modify
print re.sub(r'(.+?)(going|you|$)', subit, s)
Gives:
_____going_________________you_
To explain. The RE itself (I always use raw strings) matches one or more of any character (.+) but is non-greedy (?). This is captured in the first parentheses group (the brackets). That is followed by either "going" or "you" or the end-of-line ($).
subit is a function (you can call it anything within reason) which is called for each substitution. A match object is passed, from which we can retrieve the captured groups. The first group we just need the length of, since we are replacing each character with an underscore. The returned string is substituted for that matching the pattern.
Here is a one regex approach:
>>> re.sub(r'(?!going|you)\b([\S\s]+?)(\b|$)', lambda x: (x.end() - x.start())*'_', s)
'_____going_________________you_'
The idea is that when you are dealing with words and you want to exclude them or etc. you need to remember that most of the regex engines (most of them use traditional NFA) analyze the strings by characters. And here since you want to exclude two word and want to use a negative lookahead you need to define the allowed strings as words (using word boundary) and since in sub it replaces the matched patterns with it's replace string you can't just pass the _ because in that case it will replace a part like I am with 3 underscore (I, ' ', 'am' ). So you can use a function to pass as the second argument of sub and multiply the _ with length of matched string to be replace.

How to print substring using RegEx in Python?

This is two texts:
1) 'provider:sipoutilp1.ym.ms'
2) 'provider:sipoutqtm.ym.ms'
I would like to print ilp when reaches to the fist line and qtm when reaches to the second line.
This is my solution but it is not working.
RE_PROVIDER = re.compile(r'(?P<provider>\((ilp+|qtm+)')
or in the line below,
182938,DOMINICAN REPUBLIC-MOBILE
to DOMINICAN REPUBLIC , can I use the same approach re.compile?
Thank you for any help.
Your regex is not correct because you have a open parenthesis before your keywords, since there is no such character in your lines.
As a more general way you can get capture the alphabetical character after sipout or provider:sipout.
>>> s1 = 'provider:sipoutilp1.ym.ms'
>>> s2 = 'provider:sipoutqtm.ym.ms'
>>> RE_PROVIDER = re.compile(r'(?P<provider>(?<=sipout)(ilp|qtm))')
>>> RE_PROVIDER.search(s1).groupdict()
{'provider': 'ilp'}
>>> RE_PROVIDER.search(s2).groupdict()
{'provider': 'qtm'}
(?<=sipout) is a positive look-behind which will makes the regex engine match the patter which is precede with sipout.
After edit:
If you want to match multiple strings with different structure, you have to use a optional preceding patterns for matching your keywords, and due to this point that you cannot use unfixed length patterns within look-behind you cannot use it for this aim. So instead you can use a capture group trick.
You can define the optional preceding patterns within a none capture group and your keyword within a capture group then after match get the second matched gorup (group(1), group(0) is the whole of your match).
>>> RE_PROVIDER = re.compile(r'(?:sipout|\d+,)(?P<provider>(ilp|qtm|[A-Z\s]+))')
>>> RE_PROVIDER.search(s1).groupdict()
{'provider': 'ilp'}
>>> RE_PROVIDER.search(s2).groupdict()
{'provider': 'qtm'}
>>> s3 = "182938,DOMINICAN REPUBLIC-MOBILE"
>>> RE_PROVIDER.search(s3).groupdict()
{'provider': 'DOMINICAN REPUBLIC'}
Note that gorupdict doesn't works in this case because it will returns

Python regexp: get all group's sequence

I have a regex like this '^(a|ab|1|2)+$' and want to get all sequence for this...
for example for re.search(reg, 'ab1') I want to get ('ab','1')
Equivalent result I can get with '^(a|ab|1|2)(a|ab|1|2)$' pattern,
but I don't know how many blocks been matched with (pattern)+
Is this possible, and if yes - how?
try this:
import re
r = re.compile('(ab|a|1|2)')
for i in r.findall('ab1'):
print i
The ab option has been moved to be first, so it will match ab in favor of just a.
findall method matches your regular expression more times and returns a list of matched groups. In this simple example you'll get back just a list of strings. Each string for one match. If you had more groups you'll get back a list of tuples each containing strings for each group.
This should work for your second example:
pattern = '(7325189|7325|9087|087|18)'
str = '7325189087'
res = re.compile(pattern).findall(str)
print(pattern, str, res, [i for i in res])
I'm removing the ^$ signs from the pattern because if findall has to find more than one substring, then it should search anywhere in str. Then I've removed + so it matches single occurences of those options in pattern.
Your original expression does match the way you want to, it just matches the entire string and doesn't capture individual groups for each separate match. Using a repetition operator ('+', '*', '{m,n}'), the group gets overwritten each time, and only the final match is saved. This is alluded to in the documentation:
If a group matches multiple times, only the last match is accessible.
I think you don't need regexpes for this problem,
you need some recursial graph search function

How to make re.split() inclusive

I have the following:
>>> x='STARSHIP_TROOPERS_INVASION_2012_LOCDE'
>>> re.split('_\d{4}',x)[0]
'STARSHIP_TROOPERS_INVASION'
How would I get the year included? For example:
STARSHIP_TROOPERS_INVASION_2012
Note there are tens of thousands of titles, and I need to split on the year for each. I can't do a normal python split() here.
A more straightforward solution would be using re.search()/MatchObject.end():
m = re.search('_\d{4}', x)
print x[:m.end(0)]
If you want to stick with split(), you can use a lookbehind:
re.split('(?<=_\d{4}).', x)
(This work even when the year is at the end of the string, because split() returns an array with the original string in case the delimiter isn't found.)
If its always going to be the same pattern, then why not:
>>> x = 'STARSHIP_TROOPERS_INVASION_2012_LOCDE'
>>> x[:x.rfind('_')]
'STARSHIP_TROOPERS_INVASION_2012'
For your original regular expression, since you aren't capturing the matched group, it is not part of your matches:
>>> re.split('_\d{4}',x)
['STARSHIP_TROOPERS_INVASION', '_LOCDE']
>>> re.split('_(\d{4})',x)
['STARSHIP_TROOPERS_INVASION', '2012', '_LOCDE']
The () marks the selection as a captured group:
Matches whatever regular expression is inside the parentheses, and
indicates the start and end of a group; the contents of a group can be
retrieved after a match has been performed, and can be matched later
in the string with the \number special sequence, described below. To
match the literals '(' or ')', use ( or ), or enclose them inside a
character class: [(] [)].
you may use both split() and search() supposing you have a single such date in your string you wish to split at.
import re
x='STARSHIP_TROOPERS_INVASION_2012_LOCDE'
date=re.search('_\d{4}',x).group(0)
print(date)
gives
>>>
_2012
and
print(re.split('_\d{4}',x)[0]+date)
gives
STARSHIP_TROOPERS_INVASION_2012

Regular expression to return all characters between two special characters

How would I go about using regx to return all characters between two brackets.
Here is an example:
foobar['infoNeededHere']ddd
needs to return infoNeededHere
I found a regex to do it between curly brackets but all attempts at making it work with square brackets have failed. Here is that regex: (?<={)[^}]*(?=}) and here is my attempt to hack it
(?<=[)[^}]*(?=])
Final Solution:
import re
str = "foobar['InfoNeeded'],"
match = re.match(r"^.*\['(.*)'\].*$",str)
print match.group(1)
If you're new to REG(gular) EX(pressions) you learn about them at Python Docs. Or, if you want a gentler introduction, you can check out the HOWTO. They use Perl-style syntax.
Regex
The expression that you need is .*?\[(.*)\].*. The group that you want will be \1.
- .*?: . matches any character but a newline. * is a meta-character and means Repeat this 0 or more times. ? makes the * non-greedy, i.e., . will match up as few chars as possible before hitting a '['.
- \[: \ escapes special meta-characters, which in this case, is [. If we didn't do that, [ would do something very weird instead.
- (.*): Parenthesis 'groups' whatever is inside it and you can later retrieve the groups by their numeric IDs or names (if they're given one).
- \].*: You should know enough by now to know what this means.
Implementation
First, import the re module -- it's not a built-in -- to where-ever you want to use the expression.
Then, use re.search(regex_pattern, string_to_be_tested) to search for the pattern in the string to be tested. This will return a MatchObject which you can store to a temporary variable. You should then call it's group() method and pass 1 as an argument (to see the 'Group 1' we captured using parenthesis earlier). I should now look like:
>>> import re
>>> pat = r'.*?\[(.*)].*' #See Note at the bottom of the answer
>>> s = "foobar['infoNeededHere']ddd"
>>> match = re.search(pat, s)
>>> match.group(1)
"'infoNeededHere'"
An Alternative
You can also use findall() to find all the non-overlapping matches by modifying the regex to (?>=\[).+?(?=\]).
- (?<=\[): (?<=) is called a look-behind assertion and checks for an expression preceding the actual match.
- .+?: + is just like * except that it matches one or more repititions. It is made non-greedy by ?.
- (?=\]): (?=) is a look-ahead assertion and checks for an expression following the match w/o capturing it.
Your code should now look like:
>>> import re
>>> pat = r'(?<=\[).+?(?=\])' #See Note at the bottom of the answer
>>> s = "foobar['infoNeededHere']ddd[andHere] [andOverHereToo[]"
>>> re.findall(pat, s)
["'infoNeededHere'", 'andHere', 'andOverHereToo[']
Note: Always use raw Python strings by adding an 'r' before the string (E.g.: r'blah blah blah').
10x for reading! I wrote this answer when there were no accepted ones yet, but by the time I finished it, 2 ore came up and one got accepted. :( x<
^.*\['(.*)'\].*$ will match a line and capture what you want in a group.
You have to escape the [ and ] with \
The documentation at the rubular.com proof link will explain how the expression is formed.
If there's only one of these [.....] tokens per line, then you don't need to use regular expressions at all:
In [7]: mystring = "Bacon, [eggs], and spam"
In [8]: mystring[ mystring.find("[")+1 : mystring.find("]") ]
Out[8]: 'eggs'
If there's more than one of these per line, then you'll need to modify Jarrod's regex ^.*\['(.*)'\].*$ to match multiple times per line, and to be non greedy. (Use the .*? quantifier instead of the .* quantifier.)
In [15]: mystring = "[Bacon], [eggs], and [spam]."
In [16]: re.findall(r"\[(.*?)\]",mystring)
Out[16]: ['Bacon', 'eggs', 'spam']

Categories