Numeration of groups in regular expression - python

I am learning regular expressions in Python but couldn't find out what is numeration in .group() based on.
Here is my code:
import re
string = 'suzi sabin joe brandon josh'
print(re.search(r'^.*\b(suzi|sabin|joe|brandon|josh)\b.*$', string).group(0))
# output : suzi sabin joe brandon josh
print(re.search(r'^.*\b(suzi|sabin|joe|brandon|josh)\b.*$', string).group(1))
# output : josh
I am wondering
Why is there only group(1) and not group(1-5)?
Why was josh classified into group(1)?
I am thankful for any advice.

When you call group(0), you get the whole matched text, which is the whole string, since your pattern matches from the beginning of the string to the end.
While the regex matches everything, it only captures one name (in group 1 because regex counts from 1 for historical reasons). Because the first .* is greedy (it tries to match as much text as possible), it gobbles up the earlier names, and the captured name is the last one, "josh" (and the last .* matches an empty string). The captured name is what you get when you call group(1).
If you want to separately capture each name, you'll need to do things differently. Probably something like this would work:
print(re.findall(r'\b(suzi|sabin|joe|brandon|josh)\b', string))
This will print the list ['suzi', 'sabin', 'joe', 'brandon', 'josh']. Each name will appear in the output in the same order it appears in the input string, which need not be the same order they were in the pattern. This might not do exactly what you want though, since it will skip over any text that isn't one of the names you're looking for (rather than failing to match anything).

Related

Split a string and capture all instances in python regex

Newbie here, I have been trying to learn regex for some time but sometimes I feel I can't understand how regex is handling strings. Because in planning phase I seem to work it out, but in implementation it doesn't work as I expect it.
Here is my little problem: I have strings that contains one or more names (team names). The problem is that if the string contains more than one, there is no separator. All names are joint directly.
Some examples :
------------String -----------------Contains----------Names to be extracted
'RangersIslandersDevils' --> 3 names ->>> [Rangers, Islanders, Devils]
'49ersRaiders' -------------> 2 names ->>> [49ers, Raiders]
'Avalanche'----------------> 1 name ->>> [Avalanche]
'Red Wings'---------------> 1 name ->>> [Red Wings]
I want to capture each name in each string and use them in a loop later on. But I can't seem to implement the pattern I imagine for it.
The pattern implementation in my head for the strings are like this:
Start scanning the text which is expected to start with a capital
letter or number
If you see a literal 's' followed by a capital letter (like ...s[A-Z]..) capture the text until "s" (including s)
Repeat step two until you no more see (....s[A-Z]..) pattern. And capture the rest of the string as the last name.
Optionally, Write all names in a list
Well I tried in vain some code in which the step two captures only one instance and step 3 normally gives another.
re.findall('([A-Z0-9].*s)*([A-Z].*)+', 'RangersIslandersMolsDevil')
That returns only two names:
[('RangersIslandersMols', 'Devil')]
whereas I want four:
[Rangers, Islanders, Mols, Devil]
([A-Z0-9].*s)* will capture as many of any character as it can, so that's causing 'RangersIslandersMols' to get stuck together as one match.
It sounds like the boundary between team names is defined as a lowercase letter (not necessarily an 's', as in 'Avalanche') followed immediately by an uppercase letter or number, so our pattern should look for:
uppercase letters or numbers, followed by
lowercase letters
Because a team name can have multiple words, we'll also look for a space followed by the same pattern as above, for any possible number of words.
Try this pattern:
>>> pattern = r'[A-Z0-9]+[a-z]+(?: [A-Z0-9]+[a-z]+)*'
>>> findall(pattern, "RangersIslandersDevils49ersWashginton Football TeamAvalancheWarriors")
['Rangers', 'Islanders', 'Devils', '49ers', 'Washginton Football Team', 'Avalanche', 'Warriors']

Why does Regex finditer only return the first result

My string is a transcript, I want to capture the speaker, specifically their second name (Which needs to only match when fully capitalised)
Additionally, I want to match their speech until the next speaker begins, I want to loop this process over a huge text file eventually.
The problem is the match only returns one match object, even though there are two different speakers. Also I have tried online regex tester with the python flavor however, they return very different results (not sure why?).
str = 'Senator BACK\n (Western Australia) (21:15): This evening I had the pleasure (...) Senator DAY\n (South Australia) (21:34): Well, what a week it h(...) '
pattern = re.compile("(:?(Senator|Mr|Dr)\s+([A-Z]{2,})\s*(\(.+?\))\s+(\(\d{2}:\d{2}\):)(.*))(?=Senator)")
for match in re.finditer(pattern, str):
print(match)
I want 2 match objects, both objects having a group for there surname and their speech. It's important to note also I have used Regex debuggers online however the python flavor gives different results to Python on my terminal.
Just replace the regex into:
(:?(Senator|Mr|Dr)\s+([A-Z]{2,})\s*(\(.+?\))\s+(\(\d{2}:\d{2}\):)(.*))(?=Senator|$)
demo: https://regex101.com/r/gJDaWM/1/
With your current regex, you are enforcing the condition that each match must be followed by Senator through the positive lookahead.
You might actually have to change the positive lookahead into:
(?=Senator|Mr|Dr|$)
if you want to take into account Mr and Dr on top of Senator.

Capturing groups and greediness in Python

Recently I have been playing around with regex expressions in Python and encountered a problem with r"(\w{3})+" and with its non-greedy equivalent r"(\w{3})+?".
Please let's take a look at the following example:
S = "abcdefghi" # string used for all the cases below
1. Greedy search
m = re.search(r"(\w{3})+", S)
print m.group() # abcdefghi
print m.groups() # ('ghi',)
m.group is exactly as I expected - just whole match.
Regarding m.groups please confirm: ghi is printed because it has overwritten previous captured groups of def and abc, am I right? If yes, then can I capture all overwritten groups as well? Of course, for this particular string I could just write m = re.search(r"(\w{3})(\w{3})(\w{3})", S) but I am looking for a more general way to capture groups not knowing how many of them I can expect, thus metacharacter +.
2. Non-greedy search
m = re.search(r"(\w{3})+?", S)
print m.group() # abc
print m.groups() # ('abc',)
Now we are not greedy so only abc was found - exactly as I expected.
Regarding m.groups(), the engine stopped when it found abc so I understand that this is the only found group here.
3. Greedy findall
print re.findall(r"(\w{3})+", S) # ['ghi']
Now I am truly perplexed, I always thought that function re.findall finds all substrings where the RE matches and returns them as a list. Here, we have only one match abcdefghi (according to common sense and bullet 1), so I expected to have a list containing this one item. Why only ghi was returned?
4. Non-greedy findall
print re.findall(r"(\w{3})+?", S) # ['abc', 'def', 'ghi']
Here, in turn, I expected to have abc only, but maybe having bullet 3 explained will help me understand this as well. Maybe this is even the answer for my question from bullet 1 (about capturing overwritten groups), but I would really like to understand what is happening here.
You should think about the greedy/non-greedy behavior in the context of your regex (r"(\w{3})+") versus a regex where the repeating pattern was not at the end: (r"(\w{3})+\w")
It's important because the default behavior of regex matching is:
The entire regex must match
Starting as early in the target string as possible
Matching as much of the target string as possible (greedy)
If you have a "repeat" operator - either * or + - in your regex, then the default behavior is for that to match as much as it can, so long as the rest of the regex is satisfied.
When the repeat operator is at the end of the pattern, there is no rest of the regex, so the behavior becomes match as much as it can.
If you have a repeat operator with a non-greedy qualifier - *? or +? - in your regex, then the behavior is to match as little as it can, so long as the rest of the regex is satisfied.
When the repeat-nongreedy operator is at the end of the pattern, there is no rest of the regex, so the behavior becomes match as little as it can.
All that is in just one match. You are mixing re.findall() in as well, which will then repeat the match, if possible.
The first time you run re.findall, with r"(\w{3})+" you are using a greedy match at the end of the pattern. Thus, it will try to apply that last block as many times as possible in a single match. You have the case where, like the call to re.search, the single match consumes the entire string. As part of consuming the entire string, the w3 block gets repeated, and the group buffer is overwritten several times.
The second time you run re.findall, with r"(\w{3})+?" you are using a non-greedy match at the end of the pattern. Thus, it will try to apply that last block as few times as possible in a single match. Since the operator is +, that would be 1. Now you have a case where the match can stop without consuming the entire string. And now, the group buffer only gets filled one time, and not overwritten. Which means that findall can return that result (abc), then loop for a different result (def), then loop for a final result (ghi).
Regarding m.groups please confirm: ghi is printed because it has overwritten previous captured groups of def and abc, am I right?
Right. Only the last captured text is stored in the group memory buffer.
can I capture all overwritten groups as well?
Not with re, but with PyPi regex, you can. Its match object has a captures method. However, with re, you can just match them with re.findall(r'\w{3}', S). However, in this case, you will match all 3-word character chunks from the string, not just those consecutive ones. With the regex module, you can get all the 3-character consecutive chunks from the beginning of the string with the help of \G operator: regex.findall(r"\G\w{3}", "abcdefghi") (result: abc, def, ghi).
Why only ghi was returned with re.findall(r"(\w{3})+", S)?
Because there is only one match that is equal to the whole abcdefghi string, and Capture group 1 contains just the last three characters. re.findall only returns the captured values if capturing groups are defined in the pattern.

Matching an object and a specific regex with Python

Given a text, I need to check for each char if it has exactly (edited) 3 capital letters on both sides and if there are, add it to a string of such characters that is retured.
I wrote the following: m = re.match("[A-Z]{3}.[A-Z]{3}", text)
(let's say text="AAAbAAAcAAA")
I expected to get two groups in the match object: "AAAbAAA" and "AAAcAAA"
Now, When i invoke m.group(0) I get "AAAbAAA" which is right. Yet, when invoking m.group(1), I find that there is no such group, meaning "AAAcAAA" wasn't a match. Why?
Also, when invoking m.groups(), I get an empty tuple although I should get a tuple of the matches, meaning that in my case I should have gotten a tuple with "AAAbAAA". Why doesn't that work?
You don't have any groups in your pattern. To capture something in a group, you have to surround it with parentheses:
([A-Z]{3}).[A-Z]{3}
The exception is m.group(0), which will always contain the entire match.
Looking over your question, it sounds like you aren't actually looking for capture groups, but rather overlapping matches. In regex, a group means a smaller part of the match that is set aside for later use. For example, if you're trying to match phone numbers with something like
([0-9]{3})-([0-9]{3}-[0-9]{4})
then the area code would be in group(1), the local part in group(2), and the entire thing would be in group(0).
What you want is to find overlapping matches. Here's a Stack Overflow answer that explains how to do overlapping matches in Python regex, and here's my favorite reference for capture groups and regex in general.
One, you are using match when it looks like you want findall. It won't grab the enclosing capital triplets, but re.findall('[A-Z]{3}([a-z])(?=[A-Z]{3})', search_string) will get you all single lower case characters surrounded on both sides by 3 caps.

How do I use regular expressions in Python with placeholder text?

I am doing a project in Python where I require a user to input text. If the text matches a format supported by the program, it will output a response that includes a user's key word (it is a simple chat bot). The format is stored in a text file as a user input format and an answer format.
For example, the text file looks like this, with user input on the left and output on the right:
my name is <-name> | Hi there, <-name>
So if the user writes my name is johnny, I want the program to know that johnny is the <-name> variable, and then to print the response Hi there, johnny.
Some prodding me in the right direction would be great! I have never used regular expressions before and I read an article on how to use them, but unfortunately it didn't really help me since it mainly went over how to match specific words.
Here's an example:
import re
io = [
('my name is (?P<name>\w+)', 'Hi there, {name}'),
]
string = input('> ')
for regex, output in io:
match = re.match(regex, string)
if match:
print(output.format(**match.groupdict()))
break
I'll take you through it:
'my name is (?P<name>\w+)'
(?P<name>...) stores the following part (\w+) under the name name in the match object which we're going to use later on.
match = re.match(regex, string)
This looks for the regex in the input given. Note that re.match only matches at the beginning of the input, if you don't want that restriction use re.search here instead.
If it matches:
output.format(**match.groupdict())
match.groupdict returns a dictionary of keys defined by (?P<name>...) and their associated matched values. ** passes those key/values to .format, in this case Python will translate it to output.format(name='matchedname').
To construct the io dictionary from a file do something like this:
io = []
with open('input.txt') as file_:
for line in file:
key, value = line.rsplit(' | ', 1)
io.append(tuple(key, value))
You are going to want to do a group match and then pull out the search groups.
First you would want to import re - re is the python regex module.
Lets say that user_input is the var holding the input string.
You then want to use the re.sub method to match your string and return a substitute it for something.
output = re.sub(input_regex, output_regex, user_input)
So the regex, first you can put the absolute stuff you want:
input_regex = 'my name is '
If you want it to match explicitly from the start of the line, you should proceed it with the caret:
input_regex = '^my name is '
You then want a group to match any string .+ (. is anything, + is 1 or more of the preceding item) until the end of the line '$'.
input_regex = '^my name is .+$'
Now you'll want to put that into a named group. Named groups take the form "(?Pregex)" - note that those angle brackets are literal.
input_regex = '^my name is (?P<name>.+)$'
You now have a regex that will match and give a match group named "name" with the users name in it. The output string will need to reference the match group with "\g"
output_regex = 'Hi there, \g<name>'
Putting it all together you can do it in a one liner (and the import):
import re
output = re.sub('^my name is (?P<name>.+)$', 'Hi there, \g<name>', user_input)
Asking for REGEXP inevitably leads to answers like the ones you're getting right now: demonstrations of basic REGEXP operations: how to split sentences, search for some term combination like 'my' + 'name' + 'is' within, etc.
In fact, you could learn all this from reading existing documentation and open source programs. REGEXP is not exactly easy. Still you'll need to understand a bit on your own to be able to really know what's going on, if you want to change and extend your program. Don't just copy from the receipts here.
But you may even want to have something more comprehensive. Because you mentioned building a "chat bot", you may want see, how others are approaching that task - way beyond REGEXP. See:
So if the user writes 'my name is johnny', I want the program to know that 'johnny' is the '<-name>' variable, ...
From you question it's unclear, how complex this program should become. What, if he types
'Johnny is my name.'
or
'Hey, my name is John X., but call me johnny.'
?
Take a look at re module and pay attention for capturing groups.
For example, you can assume that name will be a word, so it matches \w+. Then you have to construct a regular expression with \w+ capturing group where the name should be (capturing groups are delimited by parentheses):
r'my name is (\w+)'
and then match it against the input (hint: look for match in the re module docs).
Once you get the match, you have to get the contents of capturing group (in this case at index 1, index 0 is reserved for the whole match) and use it to construct your response.

Categories