Python Regex sub() - python

agentNamesRegex = re.compile(r'Agent (\w)\w*')
agentNamesRegex.sub(r'\1****', 'Agent Alice told Agent Carol that Agent Eve knew Agent Bob was a double agent.')
A**** told C**** that E**** knew B**** was a double agent.'
So I'm learning python and needed help on the above regex. Please correct me but '\1' is for capturing the first word. Two questions:
Why is parenthesis needed
Why it doesn't work when I change the above lines to:
agentNamesRegex = re.compile(r'Agent (\w)(\w)(\w)(\w)(\w)(\w)(\w)(\w)\w*')
agentNamesRegex.sub(r'\3****', 'Agent Alice told Agent Carol that Agent Eve knew Agent Bob was a double agent.')
I guess I did not understand the concept of (\w) and \1 in the first place. Can you please help on this? I didn't had any specific output in mind but was trying different things in spider to know regex better and understand the above expression.

Why is parenthesis needed
Parentheses are used for capturing a group of characters. The \1 returns the first captured group. In the regular expression r'Agent (\w)\w*', the parentheses around (\w) capture the first word character that follows 'Agent ', which is the first letter of the agent's name. That captured letter is then substituted back into the output in place of the \1 for each matched substring.
Why it doesn't work when I change the above lines to:
agentNamesRegex = re.compile(r'Agent (\w)(\w)(\w)(\w)(\w)(\w)(\w)(\w)\w*')
That regular expression is looking for the word 'Agent', followed by a space, followed by 8 or more word characters. Nothing in your input string matches that pattern. (Your agent names are all too short.)

Related

Use CR/LF pair to reject a match using Regex

I am struggling to reject matches for words separated by newline character.
Here's the test string:
Cardoza Fred
Catto, Philipa
Duncan, Jean
Jerry Smith
and
but
and
Andrew
Red
Abcd
DDDD
Rules for regex:
1) Reject a word if it's followed by comma. Therefore, we will drop Catto.
2) Only select words that begin with a capital letter. Hence, and etc. will be dropped
3) If the word is followed by a carriage return (i.e. it is the first name, then ignore it).
Here's my attempt: \b([A-Z][a-z]+)\s(?!\n)
Explanation:
\b #start at a word boundary
([A-Z][a-z]+) #start with A-Z followed by a-z
\s #Last name must be followed by a space character
(?!\n) #The word shouldn't be followed by newline char i.e. ignore first names.
There are two problems with my regex.
1) Andrew is matched as Andre. I am unsure why w is missed. I have also observed that w of Andrew is not missed if I change the bottom portion of the sample text to remove all characters including and after w of Andrew. i.e. sample text would look like:
Cardoza Fred
Catto, Philipa
Duncan, Jean
Jerry Smith
and
but
and
Andrew
The output should be:
Cardoza
Jerry
You might ask: Why should Andrew be rejected? This is because of two reasons: a) Andrew is not followed by space. b) There is no first_name "space" last_name combination.
2) The first names are getting selected using my regex. How do I ignore first names?
I researched SO. It seems there is similar thread ignoring newline character in regex match, but the answer doesn't talk about ignoring \r.
This problem is adapted from Watt's Begining Regex book. I have spent close to 1 hour on this problem without any success. Any explanation will be greatly appreciated. I am using python's re module.
Here's regex101 for reference.
Andre (and not the trailing w) is being matched in your regex because the last token is negative lookahead for \n, and just before that is an optional space. So, Andrew<end of line> fails due to being at the end of the line, so the engine backtracks to Andre, which succeeds.
Maybe the optional quantifier in \s? in your regex101 was a typo, but it would probably be easier to start from scratch. If you want to find the initial names that are followed by a space and then another name, then you can use
^[A-Z][a-z]+(?= [A-Z][a-z]+$)
with the m flag:
https://regex101.com/r/kqeMcH/5
The m flag allows for ^ to match the beginning of a line, and $ to match the end of the line - easier than messing with looking for \ns. (Without the m flag, ^ will only match the beginning of the string, while $ will similarly only match the end of the string)
That is, start with repeated alphabetical characters, then lookahead for a space and more alphabetical characters, followed by the end of the line. Using positive lookahead will be a lot easier than negative lookahead for newlines and such.
Note that literal spaces are a bit more reliable in a regex than \s, because \s matches any whitespace character, including newlines. If you're looking for literal spaces, better to use a literal space.
To use flags in Python regex, either use the flags=, or define the flags at the beginning of the pattern, eg
pattern = r'(?m)^[a-z]+(?= [A-Z][a-z]+$)'

RegEx : Capturing words within Quotation mark

I have a paragraph of text like this:
John went out for a walk. He met Mrs. Edwards and said, 'Hello Mam how are you doing today?'. She replied 'I'm fine. How are you?'.
I would like to capture the words within the single quotes.
I tried this regex
re.findall(r"(?<=([']\b))((?=(\\?))\2.)*?(?=\1))",string)
(from this question: RegEx: Grabbing values between quotation marks)
It returned only single quotes as the output. I don't know what went wrong can someone help me?
Python requires capturing groups to be fully closed before any backreferences (\2) to the group.
You can use Positive Lookbehind (?<=[\s,.]) and Positive Lookahead (?=[\s,.]) zero-length assertions to match words inside single quotes, including words such as I'm, i.e.:
re.findall(r"(?<=[\s,.])'.*?'(?=[\s,.])", string)
Full match 56-92 'Hello Mam how are you doing today?'
Full match 106-130 'I'm fine. How are you?'
Explanation
Regex Demo

Numeration of groups in regular expression

I am learning regular expressions in Python but couldn't find out what is numeration in .group() based on.
Here is my code:
import re
string = 'suzi sabin joe brandon josh'
print(re.search(r'^.*\b(suzi|sabin|joe|brandon|josh)\b.*$', string).group(0))
# output : suzi sabin joe brandon josh
print(re.search(r'^.*\b(suzi|sabin|joe|brandon|josh)\b.*$', string).group(1))
# output : josh
I am wondering
Why is there only group(1) and not group(1-5)?
Why was josh classified into group(1)?
I am thankful for any advice.
When you call group(0), you get the whole matched text, which is the whole string, since your pattern matches from the beginning of the string to the end.
While the regex matches everything, it only captures one name (in group 1 because regex counts from 1 for historical reasons). Because the first .* is greedy (it tries to match as much text as possible), it gobbles up the earlier names, and the captured name is the last one, "josh" (and the last .* matches an empty string). The captured name is what you get when you call group(1).
If you want to separately capture each name, you'll need to do things differently. Probably something like this would work:
print(re.findall(r'\b(suzi|sabin|joe|brandon|josh)\b', string))
This will print the list ['suzi', 'sabin', 'joe', 'brandon', 'josh']. Each name will appear in the output in the same order it appears in the input string, which need not be the same order they were in the pattern. This might not do exactly what you want though, since it will skip over any text that isn't one of the names you're looking for (rather than failing to match anything).

Combine three regular expressions

Is there a way to combine the following three expressions into one regex?
name = re.sub(r'\s?\(\w+\)', '',name) # John Smith (ii) --> John Smith
name = re.sub(r'\s?(Jr.|Sr.)$','', name, flags=re.I) # John Jr. --> John
name = re.sub(r'".+"\s?', '', name) # Dwayne "The Rock" Johnson --> Dwayne Johnson
You can just use grouping and pipe:
re.sub(r'(\s?\(\w+\))|(s?(Jr.|Sr.))|(".+"\s?)', '', name)
Demo
If you want to obtain an efficient (and a working most of the time) pattern simply separating your patterns with a pipe is a bad idea. You must reconsider what you want to do with your pattern and rewrite it from the begining.
p = re.compile(r'["(js](?:(?<=\b[js])r\.|(?<=\()\w+\)|(?<=")[^"]*")\s*', re.I)
text = p.sub('', text).rstrip()
This is a good opportunity to be critical about what you have previously written:
starting a pattern with an optional character \s? is slow because each position in the string must be tested with and without this character. So this is better to catch the optional whitespace at the end and to trim the string after. (in all cases you need to trim the result, even if you decide to catch the optional whitespace at the begining)
the pattern to find quoted parts is false and inefficient (when it works), because you use a dot with a greedy quantifier, so if there are two quoted parts in the same line (note that the dot doesn't match newlines) all the content between will be matched too. It's better to use a negated character class that doesn't contain the quote: "[^"]*" (note: this can be improved to deal with escaped quotes inside the quotes)
the pattern for Jr. and Sr. is false too, to match a literal . you need to escape it. Aside from
that, the pattern is too imprecise because it doesn't check if there are other word characters before. It will match for example a sentence that ends with "USSR." or any substrings that contain "jr." or "sr.". (to be fully rigorous, you must check if there is a whitespace or the start of the string before, but a simple word boundary should suffice most of the time)
Now how to build your alternation:
The order can be important, in particular if the subpatterns are not mutualy exclusive. For example, if you have the subpatterns a+b and a+, if you write a+|a+b all the b preceded by an a will never match because the first branch succeeds before. But for your example there is not this kind of problems.
As an aside, if you know that one of the branches has more chances to succeed put it at the first position in the alternation.
You know the searched substring starts with one of these characters: ", (, j, s. In this case why not begining the pattern with ["(js] that avoids to test each branch of the pattern for all positions in the string.
Then, since the first character is already consumed, you only need to check with a lookbehind which of these characters has been matched for each branch.
With these small improvements you obtain a much faster pattern.

Python Regex to capture #[123456](John Smith)

I'm trying to capture the id and name in a pattern like this #[123456](John Smith) and use those to create a string like < a href="123456"> John Smith< /a>.
Here's what I tried, but it's not working.
def format(text):
def idrepl(match):
fbid = match.group(1)
name = match.group(2)
print fbid, name
return '{}'.format(fbid, name)
return re.sub(r'\#\[(\d+)\]\[(\w\s+)\]', idrepl, text)
The part
(\w\s+)
matches exactly one word character followed by 1+ whitespace characters.
Clearly that's not what you want, and it's easy to fix:
([\w\s]+)
"one or more characters each of which is a word or whitespace char".
Whether that's actually what you want, I'm not sure -- it will happily match John Smith, but not e.g Maureen O'Hara (that apostrophe will impede the match) or John V. Smith (here, it's the dot what will impede the match) or John Smith-Passell (here, it's the dash).
In general, people spell their names with potentially several punctuation characters (as well as word-characters and whitespace) -- apostrophes, dots, dashes, and more. If you don't need to account for this, then, fine!-) If you do, life gets a bit harder (sticking those chars within the square brackets above will mostly do, but precautions are needed -- e.g a dash, if you need it to be part of the bracketed char set, must be at the end, just before the close bracket).

Categories