RegEx : Capturing words within Quotation mark - python

I have a paragraph of text like this:
John went out for a walk. He met Mrs. Edwards and said, 'Hello Mam how are you doing today?'. She replied 'I'm fine. How are you?'.
I would like to capture the words within the single quotes.
I tried this regex
re.findall(r"(?<=([']\b))((?=(\\?))\2.)*?(?=\1))",string)
(from this question: RegEx: Grabbing values between quotation marks)
It returned only single quotes as the output. I don't know what went wrong can someone help me?

Python requires capturing groups to be fully closed before any backreferences (\2) to the group.
You can use Positive Lookbehind (?<=[\s,.]) and Positive Lookahead (?=[\s,.]) zero-length assertions to match words inside single quotes, including words such as I'm, i.e.:
re.findall(r"(?<=[\s,.])'.*?'(?=[\s,.])", string)
Full match 56-92 'Hello Mam how are you doing today?'
Full match 106-130 'I'm fine. How are you?'
Explanation
Regex Demo

Related

Regex negative lookahead in python [duplicate]

I am trying to search for all occurrences of "Tom" which are not followed by "Thumb".
I have tried to look for
Tom ^((?!Thumb).)*$
but I still get the lines that match to Tom Thumb.
You don't say what flavor of regex you're using, but this should work in general:
Tom(?!\s+Thumb)
In case you are not looking for whole words, you can use the following regex:
Tom(?!.*Thumb)
If there are more words to check after a wanted match, you may use
Tom(?!.*(?:Thumb|Finger|more words here))
Tom(?!.*Thumb)(?!.*Finger)(?!.*more words here)
To make . match line breaks please refer to How do I match any character across multiple lines in a regular expression?
See this regex demo
If you are looking for whole words (i.e. a whole word Tom should only be matched if there is no whole word Thumb further to the right of it), use
\bTom\b(?!.*\bThumb\b)
See another regex demo
Note that:
\b - matches a leading/trailing word boundary
(?!.*Thumb) - is a negative lookahead that fails the match if there are any 0+ characters (depending on the engine including/excluding linebreak symbols) followed with Thumb.
Tom(?!\s+Thumb) is what you search for.

Remove duplicated puntaction in a string

I'm working on a cleaning some text as the one bellow:
Great talking with you. ? See you, the other guys and Mr. Jack Daniels next week, I hope-- ? Bobette ? ? Bobette Riner??????????????????????????????? Senior Power Markets Analyst?????? TradersNews Energy 713/647-8690 FAX: 713/647-7552 cell: 832/428-7008 bobette.riner#ipgdirect.com http://www.tradersnewspower.com ? ? - cinhrly020101.doc
It has multiple spaces and question marks, to clean it I'm using regular expressions:
def remove_duplicate_characters(text):
text = re.sub("\s+"," ",text)
text = re.sub("\s*\?+","?",text)
text = re.sub("\s*\?+","?",text)
return text
remove_duplicate_characters(msg)
remove_duplicate_characters(msg)
Which gives me the following result:
'Great talking with you.? See you, the other guys and Mr. Jack Daniels next week, I hope--? Bobette? Bobette Riner? Senior Power Markets Analyst? TradersNews Energy 713/647-8690 FAX: 713/647-7552 cell: 832/428-7008 bobette.riner#ipgdirect.com http://www.tradersnewspower.com? - cinhrly020101.doc'
For this particular case, it does work, but does not looks like the best approach if I want to add more charaters to remove. Is there an optimal way to solve this?
To replace all consecutive punctuation chars with their single occurrence you can use
re.sub(r"([^\w\s]|_)\1+", r"\1", text)
If the leading whitespace must be removed, use the r"\s*([^\w\s]|_)\1+" regex.
See the regex demo online.
In case you want to introduce exceptions to this generic regex, you may add an alternative on the left where you'd capture all the contexts where you wat the consecutive punctuation to be kept:
re.sub(r'((?<!\.)\.{3}(?!\.)|://)|([^\w\s]|_)\2+', r'\1\2', text)
See this regex demo.
The ((?<!\.)\.{3}(?!\.)|://)|([^\w\s]|_)\2+ regex matches and captures a ... (not encosed with other dots on both ends) and a :// string (commonly seen in URLS), and the rest is the original regex with the adjusted backreference (since now, there are two capturing groups).
The \1\2 in the replacement pattern put back the captured vaues into the resulting string.

Python: Regex to search for a "Mozilla" but ignore the match if the string also includes "iPhone" [duplicate]

I am trying to search for all occurrences of "Tom" which are not followed by "Thumb".
I have tried to look for
Tom ^((?!Thumb).)*$
but I still get the lines that match to Tom Thumb.
You don't say what flavor of regex you're using, but this should work in general:
Tom(?!\s+Thumb)
In case you are not looking for whole words, you can use the following regex:
Tom(?!.*Thumb)
If there are more words to check after a wanted match, you may use
Tom(?!.*(?:Thumb|Finger|more words here))
Tom(?!.*Thumb)(?!.*Finger)(?!.*more words here)
To make . match line breaks please refer to How do I match any character across multiple lines in a regular expression?
See this regex demo
If you are looking for whole words (i.e. a whole word Tom should only be matched if there is no whole word Thumb further to the right of it), use
\bTom\b(?!.*\bThumb\b)
See another regex demo
Note that:
\b - matches a leading/trailing word boundary
(?!.*Thumb) - is a negative lookahead that fails the match if there are any 0+ characters (depending on the engine including/excluding linebreak symbols) followed with Thumb.
Tom(?!\s+Thumb) is what you search for.

Python Regex Matching - Splitting on punctuation but ignoring certain words

Suppose I have the following sentence,
Hi, my name is Dr. Who. I'm in love with fish-fingers and custard !!
I'm trying to capture the punctuation (except the apostrophe and hyphen) using regular expressions, but I also want to ignore certain words. For example, I'm ignoring Dr., and so I don't want to capture the . in the word Dr.
Ideally, the regex should capture the text in between the parentheses:
Hi(, )my( )name( )is( )Dr.( )Who(. )I'm( )in( )love( )with( )fish-fingers( )and( )custard( !!)
Note that I have a Python list that contains words like "Dr." that I want to ignore. I'm also using string.punctuation to get a list of punctuation characters to use in the regex. I've tried using negative lookahead but it was still catching the "." in Dr. Any help appreciated!
you can throw away at first all your stop words (like "Dr.") and then all letters (and digits).
import re
text = "Hi, my name is Dr. Who. I'm in love with fish-fingers and custard !!"
tmp = re.sub(r'[Dr.|Prof.]', '', text)
print(re.sub('[a-zA-Z0-9]*', '', tmp))
Would that work?
it would print:
, ' - !!
The output is capturing the text in between the parentheses, in your question.

Python Regex sub()

agentNamesRegex = re.compile(r'Agent (\w)\w*')
agentNamesRegex.sub(r'\1****', 'Agent Alice told Agent Carol that Agent Eve knew Agent Bob was a double agent.')
A**** told C**** that E**** knew B**** was a double agent.'
So I'm learning python and needed help on the above regex. Please correct me but '\1' is for capturing the first word. Two questions:
Why is parenthesis needed
Why it doesn't work when I change the above lines to:
agentNamesRegex = re.compile(r'Agent (\w)(\w)(\w)(\w)(\w)(\w)(\w)(\w)\w*')
agentNamesRegex.sub(r'\3****', 'Agent Alice told Agent Carol that Agent Eve knew Agent Bob was a double agent.')
I guess I did not understand the concept of (\w) and \1 in the first place. Can you please help on this? I didn't had any specific output in mind but was trying different things in spider to know regex better and understand the above expression.
Why is parenthesis needed
Parentheses are used for capturing a group of characters. The \1 returns the first captured group. In the regular expression r'Agent (\w)\w*', the parentheses around (\w) capture the first word character that follows 'Agent ', which is the first letter of the agent's name. That captured letter is then substituted back into the output in place of the \1 for each matched substring.
Why it doesn't work when I change the above lines to:
agentNamesRegex = re.compile(r'Agent (\w)(\w)(\w)(\w)(\w)(\w)(\w)(\w)\w*')
That regular expression is looking for the word 'Agent', followed by a space, followed by 8 or more word characters. Nothing in your input string matches that pattern. (Your agent names are all too short.)

Categories