Create Python regex for specific sentence pattern - python

I'm trying to build a regex pattern that can capture the following examples:
pattern1 = '.She is greatThis is annoyingWhy u do this'
pattern2 = '.Weirdly specificThis sentence is longer than the other oneSee this is great'
example = 'He went such dare good mr fact. The small own seven saved man age no offer. Suspicion did mrs nor furniture smallness. Scale whole downs often leave not eat. An expression reasonably cultivated indulgence mr he surrounded instrument. Gentleman eat and consisted are pronounce distrusts.This is where the fun startsSummer is really bothersome this yearShe is out of ideas'
example_pattern_goal = 'This is where the fun startsSummer is really bothersome this yearShe is out of ideas'
Essentially, it's always a dot followed by sentences of various length not including any numbers. I only want to capture these specific sentences, so I tried to capture instances where a dot was immediately followed by a word that starts with an uppercase and other words that include two instances where an uppercase letter is inside the word.
So far, I've only come up with the following regex that doesn't quite work:
'.\b[A-Z]\w+[\s\w]+\b\w+[A-Z]\w+\b[\s\w]+\b\w+[A-Z]\w+\b[\s\w]+'

You can use
\.([A-Z][a-z]*(?:\s+[A-Za-z]+)*\s+[a-zA-Z]+[A-Z][a-z]+(?:\s+[A-Za-z]+)*)
See the regex demo.
Details:
\. - a dot
[A-Z][a-z]* - an ASCII word starting from an upper case letter
(?:\s+[A-Za-z]+)* - zero or more sequences of one or more whitespaces and then an ASCII word
\s+ - zero or more whitespaces
[a-zA-Z]+[A-Z][a-z]+ - an ASCII word with an uppercase letter inside it
(?:\s+[A-Za-z]+)* - zero or more sequences of one or more whitespaces and then an ASCII word.

Related

How do I remove the substrings started with capital letters in a Python string?

I have this string which is a mix between a title and a regular sentence (there is no separator separating the two).
text = "Read more: Indonesia to Get Moderna Vaccines Before the pandemic began, a lot of people were...."
The title actually ends at the word Vaccines, the Before the pandemic is another sentence completely separate from the title.
How do I remove the substring until the word vaccines? My idea was to remove all words from the words "Read more:" to all the words after that that start with capital until before one word (before). But I don't know what to do if it meets with conjunction or preposition that doesn't need to be capitalized in a title, like the word the.
I know there is a function title() to convert a string into a title format in Python, but is there any function that can detect if a substring is a title?
I have tried the following using regular expression.
import re
text = "Read more: Indonesia to Get Moderna Vaccines Before the pandemic began, a lot of people were...."
res = re.sub(r"\s*[A-Z]\s*", " ", text)
res
But it just removed all words started with capital letters instead.
You can match the title by matching a sequence of capitalized words and words that can be non-capitalized in titles.
^(?:Read\s+more\s*:)?\s*(?:(?:[A-Z]\S*|the|an?|[io]n|at|with(?:out)?|from|for|and|but|n?or|yet|[st]o|around|by|after|along|from|of)\s+)*(?=[A-Z])
See the regex demo.
Details:
^ - start of string
(?:Read\s+more\s*:)? - an optional non-capturing group matching Read, one or more whitespaces, more, zero or more whitespaces and a :
\s* - zero or more whitespaces
(?:(?:[A-Z]\S*|the|an?|[io]n|at|with(?:out)?|from|for|and|but|n?or|yet|[st]o|around|by|after|along|from|of)\s+)* - zero or more sequences of
(?:[A-Z]\S*|the|an?|[io]n|at|with(?:out)?|from|for|and|but|n?or|yet|[st]o|around|by|after|along|from|of) - an capitalized word that may contain any non-whitespace chars or one of the words that can stay non-capitalized in an English title
\s+ - one or more whitespaces
(?=[A-Z]) - followed with an uppercase letter.
NOTE: You mentioned your language is not English, so
You need to find the list of your language words that may go non-capitalized in a title and use them instead of ^(?:Read\s+more\s*:)?\s*(?:(?:[A-Z]\S*|the|an?|[io]n|at|with(?:out)?|from|for|and|but|n?or|yet|[st]o|around|by|after|along|from|of
You might want to replace [A-Z] with \p{Lu} to match any Unicode uppercase letters and \S* with \p{L}* to match any zero or more Unicode letters, BUT make sure you use the PyPi regex library then as Python built-in re does not support the Unicode category classes.
Why don't you just use slicing?
title = text[:44]
print(title)
Read more: Indonesia to Get Moderna Vaccines

Python Regex Matching - Splitting on punctuation but ignoring certain words

Suppose I have the following sentence,
Hi, my name is Dr. Who. I'm in love with fish-fingers and custard !!
I'm trying to capture the punctuation (except the apostrophe and hyphen) using regular expressions, but I also want to ignore certain words. For example, I'm ignoring Dr., and so I don't want to capture the . in the word Dr.
Ideally, the regex should capture the text in between the parentheses:
Hi(, )my( )name( )is( )Dr.( )Who(. )I'm( )in( )love( )with( )fish-fingers( )and( )custard( !!)
Note that I have a Python list that contains words like "Dr." that I want to ignore. I'm also using string.punctuation to get a list of punctuation characters to use in the regex. I've tried using negative lookahead but it was still catching the "." in Dr. Any help appreciated!
you can throw away at first all your stop words (like "Dr.") and then all letters (and digits).
import re
text = "Hi, my name is Dr. Who. I'm in love with fish-fingers and custard !!"
tmp = re.sub(r'[Dr.|Prof.]', '', text)
print(re.sub('[a-zA-Z0-9]*', '', tmp))
Would that work?
it would print:
, ' - !!
The output is capturing the text in between the parentheses, in your question.

Ignore words containing substring using regular expressions

I am a beginner and have spent considerable amount of time on this. I was partially able to solve it.
Problem: I want to ignore all words that have either the or The. E.g. atheist, others, The, the will be excluded. However, hottie shouldn't be included because the doesn't occur inside the word as a whole word.
I am using Python's re engine.
Here's my regex:
\b - Start at word boundary
(?! - Negative lookahead to avoid starting with the or The
[t|T]he - the and The
)
\w+ - Other letters are fine
(?<! - Negative look behind
[t|T]he - the or The shouldn't occur before \w+
)
\b - Word boundary
Expected output for a given input:
Input: Atheist Others Their Hello the The bathe hottie tahaie theater
Expected Output: Hello hottie tahaie
As one can see in regex101, I am able to exclude most of the words except words like atheist--i.e. cases when the or The appear inside words. I searched for this on SO and found some threads such as How to exclude specific string using regex in Python?, but they don't seem to be directly related to what I am trying to do.
Any help will be greatly appreciated.
Please note that I am interested in solving this problem only using regex. I am not looking for solutions using python's string manipulation.
The approach is simpler than your original regular expression:
\b(?!\w*[t|T]he)\w+\b
We match a word, but make sure that there is no the within the word using a "padded" negative lookahead. Your original approach only disallowed the at the front or the back of the word as it allowed for no padding after/before the word boundary.
(?![tT]he) only matches at the current position, while (?:\w*[tT]he) allows the match to extend from the current position, because the \w* can be used as filler.

Use CR/LF pair to reject a match using Regex

I am struggling to reject matches for words separated by newline character.
Here's the test string:
Cardoza Fred
Catto, Philipa
Duncan, Jean
Jerry Smith
and
but
and
Andrew
Red
Abcd
DDDD
Rules for regex:
1) Reject a word if it's followed by comma. Therefore, we will drop Catto.
2) Only select words that begin with a capital letter. Hence, and etc. will be dropped
3) If the word is followed by a carriage return (i.e. it is the first name, then ignore it).
Here's my attempt: \b([A-Z][a-z]+)\s(?!\n)
Explanation:
\b #start at a word boundary
([A-Z][a-z]+) #start with A-Z followed by a-z
\s #Last name must be followed by a space character
(?!\n) #The word shouldn't be followed by newline char i.e. ignore first names.
There are two problems with my regex.
1) Andrew is matched as Andre. I am unsure why w is missed. I have also observed that w of Andrew is not missed if I change the bottom portion of the sample text to remove all characters including and after w of Andrew. i.e. sample text would look like:
Cardoza Fred
Catto, Philipa
Duncan, Jean
Jerry Smith
and
but
and
Andrew
The output should be:
Cardoza
Jerry
You might ask: Why should Andrew be rejected? This is because of two reasons: a) Andrew is not followed by space. b) There is no first_name "space" last_name combination.
2) The first names are getting selected using my regex. How do I ignore first names?
I researched SO. It seems there is similar thread ignoring newline character in regex match, but the answer doesn't talk about ignoring \r.
This problem is adapted from Watt's Begining Regex book. I have spent close to 1 hour on this problem without any success. Any explanation will be greatly appreciated. I am using python's re module.
Here's regex101 for reference.
Andre (and not the trailing w) is being matched in your regex because the last token is negative lookahead for \n, and just before that is an optional space. So, Andrew<end of line> fails due to being at the end of the line, so the engine backtracks to Andre, which succeeds.
Maybe the optional quantifier in \s? in your regex101 was a typo, but it would probably be easier to start from scratch. If you want to find the initial names that are followed by a space and then another name, then you can use
^[A-Z][a-z]+(?= [A-Z][a-z]+$)
with the m flag:
https://regex101.com/r/kqeMcH/5
The m flag allows for ^ to match the beginning of a line, and $ to match the end of the line - easier than messing with looking for \ns. (Without the m flag, ^ will only match the beginning of the string, while $ will similarly only match the end of the string)
That is, start with repeated alphabetical characters, then lookahead for a space and more alphabetical characters, followed by the end of the line. Using positive lookahead will be a lot easier than negative lookahead for newlines and such.
Note that literal spaces are a bit more reliable in a regex than \s, because \s matches any whitespace character, including newlines. If you're looking for literal spaces, better to use a literal space.
To use flags in Python regex, either use the flags=, or define the flags at the beginning of the pattern, eg
pattern = r'(?m)^[a-z]+(?= [A-Z][a-z]+$)'

Python RE, is \b ever useful to indicate end of a word

I understand that \b can represent either the beginning or the end of a word. When would \b be required to represent the end? I'm asking because it seems that it's always necessary to have \s to indicate the end of the word, therefore eliminating the need to have \b. Like in the case below, one with a '\b' to end the inner group, the other without, and they get the same result.
m = re.search(r'(\b\w+\b)\s+\1', 'Cherry tree blooming will begin in in later March')
print m.group()
m = re.search(r'(\b\w+)\s+\1', 'Cherry tree blooming will begin in in later March')
print m.group()
\s is just whitespace. You can have word boundaries that aren't whitespace (punctuation, etc.) which is when you need to use \b. If you're only matching words that are delimited by whitespace then you can just use \s; and in that case you don't need the \b.
import re
sentence = 'Non-whitespace delimiters: Commas, semicolons; etc.'
print(re.findall(r'(\b\w+)\s+', sentence))
print(re.findall(r'(\b\w+\b)+', sentence))
Produces:
['whitespace']
['Non', 'whitespace', 'delimiters', 'Commas', 'semicolons', 'etc']
Notice how trying to catch word endings with just \s ends up missing most of them.
Consider wanting to match the word "march":
>>> regex = re.compile(r'\bmarch\b')
It can come at the end of the sentence...
>>> regex.search('I love march')
<_sre.SRE_Match object at 0x10568e4a8>
Or the beginning ...
>>> regex.search('march is a great month')
<_sre.SRE_Match object at 0x10568e440>
But if I don't want to match things like marching, word boundaries are the most convenient:
>>> regex.search('my favorite pass-time is marching')
>>>
You might be thinking "But I can get all of these things using r'\s+march\s+'" and you're kind of right... The difference is in what matches. With the \s+, you also might be including some whitespace in the match (since that's what \s+ means). This can make certain things like search for a word and replace it more difficult because you might have to manage keeping the whitespace consistent with what it was before.
It's not because it's at the end of the word, it's because you know what comes after the word. In your example:
m = re.search(r'(\b\w+\b)\s+\1', 'Cherry tree blooming will begin in in later March')
...the first \b is necessary to prevent a match starting with the in in begin. The second one is redundant because you're explicitly matching the non-word characters (\s+) that follow the word. Word boundaries are for situations where you don't know what the character on the other side will be, or even if there will be a character there.
Where you should be using another one is at the end of the regex. For example:
m = re.search(r'(\b\w+)\s+\1\b', "Let's go to the theater")
Without the second \b, you would get a false positive for the theater.
"I understand that \b can represent either the beginning or the end of a word. When would \b be required to represent the end?"
\b is never required to represent the end, or beginning, of a word. To answer your bigger question, it's only useful during development -- when working with natural language, you'll ultimately need to replace \b with something else. Why?
The \b operator matches a word boundary as you've discovered. But a key concept here is, "What is a word?" The answer is the very narrow set [A-Za-z0-9_] -- word is not a natural language word but a computer language identifier. The \b operator exists for a formal language's parser.
This means it doesn't handle common natural language situations like:
The word let's becomes two words, 'let' & 's' if `\b' represents the boundaries of a word. Also consider titles like Mr. & Mrs. lose their period.
Similarly, if `\b' represents the start of a word, then the appostrophe in these cases will be lost: 'twas 'bout 'cause
Hyphenated words suffer at the hand of `\b' as well, e.g mother-in-law (unless you want her to suffer.)
Unfortunately, you can't simply augment \b by including it in a character set as it doesn't represent a character. You may be able to combine it with other characters via alternation in a zero-width assertion.
When working with natural language, the \b operator is great for quickly prototyping an idea, but ultimately, probably not what you want. Ditto \w, but, since it represents a character, it's more easily augmented.

Categories