Multiple negative lookbehind assertions in python regex? - python

I'm new to programming, sorry if this seems trivial: I have a text that I'm trying to split into individual sentences using regular expressions. With the .split method I search for a dot followed by a capital letter like
"\. A-Z"
However I need to refine this rule in the following way: The . (dot) may not be preceeded by either Abs or S. And if it is followed by a capital letter (A-Z), it should still not match if it is a month name, like January | February | March.
I tried implementing the first half, but even this did not work. My code was:
"( (?<!Abs)\. A-Z) | (?<!S)\. A-Z) ) "

First, I think you may want to replace the space with \s+, or \s if it really is exactly one space (you often find double spaces in English text).
Second, to match an uppercase letter you have to use [A-Z], but A-Z will not work (but remember there may be other uppercase letters than A-Z ...).
Additionally, I think I know why this does not work. The regular expression engine will try to match \. [A-Z] if it is not preceeded by Abs or S. The thing is that, if it is preceeded by an S, it is not preceeded by Abs, so the first pattern matches. If it is preceeded by Abs, it is not preceeded by S, so the second pattern version matches. In either way one of those patterns will match since Abs and S are mutually exclusive.
The pattern for the first part of your question could be
(?<!Abs)(?<!S)(\. [A-Z])
or
(?<!Abs)(?<!S)(\.\s+[A-Z])
(with my suggestion)
That is because you have to avoid |, without it the expression now says not preceeded by Abs and not preceeded by S. If both are true the pattern matcher will continue to scan the string and find your match.
To exclude the month names I came up with this regular expression:
(?<!Abs)(?<!S)(\.\s+)(?!January|February|March)[A-Z]
The same arguments hold for the negative look ahead patterns.

I'm adding a short answer to the question in the title, since this is at the top of Google's search results:
The way to have multiple differently-lengthed negative lookbehinds is to chain them together like this:
"(?<!1)(?<!12)(?<!123)example"
This would match example 2example and 3example but not 1example 12example or 123example.

Use nltk punkt tokenizer. It's probably more robust than using regex.
>>> import nltk.data
>>> text = """
... Punkt knows that the periods in Mr. Smith and Johann S. Bach
... do not mark sentence boundaries. And sometimes sentences
... can start with non-capitalized words. i is a good variable
... name.
... """
>>> sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
>>> print '\n-----\n'.join(sent_detector.tokenize(text.strip()))
Punkt knows that the periods in Mr. Smith and Johann S. Bach
do not mark sentence boundaries.
-----
And sometimes sentences
can start with non-capitalized words.
-----
i is a good variable
name.

Use nltk or similar tools as suggested by #root.
To answer your regex question:
import re
import sys
print re.split(r"(?<!Abs)(?<!S)\.\s+(?!January|February|March)(?=[A-Z])",
sys.stdin.read())
Input
First. Second. January. Third. Abs. Forth. S. Fifth.
S. Sixth. ABs. Eighth
Output
['First', 'Second. January', 'Third', 'Abs. Forth', 'S. Fifth',
'S. Sixth', 'ABs', 'Eighth']

You can use Set [].
'(?<![1,2,3]example)'
This would not match 1example, 2example, 3example.

Related

How to use regular expressions in python to split articles based on punctuation

I need to divide the article into sentences by punctuation. I use the following regular expression:
re.split(r'[,|.|?|!]', strContent)
It does work, but there is a problem. It will separate the following Latin names that should not be split (such as G. lucidum):
Many studies to date have described the anticancer properties of G. lucidum,
The abbreviation of this Latin name is a capital letter followed by a dot and a space.
So I try to modify the above regular expression as follows:
re.split(r'[,|(?:[^A-Z].)|?|!]', strContent)
However, the following error prompt was received:
re.error: unbalanced parenthesis
How can I modify this regular expression?
You should use a negative lookbehind, and put it before the character set that matches the sentence ending.
The negative lookbehind should match a word that's just a single capital letter. This can be done by matching a word boundary before the letter with \b.
You also don't need | inside the character set. That's used for alternative patterns to match.
re.split(r'(?<!\b[A-Z])[,.?!]', strContent)
Using pure regex to find complete sentences is difficult, because of edge cases such as abbreviations, which you have been seeing. You should use an NLP library like NLTK instead.
from nltk.tokenize import sent_tokenize
text = "Many studies to date have described the anticancer properties of G. lucidum. The studies are vast."
print(sent_tokenize(text))
# ['Many studies to date have described the anticancer properties of G. lucidum.', 'The studies are vast.']

Use CR/LF pair to reject a match using Regex

I am struggling to reject matches for words separated by newline character.
Here's the test string:
Cardoza Fred
Catto, Philipa
Duncan, Jean
Jerry Smith
and
but
and
Andrew
Red
Abcd
DDDD
Rules for regex:
1) Reject a word if it's followed by comma. Therefore, we will drop Catto.
2) Only select words that begin with a capital letter. Hence, and etc. will be dropped
3) If the word is followed by a carriage return (i.e. it is the first name, then ignore it).
Here's my attempt: \b([A-Z][a-z]+)\s(?!\n)
Explanation:
\b #start at a word boundary
([A-Z][a-z]+) #start with A-Z followed by a-z
\s #Last name must be followed by a space character
(?!\n) #The word shouldn't be followed by newline char i.e. ignore first names.
There are two problems with my regex.
1) Andrew is matched as Andre. I am unsure why w is missed. I have also observed that w of Andrew is not missed if I change the bottom portion of the sample text to remove all characters including and after w of Andrew. i.e. sample text would look like:
Cardoza Fred
Catto, Philipa
Duncan, Jean
Jerry Smith
and
but
and
Andrew
The output should be:
Cardoza
Jerry
You might ask: Why should Andrew be rejected? This is because of two reasons: a) Andrew is not followed by space. b) There is no first_name "space" last_name combination.
2) The first names are getting selected using my regex. How do I ignore first names?
I researched SO. It seems there is similar thread ignoring newline character in regex match, but the answer doesn't talk about ignoring \r.
This problem is adapted from Watt's Begining Regex book. I have spent close to 1 hour on this problem without any success. Any explanation will be greatly appreciated. I am using python's re module.
Here's regex101 for reference.
Andre (and not the trailing w) is being matched in your regex because the last token is negative lookahead for \n, and just before that is an optional space. So, Andrew<end of line> fails due to being at the end of the line, so the engine backtracks to Andre, which succeeds.
Maybe the optional quantifier in \s? in your regex101 was a typo, but it would probably be easier to start from scratch. If you want to find the initial names that are followed by a space and then another name, then you can use
^[A-Z][a-z]+(?= [A-Z][a-z]+$)
with the m flag:
https://regex101.com/r/kqeMcH/5
The m flag allows for ^ to match the beginning of a line, and $ to match the end of the line - easier than messing with looking for \ns. (Without the m flag, ^ will only match the beginning of the string, while $ will similarly only match the end of the string)
That is, start with repeated alphabetical characters, then lookahead for a space and more alphabetical characters, followed by the end of the line. Using positive lookahead will be a lot easier than negative lookahead for newlines and such.
Note that literal spaces are a bit more reliable in a regex than \s, because \s matches any whitespace character, including newlines. If you're looking for literal spaces, better to use a literal space.
To use flags in Python regex, either use the flags=, or define the flags at the beginning of the pattern, eg
pattern = r'(?m)^[a-z]+(?= [A-Z][a-z]+$)'

Regular expression add space around all occurrence of a character within parentheses in python

My goal is to separate dashes between parenthess. For example: "Mr. Queen (The-American-Detective, EQ), Mr. Holmes (The-British-Detective) "
I want the result to be
"Mr. Queen (The - American - Detective, EQ), Mr. Holmes (The - British - Detective) "
My code is
re.sub(r'(.*)(\(.*)(-)(.*\))(.*)', r'\1\2 \3 \4\5', String)
however, this code seems only separates the last dash occurs in the last parentheses of a string.
it gives the result "'Mr. Queen (The-America-Detective, EQ), Mr. Holmes (The-British - Detective) "
Can anyone help with it? I tried to find through here; but it seems my code should work the way I expected
This code achieves the task by dividing it into two separate parts instead of relying solely on a single regular expression.
It searches the string target for portions that are enclosed by (...)
It then searches and replaces each - with (SPACE)-(SPACE) in each found (...) using replacement functions
Here we have the solution code:
def expand_dashes(target):
"""
replace all "-" with " - " when they are within ()
target [string] - the original string
return [string] - the replaced string
* note, this function does not work with nested ()
"""
return re.sub(r'(?<=\()(.*?)(?=\))', __helper_func, target)
def __helper_func(match):
"""
a helper function meant to process individual groups
"""
return match.group(0).replace('-', ' - ')
Here we have the demo output:
>>> x = "Mr. Queen (The-American-Detective, EQ), Mr. Holmes (The-British-Detective)"
>>> expand_dashes(x)
>>> "Mr. Queen (The - American - Detective, EQ), Mr. Holmes (The - British - Detective)"
Many specifiers in most regular expression implementations (including Python's) act greedily - that is, they match as much of the input string as possible. Thus, the first .* in your regex is matching all of your input string except for the very last set of parentheses - that first .* "eats up" everything it can while still leaving enough left for the whole regex to make a successful match. Once inside that set of parentheses, you first have another .*, which similarly matches everything it can and still have the rest of the regex have enough for a successful match - so all the dashes in that final pair of parentheses except for the last dash. Thus, the substitution only inserts spaces around the final dash in the final set of parentheses, because your regex only has a single non-overlapping match: it matches the entire input string, it's just that the part of the regex that singles out dash-between-parentheses only includes the final such dash.
To fix this, you may need to reevaluate parts of your approach, because re.sub will substitute for non-overlapping matches, and it would be difficult (I'm skeptical it would even be doable) to construct a single regex that can match arbitrary numbers of dashes between a given pair of parentheses, with a corresponding replacement that puts spaces around each such dash, and still make each of those matches non-overlapping (with a regex system capable of arbitrary-number group captures, maybe, but as far as I am aware Python's implementation only captures the last captured group of any repeatable group ((<group>)* or (<group>)+ etc) in a given match. Checking for parentheses surrounding dashes with regex will need to include them in the match, which means a regex that matches and performs a replacement for a single dash-between-parentheses will have overlapping matches where there are multiple dashes in the same pair of parentheses.
An incremental approach, while a bit more complicated in implementation, might be a better way to get the desired behavior. You could use re.split with an appropriate regex to split the string into parenthesized sections and the intervening non-parenthetical sections, then perform a regex replacement on only the parenthetical sections using a simpler regex like r'([^-]*)(-)([^-]*)' to match any dashes*, then reassemble the full sequence with the new parenthetical sections. This effectively breaks the 'individually capture all dashes within parentheses' problem which is a bit hard for a single regex to get the captures right for into two problems of 'find parenthesized sections' and 'individually capture dashes', which are easier problems to solve.
*Note that this regex suggestion uses the character class [^-] meaning 'any characters that are not -'. This avoids the issue displayed by your current regex of .* including dashes in what it matches and "eating up" all but the last ones, because [^-]* is forced to stop matching when the next character is a -. Simply replacing .* with [^-]* in your current regex won't solve the issue, however, because re.sub won't replace for matches that overlap, like multiple dashes within the same parentheses would in that case.
Try a simpler way:
import re
s = "Mr. Queen (The-American-Detective, EQ), Mr. Holmes (The-British-Detective) "
s = re.sub(r'(\w+)(\-)(\w+)(\-)(\w+)', '\\1 \\2 \\3 \\4 \\5', s)
print(s)
Outputs:
Mr. Queen (The - American - Detective, EQ), Mr. Holmes (The - British - Detective)
Here is the working:
\w essentially is same as [a-zA-Z0-9_] that is it matches
lowercase, uppercase, digits or underscore.
\- matches -.
So, this regex matches any string of the form something-anything-anotherthing and replace it with something - anything - anotherthing

Why does this regular expression to match two consecutive words not work?

There is a similar question here: Regular Expression For Consecutive Duplicate Words. This addresses the general question of how to solve this problem, whereas I am looking for specific advice on why my solution does not work.
I'm using python regex, and I'm trying to match all consecutively repeated words, such as the bold in:
I am struggling to to make this this work
I tried:
[A-Za-z0-9]* {2}
This is the logic behind this choice of regex: The '[A-Za-z0-9]*' should match any word of any length, and '[A-Za-z0-9]* ' makes it consider the space at the end of the word. Hence [A-Za-z0-9]* {2} should flag a repetition of the previous word with a space at the end. In other words it says "For any word, find cases where it is immediately repeated after a space".
How is my logic flawed here? Why does this regex not work?
[A-Za-z0-9]* {2}
Quantifiers in regular expressions will always only apply to the element right in front of them. So a \d+ will look for one or more digits but x\d+ will look for a single x, followed by one or more digits.
If you want a quantifier to apply to more than just a single thing, you need to group it first, e.g. (x\d)+. This is a capturing group, so it will actually capture that in the result. This is sometimes undesired if you just want to group things to apply a common quantifier. In that case, you can prefix the group with ?: to make it a non-capturing group: (?:x\d)+.
So, going back to your regular expression, you would have to do it like this:
([A-Za-z0-9]* ){2}
However, this does not actually have any check that the second matched word is the same as the first one. If you want to match for that, you will need to use backreferences. Backreferences allow you to reference a previously captured group within the expression, looking for it again. In your case, this would look like this:
([A-Za-z0-9]*) \1
The \1 will reference the first capturing group, which is ([A-Za-z0-9]*). So the group will match the first word. Then, there is a space, followed by a backreference to the first word again. So this will look for a repetition of the same word separated by a space.
As bobble bubble points out in the comments, there is still a lot one can do to improve the regular expression. While my main concern was to explain the various concepts without focusing too much on your particular example, I guess I still owe you a more robust regular expression for matching two consecutive words within a string that are separated by a space. This would be my take on that:
\b(\w+)\s\1\b
There are a few things that are different to the previous approach: First of all, I’m looking for word boundaries around the whole expression. The \b matches basically when a word starts or ends. This will prevent the expression from matching within other words, e.g. neither foo fooo nor foo oo would be matched.
Then, the regular expression requires at least one character. So empty words won’t be matched. I’m also using \w here which is a more flexible way of including alphanumerical characters. And finally, instead of looking for an actual space, I accept any kind of whitespace between the words, so this could even match tabs or line breaks. It might make sense to add a quantifier there too, i.e. \s+ to allow multiple whitespace characters.
Of course, whether this works better for you, depends a lot on your actual requirements which we won’t be able to tell just from your one example. But this should give you a few ideas on how to continue at least.
You can match a previous capture group with \1 for the first group, \2 for the second, etc...
import re
s = "I am struggling to to make this this work"
matches = re.findall(r'([A-Za-z0-9]+) \1', s)
print(matches)
>>> ['to', 'this']
If you want both occurrences, add a capture group around \1:
matches = re.findall(r'([A-Za-z0-9]+) (\1)', s)
print(matches)
>>> [('to', 'to'), ('this', 'this')]
At a glance it looks like this will match any two words, not repeated words. If I recall correctly asterisk (*) will match zero or more times, so perhaps you should be using plus (+) for one or more. Then you need to provide a capture and re-use the result of the capture. Additionally the \w can be used for alphanumerical characters for clarity. Also \b can be used to match empty string at word boundary.
Something along the lines of the example below will get you part of the way.
>>> import re
>>> p = re.compile(r'\b(\w+) \1\b')
>>> p.findall('fa fs bau saa saa fa bau eek mu muu bau')
['saa']
These pages may offer some guidance:
Python regex cheat sheet
RegExp match repeated characters
Regular Expression For Consecutive Duplicate Words.
This should work: \b([A-Za-z0-9]+)\s+\1\b
\b matches a word boundary, \s matches whitespace and \1 specifies the first capture group.
>>> s = 'I am struggling to to make this this work'
>>> re.findall(r'\b([A-Za-z0-9]+)\s+\1\b', s)
['to', 'this']
Here is a simple solution not using RegEx.
sentence = 'I am struggling to to make this this work'
def find_duplicates_in_string(words):
""" Takes in a string and returns any duplicate words
i.e. "this this"
"""
duplicates = []
words = words.split()
for i in range(len(words) - 1):
prev_word = words[i]
word = words[i + 1]
if word == prev_word:
duplicates.append(word)
return duplicates
print(find_duplicates_in_string(sentence))

Combine three regular expressions

Is there a way to combine the following three expressions into one regex?
name = re.sub(r'\s?\(\w+\)', '',name) # John Smith (ii) --> John Smith
name = re.sub(r'\s?(Jr.|Sr.)$','', name, flags=re.I) # John Jr. --> John
name = re.sub(r'".+"\s?', '', name) # Dwayne "The Rock" Johnson --> Dwayne Johnson
You can just use grouping and pipe:
re.sub(r'(\s?\(\w+\))|(s?(Jr.|Sr.))|(".+"\s?)', '', name)
Demo
If you want to obtain an efficient (and a working most of the time) pattern simply separating your patterns with a pipe is a bad idea. You must reconsider what you want to do with your pattern and rewrite it from the begining.
p = re.compile(r'["(js](?:(?<=\b[js])r\.|(?<=\()\w+\)|(?<=")[^"]*")\s*', re.I)
text = p.sub('', text).rstrip()
This is a good opportunity to be critical about what you have previously written:
starting a pattern with an optional character \s? is slow because each position in the string must be tested with and without this character. So this is better to catch the optional whitespace at the end and to trim the string after. (in all cases you need to trim the result, even if you decide to catch the optional whitespace at the begining)
the pattern to find quoted parts is false and inefficient (when it works), because you use a dot with a greedy quantifier, so if there are two quoted parts in the same line (note that the dot doesn't match newlines) all the content between will be matched too. It's better to use a negated character class that doesn't contain the quote: "[^"]*" (note: this can be improved to deal with escaped quotes inside the quotes)
the pattern for Jr. and Sr. is false too, to match a literal . you need to escape it. Aside from
that, the pattern is too imprecise because it doesn't check if there are other word characters before. It will match for example a sentence that ends with "USSR." or any substrings that contain "jr." or "sr.". (to be fully rigorous, you must check if there is a whitespace or the start of the string before, but a simple word boundary should suffice most of the time)
Now how to build your alternation:
The order can be important, in particular if the subpatterns are not mutualy exclusive. For example, if you have the subpatterns a+b and a+, if you write a+|a+b all the b preceded by an a will never match because the first branch succeeds before. But for your example there is not this kind of problems.
As an aside, if you know that one of the branches has more chances to succeed put it at the first position in the alternation.
You know the searched substring starts with one of these characters: ", (, j, s. In this case why not begining the pattern with ["(js] that avoids to test each branch of the pattern for all positions in the string.
Then, since the first character is already consumed, you only need to check with a lookbehind which of these characters has been matched for each branch.
With these small improvements you obtain a much faster pattern.

Categories