Is there a way to combine the following three expressions into one regex?
name = re.sub(r'\s?\(\w+\)', '',name) # John Smith (ii) --> John Smith
name = re.sub(r'\s?(Jr.|Sr.)$','', name, flags=re.I) # John Jr. --> John
name = re.sub(r'".+"\s?', '', name) # Dwayne "The Rock" Johnson --> Dwayne Johnson
You can just use grouping and pipe:
re.sub(r'(\s?\(\w+\))|(s?(Jr.|Sr.))|(".+"\s?)', '', name)
Demo
If you want to obtain an efficient (and a working most of the time) pattern simply separating your patterns with a pipe is a bad idea. You must reconsider what you want to do with your pattern and rewrite it from the begining.
p = re.compile(r'["(js](?:(?<=\b[js])r\.|(?<=\()\w+\)|(?<=")[^"]*")\s*', re.I)
text = p.sub('', text).rstrip()
This is a good opportunity to be critical about what you have previously written:
starting a pattern with an optional character \s? is slow because each position in the string must be tested with and without this character. So this is better to catch the optional whitespace at the end and to trim the string after. (in all cases you need to trim the result, even if you decide to catch the optional whitespace at the begining)
the pattern to find quoted parts is false and inefficient (when it works), because you use a dot with a greedy quantifier, so if there are two quoted parts in the same line (note that the dot doesn't match newlines) all the content between will be matched too. It's better to use a negated character class that doesn't contain the quote: "[^"]*" (note: this can be improved to deal with escaped quotes inside the quotes)
the pattern for Jr. and Sr. is false too, to match a literal . you need to escape it. Aside from
that, the pattern is too imprecise because it doesn't check if there are other word characters before. It will match for example a sentence that ends with "USSR." or any substrings that contain "jr." or "sr.". (to be fully rigorous, you must check if there is a whitespace or the start of the string before, but a simple word boundary should suffice most of the time)
Now how to build your alternation:
The order can be important, in particular if the subpatterns are not mutualy exclusive. For example, if you have the subpatterns a+b and a+, if you write a+|a+b all the b preceded by an a will never match because the first branch succeeds before. But for your example there is not this kind of problems.
As an aside, if you know that one of the branches has more chances to succeed put it at the first position in the alternation.
You know the searched substring starts with one of these characters: ", (, j, s. In this case why not begining the pattern with ["(js] that avoids to test each branch of the pattern for all positions in the string.
Then, since the first character is already consumed, you only need to check with a lookbehind which of these characters has been matched for each branch.
With these small improvements you obtain a much faster pattern.
Related
I am struggling to reject matches for words separated by newline character.
Here's the test string:
Cardoza Fred
Catto, Philipa
Duncan, Jean
Jerry Smith
and
but
and
Andrew
Red
Abcd
DDDD
Rules for regex:
1) Reject a word if it's followed by comma. Therefore, we will drop Catto.
2) Only select words that begin with a capital letter. Hence, and etc. will be dropped
3) If the word is followed by a carriage return (i.e. it is the first name, then ignore it).
Here's my attempt: \b([A-Z][a-z]+)\s(?!\n)
Explanation:
\b #start at a word boundary
([A-Z][a-z]+) #start with A-Z followed by a-z
\s #Last name must be followed by a space character
(?!\n) #The word shouldn't be followed by newline char i.e. ignore first names.
There are two problems with my regex.
1) Andrew is matched as Andre. I am unsure why w is missed. I have also observed that w of Andrew is not missed if I change the bottom portion of the sample text to remove all characters including and after w of Andrew. i.e. sample text would look like:
Cardoza Fred
Catto, Philipa
Duncan, Jean
Jerry Smith
and
but
and
Andrew
The output should be:
Cardoza
Jerry
You might ask: Why should Andrew be rejected? This is because of two reasons: a) Andrew is not followed by space. b) There is no first_name "space" last_name combination.
2) The first names are getting selected using my regex. How do I ignore first names?
I researched SO. It seems there is similar thread ignoring newline character in regex match, but the answer doesn't talk about ignoring \r.
This problem is adapted from Watt's Begining Regex book. I have spent close to 1 hour on this problem without any success. Any explanation will be greatly appreciated. I am using python's re module.
Here's regex101 for reference.
Andre (and not the trailing w) is being matched in your regex because the last token is negative lookahead for \n, and just before that is an optional space. So, Andrew<end of line> fails due to being at the end of the line, so the engine backtracks to Andre, which succeeds.
Maybe the optional quantifier in \s? in your regex101 was a typo, but it would probably be easier to start from scratch. If you want to find the initial names that are followed by a space and then another name, then you can use
^[A-Z][a-z]+(?= [A-Z][a-z]+$)
with the m flag:
https://regex101.com/r/kqeMcH/5
The m flag allows for ^ to match the beginning of a line, and $ to match the end of the line - easier than messing with looking for \ns. (Without the m flag, ^ will only match the beginning of the string, while $ will similarly only match the end of the string)
That is, start with repeated alphabetical characters, then lookahead for a space and more alphabetical characters, followed by the end of the line. Using positive lookahead will be a lot easier than negative lookahead for newlines and such.
Note that literal spaces are a bit more reliable in a regex than \s, because \s matches any whitespace character, including newlines. If you're looking for literal spaces, better to use a literal space.
To use flags in Python regex, either use the flags=, or define the flags at the beginning of the pattern, eg
pattern = r'(?m)^[a-z]+(?= [A-Z][a-z]+$)'
My goal is to separate dashes between parenthess. For example: "Mr. Queen (The-American-Detective, EQ), Mr. Holmes (The-British-Detective) "
I want the result to be
"Mr. Queen (The - American - Detective, EQ), Mr. Holmes (The - British - Detective) "
My code is
re.sub(r'(.*)(\(.*)(-)(.*\))(.*)', r'\1\2 \3 \4\5', String)
however, this code seems only separates the last dash occurs in the last parentheses of a string.
it gives the result "'Mr. Queen (The-America-Detective, EQ), Mr. Holmes (The-British - Detective) "
Can anyone help with it? I tried to find through here; but it seems my code should work the way I expected
This code achieves the task by dividing it into two separate parts instead of relying solely on a single regular expression.
It searches the string target for portions that are enclosed by (...)
It then searches and replaces each - with (SPACE)-(SPACE) in each found (...) using replacement functions
Here we have the solution code:
def expand_dashes(target):
"""
replace all "-" with " - " when they are within ()
target [string] - the original string
return [string] - the replaced string
* note, this function does not work with nested ()
"""
return re.sub(r'(?<=\()(.*?)(?=\))', __helper_func, target)
def __helper_func(match):
"""
a helper function meant to process individual groups
"""
return match.group(0).replace('-', ' - ')
Here we have the demo output:
>>> x = "Mr. Queen (The-American-Detective, EQ), Mr. Holmes (The-British-Detective)"
>>> expand_dashes(x)
>>> "Mr. Queen (The - American - Detective, EQ), Mr. Holmes (The - British - Detective)"
Many specifiers in most regular expression implementations (including Python's) act greedily - that is, they match as much of the input string as possible. Thus, the first .* in your regex is matching all of your input string except for the very last set of parentheses - that first .* "eats up" everything it can while still leaving enough left for the whole regex to make a successful match. Once inside that set of parentheses, you first have another .*, which similarly matches everything it can and still have the rest of the regex have enough for a successful match - so all the dashes in that final pair of parentheses except for the last dash. Thus, the substitution only inserts spaces around the final dash in the final set of parentheses, because your regex only has a single non-overlapping match: it matches the entire input string, it's just that the part of the regex that singles out dash-between-parentheses only includes the final such dash.
To fix this, you may need to reevaluate parts of your approach, because re.sub will substitute for non-overlapping matches, and it would be difficult (I'm skeptical it would even be doable) to construct a single regex that can match arbitrary numbers of dashes between a given pair of parentheses, with a corresponding replacement that puts spaces around each such dash, and still make each of those matches non-overlapping (with a regex system capable of arbitrary-number group captures, maybe, but as far as I am aware Python's implementation only captures the last captured group of any repeatable group ((<group>)* or (<group>)+ etc) in a given match. Checking for parentheses surrounding dashes with regex will need to include them in the match, which means a regex that matches and performs a replacement for a single dash-between-parentheses will have overlapping matches where there are multiple dashes in the same pair of parentheses.
An incremental approach, while a bit more complicated in implementation, might be a better way to get the desired behavior. You could use re.split with an appropriate regex to split the string into parenthesized sections and the intervening non-parenthetical sections, then perform a regex replacement on only the parenthetical sections using a simpler regex like r'([^-]*)(-)([^-]*)' to match any dashes*, then reassemble the full sequence with the new parenthetical sections. This effectively breaks the 'individually capture all dashes within parentheses' problem which is a bit hard for a single regex to get the captures right for into two problems of 'find parenthesized sections' and 'individually capture dashes', which are easier problems to solve.
*Note that this regex suggestion uses the character class [^-] meaning 'any characters that are not -'. This avoids the issue displayed by your current regex of .* including dashes in what it matches and "eating up" all but the last ones, because [^-]* is forced to stop matching when the next character is a -. Simply replacing .* with [^-]* in your current regex won't solve the issue, however, because re.sub won't replace for matches that overlap, like multiple dashes within the same parentheses would in that case.
Try a simpler way:
import re
s = "Mr. Queen (The-American-Detective, EQ), Mr. Holmes (The-British-Detective) "
s = re.sub(r'(\w+)(\-)(\w+)(\-)(\w+)', '\\1 \\2 \\3 \\4 \\5', s)
print(s)
Outputs:
Mr. Queen (The - American - Detective, EQ), Mr. Holmes (The - British - Detective)
Here is the working:
\w essentially is same as [a-zA-Z0-9_] that is it matches
lowercase, uppercase, digits or underscore.
\- matches -.
So, this regex matches any string of the form something-anything-anotherthing and replace it with something - anything - anotherthing
I have a series of text files to parse which may or may not contain any one of a collection of headers, and then lines of data or comment below that header. All header groups are preceded by a double line break.
I am seeking a regular expression that will return an empty string if it sees a header followed immediately by a double line break. I need to differentiate whether a document has that header with no content, or does not have that header at all.
For example, here are portions of two documents:
Dogs
Spaniel
Beagle
Birds
Parrot
and
Dogs
Amphibians
Frogs
Salamanders
I would like a regex that would return Spaniel\nBeagle in the first document, and an empty string for the second.
The closest I have been able to find is (in Python syntax) expr = re.compile("Dogs(.+?|)?\n\n, re.DOTALL). This returns the correct value for the first, but in the second case it returns \n\nAmphibians\nFrogs\nSalamanders. The second question mark and the pipe do not do what I had hoped.
I am handling this by program logic right now, searching for Dogs\n\n and only returning contents if that regex is not found, but it is unsatisfying because nothing beats the feeling of a single regular expression doing the job.
So: is there a regex that will match the second document, and return ""?
Problem
Your Dogs(.+?|)?\n\n pattern matches the word Dogs anywhere in the document, then tries to optionally (as there is an empty alternative |)) match any 1 or more (due to +? quantifier) characters, but as few as possible (since +? is a lazy quantifier), up to the first 2 newlines.
That means, the regex either matches Dogs only if there are no double newline symbols somewhere further in the text, or it will grab any text there is up to the first double newline symbols, because the .+? will consume 1 newline, and the \n\n pattern part will not be able to find the 2 newlines after Dogs.
Solution
You may use a *? quantifier instead of +? one to allow matching zero or more characters. The Dogs(.*?)\n\n will find Dogs, any 0+ chars as few as possible, up to the first \n\n, even those that appear right after Dogs.
Optimization:
If you process very long strings, and if the Dogs appear at the beginning of a line, you may use an unrolled regex since .*? is known to slow regex execution with longer inputs.
Use
expr = re.compile(r"^Dogs(.*(?:\n(?!\n).*)*)", re.MULTILINE)
See the regex demo
Basically, it will match
^ - start of a line
Dogs - Dogs substring
(.*(?:\n(?!\n).*)*) - Group 1 capturing:
.* - zero or more chars other than linebreak chars (as the re.DOTALL modifier is not used)
(?:\n(?!\n).*)* - zero or more sequences of:
\n(?!\n) - a newline not followed with another newline
.* - zero or more chars other than linebreak chars
I'm trying to capture the id and name in a pattern like this #[123456](John Smith) and use those to create a string like < a href="123456"> John Smith< /a>.
Here's what I tried, but it's not working.
def format(text):
def idrepl(match):
fbid = match.group(1)
name = match.group(2)
print fbid, name
return '{}'.format(fbid, name)
return re.sub(r'\#\[(\d+)\]\[(\w\s+)\]', idrepl, text)
The part
(\w\s+)
matches exactly one word character followed by 1+ whitespace characters.
Clearly that's not what you want, and it's easy to fix:
([\w\s]+)
"one or more characters each of which is a word or whitespace char".
Whether that's actually what you want, I'm not sure -- it will happily match John Smith, but not e.g Maureen O'Hara (that apostrophe will impede the match) or John V. Smith (here, it's the dot what will impede the match) or John Smith-Passell (here, it's the dash).
In general, people spell their names with potentially several punctuation characters (as well as word-characters and whitespace) -- apostrophes, dots, dashes, and more. If you don't need to account for this, then, fine!-) If you do, life gets a bit harder (sticking those chars within the square brackets above will mostly do, but precautions are needed -- e.g a dash, if you need it to be part of the bracketed char set, must be at the end, just before the close bracket).
I'm new to programming, sorry if this seems trivial: I have a text that I'm trying to split into individual sentences using regular expressions. With the .split method I search for a dot followed by a capital letter like
"\. A-Z"
However I need to refine this rule in the following way: The . (dot) may not be preceeded by either Abs or S. And if it is followed by a capital letter (A-Z), it should still not match if it is a month name, like January | February | March.
I tried implementing the first half, but even this did not work. My code was:
"( (?<!Abs)\. A-Z) | (?<!S)\. A-Z) ) "
First, I think you may want to replace the space with \s+, or \s if it really is exactly one space (you often find double spaces in English text).
Second, to match an uppercase letter you have to use [A-Z], but A-Z will not work (but remember there may be other uppercase letters than A-Z ...).
Additionally, I think I know why this does not work. The regular expression engine will try to match \. [A-Z] if it is not preceeded by Abs or S. The thing is that, if it is preceeded by an S, it is not preceeded by Abs, so the first pattern matches. If it is preceeded by Abs, it is not preceeded by S, so the second pattern version matches. In either way one of those patterns will match since Abs and S are mutually exclusive.
The pattern for the first part of your question could be
(?<!Abs)(?<!S)(\. [A-Z])
or
(?<!Abs)(?<!S)(\.\s+[A-Z])
(with my suggestion)
That is because you have to avoid |, without it the expression now says not preceeded by Abs and not preceeded by S. If both are true the pattern matcher will continue to scan the string and find your match.
To exclude the month names I came up with this regular expression:
(?<!Abs)(?<!S)(\.\s+)(?!January|February|March)[A-Z]
The same arguments hold for the negative look ahead patterns.
I'm adding a short answer to the question in the title, since this is at the top of Google's search results:
The way to have multiple differently-lengthed negative lookbehinds is to chain them together like this:
"(?<!1)(?<!12)(?<!123)example"
This would match example 2example and 3example but not 1example 12example or 123example.
Use nltk punkt tokenizer. It's probably more robust than using regex.
>>> import nltk.data
>>> text = """
... Punkt knows that the periods in Mr. Smith and Johann S. Bach
... do not mark sentence boundaries. And sometimes sentences
... can start with non-capitalized words. i is a good variable
... name.
... """
>>> sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
>>> print '\n-----\n'.join(sent_detector.tokenize(text.strip()))
Punkt knows that the periods in Mr. Smith and Johann S. Bach
do not mark sentence boundaries.
-----
And sometimes sentences
can start with non-capitalized words.
-----
i is a good variable
name.
Use nltk or similar tools as suggested by #root.
To answer your regex question:
import re
import sys
print re.split(r"(?<!Abs)(?<!S)\.\s+(?!January|February|March)(?=[A-Z])",
sys.stdin.read())
Input
First. Second. January. Third. Abs. Forth. S. Fifth.
S. Sixth. ABs. Eighth
Output
['First', 'Second. January', 'Third', 'Abs. Forth', 'S. Fifth',
'S. Sixth', 'ABs', 'Eighth']
You can use Set [].
'(?<![1,2,3]example)'
This would not match 1example, 2example, 3example.