Use CR/LF pair to reject a match using Regex

Use CR/LF pair to reject a match using Regex - python

I am struggling to reject matches for words separated by newline character.
Here's the test string:
Cardoza Fred
Catto, Philipa
Duncan, Jean
Jerry Smith
and
but
and
Andrew
Red
Abcd
DDDD
Rules for regex:
1) Reject a word if it's followed by comma. Therefore, we will drop Catto.
2) Only select words that begin with a capital letter. Hence, and etc. will be dropped
3) If the word is followed by a carriage return (i.e. it is the first name, then ignore it).
Here's my attempt: \b([A-Z][a-z]+)\s(?!\n)
Explanation:
\b #start at a word boundary
([A-Z][a-z]+) #start with A-Z followed by a-z
\s #Last name must be followed by a space character
(?!\n) #The word shouldn't be followed by newline char i.e. ignore first names.
There are two problems with my regex.
1) Andrew is matched as Andre. I am unsure why w is missed. I have also observed that w of Andrew is not missed if I change the bottom portion of the sample text to remove all characters including and after w of Andrew. i.e. sample text would look like:
Cardoza Fred
Catto, Philipa
Duncan, Jean
Jerry Smith
and
but
and
Andrew
The output should be:
Cardoza
Jerry
You might ask: Why should Andrew be rejected? This is because of two reasons: a) Andrew is not followed by space. b) There is no first_name "space" last_name combination.
2) The first names are getting selected using my regex. How do I ignore first names?
I researched SO. It seems there is similar thread ignoring newline character in regex match, but the answer doesn't talk about ignoring \r.
This problem is adapted from Watt's Begining Regex book. I have spent close to 1 hour on this problem without any success. Any explanation will be greatly appreciated. I am using python's re module.
Here's regex101 for reference.

Andre (and not the trailing w) is being matched in your regex because the last token is negative lookahead for \n, and just before that is an optional space. So, Andrew<end of line> fails due to being at the end of the line, so the engine backtracks to Andre, which succeeds.
Maybe the optional quantifier in \s? in your regex101 was a typo, but it would probably be easier to start from scratch. If you want to find the initial names that are followed by a space and then another name, then you can use
^[A-Z][a-z]+(?= [A-Z][a-z]+$)
with the m flag:
https://regex101.com/r/kqeMcH/5
The m flag allows for ^ to match the beginning of a line, and $ to match the end of the line - easier than messing with looking for \ns. (Without the m flag, ^ will only match the beginning of the string, while $ will similarly only match the end of the string)
That is, start with repeated alphabetical characters, then lookahead for a space and more alphabetical characters, followed by the end of the line. Using positive lookahead will be a lot easier than negative lookahead for newlines and such.
Note that literal spaces are a bit more reliable in a regex than \s, because \s matches any whitespace character, including newlines. If you're looking for literal spaces, better to use a literal space.
To use flags in Python regex, either use the flags=, or define the flags at the beginning of the pattern, eg
pattern = r'(?m)^[a-z]+(?= [A-Z][a-z]+$)'

Related

How to allow regular expression to return empty string

I have a series of text files to parse which may or may not contain any one of a collection of headers, and then lines of data or comment below that header. All header groups are preceded by a double line break.
I am seeking a regular expression that will return an empty string if it sees a header followed immediately by a double line break. I need to differentiate whether a document has that header with no content, or does not have that header at all.
For example, here are portions of two documents:
Dogs
Spaniel
Beagle
Birds
Parrot
and
Dogs
Amphibians
Frogs
Salamanders
I would like a regex that would return Spaniel\nBeagle in the first document, and an empty string for the second.
The closest I have been able to find is (in Python syntax) expr = re.compile("Dogs(.+?|)?\n\n, re.DOTALL). This returns the correct value for the first, but in the second case it returns \n\nAmphibians\nFrogs\nSalamanders. The second question mark and the pipe do not do what I had hoped.
I am handling this by program logic right now, searching for Dogs\n\n and only returning contents if that regex is not found, but it is unsatisfying because nothing beats the feeling of a single regular expression doing the job.
So: is there a regex that will match the second document, and return ""?

Problem
Your Dogs(.+?|)?\n\n pattern matches the word Dogs anywhere in the document, then tries to optionally (as there is an empty alternative |)) match any 1 or more (due to +? quantifier) characters, but as few as possible (since +? is a lazy quantifier), up to the first 2 newlines.
That means, the regex either matches Dogs only if there are no double newline symbols somewhere further in the text, or it will grab any text there is up to the first double newline symbols, because the .+? will consume 1 newline, and the \n\n pattern part will not be able to find the 2 newlines after Dogs.
Solution
You may use a *? quantifier instead of +? one to allow matching zero or more characters. The Dogs(.*?)\n\n will find Dogs, any 0+ chars as few as possible, up to the first \n\n, even those that appear right after Dogs.
Optimization:
If you process very long strings, and if the Dogs appear at the beginning of a line, you may use an unrolled regex since .*? is known to slow regex execution with longer inputs.
Use
expr = re.compile(r"^Dogs(.*(?:\n(?!\n).*)*)", re.MULTILINE)
See the regex demo
Basically, it will match
^ - start of a line
Dogs - Dogs substring
(.*(?:\n(?!\n).*)*) - Group 1 capturing:
.* - zero or more chars other than linebreak chars (as the re.DOTALL modifier is not used)
(?:\n(?!\n).*)* - zero or more sequences of:
\n(?!\n) - a newline not followed with another newline
.* - zero or more chars other than linebreak chars

Combine three regular expressions

Is there a way to combine the following three expressions into one regex?
name = re.sub(r'\s?\(\w+\)', '',name) # John Smith (ii) --> John Smith
name = re.sub(r'\s?(Jr.|Sr.)$','', name, flags=re.I) # John Jr. --> John
name = re.sub(r'".+"\s?', '', name) # Dwayne "The Rock" Johnson --> Dwayne Johnson

You can just use grouping and pipe:
re.sub(r'(\s?\(\w+\))|(s?(Jr.|Sr.))|(".+"\s?)', '', name)
Demo

If you want to obtain an efficient (and a working most of the time) pattern simply separating your patterns with a pipe is a bad idea. You must reconsider what you want to do with your pattern and rewrite it from the begining.
p = re.compile(r'["(js](?:(?<=\b[js])r\.|(?<=\()\w+\)|(?<=")[^"]*")\s*', re.I)
text = p.sub('', text).rstrip()
This is a good opportunity to be critical about what you have previously written:
starting a pattern with an optional character \s? is slow because each position in the string must be tested with and without this character. So this is better to catch the optional whitespace at the end and to trim the string after. (in all cases you need to trim the result, even if you decide to catch the optional whitespace at the begining)
the pattern to find quoted parts is false and inefficient (when it works), because you use a dot with a greedy quantifier, so if there are two quoted parts in the same line (note that the dot doesn't match newlines) all the content between will be matched too. It's better to use a negated character class that doesn't contain the quote: "[^"]*" (note: this can be improved to deal with escaped quotes inside the quotes)
the pattern for Jr. and Sr. is false too, to match a literal . you need to escape it. Aside from
that, the pattern is too imprecise because it doesn't check if there are other word characters before. It will match for example a sentence that ends with "USSR." or any substrings that contain "jr." or "sr.". (to be fully rigorous, you must check if there is a whitespace or the start of the string before, but a simple word boundary should suffice most of the time)
Now how to build your alternation:
The order can be important, in particular if the subpatterns are not mutualy exclusive. For example, if you have the subpatterns a+b and a+, if you write a+|a+b all the b preceded by an a will never match because the first branch succeeds before. But for your example there is not this kind of problems.
As an aside, if you know that one of the branches has more chances to succeed put it at the first position in the alternation.
You know the searched substring starts with one of these characters: ", (, j, s. In this case why not begining the pattern with ["(js] that avoids to test each branch of the pattern for all positions in the string.
Then, since the first character is already consumed, you only need to check with a lookbehind which of these characters has been matched for each branch.
With these small improvements you obtain a much faster pattern.

Python Regex to capture #[123456](John Smith)

I'm trying to capture the id and name in a pattern like this #[123456](John Smith) and use those to create a string like < a href="123456"> John Smith< /a>.
Here's what I tried, but it's not working.
def format(text):
def idrepl(match):
fbid = match.group(1)
name = match.group(2)
print fbid, name
return '{}'.format(fbid, name)
return re.sub(r'\#\[(\d+)\]\[(\w\s+)\]', idrepl, text)

The part
(\w\s+)
matches exactly one word character followed by 1+ whitespace characters.
Clearly that's not what you want, and it's easy to fix:
([\w\s]+)
"one or more characters each of which is a word or whitespace char".
Whether that's actually what you want, I'm not sure -- it will happily match John Smith, but not e.g Maureen O'Hara (that apostrophe will impede the match) or John V. Smith (here, it's the dot what will impede the match) or John Smith-Passell (here, it's the dash).
In general, people spell their names with potentially several punctuation characters (as well as word-characters and whitespace) -- apostrophes, dots, dashes, and more. If you don't need to account for this, then, fine!-) If you do, life gets a bit harder (sticking those chars within the square brackets above will mostly do, but precautions are needed -- e.g a dash, if you need it to be part of the bracketed char set, must be at the end, just before the close bracket).

Python regex: using or statement

I may not being saying this right (I'm a total regex newbie). Here's the code I currently have:
bugs.append(re.compile("^(\d+)").match(line).group(1))
I'd like to add to the regex so it looks at either '\d+' (starts with digits) or that it starts with 2 capital letters and contains a '-' before the first whitespace. I have the regex for the capital letters:
^[A-Z]{2,}
but I'm not sure how to add the '-' and the make an OR with the \d+. Does this make sense? Thanks!

The way to do an OR in regexps is with the "alternation" or "pipe" operator, |.
For example, to match either one or more digits, or two or more capital letter:
^(\d+|[A-Z]{2,})
Debuggex Demo
You may or may not sometimes need to add/remove/move parentheses to get the precedence right. The way I've written it, you've got one group that captures either the digit string or the capitals. While you're learning the rules (in fact, even after you've learned the rules) it's helpful to look at a regular expression visualizer/debugger like the one I used.
Your rule is slightly more complicated: you want 2 or more capital letters, and a hyphen before the first space. That's a bit hard to write as is, but if you change it to two or more capital letters, zero or more non-space characters, and a hyphen, that's easy:
^(\d+|[A-Z]{2,}\S*?-)
Debuggex Demo
(Notice the \S*?—that means we're going to match as few characters as possible, instead of as many as possible, so we'll only match up to the first hyphen in THIS-IS-A-TEST instead of up to the last. If you want the other one, just drop the ?.)

Write | for "or". For a sequence of zero or more non-whitespace characters, write \S*.
re.compile('^(\d+|[A-Z][A-Z]\S*-\s)')

re.compile(r"""
^ # beginning of the line
(?: # non-capturing group; do not return this group in .group()
(\d+) # one or more digits, captured as a group
| # Or
[A-Z]{2} # Exactly two uppercase letters
\S* # Any number of non-whitespace characters
- # the dash you wanted
) # end of the non-capturing group
""",
re.X) # enable comments in the regex

Sentence matching with regex

I have a text that splits into many lines, no particular formats. So I decided to line.strip('\n') for each line. Then I want to split the text into sentences using the sentence end marker . considering:
period . that is followed by a \s (whitespace), \S (like " ') and followed by [A-Z] will split
not to split [0-9]\.[A-Za-z], like 1.stackoverflow real time solution.
My program only solve half of 1 - period (.) that is followed by a \s and [A-Z]. Below is the code:
# -*- coding: utf-8 -*-
import re, sys
source = open(sys.argv[1], 'rb')
dest = open(sys.argv[2], 'wb')
sent = []
for line in source:
line1 = line.strip('\n')
k = re.sub(r'\.\s+([A-Z“])'.decode('utf8'), '.\n\g<1>', line1)
sent.append(k)
for line in sent:
dest.write(''.join(line))
Pls! I'd like to know which is the best way to master regex. It seems to be confusing.

To include the single quote in the character class, escape it with a \. The regex should be:
\.\s+[A-Z"\']
That's really all you need. You only need to tell a regex what to match, you don't need to specify what you don't want to match. Everything that doesn't fit the pattern won't match.
This regex will match any period followed by whitespace followed by a capital letter or a quote. Since a period immediately preceded by an number and immediately followed by a letter doesn't meet those criteria, it won't match.
This is assuming that the regex you had was working to split a period followed by whitespace followed by a capital, as you stated. Note, however, that this means that I am Sam. Sam I am. would split into I am Sam and am I am. Is that really what you want? If not, use zero-width assertions to exclude the parts you want to match but also keep. Here are your options, in order of what I think it's most likely you want.
1) Keep the period and the first letter or opening quote of the next sentence; lose the whitespace:
(?<=\.)\s+(?=[A-Z"\'])
This will split the example above into I am Sam. and Sam I am.
2) Keep the first letter of the next sentence; lose the period and whitespace:
\.\s+(?=[A-Z"\'])
This will split into I am Sam and Sam I am. This presumes that there are more sentences afterward, otherwise the period will stay with the second sentence, because it's not followed by whitespace and a capital letter or quote. If this option is the one you want - the sentences without the periods, then you might want to also match a period followed by the end of the string, with optional intervening whitespace, so that the final period and any trailing whitespace will be dropped:
\.(?:\s+(?=[A-Z"\'])|\s*$)
Note the ?:. You need non-capturing parentheses, because if you have capture groups in a split, anything captured by the group is added as an element in the results (e.g. split('(+)', 'a+b+c' gives you an array of a + b + c rather than just a b c).
3) Keep everything; whitespace goes with the preceding sentence:
(?<=\.\s+)(?=[A-Z"\'])
This will give you I am Sam. and Sam I am.
Regarding the last part of your question, the best resource for regex syntax I've seen is http://www.regular-expressions.info. Start with this summary: http://www.regular-expressions.info/reference.html Then go to the Tutorial page for more advanced details: http://www.regular-expressions.info/tutorial.html

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.