Using regex, extract quoted strings that may contain nested quotes - python

I have the following string:
'Well, I've tried to say "How Doth the Little Busy Bee," but it all came different!' Alice replied in a very melancholy voice. She continued, 'I'll try again.'
Now, I wish to extract the following quotes:
1. Well, I've tried to say "How Doth the Little Busy Bee," but it all came different!
2. How Doth the Little Busy Bee,
3. I'll try again.
I tried the following code but I'm not getting what I want. The [^\1]* is not working as expected. Or is the problem elsewhere?
import re
s = "'Well, I've tried to say \"How Doth the Little Busy Bee,\" but it all came different!' Alice replied in a very melancholy voice. She continued, 'I'll try again.'"
for i, m in enumerate(re.finditer(r'([\'"])(?!(?:ve|m|re|s|t|d|ll))(?=([^\1]*)\1)', s)):
print("\nGroup {:d}: ".format(i+1))
for g in m.groups():
print(' '+g)

If you really need to return all the results from a single regular expression applied only once, it will be necessary to use lookahead ((?=findme)) so the finding position goes back to the start after each match - see this answer for a more detailed explanation.
To prevent false matches, some clauses are also needed regarding the quotes that add complexity, e.g. the apostrophe in I've shouldn't count as an opening or closing quote. There's no single clear-cut way of doing this but the rules I've gone for are:
An opening quote must not be immediately preceeded by a word character (e.g. letter). So for example, A" would not count as an opening quote but ," would count.
A closing quote must not be immediately followed by a word character (e.g. letter). So for example, 'B would not count as a closing quote but '. would count.
Applying the above rules leads to the following regular expression:
(?=(?:(?<!\w)'(\w.*?)'(?!\w)|\"(\w.*?)\"(?!\w)))
Debuggex Demo
A good quick sanity check test on any possible candidate regular expression is to reverse the quotes. This has been done in this regex101 demo.

EDIT
I modified my regex, it match properly even more complicated cases:
(?=(?<!\w|[!?.])('|\")(?!\s)(?P<content>(?:.(?!(?<=(?=\1).)(?!\w)))*)\1(?!\w))
DEMO
It is now even more complicated, the main improvement is not matching directly after some of punctuation character ([!?.]) and better quote case separation. Verified on diversified examples.
The sentence will be in content captured group. Of course it has some restrictions, releted to usage of whitespaces, etc. But it should work with most of proper formatted sentences - or at least it work with examples.
(?=(?<!\w|[!?.])('|\")(?!\s) - match the ' or " not preceded by word or punctuation character ((?<!\w|[!?.])) or not fallowed by whitespace((?!\s)), the ' or " part is captured in group 1 to further use,
(?P<content>(?:.(?!(?<=(?=\1).)(?!\w)))*)\1(?!\w)) - match sentence, followed by
same char (' or " captured in group 1) as it was started, ignore other quotes
It doesn't match whole sentence directly, but with capturing group nested in lookaround construct, so with global match modifier it will match also sentences inside sentences - because it directly match only the place before sentence starts.
About your regex:
I suppose, that by [^\1]* you meant any char but not one captured in group 1, but character class doesn't work this way, because it treats \1 as an char in octal notation (which I think is some kind of whitespace) not a reference to capturing group. Take a look on this example - read explanation. Also compare matching of THIS and THIS regex.
To achieve what you want, you should use lookaround, something like this: (')((?:.(?!\1))*.) - capture the opening char, then match every char which is not followed by captured opening char, then capture one more char, which is directly before captured char - and you have whole content between chars you excluded.

This is a great question for Python regex because sadly, in my opinion the re module is one of the most underpowered of mainstream regex engines. That's why for any serious regex work in Python, I turn to Matthew Barnett's stellar regex module, which incorporates some terrific features from Perl, PCRE and .NET.
The solution I'll show you can be adapted to work with re, but it is much more readable with regex because it is made modular. Also, consider it as a starting block for more complex nested matching, because regex lets you write recursive regular expressions similar to those found in Perl and PCRE.
Okay, enough talk, here's the code (a mere four lines apart from the import and definitions). Please don't let the long regex scare you: it is long because it is designed to be readable. Explanations follow.
The Code
import regex
quote = regex.compile(r'''(?x)
(?(DEFINE)
(?<qmark>["']) # what we'll consider a quotation mark
(?<not_qmark>[^'"]+) # chunk without quotes
(?<a_quote>(?P<qopen>(?&qmark))(?&not_qmark)(?P=qopen)) # a non-nested quote
) # End DEFINE block
# Start Match block
(?&a_quote)
|
(?P<open>(?&qmark))
(?&not_qmark)?
(?P<quote>(?&a_quote))
(?&not_qmark)?
(?P=open)
''')
str = """'Well, I have tried to say "How Doth the Little Busy Bee," but it all came different!' Alice replied in a very melancholy voice. She continued, 'I will try again.'"""
for match in quote.finditer(str):
print(match.group())
if match.group('quote'):
print(match.group('quote'))
The Output
'Well, I have tried to say "How Doth the Little Busy Bee," but it all came different!'
"How Doth the Little Busy Bee,"
'I will try again.'
How it Works
First, to simplify, note that I have taken the liberty of converting I'll to I will, reducing confusion with quotes. Addressing I'll would be no problem with a negative lookahead, but I wanted to make the regex readable.
In the (?(DEFINE)...) block, we define the three sub-expressions qmark, not_qmark and a_quote, much in the way that you define variables or subroutines to avoid repeating yourself.
After the definition block, we proceed to matching:
(?&a_quote) matches an entire quote,
| or...
(?P<open>(?&qmark)) matches a quotation mark and captures it to the open group,
(?&not_qmark)? matches optional text that is not quotes,
(?P<quote>(?&a_quote)) matches a full quote and captures it to the quote group,
(?&not_qmark)? matches optional text that is not quotes,
(?P=open) matches the same quotation mark that was captured at the opening of the quote.
The Python code then only needs to print the match and the quote capture group if present.
Can this be refined? You bet. Working with (?(DEFINE)...) in this way, you can build beautiful patterns that you can later re-read and understand.
Adding Recursion
If you want to handle more complex nesting using pure regex, you'll need to turn to recursion.
To add recursion, all you need to do is define a group and refer to it using the subroutine syntax. For instance, to execute the code within Group 1, use (?1). To execute the code within group something, use (?&something). Remember to leave an exit for the engine by either making the recursion optional (?) or one side of an alternation.
References
Pre-defined regex subroutines
Named capture groups

It seems difficult to achieve with juste one regex pass, but it could be done with a relatively simple regex and a recursive function:
import re
REGEX = re.compile(r"(['\"])(.*?[!.,])\1", re.S)
S = """'Well, I've tried to say "How Doth the Little Busy Bee," but it all came different!' Alice replied in a very melancholy voice. She continued, 'I'll try again.' 'And we may now add "some more 'random test text'.":' "Yes it seems to be a good idea!" 'ok, let's go.'"""
def extract_quotes(string, quotes_list=None):
list = quotes_list or []
list += [found[1] for found in REGEX.findall(string)]
print("found: {}".format(quotes_list))
index = 0
for quote in list[:]:
index += 1
sub_list = extract_quotes(quote)
list = list[:index] + sub_list + list[index:]
index += len(sub_list)
return list
print extract_quotes(S)
This prints:
['Well, I\'ve tried to say "How Doth the Little Busy Bee," but it all came different!', 'How Doth the Little Busy Bee,', "I'll try again.", 'And we may now add "some more \'random test text\'.":\' "Yes it seems to be a good idea!" \'ok, let\'s go.', "some more 'random test text'.", 'Yes it seems to be a good idea!']
Note that the regex uses the punctuation to determine if a quoted text is a "real quote". in order to be extracted, a quote need to be ended with a punctuation character before the closing quote. That is 'random test text' is not considered as an actual quote, while 'ok let's go.' is.
The regex is pretty simple, I think it does not need explanation.
Thue extract_quotes function find all quotes in the given string and store them in the quotes_list. Then, it calls itself for each found quote, looking for inner quotes...

Related

How can I get a regular expression to find the correct instance of a word?

I'm trying to write a regular expression in python to identify instances of the phrases "played for" and "plays for" in a text, with the potential for finding instances where words come between the two, for example, "played guitar for". I only want this to find the first instance of the word "for" after "plays" or "played", however, I cannot work out how to write the regular expression.
The code I have at the moment is like this:
def play_finder(doc)
playre = re.compile(r'\bplay[s|e][d]?\b.*\bfor\b\s\b')
if playre.findall(doc):
for inst in playre.findall(doc):
playstr = inst
print(playstr)
mytext = "He played for four hours last night. He plays guitar for the foo pythers. He won an award for his guitar playing."
play_finder(mytext)
I would like my to be able to pull out two instances from mytext; "played for four" and "plays guitar for the".
Instead, what my code is finding is:
"He played for four hours last night. He plays guitar for the foo pythers. He won an award for".
So it's skipping the first and second for, and only finding the last.
How can I rewrite the regular expression to get it to stop skipping over the first and second instance of "for" in the sentence, and to identify both of them?
Edit: Another problem has become apparent to me after applying a solution I was offered. Given more than one sentence, such as:
"He played an eight hour set. It seemed like he went on for ever."
I don't want the regex to identify "He played an eight hour set. It seemed like he went on for" as matching the pattern. Is there a way to stop it looking for the "for" if it encounters a full stop?
You can try this,
\bplay(?:s|ed).*?for\b
Demo
There are some faults in the regex of your script.
playre = re.compile(r'\bplay[s|e][d]?\b.*\bfor\b\s\b')
[s|e] : is not workable for logical expression because [] is character class and means only one character which it allows
.* : greed(*) search seems match the string of possible maximum length match.
Somebody answered that I needed the lazy .*? then deleted their answer. I'm not sure why, because that worked. Hence, the code I'm using now is:
(r'\bplay[s|e][d]?\b.*?\bfor\b\s\b')
#ThmLee I tried your suggestion:
\bplay(s|ed).*?for\b
I'm (clearly) no expert with Regex, but it seemed not to work as well. Instead of outputting the lines "played for" and " plays guitar for" it just outputs "s" and "ed".
You misunderstand the use of square brackets. They create a character class which matches a single character out of the set of characters enumerated between the brackets. So [s|e] matches s or | or e.
Also, the word boundary is simply an assertion. It matches if the previous character was a "word" character and the next one isn't, or vice versa; but it doesn't advance the position within the string. So, for example, \s\bfor\b\s is redundant; we already know that \s matches whitespace (which is non-word) and for consists of word characters. You mean simply \sfor\s because the dropped \b conditions don't change what is being matched.
Try
r'\bplay(?:s|ed)?\s+(?:\w+\s+)??for\s+\w+'
The (?:\w+\s+)?? allows for a single optional word before for. The second question mark makes the capture non-greedy, i.e. it matches the shortest possible string which still allows the expression to match, instead of the longest. You will not want to allow unlimited repetitions (because then you'd match e.g. "played another game before he sat down for") but you might consider replacing the ?? with e.g. {0,3}? to allow for up to three words before "for".
We use (?:...) instead of (...) to make the grouping parentheses non-capturing; otherwise, findall will return a list of the captured submatches rather than the entire match.
The if findall: for findall is a minor inefficiency; you just need for match in findall which will simply iterate zero times if there are no matches.
More generally, using regex for higher-level grammatical patterns is very often unsatisfactory. A grammatical parser (even some type of shallow parsing) is better at telling you when some words are constituents of an optional attribute or modifier for a noun phrase, or when "play" should be analyzed as a noun. Consider
He played - or rather, tapped his fingers and hummed - for three minutes.
I play another silly but not completely outrageous role for the third time in a year.
She plays what for many is considered offensive gameplay for the Hawks.
Brett plays the oboe although he thinks it's for wimps.
Some plays are for fools.

capture anything but string [duplicate]

I know it's possible to match a word and then reverse the matches using other tools (e.g. grep -v). However, is it possible to match lines that do not contain a specific word, e.g. hede, using a regular expression?
Input:
hoho
hihi
haha
hede
Code:
grep "<Regex for 'doesn't contain hede'>" input
Desired output:
hoho
hihi
haha
The notion that regex doesn't support inverse matching is not entirely true. You can mimic this behavior by using negative look-arounds:
^((?!hede).)*$
The regex above will match any string, or line without a line break, not containing the (sub)string 'hede'. As mentioned, this is not something regex is "good" at (or should do), but still, it is possible.
And if you need to match line break chars as well, use the DOT-ALL modifier (the trailing s in the following pattern):
/^((?!hede).)*$/s
or use it inline:
/(?s)^((?!hede).)*$/
(where the /.../ are the regex delimiters, i.e., not part of the pattern)
If the DOT-ALL modifier is not available, you can mimic the same behavior with the character class [\s\S]:
/^((?!hede)[\s\S])*$/
Explanation
A string is just a list of n characters. Before, and after each character, there's an empty string. So a list of n characters will have n+1 empty strings. Consider the string "ABhedeCD":
┌──┬───┬──┬───┬──┬───┬──┬───┬──┬───┬──┬───┬──┬───┬──┬───┬──┐
S = │e1│ A │e2│ B │e3│ h │e4│ e │e5│ d │e6│ e │e7│ C │e8│ D │e9│
└──┴───┴──┴───┴──┴───┴──┴───┴──┴───┴──┴───┴──┴───┴──┴───┴──┘
index 0 1 2 3 4 5 6 7
where the e's are the empty strings. The regex (?!hede). looks ahead to see if there's no substring "hede" to be seen, and if that is the case (so something else is seen), then the . (dot) will match any character except a line break. Look-arounds are also called zero-width-assertions because they don't consume any characters. They only assert/validate something.
So, in my example, every empty string is first validated to see if there's no "hede" up ahead, before a character is consumed by the . (dot). The regex (?!hede). will do that only once, so it is wrapped in a group, and repeated zero or more times: ((?!hede).)*. Finally, the start- and end-of-input are anchored to make sure the entire input is consumed: ^((?!hede).)*$
As you can see, the input "ABhedeCD" will fail because on e3, the regex (?!hede) fails (there is "hede" up ahead!).
Note that the solution to does not start with “hede”:
^(?!hede).*$
is generally much more efficient than the solution to does not contain “hede”:
^((?!hede).)*$
The former checks for “hede” only at the input string’s first position, rather than at every position.
If you're just using it for grep, you can use grep -v hede to get all lines which do not contain hede.
ETA Oh, rereading the question, grep -v is probably what you meant by "tools options".
Answer:
^((?!hede).)*$
Explanation:
^the beginning of the string,
( group and capture to \1 (0 or more times (matching the most amount possible)),
(?! look ahead to see if there is not,
hede your string,
) end of look-ahead,
. any character except \n,
)* end of \1 (Note: because you are using a quantifier on this capture, only the LAST repetition of the captured pattern will be stored in \1)
$ before an optional \n, and the end of the string
The given answers are perfectly fine, just an academic point:
Regular Expressions in the meaning of theoretical computer sciences ARE NOT ABLE do it like this. For them it had to look something like this:
^([^h].*$)|(h([^e].*$|$))|(he([^h].*$|$))|(heh([^e].*$|$))|(hehe.+$)
This only does a FULL match. Doing it for sub-matches would even be more awkward.
If you want the regex test to only fail if the entire string matches, the following will work:
^(?!hede$).*
e.g. -- If you want to allow all values except "foo" (i.e. "foofoo", "barfoo", and "foobar" will pass, but "foo" will fail), use: ^(?!foo$).*
Of course, if you're checking for exact equality, a better general solution in this case is to check for string equality, i.e.
myStr !== 'foo'
You could even put the negation outside the test if you need any regex features (here, case insensitivity and range matching):
!/^[a-f]oo$/i.test(myStr)
The regex solution at the top of this answer may be helpful, however, in situations where a positive regex test is required (perhaps by an API).
FWIW, since regular languages (aka rational languages) are closed under complementation, it's always possible to find a regular expression (aka rational expression) that negates another expression. But not many tools implement this.
Vcsn supports this operator (which it denotes {c}, postfix).
You first define the type of your expressions: labels are letter (lal_char) to pick from a to z for instance (defining the alphabet when working with complementation is, of course, very important), and the "value" computed for each word is just a Boolean: true the word is accepted, false, rejected.
In Python:
In [5]: import vcsn
c = vcsn.context('lal_char(a-z), b')
c
Out[5]: {a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z} → 𝔹
then you enter your expression:
In [6]: e = c.expression('(hede){c}'); e
Out[6]: (hede)^c
convert this expression to an automaton:
In [7]: a = e.automaton(); a
finally, convert this automaton back to a simple expression.
In [8]: print(a.expression())
\e+h(\e+e(\e+d))+([^h]+h([^e]+e([^d]+d([^e]+e[^]))))[^]*
where + is usually denoted |, \e denotes the empty word, and [^] is usually written . (any character). So, with a bit of rewriting ()|h(ed?)?|([^h]|h([^e]|e([^d]|d([^e]|e.)))).*.
You can see this example here, and try Vcsn online there.
Here's a good explanation of why it's not easy to negate an arbitrary regex. I have to agree with the other answers, though: if this is anything other than a hypothetical question, then a regex is not the right choice here.
With negative lookahead, regular expression can match something not contains specific pattern. This is answered and explained by Bart Kiers. Great explanation!
However, with Bart Kiers' answer, the lookahead part will test 1 to 4 characters ahead while matching any single character. We can avoid this and let the lookahead part check out the whole text, ensure there is no 'hede', and then the normal part (.*) can eat the whole text all at one time.
Here is the improved regex:
/^(?!.*?hede).*$/
Note the (*?) lazy quantifier in the negative lookahead part is optional, you can use (*) greedy quantifier instead, depending on your data: if 'hede' does present and in the beginning half of the text, the lazy quantifier can be faster; otherwise, the greedy quantifier be faster. However if 'hede' does not present, both would be equal slow.
Here is the demo code.
For more information about lookahead, please check out the great article: Mastering Lookahead and Lookbehind.
Also, please check out RegexGen.js, a JavaScript Regular Expression Generator that helps to construct complex regular expressions. With RegexGen.js, you can construct the regex in a more readable way:
var _ = regexGen;
var regex = _(
_.startOfLine(),
_.anything().notContains( // match anything that not contains:
_.anything().lazy(), 'hede' // zero or more chars that followed by 'hede',
// i.e., anything contains 'hede'
),
_.endOfLine()
);
Benchmarks
I decided to evaluate some of the presented Options and compare their performance, as well as use some new Features.
Benchmarking on .NET Regex Engine: http://regexhero.net/tester/
Benchmark Text:
The first 7 lines should not match, since they contain the searched Expression, while the lower 7 lines should match!
Regex Hero is a real-time online Silverlight Regular Expression Tester.
XRegex Hero is a real-time online Silverlight Regular Expression Tester.
Regex HeroRegex HeroRegex HeroRegex HeroRegex Hero is a real-time online Silverlight Regular Expression Tester.
Regex Her Regex Her Regex Her Regex Her Regex Her Regex Her Regex Hero is a real-time online Silverlight Regular Expression Tester.
Regex Her is a real-time online Silverlight Regular Expression Tester.Regex Hero
egex Hero egex Hero egex Hero egex Hero egex Hero egex Hero Regex Hero is a real-time online Silverlight Regular Expression Tester.
RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRegex Hero is a real-time online Silverlight Regular Expression Tester.
Regex Her
egex Hero
egex Hero is a real-time online Silverlight Regular Expression Tester.
Regex Her is a real-time online Silverlight Regular Expression Tester.
Regex Her Regex Her Regex Her Regex Her Regex Her Regex Her is a real-time online Silverlight Regular Expression Tester.
Nobody is a real-time online Silverlight Regular Expression Tester.
Regex Her o egex Hero Regex Hero Reg ex Hero is a real-time online Silverlight Regular Expression Tester.
Results:
Results are Iterations per second as the median of 3 runs - Bigger Number = Better
01: ^((?!Regex Hero).)*$ 3.914 // Accepted Answer
02: ^(?:(?!Regex Hero).)*$ 5.034 // With Non-Capturing group
03: ^(?!.*?Regex Hero).* 7.356 // Lookahead at the beginning, if not found match everything
04: ^(?>[^R]+|R(?!egex Hero))*$ 6.137 // Lookahead only on the right first letter
05: ^(?>(?:.*?Regex Hero)?)^.*$ 7.426 // Match the word and check if you're still at linestart
06: ^(?(?=.*?Regex Hero)(?#fail)|.*)$ 7.371 // Logic Branch: Find Regex Hero? match nothing, else anything
P1: ^(?(?=.*?Regex Hero)(*FAIL)|(*ACCEPT)) ????? // Logic Branch in Perl - Quick FAIL
P2: .*?Regex Hero(*COMMIT)(*FAIL)|(*ACCEPT) ????? // Direct COMMIT & FAIL in Perl
Since .NET doesn't support action Verbs (*FAIL, etc.) I couldn't test the solutions P1 and P2.
Summary:
The overall most readable and performance-wise fastest solution seems to be 03 with a simple negative lookahead. This is also the fastest solution for JavaScript, since JS does not support the more advanced Regex Features for the other solutions.
Not regex, but I've found it logical and useful to use serial greps with pipe to eliminate noise.
eg. search an apache config file without all the comments-
grep -v '\#' /opt/lampp/etc/httpd.conf # this gives all the non-comment lines
and
grep -v '\#' /opt/lampp/etc/httpd.conf | grep -i dir
The logic of serial grep's is (not a comment) and (matches dir)
Since no one else has given a direct answer to the question that was asked, I'll do it.
The answer is that with POSIX grep, it's impossible to literally satisfy this request:
grep "<Regex for 'doesn't contain hede'>" input
The reason is that with no flags, POSIX grep is only required to work with Basic Regular Expressions (BREs), which are simply not powerful enough for accomplishing that task, because of lack of alternation in subexpressions. The only kind of alternation it supports involves providing multiple regular expressions separated by newlines, and that doesn't cover all regular languages, e.g. there's no finite collection of BREs that matches the same regular language as the extended regular expression (ERE) ^(ab|cd)*$.
However, GNU grep implements extensions that allow it. In particular, \| is the alternation operator in GNU's implementation of BREs. If your regular expression engine supports alternation, parentheses and the Kleene star, and is able to anchor to the beginning and end of the string, that's all you need for this approach. Note however that negative sets [^ ... ] are very convenient in addition to those, because otherwise, you need to replace them with an expression of the form (a|b|c| ... ) that lists every character that is not in the set, which is extremely tedious and overly long, even more so if the whole character set is Unicode.
Thanks to formal language theory, we get to see how such an expression looks like. With GNU grep, the answer would be something like:
grep "^\([^h]\|h\(h\|eh\|edh\)*\([^eh]\|e[^dh]\|ed[^eh]\)\)*\(\|h\(h\|eh\|edh\)*\(\|e\|ed\)\)$" input
(found with Grail and some further optimizations made by hand).
You can also use a tool that implements EREs, like egrep, to get rid of the backslashes, or equivalently, pass the -E flag to POSIX grep (although I was under the impression that the question required avoiding any flags to grep whatsoever):
egrep "^([^h]|h(h|eh|edh)*([^eh]|e[^dh]|ed[^eh]))*(|h(h|eh|edh)*(|e|ed))$" input
Here's a script to test it (note it generates a file testinput.txt in the current directory). Several of the expressions presented in other answers fail this test.
#!/bin/bash
REGEX="^\([^h]\|h\(h\|eh\|edh\)*\([^eh]\|e[^dh]\|ed[^eh]\)\)*\(\|h\(h\|eh\|edh\)*\(\|e\|ed\)\)$"
# First four lines as in OP's testcase.
cat > testinput.txt <<EOF
hoho
hihi
haha
hede
h
he
ah
head
ahead
ahed
aheda
ahede
hhede
hehede
hedhede
hehehehehehedehehe
hedecidedthat
EOF
diff -s -u <(grep -v hede testinput.txt) <(grep "$REGEX" testinput.txt)
In my system it prints:
Files /dev/fd/63 and /dev/fd/62 are identical
as expected.
For those interested in the details, the technique employed is to convert the regular expression that matches the word into a finite automaton, then invert the automaton by changing every acceptance state to non-acceptance and vice versa, and then converting the resulting FA back to a regular expression.
As everyone has noted, if your regular expression engine supports negative lookahead, the regular expression is much simpler. For example, with GNU grep:
grep -P '^((?!hede).)*$' input
However, this approach has the disadvantage that it requires a backtracking regular expression engine. This makes it unsuitable in installations that are using secure regular expression engines like RE2, which is one reason to prefer the generated approach in some circumstances.
Using Kendall Hopkins' excellent FormalTheory library, written in PHP, which provides a functionality similar to Grail, and a simplifier written by myself, I've been able to write an online generator of negative regular expressions given an input phrase (only alphanumeric and space characters currently supported, and the length is limited): http://www.formauri.es/personal/pgimeno/misc/non-match-regex/
For hede it outputs:
^([^h]|h(h|e(h|dh))*([^eh]|e([^dh]|d[^eh])))*(h(h|e(h|dh))*(ed?)?)?$
which is equivalent to the above.
with this, you avoid to test a lookahead on each positions:
/^(?:[^h]+|h++(?!ede))*+$/
equivalent to (for .net):
^(?>(?:[^h]+|h+(?!ede))*)$
Old answer:
/^(?>[^h]+|h+(?!ede))*$/
Aforementioned (?:(?!hede).)* is great because it can be anchored.
^(?:(?!hede).)*$ # A line without hede
foo(?:(?!hede).)*bar # foo followed by bar, without hede between them
But the following would suffice in this case:
^(?!.*hede) # A line without hede
This simplification is ready to have "AND" clauses added:
^(?!.*hede)(?=.*foo)(?=.*bar) # A line with foo and bar, but without hede
^(?!.*hede)(?=.*foo).*bar # Same
An, in my opinon, more readable variant of the top answer:
^(?!.*hede)
Basically, "match at the beginning of the line if and only if it does not have 'hede' in it" - so the requirement translated almost directly into regex.
Of course, it's possible to have multiple failure requirements:
^(?!.*(hede|hodo|hada))
Details: The ^ anchor ensures the regex engine doesn't retry the match at every location in the string, which would match every string.
The ^ anchor in the beginning is meant to represent the beginning of the line. The grep tool matches each line one at a time, in contexts where you're working with a multiline string, you can use the "m" flag:
/^(?!.*hede)/m # JavaScript syntax
or
(?m)^(?!.*hede) # Inline flag
Here's how I'd do it:
^[^h]*(h(?!ede)[^h]*)*$
Accurate and more efficient than the other answers. It implements Friedl's "unrolling-the-loop" efficiency technique and requires much less backtracking.
Another option is that to add a positive look-ahead and check if hede is anywhere in the input line, then we would negate that, with an expression similar to:
^(?!(?=.*\bhede\b)).*$
with word boundaries.
The expression is explained on the top right panel of regex101.com, if you wish to explore/simplify/modify it, and in this link, you can watch how it would match against some sample inputs, if you like.
RegEx Circuit
jex.im visualizes regular expressions:
If you want to match a character to negate a word similar to negate character class:
For example, a string:
<?
$str="aaa bbb4 aaa bbb7";
?>
Do not use:
<?
preg_match('/aaa[^bbb]+?bbb7/s', $str, $matches);
?>
Use:
<?
preg_match('/aaa(?:(?!bbb).)+?bbb7/s', $str, $matches);
?>
Notice "(?!bbb)." is neither lookbehind nor lookahead, it's lookcurrent, for example:
"(?=abc)abcde", "(?!abc)abcde"
The OP did not specify or Tag the post to indicate the context (programming language, editor, tool) the Regex will be used within.
For me, I sometimes need to do this while editing a file using Textpad.
Textpad supports some Regex, but does not support lookahead or lookbehind, so it takes a few steps.
If I am looking to retain all lines that Do NOT contain the string hede, I would do it like this:
1. Search/replace the entire file to add a unique "Tag" to the beginning of each line containing any text.
Search string:^(.)
Replace string:<##-unique-##>\1
Replace-all
2. Delete all lines that contain the string hede (replacement string is empty):
Search string:<##-unique-##>.*hede.*\n
Replace string:<nothing>
Replace-all
3. At this point, all remaining lines Do NOT contain the string hede. Remove the unique "Tag" from all lines (replacement string is empty):
Search string:<##-unique-##>
Replace string:<nothing>
Replace-all
Now you have the original text with all lines containing the string hede removed.
If I am looking to Do Something Else to only lines that Do NOT contain the string hede, I would do it like this:
1. Search/replace the entire file to add a unique "Tag" to the beginning of each line containing any text.
Search string:^(.)
Replace string:<##-unique-##>\1
Replace-all
2. For all lines that contain the string hede, remove the unique "Tag":
Search string:<##-unique-##>(.*hede)
Replace string:\1
Replace-all
3. At this point, all lines that begin with the unique "Tag", Do NOT contain the string hede. I can now do my Something Else to only those lines.
4. When I am done, I remove the unique "Tag" from all lines (replacement string is empty):
Search string:<##-unique-##>
Replace string:<nothing>
Replace-all
Since the introduction of ruby-2.4.1, we can use the new Absent Operator in Ruby’s Regular Expressions
from the official doc
(?~abc) matches: "", "ab", "aab", "cccc", etc.
It doesn't match: "abc", "aabc", "ccccabc", etc.
Thus, in your case ^(?~hede)$ does the job for you
2.4.1 :016 > ["hoho", "hihi", "haha", "hede"].select{|s| /^(?~hede)$/.match(s)}
=> ["hoho", "hihi", "haha"]
Through PCRE verb (*SKIP)(*F)
^hede$(*SKIP)(*F)|^.*$
This would completely skips the line which contains the exact string hede and matches all the remaining lines.
DEMO
Execution of the parts:
Let us consider the above regex by splitting it into two parts.
Part before the | symbol. Part shouldn't be matched.
^hede$(*SKIP)(*F)
Part after the | symbol. Part should be matched.
^.*$
PART 1
Regex engine will start its execution from the first part.
^hede$(*SKIP)(*F)
Explanation:
^ Asserts that we are at the start.
hede Matches the string hede
$ Asserts that we are at the line end.
So the line which contains the string hede would be matched. Once the regex engine sees the following (*SKIP)(*F) (Note: You could write (*F) as (*FAIL)) verb, it skips and make the match to fail. | called alteration or logical OR operator added next to the PCRE verb which inturn matches all the boundaries exists between each and every character on all the lines except the line contains the exact string hede. See the demo here. That is, it tries to match the characters from the remaining string. Now the regex in the second part would be executed.
PART 2
^.*$
Explanation:
^ Asserts that we are at the start. ie, it matches all the line starts except the one in the hede line. See the demo here.
.* In the Multiline mode, . would match any character except newline or carriage return characters. And * would repeat the previous character zero or more times. So .* would match the whole line. See the demo here.
Hey why you added .* instead of .+ ?
Because .* would match a blank line but .+ won't match a blank. We want to match all the lines except hede , there may be a possibility of blank lines also in the input . so you must use .* instead of .+ . .+ would repeat the previous character one or more times. See .* matches a blank line here.
$ End of the line anchor is not necessary here.
The TXR Language supports regex negation.
$ txr -c '#(repeat)
#{nothede /~hede/}
#(do (put-line nothede))
#(end)' Input
A more complicated example: match all lines that start with a and end with z, but do not contain the substring hede:
$ txr -c '#(repeat)
#{nothede /a.*z&~.*hede.*/}
#(do (put-line nothede))
#(end)' -
az <- echoed
az
abcz <- echoed
abcz
abhederz <- not echoed; contains hede
ahedez <- not echoed; contains hede
ace <- not echoed; does not end in z
ahedz <- echoed
ahedz
Regex negation is not particularly useful on its own but when you also have intersection, things get interesting, since you have a full set of boolean set operations: you can express "the set which matches this, except for things which match that".
It may be more maintainable to two regexes in your code, one to do the first match, and then if it matches run the second regex to check for outlier cases you wish to block for example ^.*(hede).* then have appropriate logic in your code.
OK, I admit this is not really an answer to the posted question posted and it may also use slightly more processing than a single regex. But for developers who came here looking for a fast emergency fix for an outlier case then this solution should not be overlooked.
The below function will help you get your desired output
<?PHP
function removePrepositions($text){
$propositions=array('/\bfor\b/i','/\bthe\b/i');
if( count($propositions) > 0 ) {
foreach($propositions as $exceptionPhrase) {
$text = preg_replace($exceptionPhrase, '', trim($text));
}
$retval = trim($text);
}
return $retval;
}
?>
I wanted to add another example for if you are trying to match an entire line that contains string X, but does not also contain string Y.
For example, let's say we want to check if our URL / string contains "tasty-treats", so long as it does not also contain "chocolate" anywhere.
This regex pattern would work (works in JavaScript too)
^(?=.*?tasty-treats)((?!chocolate).)*$
(global, multiline flags in example)
Interactive Example: https://regexr.com/53gv4
Matches
(These urls contain "tasty-treats" and also do not contain "chocolate")
example.com/tasty-treats/strawberry-ice-cream
example.com/desserts/tasty-treats/banana-pudding
example.com/tasty-treats-overview
Does Not Match
(These urls contain "chocolate" somewhere - so they won't match even though they contain "tasty-treats")
example.com/tasty-treats/chocolate-cake
example.com/home-cooking/oven-roasted-chicken
example.com/tasty-treats/banana-chocolate-fudge
example.com/desserts/chocolate/tasty-treats
example.com/chocolate/tasty-treats/desserts
As long as you are dealing with lines, simply mark the negative matches and target the rest.
In fact, I use this trick with sed because ^((?!hede).)*$ looks not supported by it.
For the desired output
Mark the negative match: (e.g. lines with hede), using a character not included in the whole text at all. An emoji could probably be a good choice for this purpose.
s/(.*hede)/🔒\1/g
Target the rest (the unmarked strings: e.g. lines without hede). Suppose you want to keep only the target and delete the rest (as you want):
s/^🔒.*//g
For a better understanding
Suppose you want to delete the target:
Mark the negative match: (e.g. lines with hede), using a character not included in the whole text at all. An emoji could probably be a good choice for this purpose.
s/(.*hede)/🔒\1/g
Target the rest (the unmarked strings: e.g. lines without hede). Suppose you want to delete the target:
s/^[^🔒].*//g
Remove the mark:
s/🔒//g
^((?!hede).)*$ is an elegant solution, except since it consumes characters you won't be able to combine it with other criteria. For instance, say you wanted to check for the non-presence of "hede" and the presence of "haha." This solution would work because it won't consume characters:
^(?!.*\bhede\b)(?=.*\bhaha\b)
How to use PCRE's backtracking control verbs to match a line not containing a word
Here's a method that I haven't seen used before:
/.*hede(*COMMIT)^|/
How it works
First, it tries to find "hede" somewhere in the line. If successful, at this point, (*COMMIT) tells the engine to, not only not backtrack in the event of a failure, but also not to attempt any further matching in that case. Then, we try to match something that cannot possibly match (in this case, ^).
If a line does not contain "hede" then the second alternative, an empty subpattern, successfully matches the subject string.
This method is no more efficient than a negative lookahead, but I figured I'd just throw it on here in case someone finds it nifty and finds a use for it for other, more interesting applications.
Simplest thing that I could find would be
[^(hede)]
Tested at https://regex101.com/
You can also add unit-test cases on that site
A simpler solution is to use the not operator !
Your if statement will need to match "contains" and not match "excludes".
var contains = /abc/;
var excludes =/hede/;
if(string.match(contains) && !(string.match(excludes))){ //proceed...
I believe the designers of RegEx anticipated the use of not operators.

regex- capturing text between matches

In the following text, I try to match a number followed by ")" and number followed by a period. I am trying to retrieve the text between the matches.
Example:
"1) there is a dsfsdfsd and 2) there is another one and 3) yet another
case"
so I am trying to output: ["there is a dsfsdfsd and", "there is another one and", yet another case"]
I've used this regex: (?:\d)|\d.)
Adding a .* at the end matches the entire string, I only want it to match the words between
also in this string:
"we will give 4. there needs to be another option and 6.99 USD is a
bit amount"
I want to only match the 4. and not the 6.99
Any pointers will be appreciated. Thank you. r
tldr
Regular expressions are tricky beasts and you should avoid them if at all possible.
If you can't avoid them, then make sure you have lots of test cases for all the edge cases that can occur.
Build up your regular expression slowly and systematically, testing your assumptions at every step.
If this code will go intro production, then please write unit tests that explain the thinking process to the poor soul who has to maintain it one day
The long version
Regular expressions are finicky. Your best approach may be to solve the problem a different way.
For example, your language might have a library function that allows you to split up strings using a regular expression to define what comes between the numbers. That will let you get away with writing a simpler regex to match the numbers and brackets/dots.
If you still decide to use regular expressions, then you need to be very structured about how you build up your regular expressions. It's extremely easy to miss edge cases.
So let's break this down piece by piece...
Set up a test environment for quickly experimenting with your regex.
There are lots of options here, depending on your programming language and OS. Ones I sometimes use are:
a Powershell window for testing .Net regexes (NB: the cli gives you a history of past attempts, so you can go back a few steps if you mess things up too badly)
a Python console for testing Python regexes (which are slightly different to .Net regexes in their syntax for named capture groups).
an html page with JavaScript to test the regex
an online or desktop regex tool (I still use the ancient Regular Expression Workbench from Eric Gunnerson, but I'm sure there are better alternatives these days)
Since you didn't specify a language or regex version, I'll assume .Net regular expressions
Create a single test string for testing a wider variety of options.
Your goal is to include as many edge cases as you can think of. Here's what I would use: "ab 1. there is a dsfsdfsd costing $6.99 and 2) there is another one and 3. yet another case 4)5) 6)10."
Note that I've added a few extra cases you didn't mention:
empty strings between two round bracket numbers: "4)" and "5)"
white space string between two round bracket numbers: "5)" and "6)"
empty strings between a round bracket number and a dotted number: "6)" and "10."
empty string after the dotted number "10." at the end of the string
random text and empty space, which should be ignored, before the first number
I'm going to make a few assumptions here, which you will need to vary based on your actual requirements:
You DO want to capture the white space after the dot or round bracket.
You DO want to capture the white space before the next dotted number or round bracket number.
You might have numbers that go beyond 9, so I've included "10" in the test cases.
You want to capture empty strings at the end e.g. after the "10."
NOTES:
Thinking through this test case forces you to be more rigorous about your requirements.
It will also help you be more efficient while you are manually testing your regular expression.
HOWEVER, this is assuming you aren't following a TDD approach. If you are, then you should probably do things a little differently... create unit tests for each scenario separately and get the regex working incrementally.
This test string doesn't cover all cases. For example, there are no newline or tab characters in the test string. Also it can't test for an empty string following a round bracket number at the very end.
First get a regex working that just captures the round brackets and dotted brackets.
Don't worry about the $6.99 edge case yet.
Drop the "(?:" non-capturing group syntax from your regex for now: "\d)|\d."
This doesn't even parse, because you have an unescaped round bracket.
The revised string is "\d\)|\d.", which parses, but which also matches "99" which you probably weren't expecting. That's because you forgot to escape the "."
The revised string is "\d\)|\d\.". This no longer matches "99", but it now matches "0." at the end instead of "10.". That's because it assumes that numbers will be single digit only.
The following string seems to work: "\d+\)|\d+\."
Time to deal with that pesky "$6.99" now...
Modify the regex so that it doesn't capture a floating point number.
You need to use a negative look ahead pattern to prevent a digit being after the decimal point.
Result: "\d+\)|\d+\.(?!\d)"
Count how many matches this produces. You're going to use this number for checking later results.
Hint: Save the regex pattern somewhere. You want to be able to go back to it any time you mess up your regex pattern beyond repair.
If you found a string splitting function, then you should use it now and avoid the complexity that follows. [I've included an example of this at the end.]
Simple is better, but I'm going to continue with the longer solution in the interests of showing an approach to staying in control of regex'es that start getting horribly complicated
Decide how to exclude that pattern
You used the non-capture group pattern in your question i.e. "(?:"
That approach can work. But it's a bit cumbersome, because you need to have a capturing group after it that you will look for instead.
It would be much nicer if your entire pattern matched what you are looking for.
So wrap the number pattern inside a zero-width positive look behind pattern (if your language supports it) i.e. "(?<=".
This checks for the pattern, but doesn't include it in what gets captured.
So now your regex looks like this: "(?<=\d+\)|\d+\.(?!\d))"
Test it!
It might seem silly to test this on its own - all the matches are empty strings.
Do it anyway. You want to sanity check every step of the way.
Make sure that it still produces the same number of matches as in step 4.
Decide how to match the text in between the numbers.
You rightly mention that ".*" will match the entire string, not just the parts in between.
There's a neat trick that allows you to reuse the pattern from step 5 to get the text in between.
Start by just matching the next character
The trick is that you want to match any character unless it's the start of the next number
That sounds like a negative look ahead pattern again: "(?!"
Let X be the pattern you saved in step 4. Matching a single character will look like this: "(?!X)."
You want to match lots of those characters. So put that pattern into a non-capturing group and repeat it: "(?:(?!X).)*"
This assumes you want to capture empty text.
If you're not, then change the "*" to a "+".
Hint: This is such a common pattern that you will want to reuse it in future pasting in different patterns in place of X
I used a non-capturing group instead of a normal group so that you can also embed this pattern in regexes where you do care about the capturing groups
Resulting pattern: "(?:(?!\d+\)|\d+\.(?!\d)).)*"
I suggest testing this pattern on its own to see what it does
Now put parts 5 and 7 together: "(?<=\d+\)|\d+\.(?!\d))(?:(?!\d+\)|\d+\.(?!\d)).)*"
Test it!
Unit tests!
If this is going into production, then please write lots of unit tests that will explain each step of this thought process
Have pity on the poor soul who has to maintain your regex in future!
By rights that person should be you
I suggest putting a note in your calendar to return to this code in 6 months' time and make sure you can still understand it from the unit tests alone!
Refactor
In six months' time, if you can't understand the code any more, use your newfound insight (and incentive) to solve the problem without using regular expressions (or only very simple ones)
Addendum
As an example of using a string splitting function to get away with a simpler regex, here's a solution in Powershell:
$string = 'ab 1. there is a dsfsdfsd costing $6.99 and 2) there is another one and 3. yet another case 4)5) 6)10.'
$pattern = [regex] '\d+\)|\d+\.(?!\d)'
$string -split $pattern | select-object -skip 1
Judging by the task you have, it might be easier to match the delimiters and use re.split (as also pointed out by bobblebubble in the comments).
I dsuggest a mere
\d+[.)]\B\s*
See it in action (demo)
It matches 1 or more digits, then a . or a ), then it makes sure there is no word letter (digit, letter or underscore) after it and then matches zero or more whitespace.
Python demo:
import re
rx = r'\d+[.)]\B\s*'
test_str = "1) there is a dsfsdfsd and 2) there is another one and 3) yet another case\n\"we will give 4. there needs to be another option and 6.99 USD is a bit amount"
print([x for x in re.split(rx,test_str) if x])
Try the following regex with the g modifier:
([A-Za-z\s\-_]+|\d(?!(\)|\.)\D)|\.\d)
Example: https://regex101.com/r/kB1xI0/3
[A-Za-z\s\-_]+ automatically matches all alphabetical characters + whitespace
\d(?!(\)|\.)\D) match any numeric sequence of digits not followed by a closing parenthesis ) or decimal value (.99)
\.\d match any period followed by numeric digit.
I used this pattern:
(?<=\d.\s)(.*?)(?=\d.\s)
demo
This looks for the contents between any digit, any character, then a space.
Edit: Updated pattern to handle the currency issue and line ends better:
This is with flag 'g'
(?<=[0-9].\s)(.*?)(?=\s[0-9].\s|\n|\r)
Demo 2
import re
s = "1) there is a dsfsdfsd and 2) there is another one and 3) yet another case"
s1 = "we will give 4. there needs to be another option and 6.99 USD is a bit amount"
regex = re.compile("\d\)\s.*?|\s\d\.\D.*?")
print ([x for x in regex.split(s) if x])
print regex.split(s1)
Output:
['there is a dsfsdfsd and ', 'there is another one and ', 'yet another case']
['we will give', 'there needs to be another option and 6.99 USD is a bit amount']

How do I append a list of negative lookbehinds to a python regular expression?

I'm trying to split a paragraph into sentences using regex split and I'm trying to use the second answer posted here:
a Regex for extracting sentence from a paragraph in python
But I have a list of abbreviations that I don't want to end the sentence on even though there's a period. But I don't know how to append it to that regular expression properly. I'm reading in the abbreviations from a file that contains terms like Mr. Ms. Dr. St. (one on each line).
Short answer: You can't, unless all lookbehind assertions are of the same, fixed width (which they probably aren't in your case; your example contained only two-letter abbreviations, but Mrs. would break your regex).
This is a limitation of the current Python regex engine.
Longer answer:
You could write a regex like (?s)(?<!.Mr|Mrs|.Ms|.St)\., padding each alternating part of the lookbehind assertion with as many .s as needed to get all of them to the same width. However, that would fail in some circumstances, for example when a paragraph begins with Mr..
Anyway, you're not using the right tool here. Better use a tool designed for the job, for example the Natural Language Toolkit.
If you're stuck with regex (too bad!), then you could try and use a findall() approach instead of split():
(?:(?:\b(?:Mr|Ms|Dr|Mrs|St)\.)|[^.])+\.\s*
would match a sentence that ends in . (optionally followed by whitespace) and may contain no dots unless preceded by one of the allowed abbreviations.
>>> import re
>>> s = "My name is Mr. T. I pity the fool who's not on the A-Team."
>>> re.findall(r"(?:(?:\b(?:Mr|Ms|Dr|Mrs|St)\.)|[^.])+\.\s*", s)
['My name is Mr. T. ', "I pity the fool who's not on the A-Team."]
I don't directly answer your question, but this post should contain enough information for you to write a working regex for your problem.
You can append a list of negative look-behinds. Remember that look-behinds are zero-width, which means that you can put as many look-behinds as you want next to each other, and you are still look-behind from the same position. As long as you don't need to use "many" quantifier (e.g. *, +, {n,}) in the look-behind, everything should be fine (?).
So the regex can be constructured like this:
(?<!list )(?<!of )(?<!words )(?<!not )(?<!allowed )(?<!to )(?<!precede )pattern\w+
It is a bit too verbose. Anyway, I write this post just to demonstrate that it is possible to look-behind on a list of fixed string.
Example run:
>>> s = 'something patterning of patterned crap patternon not patterner, not allowed patternes to patternsses, patternet'
>>> re.findall(r'(?<!list )(?<!of )(?<!words )(?<!not )(?<!allowed )(?<!to )(?<!precede )pattern\w+', s)
['patterning', 'patternon', 'patternet']
There is a catch in using look-behind, though. If there are dynamic number of spaces between the blacklisted text and the text matching the pattern, the regex above will fail. I really doubt there exists a way to modify the regex so that it works for the case above while keeping the look-behinds. (You can always replace consecutive spaces into 1, but it won't work for more general cases).

Python regex with look behind and alternatives

I want to have a regular expression that finds the texts that are "wrapped" in between "HEAD or HEADa" and "HEAD. That is, I may have a text that starts with the first word as HEAD or HEADa and the following "heads" are of type HEAD.
HEAD\n\n text...text...HEAD \n\n text....text HEAD\n\n text....text .....
HEADa\n\n text...text...HEAD \n\n text....text HEAD\n\n text....text .....
I want only to capture the text that are in between the "heads" therefore I have a regex with look behind and look ahead expressions looking for my "heads". I have the following regex:
var = "HEADa", "HEAD"
my_pat = re.compile(r"(?<=^\b"+var[0]+r"|"+var[1]+r"\b) \w*\s\s(.*?)(?=\b"+var[1] +r"\b)",re.DOTALL|re.MULTILINE)
However, when I try to execute this regex, I am getting an error message saying that I cannot have variable length in the look behind expression. What is wrong with this regex?
Currently, the first part of your regex looks like this:
(?<=^\bHEADa|HEAD\b)
You have two alternatives; one matches five characters and the other matches four, and that's why you get the error. Some regex flavors will let you do that even though they say they don't allow variable-length lookbehinds, but not Python. You could break it up into two lookbehinds, like this:
(?:(?<=^HEADa\b)|(?<=\bHEAD\b))
...but you probably don't need lookbehinds for this anyway. Try this instead:
(?:^HEADa|\bHEAD)\b
Whatever gets matched by the (.*?) later on will still be available through group #1. If you really need the whole of the text between the delimiters, you can capture that in group #1, and that other group will become #2 (or you can use named groups, and not have to keep track of the numbers).
Generally speaking, lookbehind should never be your first resort. It may seem like the obvious tool for the job, but you're usually better off doing a straight match and extracting the part you want with a capturing group. And that's true of all flavors, not just Python; just because you can do more with lookbehinds in other flavors doesn't mean you should.
BTW, you may have noticed that I redistributed your word boundaries; I think this is what you really intended.

Categories