I have an And/Or regex i.e (PatternA|PatternB) in which I only take PatternA if PatternB does not exist (PatternB always comes after PatternA but is more important) so I put a negative lookahead in the PatternA Pipe.
This works on shorter text blocks:
https://regex101.com/r/bU6cU6/5
But times out on longer text blocks:
https://regex101.com/r/bU6cU6/2
What I don't understand is if I put PatternA with the Neg Look ahead alone in the same long text block it takes only 32 steps to reject it:
https://regex101.com/r/bU6cU6/3
and if I put PatternB alone in the same long text block it only takes 18 steps to accept it:
https://regex101.com/r/bU6cU6/4
So I am not sure why it is taking 100,000+/timeout to first reject (32 steps) then accept (18 steps) with the pipes. Is there another/better way to construct so it checks PatternA first than PatternB because now it is doing something I don't understand to go from 50 steps to 100k +.
Unanchored lookarounds used with a "global" regex (matching several occurrences) cause too much legwork, and are inefficient. They should be "anchored" to some concrete context. Often, they are executed at the beginning (lookaheads) or end (lookbehinds) of the string.
In your case, you may "anchor" it by placing after Option1: to ensure it is only executed after Option1: is aready matched.
Option1:(?!.*Option2)\*.*?(?P<Capture>Bob|David|Ted|Alice)|\*Option2 (?P<Capture2>Juan)
^^^^^^^^^^^^^
See this regex demo
Some more answers:
What I don't understand is if I put PatternA with the Neg Look ahead alone in the same long text block it takes only 32 steps to reject it
Yes, but you tested it with internal optimizations ON. Disable them and you will see
if I put PatternB alone in the same long text block it only takes 18 steps to accept it:
The match is found as expected, in a very efficient way:
Your main problem is the position of the lookahead. The lookahead has to be tried at every position, and it has to scan all the remaining characters every time. The longer test string is over 3500 characters long; that adds up.
If your regex isn't anchored, you should always try to start it with something concrete that will fail or succeed quickly--literal text is the best. In this case, it's obvious that you can move the lookahead back: Option1:\*(?!.*Option2) instead of (?!.*Option2)Option1:\*. (Notice the lack of trailing .* in the lookahead; you didn't need that.)
But why is PatternA so much quicker when you match it alone? Internal optimizations. When the regex is just (?!.*Option2.*)Option1:\*.*?(?P<Capture>(Bob|David|Ted|Alice)), the regex engine can tell that the match must start with Option1:*, so it goes straight to that position for its first match attempt. The longer regex is too complicated, and the optimization doesn't occur.
You can test that by using the "regex debugger" option at Regex101, then checking DISABLE INTERNAL ENGINE OPTIMIZATIONS. The step count goes back to over 100,000.
Related
I have a rather complex regex search in place, and I recently had to extend it to a "reversed" pattern. This was easy to implement, but performance dropped roughly by a factor of 10.
I would appreciate any tips on how to improve this problem, but I'm not particular about how. It's ok, for example, to use two or more steps if this is faster than a single, very involved one.
Original pattern
import regex
expression = regex.compile(
fr"""
(?:{CUE}) # cue before targets. non capture (?:)
(?:{PADDING}) # text before the match.
(?:
(?:{ITEM_SEPARATION})? # none or once
(?P<targets>{targets_as_regex})
)+
""",
regex.VERBOSE,
)
Need to know: CUE is a rather short list of options, something like:
CUE = r"""word
|two\swords
|simple\soptions?
"""
This is about 5-15 options, quite simple (a bit simpler for the reversed scenario).
PADDING is literally anything, but no longer than 150. (PADDING = r".{0,150}?"). In the reversed scenario it's a bit more restrictive, only [,A-Za-z\d\s]
ITEM_SEPARATION is simple: optional spaces followed by either comma, bounded " y " and more optional spaces:
ITEM_SEPARATION = r"""
\s* # optional spaces
(?:,|\by\b) # non-capture, either comma, or 'y' (with bounds)
\s* # optional spaces
"""
Finally, the problem is likely in targets_as_regex, which is a big list of bounded names. This can easily be hundreds or thousands of options, although each call will pass a unique list of targets.
The idea is to match something like:
This is a text, with a cue, then a bit more text and a match, other match y final match. And after the "." we stop matching.
And it performs sufficiently well. But then we have the reversed situation:
Reversed pattern
The idea is to match this:
This not a match, because of the stop. This is a match, another match y final match, because now we find a cue. Etc.
I naturally recycled my pattern:
expression = regex.compile(
fr"""
(
(?P<targets>{targets_as_regex})
(?:{ITEM_SEPARATION})?
)+
(?:{SIMPLE_PADDING}) # text before the match
(?:{CUE_AFTER}) # cue after targets
""",
regex.VERBOSE,
)
It works as expected, but performance is very poor when targets (that are many) come before cues (which are few).
What I tried
I added a basic check where we omit the whole thing if the cue is not in the text at all, but it did not help very much. Another idea that I have tested is to search all matches for the targets (regardless of cues and all that) and then pass a shortened list (targets_as_regex) with only these matches to the slow pattern. This actually helps, but not by much (still close to that x10 drop in performance).
Any other ideas that could help?
The ((?P<targets>{targets_as_regex})(?:{ITEM_SEPARATION})?)+ is a well-known catastrophic backtracking prone pattern (analogous to (a+b?)+ which is reduced to (a+)+), and the more to the left of the pattern, the more dangerous.
You need to "unroll" patterns like (a+b?)+ into a+(?:ba+)* making the subsequent subpatterns match only at different positions inside the string.
The pattern you need is
fr"""(?:{targets_as_regex})(?:{ITEM_SEPARATION}(?:{targets_as_regex}))*(?:{SIMPLE_PADDING})(?:{CUE_AFTER})"""
To speed it up, you need a regex trie built from the targets_as_regex, see Speed up millions of regex replacements in Python 3.
In the following text, I try to match a number followed by ")" and number followed by a period. I am trying to retrieve the text between the matches.
Example:
"1) there is a dsfsdfsd and 2) there is another one and 3) yet another
case"
so I am trying to output: ["there is a dsfsdfsd and", "there is another one and", yet another case"]
I've used this regex: (?:\d)|\d.)
Adding a .* at the end matches the entire string, I only want it to match the words between
also in this string:
"we will give 4. there needs to be another option and 6.99 USD is a
bit amount"
I want to only match the 4. and not the 6.99
Any pointers will be appreciated. Thank you. r
tldr
Regular expressions are tricky beasts and you should avoid them if at all possible.
If you can't avoid them, then make sure you have lots of test cases for all the edge cases that can occur.
Build up your regular expression slowly and systematically, testing your assumptions at every step.
If this code will go intro production, then please write unit tests that explain the thinking process to the poor soul who has to maintain it one day
The long version
Regular expressions are finicky. Your best approach may be to solve the problem a different way.
For example, your language might have a library function that allows you to split up strings using a regular expression to define what comes between the numbers. That will let you get away with writing a simpler regex to match the numbers and brackets/dots.
If you still decide to use regular expressions, then you need to be very structured about how you build up your regular expressions. It's extremely easy to miss edge cases.
So let's break this down piece by piece...
Set up a test environment for quickly experimenting with your regex.
There are lots of options here, depending on your programming language and OS. Ones I sometimes use are:
a Powershell window for testing .Net regexes (NB: the cli gives you a history of past attempts, so you can go back a few steps if you mess things up too badly)
a Python console for testing Python regexes (which are slightly different to .Net regexes in their syntax for named capture groups).
an html page with JavaScript to test the regex
an online or desktop regex tool (I still use the ancient Regular Expression Workbench from Eric Gunnerson, but I'm sure there are better alternatives these days)
Since you didn't specify a language or regex version, I'll assume .Net regular expressions
Create a single test string for testing a wider variety of options.
Your goal is to include as many edge cases as you can think of. Here's what I would use: "ab 1. there is a dsfsdfsd costing $6.99 and 2) there is another one and 3. yet another case 4)5) 6)10."
Note that I've added a few extra cases you didn't mention:
empty strings between two round bracket numbers: "4)" and "5)"
white space string between two round bracket numbers: "5)" and "6)"
empty strings between a round bracket number and a dotted number: "6)" and "10."
empty string after the dotted number "10." at the end of the string
random text and empty space, which should be ignored, before the first number
I'm going to make a few assumptions here, which you will need to vary based on your actual requirements:
You DO want to capture the white space after the dot or round bracket.
You DO want to capture the white space before the next dotted number or round bracket number.
You might have numbers that go beyond 9, so I've included "10" in the test cases.
You want to capture empty strings at the end e.g. after the "10."
NOTES:
Thinking through this test case forces you to be more rigorous about your requirements.
It will also help you be more efficient while you are manually testing your regular expression.
HOWEVER, this is assuming you aren't following a TDD approach. If you are, then you should probably do things a little differently... create unit tests for each scenario separately and get the regex working incrementally.
This test string doesn't cover all cases. For example, there are no newline or tab characters in the test string. Also it can't test for an empty string following a round bracket number at the very end.
First get a regex working that just captures the round brackets and dotted brackets.
Don't worry about the $6.99 edge case yet.
Drop the "(?:" non-capturing group syntax from your regex for now: "\d)|\d."
This doesn't even parse, because you have an unescaped round bracket.
The revised string is "\d\)|\d.", which parses, but which also matches "99" which you probably weren't expecting. That's because you forgot to escape the "."
The revised string is "\d\)|\d\.". This no longer matches "99", but it now matches "0." at the end instead of "10.". That's because it assumes that numbers will be single digit only.
The following string seems to work: "\d+\)|\d+\."
Time to deal with that pesky "$6.99" now...
Modify the regex so that it doesn't capture a floating point number.
You need to use a negative look ahead pattern to prevent a digit being after the decimal point.
Result: "\d+\)|\d+\.(?!\d)"
Count how many matches this produces. You're going to use this number for checking later results.
Hint: Save the regex pattern somewhere. You want to be able to go back to it any time you mess up your regex pattern beyond repair.
If you found a string splitting function, then you should use it now and avoid the complexity that follows. [I've included an example of this at the end.]
Simple is better, but I'm going to continue with the longer solution in the interests of showing an approach to staying in control of regex'es that start getting horribly complicated
Decide how to exclude that pattern
You used the non-capture group pattern in your question i.e. "(?:"
That approach can work. But it's a bit cumbersome, because you need to have a capturing group after it that you will look for instead.
It would be much nicer if your entire pattern matched what you are looking for.
So wrap the number pattern inside a zero-width positive look behind pattern (if your language supports it) i.e. "(?<=".
This checks for the pattern, but doesn't include it in what gets captured.
So now your regex looks like this: "(?<=\d+\)|\d+\.(?!\d))"
Test it!
It might seem silly to test this on its own - all the matches are empty strings.
Do it anyway. You want to sanity check every step of the way.
Make sure that it still produces the same number of matches as in step 4.
Decide how to match the text in between the numbers.
You rightly mention that ".*" will match the entire string, not just the parts in between.
There's a neat trick that allows you to reuse the pattern from step 5 to get the text in between.
Start by just matching the next character
The trick is that you want to match any character unless it's the start of the next number
That sounds like a negative look ahead pattern again: "(?!"
Let X be the pattern you saved in step 4. Matching a single character will look like this: "(?!X)."
You want to match lots of those characters. So put that pattern into a non-capturing group and repeat it: "(?:(?!X).)*"
This assumes you want to capture empty text.
If you're not, then change the "*" to a "+".
Hint: This is such a common pattern that you will want to reuse it in future pasting in different patterns in place of X
I used a non-capturing group instead of a normal group so that you can also embed this pattern in regexes where you do care about the capturing groups
Resulting pattern: "(?:(?!\d+\)|\d+\.(?!\d)).)*"
I suggest testing this pattern on its own to see what it does
Now put parts 5 and 7 together: "(?<=\d+\)|\d+\.(?!\d))(?:(?!\d+\)|\d+\.(?!\d)).)*"
Test it!
Unit tests!
If this is going into production, then please write lots of unit tests that will explain each step of this thought process
Have pity on the poor soul who has to maintain your regex in future!
By rights that person should be you
I suggest putting a note in your calendar to return to this code in 6 months' time and make sure you can still understand it from the unit tests alone!
Refactor
In six months' time, if you can't understand the code any more, use your newfound insight (and incentive) to solve the problem without using regular expressions (or only very simple ones)
Addendum
As an example of using a string splitting function to get away with a simpler regex, here's a solution in Powershell:
$string = 'ab 1. there is a dsfsdfsd costing $6.99 and 2) there is another one and 3. yet another case 4)5) 6)10.'
$pattern = [regex] '\d+\)|\d+\.(?!\d)'
$string -split $pattern | select-object -skip 1
Judging by the task you have, it might be easier to match the delimiters and use re.split (as also pointed out by bobblebubble in the comments).
I dsuggest a mere
\d+[.)]\B\s*
See it in action (demo)
It matches 1 or more digits, then a . or a ), then it makes sure there is no word letter (digit, letter or underscore) after it and then matches zero or more whitespace.
Python demo:
import re
rx = r'\d+[.)]\B\s*'
test_str = "1) there is a dsfsdfsd and 2) there is another one and 3) yet another case\n\"we will give 4. there needs to be another option and 6.99 USD is a bit amount"
print([x for x in re.split(rx,test_str) if x])
Try the following regex with the g modifier:
([A-Za-z\s\-_]+|\d(?!(\)|\.)\D)|\.\d)
Example: https://regex101.com/r/kB1xI0/3
[A-Za-z\s\-_]+ automatically matches all alphabetical characters + whitespace
\d(?!(\)|\.)\D) match any numeric sequence of digits not followed by a closing parenthesis ) or decimal value (.99)
\.\d match any period followed by numeric digit.
I used this pattern:
(?<=\d.\s)(.*?)(?=\d.\s)
demo
This looks for the contents between any digit, any character, then a space.
Edit: Updated pattern to handle the currency issue and line ends better:
This is with flag 'g'
(?<=[0-9].\s)(.*?)(?=\s[0-9].\s|\n|\r)
Demo 2
import re
s = "1) there is a dsfsdfsd and 2) there is another one and 3) yet another case"
s1 = "we will give 4. there needs to be another option and 6.99 USD is a bit amount"
regex = re.compile("\d\)\s.*?|\s\d\.\D.*?")
print ([x for x in regex.split(s) if x])
print regex.split(s1)
Output:
['there is a dsfsdfsd and ', 'there is another one and ', 'yet another case']
['we will give', 'there needs to be another option and 6.99 USD is a bit amount']
My example log file is big and contains below lines.
<6>[16495.700255]
Memory - START UC1
<4>16495.723327 C0 Memory - START UC1
<4>[16495.723327] C0 [ sh] Memory - START UC1
I am looking for Memory - START UC1
The below regular expression gets the first two lines but not the third.
re.compile("(Memory - +(.*)$)")
Use re.MULTILINE as a flag to re.compile or add (?m) to the start of the Regex. The $ only matches the end of the string unless MULTILINE mode is on, when it matches the end of any line.
I copied the original regex from your question - re.compile("(Memory - +(.*)$)") into the code from your follow-up answer, and ran that against the sample text from your question, and got all three matches.
#Smac89's suggestion of re.compile("(.*?Memory - START UC1)") is only necessary if you are calling the regex with event_regex.match(line), which is implicitly anchored to the beginning of the string (^); if you use search(line) or findall(line) then the .*? doesn't do anything except make the regex harder to read: it non-greedily matches zero or more of anything, so if you're not anchored to the start of the string then it will end up matching zero characters anyway.
And I'm afraid that the suggestion of [^.* ]? makes even less sense, unless I'm terribly mistaken (which happens far too often). That says: match zero or one characters from the character group that consists of all characters except a literal ., a literal *, or a space. Which, again, if you're not anchored to the beginning of the string, that part of the regex will end up most likely matching zero characters anyway.
Honestly, if you know that you want to match the exact string Memory - START UC1, then you're probably better off with a simple line.contains('Memory - START UC1') rather than a regex.
But your initial regex contained + (that's 'space plus') - one or more spaces - and if the number of spaces can vary, then yes you do want a regex. You might also consider \s+ in that case, which matches both spaces and tabs (and a few other rarer whitespacey characters). If there's a possibility of trailing spaces, then you should put \s* just before your $ end-of-string anchor. (I actually suspect that trailing space was the reason your initial regex was not matching that third occurrence of your target string.)
A couple of other tips:
In your initial regex, "(Memory - +(.*)$)" you have two capture groups (ie. sets of parentheses) but I suspect that you only actually want one, depending on whether you're interested only in the "UC1" or all of "Memory - UC1".
Also, your if not line: clause never fires, because blank lines still have a linebreak. You could do line.strip() - since you already do a line.strip() later, I would just put a line = line.strip() at the top of the loop and then just use line thereafter, rather than repeating the function call. It's a good thought to early-out, but in this case I'm not sure that it really saves you anything, since it doesn't take the regex engine long to figure out that there's no match on a blank line.
Final thought: It looks like you are only expecting at most one match on a given line. If that's the case, then use search(...) rather than findall(...). No need to keep looking after you've found what you wanted.
Regexes involve a bit of a learning curve, but they are amazingly powerful once you grok them. Keep at it!
Change your compile to:
re.compile("(.*?Memory - START UC1)")
see if that helps
It seems to work on ideone
If you just want to get the word, replace the regex with:
regex = compile(r'([^.* ]?Memory - START UC1)')
I'm trying to find all instances of the keyword "public" in some Java code (with a Python script) that are not in comments or strings, a.k.a. not found following //, in between a /* and a */, and not in between double or single quotes, and which are not part of variable names-- i.e. they must be preceded by a space, tab, or newline, and must be followed by the same.
So here's what I have at the moment--
//.*\spublic\s.*\n
/\*.*\spublic\s.*\*/
".*\spublic\s.*"
'.*\spublic\s.*'
Am I messing this up at all?
But that finds exactly what I'm NOT looking for. How can I turn it around and search the inverse of the sum of those four expressions, as a single regex?
I've figured out this probably uses negative look-ahead and look-behind, but I still can't quite piece it together. Also, for the /**/ regex, I'm concerned that .* doesn't match newlines, so it would fail to recognize that this public is in a comment:
/*
public
*/
Everything below this point is me thinking on paper and can be disregarded. These thoughts are not fully accurate.
Edit:
I daresay (?<!//).*public.* would match anything not in single line comments, so I'm getting the hang of things. I think. But still unsure how to combine everything.
Edit2:
So then-- following that idea, I |ed them all to get--
(?<!//).*public.*|(?<!/\*).*public.\*/(?!\*/)|(?<!").*public.*(?!")|(?<!').*public.*(?!')
But I'm not sure about that. //public will not be matched by the first alternate, but it will be matched by the second. I need to AND the look-aheads and look-behinds, not OR the whole thing.
I'm sorry, but I'll have to break the news to you, that what you are trying to do is impossible. The reason is mostly because Java is not a regular language. As we all know by now, most regex engines provide non-regular features, but Python in particular is lacking something like recursion (PCRE) or balancing groups (.NET) which could do the trick. But let's look into that in more depth.
First of all, why are your patterns not as good as you think they are? (for the task of matching public inside those literals; similar problems will apply to reversing the logic)
As you have already recognized, you will have problems with line breaks (in the case of /*...*/). This can be solved by either using the modifier/option/flag re.S (which changes the behavior of .) or by using [\s\S] instead of . (because the former matches any character).
But there are other problems. You only want to find surrounding occurrences of the string or comment literals. You are not actually making sure that they are specifically wrapped around the public in question. I'm not sure how much you can put onto a single line in Java, but if you had an arbitrary string, then later a public and then another string on a single line, then your regex would match the public because it can find the " before and after it. Even if that is not possible, if you have two block comments in the same input, then any public between those two block comments would cause a match. So you would need to find a way to assert only that your public is really inside "..." or /*...*/ and not just that these literals can be found anywhere to left of right of it.
Next thing: matches cannot overlap. But your match includes everything from the opening literal until the ending literal. So if you had "public public" that would cause only one match. And capturing cannot help you here. Usually the trick to avoid this is to use lookarounds (which are not included in the match). But (as we will see later) the lookbehind doesn't work as nicely as you would think, because it cannot be of arbitrary length (only in .NET that is possible).
Now the worst of all. What if you have " inside a comment? That shouldn't count, right? What if you have // or /* or */ inside a string? That shouldn't count, right? What about ' inside "-strings and " inside '-strings? Even worse, what about \" inside "-string? So for 100% robustness you would have to do a similar check for your surrounding delimiters as well. And this is usually where regular expressions reach the end of their capabilities and this is why you need a proper parser that walks the input string and builds a whole tree of your code.
But say you never have comment literals inside strings and you never have quotes inside comments (or only matched quotes, because they would constitute a string, and we don't want public inside strings anyway). So we are basically assuming that every of the literals in question is correctly matched, and they are never nested. In that case you can use a lookahead to check whether you are inside or outside one of the literals (in fact, multiple lookaheads). I'll get to that shortly.
But there is one more thing left. What does (?<!//).*public.* not work? For this to match it is enough for (?<!//) to match at any single position. e.g. if you just had input // public the engine would try out the negative lookbehind right at the start of the string, (to the left of the start of the string), would find no //, then use .* to consume // and the space and then match public. What you actually want is (?<!//.*)public. This will start the lookbehind from the starting position of public and look all the way to the left through the current line. But... this is a variable-length lookbehind, which is only supported by .NET.
But let's look into how we can make sure we are really outside of a string. We can use a lookahead to look all the way to the end of the input, and check that there is an even number of quotes on the way.
public(?=[^"]*("[^"]*"[^"]*)*$)
Now if we try really hard we can also ignore escaped quotes when inside of a string:
public(?=[^"]*("(?:[^"\\]|\\.)*"[^"]*)*$)
So once we encounter a " we will accept either non-quote, non-backslash characters, or a backslash character and whatever follows it (that allows escaping of backslash-characters as well, so that in "a string\\" we won't treat the closing " as being escaped). We can use this with multi-line mode (re.M) to avoid going all the way to the end of the input (because the end of the line is enough):
public(?=[^"\r\n]*("(?:[^"\r\n\\]|\\.)*"[^"\r\n]*)*$)
(re.M is implied for all following patterns)
This is what it looks for single-quoted strings:
public(?=[^'\r\n]*('(?:[^'\r\n\\]|\\.)*'[^'\r\n]*)*$)
For block comments it's a bit easier, because we only need to look for /* or the end of the string (this time really the end of the entire string), without ever encountering */ on the way. That is done with a negative lookahead at every single position until the end of the search:
public(?=(?:(?![*]/)[\s\S])*(?:/[*]|\Z))
But as I said, we're stumped on the single-line comments for now. But anyway, we can combine the last three regular expressions into one, because lookaheads don't actually advance the position of the regex engine on the target string:
public(?=[^"\r\n]*("(?:[^"\r\n\\]|\\.)*"[^"\r\n]*)*$)(?=[^'\r\n]*('(?:[^'\r\n\\]|\\.)*'[^'\r\n]*)*$)(?=(?:(?![*]/)[\s\S])*(?:/[*]|\Z))
Now what about those single-line comments? The trick to emulate variable-length lookbehinds is usually to reverse the string and the pattern - which makes the lookbehind a lookahead:
cilbup(?!.*//)
Of course, that means we have to reverse all other patterns, too. The good news is, if we don't care about escaping, they look exactly the same (because both quotes and block comments are symmetrical). So you could run this pattern on a reversed input:
cilbup(?=[^"\r\n]*("[^"\r\n]*"[^"\r\n]*)*$)(?=[^'\r\n]*('[^'\r\n]*'[^'\r\n]*)*$)(?=(?:(?![*]/)[\s\S])*(?:/[*]|\Z))(?!.*//)
You can then find the match positions in your actual input by using inputLength -foundMatchPosition - foundMatchLength.
Now what about escaping? That get's quite annoying now, because we have to skip quotes, if they are followed by a backslash. Because of some backtracking issues we need to take care of that in five places. Three times, when consuming non-quote characters (because we need to allow "\ as well now. And twice, when consuming quote characters (using a negative lookahead to make sure there is no backslash after them). Let's look at double quotes:
cilbup(?=(?:[^"\r\n]|"\\)*(?:"(?!\\)(?:[^"\r\n]|"\\)*"(?!\\)(?:[^"\r\n]|"\\)*)*$)
(It looks horrible, but if you compare it with the pattern that disregards escaping, you will notice the few differences.)
So incorporating that into the above pattern:
cilbup(?=(?:[^"\r\n]|"\\)*(?:"(?!\\)(?:[^"\r\n]|"\\)*"(?!\\)(?:[^"\r\n]|"\\)*)*$)(?=(?:[^'\r\n]|'\\)*(?:'(?!\\)(?:[^'\r\n]|'\\)*'(?!\\)(?:[^'\r\n]|'\\)*)*$)(?=(?:(?![*]/)[\s\S])*(?:/[*]|\Z))(?!.*//)
So this might actually do it for many cases. But as you can see it's horrible, almost impossible to read, and definitely impossible to maintain.
What were the caveats? No comment literals inside strings, no string literals inside strings of the other type, no string literals inside comments. Plus, we have four independent lookaheads, which will probably take some time (at least I think I have a voided most of backtracking).
In any case, I believe this is as close as you can get with regular expressions.
EDIT:
I just realised I forgot the condition that public must not be part of a longer literal. You included spaces, but what if it's the first thing in the input? The easiest thing would be to use \b. That matches a position (without including surrounding characters) that is between a word character and a non-word character. However, Java identifiers may contain any Unicode letter or digit, and I'm not sure whether Python's \b is Unicode-aware. Also, Java identifiers may contain $. Which would break that anyway. Lookarounds to the rescue! Instead of asserting that there is a space character on every side, let's assert that there is no non-space character. Because we need negative lookarounds for that, we will get the advantage of not including those characters in the match for free:
(?<!\S)cilbup(?!\S)(?=(?:[^"\r\n]|"\\)*(?:"(?!\\)(?:[^"\r\n]|"\\)*"(?!\\)(?:[^"\r\n]|"\\)*)*$)(?=(?:[^'\r\n]|'\\)*(?:'(?!\\)(?:[^'\r\n]|'\\)*'(?!\\)(?:[^'\r\n]|'\\)*)*$)(?=(?:(?![*]/)[\s\S])*(?:/[*]|\Z))(?!.*//)
And because just from scrolling this code snippet to the right one cannot quite grasp how ridiculously huge this regex is, here it is in freespacing mode (re.X) with some annotations:
(?<!\S) # make sure there is no trailing non-whitespace character
cilbup # public
(?!\S) # make sure there is no leading non-whitespace character
(?= # lookahead (effectively lookbehind!) to ensure we are not inside a
# string
(?:[^"\r\n]|"\\)*
# consume everything except for line breaks and quotes, unless the
# quote is followed by a backslash (preceded in the actual input)
(?: # subpattern that matches two (unescaped) quotes
"(?!\\) # a quote that is not followed by a backslash
(?:[^"\r\n]|"\\)*
# we've seen that before
"(?!\\) # a quote that is not followed by a backslash
(?:[^"\r\n]|"\\)*
# we've seen that before
)* # end of subpattern - repeat 0 or more times (ensures even no. of ")
$ # end of line (start of line in actual input)
) # end of double-quote lookahead
(?=(?:[^'\r\n]|'\\)*(?:'(?!\\)(?:[^'\r\n]|'\\)*'(?!\\)(?:[^'\r\n]|'\\)*)*$)
# the same horrible bastard again for single quotes
(?= # lookahead (effectively lookbehind) for block comments
(?: # subgroup to consume anything except */
(?![*]/) # make sure there is no */ coming up
[\s\S] # consume an arbitrary character
)* # repeat
(?:/[*]|\Z)# require to find either /* or the end of the string
) # end of lookahead for block comments
(?!.*//) # make sure there is no // on this line
Have you considered replacing all comments and single and double quoted string literals with null strings using the re sub() method. Then just do a simple search/match/find of the resulting file for the word you're looking for?
That would at least give you the line numbers where the word is located. You may be able to use that information to edit the original file.
You could use pyparsing to find public keyword outside a comment or a double quoted string:
from pyparsing import Keyword, javaStyleComment, dblQuotedString
keyword = "public"
expr = Keyword(keyword).ignore(javaStyleComment | dblQuotedString)
Example
for [token], start, end in expr.scanString(r"""{keyword} should match
/*
{keyword} should not match "
*/
// this {keyword} also shouldn't match
"neither this \" {keyword}"
but this {keyword} will
re{keyword} is ignored
'{keyword}' - also match (only double quoted strings are ignored)
""".format(keyword=keyword)):
assert token == keyword and len(keyword) == (end - start)
print("Found at %d" % start)
Output
Found at 0
Found at 146
Found at 187
To ignore also single quoted string, you could use quotedString instead of dblQuotedString.
To do it with only regexes, see regex-negation tag on SO e.g., Regular expression to match string not containing a word? or using even less regex capabilities Regex: Matching by exclusion, without look-ahead - is it possible?. The simple way would be to use a positive match and skip matched comments, quoted strings. The result is the rest of the matches.
It's finding the opposite because that's just what you're asking for. :)
I don't know a way to match them all in a single regex (though it should be theoretically possible, since the regular languages are closed under complements and intersections). But you could definitely search for all instances of public, and then remove any instances that are matched by one of your "bad" regexes. Try using for example set.difference on the match.start and match.end properties from re.finditer.
i test re on some pythonwebshelll, all of them are encounter issue
if i use
a=re.findall(r"""<ul>[\s\S]*?<li><a href="(?P<link>[\s\S]*?)"[\s\S]*?<img src="(?P<img>[\s\S]*?)"[\s\S]*?<br/>[\s\S]*?</li>[\s\S]*?</li>[\s\S]*?</li>[\s\S]*?</ul>""",html)
print a
it's ok
but if i use
a=re.findall(r"""<ul>[\s\S]*?<li><a href="(?P<link>[\s\S]*?)"[\s\S]*?<img src="(?P<img>[\s\S]*?)"[\s\S]*?<br/>[\s\S]*?</li>[\s\S]*?</li>[\s\S]*?</li>[\s\S]*?</ul>d""",html)
print a
it will block the server and wait always like the server is dead,also i have tested on regexbuddy
the only difference betwwen the two snippet code is at the end of the second snippet code's regur expression,i add a character 'd'
any one can explain why occures this
The expression [\s\S]*? can match any amount of anything. This can potentially cause an enormous amount of backtracking in the case that the match fails. If you are more specific about what you can and can't match then it will allow the match to fail faster.
Also, I'd advise you to use an HTML parser instead of regular expressions for this. Beautiful Soup is an excellent library that is easy to use.
Your regex is suffering from catastrophic backtracking. If it can find a match it's fine, but if it can't, it has to try a virtually infinite number of possibilities before it gives up. Every one of those [\s\S]*? constructs ends up trying to match all the way to the end of the document, and the interaction between them creates a staggering amount of useless work.
Python doesn't support atomic groups, but here's a little trick you can use to imitate them:
a=re.findall(r"""(?=(<ul>[\s\S]*?<li><a href="(?P<link>[\s\S]*?)"[\s\S]*?<img src="(?P<img>[\s\S]*?)"[\s\S]*?<br/>[\s\S]*?</li>[\s\S]*?</li>[\s\S]*?</li>[\s\S]*?</ul>))\1d""",html)
print a
If the lookahead succeeds, the whole <UL> element is captured in group #1, the match position resets to the beginning of the element, then the \1 backreference consumes the element. But if the next character is not d, it does not go back and muck about with all those [\s\S]*? constructs again, like your regex does.
Instead, the regex engine goes straight back to the beginning of the <UL> element, then bumps ahead one position (so it's between the < and the u) and tries the lookahead again from the beginning. It keeps doing that until it finds another match for the lookahead, or it reaches the end of the document. In this way, it will fail (the expected result) in about the same time your first regex took to succeed.
Note that I'm not presenting this trick as a solution, just trying to answer your question as to why your regex seems to hang. If I were offering a solution, I would say to stop using [\s\S]*? (or [\s\S]*, or .*, or .*?); you're relying on that too much. Try to be as specific as you reasonably can--for example, instead of:
<a href="(?P<link>[\s\S]*?)"[\s\S]*?<img src="(?P<img>[\s\S]*?)"[\s\S]*?
...use:
<a href="(?P<link>[^"]*)"[^>]*><img src="(?P<img>[^"]*)"[^>]*>
But even that has serious problems. You should seriously consider using an HTML parser for this job. I love regexes too, but you're asking too much from them.