Try to rescan with re.scanner if "False" - python

How can I get re.Scanner to attempt next options?
I have some nested groups. When we return None, we skip over this text.
Is there a way within the function I have the scanner retry next rules (i.e. have the function hit, but not produce output, and retry matching)?
Since we cannot have negative lookahead I was thinking that it would be useful in the function itself to communicate to the scanner to be allowed to try the next rules.
s = re.Scanner([
("a", retry_under_some_conditions), # I want to try the next rule, "aa"
("aa", None) # essentially means skip over "aa"
])
s.scan("aa") # output should not be from retry function, but "None"; it matches on "aa".
This is not the actual problem, but it is related enough (just don't try to solve this actual problem in a traditional way). Being able to apply next rules would be very powerful.
I'm now thinking it might not be possible. If the regex engine optimization can only match in a straightforward way and does a complete parse, and only afterwards applies the functions, then this is not possible.

How can I get re.Scanner to attempt next options?
Short answer: You can't.
Since we cannot have negative lookahead
I'm a bit puzzled. Because you can:
(r"a(?!a)", retry_under_some_conditions)
That would work, and the scanner would try the next rule. So I'm assuming the reason you can't, has something to do with your project / actual problem.
I have some nested groups. When we return None, we skip over this text.
That's just because of how the scanner works. When creating the scanner it takes your patterns, in this case a and aa. These would be combined into (a)|(aa). The subpattern is the important part because it relates to how the scanner works.
If we instead take the negative lookahead into account, and use this pattern (a(?!a))|(aa). Then doing this:
string = "a aa"
pattern = r"(a(?!a))|(aa)"
for match in re.finditer(pattern, string):
print(match.lastindex, match.span(), match.group())
Would print:
1 (0, 1) a
2 (2, 4) aa
The key here is that the scanner leverages match.lastindex, to map back to get the callback or value to return as a result.
I'm now thinking it might not be possible
Without using the lookahead or some other way to make it try the next subpattern. Then this isn't possible. Neither with the scanner or regex in general. The engine matches the first subpattern, if it matches then the others are skipped. Call it a limitation of regex if you will, but that's just how regex works.
The solution would then be to make your own scanner and scan each pattern individually.

Related

How can I get Regex to remove redundancies and call itself again?

I have a simple function which when given an input like (x,y), it will return {{x},{x,y}}.
In the cases that x=y, it naturally returns {{x},{x,x}}.
I can't figure out how to get Regex to substitute 'x' in place of 'x,x'. But even if I could figure out how to do this, the expression would go from {{x},{x,x}} to {{x},{x}}, which itself would need to be substituted for {{x}}.
The closest I have gotten has been:
re.sub('([0-9]+),([0-9]+)',r'\1',string)
But this function will also turn {{x},{x,y}} into {{x},{x}}, which is not desired. Also you may notice that the function searches for numbers only, which is fine because I really only intend to be using numbers in the place of x and y; however, if there is a way to get it to work with any letter as well (lower case or capital) the would be even more ideal.
Note also that if I give my original function (x,y,z) it will read it as ((x,y),z) and thus return {{{{x},{x,y}}},{{{x},{x,y}},z}}, thus in the case that x=y=z, I would want to be able to have a Regex function call itself repeatedly to reduce this to {{{{x}}},{{{x}},x}} instead of {{{{x},{x,x}}},{{{x},{x,x}},x}}.
If it helps at all, this is essentially an attempt at making a translation (into sets) using the Kuratowski definition of an ordered pair.
Essentially to solve this you need recursion, or more simply, keep applying the regex in a loop until the replacement doesn't change the input string. For example using your regex from https://regex101.com/r/Yl1IJv/4:
s = '{{ab},{ab,ab}}'
while True:
news = re.sub(r'(?P<first>.?(\w+|\d+).?),(?P=first)', r'\g<1>', s, 0)
if news == s:
break
s = news
print(s)
Output
{{ab}}
Demo on rextester
With
s = '{{{{x},{x,x}}},{{{x},{x,x}},x}}'
The output is
{{{{x}}},{{{x}},x}}
as required. Demo on rextester

Python regex matching words with repeating consonant

First off, this is homework. (I couldn't use a tag in the title and nothing showed up in the tag list at the bottom for homework, so please let me know if I should EDIT something else regarding this matter).
So I have been reading through the python docs and scavenging SO, finding several solutions that are close to what I want, but not exact.
I have a dictionary which I read in to a string:
a
aa
aabbaa
...
z
We are practicing various regex patters on this data.
The specific problem here is to return a list of words which match the pattern, NOT tuples with the groups within each match.
For example:
Given a subset of this dictionary like:
someword
sommmmmeword
someworddddd
sooooomeword
I want to return:
['sommmmmword', 'someworddddd']
NOT:
[('sommmmword', 'mmmmm', ...), ...] # or any other variant
EDIT:
My reasoning behind the above example, is that I want to see how I can avoid making a second pass over the results. That is instead of saying:
res = re.match(re.compile(r'pattern'), dictionary)
return [r[0] for r in res]
I specifically want a mechanism where I can just use:
return re.match(re.compile(r'pattern'), dictionary)
I know that may sound silly, but I am doing this to really dig into regex. I mention this at the bottom.
This is what I have tried:
# learned about back refs
r'\b([b-z&&[^eiou]])\1+\b' -> # nothing
# back refs were weird, I want to match something N times
r'\b[b-z&&[^eiou]]{2}\b' -> # nothing
Somewhere in testing I noticed a pattern returning things like '\nsomeword'. I couldn't figure out what it was but if I find the pattern again I will include it here for completeness.
# Maybe the \b word markers don't work how I think?
r'.*[b-z&&[^eiou]]{2}' -> # still nothing
# Okay lets just try to match something in between anything
r'.*[b-z&&[^eiou]].*' -> # nope
# Since its words, maybe I should be more explicit.
r'[a-z]*[b-z&&[^eiou]][a-z]*' -> # still nope
# Decided to go back to grouping.
r'([b-z&&[^eiou]])(\1)' # I realize set difference may be the issue
# I saw someone (on SO) use set difference claiming it works
# but I gave up on it...
# OKAY getting close
r'(([b-df-hj-np-tv-xz])(\2))' -> [('ll', 'l', 'l'), ...]
# Trying the the previous ones without set difference
r'\b(.*(?:[b-df-hj-np-tv-xz]{3}).*)\b' -> # returned everything (all words)
# Here I realize I need a non-greedy leading pattern (.* -> .*?)
r'\b(.*?(?:[b-df-hj-np-tv-xz]{3}).*)\b' -> # still everything
# Maybe I need the comma in {3,} to get anything 3 or more
r'\b(.*?(?:[b-df-hj-np-tv-xz]{3,}).*)\b' -> # still everything
# okay I'll try a 1 line test just in case
r'\b(.*?([b-df-hj-np-tv-xz])(\2{3,}).*)\b'
# Using 'asdfdffff' -> [('asdfdffff', 'f', 'fff')]
# Using dictionary -> [] # WAIT WHAT?!
How does this last one work? Maybe there there are no 3+ repeating consonant words? I'm using /usr/share/dict/cracklib-small on my schools server which is about 50,000 words I think.
I am still working on this but any advice would be awesome.
One thing I find curious is that you can not back reference a non-capturing group. If I want to output only the full word, I use (?:...) to avoid capture, but then I can not back reference. Obviously I could leave the captures, loop over the results and filter out the extra stuff, but I absolutely want to figure this out using ONLY regex!
Perhaps there is a way to do the non-capture, but still allow back reference? Or maybe there is an entirely different expression I haven't tested yet.
Here are some points to consider:
Use re.findall to get all the results, not re.match (that only searches for 1 match and only at the string start).
[b-z&&[^eiou]] is a Java/ICU regex, this syntax is not supported by Python re. In Python, you can either redefine the ranges to skip the vowels, or use (?![eiou])[b-z].
To avoid "extra" values in tuples with re.findall, do not use capturing groups. If you need backreferences, use re.finditer instead of re.findall and access .group() of each match.
Coming back to the question, how you can use a backreference and still get the whole match, here is a working demo:
import re
s = """someword
sommmmmeword
someworddddd
sooooomeword"""
res =[x.group() for x in re.finditer(r"\w*([b-df-hj-np-tv-xz])\1\w*", s)]
print(res)
# => ['sommmmmeword', 'someworddddd']

Simple regular expression not working

I am trying to match a string with a regular expression but it is not working.
What I am trying to do is simple, it is the typical situation when an user intruduces a range of pages, or single pages. I am reading the string and checking if it is correct or not.
Expressions I am expecting, for a range of pages are like: 1-3, 5-6, 12-67
Expressions I am expecting, for single pages are like: 1,5,6,9,10,12
This is what I have done so far:
pagesOption1 = re.compile(r'\b\d\-\d{1,10}\b')
pagesOption2 = re.compile(r'\b\d\,{1,10}\b')
Seems like the first expression works, but not the second.
And, would it be possible to merge both of them in one single regular expression?, In a way that, if the user introduces either something like 1-2, 7-10 or something like 3,5,6,7 the expression will be recogniced as good.
Simpler is better
Matching the entire input isn't simple, as the proposed solutions show, at least it is not as simple as it could/should be. Will become read only very quickly and probably be scrapped by anyone that isn't regex savvy when they need to modify it with a simpler more explicit solution.
Simplest
First parse the entire string and .split(","); into individual data entries, you will need these anyway to process. You have to do this anyway to parse out the useable numbers.
Then the test becomes a very simple, test.
^(\d+)(?:-\(d+))?$
It says, that there the string must start with one or more digits and be followed by optionally a single - and one or more digits and then the string must end.
This makes your logic as simple and maintainable as possible. You also get the benefit of knowing exactly what part of the input is wrong and why so you can report it back to the user.
The capturing groups are there because you are going to need the input parsed out to actually use it anyway, this way you get the numbers if they match without having to add more code to parse them again anyway.
This regex should work -
^(?:(\d+\-\d+)|(\d+))(?:\,[ ]*(?:(\d+\-\d+)|(\d+)))*$
Demo here
Testing this -
>>> test_vals = [
'1-3, 5-6, 12-67',
'1,5,6,9,10,12',
'1-3,1,2,4',
'abcd',
]
>>> regex = re.compile(r'^(?:(\d+\-\d+)|(\d+))(?:\,[ ]*(?:(\d+\-\d+)|(\d+)))*$')
>>> for val in test_vals:
print val
if regex.match(val) == None:
print "Fail"
else:
print "Pass"
1-3, 5-6, 12-67
Pass
1,5,6,9,10,12
Pass
1-3,1,2,4.5
Fail
abcd
Fail

Regular expression how to get middle strings

I want to search for string that occurs between a certain string. For example,
\start
\problem{number}
\subproblem{number}
/* strings that I want to get */
\subproblem{number}
/* strings that I want to get */
\problem{number}
\subproblem{number}
...
...
\end
More specifically, I want to get problem number and subproblem number and strings between which is answer.
I somewhat came up with expression like
'(\\problem{(.*?)}\n)? \\subproblem{(.*?)} (.*?) (\\problem|\\subproblem|\\end)'
but it seems like it doesn't work as I expect. What is wrong with this expression?
This one:
(?:\\problem\{(.*?)\}\n)?\\subproblem\{(.*?)\}\n+(.*?)\n+(?=\\problem|\\subproblem|\\end)
returns three matches for me:
Match 1:
group 1: "number"
group 2: "number"
group 3: "/* strings that I want to get */"
Match 2:
group 1: null
group 2: "number"
group 3: "/* strings that I want to get */"
Match 3:
group 1: "number"
group 2: "number"
group 3: " ...\n ..."
However I'd rather parse it in two steps.
First find the problem's number (group 1) and content (group 2) using:
\\problem\{(.*?)\}\n(.+?)\\end
Then find the subproblem's numbers (group 1) and contents (group 2) inside that content using:
\\subproblem\{(.*?)\}\n+(.*?)\n+(?=\\problem|\\subproblem|\\end)
TeX is pretty complicated and I'm not sure how I feel about parsing it using regular expressions.
That said, your regular expression has two issues:
You're using a space character where you should just consume all whitespace
You need to use a lookahead assertion for your final group so that it doesn't get eaten up (because you need to match it at the beginning of the regex the next time around)
Give this a try:
>>> v
'\\start\n\n\\problem{number}\n\\subproblem{number}\n\n/* strings that I want to get */\n\n\\subproblem{number}\n\n/* strings that I want to get */\n\n\\problem{number}\n\\subproblem{number}\n ...\n ...\n\\end\n'
>>> re.findall(r'(?:\\problem{(.*?)})?\s*\\subproblem{(.*?)}\s*(.*?)\s*(?=\\problem{|\\subproblem{|\\end)', v, re.DOTALL)
[('number', 'number', '/* strings that I want to get */'), ('', 'number', '/* strings that I want to get */'), ('number', 'number', '...\n ...')]
If the question really is "What is wrong with this expression?", here's the answer:
You're trying to match newlines with a .*?. You need (?s) for that to work.
You have explicit spaces and newlines in the middle of the regex that don't have any corresponding characters in the source text. You need (?x) for that to work.
That may not be all that's wrong with the expression. But just adding (?sx), turning it into a raw string (because I don't trust myself to mix Python quoting and regex quoting properly), and removing the \n gives me this:
r'(?sx)(\\problem{(.*?)}? \\subproblem{(.*?)} (.*?)) (\\problem|\\subproblem|\\end)'
That returns 2 matches instead of 0, and it's probably the smallest change to your regex that works.
However, if the question is "How can I parse this?", rather than "What's wrong with my existing attempt?", I think impl's solution makes more sense (and I also agree with the point about using regex to parse TeX being usually a bad idea)—-or, even better, doing it in two steps as Regexident does.
if using regex to parse TeX is not good idea, then what method would you suggest to parse TeX?
First of all, as a general rule of thumb, if I can't write the regex to solve a problem by myself, I don't want to solve it with a regex, because I'll have a hard time figuring it out a few months from now. Sometimes I break it down into subexpressions, or use (?x) and load it up with comments, but usually I look for another way.
More importantly, if you have a real parser that can consume your language and give you a tree (or whatever's appropriate) that you can walk and search—as with, e.g. etree for XML—then you've got 90% of a solution for every problem you're going to come up with in dealing with that language. A quick&dirty regex (especially one you can't write on your own) only gets you 10% of the way to solving the next problem. And more often than not, if I've got a problem today, I'm going to have more of them in the next few months.
So, what's a good parser for TeX in Python? Honestly, I don't know. I know scipy/matplotlib has something that does it, so I'd probably look there first. Beyond that, check Google, PyPI, and maybe tex.stackexchange.com. The first things that turn up in a search are Texcaller and plasTeX. I have no idea how good they are, or if they're appropriate for your use case, but it shouldn't take long to skim the tutorials and find out.
If it turns out that there's nothing out there, and it comes down to writing something myself with, e.g., pyparsing vs. regexes, then it's a tougher choice. Some languages, it's very easy to define just the subset you care about and leave the rest as giant uninterpreted tokens, in which case a real parser will be just as easy as a regex, so you might as well go that way. Other languages, you have to handle half the syntax before you can do anything useful, so I wouldn't even try. I'd have to put a bit of time into thinking about it and experimenting both ways before deciding which way to go.

Apply multiple negative regex to expression in Python

This question is similar to "How to concisely cascade through multiple regex statements in Python" except instead of matching one regular expression and doing something I need to make sure I do not match a bunch of regular expressions, and if no matches are found (aka I have valid data) then do something. I have found one way to do it but am thinking there must be a better way, especially if I end up with many regular expressions.
Basically I am filtering URL's for bad stuff ("", \\", etc.) that occurs when I yank what looks like a valid URL out of an HTML document but it turns out to be part of a JavaScript (and thus needs to be evaluated, and thus the escaping characters). I can't use Beautiful soup to process these pages since they are far to mangled (actually I use BeautifulSoup, then fall back to my ugly but workable parser).
So far I have found the following works relatively well: I compile a dict or regular expressions outside the main loop (so I only have to compile it once, but benefit from the speed increase every time I use it), I then loop a URL through this dict, if there is a match then the URL is bad, if not the url is good:
regex_bad_url = {"1" : re.compile('\"\"'),
"2" : re.compile('\\\"')}
Followed by:
url_state = "good"
for key, pattern in regex_bad_url_components.items():
match = re.search(pattern, url)
if (match):
url_state = "bad"
if (url_state == "good"):
# do stuff here ...
Now the obvious thought is to use regex "or" ("|"), i.e.:
re.compile('(\"\"|\\\")')
Which reduces the number of compares and whatnot, but makes it much harder to trouble shoot (with one expression per compare I can easily add a print statement like:
print "URL: ", url, " matched by key ", key
So is there someway to get the best of both worlds (i.e. minimal number of compares) yet still be able to print out which regex is matching the URL, or do I simply need to bite the bullet and have my slower but easier to troubleshoot code when debugging and then squoosh all the regex's together into one line for production? (which means one more step of programming and code maintenance and possible problems).
Update:
Good answer by Dave Webb, so the actual code for this would look like:
match = re.search(r'(?P<double_quotes>\"\")|(?P<slash_quote>\\\")', fullurl)
if (match == None):
# do stuff here ...
else:
#optional for debugging
print "url matched by", match.lastgroup
"Squoosh" all the regexes into one line but put each in a named group using (?P<name>...) then use MatchOjbect.lastgroup to find which matched.

Categories