Regex named conditional lookahead (in Python) - python

I'm hoping to match the beginning of a string differently based on whether a certain block of characters is present later in the string. A very simplified version of this is:
re.search("""^(?(pie)a|b)c.*(?P<pie>asda)$""", 'acaaasda')
Where, if <pie> is matched, I want to see a at the beginning of the string, and if it isn't then I'd rather see b.
I'd use normal numerical lookahead but there's no guarantee how many groups will or won't be matched between these two.
I'm currently getting error: unknown group name. The sinking feeling in my gut tells me that this is because what I want is impossible (look-ahead to named groups isn't exactly a feature of a regular language parser), but I really really really want this to work -- the alternative is scrapping 4 or 5 hours' worth of regex writing and redoing it all tomorrow as a recursive descent parser or something.
Thanks in advance for any help.

Unfortunately, I don't think there is a way to do what you want to do with named groups. If you don't mind duplication too much, you could duplicate the shared conditions and OR the expressions together:
^(ac.*asda|bc.*)$
If it is a complicated expression you could always use string formatting to share it (rather than copy-pasting the shared part):
common_regex = "c.*"
final_regex = "^(a{common}asda|b{common})$".format(common=common_regex)

You can use something like that:
^(?:a(?=c.*(?P<pie>asda)$)|b)c.*$
or without .*$ if you don't need it.

Related

Simple regular expression not working

I am trying to match a string with a regular expression but it is not working.
What I am trying to do is simple, it is the typical situation when an user intruduces a range of pages, or single pages. I am reading the string and checking if it is correct or not.
Expressions I am expecting, for a range of pages are like: 1-3, 5-6, 12-67
Expressions I am expecting, for single pages are like: 1,5,6,9,10,12
This is what I have done so far:
pagesOption1 = re.compile(r'\b\d\-\d{1,10}\b')
pagesOption2 = re.compile(r'\b\d\,{1,10}\b')
Seems like the first expression works, but not the second.
And, would it be possible to merge both of them in one single regular expression?, In a way that, if the user introduces either something like 1-2, 7-10 or something like 3,5,6,7 the expression will be recogniced as good.
Simpler is better
Matching the entire input isn't simple, as the proposed solutions show, at least it is not as simple as it could/should be. Will become read only very quickly and probably be scrapped by anyone that isn't regex savvy when they need to modify it with a simpler more explicit solution.
Simplest
First parse the entire string and .split(","); into individual data entries, you will need these anyway to process. You have to do this anyway to parse out the useable numbers.
Then the test becomes a very simple, test.
^(\d+)(?:-\(d+))?$
It says, that there the string must start with one or more digits and be followed by optionally a single - and one or more digits and then the string must end.
This makes your logic as simple and maintainable as possible. You also get the benefit of knowing exactly what part of the input is wrong and why so you can report it back to the user.
The capturing groups are there because you are going to need the input parsed out to actually use it anyway, this way you get the numbers if they match without having to add more code to parse them again anyway.
This regex should work -
^(?:(\d+\-\d+)|(\d+))(?:\,[ ]*(?:(\d+\-\d+)|(\d+)))*$
Demo here
Testing this -
>>> test_vals = [
'1-3, 5-6, 12-67',
'1,5,6,9,10,12',
'1-3,1,2,4',
'abcd',
]
>>> regex = re.compile(r'^(?:(\d+\-\d+)|(\d+))(?:\,[ ]*(?:(\d+\-\d+)|(\d+)))*$')
>>> for val in test_vals:
print val
if regex.match(val) == None:
print "Fail"
else:
print "Pass"
1-3, 5-6, 12-67
Pass
1,5,6,9,10,12
Pass
1-3,1,2,4.5
Fail
abcd
Fail

Regular expression how to get middle strings

I want to search for string that occurs between a certain string. For example,
\start
\problem{number}
\subproblem{number}
/* strings that I want to get */
\subproblem{number}
/* strings that I want to get */
\problem{number}
\subproblem{number}
...
...
\end
More specifically, I want to get problem number and subproblem number and strings between which is answer.
I somewhat came up with expression like
'(\\problem{(.*?)}\n)? \\subproblem{(.*?)} (.*?) (\\problem|\\subproblem|\\end)'
but it seems like it doesn't work as I expect. What is wrong with this expression?
This one:
(?:\\problem\{(.*?)\}\n)?\\subproblem\{(.*?)\}\n+(.*?)\n+(?=\\problem|\\subproblem|\\end)
returns three matches for me:
Match 1:
group 1: "number"
group 2: "number"
group 3: "/* strings that I want to get */"
Match 2:
group 1: null
group 2: "number"
group 3: "/* strings that I want to get */"
Match 3:
group 1: "number"
group 2: "number"
group 3: " ...\n ..."
However I'd rather parse it in two steps.
First find the problem's number (group 1) and content (group 2) using:
\\problem\{(.*?)\}\n(.+?)\\end
Then find the subproblem's numbers (group 1) and contents (group 2) inside that content using:
\\subproblem\{(.*?)\}\n+(.*?)\n+(?=\\problem|\\subproblem|\\end)
TeX is pretty complicated and I'm not sure how I feel about parsing it using regular expressions.
That said, your regular expression has two issues:
You're using a space character where you should just consume all whitespace
You need to use a lookahead assertion for your final group so that it doesn't get eaten up (because you need to match it at the beginning of the regex the next time around)
Give this a try:
>>> v
'\\start\n\n\\problem{number}\n\\subproblem{number}\n\n/* strings that I want to get */\n\n\\subproblem{number}\n\n/* strings that I want to get */\n\n\\problem{number}\n\\subproblem{number}\n ...\n ...\n\\end\n'
>>> re.findall(r'(?:\\problem{(.*?)})?\s*\\subproblem{(.*?)}\s*(.*?)\s*(?=\\problem{|\\subproblem{|\\end)', v, re.DOTALL)
[('number', 'number', '/* strings that I want to get */'), ('', 'number', '/* strings that I want to get */'), ('number', 'number', '...\n ...')]
If the question really is "What is wrong with this expression?", here's the answer:
You're trying to match newlines with a .*?. You need (?s) for that to work.
You have explicit spaces and newlines in the middle of the regex that don't have any corresponding characters in the source text. You need (?x) for that to work.
That may not be all that's wrong with the expression. But just adding (?sx), turning it into a raw string (because I don't trust myself to mix Python quoting and regex quoting properly), and removing the \n gives me this:
r'(?sx)(\\problem{(.*?)}? \\subproblem{(.*?)} (.*?)) (\\problem|\\subproblem|\\end)'
That returns 2 matches instead of 0, and it's probably the smallest change to your regex that works.
However, if the question is "How can I parse this?", rather than "What's wrong with my existing attempt?", I think impl's solution makes more sense (and I also agree with the point about using regex to parse TeX being usually a bad idea)—-or, even better, doing it in two steps as Regexident does.
if using regex to parse TeX is not good idea, then what method would you suggest to parse TeX?
First of all, as a general rule of thumb, if I can't write the regex to solve a problem by myself, I don't want to solve it with a regex, because I'll have a hard time figuring it out a few months from now. Sometimes I break it down into subexpressions, or use (?x) and load it up with comments, but usually I look for another way.
More importantly, if you have a real parser that can consume your language and give you a tree (or whatever's appropriate) that you can walk and search—as with, e.g. etree for XML—then you've got 90% of a solution for every problem you're going to come up with in dealing with that language. A quick&dirty regex (especially one you can't write on your own) only gets you 10% of the way to solving the next problem. And more often than not, if I've got a problem today, I'm going to have more of them in the next few months.
So, what's a good parser for TeX in Python? Honestly, I don't know. I know scipy/matplotlib has something that does it, so I'd probably look there first. Beyond that, check Google, PyPI, and maybe tex.stackexchange.com. The first things that turn up in a search are Texcaller and plasTeX. I have no idea how good they are, or if they're appropriate for your use case, but it shouldn't take long to skim the tutorials and find out.
If it turns out that there's nothing out there, and it comes down to writing something myself with, e.g., pyparsing vs. regexes, then it's a tougher choice. Some languages, it's very easy to define just the subset you care about and leave the rest as giant uninterpreted tokens, in which case a real parser will be just as easy as a regex, so you might as well go that way. Other languages, you have to handle half the syntax before you can do anything useful, so I wouldn't even try. I'd have to put a bit of time into thinking about it and experimenting both ways before deciding which way to go.

Efficient way to do a large number of search/replaces in Python?

I'm fairly new to Python, and am writing a series of script to convert between some proprietary markup formats. I'm iterating line by line over files and then basically doing a large number (100-200) of substitutions that basically fall into 4 categories:
line = line.replace("-","<EMDASH>") # Replace single character with tag
line = line.replace("<\\#>","#") # tag with single character
line = line.replace("<\\n>","") # remove tag
line = line.replace("\xe1","•") # replace non-ascii character with entity
the str.replace() function seems to be pretty efficient (fairly low in the numbers when I examine profiling output), but is there a better way to do this? I've seen the re.sub() method with a function as an argument, but am unsure if this would be better? I guess it depends on what kind of optimizations Python does internally. Thought I would ask for some advice before creating a large dict that might not be very helpful!
Additionally I do some parsing of tags (that look somewhat like HTML, but are not HTML). I identify tags like this:
m = re.findall('(<[^>]+>)',line)
And then do ~100 search/replaces (mostly removing matches) within the matched tags as well, e.g.:
m = re.findall('(<[^>]+>)',line)
for tag in m:
tag_new = re.sub("\*t\([^\)]*\)","",tag)
tag_new = re.sub("\*p\([^\)]*\)","",tag_new)
# do many more searches...
if tag != tag_new:
line = line.replace(tag,tag_new,1) # potentially problematic
Any thoughts of efficiency here?
Thanks!
str.replace() is more efficient if you're going to do basic search and replaces, and re.sub is (obviously) more efficient if you need complex pattern matching (because otherwise you'd have to use str.replace several times).
I'd recommend you use a combination of both. If you have several patterns that all get replaced by one thing, use re.sub. If you just have some cases where you just need to replace one specific tag with another, use str.replace.
You can also improve efficiency by using larger strings (call re.sub once instead of once for each line). Increases memory use, but shouldn't be a problem unless the file is HUGE, but also improves execution time.
If you don't actually need the regex and are just doing literal replacing, string.replace() will almost certainly be faster. But even so, your bottleneck here will be file input/output, not string manipulation.
The best solution though would probably be to use cStringIO
Depending on the ratio of relevant-to-not-relevant portions of the text you're operating on (and whether or not the parts each substitution operates on overlap), it might be more efficient to try to break down the input into tokens and work on each token individually.
Since each replace() in your current implementation has to examine the entire input string, that can be slow. If you instead broke down that stream into something like...
[<normal text>, <tag>, <tag>, <normal text>, <tag>, <normal text>]
# from an original "<normal text><tag><tag><normal text><tag><normal text>"
...then you could simply look to see if a given token is a tag, and replace it in the list (and then ''.join() at the end).
You can pass a function object to re.sub instead of a substitution string, it takes the match object and returns the substitution, so for example
>>> r = re.compile(r'<(\w+)>|(-)')
>>> r.sub(lambda m: '(%s)' % (m.group(1) if m.group(1) else 'emdash'), '<atag>-<anothertag>')
'(atag)(emdash)(anothertag)'
Of course you can use a more complex function object, this lambda is just an example.
Using a single regex that does all the substitution should be slightly faster than iterating the string many times, but if a lot of substitutions are perfomed the overhead of calling the function object that computes the substitution may be significant.

regular expression help with converting exp1^exp2 to pow(exp1, exp2)

I am converting some matlab code to C, currently I have some lines that have powers using the ^, which is rather easy to do with something along the lines \(?(\w*)\)?\^\(?(\w*)\)?
works fine for converting (glambda)^(galpha),using the sub routine in python pattern.sub(pow(\g<1>,\g<2>),'(glambda)^(galpha)')
My problem comes with nested parenthesis
So I have a string like:
glambdastar^(1-(1-gphi)*galpha)*(glambdaq)^(-(1-gphi)*galpha);
And I can not figure out how to convert that line to:
pow(glambdastar,(1-(1-gphi)*galpha))*pow(glambdaq,-(1-gphi)*galpha));
Unfortunately, regular expressions aren't the right tool for handling nested structures. There are some regular expressions engines (such as .NET) which have some support for recursion, but most — including the Python engine — do not, and can only handle as many levels of nesting as you build into the expression (which gets ugly fast).
What you really need for this is a simple parser. For example, iterate over the string counting parentheses and storing their locations in a list. When you find a ^ character, put the most recently closed parenthesis group into a "left" variable, then watch the group formed by the next opening parenthesis. When it closes, use it as the "right" value and print the pow(left, right) expression.
I think you can use recursion here.
Once you figure out the Left and Right parts, pass each of those to your function again.
The base case would be that no ^ operator is found, so you will not need to add the pow() function to your result string.
The function will return a string with all the correct pow()'s in place.
I'll come up with an example of this if you want.
Nested parenthesis cannot be described by a regexp and require a full parser (able to understand a grammar, which is something more powerful than a regexp). I do not think there is a solution.
See recent discussion function-parser-with-regex-in-python (one of many similar discussions). Then follow the suggestion to pyparsing.
An alternative would be to iterate until all ^ have been exhausted. no?.
Ruby code:
# assuming str contains the string of data with the expressions you wish to convert
while str.include?('^')
str!.gsub!(/(\w+)\^(\w+)/, 'pow(\1,\2)')
end

Python regular expressions with more than 100 groups?

Is there any way to beat the 100-group limit for regular expressions in Python? Also, could someone explain why there is a limit.
There is a limit because it would take too much memory to store the complete state machine efficiently. I'd say that if you have more than 100 groups in your re, something is wrong either in the re itself or in the way you are using them. Maybe you need to split the input and work on smaller chunks or something.
I found the easiest way was to
import regex as re
instead of
import re
The default _MAXCACHE for regex is 500 instead of 100 I believe. This is one of the many reasons I find regex to be a better module than re.
If I'm not mistaken, the "new" regex module (currently third-party, but intended to eventually replace the re module in the stdlib) does not have this limit, so you might give that a try.
I'm not sure what you're doing exactly, but try using a single group, with a lot of OR clauses inside... so (this)|(that) becomes (this|that). You can do clever things with the results by passing a function that does something with the particular word that is matched:
newContents, num = cregex.subn(lambda m: replacements[m.string[m.start():m.end()]], contents)
If you really need so many groups, you'll probably have to do it in stages... one pass for a dozen big groups, then another pass inside each of those groups for all the details you want.
I doubt you really need to process 100 named groups by next commands or use it in regexp replacement command. It would be quite impractical. If you just need groups to express the rich conditions in regexp you can use non-grouping group.
(?:word1|word2)(?:word3|word4)
etc. Complex scenarios including nesting groups are possible.
There is no limit for non-grouping groups.
First, as others have said, there are probably good alternatives to using 100 groups. The re.findall method might be a useful place to start. If you really need more than 100 groups, the only workaround I see is to modify the core Python code.
In [python-install-dir]/lib/sre_compile.py simply modify the compile() function by removing the following lines:
# in lib/sre_compile.py
if pattern.groups > 100:
raise AssertionError(
"sorry, but this version only supports 100 named groups"
)
For a slightly more flexible version, just define a constant at the top of the sre_compile module, and have the above line compare to that constant instead of 100.
Funnily enough, in the (Python 2.5) source there is a comment indicating that the 100 group limit is scheduled to be removed in future versions.
I've found that Python 3 doesn't have this limitation, whereas the same code ran in latest 2.7 displays this error.
When I run into this I had a really complex pattern that was actually composed of a bunch of high-level patterns joined by ORs, like this:
pattern_string = u"pattern1|" \
u"pattern2|" \
u"patternN"
pattern = re.compile(pattern_string, re.UNICODE)
for match in pattern.finditer(string_to_search):
pass # Extract data from the groups in the match.
As a workaround, I turned the pattern into a list and I used that list as follows:
pattern_strings = [
u"pattern1",
u"pattern2",
u"patternN",
]
patterns = [re.compile(pattern_string, re.UNICODE) for pattern_string in pattern_strings]
for pattern in patterns:
for match in pattern.finditer(string_to_search):
pass # Extract data from the groups in the match.
string_to_search = pattern.sub(u"", string_to_search)
I would say you could reduce the number of groups by using non-grouping parentheses, but whatever it is that you're doing seems like you want all these groupings.
in my case, i have a dictionary of n words and want to create a single regex that matches all of them.. ie: if my dictionary is
hello
goodbye
my regex would be: (^|\s)hello($|\s)|(^|\s)goodbye($|\s) ... it's the only way to do it, and works fine on small dictionaries, but when you have more tan 50 words, well...
It's very ease to resolve this error:
Open the re class and you'll see this constant _MAXCACHE = 100.
Change the value to 1000, for example, and do a test.

Categories