I have a program in which a user inputs a function, such as sin(x)+1. I'm using ast to try to determine if the string is 'safe' by whitelisting components as shown in this answer. Now I'd like to parse the string to add multiplication (*) signs between coefficients without them.
For example:
3x-> 3*x
4(x+5) -> 4*(x+5)
sin(3x)(4) -> sin(3x)*(4) (sin is already in globals, otherwise this would be s*i*n*(3x)*(4)
Are there any efficient algorithms to accomplish this? I'd prefer a pythonic solution (i.e. not complex regexes, not because they're pythonic, but just because I don't understand them as well and want a solution I can understand. Simple regexes are ok. )
I'm very open to using sympy (which looks really easy for this sort of thing) under one condition: safety. Apparently sympy uses eval under the hood. I've got pretty good safety with my current (partial) solution. If anyone has a way to make sympy safer with untrusted input, I'd welcome this too.
A regex is easily the quickest and cleanest way to get the job done in vanilla python, and I'll even explain the regex for you, because regexes are such a powerful tool it's nice to understand.
To accomplish your goal, use the following statement:
import re
# <code goes here, set 'thefunction' variable to be the string you're parsing>
re.sub(r"((?:\d+)|(?:[a-zA-Z]\w*\(\w+\)))((?:[a-zA-Z]\w*)|\()", r"\1*\2", thefunction)
I know it's a bit long and complicated, but a different, simpler solution doesn't make itself immediately obvious without even more hacky stuff than what's gone into the regex here. But, this has been tested against all three of your test cases and works out precisely as you want.
As a brief explanation of what's going on here: The first parameter to re.sub is the regular expression, which matches a certain pattern. The second is the thing we're replacing it with, and the third is the actual string to replace things in. Every time our regex sees a match, it removes it and plugs in the substitution, with some special behind-the-scenes tricks.
A more in-depth analysis of the regex follows:
((?:\d+)|(?:[a-zA-Z]\w*\(\w+\)))((?:[a-zA-Z]\w*)|\() : Matches a number or a function call, followed by a variable or parentheses.
((?:\d+)|(?:[a-zA-Z]\w*\(\w+\))) : Group 1. Note: Parentheses delimit a Group, which is sort of a sub-regex. Capturing groups are indexed for future reference; groups can also be repeated with modifiers (described later). This group matches a number or a function call.
(?:\d+) : Non-capturing group. Any group with ?: immediately after the opening parenthesis will not assign an index to itself, but still act as a "section" of the pattern. Ex. A(?:bc)+ will match "Abcbcbcbc..." and so on, but you cannot access the "bcbcbcbc" match with an index. However, without this group, writing "Abc+" would match "Abcccccccc..."
\d : Matches any numerical digit once. A regex of \d all its own will match, separately, "1", "2", and "3" of "123".
+ : Matches the previous element one or more times. In this case, the previous element is \d, any number. In the previous example, \d+ on "123" will successfully match "123" as a single element. This is vital to our regex, to make sure that multi-digit numbers are properly registered.
| : Pipe character, and in a regex, it effectively says or: "a|b" will match "a" OR "b". In this case, it separates "a number" and "a function call"; match a number OR a function call.
(?:[a-zA-Z]\w*\(\w+\)) : Matches a function call. Also a non-capturing group, like (?:\d+).
[a-zA-Z] : Matches the first letter of the function call. There is no modifier on this because we only need to ensure the first character is a letter; A123 is technically a valid function name.
\w : Matches any alphanumeric character or an underscore. After the first letter is ensured, the following characters could be letters, numbers, or underscores and still be valid as a function name.
* : Matches the previous element 0 or more times. While initially seeming unnecessary, the star character effectively makes an element optional. In this case, our modified element is \w, but a function doesn't technically need any more than one character; A() is a valid function name. A would be matched by [a-zA-Z], making \w unnecessary. On the other end of the spectrum, there could be any number of characters following the first letter, which is why we need this modifier.
\( : This is important to understand: this is not another group. The backslash here acts much like an escape character would in a normal string. In a regex, any time you preface a special character, such as parentheses, +, or * with a backslash, it uses it like a normal character. \( matches an opening parenthesis, for the actual function call part of the function.
\w+ : Matches a number, letter or underscore one or more times. This ensures the function actually has a parameter going into it.
\) : Like \(, but matches a closing parenthesis
((?:[a-zA-Z]\w*)|\() : Group 2. Matches a variable, or an opening parenthesis.
(?:[a-zA-Z]\w*) : Matches a variable. This is the exact same as our function name matcher. However, note that this is in a non-capturing group: this is important, because of the way the OR checks. The OR immediately following this looks at this group as a whole. If this was not grouped, the "last object matched" would be \w*, which would not be sufficient for what we want. It would say: "match one letter followed by more letters OR one letter followed by a parenthesis". Putting this element in a non-capturing group allows us to control what the OR registers.
| : Or character. Matches (?:[a-zA-Z]\w*) or \(.
\( : Matches an opening parenthesis. Once we have checked if there is an opening parenthesis, we don't need to check anything beyond it for the purposes of our regex.
Now, remember our two groups, group one and group two? These are used in the substitution string, "\1*\2". The substitution string is not a true regex, but it still has certain special characters. In this case, \<number> will insert the group of that number. So our substitution string is saying: "Put group 1 in (which is either our function call or our number), then put in an asterisk (*), then put in our second group (either a variable or a parenthesis)"
I think that about sums it up!
Related
This is an example string:
123456#p654321
Currently, I am using this match to capture 123456 and 654321 in to two different groups:
([0-9].*)#p([0-9].*)
But on occasions, the #p654321 part of the string will not be there, so I will only want to capture the first group. I tried to make the second group "optional" by appending ? to it, which works, but only as long as there is a #p at the end of the remaining string.
What would be the best way to solve this problem?
You have the #p outside of the capturing group, which makes it a required piece of the result. You are also using the dot character (.) improperly. Dot (in most reg-ex variants) will match any character. Change it to:
([0-9]*)(?:#p([0-9]*))?
The (?:) syntax is how you get a non-capturing group. We then capture just the digits that you're interested in. Finally, we make the whole thing optional.
Also, most reg-ex variants have a \d character class for digits. So you could simplify even further:
(\d*)(?:#p(\d*))?
As another person has pointed out, the * operator could potentially match zero digits. To prevent this, use the + operator instead:
(\d+)(?:#p(\d+))?
Your regex will actually match no digits, because you've used * instead of +.
This is what (I think) you want:
(\d+)(?:#p(\d+))?
Recently I have been playing around with regex expressions in Python and encountered a problem with r"(\w{3})+" and with its non-greedy equivalent r"(\w{3})+?".
Please let's take a look at the following example:
S = "abcdefghi" # string used for all the cases below
1. Greedy search
m = re.search(r"(\w{3})+", S)
print m.group() # abcdefghi
print m.groups() # ('ghi',)
m.group is exactly as I expected - just whole match.
Regarding m.groups please confirm: ghi is printed because it has overwritten previous captured groups of def and abc, am I right? If yes, then can I capture all overwritten groups as well? Of course, for this particular string I could just write m = re.search(r"(\w{3})(\w{3})(\w{3})", S) but I am looking for a more general way to capture groups not knowing how many of them I can expect, thus metacharacter +.
2. Non-greedy search
m = re.search(r"(\w{3})+?", S)
print m.group() # abc
print m.groups() # ('abc',)
Now we are not greedy so only abc was found - exactly as I expected.
Regarding m.groups(), the engine stopped when it found abc so I understand that this is the only found group here.
3. Greedy findall
print re.findall(r"(\w{3})+", S) # ['ghi']
Now I am truly perplexed, I always thought that function re.findall finds all substrings where the RE matches and returns them as a list. Here, we have only one match abcdefghi (according to common sense and bullet 1), so I expected to have a list containing this one item. Why only ghi was returned?
4. Non-greedy findall
print re.findall(r"(\w{3})+?", S) # ['abc', 'def', 'ghi']
Here, in turn, I expected to have abc only, but maybe having bullet 3 explained will help me understand this as well. Maybe this is even the answer for my question from bullet 1 (about capturing overwritten groups), but I would really like to understand what is happening here.
You should think about the greedy/non-greedy behavior in the context of your regex (r"(\w{3})+") versus a regex where the repeating pattern was not at the end: (r"(\w{3})+\w")
It's important because the default behavior of regex matching is:
The entire regex must match
Starting as early in the target string as possible
Matching as much of the target string as possible (greedy)
If you have a "repeat" operator - either * or + - in your regex, then the default behavior is for that to match as much as it can, so long as the rest of the regex is satisfied.
When the repeat operator is at the end of the pattern, there is no rest of the regex, so the behavior becomes match as much as it can.
If you have a repeat operator with a non-greedy qualifier - *? or +? - in your regex, then the behavior is to match as little as it can, so long as the rest of the regex is satisfied.
When the repeat-nongreedy operator is at the end of the pattern, there is no rest of the regex, so the behavior becomes match as little as it can.
All that is in just one match. You are mixing re.findall() in as well, which will then repeat the match, if possible.
The first time you run re.findall, with r"(\w{3})+" you are using a greedy match at the end of the pattern. Thus, it will try to apply that last block as many times as possible in a single match. You have the case where, like the call to re.search, the single match consumes the entire string. As part of consuming the entire string, the w3 block gets repeated, and the group buffer is overwritten several times.
The second time you run re.findall, with r"(\w{3})+?" you are using a non-greedy match at the end of the pattern. Thus, it will try to apply that last block as few times as possible in a single match. Since the operator is +, that would be 1. Now you have a case where the match can stop without consuming the entire string. And now, the group buffer only gets filled one time, and not overwritten. Which means that findall can return that result (abc), then loop for a different result (def), then loop for a final result (ghi).
Regarding m.groups please confirm: ghi is printed because it has overwritten previous captured groups of def and abc, am I right?
Right. Only the last captured text is stored in the group memory buffer.
can I capture all overwritten groups as well?
Not with re, but with PyPi regex, you can. Its match object has a captures method. However, with re, you can just match them with re.findall(r'\w{3}', S). However, in this case, you will match all 3-word character chunks from the string, not just those consecutive ones. With the regex module, you can get all the 3-character consecutive chunks from the beginning of the string with the help of \G operator: regex.findall(r"\G\w{3}", "abcdefghi") (result: abc, def, ghi).
Why only ghi was returned with re.findall(r"(\w{3})+", S)?
Because there is only one match that is equal to the whole abcdefghi string, and Capture group 1 contains just the last three characters. re.findall only returns the captured values if capturing groups are defined in the pattern.
I'm trying to parse a text document with data in the following format: 24036 -977. I need to separate the numbers into separate values, and the way I've done that is with the following steps.
values = re.search("(.*?)\s(.*)")
x = values.group(1)
y = values.gropu(2)
This does the job, however I was curious about why using (.*?) in the second group causes the regex to fail? I tested it in the online regex tester(https://regex101.com/r/bM2nK1/1), and adding the ? in causes the second group to return nothing. Now as far as I know .*? means to take any value unlimited times, as few times as possible, and the .* is just the greedy version of that. What I'm confused about is why the non greedy version.*? takes that definition to mean capturing nothing?
Because it means to match the previous token, the *, as few times as possible, which is 0 times. If you would it to extend to the end of the string, add a $, which matches the end of string. If you would like it to match at least one, use + instead of *.
The reason the first group .*? matches 24036 is because you have the \s token after it, so the fewest amount of characters the .*? could match and be followed by a \s is 24036.
#iobender has pointed out the answer to your question.
But I think it's worth mentioning that if the numbers are separated by space, you can just use split:
>>> '24036 -977'.split()
['24036', '-977']
This is simpler, easier to understand and often faster than regex.
From Python 3.4.1 docs:
(?=...)
Positive lookahead assertion. This succeeds if the contained regular expression, represented here by ..., successfully matches at the current location, and fails otherwise. But, once the contained expression has been tried, the matching engine doesn’t advance at all; the rest of the pattern is tried right where the assertion started.
I'm trying to understand regex in Python. Could you please help me understand the second sentences, especially the bolded words? Any example will be appreciated.
Lookarounds are zero-width assertions. They don't consume any characters on the string.
To touch briefly on the bolded portions of the documentation:
This means that after looking ahead, the regular expression engine is back at the same position on the string from where it started looking. From there, it can start matching again...
The key point:
You can get a zero-width match which is a match that does not consume any characters. It only matches a position in the string. The point of zero-width is the validation to see if a regular expression can or cannot be matched looking ahead or looking back from the current position, without adding them to the overall match.
An answer in an example form. On string "xy":
(?:x) will match "x"
(?:x)x will not match, because there is no another x after x
(?:x)y will match "xy", by advancing over x and then y.
(?=x) will match "" at the start of the string, since x is following.
(?=x)x will match "x" - it recognises that an x follows, and then it advances over it.
(?=x)y will not match, since it affirms there is an x following, but then tries to advance over it using y.
Generally a Regular Expression engine is "consuming" your string character by character as it matches up with your regular expression.
If you use a look-ahead operator, the engine will instead simply look ahead without "consuming" any characters while it looks for a match.
Example
A good example is a regular expression to match a password where it needs to have a single numeric digit as well as be between 6-20 characters long.
You could write two checks (one to check if a digit exists, and one to check if the string length is as required), or use a single regular expression:
(?=.*\d).{6,20}
The first portion (?=.*\d)checks if there is digit anywhere in the string. When it completes we are back at the beginning of the string again (we were only "looking-ahead") and if it passed, we go onto the next portion of the regex.
Now .{6,20} is no longer a lookahead, and begins consuming the string. When the entire string is consumed, a match has been found.
I need to match the following sets of input:
foo_abc_bar
foo_bar
and get "abc" or an empty string as the result.
So this is the regular expression I wrote:
r'foo_(abc|)[_|]bar'
But for some reason, this does not match with the second string that I have given.
On further inspection, I found that [_|] does not match an empty string.
So, how do I solve this problem?
To make abc_ optional, you could use the question mark operator:
(abc_)?
Thus, the entire regex becomes:
r'foo_(abc_)?bar'
With this regex, the second underscore (if present) will become part of the capture group. If you don't want that, you could either remove it post-match with .rstrip('_') or use a slightly more complex regex:
r'foo_(?:(abc)_)?bar'
I found that [_|] does not match an empty string.
That's right. Square brackets denote a character group. The [_|] would match exactly one underscore or exactly one vertical bar, and nothing else. In other words, the vertical bar loses its special meaning when it appears inside a character group.
if you want a string pattern like this
xxx_xxx_xxx
xxx_xxx
then you need
([A-Za-z]{3})((_[A-Za-z]{3})+)?
but this will work also
r'foo(_abc)?_bar'
? means optional (may or may not match).