Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I am currently implementing a graphical calculator in Python, in a manner where you can type a natural expression and evaluate it. This is not a problem with most functions or operators, but as the factorial function is denoted by a ! after the operand, this is more difficult.
What I have is a string which contains a function, for example: '(2x + 1)!' which needs to be replaced with: 'math.factorial((2x + 1))'
However, the string may also include other terms such as: '2*x + (2x + 1)! - math.sin(x)'
and the factorial term may not necessarily contain brackets: '2!'
I've been trying to find a solution to this problem, but to no avail, I don't think the string.replace() method can do this directly. Is what I seek too ambitious, or is there some method by which I could achieve a result as desired?
There's two levels of answer to your question: (1) solve your current problem; (2) solve your general problem.
(1) is pretty easy - the most common, versatile tool in general for doing string pattern matching and replacement is Regular Expressions (RE). You simply define the pattern you're looking for, the pattern you want out, and pass the RE engine your string. Python has a RE module built-in called re. Most languages have something similar. Some languages (eg. Perl) even have it as a core part of the language syntax.
A pattern is a series of either specific characters, or non-specific ("wildcard") characters. So in your case you want the non-specific characters before a specific '!' character. You seem to suggest that "before" in your case means either all the non-whitespace characters, or if the proceeding character is a ')', then all the characters between that and the proceeding '('. So let's build that pattern, starting with the version without the parentheses:
[\w] - the set of characters which are letters or numbers (we need a set of
characters that doesn't include whitespace or ')' so I'm taking some
liberty to keep the example simple - you could always build your own
more complex set with the '[]' pattern)
+ - at least one of them
! - the '!' character
And then the version with the parentheses:
\( - the literal '(' character, as opposed to the special function that ( has
. - any character
+ - at least one of them
? - but dont be "greedy", ie. only take the smallest set of characters that
match the pattern (will work out to be the closest pair of parentheses)
\) - the closing ')' character
! - the '!' character
Then we just need to put it all together. We use | to match the first pattern OR the second pattern. And we use ( and ) to denote the part of the pattern we want to "capture" - it's the bit before the '!' and inside the parentheses that we want to use later. So your pattern becomes:
([\w]+)!|\((.+?)\)!
Don't worry, RE expressions always come out looking like someone has just mashed the keyboard. There are some great tools out there like RegExr which help break down complex RE expressions.
Finally, you just need to take your captures and stick them in "math.factorial". \x means the xth match. If the first pattern matches, \2 will be blank and vice-versa, so we can just use both of them at once.
math.factorial(\1\2)
That's it! Here how you run your RE in Python (note the r before the strings prevents Python from trying to process the \ as an escape sequence):
import re
re.sub(r'([\w]+)!|\((.+?)\)!', r'math.factorial(\1\2)', '2*x + (2x + 1)! - math.sin(x) + 2!')
re.sub takes three parameters (plus some optional ones not used here): the RE pattern, the replacement string, and the input string. This produces:
'2*x + math.factorial(2x + 1) - math.sin(x) + math.factorial(2)'
which is I believe what you're after.
Now, (2) is harder. If your intention really is to implement a calculator that takes strings as input, you're quickly going to drown in regular expressions. There will be so many exceptions and variations between what can be entered and what Python can interpret, that you'll end up with something quite fragile, that will fail on its first contact with a user. If you're not intending on having users you're pretty safe - you can just stick to using the patterns that work. If not, then you'll find the pattern matching method a bit limiting.
In general, the problem you're tackling is known as lexical analysis (or more fully, as the three step process of lexical analysis, syntactic analysis and semantic analysis). The standard way to tackle this is with a technique called recursive descent parsing.
Intriguingly, the Python interpreter performs exactly this process in interpreting the re statement above - compilers and interpreters all undertake the same process to turn a string into tokens that can be processed by a computer.
You'll find lots of tutorials on the web. It's a bit more complex than using RE, but allows significantly more generalisation. You might want to start with the very brief intro here.
You could remove the brackets and calculate everything in a linear form, such that when parsing the brackets it would evaluate the operand, and then apply the factorial function - in the order written.
Or, you could get the index of the factorial ! in the string, then if the character at the index before is a close bracket ) you know there is a bracketed operand that needs to be calculated prior to applying math.factorial().
Related
Suppose I have a string which consists of a part of latex file. How can I use python re module to remove any math expression in it?
e.g:
text="This is an example $$a \text{$a$}$$. How to remove it? Another random math expression $\mathbb{R}$..."
I would like my function to return ans="This is an example . How to remove it? Another random math expression ...".
Thank you!
Try this Regex:
(\$+)(?:(?!\1)[\s\S])*\1
Click for Demo
Code
Explanation:
(\$+) - matches 1+ occurrences of $ and captures it in Group 1
(?:(?!\1)[\s\S])* - matches 0+ occurrences of any character that does not start with what was captured in Group 1
\1 - matches the contents of Group 1 again
Replace each match with a blank string.
As suggested by #torek, we should not match 3 or more consecutive $, hence changing the expression to (\${1,2})(?:(?!\1)[\s\S])*\1
It's commonly said that regular expressions cannot count, which is kind of a loose way of describing a problem more formally discussed in Count parentheses with regular expression. See that for what this means.
Now, with that in mind, note that LaTeX math expressions can include nested sub-equations, which can include further nested sub-equations, and so on. This is analogous to the problem of detecting whether a closing parenthesis closes an inner parenthesized expression (as in (for instance) this example, where the first one does not) or an outer parenthesis. Therefore, regular expressions are not going to be powerful enough to handle the full general case.
If you're willing to do a less-than-complete job, you can construct a regular expression that finds $...$ and $$...$$. You will need to pay attention to the particular regular expression language available. Python's is essentially the same as Perl's here.
Importantly, these $-matchers will completely miss \begin{equation} ... \end{equation}, \begin{eqnarray} ... \end{eqnarray}, and so on. We've already noted that handling LaTeX expression parsing with a mere regular expression recognizer is inadequate, so if you want to do a good job—while ignoring the complexity of lower-level TeX manipulation of token types, where one can change any individual character's category code —you will want a more general parser. You can then tokenize \begin, {, }, and words, and match up the begin/end pairs. You can also tokenize $ and $$ and match those up. Since parsers can count, in exactly the way that regular expressions can't, you can do a much better job this way.
I suck at Python regex and would love to see some solved examples to help me gain understanding. I am experimenting using http://pyregex.com/ which is great but need some 'good' examples to get me started.
I try to create a set of rules like so:
rules = [('name', r'[a-z]+'),
('operator', r'[+-*\]']
which I have found but not confident enough to create my own regexes for cases like the ones listed below:
match only the = or += or *= characters
match the + character (i.e the operator as seen above) separately from the ++ characters
match any one word after a specific keyword (e.g. int) and any number of space(s) and/or tabs. [edited - initially had followed which was wrong]
For 1. I have tried [\+=|=], for 2. I know the order in the rules is important and for 3. I am completely lost with the [] and on how I can generalize that case to work not just for int, but for float as well.
Any code examples will be greatly appreciated since I am only just starting with Python and coding!
match only the = or += or *= characters
r'[+*]?='
The [+*]?= consists of an optional atom, a character class [+*] that matches either a + or a *, ? - one or zero times, and a literal = symbol. Why not r'\+=|\*=|='? Not only the optional character class solution is shorter, but also it is more efficient: when you use alternation, you always have more redundant backtracking involved. You also need to be attentive to place the alternatives in a correct order, so that the longest appears first (although that does not always guarantee that the longest will match (depends on the branch subpatterns), or the order does not matter if there are anchors on both sides of the alternation group).
match the + character (i.e the operator as seen above) separately from the ++ characters
r'(?<!\+)\+(?!\+)'
This pattern matches a literal + (as it is escaped) and only in case it is neither preceded with another plus (see the negative lookbehind (?<!\+)) nor followed with another plus (see the positive lookahead (?!\+)). The lookarounds are non-consuming, i.e. the regex index remains right before a plus when it checks for a plus in front of it, and after the plus when it checks for a plus after it. The characters (or start/end of string positions) are not returned as part of the match (that is why they are called zero-width, non-capturing patterns).
match any one word after a specific keyword (e.g. int) and any number of space(s) and/or tabs.
r'\bint\b(?=\s+\w+\s+)'
If you read the explanation above, you will recognize another zero-width assertion here: (?=\s+\w+\s+) is a positive lookahead that checks if a whole word int (as \b matches word boundary positions) is followed with 1+ whitespaces, then 1+ word characters, and then again 1+ whitespaces.
The examples provided in the documentation and in the previous answers should get you started in the right path. An additional consideration, since you said you are new to programming and Python, is that regular expressions are an intermediate to advanced topic (depending what you want to do with it) and should be tackled once you have a better grasp of good programming practices and Python's fundamentals.
In any case more information and examples can be found at:
Python Regular Expressions module.
In the following text, I try to match a number followed by ")" and number followed by a period. I am trying to retrieve the text between the matches.
Example:
"1) there is a dsfsdfsd and 2) there is another one and 3) yet another
case"
so I am trying to output: ["there is a dsfsdfsd and", "there is another one and", yet another case"]
I've used this regex: (?:\d)|\d.)
Adding a .* at the end matches the entire string, I only want it to match the words between
also in this string:
"we will give 4. there needs to be another option and 6.99 USD is a
bit amount"
I want to only match the 4. and not the 6.99
Any pointers will be appreciated. Thank you. r
tldr
Regular expressions are tricky beasts and you should avoid them if at all possible.
If you can't avoid them, then make sure you have lots of test cases for all the edge cases that can occur.
Build up your regular expression slowly and systematically, testing your assumptions at every step.
If this code will go intro production, then please write unit tests that explain the thinking process to the poor soul who has to maintain it one day
The long version
Regular expressions are finicky. Your best approach may be to solve the problem a different way.
For example, your language might have a library function that allows you to split up strings using a regular expression to define what comes between the numbers. That will let you get away with writing a simpler regex to match the numbers and brackets/dots.
If you still decide to use regular expressions, then you need to be very structured about how you build up your regular expressions. It's extremely easy to miss edge cases.
So let's break this down piece by piece...
Set up a test environment for quickly experimenting with your regex.
There are lots of options here, depending on your programming language and OS. Ones I sometimes use are:
a Powershell window for testing .Net regexes (NB: the cli gives you a history of past attempts, so you can go back a few steps if you mess things up too badly)
a Python console for testing Python regexes (which are slightly different to .Net regexes in their syntax for named capture groups).
an html page with JavaScript to test the regex
an online or desktop regex tool (I still use the ancient Regular Expression Workbench from Eric Gunnerson, but I'm sure there are better alternatives these days)
Since you didn't specify a language or regex version, I'll assume .Net regular expressions
Create a single test string for testing a wider variety of options.
Your goal is to include as many edge cases as you can think of. Here's what I would use: "ab 1. there is a dsfsdfsd costing $6.99 and 2) there is another one and 3. yet another case 4)5) 6)10."
Note that I've added a few extra cases you didn't mention:
empty strings between two round bracket numbers: "4)" and "5)"
white space string between two round bracket numbers: "5)" and "6)"
empty strings between a round bracket number and a dotted number: "6)" and "10."
empty string after the dotted number "10." at the end of the string
random text and empty space, which should be ignored, before the first number
I'm going to make a few assumptions here, which you will need to vary based on your actual requirements:
You DO want to capture the white space after the dot or round bracket.
You DO want to capture the white space before the next dotted number or round bracket number.
You might have numbers that go beyond 9, so I've included "10" in the test cases.
You want to capture empty strings at the end e.g. after the "10."
NOTES:
Thinking through this test case forces you to be more rigorous about your requirements.
It will also help you be more efficient while you are manually testing your regular expression.
HOWEVER, this is assuming you aren't following a TDD approach. If you are, then you should probably do things a little differently... create unit tests for each scenario separately and get the regex working incrementally.
This test string doesn't cover all cases. For example, there are no newline or tab characters in the test string. Also it can't test for an empty string following a round bracket number at the very end.
First get a regex working that just captures the round brackets and dotted brackets.
Don't worry about the $6.99 edge case yet.
Drop the "(?:" non-capturing group syntax from your regex for now: "\d)|\d."
This doesn't even parse, because you have an unescaped round bracket.
The revised string is "\d\)|\d.", which parses, but which also matches "99" which you probably weren't expecting. That's because you forgot to escape the "."
The revised string is "\d\)|\d\.". This no longer matches "99", but it now matches "0." at the end instead of "10.". That's because it assumes that numbers will be single digit only.
The following string seems to work: "\d+\)|\d+\."
Time to deal with that pesky "$6.99" now...
Modify the regex so that it doesn't capture a floating point number.
You need to use a negative look ahead pattern to prevent a digit being after the decimal point.
Result: "\d+\)|\d+\.(?!\d)"
Count how many matches this produces. You're going to use this number for checking later results.
Hint: Save the regex pattern somewhere. You want to be able to go back to it any time you mess up your regex pattern beyond repair.
If you found a string splitting function, then you should use it now and avoid the complexity that follows. [I've included an example of this at the end.]
Simple is better, but I'm going to continue with the longer solution in the interests of showing an approach to staying in control of regex'es that start getting horribly complicated
Decide how to exclude that pattern
You used the non-capture group pattern in your question i.e. "(?:"
That approach can work. But it's a bit cumbersome, because you need to have a capturing group after it that you will look for instead.
It would be much nicer if your entire pattern matched what you are looking for.
So wrap the number pattern inside a zero-width positive look behind pattern (if your language supports it) i.e. "(?<=".
This checks for the pattern, but doesn't include it in what gets captured.
So now your regex looks like this: "(?<=\d+\)|\d+\.(?!\d))"
Test it!
It might seem silly to test this on its own - all the matches are empty strings.
Do it anyway. You want to sanity check every step of the way.
Make sure that it still produces the same number of matches as in step 4.
Decide how to match the text in between the numbers.
You rightly mention that ".*" will match the entire string, not just the parts in between.
There's a neat trick that allows you to reuse the pattern from step 5 to get the text in between.
Start by just matching the next character
The trick is that you want to match any character unless it's the start of the next number
That sounds like a negative look ahead pattern again: "(?!"
Let X be the pattern you saved in step 4. Matching a single character will look like this: "(?!X)."
You want to match lots of those characters. So put that pattern into a non-capturing group and repeat it: "(?:(?!X).)*"
This assumes you want to capture empty text.
If you're not, then change the "*" to a "+".
Hint: This is such a common pattern that you will want to reuse it in future pasting in different patterns in place of X
I used a non-capturing group instead of a normal group so that you can also embed this pattern in regexes where you do care about the capturing groups
Resulting pattern: "(?:(?!\d+\)|\d+\.(?!\d)).)*"
I suggest testing this pattern on its own to see what it does
Now put parts 5 and 7 together: "(?<=\d+\)|\d+\.(?!\d))(?:(?!\d+\)|\d+\.(?!\d)).)*"
Test it!
Unit tests!
If this is going into production, then please write lots of unit tests that will explain each step of this thought process
Have pity on the poor soul who has to maintain your regex in future!
By rights that person should be you
I suggest putting a note in your calendar to return to this code in 6 months' time and make sure you can still understand it from the unit tests alone!
Refactor
In six months' time, if you can't understand the code any more, use your newfound insight (and incentive) to solve the problem without using regular expressions (or only very simple ones)
Addendum
As an example of using a string splitting function to get away with a simpler regex, here's a solution in Powershell:
$string = 'ab 1. there is a dsfsdfsd costing $6.99 and 2) there is another one and 3. yet another case 4)5) 6)10.'
$pattern = [regex] '\d+\)|\d+\.(?!\d)'
$string -split $pattern | select-object -skip 1
Judging by the task you have, it might be easier to match the delimiters and use re.split (as also pointed out by bobblebubble in the comments).
I dsuggest a mere
\d+[.)]\B\s*
See it in action (demo)
It matches 1 or more digits, then a . or a ), then it makes sure there is no word letter (digit, letter or underscore) after it and then matches zero or more whitespace.
Python demo:
import re
rx = r'\d+[.)]\B\s*'
test_str = "1) there is a dsfsdfsd and 2) there is another one and 3) yet another case\n\"we will give 4. there needs to be another option and 6.99 USD is a bit amount"
print([x for x in re.split(rx,test_str) if x])
Try the following regex with the g modifier:
([A-Za-z\s\-_]+|\d(?!(\)|\.)\D)|\.\d)
Example: https://regex101.com/r/kB1xI0/3
[A-Za-z\s\-_]+ automatically matches all alphabetical characters + whitespace
\d(?!(\)|\.)\D) match any numeric sequence of digits not followed by a closing parenthesis ) or decimal value (.99)
\.\d match any period followed by numeric digit.
I used this pattern:
(?<=\d.\s)(.*?)(?=\d.\s)
demo
This looks for the contents between any digit, any character, then a space.
Edit: Updated pattern to handle the currency issue and line ends better:
This is with flag 'g'
(?<=[0-9].\s)(.*?)(?=\s[0-9].\s|\n|\r)
Demo 2
import re
s = "1) there is a dsfsdfsd and 2) there is another one and 3) yet another case"
s1 = "we will give 4. there needs to be another option and 6.99 USD is a bit amount"
regex = re.compile("\d\)\s.*?|\s\d\.\D.*?")
print ([x for x in regex.split(s) if x])
print regex.split(s1)
Output:
['there is a dsfsdfsd and ', 'there is another one and ', 'yet another case']
['we will give', 'there needs to be another option and 6.99 USD is a bit amount']
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question appears to be off-topic because it lacks sufficient information to diagnose the problem. Describe your problem in more detail or include a minimal example in the question itself.
Closed 8 years ago.
Improve this question
I have a long string which I have parsed through beautifulsoup and I need advice on the best way to extract data from this soup object.
The number I want is contained inside the soup object, inside () after this text.
View All (8)
What is the most efficient way to locate this, and get the number out of it.
In VBA I would have done this.
(1) Find where does this text string start if soup is length 1000 text is at 200
Then I would loop until I found the ending ), grab that text, store it in a variable, and process each character removing everything which is not a number.
So If I have > View All (8) I would end up with 8. The number inside here is not known, could be q00, 110, or 2000.
I have just started learning python, don't yet know how to use regular expression but that seems the way to go?
Sample String
">View All (90)</a>
Expected Result - hopeful
90
Sample String
">View All (8)</a>
Expected Result - hopeful
8
Seeing how my comment provoked some more questions, let me expand it a bit. First, welcome to the wonderful world of regular expressions. Regular expressions can be quite a headache, but mastering them is a very useful skill. A very clear tutorial was written by A.M. Kuchling, one of Python's old hackers from the early days. If memory serves me he wrote the re library, with (as an additional bonus) an undocumented implementation of lex in some 15 odd lines of python. But I digress. You can find the tutorial here. https://docs.python.org/2/howto/regex.html
Let me go over the expression bit by bit:
m = re.compile(r'View All \((\d*?)\)').search(soupstring);
print m.group(1)
The r in front of the quotation marks it as a raw string in Python. Python will preprocess normal string literals, so that a backslash is interpreted as a special character. E.g. a '\t' in a string will be replaced by the tab character. Try print '\' to see what I mean. To include a '\' in a string you have to escape it like this '\\'. This can be a problem as a backslash is also a escaping character for the regular expression engine. If you have to match patterns that contain backslashes, you will soon be writing patterns like this '\\\\'. Which can be fun . . . If you like 50 shades of grey, give it a try.
Inside the regular expression language: '(' characters are special. They are used to group parts of the match together. Since you are only interested in the digits between the parentheses, I used a group to extract this data. Other special characters are '{', '[', , '*', '?', '\' and their matching counterparts. I am sure I have forgotten a few, but you can look them up.
With that information, the '\(' will make more sense. Since I have escaped the '(' it tells the regular expression parser to ignore the special meaning of '(' and instead match it against a literal '(' character.
The sequence '\d' is again special. An escaped '\d' means, do not interpret this as a literal 'd', but interpret it as "any digit character".
The '*' means take the last pattern and match it zero or more times.
The '*?' variant means, use "greedy matching". It means return the first possible match instead of finding the longest possible match. In the context of regular expressions greed is usually good. As Sebastian has noted, the '?' is not needed here. However, if you ever need to find html elements or quoted strings, then you can use '<.*?>' or '".*?"'.
Please note that '.' is again special. It means match "any character (except the newline (well most of the time anyway))".
Have fun . . .
I am using Ply to interpret a FORTRAN format string. I am having trouble writing a regex to match the 'H' edit descriptor which is of the form
xHccccc ...
where x specifies the number of characters to read in after the 'H'
Ply matches tokens with a single regular expression, but I am having trouble using regular expression to perform the above. I am looking for something like,
(\d+)[Hh].{\1}
where \1 is parsed as an integer and evaluated as part of the regex - however it isn't.
It seems that it is not possible to use matched numbers later in the same regex, is this the case?
Does anyone have any other solutions that might use Ply?
Regex can't do things like that. You can hack it though:
(1[Hh].|2[Hh]..|3[Hh]...|etc...)
Ugly!
This is what comes of thinking that regexps can replace a lexer.
Short version: regular expressions can only deal with that small subset of all possible language termed "regular" (big surprise, I know). But "regular" is not isomorphic to the human understanding of "simple", so even very simple languages can have non-regular expressions.
Writing a lexer for a simple language is not terribly hard.
That canonical Stack Overflow question for resources on the topic is Learning to write a compiler.
Ah. I seem to have misunderstood the question. Mea Culpa.
I'm not familiar with ply, and its been a while since I used flex, but think you would eat any number of following digits, then check in the associated code block if the rules had been obeyed.
Pyparsing includes an adaptive expression that is very similar to this, called countedArray. countedArray(expr) parses a leading integer 'n' and then parses 'n' instances of expr, returning the whole array as a single list. The way this works is that countedArray parses a leading integer expression, followed by an uninitialized Forward expression. The leading integer expression has a parse action attached that assigns the following Forward to 'n'*expr. The pyparsing parser then continues on, and parses the following 'n' expr's. So it is sort of a self-modifying parser.
To parse your expression, this would look something like:
integer = Word(nums).setParseAction(lambda t:int(t[0]))
following = Forward()
integer.addParseAction(lambda t: following << Word(printables+" ",exact=t[0]))
H_expr = integer + 'H' + following
print H_expr.parseString("22HThis is a test string.This is not in the string")
Prints:
[22, 'H', 'This is a test string.']
If Ply has something similar, perhaps you could use this technique.