I want to parse equations and get a list of tuples.
For example, when I enter
2x = 4+3y,
I want to get
[('', '2', 'x', '='), ('','4','',''), ('+','3','y','')]
This is my regex so far:
([+-]*)([0-9]+)([a-z]*)([<=>]*)
It works fine for the above query but it does not capture equations like
2 = x +3y, (where x does not have any coefficient)
How do I capture that?
(\d*)(\w*) *(=) *(\d*)(\w*) *[+|\-|*|/] *(\d*)(\w*)
How about this regex?
It separates all operands and operators. (and inside operands it also splits number and variable).
For testing the regex I normally use https://regex101.com/ so you can build regex with live changes there.
If you change the quantifier on the coefficient from + (one or more) to * (zero or more) then you should get the result you are after. You will also get an empty string match due to all the quantifiers now being * but you can filter out that match.
>>> import re
>>> e1 = "2x=4+3y"
>>> e2 = "2=x+3y"
>>> re.findall("([+-]*)([0-9]*)([a-z]*)([<=>]*)", e1)
[('', '2', 'x', '='), ('', '4', '', ''), ('+', '3', 'y', ''), ('', '', '', '')]
>>> re.findall("([+-]*)([0-9]*)([a-z]*)([<=>]*)", e2)
[('', '2', '', '='), ('', '', 'x', ''), ('+', '3', 'y', ''), ('', '', '', '')]
Note: whilst this solves your direct question this is not a good approach to parsing infix equations.
Related
I know partition() exists, but it only takes in one value, I'm trying to partition around various values:
for example say I wanted to partition around symbols in a string:
input: "function():"
output: ["function", "(", ")", ":"]
I can't seem to find an efficient way to handle variable amounts of partitioning.
You can use re.findall with an alternation pattern that matches either a word or a non-space character:
re.findall(r'\w+|\S', s)
so that given s = 'function():', this returns:
['function', '(', ')', ':']
You could re.split by \W and use (...) to keep the delimiters, then remove empty parts.
>>> import re
>>> s = "function(): return foo + 3"
>>> [s for s in re.split(r"(\W)", s) if s.strip()]
['function', '(', ')', ':', 'return', 'foo', '+', '3']
Note that this will split after every special character; if you want to keep certain groups of special characters together, e.g. == or <=, you should test those first with |.
>>> s = "function(): return foo + 3 == 42"
>>> [s for s in re.split(r"(\W)", s) if s.strip()]
['function', '(', ')', ':', 'return', 'foo', '+', '3', '=', '=', '42']
>>> [s for s in re.split(r"(==|!=|<=|\W)", s) if s.strip()]
['function', '(', ')', ':', 'return', 'foo', '+', '3', '==', '42']
I'm trying to parse an equation such as
5x>=7-5y+4z
into a list of tuples with python:
[('', '5', 'x', '>='), ('', '7', '', ''), ('-', '5', 'y', ''), ('+', '4', 'z', '')]
I've managed to write a pattern (pattern = "[+-]?\d*[a-z]?[><=]*") to break the equation into groups, but I have no idea how to make it return tuples.
Any help appreciated...
I think you want this:
import re
pattern = re.compile(r'([+-]?)([0-9]+)([a-z]?)([><]?=?)')
re.findall(pattern, '5x>=7-5y+4z')
>>> [('', '5', 'x', '>='), ('', '7', '', ''), ('-', '5', 'y', ''), ('+', '4', 'z', '')]
Each instance of the regex passed to re.findall is put into a tuple, which is then further split into strings corresponding to each of the groups in the regex.
I took some liberties with the interpretation of the actual regex, since I'm not sure what the expected output for other cases would be (for example, would there be a 0x term?)
I wrote this regex that splits the expression 'res=3+x_sum*11' into lexemes
import re
print(re.findall('(\w+)(=)(\d+)(\*|\+)(\w+)(\*|\+)(\d+)', 'res=3+x_sum*11'))
with my output looking like this:
[('res', '=', '3', '+', 'x_sum', '*', '11')]
but i want re.findall to return a list of the lexemes and their tokens so that each lexeme is in its own group. That output should look like this:
[('', 'res', ''), ('', '', '='), ('3', '', ''), ('', '', '+'),
('', 'x_sum', ''), ('', '', '*'), ('11', '', '')]
How do i get re.findall to return an output like that
You may tokenize the string using
re.findall(r'(\d+)|([^\W\d]+)|(\W)', s)
See the regex demo. Note that re.findall returns a list of tuples once the pattern contains several capturing groups. The pattern above contains 3 capturing groups, thus, each tuple contains 3 elements: 1+ digits, 1+ letters/underscores, or a non-word char.
More details
(\d+) - Capturing group 1: 1+ digits
| - or
([^\W\d]+) - Capturing group 2: 1+ chars other than non-word and digit chars (letters or underscores)
| - or
(\W) - Capturing group 3: a non-word char.
See Python demo:
import re
rx = r"(\d+)|([^\W\d]+)|(\W)"
s = "res=3+x_sum*11"
print(re.findall(rx, s))
# => [('', 'res', ''), ('', '', '='), ('3', '', ''), ('', '', '+'), ('', 'x_sum', ''), ('', '', '*'), ('11', '', '')]
I have a lot of python strings such as "A7*4", "Z3+8", "B6 / 11", and I want to split these strings so that they would be in a list, in the format ["A7", "*", "4"], ["B6", "/", "11"], etc. I have used a lot of different split methods but I think I need to just perform the split where there is a math symbol, such as /,*,+,-. I would also need to strip out the whitespace.
Currently I am using the code re.split(r'(\D)', "B6 / 11"), which is returning ['', 'B', '6', ' ', '', '/', '', ' ', '11']. Instead I want to get back ["B6", "/", "11"].
You should split on the character set [+-/*] after removing the whitespace from the string:
>>> import re
>>> def mysplit(mystr):
... return re.split("([+-/*])", mystr.replace(" ", ""))
...
>>> mysplit("A7*4")
['A7', '*', '4']
>>> mysplit("Z3+8")
['Z3', '+', '8']
>>> mysplit("B6 / 11")
['B6', '/', '11']
>>>
There is a way to solve this without regular expressions by using the Python tokenizer. I used a more complex formula to show the capabilities of this solution.
from io import StringIO
import tokenize
formula = "(A7*4) - (Z3+8) - ( B6 / 11)"
print([token[1] for token in tokenize.generate_tokens(StringIO(formula).readline) if token[1]])
Result:
['(', 'A7', '*', '4', ')', '-', '(', 'Z3', '+', '8', ')', '-', '(', 'B6', '/', '11', ')']
If I have a string
'x+13.5*10x-4e1'
how can I split it into the following list of tokens?
['x', '+', '13', '.', '5', '*', '10', 'x', '-', '4', 'e', '1']
Currently I'm using the shlex module:
str = 'x+13.5*10x-4e1'
lexer = shlex.shlex(str)
tokenList = []
for token in lexer:
tokenList.append(str(token))
return tokenList
But this returns:
['x', '+', '13', '.', '5', '*', '10x', '-', '4e1']
So I'm trying to split the letters from the numbers. I'm considering taking the strings that contain both letters and numbers then somehow splitting them, but not sure about how to do this or how to add them all back into the list with the others afterwards. It's important that the tokens stay in order, and I can't have nested lists.
In an ideal world, e and E would not be recognised as letters in the same way, so
'-4e1'
would become
['-', '4e1']
but
'-4x1'
would become
['-', '4', 'x', '1']
Can anybody help?
Use the regular expression module's split() function, to split at
'\d+' -- digits (number characters) and
'\W+' -- non-word characters:
CODE:
import re
print([i for i in re.split(r'(\d+|\W+)', 'x+13.5*10x-4e1') if i])
OUTPUT:
['x', '+', '13', '.', '5', '*', '10', 'x', '-', '4', 'e', '1']
If you don't want to separate the dot (as a floating-point number in the expression) then you should use this:
[\d.]+ -- digit or dot characters (although this allows you to write: 13.5.5
CODE:
print([i for i in re.split(r'([\d.]+|\W+)', 'x+13.5*10x-4e1') if i])
OUTPUT:
['x', '+', '13.5', '*', '10', 'x', '-', '4', 'e', '1']
Another alternative not suggested here, is to using nltk.tokenize module
Well, the problem seems not to be quite simple. I think, a good way to get robust (but, unfortunately, not so short) solution is to use Python Lex-Yacc for creating a full-weight tokenizer. Lex-Yacc is a common (not only Python) practice for this, thus there can exist ready grammars for creating a simple arithmetic tokenizer (like this one), and you have just to fit them to your specific needs.