Splitting a math expression string into tokens in Python - python

I have a lot of python strings such as "A7*4", "Z3+8", "B6 / 11", and I want to split these strings so that they would be in a list, in the format ["A7", "*", "4"], ["B6", "/", "11"], etc. I have used a lot of different split methods but I think I need to just perform the split where there is a math symbol, such as /,*,+,-. I would also need to strip out the whitespace.
Currently I am using the code re.split(r'(\D)', "B6 / 11"), which is returning ['', 'B', '6', ' ', '', '/', '', ' ', '11']. Instead I want to get back ["B6", "/", "11"].

You should split on the character set [+-/*] after removing the whitespace from the string:
>>> import re
>>> def mysplit(mystr):
... return re.split("([+-/*])", mystr.replace(" ", ""))
...
>>> mysplit("A7*4")
['A7', '*', '4']
>>> mysplit("Z3+8")
['Z3', '+', '8']
>>> mysplit("B6 / 11")
['B6', '/', '11']
>>>

There is a way to solve this without regular expressions by using the Python tokenizer. I used a more complex formula to show the capabilities of this solution.
from io import StringIO
import tokenize
formula = "(A7*4) - (Z3+8) - ( B6 / 11)"
print([token[1] for token in tokenize.generate_tokens(StringIO(formula).readline) if token[1]])
Result:
['(', 'A7', '*', '4', ')', '-', '(', 'Z3', '+', '8', ')', '-', '(', 'B6', '/', '11', ')']

Related

Split a String with multiple delimiter and get the used delimiter

I want to split a String in python using multiple delimiter. In my case I also want the delimiter which was used returned in a list of delimiters.
Example:
string = '1000+20-12+123-165-564'
(Methods which split the string and return lists with numbers and delimiter)
numbers = ['1000', '20', '12', '123', '165', '564']
delimiter = ['+', '-', '+', '-', '-']
I hope my question is understandable.
You might use re.split for this task following way
import re
string = '1000+20-12+123-165-564'
elements = re.split(r'(\d+)',string) # note capturing group
print(elements) # ['', '1000', '+', '20', '-', '12', '+', '123', '-', '165', '-', '564', '']
numbers = elements[1::2] # last 2 is step, get every 2nd element, start at index 1
delimiter = elements[2::2] # again get every 2nd element, start at index 2
print(numbers) # ['1000', '20', '12', '123', '165', '564']
print(delimiter) # ['+', '-', '+', '-', '-', '']
Just capture (...) the delimiter along with matching/splitting with re.split:
import re
s = '1000+20-12+123-165-564'
parts = re.split(r'([+-])', s)
numbers, delims = parts[::2], parts[1::2]
print(numbers, delims)
['1000', '20', '12', '123', '165', '564'] ['+', '-', '+', '-', '-']

Partitioning a string with multiple delimiters

I know partition() exists, but it only takes in one value, I'm trying to partition around various values:
for example say I wanted to partition around symbols in a string:
input: "function():"
output: ["function", "(", ")", ":"]
I can't seem to find an efficient way to handle variable amounts of partitioning.
You can use re.findall with an alternation pattern that matches either a word or a non-space character:
re.findall(r'\w+|\S', s)
so that given s = 'function():', this returns:
['function', '(', ')', ':']
You could re.split by \W and use (...) to keep the delimiters, then remove empty parts.
>>> import re
>>> s = "function(): return foo + 3"
>>> [s for s in re.split(r"(\W)", s) if s.strip()]
['function', '(', ')', ':', 'return', 'foo', '+', '3']
Note that this will split after every special character; if you want to keep certain groups of special characters together, e.g. == or <=, you should test those first with |.
>>> s = "function(): return foo + 3 == 42"
>>> [s for s in re.split(r"(\W)", s) if s.strip()]
['function', '(', ')', ':', 'return', 'foo', '+', '3', '=', '=', '42']
>>> [s for s in re.split(r"(==|!=|<=|\W)", s) if s.strip()]
['function', '(', ')', ':', 'return', 'foo', '+', '3', '==', '42']

How to split a string that includes sign characters

How can I split a string that includes "sign characters" but no spaces? For example:
aString = '1+20*40-3'
I want the output to be:
['1', '+', '20', '*', '40', '-', '3']
I tried this:
aString.split('+' and '*' and '-')
but that didn't work.
You can use regular expression to do this task in python. The code will be:
import re
aString= '1+20*40-3'
print re.findall('[+-/*]|\d+',aString)
output:
>>>
['1', '+', '20', '*', '40', '-', '3']
Refer documentation here

pyparsing and line breaks

I just started with pyparsing and I have problems with line breaks.
My grammar is:
from pyparsing import *
newline = LineEnd () #Literal ('\n').leaveWhitespace ()
minus = Literal ('-')
plus = Literal ('+')
lparen = Literal ('(')
rparen = Literal (')')
ident = Word (alphas)
integer = Word (nums)
arith = Forward ()
parenthized = Group (lparen + arith + rparen)
atom = ident | integer | parenthized
factor = ZeroOrMore (minus | plus) + atom
arith << (ZeroOrMore (factor + (minus | plus) ) + factor)
statement = arith + newline
program = OneOrMore (statement)
Now when I parse the following:
print (program.parseString ('--1-(-a-3+n)\nx\n') )
The result is as expected:
['-', '-', '1', '-', ['(', '-', 'a', '-', '3', '+', 'n', ')'], '\n', 'x', '\n']
But when the second line can be parsed as tail of the first line, the first \n is magicked away?
Code:
print (program.parseString ('--1-(-a-3+n)\n-x\n') )
Actual result:
['-', '-', '1', '-', ['(', '-', 'a', '-', '3', '+', 'n', ')'], '-', 'x', '\n']
Expected result:
['-', '-', '1', '-', ['(', '-', 'a', '-', '3', '+', 'n', ')'], '\n', '-', 'x', '\n']
Actually I don't want the parser to automatically join statements.
1. What am I doing wrong?
2. How can I fix this?
3. What is happening under the hood causing this behavious (which surely is sensible, but I just fail to see the point)?
'\n' is normally skipped over as a whitespace character. If you want '\n' to be significant, then you have to call setDefaultWhitespaceChars to remove '\n' as skippable whitespace (you have to do this before defining any of your pyparsing expressions):
from pyparsing import *
ParserElement.setDefaultWhitespaceChars(' \t')
What is happening here is that the parser by default ignores any whitespace. You need to add the following line of code before you define any elements:
ParserElement.setDefaultWhitespaceChars(" \t")
The normal default whitespace characters are " \t\r\n", I believe.
Edit: Paul beat me to it. I should have refreshed after getting dinner together. :)

How can I split a string into tokens?

If I have a string
'x+13.5*10x-4e1'
how can I split it into the following list of tokens?
['x', '+', '13', '.', '5', '*', '10', 'x', '-', '4', 'e', '1']
Currently I'm using the shlex module:
str = 'x+13.5*10x-4e1'
lexer = shlex.shlex(str)
tokenList = []
for token in lexer:
tokenList.append(str(token))
return tokenList
But this returns:
['x', '+', '13', '.', '5', '*', '10x', '-', '4e1']
So I'm trying to split the letters from the numbers. I'm considering taking the strings that contain both letters and numbers then somehow splitting them, but not sure about how to do this or how to add them all back into the list with the others afterwards. It's important that the tokens stay in order, and I can't have nested lists.
In an ideal world, e and E would not be recognised as letters in the same way, so
'-4e1'
would become
['-', '4e1']
but
'-4x1'
would become
['-', '4', 'x', '1']
Can anybody help?
Use the regular expression module's split() function, to split at
'\d+' -- digits (number characters) and
'\W+' -- non-word characters:
CODE:
import re
print([i for i in re.split(r'(\d+|\W+)', 'x+13.5*10x-4e1') if i])
OUTPUT:
['x', '+', '13', '.', '5', '*', '10', 'x', '-', '4', 'e', '1']
If you don't want to separate the dot (as a floating-point number in the expression) then you should use this:
[\d.]+ -- digit or dot characters (although this allows you to write: 13.5.5
CODE:
print([i for i in re.split(r'([\d.]+|\W+)', 'x+13.5*10x-4e1') if i])
OUTPUT:
['x', '+', '13.5', '*', '10', 'x', '-', '4', 'e', '1']
Another alternative not suggested here, is to using nltk.tokenize module
Well, the problem seems not to be quite simple. I think, a good way to get robust (but, unfortunately, not so short) solution is to use Python Lex-Yacc for creating a full-weight tokenizer. Lex-Yacc is a common (not only Python) practice for this, thus there can exist ready grammars for creating a simple arithmetic tokenizer (like this one), and you have just to fit them to your specific needs.

Categories