I just started with pyparsing and I have problems with line breaks.
My grammar is:
from pyparsing import *
newline = LineEnd () #Literal ('\n').leaveWhitespace ()
minus = Literal ('-')
plus = Literal ('+')
lparen = Literal ('(')
rparen = Literal (')')
ident = Word (alphas)
integer = Word (nums)
arith = Forward ()
parenthized = Group (lparen + arith + rparen)
atom = ident | integer | parenthized
factor = ZeroOrMore (minus | plus) + atom
arith << (ZeroOrMore (factor + (minus | plus) ) + factor)
statement = arith + newline
program = OneOrMore (statement)
Now when I parse the following:
print (program.parseString ('--1-(-a-3+n)\nx\n') )
The result is as expected:
['-', '-', '1', '-', ['(', '-', 'a', '-', '3', '+', 'n', ')'], '\n', 'x', '\n']
But when the second line can be parsed as tail of the first line, the first \n is magicked away?
Code:
print (program.parseString ('--1-(-a-3+n)\n-x\n') )
Actual result:
['-', '-', '1', '-', ['(', '-', 'a', '-', '3', '+', 'n', ')'], '-', 'x', '\n']
Expected result:
['-', '-', '1', '-', ['(', '-', 'a', '-', '3', '+', 'n', ')'], '\n', '-', 'x', '\n']
Actually I don't want the parser to automatically join statements.
1. What am I doing wrong?
2. How can I fix this?
3. What is happening under the hood causing this behavious (which surely is sensible, but I just fail to see the point)?
'\n' is normally skipped over as a whitespace character. If you want '\n' to be significant, then you have to call setDefaultWhitespaceChars to remove '\n' as skippable whitespace (you have to do this before defining any of your pyparsing expressions):
from pyparsing import *
ParserElement.setDefaultWhitespaceChars(' \t')
What is happening here is that the parser by default ignores any whitespace. You need to add the following line of code before you define any elements:
ParserElement.setDefaultWhitespaceChars(" \t")
The normal default whitespace characters are " \t\r\n", I believe.
Edit: Paul beat me to it. I should have refreshed after getting dinner together. :)
Related
In Python 3.7, assuming we have no access to functions such as str() and must rely entirely on int.__str__() etc., is it possible to get the character represented by an ASCII integer? E.G
>>> a = 97
>>> b = a.__magicfunction__()
>>> b
'a'
def magicfunction(i):
ascii=[None] * 32
ascii+=[' ', '!', "'", '#', '$', '%', '&', '"', '(', ')', '*', '+', '`', '-', '+', '/']
ascii+=list('0123456789:;<=>?#')
ascii+=list('ABCDEFGHIJKLMNOPQRSTUVWXYZ')
ascii+=list('[\]^_`')
ascii+=list('abcdefghijklmnopqrstuvwxyz')
return ascii[i]
print(magicfunction(97))
produces:
a
You never said we should not use bytes:
bytes([97]).decode()
I know partition() exists, but it only takes in one value, I'm trying to partition around various values:
for example say I wanted to partition around symbols in a string:
input: "function():"
output: ["function", "(", ")", ":"]
I can't seem to find an efficient way to handle variable amounts of partitioning.
You can use re.findall with an alternation pattern that matches either a word or a non-space character:
re.findall(r'\w+|\S', s)
so that given s = 'function():', this returns:
['function', '(', ')', ':']
You could re.split by \W and use (...) to keep the delimiters, then remove empty parts.
>>> import re
>>> s = "function(): return foo + 3"
>>> [s for s in re.split(r"(\W)", s) if s.strip()]
['function', '(', ')', ':', 'return', 'foo', '+', '3']
Note that this will split after every special character; if you want to keep certain groups of special characters together, e.g. == or <=, you should test those first with |.
>>> s = "function(): return foo + 3 == 42"
>>> [s for s in re.split(r"(\W)", s) if s.strip()]
['function', '(', ')', ':', 'return', 'foo', '+', '3', '=', '=', '42']
>>> [s for s in re.split(r"(==|!=|<=|\W)", s) if s.strip()]
['function', '(', ')', ':', 'return', 'foo', '+', '3', '==', '42']
I have a lot of python strings such as "A7*4", "Z3+8", "B6 / 11", and I want to split these strings so that they would be in a list, in the format ["A7", "*", "4"], ["B6", "/", "11"], etc. I have used a lot of different split methods but I think I need to just perform the split where there is a math symbol, such as /,*,+,-. I would also need to strip out the whitespace.
Currently I am using the code re.split(r'(\D)', "B6 / 11"), which is returning ['', 'B', '6', ' ', '', '/', '', ' ', '11']. Instead I want to get back ["B6", "/", "11"].
You should split on the character set [+-/*] after removing the whitespace from the string:
>>> import re
>>> def mysplit(mystr):
... return re.split("([+-/*])", mystr.replace(" ", ""))
...
>>> mysplit("A7*4")
['A7', '*', '4']
>>> mysplit("Z3+8")
['Z3', '+', '8']
>>> mysplit("B6 / 11")
['B6', '/', '11']
>>>
There is a way to solve this without regular expressions by using the Python tokenizer. I used a more complex formula to show the capabilities of this solution.
from io import StringIO
import tokenize
formula = "(A7*4) - (Z3+8) - ( B6 / 11)"
print([token[1] for token in tokenize.generate_tokens(StringIO(formula).readline) if token[1]])
Result:
['(', 'A7', '*', '4', ')', '-', '(', 'Z3', '+', '8', ')', '-', '(', 'B6', '/', '11', ')']
If I have a string
'x+13.5*10x-4e1'
how can I split it into the following list of tokens?
['x', '+', '13', '.', '5', '*', '10', 'x', '-', '4', 'e', '1']
Currently I'm using the shlex module:
str = 'x+13.5*10x-4e1'
lexer = shlex.shlex(str)
tokenList = []
for token in lexer:
tokenList.append(str(token))
return tokenList
But this returns:
['x', '+', '13', '.', '5', '*', '10x', '-', '4e1']
So I'm trying to split the letters from the numbers. I'm considering taking the strings that contain both letters and numbers then somehow splitting them, but not sure about how to do this or how to add them all back into the list with the others afterwards. It's important that the tokens stay in order, and I can't have nested lists.
In an ideal world, e and E would not be recognised as letters in the same way, so
'-4e1'
would become
['-', '4e1']
but
'-4x1'
would become
['-', '4', 'x', '1']
Can anybody help?
Use the regular expression module's split() function, to split at
'\d+' -- digits (number characters) and
'\W+' -- non-word characters:
CODE:
import re
print([i for i in re.split(r'(\d+|\W+)', 'x+13.5*10x-4e1') if i])
OUTPUT:
['x', '+', '13', '.', '5', '*', '10', 'x', '-', '4', 'e', '1']
If you don't want to separate the dot (as a floating-point number in the expression) then you should use this:
[\d.]+ -- digit or dot characters (although this allows you to write: 13.5.5
CODE:
print([i for i in re.split(r'([\d.]+|\W+)', 'x+13.5*10x-4e1') if i])
OUTPUT:
['x', '+', '13.5', '*', '10', 'x', '-', '4', 'e', '1']
Another alternative not suggested here, is to using nltk.tokenize module
Well, the problem seems not to be quite simple. I think, a good way to get robust (but, unfortunately, not so short) solution is to use Python Lex-Yacc for creating a full-weight tokenizer. Lex-Yacc is a common (not only Python) practice for this, thus there can exist ready grammars for creating a simple arithmetic tokenizer (like this one), and you have just to fit them to your specific needs.
I have a string like this:
string = 'This is my text of 2013-02-11, & it contained characters like this! (Exceptional)'
These are the symbols I want to remove from my String.
!, #, #, %, ^, &, *, (, ), _, +, =, `, /
What I have tried is:
listofsymbols = ['!', '#', '#', '%', '^', '&', '*', '(', ')', '_', '+', '=', '`', '/']
exceptionals = set(chr(e) for e in listofsymbols)
string.translate(None,exceptionals)
The error is:
an integer is required
Please help me doing this!
Try this
>>> my_str = 'This is my text of 2013-02-11, & it contained characters like this! (Exceptional)'
>>> my_str.translate(None, '!##%^&*()_+=`/')
This is my text of 2013-02-11, it contained characters like this Exceptional
Also, please refrain from naming variables that are already built-in names or part of the standard library.
How about this? I've also renamed string to s to avoid it getting mixed up with the built-in module string.
>>> s = 'This is my text of 2013-02-11, & it contained characters like this! (Exceptional)'
>>> listofsymbols = ['!', '#', '#', '%', '^', '&', '*', '(', ')', '_', '+', '=', '`', '/']
>>> print ''.join([i for i in s if i not in listofsymbols])
This is my text of 2013-02-11, it contained characters like this Exceptional
Another proposal, easily expandable to more complex filter criteria or other input data type:
from itertools import ifilter
def isValid(c): return c not in "!##%^&*()_+=`/"
print "".join(ifilter(isValid, my_string))