I'm using grako (a PEG parser generator library for python) to parse a simple declarative language where a document can contain one or more protocols.
Originally, I had the root rule for document written as:
document = {protocol}+ ;
This appropriately returns a list of protocols, but only gives helpful errors if a syntax error is in the first protocol. Otherwise, it silently discards the invalid protocol and everything after it.
I have also tried a few variations on:
document = protocol document | $ ;
But this doesn't result in a list if there's only one protocol, and doesn't give helpful error messages either, saying only no available options: (...) document if any of the protocols contains an error.
How do I write a rule that does both of the following?:
Always returns a list, even if there's only one protocol
Displays helpful error messages about the unsuccessful match, instead of just saying it's an invalid document or silently dropping the damaged protocol
This is the solution:
document = {protocol ~ }+ $ ;
If you don't add the $ for the parser to see the end of file, the parse will succeed with one or more protocol, even if there are more to parse.
Adding the cut expression (~) makes the parser commit to what was parsed in the closest option/choice in the parse (a closure is an option of X = a X|();). Additional cut expressions within what's parsed by protocol will make the error messages be closer to the expected points of failure in the input.
Related
I designed a sqlite3 database with SQLiteStudio and DB Browser for SQLite. Both implement REGEXP through Qt regexp methods, which I use for queries and CHECK constraints.
I then proceeded to write a python script, which has to manually load a regexp extension, for which I experimented with the one in the sqlite3-pcre package and the one given in the sqlite3 source code repository (POSIX compliant apparently).
The latter would not match correctly inverted character groups, for example regexp('[^a]', 'a') would give a match, eventhough the only character in the string is an "a". Perhaps I'm missing something about beginning-of-line and end-of-line weirdness, but it looks like a bug to me?
The former would work as expected, but then cannot perform pragma integrity_check; because I have strings which are NULL (which is valid), but the pcre-regexp extension will complain with error: no string (which has been referenced in another comment on SO as a bug as well, which can be circumvented by checking first against NOT NULL).
My question therefore is:
Am I missing something about POSIX compliant regexp, that explains
why the match returns true while it shouldn't?
Are there other regex
extensions, that work better, especially with NULL strings.
I'm trying to get a regex to work for a string of multiline text. Need this to work for python.
Example text:
description : "4.10 TCP Wrappers - not installed"
info : "If some of the services running in /etc/inetd.conf are
required, then it is recommended that TCP Wrappers are installed and configured to limit access to any active TCP and UDP services.
TCP Wrappers allow the administrator to control who has access to various inetd network services via source IP address controls. TCP Wrappers also provide logging information via syslog about both successful and unsuccessful connections.
TCP Wrappers are generally triggered via /etc/inetd.conf, but other options exist for \"wrappering\" non-inetd based software.
The configuration of TCP Wrappers to suit a particular environment is outside the scope of this benchmark; however the following links will provide the necessary documentation to plan an appropriate implementation:
ftp://ftp.porcupine.org/pub/security/index.html
The website contains source code for both IPv4 and IPv6 versions."
expect : "^[\\s]*[A-Za-z0-9]+:[\\s]+[^A][^L][^L]"
required : YES
I have come up with this,
[(a-zA-Z_ \t#)]*[:][ ]*\"[^\"]*.*\"
But the problem is that it stops at the second \" the rest of the line is not selected.
My objective is to get the entire string starting from info till the end of the double quotes, relating to the info line.
This same regex should also work for the 'expect' line, starting from expect ending at the double quotes relating to the expect string.
Once I get the entire string I will split it on the first ":" because I want to store these strings into a DB with the "description", "info", "expect" as columns then the strings as values in those columns.
Appreciate the help!
One alternative is to use thelexer provided in the shlex module:
>>> s = """tester : "this is a long string
that
is multiline, contains \\" double qoutes \\" and .
this line is finished\""""
>>> shlex.split(s[s.find('"'):])[0]
'this is a long string\nthat\nis multiline, contains " double qoutes " and .\nthis line is finished'
It will also remove the backslases from the double quotes inside the string.
The code finds the first double quote in the string and only looks at the string starting from there. It then uses shlex.split() to tokenize the remainder of the string, and takes the first token from the returned list.
Update 1: I got this to work:
[(a-zA-Z_ \t#)]*[:][ ]*\"([^\"]|(?<=\\\\)[\"])*\"
Update 2: If you cannot modify the file to add escaped quotes where necessary for the expression above, then as long as the lines such as
group : "#GROUP#" || "test"
exist only as single lines, then I think this will grab those along with the longer quoted values:
[(a-zA-Z_ \t#)]*[:][ ]*(?:\"([^\"]|(?<=\\\\)[\"])*\"|.*)(?=(?:\r\n|$))
Try that, and if it works, I'll update again to explain it.
I am having trouble with pexpect. I'm trying to grab output from tralics which reads in latex equations and emits the MathML representation, like this:
1 ~/ % tralics --interactivemath
This is tralics 2.14.5, a LaTeX to XML translator, running on tlocal
Copyright INRIA/MIAOU/APICS/MARELLE 2002-2012, Jos\'e Grimm
Licensed under the CeCILL Free Software Licensing Agreement
Starting translation of file texput.tex.
No configuration file.
> $x+y=z$
<formula type='inline'><math xmlns='http://www.w3.org/1998/Math/MathML'><mrow><mi>x</mi> <mo>+</mo><mi>y</mi><mo>=</mo><mi>z</mi></mrow></math></formula>
>
So I try to get the formula using pexpect:
import pexpect
c = pexpect.spawn('tralics --interactivemath')
c.expect('>')
c.sendline('$x+y=z$')
s = c.read_nonblocking(size=2000)
print s
The output has the formula, but with the original input at the beginning and some control chars at the end:
"x+y=z$\r\n<formula type='inline'><math xmlns='http://www.w3.org/1998/Math/MathML'><mrow><mi>x</mi><mo>+</mo><mi>y</mi><mo>=</mo><mi>z</mi></mrow></math></formula>\r\n\r> \x1b[K"
I can clean the output string, but I must be missing something basic. Is there a cleaner way to get the MathML?
From what I understand you are trying to get this from pexpect:
<formula type='inline'><math xmlns='http://www.w3.org/1998/Math/MathML'><mrow><mi>x</mi> <mo>+</mo><mi>y</mi><mo>=</mo><mi>z</mi></mrow></math></formula>
You can use a regexp instead of ">" for the matching in order to get the expected result. This is the easiest example:
c.expect("<formula.*formula>");
After that, you can access the matched string by calling the match attribute of pexpect:
print c.match
You might also try different regexps, due to the fact that the one I posted is a greedy one and it might hinder your execution time if the formulas are big.
The t_error() function is used to handle lexing errors that occur when illegal characters are detected. My question is: How can I use this function to get more specific information on errors? Like error type, in which rule or section the error appears, etc.
In general, there is only very limited information available to the t_error() function. As input, it receives a token object where the value has been set to the remaining input text. Analysis of that text is entirely up to you. You can use the t.lexer.skip(n) function to have the lexer skip ahead by a certain number of characters and that's about it.
There is no notion of an "error type" other than the fact that there is an input character that does not match the regular expression of any known token. Since the lexer is decoupled from the parser, there is no direct way to get any information about the state of the parsing engine or to find out what grammar rule is being parsed. Even if you could get the state (which would simply be the underlying state number of the LALR state machine), interpretation of it would likely be very difficult since the parser could be in the intermediate stages of matching dozens of possible grammar rules looking for reduce actions.
My advice is as follows: If you need additional information in the t_error() function, you should set up some kind of object that is shared between the lexer and parser components of your code. You should explicitly make different parts of your compiler update that object as needed (e.g., it could be updated in specific grammar rules).
Just as aside, there are usually very few courses of action for a bad token. Essentially, you're getting input text that doesn't any known part of the language alphabet (e.g., no known symbol). As such, there's not even any kind of token value you can give to the parser. Usually, the only course of action is to report the bad input, throw it out, and continue.
As a followup to Raymond's answer, I would also not advise modifying any attribute of the lexer object in t_error().
Ply includes an example ANSI-C style lexer in a file called cpp.py. It has an example of how to extract some information out of t_error():
def t_error(t):
t.type = t.value[0]
t.value = t.value[0]
t.lexer.skip(1)
return t
In that function, you can also access the lexer's public attributes:
lineno - Current line number
lexpos - Current position in the input string
There are also some other attributes that aren't listed as public but may provide some useful diagnostics:
lexstate - Current lexer state
lexstatestack - Stack of lexer states
lexstateinfo - State information
lexerrorf - Error rule (if any)
There is indeed a way of managing errors in PLY, take a look at this very interesting resentation:
http://www.slideshare.net/dabeaz/writing-parsers-and-compilers-with-ply
and at chapter 6.8.1 of
http://www.dabeaz.com/ply/ply.html#ply_nn3
How can I automate a test to enforce that a body of Python 2.x code contains no string instances (only unicode instances)?
Eg.
Can I do it from within the code?
Is there a static analysis tool that has this feature?
Edit:
I wanted this for an application in Python 2.5, but it turns out this is not really possible because:
2.5 doesn't support unicode_literals
kwargs dictionary keys can't be unicode objects, only strings
So I'm accepting the answer that says it's not possible, even though it's for different reasons :)
You can't enforce that all strings are Unicode; even with from __future__ import unicode_literals in a module, byte strings can be written as b'...', as they can in Python 3.
There was an option that could be used to get the same effect as unicode_literals globally: the command-line option -U. However it was abandoned early in the 2.x series because it basically broke every script.
What is your purpose for this? It is not desirable to abolish byte strings. They are not “bad” and Unicode strings are not universally “better”; they are two separate animals and you will need both of them. Byte strings will certainly be needed to talk to binary files and network services.
If you want to be prepared to transition to Python 3, the best tack is to write b'...' for all the strings you really mean to be bytes, and u'...' for the strings that are inherently Unicode. The default string '...' format can be used for everything else, places where you don't care and/or whether Python 3 changes the default string type.
It seems to me like you really need to parse the code with an honest to goodness python parser. Then you will need to dig through the AST your parser produces to see if it contains any string literals.
It looks like Python comes with a parser out of the box. From this documentation I got this code sample working:
import parser
from token import tok_name
def checkForNonUnicode(codeString):
return checkForNonUnicodeHelper(parser.suite(codeString).tolist())
def checkForNonUnicodeHelper(lst):
returnValue = True
nodeType = lst[0]
if nodeType in tok_name and tok_name[nodeType] == 'STRING':
stringValue = lst[1]
if stringValue[0] != "u": # Kind of hacky. Does this always work?
print "%s is not unicode!" % stringValue
returnValue = False
else:
for subNode in [lst[n] for n in range(1, len(lst))]:
if isinstance(subNode, list):
returnValue = returnValue and checkForNonUnicodeHelper(subNode)
return returnValue
print checkForNonUnicode("""
def foo():
a = 'This should blow up!'
""")
print checkForNonUnicode("""
def bar():
b = u'although this is ok.'
""")
which prints out
'This should blow up!' is not unicode!
False
True
Now doc strings aren't unicode but should be allowed, so you might have to do something more complicated like from symbol import sym_name where you can look up which node types are for class and function definitions. Then the first sub-node that's simply a string, i.e. not part of an assignment or whatever, should be allowed to not be unicode.
Good question!
Edit
Just a follow up comment. Conveniently for your purposes, parser.suite does not actually evaluate your python code. This means that you can run this parser over your Python files without worrying about naming or import errors. For example, let's say you have myObscureUtilityFile.py that contains
from ..obscure.relative.path import whatever
You can
checkForNonUnicode(open('/whoah/softlink/myObscureUtilityFile.py').read())
Our SD Source Code Search Engine (SCSE) can provide this result directly.
The SCSE provides a way to search extremely quickly across large sets of files using some of the language structure to enable precise queries and minimize false positives. It handles a wide array
of languages, even at the same time, including Python. A GUI shows search hits and a page of actual text from the file containing a selected hit.
It uses lexical information from the source languages as the basis for queries, comprised of various langauge keywords and pattern tokens that match varying content langauge elements. SCSE knows the types of lexemes available in the langauge. One can search for a generic identifier (using query token I) or an identifier matching some regulatr expression. Similar, on can search for a generic string (using query token "S" for "any kind of string literal") or for a specific
type of string (for Python including "UnicodeStrings", non-unicode strings, etc, which collectively make up the set of Python things comprising "S").
So a search:
'for' ... I=ij*
finds the keyword 'for' near ("...") an identifier whose prefix is "ij" and shows you all the hits. (Language-specific whitespace including line breaks and comments are ignored.
An trivial search:
S
finds all string literals. This is often a pretty big set :-}
A search
UnicodeStrings
finds all string literals that are lexically defined as Unicode Strings (u"...")
What you want are all strings that aren't UnicodeStrings. The SCSE provides a "subtract" operator that subtracts hits of one kind that overlap hits of another. So your question, "what strings aren't unicode" is expressed concisely as:
S-UnicodeStrings
All hits shown will be the strings that aren't unicode strings, your precise question.
The SCSE provides logging facilities so that you can record hits. You can run SCSE from a command line, enabling a scripted query for your answer. Putting this into a command script would provide a tool gives your answer directly.