This is a rather theoretical question, pertaining to the fundamental general syntax of Python. I am looking for an example of sequence of characters (*1) that would always cause a syntax error when present inside a Python program, regardless of the context (*2). For instance, the sequence a[0) is not a correct example, because the program
s = 'a[0)'
is perfectly valid. What I want is a sequence of characters that, wherever it occurs in the source code, causes a syntax error! (Oh, and of course, all the characters in this sequence have to be characters individually allowed to appear in a Python program).
(edit: the following blockquoted example is wrong, since newlines may appear in triple-quoted strings. Thanks to ekhumoro for this relevant remark!)
I suspect that the sequence “newline-quote-newline” is forbidden,
because the newline character may not appear in a quoted string: so,
if the first newline character does not causes a syntax error, this
means that the quote character starts a quoted string, and then the
second newline character will cause a syntax error.
It seems to me that a fundamentally buggy sequence could be
(edited some mistakes here: thanks to ekhumoro for noticing!)
'[)"[)'''[)"""[)'[)"[)'''[)"""[)
(where  denotes a newline character), because one of the [)'s shall necessarily occur outside a quoted string, and the string cannot occur in a comment because of the initial .
However, I do not know enough about the sharp details of Python syntax to be sure that the above examples are correct: maybe there exists some bizarre context, more subtle than mere quoted strings, where the above sequences of characters would be allowed? Maybe the full details of Python syntax even make it actually impossible to build any buggy sequence such as what I am looking for?…
(edit added for more clarity)
So, actually my question is about whether the specifications allow you to define a new kind of quoted context at some point: is there something in the Python specifications that say that the only possible quoted contexts are '…', "…", '''…''', """…""" and #… (plus possibly a few more which I would not be currently aware of), or may you devise new quoted contexts as you wish? Or maybe you could make your program start with a kind of codec, after which you would write the sequel of the program in an arbitrary language completely different from Python…?
(*1) In a first version of this question, I wrote “bytes” instead of “characters”, because I did not want to be bothered with bizarre Unicode characters; but that made possible to turn the question into encoding issues… So, let us assume that we are working with a fixed encoding, whose set of admissible characters is fixed and well-known (say, ASCII for more simplicity).
(*2) FYI, the motivation of my question is to stress the difference between the language of a universal Turing machine (with self-delimited programs) and a general-purpose programming language, in the context of Kolmogorov complexity.
PS.: Answers to the same question for other (interpreted) real-life languages also welcomed :-)
Related
I'd appreciate some help on an efficient Pythonic solution for this problem.
Our internal coding standards mandate certain information fields should be in a block comment at the top of the file. In Perl, this was obviously a block of text beginning with '#'.
I'm experimenting with including this information in the module docstring in Python. The problem is I need to access some of this information in the program.
I have surgically extended docstring_parser to recognise the information fields, and create a data structure. This all works.
Except that one of the fields includes the source file location. That's fine on Unix, but we are a cross platform shop, and Windows uses '\' as a path separator. Python decides to process this as universal newlines and tabs, with weird results.
So the string %workspace%\PythonLib\rr2\tests\test_rr2.py
get rendered as:
%workspace%\PythonLib
r2 ests est_rr2.py
which isn't exactly readable anymore.
The fix I have attempted is based on repeated applications of str.replace(), but is there a better way?
#user2357112 is correct. The docstring can be made raw by beginning it with r""", and then everything works.
I'm trying to understand why when we were using pandas to_csv(), a number 3189069486778499 has been output as "0.\x103189069486778499". And this is the only case happened within a huge amount of data.
When using to_csv(), we have already used encoding='utf8', normally that would solve some unicode problems...
So, I'm trying to understand what is "\x10", so that I may know why...
Since the whole process was running in luigi pipeline, sometimes luigi will generate weird output. I tried the same thing in IPython, same version of pandas and everything works fine....
Because it's the likely answer, even if the details aren't provide in your question:
It's highly likely something in your pipeline is intentionally producing fields with length prefixed text, rather than the raw unstructured text. \x103189069486778499 is a binary byte with the value 16 (0x10), followed by precisely 16 characters. The 0. before it may be from a previous output, or some other part of whatever custom data serialization format it's using.
This design is usually intended to make parsing more efficient; if you use a delimiter character between fields (e.g. a comma, like CSV), you're stuck coming up with ways to escape or quote the delimiter when it occurs in your actual data, and parsers have to scan character by character, statefully, to figure out where a field begins and ends. With length prefixed text, the parser finds a field length and knows exactly how many characters to read to slurp the field, or how many to skip to find the next field, no quoting or escaping required, no matter what the field contains.
As for what's doing this: You're going to have to check the commands in your pipeline. Your question provides no meaningful way to determine the cause of this problem.
I know there is quite a lot on the web and on stackoverflow about Python and character encoding, but I haven't really found the answer I'm looking for. So at the risk of creating a duplicate, I'm going to ask anyway.
It's a script that gets a dictionary, where all keys are properly as unicode. The values are strings with unknown encoding. For the keys it wouldn't matter that much, keys are all very simple very unlike the values. The values can (and do) contain a large variety of encodings. There are some dictionaries, where some values are in ASCII others as UTF-16BE yet others cp1250.
That totally messes up further processing, which currently consists mainly printing or concatenating (yes, that simple).
The work-around that I came up with, which makes Python print statements work properly is:
for key in data.keys():
# hope they did not chose a funky encoding
try:
print key+":"+data[key] # this triggers a UnicodeDecodeError on many encodings
current_data = data[key]
except UnicodeDecodeError:
# trying to cope with a funky encoding
current_data = data[key].decode(chardet.detect(data[key])['encoding']) # doing this on each value, because the dictionary sometimes contains multiple encodings
print key+":", # printing without newline was a workaround, because connecting didn't work
print current_data.encode('UTF-8')
In Python this works just fine. In Jython 2.7rc1 which I use in the project (not an option to switch), it prints characters which are definitely not the original encoding (funky looking characters). If anyone has an idea how I can make this also work in Jython that'd be great!
Edit (Example):
Sample-Value:
Our latest scenarios explore two possible versions of the future seen through fresh “lenses”.
Creates a string where the right and left double quotes turn to \x8D and \x8E. I don't know what encoding that is. In Python after using the above code it strips them. In Jython it turns them into white squares.
I'm not familiar with Jython, but the following link I found may prove useful: http://python.6.x6.nabble.com/character-encoding-issues-td1766833.html
It says that you should keep all unicode strings in separate files to your source, and read them with codecs.open. This seemed to work for the person who was experiencing a problem similar to yours.
The following link also mentions something about specifying an encoding parameter to the JVM: https://answers.launchpad.net/sikuli/+question/156443
Without seeing any actual error output, this is the extent of the help I can provide.
I am using PyDev with Eclipse and I have some attributes that are only set during runtime. Normally I can fix PyDev's errors like this:
obj.runtime_attr # #UndefinedVariable
However, since my statement is long and thus, with respect to PEP8, multiline, it looks like this:
some.long.statement.\
with.multiline(obj.runtime_attr).\
more()
Now I cannot add #UndefinedVariable because it breaks line continuation (PEP8 demands there are two spaces before a line-ending comment). However, I cannot put it in the end of the line (it just doesn't work):
some.long.statement.\
with.multiline(obj.runtime_attr).\
more() # #UndefinedVariable
Is there any way this could work that I am overlooking? Is this just a missing feature where you cannot get it right?
First, remember that the most important rule of PEP 8 is:
But most importantly: know when to be inconsistent -- sometimes the style guide just doesn't apply. When in doubt, use your best judgment. Look at other examples and decide what looks best.
And it specifically says to avoid a rule:
When applying the rule would make the code less readable, even for someone who is used to reading code that follows the rules.
That being said, you're already violating the letter and the spirit of PEP 8 just by having these lines of code, unless you can't avoid it without making things worse. As Maximum Line Length says, using backslash continuations is the least preferred way to deal with long lines. On top of that, it specifically says to "Make sure to indent the continued line appropriately", which you aren't doing.
The obvious way to break this up is to use some intermediates variables. This isn't C++; there's no "copy constructor" cost to worry about. In a real-life example (unlike this toy example), there are probably good names that you can come up with that will be much more meaningful than the long expression they replace.
intermediate = some.long.statement
multiline = intermediate.with.multiline(obj.runtime_attr)
more = multiline.more()
If that isn't appropriate, as PEP 8 explicitly says, it's better to rely on parenthetical continuations than backslash continuations. Is that doable here? Sure:
some.long.statement.with.multiline(
obj.runtime_attr).more()
Or, if worst comes to worst:
(some.long.statement.
with.multiline(obj.runtime_attr).more())
This sometimes makes things less readable rather than more, in which case you shouldn't do it. But it's always an option. And if you have to go to extraordinary lengths to make backslash continuation work for you, it's probably going to be worse than even the worst excesses of over-parenthetizing.
At any rate, doing things either of these ways means you can put a comment on the end of each line, so your problem never comes up in the first place.
I would like to let my users use regular expressions for some features. I'm curious what the implications are of passing user input to re.compile(). I assume there is no way for a user to give me a string that could let them execute arbitrary code. The dangers I have thought of are:
The user could pass input that raises an exception.
The user could pass input that causes the regex engine to take a long time, or to use a lot of memory.
The solution to 1. is easy: catch exceptions. I'm not sure if there is a good solution to 2. Perhaps just limiting the length of the regex would work.
Is there anything else I need to worry about?
I have worked on a program that allows users to enter their own regex and you are right - they can (and do) enter regex that can take a long time to finish - sometimes longer than than the lifetime of the universe. What is worse, while processing a regex Python holds the GIL, so it will not only hang the thread that is running the regex, but the entire program.
Limiting the length of the regex will not work, since the problem is backtracking. For example, matching the regex r"(\S+)+x" on a string of length N that does not contain an "x" will backtrack 2**N times. On my system this takes about a second to match against "a"*21 and the time doubles for each additional character, so a string of 100 characters would take approximately 19167393131891000 years to complete (this is an estimate, I have not timed it).
For more information read the O'Reilly book "Mastering Regular Expressions" - this has a couple of chapters on performance.
edit
To get round this we wrote a regex analysing function that tried to catch and reject some of the more obvious degenerate cases, but it is impossible to get all of them.
Another thing we looked at was patching the re module to raise an exception if it backtracks too many times. This is possible, but requires changing the Python C source and recompiling, so is not portable. We also submitted a patch to release the GIL when matching against python strings, but I don't think it was accepted into the core (python only holds the GIL because regex can be run against mutable buffers).
It's much simpler for casual users to give them a subset language. The shell's globbing rules in fnmatch, for example. The SQL LIKE condition rules are another example.
Translate the user's language into a proper regex for execution at runtime.
Compiling the regular expression should be reasonably safe. Although what it compiles into is not strictly an NFA (backreferences mean it's not quite as clean) it should still be sort of straightforward to compile into.
Now as to performance characteristics, this is another problem entirely. Even a small regular expression can have exponential time characteristics because of backtracking. It might be better to define a certain subset of features and only support very limited expressions that you translate yourself.
If you really want to support general regular expressions you either have to trust your users (sometimes an option) or limit the amount of space and time used. I believe that space used is determined only by the length of the regular expression.
edit: As Dave notes, apparently the global interpreter lock is held during regex matching, which would make setting that timeout harder. If that is the case, your only option to set a timeout is to run the match in a separate process. While not exactly ideal it is doable. I completely forgot about multiprocessing. Point of interest is this section on sharing objects. If you really need the hard constraints, separate processes are the way to go here.
It's not necessary to use compile() except when you need to reuse a lot of different regular expressions. The module already caches the last expressions.
The point 2 (at execution) could be a very difficult one if you allow the user to input any regular expression. You can make a complex regexp with few characters, like the famous (x+x+)+y one. I think it's a problem yet to be resolved in a general way.
A workaround could be launching a different thread and monitor it, if it exceeds the allowed time, kill the thread and return with an error.
I really don't think it is possible to execute code simply by passing it into an re.compile. The way I understand it, re.compile (or any regex system in any language) converts the regex string into a finite automaton (DFA or NFA), and despite the ominous name 'compile' it has nothing to do with the execution of any code.
You technically don't need to use re.compile() to perform a regular expression operation on a string. In fact, the compile method can often be slower if you're only executing the operation once since there's overhead associated with the initial compiling.
If you're worried about the word "compile" then avoid it all together and simply pass the raw expression to match, search, etc. You may wind up improving the performance of your code slightly anyways.