Debugging the python VM - python

Is there a debugger that can debug the Python virtual machine while it is running Python code, similar to the way that GDB works with C/C++? I have searched online and have come across pdb, but this steps through the code executed by the Python interpreter, not the Python interpreter as its running the program.

The reference implementation of Python, CPython, is written in C. You can use GDB to debug it as you would debug any other program written in C.
That said, Python does have a few little helpers for use in GDB buried under Misc/gdbinit. It's got comments to describe what each command does, but I'll repeat them here for convenience:
pyo: Dump a PyObject *.
pyg: Dump a PyGC_Head *.
pylocals: Print the local variables of the current Python stack frame.
lineno: Get the current Python line number.
pyframe: Print the source file name, line, and function.
pyframev: pyframe + pylocals
printframe: pyframe if within PyEval_EvalFrameEx; built-in frame otherwise
pystack: Print the Python stack trace.
pystackv: Print the Python stack trace with local variables.
pu: Print a Unicode string.
It looks like the Fedora project has also assembled their own collection of commands to assist with debugging which you may want to look at, too.

If you're looking to debug Python at the bytecode level, that's exactly what pdb does.
If you're looking to debug the CPython reference interpreter… as icktoofay's answer says, it's just a C program like any other, so you can debug it the same way as any other C program. (And you can get the source, compile it with extra debugging info, etc. if you want, too.)
You almost certainly want to look at EasierPythonDebugging, which shows how to set up a bunch of GDB helpers (which are Python scripts, of course) to make your life easier. Most importantly: The Python stack is tightly bound to the C stack, but it's a big mess to try to map things manually. With the right helpers, you can get stack traces, frame dumps, etc. in Python terms instead of or in parallel with the C terms with no effort. Another big benefit is the py-print command, which can look up a Python name (in nearly the same way a live interpreter would), call its __repr__, and print out the result (with proper error handling and everything so you don't end up crashing your gdb session trying to walk the PyObject* stuff manually).
If you're looking for some level in between… well, there is no level in between. (Conceptually, there are multiple layers to the interpreter, but it's all just C code, and it all looks alike to gdb.)
If you're looking to debug any Python interpreter, rather than specifically CPython, you might want to look at PyPy. It's written in a Python-like language called RPython, and there are various ways to use pdb to debug the (R)Python interpreter code, although it's not as easy as it could be (unless you use a flat-translated PyPy, which will probably run about 100x too slow to be tolerable). There are also GDB debug hooks and scripts for PyPy just like the ones for CPython, but they're not as complete.

Related

Python execution from bytecode [duplicate]

I am trying to understand how Python works (because I use it all the time!). To my understanding, when you run something like python script.py, the script is converted to bytecode and then the interpreter/VM/CPython–really just a C Program–reads in the python bytecode and executes the program accordingly.
How is this bytecode read in? Is it similar to how a text file is read in C? I am unsure how the Python code is converted to machine code. Is it the case that the Python interpreter (the python command in the CLI) is really just a precompiled C program that is already converted to machine code and then the python bytecode files are just put through that program? In other words, is my Python program never actually converted into machine code? Is the python interpreter already in machine code, so my script never has to be?
Yes, your understanding is correct. There is basically (very basically) a giant switch statement inside the CPython interpreter that says "if the current opcode is so and so, do this and that".
http://hg.python.org/cpython/file/3.3/Python/ceval.c#l790
Other implementations, like Pypy, have JIT compilation, i.e. they translate Python to machine codes on the fly.
If you want to see the bytecode of some code (whether source code, a live function object or code object, etc.), the dis module will tell you exactly what you need. For example:
>>> dis.dis('i/3')
1 0 LOAD_NAME 0 (i)
3 LOAD_CONST 0 (3)
6 BINARY_TRUE_DIVIDE
7 RETURN_VALUE
The dis docs explain what each bytecode means. For example, LOAD_NAME:
Pushes the value associated with co_names[namei] onto the stack.
To understand this, you have to know that the bytecode interpreter is a virtual stack machine, and what co_names is. The inspect module docs have a nice table showing the most important attributes of the most important internal objects, so you can see that co_names is an attribute of code objects which holds a tuple of names of local variables. In other words, LOAD_NAME 0 pushes the value associated with the 0th local variable (and dis helpfully looks this up and sees that the 0th local variable is named 'i').
And that's enough to see that a string of bytecodes isn't enough; the interpreter also needs the other attributes of the code object, and in some cases attributes of the function object (which is also where the locals and globals environments come from).
The inspect module also has some tools that can help you further in investigating live code.
This is enough to figure out a lot of interesting stuff. For example, you probably know that Python figures out at compile time whether a variable in a function is local, closure, or global, based on whether you assign to it anywhere in the function body (and on any nonlocal or global statements); if you write three different functions and compare their disassembly (and the relevant other attributes) you can pretty easily figure out exactly what it must be doing.
(The one bit that's tricky here is understanding closure cells. To really get this, you will need to have 3 levels of functions, to see how the one in the middle forwards things along for the innermost one.)
To understand how the bytecode is interpreted and how the stack machine works (in CPython), you need to look at the ceval.c source code. The answers by thy435 and eyquem already cover this.
Understanding how pyc files are read only takes a bit more information. Ned Batchelder has a great (if slightly out-of-date) blog post called The structure of .pyc files, that covers all of the tricky and not-well-documented parts. (Note that in 3.3, some of the gory code related to importing has been moved from C to Python, which makes it much easier to follow.) But basically, it's just some header info and the module's code object, serialized by marshal.
To understand how source gets compiled to bytecode, that's the fun part.
Design of CPython's Compiler explains how everything works. (Some of the other sections of the Python Developer's Guide are also useful.)
For the early stuff—tokenizing and parsing—you can just use the ast module to jump right to the point where it's time to do the actual compiling. Then see compile.c for how that AST gets turned into bytecode.
The macros can be a bit tough to work through, but once you grasp the idea of how the compiler uses a stack to descend into blocks, and how it uses those compiler_addop and friends to emit bytecodes at the current level, it all makes sense.
One thing that surprises most people at first is the way functions work. The function definition's body is compiled into a code object. Then the function definition itself is compiled into code (inside the enclosing function body, module, etc.) that, when executed, builds a function object from that code object. (Once you think about how closures must work, it's obvious why it works that way. Each instance of the closure is a separate function object with the same code object.)
And now you're ready to start patching CPython to add your own statements, right? Well, as Changing CPython's Grammar shows, there's a lot of stuff to get right (and there's even more if you need to create new opcodes). You might find it easier to learn PyPy as well as CPython, and start hacking on PyPy first, and only come back to CPython once you know that what you're doing is sensible and doable.
Having read the answer of thg4535, I am sure you will find interesting the following explanations on ceval.c : Hello, ceval.c!
This article is part of a series written by Yaniv Aknin whose I'm sort of a fan: Python's Innards
When we run the python programs: 1_python source code compile with Cpython to the bytecode (bytecode is the binary file with .pyc format which seralize with marshal and it is set of stack structures that solve with pvm) 2_then the pvm (python virtual machine/python interpreter) is stackbase machine (the machine which solve task with stack data structure) which loop inside bytecode line by line and execute it.
What executes the bytecode?
The bytecode tells the Python interpreter which C code to execute.

How exactly is Python Bytecode Run in CPython?

I am trying to understand how Python works (because I use it all the time!). To my understanding, when you run something like python script.py, the script is converted to bytecode and then the interpreter/VM/CPython–really just a C Program–reads in the python bytecode and executes the program accordingly.
How is this bytecode read in? Is it similar to how a text file is read in C? I am unsure how the Python code is converted to machine code. Is it the case that the Python interpreter (the python command in the CLI) is really just a precompiled C program that is already converted to machine code and then the python bytecode files are just put through that program? In other words, is my Python program never actually converted into machine code? Is the python interpreter already in machine code, so my script never has to be?
Yes, your understanding is correct. There is basically (very basically) a giant switch statement inside the CPython interpreter that says "if the current opcode is so and so, do this and that".
http://hg.python.org/cpython/file/3.3/Python/ceval.c#l790
Other implementations, like Pypy, have JIT compilation, i.e. they translate Python to machine codes on the fly.
If you want to see the bytecode of some code (whether source code, a live function object or code object, etc.), the dis module will tell you exactly what you need. For example:
>>> dis.dis('i/3')
1 0 LOAD_NAME 0 (i)
3 LOAD_CONST 0 (3)
6 BINARY_TRUE_DIVIDE
7 RETURN_VALUE
The dis docs explain what each bytecode means. For example, LOAD_NAME:
Pushes the value associated with co_names[namei] onto the stack.
To understand this, you have to know that the bytecode interpreter is a virtual stack machine, and what co_names is. The inspect module docs have a nice table showing the most important attributes of the most important internal objects, so you can see that co_names is an attribute of code objects which holds a tuple of names of local variables. In other words, LOAD_NAME 0 pushes the value associated with the 0th local variable (and dis helpfully looks this up and sees that the 0th local variable is named 'i').
And that's enough to see that a string of bytecodes isn't enough; the interpreter also needs the other attributes of the code object, and in some cases attributes of the function object (which is also where the locals and globals environments come from).
The inspect module also has some tools that can help you further in investigating live code.
This is enough to figure out a lot of interesting stuff. For example, you probably know that Python figures out at compile time whether a variable in a function is local, closure, or global, based on whether you assign to it anywhere in the function body (and on any nonlocal or global statements); if you write three different functions and compare their disassembly (and the relevant other attributes) you can pretty easily figure out exactly what it must be doing.
(The one bit that's tricky here is understanding closure cells. To really get this, you will need to have 3 levels of functions, to see how the one in the middle forwards things along for the innermost one.)
To understand how the bytecode is interpreted and how the stack machine works (in CPython), you need to look at the ceval.c source code. The answers by thy435 and eyquem already cover this.
Understanding how pyc files are read only takes a bit more information. Ned Batchelder has a great (if slightly out-of-date) blog post called The structure of .pyc files, that covers all of the tricky and not-well-documented parts. (Note that in 3.3, some of the gory code related to importing has been moved from C to Python, which makes it much easier to follow.) But basically, it's just some header info and the module's code object, serialized by marshal.
To understand how source gets compiled to bytecode, that's the fun part.
Design of CPython's Compiler explains how everything works. (Some of the other sections of the Python Developer's Guide are also useful.)
For the early stuff—tokenizing and parsing—you can just use the ast module to jump right to the point where it's time to do the actual compiling. Then see compile.c for how that AST gets turned into bytecode.
The macros can be a bit tough to work through, but once you grasp the idea of how the compiler uses a stack to descend into blocks, and how it uses those compiler_addop and friends to emit bytecodes at the current level, it all makes sense.
One thing that surprises most people at first is the way functions work. The function definition's body is compiled into a code object. Then the function definition itself is compiled into code (inside the enclosing function body, module, etc.) that, when executed, builds a function object from that code object. (Once you think about how closures must work, it's obvious why it works that way. Each instance of the closure is a separate function object with the same code object.)
And now you're ready to start patching CPython to add your own statements, right? Well, as Changing CPython's Grammar shows, there's a lot of stuff to get right (and there's even more if you need to create new opcodes). You might find it easier to learn PyPy as well as CPython, and start hacking on PyPy first, and only come back to CPython once you know that what you're doing is sensible and doable.
Having read the answer of thg4535, I am sure you will find interesting the following explanations on ceval.c : Hello, ceval.c!
This article is part of a series written by Yaniv Aknin whose I'm sort of a fan: Python's Innards
When we run the python programs: 1_python source code compile with Cpython to the bytecode (bytecode is the binary file with .pyc format which seralize with marshal and it is set of stack structures that solve with pvm) 2_then the pvm (python virtual machine/python interpreter) is stackbase machine (the machine which solve task with stack data structure) which loop inside bytecode line by line and execute it.
What executes the bytecode?
The bytecode tells the Python interpreter which C code to execute.

Does PyPy translate itself?

Am I getting this straight? Does the PyPy interpreter actually interpret itself and then translate itself?
So here's my current understanding:
RPython's toolchain involves partially executing the program to be translated to get a sort of preprocessed version to annotate and translate.
The PyPy interpreter, running on top of CPython, executes to partially interpret itself, at which point it hands control off to its RPython half, which performs the translation?
If this is true, then this is one of the most mind-bending things I have ever seen.
PyPy's translation process is actually much less conceptually recursive than it sounds.
Really all it is is a Python program that processes Python function/class/other objects (not Python source code) and outputs C code. But of course it doesn't process just any Python objects; it can only handle particular forms, which are what you get if you write your to-be-translated code in RPython.
Since the translation toolchain is a Python program, you can run it on top of any Python interpreter, which obviously includes PyPy's python interpreter. So that's nothing special.
Since it translates RPython objects, you can use it to translate PyPy's python interpreter, which is written in RPython.
But you can't run it on the translation framework itself, which is not RPython. Only PyPy's python interpreter itself is RPython.
Things only get interesting because RPython code is also Python code (but not the reverse), and because RPython doesn't ever "really exist" in source files, but only in memory inside a working Python process that necessarily includes other non-RPython code (there are no "pure-RPython" imports or function definitions, for example, because the translator operates on functions that have already been defined and imported).
Remember that the translation toolchain operates on in-memory Python code objects. Python's execution model means that these don't exist before some Python code has been running. You can imagine that starting the translation process looks a bit like this, if you highly simplify it:
from my_interpreter import main
from pypy import translate
translate(main)
As we all know, just importing main is going to run lots of Python code, including all the other modules my_interpreter imports. But the translation process starts analysing the function object main; it never sees, and doesn't care about, whatever code was executed to come up with main.
One way to think of this is that "programming in RPython" means "writing a Python program which generates an RPython program and then feeds it to the translation process". That's relatively easy to understand and is kind of similar to how many other compilers work (e.g. one way to think of programming in C is that you are essentially writing a C pre-processor program that generates a C program, which is then fed to the C compiler).
Things only get confusing in the PyPy case because all 3 components (the Python program which generates the RPython program, the RPython program, and the translation process) are loaded into the same Python interpreter. This means it's quite possible to have functions that are RPython when called with some arguments and not when called with other arguments, to call helper functions from the translation framework as part of generating your RPython program, and lots of other weird things. So the situation gets rather blurry around the edges, and you can't necessarily divide your source lines cleanly into "RPython to be translated", "Python generating my RPython program" and "handing the RPython program over to the translation framework".
The PyPy interpreter, running on top of CPython, executes to partially
interpret itself
What I think you're alluding to here is PyPy's use of the the flow object space during translation, to do abstract interpretation. Even this isn't as crazy and mind-bending as it seems at first. I'm much less informed about this part of PyPy, but as I understand it:
PyPy implements all of the operations of a Python interpreter by delegating them to an "object space", which contains an implementation of all the basic built in operations. But you can plug in different object spaces to get different effects, and so long as they implement the same "object space" interface the interpreter will still be able to "execute" Python code.
The RPython code objects that the PyPy translation toolchain processes is Python code that could be executed by an interpreter. So PyPy re-uses part of their Python interpreter as part of the translation tool-chain, by plugging in the flow object space. When "executing" code with this object space, the interpreter doesn't actually carry out the operations of the code, it instead produces flow graphs, which are analogous to the sorts of intermediate representation used by many other compilers; it's just a simple machine-manipulable representation of the code, to be further processed. This is how regular (R)Python code objects get turned into the input for the rest of the translation process.
Since the usual thing that is translated with the translation process is PyPy's Python interpreter, it indeed "interprets itself" with the flow object space. But all that really means is that you have a Python program that is processing Python functions, including the ones doing the processing. In itself it isn't any more mind-bending than applying a decorator to itself, or having a wrapper-class wrap an instance of itself (or wrap the class itself).
Um, that got a bit rambly. I hope it helps, anyway, and I hope I haven't said anything inaccurate; please correct me if I have.
Disclaimer: I'm not an expert on PyPy - in particular, I don't understand the details of the RPython translation, I'm only citing stuff that I've read before. For a more specific post on how RPython translation may work, check out this answer.
The answer is, yes, it can (but only after it was first compiled using CPython).
Longer description:
At first it seems highly mind bending and paradoxical, but once you understand it, it's easy. Checkout the answer on Wikipedia.
Bootstrapping in program development began during the 1950s when each program was constructed on paper in decimal code or in binary code, bit by bit (1s and 0s), because there was no high-level computer language, no compiler, no assembler, and no linker. A tiny assembler program was hand-coded for a new computer (for example the IBM 650) which converted a few instructions into binary or decimal code: A1. This simple assembler program was then rewritten in its just-defined assembly language but with extensions that would enable the use of some additional mnemonics for more complex operation codes.
The process is called software bootstrapping. Basically, you build one tool, say a C++ compiler, in a lower language which has already been made (everything at one point had to be coded from binary), say ASM. Now that you have C++ in existence, you can now code a C++ compiler in C++, then use the ASM C++ compiler to compile your new one. After you once have your new compiler compiled, you can now use it to compile itself.
So basically, make the first computer tool ever by hand coding it, use that interpreter to make another slightly better one, and use that one to make a better one, ... And eventually you get all the complex software today! :)
Another interesting case, is the CoffeeScript language, which is written in... CoffeeScript. (Although this use case still requires the use of an external interpreter, namely Node.js)
The PyPy interpreter, running on top of CPython, executes to partially interpret itself, at which point it hands control off to its RPython half, which performs the translation?
You can compile PyPy using an already compiled PyPy interpreter, or you can use CPython to compile it instead. However, since PyPy has a JIT now, it'll be faster to compile PyPy using itself, rather than CPython. (PyPy is now faster than CPython in most cases)

Can You Embed an TCL Script in Bash Script or Python Script That's Callable by External Programs?

I'm writing a script to extract some useful data about a series of chemical simulations I've been running.
To get this data I need (1) a C-program that calculates the density from a file type called *.pdb. I already have (1). And (2) I need to use a program called vmd to get that pdb. In order to accomplish (2) from the command line, I can submit a tcl script, as vmd has a build in tcl interpreter.
These functions -- calling the vmd to run the tcl script, then running the compiled c-program -- will be the key activities of my wrapper data extraction script.
I would like to eliminate the superfluous TCL script, reducing my count from 2 scripts (wrapper script + tcl script for vmd) down to 1. But I'm not sure quite how to do this. One potentially solution seems to be to embed my TCL script within my wrapper script, if there's a way to make such an embedded script callable from external programs.
Most of my data collection scripts so far have been in BASH, so ideally I would like to stick to a BASH script as I'm very familiar with bash scripting versus having only beginning knowledge of Python/Perl.
Here are my questions:
1. Can you embed a TCL script inside a Bash script?
2. Can you make this script callable by an external program?
e.g. in pseudocode:
#!/bin/bash
....
tclembed extract {
#tcl script
...
}
...
vmd -dispdev text -e extract.tcl >& extract_results.log #where vmd is
#an external program
3. If the answer to #2 is no, can you do this in Python, perhaps with the Minotaur library? I would consider the switch to python, if so...
http://markmail.org/message/6kogjphzqtn4ilch
4. If not, how would you suggest trying to merge these two scripts (a tcl routine and a bash script that calls it) into a single file?
5. If anybody HAS gotten external calls of this nature to work using Minotaur, can you post some explanatory code?
I've thought of one non-embedding solution which to #4, which would be to write a function in my Bash script that writes a file with the entire tcl script. That way I would have a single script, but could dump the subscript for use with external programs, later deleting it. I have a feeling this solution is kinda kludgy though I know for sure that it works, vs. embedded solutions.
There have been several Tcl-Python alloys. As Rafe Kettler's comment above sketches, the place to start is with a standard Python installation. This includes Tkinter, which builds in a full Tcl interpreter, accessible as described in the Wiki page mentioned. So, yes, it is feasible to "do this in Python".
I really don't get what this has to do with vmd, though. vmd builds in a Tcl interpreter already. While I entirely support the aim of "reduction of moving parts", so that you have, for example, one script, rather than two, coding something in Python, when vmd already exposes Tcl, doesn't seem like a step in the direction Jason R. Mick wants to go.
SOMEWHAT LATER: after an exchange of comments with Jason R. Mick, it occurred to me he might find
#!/bin/bash
echo "Here's a bit of bash-iness."
MYSCRIPT='
puts "Here I am, inside Tcl."
puts "See? I can do calculations: [expr 3 + 5]."
exit 0
'
tclsh << HERE
$MYSCRIPT
HERE
suggestive. Its output, of course, is
Here's a bit of bash-iness.
Here I am, inside Tcl.
See? I can do calculations: 8.
I wrote this in terms of tclsh, but, if I'm keeping up, Jason R. Mick will actually want to use vmd. The appropriate homologue for *vmd is something like
...
vmd -dispdev text -eofexit << HERE > output.log
$MYSCRIPT
HERE
While I can think of several other ways to meld bash and Tcl, I believe this one is most in the spirit of the original question.
I want to note, too, that, from the little I know of vmd, it would be entirely appropriate to do the same with Python in place of Tcl: vmd is equally adept with either.
Finally, both Python and Tcl are general-purpose languages, with approximately the same power as bash, so yet another direction to take this project would be to write it entirely in Tcl (or Python), rather than bash. Embedding scripts in the way illustrated above is at least as easy in Tcl (or Python) as in bash.
1. Can you embed a TCL script inside a Bash script?
Not easily. The best way is to write the script to a temporary file and pass the name of that file to tclsh (or wish if it is a Tcl/Tk program). That should be a "simple matter of programming", i.e., some awkward coding but not fundamentally hard.
2. Can you make this script callable by an external program?
I don't quite understand what you want to do here. You can put a #! line at the start of a Tcl script and mark the file executable. That works well. The best way of all to do that is this:
#!/usr/bin/env tclsh8.5
your tcl script here...
3. If the answer to #2 is no, can you do this in Python?
This wiki page mentions something called Typcl, which is reported to allow doing Tcl from inside Python. I have never tried it.
(I think questions 4 and 5 are largely irrelevant based on my answers above.)

How can I sandbox Python in pure Python?

I'm developing a web game in pure Python, and want some simple scripting available to allow for more dynamic game content. Game content can be added live by privileged users.
It would be nice if the scripting language could be Python. However, it can't run with access to the environment the game runs on since a malicious user could wreak havoc which would be bad. Is it possible to run sandboxed Python in pure Python?
Update: In fact, since true Python support would be way overkill, a simple scripting language with Pythonic syntax would be perfect.
If there aren't any Pythonic script interpreters, are there any other open source script interpreters written in pure Python that I could use? The requirements are support for variables, basic conditionals and function calls (not definitions).
This is really non-trivial.
There are two ways to sandbox Python. One is to create a restricted environment (i.e., very few globals etc.) and exec your code inside this environment. This is what Messa is suggesting. It's nice but there are lots of ways to break out of the sandbox and create trouble. There was a thread about this on Python-dev a year ago or so in which people did things from catching exceptions and poking at internal state to break out to byte code manipulation. This is the way to go if you want a complete language.
The other way is to parse the code and then use the ast module to kick out constructs you don't want (e.g. import statements, function calls etc.) and then to compile the rest. This is the way to go if you want to use Python as a config language etc.
Another way (which might not work for you since you're using GAE), is the PyPy sandbox. While I haven't used it myself, word on the intertubes is that it's the only real sandboxed Python out there.
Based on your description of the requirements (The requirements are support for variables, basic conditionals and function calls (not definitions)) , you might want to evaluate approach 2 and kick out everything else from the code. It's a little tricky but doable.
Roughly ten years after the original question, Python 3.8.0 comes with auditing. Can it help? Let's limit the discussion to hard-drive writing for simplicity - and see:
from sys import addaudithook
def block_mischief(event,arg):
if 'WRITE_LOCK' in globals() and ((event=='open' and arg[1]!='r')
or event.split('.')[0] in ['subprocess', 'os', 'shutil', 'winreg']): raise IOError('file write forbidden')
addaudithook(block_mischief)
So far exec could easily write to disk:
exec("open('/tmp/FILE','w').write('pwned by l33t h4xx0rz')", dict(locals()))
But we can forbid it at will, so that no wicked user can access the disk from the code supplied to exec(). Pythonic modules like numpy or pickle eventually use the Python's file access, so they are banned from disk write, too. External program calls have been explicitly disabled, too.
WRITE_LOCK = True
exec("open('/tmp/FILE','w').write('pwned by l33t h4xx0rz')", dict(locals()))
exec("open('/tmp/FILE','a').write('pwned by l33t h4xx0rz')", dict(locals()))
exec("numpy.savetxt('/tmp/FILE', numpy.eye(3))", dict(locals()))
exec("import subprocess; subprocess.call('echo PWNED >> /tmp/FILE', shell=True)", dict(locals()))
An attempt of removing the lock from within exec() seems to be futile, since the auditing hook uses a different copy of locals that is not accessible for the code ran by exec. Please prove me wrong.
exec("print('muhehehe'); del WRITE_LOCK; open('/tmp/FILE','w')", dict(locals()))
...
OSError: file write forbidden
Of course, the top-level code can enable file I/O again.
del WRITE_LOCK
exec("open('/tmp/FILE','w')", dict(locals()))
Sandboxing within Cpython has proven extremely hard and many previous attempts have failed. This approach is also not entirely secure e.g. for public web access:
perhaps hypothetical compiled modules that use direct OS calls cannot be audited by Cpython - whitelisting the safe pure pythonic modules is recommended.
Definitely there is still the possibility of crashing or overloading the Cpython interpreter.
Maybe there remain even some loopholes to write the files on the harddrive, too. But I could not use any of the usual sandbox-evasion tricks to write a single byte. We can say the "attack surface" of Python ecosystem reduces to rather a narrow list of events to be (dis)allowed: https://docs.python.org/3/library/audit_events.html
I would be thankful to anybody pointing me to the flaws of this approach.
EDIT: So this is not safe either! I am very thankful to #Emu for his clever hack using exception catching and introspection:
#!/usr/bin/python3.8
from sys import addaudithook
def block_mischief(event,arg):
if 'WRITE_LOCK' in globals() and ((event=='open' and arg[1]!='r') or event.split('.')[0] in ['subprocess', 'os', 'shutil', 'winreg']):
raise IOError('file write forbidden')
addaudithook(block_mischief)
WRITE_LOCK = True
exec("""
import sys
def r(a, b):
try:
raise Exception()
except:
del sys.exc_info()[2].tb_frame.f_back.f_globals['WRITE_LOCK']
import sys
w = type('evil',(object,),{'__ne__':r})()
sys.audit('open', None, w)
open('/tmp/FILE','w').write('pwned by l33t h4xx0rz')""", dict(locals()))
I guess that auditing+subprocessing is the way to go, but do not use it on production machines:
https://bitbucket.org/fdominec/experimental_sandbox_in_cpython38/src/master/sandbox_experiment.py
AFAIK it is possible to run a code in a completely isolated environment:
exec somePythonCode in {'__builtins__': {}}, {}
But in such environment you can do almost nothing :) (you can not even import a module; but still a malicious user can run an infinite recursion or cause running out of memory.) Probably you would want to add some modules that will be the interface to you game engine.
I'm not sure why nobody mentions this, but Zope 2 has a thing called Python Script, which is exactly that - restricted Python executed in a sandbox, without any access to filesystem, with access to other Zope objects controlled by Zope security machinery, with imports limited to a safe subset.
Zope in general is pretty safe, so I would imagine there are no known or obvious ways to break out of the sandbox.
I'm not sure how exactly Python Scripts are implemented, but the feature was around since like year 2000.
And here's the magic behind PythonScripts, with detailed documentation: http://pypi.python.org/pypi/RestrictedPython - it even looks like it doesn't have any dependencies on Zope, so can be used standalone.
Note that this is not for safely running arbitrary python code (most of the random scripts will fail on first import or file access), but rather for using Python for limited scripting within a Python application.
This answer is from my comment to a question closed as a duplicate of this one: Python from Python: restricting functionality?
I would look into a two server approach. The first server is the privileged web server where your code lives. The second server is a very tightly controlled server that only provides a web service or RPC service and runs the untrusted code. You provide your content creator with your custom interface. For example you if you allowed the end user to create items, you would have a look up that called the server with the code to execute and the set of parameters.
Here's and abstract example for a healing potion.
{function_id='healing potion', action='use', target='self', inventory_id='1234'}
The response might be something like
{hp='+5' action={destroy_inventory_item, inventory_id='1234'}}
Hmm. This is a thought experiment, I don't know of it being done:
You could use the compiler package to parse the script. You can then walk this tree, prefixing all identifiers - variables, method names e.t.c. (also has|get|setattr invocations and so on) - with a unique preamble so that they cannot possibly refer to your variables. You could also ensure that the compiler package itself was not invoked, and perhaps other blacklisted things such as opening files. You then emit the python code for this, and compiler.compile it.
The docs note that the compiler package is not in Python 3.0, but does not mention what the 3.0 alternative is.
In general, this is parallel to how forum software and such try to whitelist 'safe' Javascript or HTML e.t.c. And they historically have a bad record of stomping all the escapes. But you might have more luck with Python :)
I think your best bet is going to be a combination of the replies thus far.
You'll want to parse and sanitise the input - removing any import statements for example.
You can then use Messa's exec sample (or something similar) to allow the code execution against only the builtin variables of your choosing - most likely some sort of API defined by yourself that provides the programmer access to the functionality you deem relevant.

Categories