Is fstat() a safe (sandboxed) operation? - python

I'm currently writing a Python sandbox using sandboxed PyPy. Basically, the sandbox works by providing a "controller" that maps system library calls to a specified function instead. After following the instructions found at codespeak (which walk through the set up process), I realized that the default controller does not include a replacement for os.fstat(), and therefore crashes when I call open(). Specifically, the included pypy/translator/sandbox/sandlib.py does not contain a definition for do_ll_os__ll_os_fstat.
So far, I've implemented it as:
def do_ll_os__ll_os_fstat(self, fd):
return os.fstat(fd)
which seems to work fine. Is this safe? Will this create a hole in the sandbox?

The fstat call can reveal certain information which you may or may not want to keep secret. Among other things:
Whether two file descriptors are on the same filesystem
The block size of the underlying filesystem
Numeric UID/GIDs of file owners
Modification/access times of files
However, it will not modify anything, so if you don't mind this (relatively minor) information leak, no problem. You could also alter some of the results to mask information you want to hide (set owner UIDs/GIDs to 0, for example)

bdonlan's answer is good, but since there is a bounty here, what the heck :-)
You can see for yourself precisely what information fstat provides by reading the POSIX spec for struct stat.
It is definitely a "read-only" operation. And as a rule, Unix file descriptors only provide access to the single object to which they refer. For example, a (readable) file descriptor referencing a directory will allow you to list the files within the directory, but it will not allow you to access files within the directory; for that, you need to open() the file, which will perform a permission check.
Be aware that fstat can be called on non-files like directories or sockets. Here again, though, it will only provide the information you see in struct stat and it will not modify anything. (And for a socket, most of the fields will be meaningless.)

Related

What is the best or proper way to allow debugging of generated code?

For various reasons, in one project I generate executable code by means of generating AST from various source files the compiling that to bytecode (though the question could also work for cases where the bytecode is generated directly I guess).
From some experimentation, it looks like the debugger more or less just uses the lineno information embedded in the AST alongside the filename passed to compile in order to provide a representation for the debugger's purposes, however this assumes the code being executed comes from a single on-disk file.
That is not necessarily the case for my project, the executable code can be pieced together from multiple sources, and some or all of these sources may have been fetched over the network, or been retrieved from non-disk storage (e.g. database).
And so my Y questions, which may be the wrong ones (hence the background):
is it possible to provide a memory buffer of some sort, or is it necessary to generate a singular on-disk representation of the "virtual source"?
how well would the debugger deal with jumping around between the different bits and pieces if the virtual source can't or should not be linearised[0]
and just in case, is the assumption of Python only supporting a single contiguous source file correct or can it actually be fed multiple sources somehow?
[0] for instance a web-style literate program would be debugged in its original form, jumping between the code sections, not in the so-called "tangled" form
Some of this can be handled by the trepan3k debugger. For other things various hooks are in place.
First of all it can debug based on bytecode alone. But of course stepping instructions won't be possible if the line number table doesn't exist. And for that reason if for no other, I would add a "line number" for each logical stopping point, such as at the beginning of statements. The numbers don't have to be line numbers, they could just count from 1 or be indexes into some other table. This is more or less how go's Pos type position works.
The debugger will let you set a breakpoint on a function, but that function has to exist and when you start any python program most of the functions you define don't exist. So the typically way to do this is to modify the source to call the debugger at some point. In trepan3k the lingo for this is:
from trepan.api import debug; debug()
Do that in a place where the other functions you want to break on and that have been defined.
And the functions can be specified as methods on existing variables, e.g. self.my_function()
One of the advanced features of this debugger is that will decompile the bytecode to produce source code. There is a command called deparse which will show you the context around where you are currently stopped.
Deparsing bytecode though is a bit difficult so depending on which kind of bytecode you get the results may vary.
As for the virtual source problem, well that situation is somewhat tolerated in the debugger, since that kind of thing has to go on when there is no source. And to facilitate this and remote debugging (where the file locations locally and remotely can be different), we allow for filename remapping.
Another library pyficache is used to for this remapping; it has the ability I believe remap contiguous lines of one file into lines in another file. And I think you could use this over and over again. However so far there hasn't been need for this. And that code is pretty old. So someone would have to beef up trepan3k here.
Lastly, related to trepan3k is a trepan-xpy which is a CPython bytecode debugger which can step bytecode instructions even when the line number table is empty.

Would allowing user input to python's __import__ be a security risk?

Say I create a simple web server using Flask, and allowing people to query certain things that I modulated in different python files, using the __import__ statement, would doing this with user supplied information be considered a security risk?
Example:
from flask import Flask
app = Flask(__name__)
#app.route("/<author>/<book>/<chapter>")
def index(author, book, chapter):
return getattr(__import__(author), book)(chapter)
# OR
return getattr(__import__("books." + author), book)(chapter)
I've seen a case like this recently when reviewing code, however it didn't feel right to me.
It is entirely insecure, and your system is wide open to attack. Your first return line doesn't limit what kind of names can be imported, which means the user can execute any arbitrary callable in any importable Python module.
That includes:
/pickle/loads/<url-encoded pickle data>
A pickle is a stack language that lets you execute arbitrary Python code, and the attacker can take full control of your server.
Even a prefixed __import__ would be insecure if an attacker can also place a file on your file system in the PYTHONPATH; all they need is a books directory earlier in the path. They can then use this route to have the file executed in your Flask process, again letting them take full control.
I would not use __import__ at all here. Just import those modules at the start and use a dictionary mapping author to the already imported module. You can use __import__ still to discover those modules on start-up, but you now remove the option to load arbitrary code from the filesystem.
Allowing untrusted data to direct calling arbitrary objects in modules should also be avoided (including getattr()). Again, an attacker that has limited access to the system could exploit this path to widen the crack considerably. Always limit the input to a whitelist of possible options (like the modules you loaded at the start, and per module, what objects can actually be called within).
More than being a security risk, it is a bad idea e.g. I could easily crash your web app by visiting the url:
/sys/exit/anything
translating to:
...
getattr(__import__('sys'), 'exit')('anything')
Don't give the possibility to import/execute just about anything to your users. Restrict the possibilities by using say a dictionary of permissible imports, as #MartijnPieters as clearly pointed out.

Is it possible to examine the inner statements of a function?

Working from the command line I wrote a function called go(). When called it receives input asking the user for a directory address in the format drive:\directory. No need for extra slashes or quotes or r literal qualifiers or what have you. Once you've provided a directory, it lists all the non-hidden files and directories under it.
I want to update the function now with a statement that stores this location in a variable, so that I can start browsing my hierarchy without specifying the full address every time.
Unfortunately I don't remember what statements I put in the function in the first place to make it work as it does. I know it's simple and I could just look it up and rebuild it from scratch with not too much effort, but that isn't the point.
As someone who is trying to learn the language, I try to stay at the command line as much as possible, only visiting the browser when I need to learn something NEW. Having to refer to obscure findings attached to vaguely related questions to rediscover how to do things I've already done is very cumbersome.
So my question is, can I see the contents of functions I have written, and how?
Unfortunately no. Python does not have this level of introspection. Best you can do is see the compiled byte code.
The inspect module details what information is available at runtime: https://docs.python.org/3.5/library/inspect.html

Is there a Python module for transparently working with a file's contents as a buffer?

I'm working on a pure Python file parser for event logs, which may range in size from kilobytes to gigabytes. Is there a module that abstracts explicit .open()/.seek()/.read()/.close() calls into a simple buffer-like object? You might think of this as the inverse of StringIO. I expect it might look something like:
with FileBackedBuffer('/my/favorite/path', 'rb') as buf:
header = buf[0:0x10]
footer = buf[0x10000000:]
The mmap module may fulfill my requirements; however, I have two reservations that I'd appreciate feedback on:
It is important that the module handle files larger than available RAM/swap. I am unsure if mmap can do this well.
The mmap constructors are different depending on OS. This makes me hesitant as I am looking to write nicely cross-platform code, and would rather not muck in OS specifics. I will if I need to, but this set off a warning that I might be looking in the wrong place.
If mmap is the correct module for such as task, how does it handle these two points? If it is not, what is an appropriate module?
mmap can easily handle files larger than RAM/swap. What mmap can't do is handle files larger than the address space, which means that 32bit systems are limited in how large a file they can use.
What happens with mmap is that the OS will only have in memory as much data as it it chooses to, but you program will think it is all there. Be careful in usage patters though since if your data DOESN'T fit in RAM and you jump around too randomly, it will swap (discard pages from your file that you haven't used recently to make room for the new pages to be loaded).
If you don't need to specify anything base fileno and length, I don't believe you need to worry about the platform specific arguments for mmap. If you do need to worry about the extra arguments, then you will either have to master Windows versus Unix, or pass that on to your users. I don't know what your library will be, but it may be nice to provide reasonable defaults on both platforms while also allowing the user to tweak the options. It looks to me that it would be unlikely that you would care about the Windows tagname option, also, if you are cross platform, then just accept the Unix default for prot since you have no choice on Windows. That only leaves caring about MAP_PRIVATE and MAP_SHARED. The default is MAP_SHARED, but I'm not sure if that is the option that most closely matches Windows behavior, but accepting the default is probably fine there.

How can I create a ramdisk in Python?

I want to create a ramdisk in Python. I want to be able to do this in a cross-platform way, so it'll work on Windows XP-to-7, Mac, and Linux. I want to be able to read/write to the ramdisk like it's a normal drive, preferably with a drive letter/path.
The reason I want this is to write tests for a script that creates a directory with a certain structure. I want to create the directory completely in the ramdisk so I'll be sure it would be completely deleted after the tests are over. I considered using Python's tempfile, but if the test will be stopped in the middle the directory might not be deleted. I want to be completely sure it's deleted even if someone pulls the plug on the computer in the middle of a test.
How about PyFilesystem?
https://docs.pyfilesystem.org/en/latest/reference/memoryfs.html
https://docs.pyfilesystem.org/en/latest/reference/tempfs.html
The downside is that you have to access the filesystem with PyFilesystem API, but you can also access the real fs with PyFilesystem.
Because file and directory-handling is so low-level and OS dependent, I doubt anything like what you want exists (or is even possible). Your best bet might be to implement a "virtual" file-system-like set of functions, classes, and methods that keep track of the files and directory-hierarchy created and their content.
Callables in such an emulation would need to have the same signature and return the same value(s) as their counterparts in the various Python standard built-ins and modules your application uses.
I suspect this might not be as much work as it sounds -- emulating the standard Python file-system interface -- depending on how much of it you're actually using since you wouldn't necessarily have to imitate all of it. Also, if written in Pure Pythonâ„¢, it would also be portable and easy to maintain and enhance.
One option might be to inject (monkey patch) modified versions of the methods used in the os module as well as the builtins open and file that write to StringIO files instead of to disk. Obviously this substitution should only occur for the module being tested;
Please read this:
http://docs.python.org/library/tempfile.html#tempfile.TemporaryFile
"Return a file-like object that can be
used as a temporary storage area. The
file is created using mkstemp(). It
will be destroyed as soon as it is
closed (including an implicit close
when the object is garbage
collected)."
It's all handled for you. Do nothing and it already works.

Categories