Is there a way to call the POSIX mlock function from Python? The effect of mlock is to disable swapping out certain objects.
I know there are still other issues when it comes to securing cryptographic keys, I just want to know how to contain them in RAM.
For CPython, there is no good answer for this that doesn't involve writing a Python C extension, since mlock works on pages, not objects. Even if you used ctypes to retrieve the necessary addresses and mlock-ed them all through ctypes mlock calls, you'll have a hell of a time determining when to mlock and when to munlock. You'd need to know the memory address and sizes of all protected data types; since mlock works on pages, you'd have to carefully track how many objects are currently in any given page (because if you just mlock and munlock blindly, and there are more than one things to lock in a page, the first munlock would unlock all of them; mlock/munlock is a boolean flag, it doesn't count the number of locks and unlocks).
Even if you manage that, you still would have a race between data acquisition and mlock during which the data could be written to swap.
You could partially avoid these problems through careful use of the mmap module and memoryviews (mmap gives you pages of memory, memoryview references said memory without copying it, so ctypes could be used to mlock the page), but you'd have to build it all from scratch.
In short, Python doesn't care about swapping or memory protection in the way you want; it trusts the swap file to be configured to your desired security (e.g. disabled or encrypted), neither providing additional protection nor providing the information you'd need to add it in.
Related
With Deno being the new Node.js rival and all, the memory-safe nature of Rust has been mentioned in a lot of news articles, one particular piece stated Rust and Go are good for their memory-safe nature, as are Swift and Kotlin but the latter two are not used for systems programming that widely.
Safe Rust is the true Rust programming language. If all you do is write Safe Rust, you will never have to worry about type-safety or memory-safety. You will never endure a dangling pointer, a use-after-free, or any other kind of Undefined Behavior.
This piqued my interest into understanding if Python can be regarded as memory-safe and if yes or no, how safe or unsafe?
From the outset, the article on memory safety on Wikipedia does not even mention Python and the article on Python only mentions memory management it seems.
The closest I've come to finding an answer was this one by Daniel:
The wikipedia article associates type-safe to memory-safe, meaning, that the same memory area cannot be accessed as e.g. integer and string. In this way Python is type-safe. You cannot change the type of a object implicitly.
But even this only seems to imply a connection between two aspects (using an association from Wikipedia, which again is debatable) and no definitive answer on whether Python can be regarded as memory-safe.
Wikipedia lists the following examples of memory safety issues:
Access errors: invalid read/write of a pointer
Buffer overflow - out-of-bound writes can corrupt the content of adjacent objects, or internal data (like bookkeeping information for the heap) or return addresses.
Buffer over-read - out-of-bound reads can reveal sensitive data or help attackers bypass address space layout randomization.
Python at least tries to protect against these.
Race condition - concurrent reads/writes to shared memory
That's actually not that hard to do in languages with mutable data structures. (Advocates of functional programming and immutable data structures often use this fact as an argument in their favor).
Invalid page fault - accessing a pointer outside the virtual memory space. A null pointer dereference will often cause an exception or program termination in most environments, but can cause corruption in operating system kernels or systems without memory protection, or when use of the null pointer involves a large or negative offset.
Use after free - dereferencing a dangling pointer storing the address of an object that has been deleted.
Uninitialized variables - a variable that has not been assigned a value is used. It may contain an undesired or, in some languages, a corrupt value.
Null pointer dereference - dereferencing an invalid pointer or a pointer to memory that has not been allocated
Wild pointers arise when a pointer is used prior to initialization to some known state. They show the same erratic behaviour as dangling pointers, though they are less likely to stay undetected.
There's no real way to prevent someone from trying to access a null pointer. In C# and Java, this results in an exception. In C++, this results in undefined behavior.
Memory leak - when memory usage is not tracked or is tracked incorrectly
Stack exhaustion - occurs when a program runs out of stack space, typically because of too deep recursion. A guard page typically halts the program, preventing memory corruption, but functions with large stack frames may bypass the page.
Memory leaks in languages like C#, Java, and Python have different meanings than they do in languages like C and C++ where you manage memory manually. In C or C++, you get a memory leak by failing to deallocate allocated memory. In a language with managed memory, you don't have to explicitly de-allocate memory, but it's still possible to do something quite similar by accidentally maintaining a reference to an object somewhere even after the object is no longer needed.
This is actually quite easy to do with things like event handlers in C# and long-lived collection classes; I've actually worked on projects where there were memory leaks in spite of the fact that we were using managed memory. In one sense, working with an environment that has managed memory can actually make these issues more dangerous because programmers can have a false sense of security. In my experience, even experienced engineers often fail to do memory profiling or write test cases to check for this (likely due to the environment giving them a false sense of security).
Stack exhaustion is quite easy to do in Python too (e.g. with infinite recursion).
Heap exhaustion - the program tries to allocate more memory than the amount available. In some languages, this condition must be checked for manually after each allocation.
Still quite possible - I'm rather embarrassed to admit that I've personally done that in C# (although not in Python yet).
Double free - repeated calls to free may prematurely free a new object at the same address. If the exact address has not been reused, other corruption may occur, especially in allocators that use free lists.
Invalid free - passing an invalid address to free can corrupt the heap.
Mismatched free - when multiple allocators are in use, attempting to free memory with a deallocation function of a different allocator[20]
Unwanted aliasing - when the same memory location is allocated and modified twice for unrelated purposes.
Unwanted aliasing is actually quite easy to do in Python. Here's an example in Java (full disclosure: I wrote the accepted answer); you could just as easily do something quite similar in Python. The others are managed by the Python interpreter itself.
So, it would seem that memory-safety is relative. Depending on exactly what you consider a "memory-safety issue," it can actually be quite difficult to entirely prevent. High-level languages like Java, C#, and Python can prevent many of the worst of these errors, but there are other issues that are difficult or impossible to completely prevent.
I am doing some experiments with the Python garbage collector, I would like to check if a memory address is used or not. In the following example, I have de-referenced the string (surely) at ls[2]. If I run the garbage collector, I can still see surely at the original address. I would like to be sure that the address is now writable. Is there a way to check it in Python?
from ctypes import string_at
from sys import getsizeof
import gc
ls = ['This','will be','surely','deleted']
idsurely= id(ls[2])
sizesurely = getsizeof(ls[2])
ls[2] = 'probably'
print(ls)
print(string_at(idsurely,sizesurely))
gc.collect()
# I check there is nothing in the garbage
print(gc.garbage)
print(string_at(idsurely,sizesurely))
I am interested in this mainly from a theoretical point of view so I am not saying that is something that has practical usage. My goal is to show how memory works for a tutorial. I want to show that the data is still there and that just that the bytes at the address can be now written. So the output of the script is up to now as expected. I just want to prove the last passage.
Not possible.
There is no central registry of used or unused memory addresses in Python. There isn't even a central registry of all objects (the cyclic GC doesn't know about all of them), and even if you had a registry of all objects, that wouldn't be enough to determine what memory locations are in use. Additionally, you can't just read arbitrary memory addresses, or write to arbitrary deallocated addresses. That'll quickly lead to segfaults or worse.
Finally, I would strongly advise against using this kind of thing in a tutorial even if you did find something to make it work. When you put something in a tutorial, a large fraction of people reading the tutorial are going to think it's something they're supposed to learn. Programming newbies should not be mislead into thinking that examining possibly-deallocated memory locations is something they should be doing.
Your experiments are way off base. id (solely as a CPython implementation detail) does get the memory address of the object in question, but we're talking about the Python object itself, not the data it contains. sys.getsizeof returns a number that roughly corresponds to how much memory the object occupies, but there is no guarantee that memory is contiguous.
By sheer coincidence, this almost works on str (though it will perform a buffer overread if the string in question has cached copies of its UTF-8 or wchar_t form, so you're risking crashing your program), but even then your test is flawed; CPython interns string literals that look like legal variable names, so if the string in question appears as a literal anywhere else in your program (including as the name of some class or function in some module you imported), it won't actually go away when you replace it. Similar implicit caches can occur if the literal string appears in any function, anywhere (it ends up being not only interned, but stored in the constants for that function).
Update: On testing, in an actual script, the reference count for 'surely' when you hold onto a copy of it is 3, which drops to 2 when you replace it with 'probably'. Turns out constants are being cached even at global scope. The only reason the interactive interpreter doesn't exhibit this behavior is that it effectively evals each line separately, so the constant cache is discarded when the eval completes.
And even if all that's not a problem, most (almost all) memory managers (CPython's specialized small object heap and the general heap it's built on) don't actually zero out memory when its released, so if you do look at the same address shortly after it really was released, it'll probably have pretty similar data in it.
Lastly, your gc.collect() call won't change anything except by coincidence (of whatever happens during gc possibly allocating memory by side-effect). str is not a garbage collected type, as it cannot contain references to other Python objects, so it's impossible for it to be a link in a reference cycle, and the CPython garbage collector is solely concerned with collecting cyclic garbage; CPython is reference counted, so anything that's not part of a reference cycle is cleaned up automatically and immediately when the last reference disappears.
The short answer this all leads up to is: There is no way to determine, within CPython, non-heuristically, if a particular memory address has been released to the free store and made available for reuse. CPython's memory management scheme is pure implementation detail, and exposing APIs at that level of detail would create compatibility concerns when people depended on them.
The closest you're going to get is using something like the tracemalloc module to perform basic snapshotting and compute differences in the snapshot. That's not going to give you a window into whether a specific address is still in use though AFAICT; at best it can tell you where an address that's definitely in use was allocated.
The other approach (specific to CPython) you can use is to just check the reference counts before replacing the object; sys.getrefcount for a given name/attribute reports 2, then deling (or rebinding) that name/attribute will release it (assuming no threads that might create additional references between the test and the del/rebind). You expect 2, not 1, because calling sys.getrefcount creates a temporary reference to the object in question. If it reports a number greater than 2, deling/rebinding could still lead to the object being deleted eventually when the cyclic garbage collectors runs, if the object was part of a reference cycle, but for a reference count of 2 (or 1 for something otherwise unnamed, e.g. sys.getrefcount(''.join(('f', '9')) or the like), the behavior will be deterministic.
From the documentation about gc:
... the collector supplements the reference counting already used in Python...
And from gc.is_tracked():
Returns True if the object is currently tracked by the garbage collector, False otherwise. As a general rule, instances of atomic types aren’t tracked and instances of non-atomic types (containers, user-defined objects…) are.
Strings are not tracked by the garbage collector:
In [1]: import gc
In [2]: test = 'surely'
Out[2]: 'surely'
In [3]: gc.is_tracked(test)
Out[3]: False
Looking at the documentation, there doesn't seem to be a method for accessing the reference counting from within the language.
Note that at least for me, using string_at doesn't work from the interactive interpreter. It does work in a script.
By looking at the CPython implementation it seems the return value of a string split() is a list of newly allocated strings. However, since strings are immutable it seems one could have made substrings out of the original string by pointing at the offsets.
Am I understanding the current behavior of CPython correctly ? Are there reasons for not opting for this space optimization ? One reason I can think of is that the parent string cannot be freed until all its substrings are.
Without a crystal ball I can't tell you why CPython does it that way. However, there are some reasons why you might choose to do it that way.
The problem is that a small string might hold a reference to a much larger backing array. For example, suppose I read in a 8 GB HTTP access log file to analyze which user agents access my file the most, and I do that just by fp.read() and then run a regex on the whole file at once rather than going one line at a time.
I want to know about the top 10 most common user agents, so I keep this around in a list.
Then I want to do the same analysis for 100 other files, to see how the top 10 user agents have changed over time. Boom! My program is trying to use 800 GB of memory and gets killed. Why? How do I debug this?
Java used this sharing technique prior to Java 7, so the same reasoning applies. See Java 7 String - substring complexity and JDK-4513622: (str)
keeping a substring of a field prevents GC for object.
Also note that having strings share memory would require you to follow a pointer from the string object to the string data. In CPython, the string data is usually placed directly after a header in memory, so you don't need to follow a pointer. This reduces the number of allocations required and reduces data dependencies when reading strings.
In the current CPython implementation, strings are reference-counted; it is assumed that a string cannot hold references to other objects because a string is not a container. This means that garbage collection does not need to inspect or trace over string objects (because they're entirely covered by the reference counting). But it's actually worse than that: Old versions of Python did not have a tracing garbage collector at all; GC was new in 2.0. Before that, any cyclic garbage would simply leak.
A competently-implemented substring-to-offset algorithm should not form cycles. So in theory, a cyclic garbage collector is not a prerequisite for this. However, because we're doing reference counting instead of tracing, the child objects become responsible for Py_DECREF()ing their parent objects at end-of-life. Otherwise the parent leaks. This means you cannot just chuck the whole string into the free list when it reaches end-of-life; you have to check whether it's a substring, and branching is potentially expensive. Python was historically designed to do string processing (like Perl, but with nicer syntax), which means creating and destroying a lot of strings. Furthermore, all variable names are internally stored as strings, so even if the user is not doing string processing, the interpreter is. Slowing down the string deallocation process by even a little could have a serious impact on performance.
CPython internally uses NUL-terminated strings in addition to storing a length. This is a very early design choice, present since the very first version of Python, and still true in the latest version.
You can see that in Include/unicodeobject.h where PyASCIIObject says "wchar_t representation (null-terminated)" and PyCompactUnicodeObject says "UTF-8 representation (null-terminated)". (Recent CPython implementations select from one of 4 back-end string types, depending on the Unicode encoding needs.)
Many Python extension modules expect a NUL terminated string. It would be difficult to implement substrings as slices into a larger string and preserve the low-level C API. Not impossible, as it could be done using a copy-on-C-API-access. Or Python could require all extension writers to use a new subslice-friendly API. But that complexity is not worthwhile given the problems found from experience in other languages which implement subslice references, as Dietrich Epp described.
I see little in Kevin's answer which is applicable to this question. The decision had nothing do to with the lack of circular garbage collection before Python 2.0, nor could it. Substring slices are implemented with an acyclic data structure. 'Competently-implemented' isn't a relevant requirement as it would take a perverse sort of incompetence or malice to turn it into a cyclic data structure.
Nor would there necessarily be extra branch overhead in the deallocator. If the source string were one type and the substring slice another type, then Python's normal type dispatcher would automatically use the correct deallocator, with no additional overhead. Even if there were an extra branch, we know that branching overhead in this case is not "expensive". Python 3.3 (because of PEP 393) has those 4 back-end Unicode types, and decides what to do based on branching. String access occurs much more often than deallocation, so any dellocation overhead due to branching would be lost in the noise.
It is mostly true that in CPython "variable names are internally stored as strings". (The exception is that local variables are stored as indices into a local array.) However, these names are also interned into a global dictionary using PyUnicode_InternInPlace(). There is therefore no deallocation overhead because these strings are not deallocated, outside of cases involving dynamic dispatch using non-interned strings, like through getattr().
I have a Telit module which runs [Python 1.5.2+] (http://www.roundsolutions.com/techdocs/python/Easy_Script_Python_r13.pdf)!. There are certain restrictions in the number of variable, module and method names I can use (< 500), the size of each variable (16k) and amount of RAM (~ 1MB). Refer pg 113&114 for details. I would like to know how to get the number of symbols being generated, size in RAM of each variable, memory usage (stack and heap usage).
I need something similar to a map file that gets generated with gcc after the linking process which shows me each constant / variable, symbol, its address and size allocated.
Python is an interpreted and dynamically-typed language, so generating that kind of output is very difficult, if it's even possible. I'd imagine that the only reasonable way to get this information is to profile your code on the target interpreter.
If you're looking for a true memory map, I doubt such a tool exists since Python doesn't go through the same kind of compilation process as C or C++. Since everything is initialized and allocated at runtime as the program is parsed and interpreted, there's nothing to say that one interpreter will behave the same as another, especially in a case such as this where you're running on such a different architecture. As a result, there's nothing to say that your objects will be created in the same locations or even with the same overall memory structure.
If you're just trying to determine memory footprint, you can do some manual checking with sys.getsizeof(object, [default]) provided that it is supported with Telit's libs. I don't think they're using a straight implementation of CPython. Even still, this doesn't always work and with raise a TypeError when an object's size cannot be determined if you don't specify the default parameter.
You might also get some interesting results by studying the output of the dis module's bytecode disassembly, but that assumes that dis works on your interpreter, and that your interpreter is actually implemented as a VM.
If you just want a list of symbols, take a look at this recipe. It uses reflection to dump a list of symbols.
Good manual testing is key here. Your best bet is to set up the module's CMUX (COM port MUXing), and watch the console output. You'll know very quickly if you start running out of memory.
This post makes me recall my pain once with Telit GM862-GPS modules. My code was exactly at the point that the number of variables, strings, etc added up to the limit. Of course, I didn't know this fact by then. I added one innocent line and my program did not work any more. I drove me really crazy for two days until I look at the datasheet to find this fact.
What you are looking for might not have a good answer because the Python interpreter is not a full fledged version. What I did was to use the same local variable names as many as possible. Also I deleted doc strings for functions (those count too) and replace with #comments.
In the end, I want to say that this module is good for small applications. The python interpreter does not support threads or interrupts so your program must be a super loop. When your application gets bigger, each iteration will take longer. Eventually, you might want to switch to a faster platform.
[Edit: This problem applies only to 32-bit systems. If your computer, your OS and your python implementation are 64-bit, then mmap-ing huge files works reliably and is extremely efficient.]
I am writing a module that amongst other things allows bitwise read access to files. The files can potentially be large (hundreds of GB) so I wrote a simple class that lets me treat the file like a string and hides all the seeking and reading.
At the time I wrote my wrapper class I didn't know about the mmap module. On reading the documentation for mmap I thought "great - this is just what I needed, I'll take out my code and replace it with an mmap. It's probably much more efficient and it's always good to delete code."
The problem is that mmap doesn't work for large files! This is very surprising to me as I thought it was perhaps the most obvious application. If the file is above a few gigabytes then I get an EnvironmentError: [Errno 12] Cannot allocate memory. This only happens with a 32-bit Python build so it seems it is running out of address space, but I can't find any documentation on this.
My code is just
f = open('somelargefile', 'rb')
map = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
So my question is am I missing something obvious here? Is there a way to get mmap to work portably on large files or should I go back to my naïve file wrapper?
Update: There seems to be a feeling that the Python mmap should have the same restrictions as the POSIX mmap. To better express my frustration here is a simple class that has a small part of the functionality of mmap.
import os
class Mmap(object):
def __init__(self, f):
"""Initialise with a file object."""
self.source = f
def __getitem__(self, key):
try:
# A slice
self.source.seek(key.start, os.SEEK_SET)
return self.source.read(key.stop - key.start)
except AttributeError:
# single element
self.source.seek(key, os.SEEK_SET)
return self.source.read(1)
It's read-only and doesn't do anything fancy, but I can do this just the same as with an mmap:
map2 = Mmap(f)
print map2[0:10]
print map2[10000000000:10000000010]
except that there are no restrictions on filesize. Not too difficult really...
From IEEE 1003.1:
The mmap() function shall establish a
mapping between a process' address
space and a file, shared memory
object, or [TYM] typed memory
object.
It needs all the virtual address space because that's exactly what mmap() does.
The fact that it isn't really running out of memory doesn't matter - you can't map more address space than you have available. Since you then take the result and access as if it were memory, how exactly do you propose to access more than 2^32 bytes into the file? Even if mmap() didn't fail, you could still only read the first 4GB before you ran out of space in a 32-bit address space. You can, of course, mmap() a sliding 32-bit window over the file, but that won't necessarily net you any benefit unless you can optimize your access pattern such that you limit how many times you have to visit previous windows.
Sorry to answer my own question, but I think the real problem I had was not realising that mmap was a standard POSIX system call with particular characterisatations and limitations and that the Python mmap is supposed just to expose its functionality.
The Python documentation doesn't mention the POSIX mmap and so if you come at it as a Python programmer without much knowledge of POSIX (as I did) then the address space problem appears quite arbitrary and badly designed!
Thanks to the other posters for teaching me the true meaning of mmap. Unfortunately no one has suggested a better alternative to my hand-crafted class for treating large files as strings, so I shall have to stick with it for now. Perhaps I will clean it up and make it part of my module's public interface when I get the chance.
A 32-bit program and operating system can only address a maximum of 32 bits of memory i.e. 4GB. There are other factors that make the total even smaller; for example, Windows reserves between 0.5 and 2GB for hardware access, and of course your program is going to take some space as well.
Edit: The obvious thing you're missing is an understanding of the mechanics of mmap, on any operating system. It allows you to map a portion of a file to a range of memory - once you've done that, any access to that portion of the file happens with the least possible overhead. It's low overhead because the mapping is done once, and doesn't have to change every time you access a different range. The drawback is that you need an open address range sufficient for the portion you're trying to map. If you're mapping the whole file at once, you'll need a hole in the memory map large enough to fit the entire file. If such a hole doesn't exist, or is bigger than your entire address space, it fails.
the mmap module provides all the tools you need to poke around in your large file, but due to the limitations other folks have mentioned, you can't map it all at once. You can map a good sized chunk at once, do some processing and then unmap that and map another. the key arguments to the mmap class are length and offset, which do exactly what they sound like, allowing you to map length bytes, starting at byte offset in the mapped file. Any time you wish to read a section of memory that is outside the mapped window, you have to map in a new window.
The point you are missing is that mmap is a memory mapping function that maps a file into memory for arbitrary access across the requested data range by any means.
What you are looking for sounds more like some sort of a data window class that presents an api allowing you to look at small windows of a large data structure at anyone time. Access beyond the bounds of this window would not be possible other than by calling the data window's own api.
This is fine, but it is not a memory map, it is something that offers the advantage of a wider data range at the cost of a more restrictive api.
Use a 64-bit computer, with a 64-bit OS and a 64-bit python implementation, or avoid mmap()
mmap() requires CPU hardware support to make sense with large files bigger than a few GiB.
It uses the CPU's MMU and interrupt subsystems to allow exposing the data as if it were already loaded ram.
The MMU is hardware which will generate an interrupt whenever an address corresponding to data not in physical RAM is accessed, and the OS will handle the interrupt in a way that makes sense at runtime, so the accessing code never knows (or needs to know) that the data doesn't fit in RAM.
This makes your accessing code simple to write. However, to use mmap() this way, everything involved will need to handle 64 bit addresses.
Or else it may be preferable to avoid mmap() altogether and do your own memory management.
You're setting the length parameter to zero, which means map in the entire file. On a 32 bit build, this won't be possible if the file length is more than 2GB (possibly 4GB).
You ask the OS to map the entire file in a memory range. It won't be read until you trigger page faults by reading/writing, but it still needs to make sure the entire range is available to your process, and if that range is too big, there will be difficulties.