Why doesn't Python's mmap work with large files?

Why doesn't Python's mmap work with large files? - python

[Edit: This problem applies only to 32-bit systems. If your computer, your OS and your python implementation are 64-bit, then mmap-ing huge files works reliably and is extremely efficient.]
I am writing a module that amongst other things allows bitwise read access to files. The files can potentially be large (hundreds of GB) so I wrote a simple class that lets me treat the file like a string and hides all the seeking and reading.
At the time I wrote my wrapper class I didn't know about the mmap module. On reading the documentation for mmap I thought "great - this is just what I needed, I'll take out my code and replace it with an mmap. It's probably much more efficient and it's always good to delete code."
The problem is that mmap doesn't work for large files! This is very surprising to me as I thought it was perhaps the most obvious application. If the file is above a few gigabytes then I get an EnvironmentError: [Errno 12] Cannot allocate memory. This only happens with a 32-bit Python build so it seems it is running out of address space, but I can't find any documentation on this.
My code is just
f = open('somelargefile', 'rb')
map = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
So my question is am I missing something obvious here? Is there a way to get mmap to work portably on large files or should I go back to my naïve file wrapper?
Update: There seems to be a feeling that the Python mmap should have the same restrictions as the POSIX mmap. To better express my frustration here is a simple class that has a small part of the functionality of mmap.
import os
class Mmap(object):
def __init__(self, f):
"""Initialise with a file object."""
self.source = f
def __getitem__(self, key):
try:
# A slice
self.source.seek(key.start, os.SEEK_SET)
return self.source.read(key.stop - key.start)
except AttributeError:
# single element
self.source.seek(key, os.SEEK_SET)
return self.source.read(1)
It's read-only and doesn't do anything fancy, but I can do this just the same as with an mmap:
map2 = Mmap(f)
print map2[0:10]
print map2[10000000000:10000000010]
except that there are no restrictions on filesize. Not too difficult really...

From IEEE 1003.1:
The mmap() function shall establish a
mapping between a process' address
space and a file, shared memory
object, or [TYM] typed memory
object.
It needs all the virtual address space because that's exactly what mmap() does.
The fact that it isn't really running out of memory doesn't matter - you can't map more address space than you have available. Since you then take the result and access as if it were memory, how exactly do you propose to access more than 2^32 bytes into the file? Even if mmap() didn't fail, you could still only read the first 4GB before you ran out of space in a 32-bit address space. You can, of course, mmap() a sliding 32-bit window over the file, but that won't necessarily net you any benefit unless you can optimize your access pattern such that you limit how many times you have to visit previous windows.

Sorry to answer my own question, but I think the real problem I had was not realising that mmap was a standard POSIX system call with particular characterisatations and limitations and that the Python mmap is supposed just to expose its functionality.
The Python documentation doesn't mention the POSIX mmap and so if you come at it as a Python programmer without much knowledge of POSIX (as I did) then the address space problem appears quite arbitrary and badly designed!
Thanks to the other posters for teaching me the true meaning of mmap. Unfortunately no one has suggested a better alternative to my hand-crafted class for treating large files as strings, so I shall have to stick with it for now. Perhaps I will clean it up and make it part of my module's public interface when I get the chance.

A 32-bit program and operating system can only address a maximum of 32 bits of memory i.e. 4GB. There are other factors that make the total even smaller; for example, Windows reserves between 0.5 and 2GB for hardware access, and of course your program is going to take some space as well.
Edit: The obvious thing you're missing is an understanding of the mechanics of mmap, on any operating system. It allows you to map a portion of a file to a range of memory - once you've done that, any access to that portion of the file happens with the least possible overhead. It's low overhead because the mapping is done once, and doesn't have to change every time you access a different range. The drawback is that you need an open address range sufficient for the portion you're trying to map. If you're mapping the whole file at once, you'll need a hole in the memory map large enough to fit the entire file. If such a hole doesn't exist, or is bigger than your entire address space, it fails.

the mmap module provides all the tools you need to poke around in your large file, but due to the limitations other folks have mentioned, you can't map it all at once. You can map a good sized chunk at once, do some processing and then unmap that and map another. the key arguments to the mmap class are length and offset, which do exactly what they sound like, allowing you to map length bytes, starting at byte offset in the mapped file. Any time you wish to read a section of memory that is outside the mapped window, you have to map in a new window.

The point you are missing is that mmap is a memory mapping function that maps a file into memory for arbitrary access across the requested data range by any means.
What you are looking for sounds more like some sort of a data window class that presents an api allowing you to look at small windows of a large data structure at anyone time. Access beyond the bounds of this window would not be possible other than by calling the data window's own api.
This is fine, but it is not a memory map, it is something that offers the advantage of a wider data range at the cost of a more restrictive api.

Use a 64-bit computer, with a 64-bit OS and a 64-bit python implementation, or avoid mmap()
mmap() requires CPU hardware support to make sense with large files bigger than a few GiB.
It uses the CPU's MMU and interrupt subsystems to allow exposing the data as if it were already loaded ram.
The MMU is hardware which will generate an interrupt whenever an address corresponding to data not in physical RAM is accessed, and the OS will handle the interrupt in a way that makes sense at runtime, so the accessing code never knows (or needs to know) that the data doesn't fit in RAM.
This makes your accessing code simple to write. However, to use mmap() this way, everything involved will need to handle 64 bit addresses.
Or else it may be preferable to avoid mmap() altogether and do your own memory management.

You're setting the length parameter to zero, which means map in the entire file. On a 32 bit build, this won't be possible if the file length is more than 2GB (possibly 4GB).

You ask the OS to map the entire file in a memory range. It won't be read until you trigger page faults by reading/writing, but it still needs to make sure the entire range is available to your process, and if that range is too big, there will be difficulties.

Related

Zlib decompressobj vs decompress performance

Is there any performance downside of using the zlib decompressobj function instead of decompress?
I'm asking because a python app that I work with decompresses files using zlib. For the last few months everything was working fine, however, one type of the files grow over the server memory limit which made the decompress function to fail. Based on the doc I can switch to decompressobj function which works on chunks and can handle big files. The thing is that I have more usages of the decompress function and I'm thinking about changing all of them to decompressobj. Is it OK or it might make the code slower?

First of all, premature optimization is the root of all evil. Only optimize something once it is too inefficient, in practice, and you identified the resource hog (e.g. with profiling), and the effect is large enough to be worth the effort and added complexity (=extra maintenance burden down the line).
Both zlib.decompress and zlib.decompressobj.decompress implementations are in zlibmodule.c , as zlib_decompress_impl and zlib_Decompress_decompress_impl, correspondingly.
They do not share code but their code is pretty much the same (as expected) and delegates to the same zlib C library functions.
So there's no difference which one to use raw-decompressing-wise.
There will likely be a tiny overhead with decompressobj with extra logic and repeated Python calls -- but if data are large, decompressing time will dwarf it.
So whether the replacement of decompress with decompressobj is worth it (or will have any effect at all) is going to depend on whether memory, processor or I/O is the bottleneck in each particular case (positive effect if memory, negative effect if processor, no effect if I/O). (Thus go to the first paragraph for guidance.)

Prevent RAM from paging to swap area (mlock)

Is there a way to call the POSIX mlock function from Python? The effect of mlock is to disable swapping out certain objects.
I know there are still other issues when it comes to securing cryptographic keys, I just want to know how to contain them in RAM.

For CPython, there is no good answer for this that doesn't involve writing a Python C extension, since mlock works on pages, not objects. Even if you used ctypes to retrieve the necessary addresses and mlock-ed them all through ctypes mlock calls, you'll have a hell of a time determining when to mlock and when to munlock. You'd need to know the memory address and sizes of all protected data types; since mlock works on pages, you'd have to carefully track how many objects are currently in any given page (because if you just mlock and munlock blindly, and there are more than one things to lock in a page, the first munlock would unlock all of them; mlock/munlock is a boolean flag, it doesn't count the number of locks and unlocks).
Even if you manage that, you still would have a race between data acquisition and mlock during which the data could be written to swap.
You could partially avoid these problems through careful use of the mmap module and memoryviews (mmap gives you pages of memory, memoryview references said memory without copying it, so ctypes could be used to mlock the page), but you'd have to build it all from scratch.
In short, Python doesn't care about swapping or memory protection in the way you want; it trusts the swap file to be configured to your desired security (e.g. disabled or encrypted), neither providing additional protection nor providing the information you'd need to add it in.

Python memory management for variables

I have a question regarding python memory management. I have the following code
def operation(data):
#some manipulations on data
result=something.do(data)
#some manipulations on result
return result
Now I am calling this function operation many times (probably like more than 200 times). Does python use a same memory for the result variable everytime I call operation?
As in C we can use Malloc to allocate memory once and use to the same memory inorder to avoid fragmentation.

The whole point of high-level languages like Python is that they free you from having to worry about memory management. If exact control over memory allocation is important to you, you should write C. If not, you can write Python.
As most Python programmers will tell you from their experience, manual memory management isn't nearly as important as you think it is.

No it is not but it is not a big deal because once you return from the function, the variable is deleted so there is no memoru-capacity issues involved. If you are talking performance level then it will not matter that much in terms of performance.

No, it does not.
You can, however, write optimized code in C and use it in python:
http://docs.python.org/2/extending/extending.html
This will help if you are concerned about performance.

#heisenberg
Your question is very well valid and as you just anticipated above code might create small fragments of free memory chunks. But interesting point to be noted here is: this free chunks won't be returned back to the Operating system, rather python's memory manager manage its own chunks of free memory blocks.
But again, this free memory blocks can be used by python to allocate same block to next request.
Beautiful explanation of the same given at: http://deeplearning.net/software/theano/tutorial/python-memory-management.html

Is there a Python module for transparently working with a file's contents as a buffer?

I'm working on a pure Python file parser for event logs, which may range in size from kilobytes to gigabytes. Is there a module that abstracts explicit .open()/.seek()/.read()/.close() calls into a simple buffer-like object? You might think of this as the inverse of StringIO. I expect it might look something like:
with FileBackedBuffer('/my/favorite/path', 'rb') as buf:
header = buf[0:0x10]
footer = buf[0x10000000:]
The mmap module may fulfill my requirements; however, I have two reservations that I'd appreciate feedback on:
It is important that the module handle files larger than available RAM/swap. I am unsure if mmap can do this well.
The mmap constructors are different depending on OS. This makes me hesitant as I am looking to write nicely cross-platform code, and would rather not muck in OS specifics. I will if I need to, but this set off a warning that I might be looking in the wrong place.
If mmap is the correct module for such as task, how does it handle these two points? If it is not, what is an appropriate module?

mmap can easily handle files larger than RAM/swap. What mmap can't do is handle files larger than the address space, which means that 32bit systems are limited in how large a file they can use.
What happens with mmap is that the OS will only have in memory as much data as it it chooses to, but you program will think it is all there. Be careful in usage patters though since if your data DOESN'T fit in RAM and you jump around too randomly, it will swap (discard pages from your file that you haven't used recently to make room for the new pages to be loaded).
If you don't need to specify anything base fileno and length, I don't believe you need to worry about the platform specific arguments for mmap. If you do need to worry about the extra arguments, then you will either have to master Windows versus Unix, or pass that on to your users. I don't know what your library will be, but it may be nice to provide reasonable defaults on both platforms while also allowing the user to tweak the options. It looks to me that it would be unlikely that you would care about the Windows tagname option, also, if you are cross platform, then just accept the Unix default for prot since you have no choice on Windows. That only leaves caring about MAP_PRIVATE and MAP_SHARED. The default is MAP_SHARED, but I'm not sure if that is the option that most closely matches Windows behavior, but accepting the default is probably fine there.

Counting number of symbols in Python script

I have a Telit module which runs [Python 1.5.2+] (http://www.roundsolutions.com/techdocs/python/Easy_Script_Python_r13.pdf)!. There are certain restrictions in the number of variable, module and method names I can use (< 500), the size of each variable (16k) and amount of RAM (~ 1MB). Refer pg 113&114 for details. I would like to know how to get the number of symbols being generated, size in RAM of each variable, memory usage (stack and heap usage).
I need something similar to a map file that gets generated with gcc after the linking process which shows me each constant / variable, symbol, its address and size allocated.

Python is an interpreted and dynamically-typed language, so generating that kind of output is very difficult, if it's even possible. I'd imagine that the only reasonable way to get this information is to profile your code on the target interpreter.
If you're looking for a true memory map, I doubt such a tool exists since Python doesn't go through the same kind of compilation process as C or C++. Since everything is initialized and allocated at runtime as the program is parsed and interpreted, there's nothing to say that one interpreter will behave the same as another, especially in a case such as this where you're running on such a different architecture. As a result, there's nothing to say that your objects will be created in the same locations or even with the same overall memory structure.
If you're just trying to determine memory footprint, you can do some manual checking with sys.getsizeof(object, [default]) provided that it is supported with Telit's libs. I don't think they're using a straight implementation of CPython. Even still, this doesn't always work and with raise a TypeError when an object's size cannot be determined if you don't specify the default parameter.
You might also get some interesting results by studying the output of the dis module's bytecode disassembly, but that assumes that dis works on your interpreter, and that your interpreter is actually implemented as a VM.
If you just want a list of symbols, take a look at this recipe. It uses reflection to dump a list of symbols.
Good manual testing is key here. Your best bet is to set up the module's CMUX (COM port MUXing), and watch the console output. You'll know very quickly if you start running out of memory.

This post makes me recall my pain once with Telit GM862-GPS modules. My code was exactly at the point that the number of variables, strings, etc added up to the limit. Of course, I didn't know this fact by then. I added one innocent line and my program did not work any more. I drove me really crazy for two days until I look at the datasheet to find this fact.
What you are looking for might not have a good answer because the Python interpreter is not a full fledged version. What I did was to use the same local variable names as many as possible. Also I deleted doc strings for functions (those count too) and replace with #comments.
In the end, I want to say that this module is good for small applications. The python interpreter does not support threads or interrupts so your program must be a super loop. When your application gets bigger, each iteration will take longer. Eventually, you might want to switch to a faster platform.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.