Zlib decompressobj vs decompress performance

Zlib decompressobj vs decompress performance - python

Is there any performance downside of using the zlib decompressobj function instead of decompress?
I'm asking because a python app that I work with decompresses files using zlib. For the last few months everything was working fine, however, one type of the files grow over the server memory limit which made the decompress function to fail. Based on the doc I can switch to decompressobj function which works on chunks and can handle big files. The thing is that I have more usages of the decompress function and I'm thinking about changing all of them to decompressobj. Is it OK or it might make the code slower?

First of all, premature optimization is the root of all evil. Only optimize something once it is too inefficient, in practice, and you identified the resource hog (e.g. with profiling), and the effect is large enough to be worth the effort and added complexity (=extra maintenance burden down the line).
Both zlib.decompress and zlib.decompressobj.decompress implementations are in zlibmodule.c , as zlib_decompress_impl and zlib_Decompress_decompress_impl, correspondingly.
They do not share code but their code is pretty much the same (as expected) and delegates to the same zlib C library functions.
So there's no difference which one to use raw-decompressing-wise.
There will likely be a tiny overhead with decompressobj with extra logic and repeated Python calls -- but if data are large, decompressing time will dwarf it.
So whether the replacement of decompress with decompressobj is worth it (or will have any effect at all) is going to depend on whether memory, processor or I/O is the bottleneck in each particular case (positive effect if memory, negative effect if processor, no effect if I/O). (Thus go to the first paragraph for guidance.)

Related

Is there a Python module for transparently working with a file's contents as a buffer?

I'm working on a pure Python file parser for event logs, which may range in size from kilobytes to gigabytes. Is there a module that abstracts explicit .open()/.seek()/.read()/.close() calls into a simple buffer-like object? You might think of this as the inverse of StringIO. I expect it might look something like:
with FileBackedBuffer('/my/favorite/path', 'rb') as buf:
header = buf[0:0x10]
footer = buf[0x10000000:]
The mmap module may fulfill my requirements; however, I have two reservations that I'd appreciate feedback on:
It is important that the module handle files larger than available RAM/swap. I am unsure if mmap can do this well.
The mmap constructors are different depending on OS. This makes me hesitant as I am looking to write nicely cross-platform code, and would rather not muck in OS specifics. I will if I need to, but this set off a warning that I might be looking in the wrong place.
If mmap is the correct module for such as task, how does it handle these two points? If it is not, what is an appropriate module?

mmap can easily handle files larger than RAM/swap. What mmap can't do is handle files larger than the address space, which means that 32bit systems are limited in how large a file they can use.
What happens with mmap is that the OS will only have in memory as much data as it it chooses to, but you program will think it is all there. Be careful in usage patters though since if your data DOESN'T fit in RAM and you jump around too randomly, it will swap (discard pages from your file that you haven't used recently to make room for the new pages to be loaded).
If you don't need to specify anything base fileno and length, I don't believe you need to worry about the platform specific arguments for mmap. If you do need to worry about the extra arguments, then you will either have to master Windows versus Unix, or pass that on to your users. I don't know what your library will be, but it may be nice to provide reasonable defaults on both platforms while also allowing the user to tweak the options. It looks to me that it would be unlikely that you would care about the Windows tagname option, also, if you are cross platform, then just accept the Unix default for prot since you have no choice on Windows. That only leaves caring about MAP_PRIVATE and MAP_SHARED. The default is MAP_SHARED, but I'm not sure if that is the option that most closely matches Windows behavior, but accepting the default is probably fine there.

Counting number of symbols in Python script

I have a Telit module which runs [Python 1.5.2+] (http://www.roundsolutions.com/techdocs/python/Easy_Script_Python_r13.pdf)!. There are certain restrictions in the number of variable, module and method names I can use (< 500), the size of each variable (16k) and amount of RAM (~ 1MB). Refer pg 113&114 for details. I would like to know how to get the number of symbols being generated, size in RAM of each variable, memory usage (stack and heap usage).
I need something similar to a map file that gets generated with gcc after the linking process which shows me each constant / variable, symbol, its address and size allocated.

Python is an interpreted and dynamically-typed language, so generating that kind of output is very difficult, if it's even possible. I'd imagine that the only reasonable way to get this information is to profile your code on the target interpreter.
If you're looking for a true memory map, I doubt such a tool exists since Python doesn't go through the same kind of compilation process as C or C++. Since everything is initialized and allocated at runtime as the program is parsed and interpreted, there's nothing to say that one interpreter will behave the same as another, especially in a case such as this where you're running on such a different architecture. As a result, there's nothing to say that your objects will be created in the same locations or even with the same overall memory structure.
If you're just trying to determine memory footprint, you can do some manual checking with sys.getsizeof(object, [default]) provided that it is supported with Telit's libs. I don't think they're using a straight implementation of CPython. Even still, this doesn't always work and with raise a TypeError when an object's size cannot be determined if you don't specify the default parameter.
You might also get some interesting results by studying the output of the dis module's bytecode disassembly, but that assumes that dis works on your interpreter, and that your interpreter is actually implemented as a VM.
If you just want a list of symbols, take a look at this recipe. It uses reflection to dump a list of symbols.
Good manual testing is key here. Your best bet is to set up the module's CMUX (COM port MUXing), and watch the console output. You'll know very quickly if you start running out of memory.

This post makes me recall my pain once with Telit GM862-GPS modules. My code was exactly at the point that the number of variables, strings, etc added up to the limit. Of course, I didn't know this fact by then. I added one innocent line and my program did not work any more. I drove me really crazy for two days until I look at the datasheet to find this fact.
What you are looking for might not have a good answer because the Python interpreter is not a full fledged version. What I did was to use the same local variable names as many as possible. Also I deleted doc strings for functions (those count too) and replace with #comments.
In the end, I want to say that this module is good for small applications. The python interpreter does not support threads or interrupts so your program must be a super loop. When your application gets bigger, each iteration will take longer. Eventually, you might want to switch to a faster platform.

Will Python be faster if I put commonly called code into separate methods or files?

I thought I once read on SO that Python will compile and run slightly more quickly if commonly called code is placed into methods or separate files. Does putting Python code in methods have an advantage over separate files or vice versa? Could someone explain why this is? I'd assume it has to do with memory allocation and garbage collection or something.

It doesn't matter. Don't structure your program around code speed; structure it around coder speed. If you write something in Python and it's too slow, find the bottleneck with cProfile and speed it up. How do you speed it up? You try things and profile them. In general, function call overhead in critical loops is high. Byte compiling your code takes a very small amount of time and only needs to be done once.

No. Regardless of where you put your code, it has to be parsed once and compiled if necessary. Distinction between putting code in methods or different files might have an insignificant performance difference, but you shouldn't worry about it.
About the only language right now that you have to worry about structuring "right" is Javascript. Because it has to be downloaded from net to client's computer. That's why there are so many compressors and obfuscators for it. Stuff like this isn't done with Python because it's not needed.

Two things:
Code in separate modules is compiled into bytecode at first runtime and saved as a precompiled .pyc file, so it doesn't have to be recompiled at the next run as long as the source hasn't been modified since. This might result in a small performance advantage, but only at program startup.
Also, Python stores variables etc. a bit more efficiently if they are placed inside functions instead of at the top level of a file. But I don't think that's what you're referring to here, is it?

Why doesn't Python's mmap work with large files?

[Edit: This problem applies only to 32-bit systems. If your computer, your OS and your python implementation are 64-bit, then mmap-ing huge files works reliably and is extremely efficient.]
I am writing a module that amongst other things allows bitwise read access to files. The files can potentially be large (hundreds of GB) so I wrote a simple class that lets me treat the file like a string and hides all the seeking and reading.
At the time I wrote my wrapper class I didn't know about the mmap module. On reading the documentation for mmap I thought "great - this is just what I needed, I'll take out my code and replace it with an mmap. It's probably much more efficient and it's always good to delete code."
The problem is that mmap doesn't work for large files! This is very surprising to me as I thought it was perhaps the most obvious application. If the file is above a few gigabytes then I get an EnvironmentError: [Errno 12] Cannot allocate memory. This only happens with a 32-bit Python build so it seems it is running out of address space, but I can't find any documentation on this.
My code is just
f = open('somelargefile', 'rb')
map = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
So my question is am I missing something obvious here? Is there a way to get mmap to work portably on large files or should I go back to my naïve file wrapper?
Update: There seems to be a feeling that the Python mmap should have the same restrictions as the POSIX mmap. To better express my frustration here is a simple class that has a small part of the functionality of mmap.
import os
class Mmap(object):
def __init__(self, f):
"""Initialise with a file object."""
self.source = f
def __getitem__(self, key):
try:
# A slice
self.source.seek(key.start, os.SEEK_SET)
return self.source.read(key.stop - key.start)
except AttributeError:
# single element
self.source.seek(key, os.SEEK_SET)
return self.source.read(1)
It's read-only and doesn't do anything fancy, but I can do this just the same as with an mmap:
map2 = Mmap(f)
print map2[0:10]
print map2[10000000000:10000000010]
except that there are no restrictions on filesize. Not too difficult really...

From IEEE 1003.1:
The mmap() function shall establish a
mapping between a process' address
space and a file, shared memory
object, or [TYM] typed memory
object.
It needs all the virtual address space because that's exactly what mmap() does.
The fact that it isn't really running out of memory doesn't matter - you can't map more address space than you have available. Since you then take the result and access as if it were memory, how exactly do you propose to access more than 2^32 bytes into the file? Even if mmap() didn't fail, you could still only read the first 4GB before you ran out of space in a 32-bit address space. You can, of course, mmap() a sliding 32-bit window over the file, but that won't necessarily net you any benefit unless you can optimize your access pattern such that you limit how many times you have to visit previous windows.

Sorry to answer my own question, but I think the real problem I had was not realising that mmap was a standard POSIX system call with particular characterisatations and limitations and that the Python mmap is supposed just to expose its functionality.
The Python documentation doesn't mention the POSIX mmap and so if you come at it as a Python programmer without much knowledge of POSIX (as I did) then the address space problem appears quite arbitrary and badly designed!
Thanks to the other posters for teaching me the true meaning of mmap. Unfortunately no one has suggested a better alternative to my hand-crafted class for treating large files as strings, so I shall have to stick with it for now. Perhaps I will clean it up and make it part of my module's public interface when I get the chance.

A 32-bit program and operating system can only address a maximum of 32 bits of memory i.e. 4GB. There are other factors that make the total even smaller; for example, Windows reserves between 0.5 and 2GB for hardware access, and of course your program is going to take some space as well.
Edit: The obvious thing you're missing is an understanding of the mechanics of mmap, on any operating system. It allows you to map a portion of a file to a range of memory - once you've done that, any access to that portion of the file happens with the least possible overhead. It's low overhead because the mapping is done once, and doesn't have to change every time you access a different range. The drawback is that you need an open address range sufficient for the portion you're trying to map. If you're mapping the whole file at once, you'll need a hole in the memory map large enough to fit the entire file. If such a hole doesn't exist, or is bigger than your entire address space, it fails.

the mmap module provides all the tools you need to poke around in your large file, but due to the limitations other folks have mentioned, you can't map it all at once. You can map a good sized chunk at once, do some processing and then unmap that and map another. the key arguments to the mmap class are length and offset, which do exactly what they sound like, allowing you to map length bytes, starting at byte offset in the mapped file. Any time you wish to read a section of memory that is outside the mapped window, you have to map in a new window.

The point you are missing is that mmap is a memory mapping function that maps a file into memory for arbitrary access across the requested data range by any means.
What you are looking for sounds more like some sort of a data window class that presents an api allowing you to look at small windows of a large data structure at anyone time. Access beyond the bounds of this window would not be possible other than by calling the data window's own api.
This is fine, but it is not a memory map, it is something that offers the advantage of a wider data range at the cost of a more restrictive api.

Use a 64-bit computer, with a 64-bit OS and a 64-bit python implementation, or avoid mmap()
mmap() requires CPU hardware support to make sense with large files bigger than a few GiB.
It uses the CPU's MMU and interrupt subsystems to allow exposing the data as if it were already loaded ram.
The MMU is hardware which will generate an interrupt whenever an address corresponding to data not in physical RAM is accessed, and the OS will handle the interrupt in a way that makes sense at runtime, so the accessing code never knows (or needs to know) that the data doesn't fit in RAM.
This makes your accessing code simple to write. However, to use mmap() this way, everything involved will need to handle 64 bit addresses.
Or else it may be preferable to avoid mmap() altogether and do your own memory management.

You're setting the length parameter to zero, which means map in the entire file. On a 32 bit build, this won't be possible if the file length is more than 2GB (possibly 4GB).

You ask the OS to map the entire file in a memory range. It won't be read until you trigger page faults by reading/writing, but it still needs to make sure the entire range is available to your process, and if that range is too big, there will be difficulties.

Why python compile the source to bytecode before interpreting?

Why python compile the source to bytecode before interpreting?
Why not interpret from the source directly?

Nearly no interpreter really interprets code directly, line by line – it's simply too inefficient. Almost all interpreters use some intermediate representation which can be executed easily. Also, small optimizations can be performed on this intermediate code.
Python furthermore stores this code which has a huge advantage for the next time this code gets executed: Python doesn't have to parse the code anymore; parsing is the slowest part in the compile process. Thus, a bytecode representation reduces execution overhead quite substantially.

Because you can compile to a .pyc once and interpret from it many times.
So if you're running a script many times you only have the overhead of parsing the source code once.

Because interpretting from bytecode directly is faster. It avoids the need to do lexing, for one thing.

Re-lexing and parsing the source code over and over, rather than doing it just once (most often on the first import), would obviously be a silly and pointless waste of effort.

Although there is a small efficiency aspect to it (you can store the bytecode on disk or in memory), its mostly engineering: it allows you separate parsing from interpreting. Parsers can often be nasty creatures, full of edge-cases and having to conform to esoteric rules like using just the right amount of lookahead and resolving shift-reduce problems. By contrast, interpreting is really simple: its just a big switch statement using the bytecode's opcode.

I doubt very much that the reason is performance, albeit be it a nice side effect. I would say that it's only natural to think a VM built around some high-level assembly language would be more practical than to find and replace text in some source code string.
Edit:
Okay, clearly, who ever put a -1 vote on my post without leaving a reasonable comment to explain knows very little about virtual machines (run-time environments).
http://channel9.msdn.com/shows/Going+Deep/Expert-to-Expert-Erik-Meijer-and-Lars-Bak-Inside-V8-A-Javascript-Virtual-Machine/

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.