Faster repetitive uses of bz2.BZ2File for pickling

Faster repetitive uses of bz2.BZ2File for pickling - python

I'm pickling multiple objects repeatedly, but not consecutively. But as it turned out, pickled output files were too large (about 256MB each).
So I tried bz2.BZ2File instead of open, and each file became 1.3MB. (Yeah, wow.) The problem is that it takes too long (like 95 secs pickling one object) and I want to speed it up.
Each object is a dictionary, and most of them have similar structures (or hierarchies, if that describes it better: almost the same set of keys, and each value that corresponds to each key normally has some specific structure, and so on). Many of the dictionary values are numpy arrays, and I think many zeros will appear there.
Can you give me some advice to make it faster?
Thank you!

I ended up using lz4, which is a blazingly fast compression algorithm.
There is a python wrapper, which can be installed easily:
pip install lz4

Related

Pickle as a performance benchmark, or, speedier containment checks

I'm loading a data structure from a CSV. The resultant structure is several collection classes composed of item objects, each of which have one or two attributes that point children or parents in the other collections.
It's not very much data. The CSV is 800KB. Loading the data structure from CSV currently takes me about 3.5 seconds. Initially, I had some O(n^2) code that was getting clobbered by containment lookups that took execution to the full n^2, but I've added a bit of indexing and really reduced that.
Now my weak points are still containment lookups, but linear. It's frustrating when the most time-consuming components of your program are repeated executions of code like:
def add_thing(thing):
if thing not in things:
things.append(thing)
I may be missing something (I hope I am!) but it appears there's nothing much I can do to speed these up. I don't want the things collection to contain the same thing more than once. To do that, I must check every single item that's already there. Perhaps there is a faster list that can do this for me? A C-based, dupe-free list?
Anyway, I thought I would pickle the resultant data structure and see how long it took to load. Admittidely, I have no idea what goes into unpickling, but my thinking is that unpickling must be closer to raw I/O since the data structure itself is already laid out in the pickle. I was surprised to learn pickle.loading this data structure takes 2.4 seconds. Only 30% faster than my CSV slop. cPickle does the same job in 0.4 seconds, but I'm trying to keep this as apples-to-apples as possible, which I realize I may have already failed at. But I hope you can see my train of thought and criticize its weak points.
So, what am I to gather, if anything, from the exercise?
The performance of pickle.load and my CSV loading code have nothing to do with each other, and I should not be comparing them?
Things like this just take a long time?
I can optimize my containment lookups further?
I've got it all wrong?
(N.B. I'm using cProfile for profiling)

Logging an unknown number of floats in a python C extension

I'm using python to set up a computationally intense simulation, then running it in a custom built C-extension and finally processing the results in python. During the simulation, I want to store a fixed-length number of floats (C doubles converted to PyFloatObjects) representing my variables at every time step, but I don't know how many time steps there will be in advance. Once the simulation is done, I need to pass back the results to python in a form where the data logged for each individual variable is available as a list-like object (for example a (wrapper around a) continuous array, piece-wise continuous array or column in a matrix with a fixed stride).
At the moment I'm creating a dictionary mapping the name of each variable to a list containing PyFloatObject objects. This format is perfect for working with in the post-processing stage but I have a feeling the creation stage could be a lot faster.
Time is quite crucial since the simulation is a computationally heavy task already. I expect that a combination of A. buying lots of memory and B. setting up your experiment wisely will allow the entire log to fit in the RAM. However, with my current dict-of-lists solution keeping every variable's log in a continuous section of memory would require a lot of copying and overhead.
My question is: What is a clever, low-level way of quickly logging gigabytes of doubles in memory with minimal space/time overhead, that still translates to a neat python data structure?
Clarification: when I say "logging", I mean storing until after the simulation. Once that's done a post-processing phase begins and in most cases I'll only store the resulting graphs. So I don't actually need to store the numbers on disk.
Update: In the end, I changed my approach a little and added the log (as a dict mapping variable names to sequence types) to the function parameters. This allows you to pass in objects such as lists or array.arrays or anything that has an append method. This adds a little time overhead because I'm using the PyObject_CallMethodObjArgs function to call the Append method instead of PyList_Append or similar. Using arrays allows you to reduce the memory load, which appears to be the best I can do short of writing my own expanding storage type. Thanks everyone!

You might want to consider doing this in Cython, instead of as a C extension module. Cython is smart, and lets you do things in a pretty pythonic way, even though it at the same time lets you use C datatypes and python datatypes.
Have you checked out the array module? It allows you to store lots of scalar, homogeneous types in a single collection.
If you're truly "logging" these, and not just returning them to CPython, you might try opening a file and fprintf'ing them.
BTW, realloc might be your friend here, whether you go with a C extension module or Cython.

This is going to be more a huge dump of ideas rather than a consistent answer, because it sounds like that's what you're looking for. If not, I apologize.
The main thing you're trying to avoid here is storing billions of PyFloatObjects in memory. There are a few ways around that, but they all revolve on storing billions of plain C doubles instead, and finding some way to expose them to Python as if they were sequences of PyFloatObjects.
To make Python (or someone else's module) do the work, you can use a numpy array, a standard library array, a simple hand-made wrapper on top of the struct module, or ctypes. (It's a bit odd to use ctypes to deal with an extension module, but there's nothing stopping you from doing it.) If you're using struct or ctypes, you can even go beyond the limits of your memory by creating a huge file and mmapping in windows into it as needed.
To make your C module do the work, instead of actually returning a list, return a custom object that meets the sequence protocol, so when someone calls, say, foo.getitem(i) you convert _array[i] to a PyFloatObject on the fly.
Another advantage of mmap is that, if you're creating the arrays iteratively, you can create them by just streaming to a file, and then use them by mmapping the resulting file back as a block of memory.
Otherwise, you need to handle the allocations. If you're using the standard array, it takes care of auto-expanding as needed, but otherwise, you're doing it yourself. The code to do a realloc and copy if necessary isn't that difficult, and there's lots of sample code online, but you do have to write it. Or you may want to consider building a strided container that you can expose to Python as if it were contiguous even though it isn't. (You can do this directly via the complex buffer protocol, but personally I've always found that harder than writing my own sequence implementation.) If you can use C++, vector is an auto-expanding array, and deque is a strided container (and if you've got the SGI STL rope, it may be an even better strided container for the kind of thing you're doing).
As the other answer pointed out, Cython can help for some of this. Not so much for the "exposing lots of floats to Python" part; you can just move pieces of the Python part into Cython, where they'll get compiled into C. If you're lucky, all of the code that needs to deal with the lots of floats will work within the subset of Python that Cython implements, and the only things you'll need to expose to actual interpreted code are higher-level drivers (if even that).

Efficient ways to write a large NumPy array to a file

I've currently got a project running on PiCloud that involves multiple iterations of an ODE Solver. Each iteration produces a NumPy array of about 30 rows and 1500 columns, with each iterations being appended to the bottom of the array of the previous results.
Normally, I'd just let these fairly big arrays be returned by the function, hold them in memory and deal with them all at one. Except PiCloud has a fairly restrictive cap on the size of the data that can be out and out returned by a function, to keep down on transmission costs. Which is fine, except that means I'd have to launch thousands of jobs, each running on iteration, with considerable overhead.
It appears the best solution to this is to write the output to a file, and then collect the file using another function they have that doesn't have a transfer limit.
Is my best bet to do this just dumping it into a CSV file? Should I add to the CSV file each iteration, or hold it all in an array until the end and then just write once? Is there something terribly clever I'm missing?

Unless there is a reason for the intermediate files to be human-readable, do not use CSV, as this will inevitably involve a loss of precision.
The most efficient is probably tofile (doc) which is intended for quick dumps of file to disk when you know all of the attributes of the data ahead of time.
For platform-independent, but numpy-specific, saves, you can use save (doc).
Numpy and scipy also have support for various scientific data formats like HDF5 if you need portability.

I would recommend looking at the pickle module. The pickle module allows you to serialize python objects as streams of bytes (e.g., strings). This allows you to write them to a file or send them over a network, and then reinstantiate the objects later.

Try Joblib - Fast compressed persistence
One of the key components of joblib is it’s ability to persist arbitrary Python objects, and read them back very quickly. It is particularly efficient for containers that do their heavy lifting with numpy arrays. The trick to achieving great speed has been to save in separate files the numpy arrays, and load them via memmapping.
Edit:
Newer (2016) blog entry on data persistence in Joblib

Loop through list from specific point?

I was wondering if there was a way to keep extremely large lists in the memory and then process those lists from specific points. Since these lists will have as many as almost 400 billion numbers before processing we need to split them up but I haven't the slightest idea (since I can't find an example) of where to start when trying to process a list from a specific point in Python. Edit: Right now we are not trying to create multiple-dimensions but if it's easier then I'll for sure do it.

Even if your numbers are bytes, 400GB (or 400TB if you use billion in the long-scale meaning) does not normally fit in RAM. Therefore I guess numpy.memmap or h5py may be what you're looking for.

Further to the #lazyr's point, if you use the numpy.memmap method, then my previous discussion on views into numpy arrays might well be useful.
This is also the way you should be thinking if you have stacks of memory and everything actually is in RAM.

How can I make a large python data structure more efficient to unpickle?

I have a list of ~1.7 million "token" objects, along with a list of ~130,000 "structure" objects which reference the token objects and group them into, well, structures. It's an ~800MB memory footprint, on a good day.
I'm using __slots__ to keep my memory footprint down, so my __getstate__ returns a tuple of serializable values, which __setstate__ bungs back into place. I'm also not pickling all the instance data, just 5 items for tokens, 7-9 items for structures, all strings or integers.
Of course, I'm using cPickle, and HIGHEST_PROTOCOL, which happens to be 2 (python 2.6). The resulting pickle file is ~120MB.
On my development machine, it takes ~2 minutes to unpickle the pickle. I'd like to make this faster. What methods might be available to me, beyond faster hardware and what I'm already doing?

Pickle is not the best method for storing large amounts of similar data. It can be slow for large data sets, and more importantly, it is very fragile: changing around your source can easily break all existing datasets. (I would recommend reading what pickle at its heart actually is: a bunch of bytecode expressions. It will frighten you into considering other means of data storage/retrieval.)
You should look into using PyTables, which uses HDF5 (cross-platform and everything) to store arbitrarily large amounts of data. You don't even have to load everything off of a file into memory at once; you can access it piecewise. The structure you're describing sounds like it would fit very well into a "table" object, which has a set field structure (comprised of fixed-length strings, integers, small Numpy arrays, etc.) and can hold large amounts very efficiently. For storing metadata, I'd recommend using the ._v_attrs attribute of your tables.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.