Logging an unknown number of floats in a python C extension

Logging an unknown number of floats in a python C extension - python

I'm using python to set up a computationally intense simulation, then running it in a custom built C-extension and finally processing the results in python. During the simulation, I want to store a fixed-length number of floats (C doubles converted to PyFloatObjects) representing my variables at every time step, but I don't know how many time steps there will be in advance. Once the simulation is done, I need to pass back the results to python in a form where the data logged for each individual variable is available as a list-like object (for example a (wrapper around a) continuous array, piece-wise continuous array or column in a matrix with a fixed stride).
At the moment I'm creating a dictionary mapping the name of each variable to a list containing PyFloatObject objects. This format is perfect for working with in the post-processing stage but I have a feeling the creation stage could be a lot faster.
Time is quite crucial since the simulation is a computationally heavy task already. I expect that a combination of A. buying lots of memory and B. setting up your experiment wisely will allow the entire log to fit in the RAM. However, with my current dict-of-lists solution keeping every variable's log in a continuous section of memory would require a lot of copying and overhead.
My question is: What is a clever, low-level way of quickly logging gigabytes of doubles in memory with minimal space/time overhead, that still translates to a neat python data structure?
Clarification: when I say "logging", I mean storing until after the simulation. Once that's done a post-processing phase begins and in most cases I'll only store the resulting graphs. So I don't actually need to store the numbers on disk.
Update: In the end, I changed my approach a little and added the log (as a dict mapping variable names to sequence types) to the function parameters. This allows you to pass in objects such as lists or array.arrays or anything that has an append method. This adds a little time overhead because I'm using the PyObject_CallMethodObjArgs function to call the Append method instead of PyList_Append or similar. Using arrays allows you to reduce the memory load, which appears to be the best I can do short of writing my own expanding storage type. Thanks everyone!

You might want to consider doing this in Cython, instead of as a C extension module. Cython is smart, and lets you do things in a pretty pythonic way, even though it at the same time lets you use C datatypes and python datatypes.
Have you checked out the array module? It allows you to store lots of scalar, homogeneous types in a single collection.
If you're truly "logging" these, and not just returning them to CPython, you might try opening a file and fprintf'ing them.
BTW, realloc might be your friend here, whether you go with a C extension module or Cython.

This is going to be more a huge dump of ideas rather than a consistent answer, because it sounds like that's what you're looking for. If not, I apologize.
The main thing you're trying to avoid here is storing billions of PyFloatObjects in memory. There are a few ways around that, but they all revolve on storing billions of plain C doubles instead, and finding some way to expose them to Python as if they were sequences of PyFloatObjects.
To make Python (or someone else's module) do the work, you can use a numpy array, a standard library array, a simple hand-made wrapper on top of the struct module, or ctypes. (It's a bit odd to use ctypes to deal with an extension module, but there's nothing stopping you from doing it.) If you're using struct or ctypes, you can even go beyond the limits of your memory by creating a huge file and mmapping in windows into it as needed.
To make your C module do the work, instead of actually returning a list, return a custom object that meets the sequence protocol, so when someone calls, say, foo.getitem(i) you convert _array[i] to a PyFloatObject on the fly.
Another advantage of mmap is that, if you're creating the arrays iteratively, you can create them by just streaming to a file, and then use them by mmapping the resulting file back as a block of memory.
Otherwise, you need to handle the allocations. If you're using the standard array, it takes care of auto-expanding as needed, but otherwise, you're doing it yourself. The code to do a realloc and copy if necessary isn't that difficult, and there's lots of sample code online, but you do have to write it. Or you may want to consider building a strided container that you can expose to Python as if it were contiguous even though it isn't. (You can do this directly via the complex buffer protocol, but personally I've always found that harder than writing my own sequence implementation.) If you can use C++, vector is an auto-expanding array, and deque is a strided container (and if you've got the SGI STL rope, it may be an even better strided container for the kind of thing you're doing).
As the other answer pointed out, Cython can help for some of this. Not so much for the "exposing lots of floats to Python" part; you can just move pieces of the Python part into Cython, where they'll get compiled into C. If you're lucky, all of the code that needs to deal with the lots of floats will work within the subset of Python that Cython implements, and the only things you'll need to expose to actual interpreted code are higher-level drivers (if even that).

Related

Do Python Implementation use Cache Oblivious Data Structures?

I was reading an article on the benefits of cache oblivious data structures and found myself wondering if the Python implementations (CPython) use this approach? If not, is there a technical limitation preventing it?

I would say this is mostly irrelevant for built-in (standard library) Python data structures.
Creating a new data type in Python means creating a class, which is not a bare-bones wrapper of underlying primitive types or method pointers, but rather is a particular type of struct that has lots of additional metadata coming from Python object data model.
There is no native tree data structure in Python. There are lists, arrays, and array-based hash tables (dict, set), along with some extensions to these like in the collections module. Third party tree / trie / etc., implementations are free to offer a cache-oblivious implementation if it suits the intended usage. This would include CPython C-level implementations such as with custom extensions modules or via a tool like Cython.
NumPy ndarray is a contiguous array data structure for which the user may choose the data type (i.e. the user could, in theory, choose a weird data type that is not easily made into a multiple of the machine architecture's cache size). Perhaps some customization could be improved there, for fixed data type (and maybe the same is true for array.array), but I am wondering how many array / linear algebra algorithms benefit from some sort of customized cache obliviousness -- normally these sorts of libraries are written to assume use of a particular data type, like int32 or float64, specifically based on the cache size, and employ dynamic memory reallocation, like doubling, to amortize cost of certain operations. For example, your linked article mentions that finding the max over an array is "intrinsically" cache oblivious ... because it's contiguous, you make the maximum possible use of each cache line you read, and you only read the minimal number of cache lines. Perhaps for treating an array like a heap or something, you could be clever about rearranging the memory layout to be optimal regardless of cache size, but it wouldn't be the role of a general purpose array to have its implementation customized like that based on a very specialized use case (an array having the heap property).
In short, I would turn the question around on you and say, given the data structures that are standard in Python, do you see particular trade-offs between dynamic resizing, dynamic typing and (perhaps most importantly) general random access pattern assumptions vs. having a cache oblivious implementation backing them?

Faster repetitive uses of bz2.BZ2File for pickling

I'm pickling multiple objects repeatedly, but not consecutively. But as it turned out, pickled output files were too large (about 256MB each).
So I tried bz2.BZ2File instead of open, and each file became 1.3MB. (Yeah, wow.) The problem is that it takes too long (like 95 secs pickling one object) and I want to speed it up.
Each object is a dictionary, and most of them have similar structures (or hierarchies, if that describes it better: almost the same set of keys, and each value that corresponds to each key normally has some specific structure, and so on). Many of the dictionary values are numpy arrays, and I think many zeros will appear there.
Can you give me some advice to make it faster?
Thank you!

I ended up using lz4, which is a blazingly fast compression algorithm.
There is a python wrapper, which can be installed easily:
pip install lz4

Why couldn't Julia superset python?

The Julia Language syntax looks very similar to python, while the concept of a class (if one should address it as such a thing) is more what you use in C. There were many reasons why the creators decided on the difference with respect to the OOP. Still would it have been so hard (in comparison to create Julia in first place which is impressive) to find some canonical way to interpret python to Julia and thus get a hold of all the python libraries?

Yes. The design of Python makes it fundamentally difficult to optimize at compile-time (i.e. before you run the code). It is simply false that Julia is fast BECAUSE of its JIT. Rather, Julia is designed with its type system and multiple dispatch in mind so that way the compiler can know all of the necessary details to compile "the same code you would have written in C". That's what makes it fast: the type system. It makes a few trade-offs that allow it to, in "type-stable" functions, fully deduce what the types of every variable is, know what the memory layout of the type should be (including parametric types, so Vector{Float64} has a memory layout which is determined by the type and its parameter which inlines Float64 values like a NumPy array, except this is generalized in a way that your own struct types get the same efficiency), and compile a version of the code specifically for the types which are seen.
There are many ways where this is at odds with Python. For example, if the number of fields in a struct could change, then the memory layout could not be determined and thus these optimizations cannot occur at "compile-time". Julia was painstakingly designed to make sure that it would have type inferrability, and it uses that to generate code which is fully typed and remove all runtime checks (in type-stable functions. When a function is not type-stable, the types of the variables become dynamic rather than static and it slows down to Python-like speeds). In this sense, Julia actually isn't even optimized yet: all of its performance comes "for free" given the design of its type system. Python/MATLAB/R has to try really hard to optimize at runtime because it doesn't have the capability to do these deductions. In fact, those languages are "better optimized" right now in terms of runtime optimizations, but no one has really worked on runtime optimizations in Julia yet because in most performance sensitive cases you can get it all at compile time.
So then, what about Numba? Numba tries to take the route that Julia takes but with Python code by limiting what can be done so that way it can get type-stable code and compile that efficiently. However, this means a few things. First of all, it's not compatible with all Python codes or libraries. But more importantly, since Python is not a language built around its type system, the tools for controlling the code at the level of types is much reduced. So Numba doesn't have parametric vectors and generic codes which auto-specialize via multiple dispatch because these aren't features of the language. But that also means that it cannot make full use of the design, which limits how much it can do. It can handle the "use only floating point array" stuff just fine, but you can see it gets limited if you want one code to produce efficient code for "any number type, even ones I don't know about". However, by design, Julia does this automatically.
So at the core, Julia and Python are extremely different languages. It can be hard to see because Julia's syntax is close to Python's syntax, but they do not work the same at all.
This is a short summary of what I have described in a few blog posts. These go into more detail and show you how Julia is actually generating efficient code, how it gives you a generic "Python looking style" but doing so with full inferrability all the way down, and what the tradeoffs are.
How type-stability plus multiple dispatch gives performance:
http://ucidatascienceinitiative.github.io/IntroToJulia/Html/WhyJulia
http://www.stochasticlifestyle.com/7-julia-gotchas-handle/
How the type system allows for highly performant generic designs
http://www.stochasticlifestyle.com/type-dispatch-design-post-object-oriented-programming-julia/

Fastest Get Python Data Structures

I am developing AI to perform MDP, I am getting states(just integers in this case) and assigning it a value, and I am going to be doing this a lot. So I am looking for a data structure that can hold(no need for delete) that information and will have a very fast get/update function. Is there something faster than the regular dictionary? I am looking for anything really so native python, open sourced, I just need fast getting.

Using a Python dictionary is the way to go.

You're saying that all your keys are integers? In that case, it might be faster to use a list and just treat the list indices as the key values. However, you'd have to make sure that you never delete or add list items; just start with as many as you think you'll need, setting them all equal to None, as shown:
mylist = [None for i in xrange(totalitems)]
Then, when you need to "add" an item, just set the corresponding value.
Note that this probably won't actually gain you much in terms of actual efficiency, and it might be more confusing than just using a dictionary.
For 10,000 items, it turns out (on my machine, with my particular test case) that accessing each one and assigning it to a variable takes about 334.8 seconds with a list and 565 seconds with a dictionary.

If you want a rapid prototype, use python. And don't worry about speed.
If you want to write fast scientific code (and you can't build on fast native libraries, like LAPACK for linear algebra stuff) write it in C, C++ (maybe only to call from Python). If fast instead of ultra-fast is enough, you can also use Java or Scala.

Simultaneous assignment to a numpy array

I have a data structure which serves as a wrapper to a 2D numpy array in order to use labeled indices and perform statements such as
myMatrix[ "rowLabel", "colLabel" ] = 1.0
Basically this is implemented as
def __setitem__( self, row, col, value ):
... # Check validity of row/col labels.
self.__matrixRepresentation[ ( self.__rowMap[row], self.__colMap[col] ) ] = value
I am assigning the values in a database table to this data structure, and it was straightforward to write a loop for this. However, I want to execute this loop 100 million or more times, and iteratively retrieving chunks of values from the database table and moving them to this structure takes more time than I would prefer.
All of the values I retrieve from the database table have different (row,column) pairs.
Therefore, it seems that I could parallelize the above assignment, but I don't know if numpy arrays permit simultaneous assignment using some sort of internal locking mechanism for atomic operations, or if it altogether prohibits any such thought process. If anyone has suggestions or criticisms, I would appreciate it. (If possible, in this case I'd prefer not to resort to cython or PyPy.)

Parallel execution at that level is unlikely here. The global interpreter lock will ruin your day. Besides, you're still going to have to pull each set of values out of the database sequentially, which is quite possibly going to dwarf the in-process map lookups and array assignment. Especially if the database is on a remote machine.
If at all possible, don't store your matrix in that database. Dedicated formats exist for efficiently storing large arrays. HDF5/NetCDF come to mind. There are excellent Python/NumPy libraries for using HDF5 datasets. With no further information on the format or purpose of the database and/or matrix, I can't really give you better storage recommendations.
If you don't control how that data is stored, you're just going to have to wait for it to get slurped in. Depending on what you're using it for and how often it gets updated, you might just wait for it once, and then updates from the database can be written to it from a separate thread as they become available.
(Unrelated terminology issue: "CPython" is the standard Python interpreter. I assume you meant that you don't want to write an extension for CPython using C, Cython, Boost Python, or whatever.)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.