My project currently uses NumPy, only for memory-efficient arrays (of bool_, uint8, uint16, uint32).
I'd like to get it running on PyPy which doesn't support NumPy. (failed to install it, at any rate)
So I'm wondering: Is there any other memory-efficient way to store arrays of numbers in Python? Anything that is supported by PyPy? Does PyPy have anything of it's own?
Note: array.array is not a viable solution, as it uses a lot more memory than NumPy in my testing.
array.array is a memory efficient array. It packs bytes/words etc together, so there is only a few bytes of extra overhead for the entire array.
The one place where numpy can use less memory is when you have a sparse array (and are using one of the sparse array implementations)
If you are not using sparse arrays, you simply measured it wrong.
array.array also doesn't have a packed bool type, so you can implement that as wrapper around an array.array('I') or a bytearray() or even just use bit masks with a Python long
Related
I would like to store a non-rectangular array in Python. The array has millions of elements and I will be applying a function to each element in the array, so I am concerned about performance. What data structure should I use? Should I use a Python list or a numpy array of type object? Is there another data structure that would work even better?
You can use the dictionary data structure to store everything. If you have ample memory, dictionaries is a good option. The hashing process makes them faster.
I'd suggest you to use scipy sparse matrices.
UPD. Some elaboration goes below.
I assume that "non-rectangular" implies there will be empty elements in plain 2D array. Having millions of elements will make these 'holes' tax on memory usage. Sparse matrix provide a way to have familiar array interface and occupy only necessary amount of memory.
Though if array-ish indexing is not required, dictionary is pretty fine storage to use.
I have to handle sparse matrix that can occasionally be very big, nearing or exceeding RAM capacity. I also need to support mat*vec and mat*mat operations.
Since internally a csr_matrix is 3 arrays data, indices and indptr is it possible to create a csr matrix from numpy memmap.
This can partially work, until you try to do much with the array. There's a very good chance the subarrays will be fully read into memory if you subset, or you'll get an error.
An important consideration here is that the underlying code is written assuming the arrays are typical in-memory numpy arrays. Cost of random access is very different for mmapped arrays and in memory arrays. In fact, much of the code here is (at time of writing) in Cython, which may not be able to work with more exotic array types.
Also most of this code can change at any time, as long as the behaviour is the same for in-memory arrays. This has personally bitten me when some I learned some code I worked with was doing this, but with h5py.Datasets for the underlying arrays. It worked surprisingly well, until a bug fix release of scipy completely broke it.
This works without any problems.
This is a follow up to this question
What are the benefits / drawbacks of a list of lists compared to a numpy array of OBJECTS with regards to MEMORY?
I'm interested in understanding the speed implications of using a numpy array vs a list of lists when the array is of type object.
If anyone is interested in the object I'm using:
import gmpy2 as gm
gm.mpfr('0') # <-- this is the object
The biggest usual benefits of numpy, as far as speed goes, come from being able to vectorize operations, which means you replace a Python loop around a Python function call with a C loop around some inlined C (or even custom SIMD assembly) code. There are probably no built-in vectorized operations for arrays of mpfr objects, so that main benefit vanishes.
However, there are some place you'll still benefit:
Some operations that would require a copy in pure Python are essentially free in numpy—transposing a 2D array, slicing a column or a row, even reshaping the dimensions are all done by wrapping a pointer to the same underlying data with different striding information. Since your initial question specifically asked about A.T, yes, this is essentially free.
Many operations can be performed in-place more easily in numpy than in Python, which can save you some more copies.
Even when a copy is needed, it's faster to bulk-copy a big array of memory and then refcount all of the objects than to iterate through nested lists deep-copying them all the way down.
It's a lot easier to write your own custom Cython code to vectorize an arbitrary operation with numpy than with Python.
You can still get some benefit from using np.vectorize around a normal Python function, pretty much on the same order as the benefit you get from a list comprehension over a for statement.
Within certain size ranges, if you're careful to use the appropriate striding, numpy can allow you to optimize cache locality (or VM swapping, at larger sizes) relatively easily, while there's really no way to do that at all with lists of lists. This is much less of a win when you're dealing with an array of pointers to objects that could be scattered all over memory than when dealing with values that can be embedded directly in the array, but it's still something.
As for disadvantages… well, one obvious one is that using numpy restricts you to CPython or sometimes PyPy (hopefully in the future that "sometimes" will become "almost always", but it's not quite there as of 2014); if your code would run faster in Jython or IronPython or non-NumPyPy PyPy, that could be a good reason to stick with lists.
I have a Cython function like cdef generate_data(size) where I'd like to:
initialise an array of bytes of length size
call external (C) function to populate the array using array ptr and size
return the array as something understandable by Python (bytearray, bytes, your suggestions)
I have seen many approaches on the internet but I'm looking for the best/recommended way of doing this in my simple case. I want to:
avoid memory reallocations
avoid using numpy
ideally use something that works in Python 3 and 2.7, although a 2.7 solution is good enough.
I'm using Cython 0.20.
For allocating memory, I have you covered.
After that, just take a pointer (possibly at the data attribute if you use cpython.array.array like I recommend) and pass that along. You can return the cpython.array.array type and it will become a Python array.
i want to implement 1024x1024 monochromatic grid , i need read data from any cell and insert rectangles with various dimensions, i have tried to make list in list ( and use it like 2d array ), what i have found is that list of booleans is slower than list of integers.... i have tried 1d list, and it was slower than 2d one, numpy is slower about 10 times that standard python list, fastest way that i have found is PIL and monochromatic bitmap used with "load" method, but i want it to run a lot faster, so i have tried to compile it with shedskin, but unfortunately there is no pil support there, do you know any way of implementing such grid faster without rewriting it to c or c++ ?
Raph's suggestin of using array is good, but it won't help on CPython, in fact I'd expect it to be 10-15% slower, however if you use it on PyPy (http://pypy.org/) I'd expect excellent results.
One thing I might suggest is using Python's built-in array class (http://docs.python.org/library/array.html), with a type of 'B'. Coding will be simplest if you use one byte per pixel, but if you want to save memory, you can pack 8 to a byte, and access using your own bit manipulation.
I would look into Cython which translates the Python into C that is readily compiled (or compiled for you if you use distutils). Just compiling your code in Cython will make it faster for something like this, but you can get much greater speed-ups by adding a few cdef statements. If you use it with Numpy, then you can quickly access Numpy arrays. The speed-up can be quite large by using Cython in this manner. However, it would easier to help you if you provided some example code.