Lazy version of numpy.unpackbits - python

I use numpy.memmap to load only the parts of arrays into memory that I need, instead of loading an entire huge array. I would like to do the same with bool arrays.
Unfortunately, bool memmap arrays aren't stored economically: according to ls, a bool memmap file requires as much space as a uint8 memmap file of the same array shape.
So I use numpy.unpackbits to save space. Unfortunately, it seems not lazy: It's slow and can cause a MemoryError, so apparently it loads the array from disk into memory instead of providing a "bool view" on the uint8 array.
So if I want to load only certain entries of the bool array from file, I first have to compute which uint8 entries they are part of, then apply numpy.unpackbits to that, and then again index into that.
Isn't there a lazy way to get a "bool view" on the bit-packed memmap file?

Not possible. The memory layout of a bit-packed array is incompatible with what you're looking for. The NumPy shape-and-strides model of array layout does not have sub-byte resolution. Even if you were to create a class that emulated the view you want, trying to use it with normal NumPy operations would require materializing a representation NumPy can work with, at which point you'd have to spend the memory you don't want to spend.

Related

Performing ndimage.convolve on big numpy.memmap: Unable to allocate 56.0 GiB for an array

While trying to do ndimage.convolve on big numpy.memmap, exception occurs:
Exception has occurred: _ArrayMemoryError
Unable to allocate 56.0 GiB for an array with shape (3710, 1056, 3838) and data type float32
Seems that convolve creates a regular numpy array which won't fit into memory.
Could you tell me please if there is a workaround?
Thank you for any input.
Scipy and Numpy often create new arrays to store the output value returned. This temporary array is stored in RAM even when the array is stored on a storage device and accessed with memmap. There is an output parameter to control that in many functions (including ndimage.convolve). However, this does not prevent internal in-RAM temporary arrays to be created (though such array are not very frequent and often not huge). There is not much more you can do if the output parameter is not present or a big internal is created. The only thing to do is to write your own implementation that does not allocate huge in-RAM array. C modules, Cython and Numba are pretty good for this. Note that doing efficient convolutions is far from being simple when the kernel is not trivial and there are many research paper addressing this problem.
Instead of running your own implementation, another approach that might work would be to use dask's wrapped ndfilters with a dask array created from the memmap. That way, you can delegate the chunking/out-of-memory-calculation parts to Dask.
I haven't actually done this myself, but I see no reason why it wouldn't work!

Structure of a vector in python

How can I identify a vector in Python?
Like does it have a single dimension or is it n-dimensions, I got really confused when trying to understand this in NumPy.
Also, what's the difference between static memory allocation and a dynamic one in vectors?
A vector has a single dimension and is create with method such as numpy.array and numpylinspace. A n-dimension array would be a matrix created with methods such as zeros or numpy.random.uniform(...). You dont have to use numpy to use vectors in python. You can simply use the basic array type.
In python you usually dont have to worry about memory allocation. A dynamic memory allocation means that elements can added or removed to the vector whereas in a static memory allocation, there would be a fixed number of elements in the vector.

How to store numpy two dimensional array onto disk as a kind of binary format which can be read with C++

I have python two dimensional numpy arrays and I want to change them into binary format which can be read with C++, As you know, two dimensional array in C++ is kind of one dimensional array with two pointers which are used to locate elements. Could you tell me which function in python can be used to do the job or any other solution?
This is too long for a comment, but probably not complete enough to run on it's own. As Tom mentioned in the comments on your question, using a library that saves and loads to a well defined format (hdf5, .mat) in python and C++ is probably the easiest solution. If you don't want to find and setup such a library, read on.
Numy has the ability to save data using numpy.save (see this),
The format (described here) states there is a header, with information on the datatype and number of array shape, followed by the data. So, unless you want to write a fully featured parser (you don't), you should ensure python consistently saves data as a float64 (or whichever type you want), in c order (fortran ordering is the other option).
Then, the C++ code just needs to check that the array data type is float64, the correct ordering is used, and how large the array is. Allocate the appropriate amount of memory, and you can load that number of bytes from the file directly into the allocated memory. To create 2d indexing you'll need to allocate an array of pointers to each 'row' in the allocated memory.
Or just use a library that will handle all of that for you.

Extracting specific values from .npy file

I have a .npy file of which I know basically everything (size, number of elements, type of elements, etc.) and I'd like to have a way to retrieve specific values without loading the array. The goal is to use the less amount of memory possible.
I'm looking for something like
def extract('test.npy',i,j):
return "test.npy[i,j]"
I kinda know how to do it with a text file (see recent questions) but doing this with a npy array would allow me to do more than line extraction.
Also if you know any way to do this with a scipy sparse matrix that would be really great.
Thank you.
Just use data = np.load(filename, mmap_mode='r') (or one of the other modes, if you need to change specific elements, as well).
This will return a memory-mapped array. The contents of the array won't be loaded into memory and will be on disk, but you can access individual items by indexing the array as you normally would. (Be aware that accessing some slices will take much longer than accessing other slices depending on the shape and order of your array.)
HDF is a more efficient format for this, but the .npy format is designed to allow for memmapped arrays.

Read subset of pickled NumPy array from MongoDB

I have some NumPy arrays that are are pickled and stored in MongoDB using the bson module. For instance, if x is a NumPy array, then I set a field of a MongoDB record to:
bson.binary.Binary(x.dumps())
My question is whether it is possible to recover a subset of the array x without reloading the entire array via np.loads(). So, first, how can I get MongoDB to only give me back a chunk of the binary array, and then second, how can I turn that chunk into a NumPy array. I should mention here that I also have all the NumPy metadata regarding the array already, such as it's dimensions and datatype.
A concrete example might be that I have a 2-dimensional array of size (100000,10) with datatype np.float64 and I want to retrieve just x[50,10].
I can not say for sure, but checking the api docs of BSON C++ I get the idea that it was not designed for partial retrieval...
If you can at all, consider using pytables, which is designed for large data and inter-operating nicely with numpy. Mongo is great for certain distributed applications, though, while pytables is not.
If you store the array directly inside of MongoDB, you can also try using the $slice operator to get a contiguous subset of an array. You could linearize your 2D array into an 1D array, and the $slice operator will get you matrix rows, but if you want to select columns or generally select noncontiguous indicies, then you're out of luck.
Background on $slice.

Categories