Read subset of pickled NumPy array from MongoDB - python

I have some NumPy arrays that are are pickled and stored in MongoDB using the bson module. For instance, if x is a NumPy array, then I set a field of a MongoDB record to:
bson.binary.Binary(x.dumps())
My question is whether it is possible to recover a subset of the array x without reloading the entire array via np.loads(). So, first, how can I get MongoDB to only give me back a chunk of the binary array, and then second, how can I turn that chunk into a NumPy array. I should mention here that I also have all the NumPy metadata regarding the array already, such as it's dimensions and datatype.
A concrete example might be that I have a 2-dimensional array of size (100000,10) with datatype np.float64 and I want to retrieve just x[50,10].

I can not say for sure, but checking the api docs of BSON C++ I get the idea that it was not designed for partial retrieval...
If you can at all, consider using pytables, which is designed for large data and inter-operating nicely with numpy. Mongo is great for certain distributed applications, though, while pytables is not.

If you store the array directly inside of MongoDB, you can also try using the $slice operator to get a contiguous subset of an array. You could linearize your 2D array into an 1D array, and the $slice operator will get you matrix rows, but if you want to select columns or generally select noncontiguous indicies, then you're out of luck.
Background on $slice.

Related

What is the best way to store a non-rectangular array?

I would like to store a non-rectangular array in Python. The array has millions of elements and I will be applying a function to each element in the array, so I am concerned about performance. What data structure should I use? Should I use a Python list or a numpy array of type object? Is there another data structure that would work even better?
You can use the dictionary data structure to store everything. If you have ample memory, dictionaries is a good option. The hashing process makes them faster.
I'd suggest you to use scipy sparse matrices.
UPD. Some elaboration goes below.
I assume that "non-rectangular" implies there will be empty elements in plain 2D array. Having millions of elements will make these 'holes' tax on memory usage. Sparse matrix provide a way to have familiar array interface and occupy only necessary amount of memory.
Though if array-ish indexing is not required, dictionary is pretty fine storage to use.

How to store np arrays into psql database and django

I develop an application that will be used for running simulation and optimization over graphs (for instance Travelling salesman problem or various other problems).
Currently I use 2d numpy array as graph representation and always store list of lists and after every load/dump from/into DB I use function np.fromlist, np.tolist() functions respectively.
Is there supported way how could I store numpy ndarray into psql? Unfortunately, np arrays are not JSON-serializable by default.
I also thought to convert numpy array into scipy.sparse matrix, but they are not json serializable either
json.dumps(np_array.tolist()) is the way to convert a numpy array to json. np_array.fromlist(json.loads(json.dumps(np_array.tolist()))) is how you get it back.

Lazy version of numpy.unpackbits

I use numpy.memmap to load only the parts of arrays into memory that I need, instead of loading an entire huge array. I would like to do the same with bool arrays.
Unfortunately, bool memmap arrays aren't stored economically: according to ls, a bool memmap file requires as much space as a uint8 memmap file of the same array shape.
So I use numpy.unpackbits to save space. Unfortunately, it seems not lazy: It's slow and can cause a MemoryError, so apparently it loads the array from disk into memory instead of providing a "bool view" on the uint8 array.
So if I want to load only certain entries of the bool array from file, I first have to compute which uint8 entries they are part of, then apply numpy.unpackbits to that, and then again index into that.
Isn't there a lazy way to get a "bool view" on the bit-packed memmap file?
Not possible. The memory layout of a bit-packed array is incompatible with what you're looking for. The NumPy shape-and-strides model of array layout does not have sub-byte resolution. Even if you were to create a class that emulated the view you want, trying to use it with normal NumPy operations would require materializing a representation NumPy can work with, at which point you'd have to spend the memory you don't want to spend.

Store Numpy as pickled Pandas, Pickled Numpy or HDF5

I right now working with 300 float features coming from a preprocessing of item information. Such items are identified by a UUID (i.e. a string). The current file size is around 200MB. So far I have stored them as Pickled numpyarrays. Sometimes I need to map the UUID for an item to a Numpy row. For that I am using a dictionary (stored as json) that maps UUID to row in a numpy array.
I was tempted to use Pandas and replace that dictionary for a Pandas index. I also discovered the HF5 file format but I would like to know a bit more when to use each of them.
I use part of the array to feed a scikit-Learn based algorithm and then to perform classification on the rest.
Storing pickled numpy arrays is indeed not an optimal approach. Instead, you can use,
numpy.savez to save a dictionary of numpy array in a binary format
store pandas DataFrame in HDF5
directly use PyTables to write your numpy arrays to HDF5.
HDF5 is a preferred format to store scientific data that includes, among others,
parallel read/write capabilities
on the fly compression algorithms
efficient querying
ability to work with large datasets that don't fit in the RAM.
Although, the choice of the output file format to store a small dataset of 200MB is not that critical and is more a matter of convenience.

Construct huge numpy array with pytables

I generate feature vectors for examples from large amount of data, and I would like to store them incrementally while i am reading the data. The feature vectors are numpy arrays. I do not know the number of numpy arrays in advance, and I would like to store/retrieve them incrementally.
Looking at pytables, I found two options:
Arrays: They require predetermined size and I am not quite sure how
much appending is computationally efficient.
Tables: The column types do not support list or arrays.
If it is a plain numpy array, you should probably use Extendable Arrays (EArray) http://pytables.github.io/usersguide/libref/homogenous_storage.html#the-earray-class
If you have a numpy structured array, you should use a Table.
Can't you just store them into an array? You have your code and it should be a loop that will grab things from the data to generate your examples and then it generates the example. create an array outside the loop and append your vector into the array for storage!
array = []
for row in file:
#here is your code that creates the vector
array.append(vector)
then after you have gone through the whole file, you have an array with all of your generated vectors! Hopefully that is what you need, you were a bit unclear...next time please provide some code.
Oh, and you did say you wanted pytables, but I don't think it's necessary, especially because of the limitations you mentioned

Categories