Most memory efficient way to combining many numpy arrays - python

I have about 200 numpy arrays saved as files, and I would like to combine them into one big array. Currently I am doing that by using a loop and concatenating each one individually. But I heard this is memory inefficient, because concatenating also makes a copy
Concatenate Numpy arrays without copying
If you know beforehand how many arrays you need, you can instead start
with one big array that you allocate beforehand, and have each of the
small arrays be a view to the big array (e.g. obtained by slicing).
So I am wondering if I should instead load each numpy array individually, count the row size off all the numpy arrays, create a new numpy array of this new row size, and then copy each smaller numpy array individually, and then delete that numpy array. Or is there some aspect of this I am not taking into account?

Related

Are there dynamic arrays in numpy?

Let's say I create 2 numpy arrays, one of which is an empty array and one which is of size 1000x1000 made up of zeros:
import numpy as np;
A1 = np.array([])
A2 = np.zeros([1000,1000])
When I want to change a value in A2, this seems to work fine:
A2[n,m] = 17
The above code would change the value of position [n][m] in A2 to 17.
When I try the above with A1 I get this error:
A1[n,m] = 17
IndexError: index n is out of bounds for axis 0 with size 0
I know why this happens, because there is no defined position [n,m] in A1 and that makes sense, but my question is as follows:
Is there a way to define a dynamic array without that updates the array with new rows and columns if A[n,m] = somevalue is entered when n or m or both are greater than the bound of an Array A?
It doesn't have to be in numpy, any library or method that can update array size would be awesome. If it is a method, I can imagine there being an if loop that checks if [n][m] is out of bounds and does something about it.
I am coming from a MATLAB background where it's easy to do this. I tried to find something about this in the documentation in numpy.array but I've been unsuccessful.
EDIT:
I want to know if some way to create a dynamic list is possible at all in Python, not just in the numpy library. It appears from this question that it doesn't work with numpy Creating a dynamic array using numpy in python.
This can't be done in numpy, and it technically can't be done in MATLAB either. What MATLAB is doing behind-the-scenes is creating an entire new matrix, then copying all the data to the new matrix, then deleting the old matrix. It is not dynamically resizing, that isn't actually possible because of how arrays/matrices work. This is extremely slow, especially for large arrays, which is why MATLAB nowadays warns you not to do it.
Numpy, like MATLAB, cannot resize arrays (actually, unlike MATLAB it technically can, but only if you are lucky so I would advise against trying). But in order to avoid the sort of confusion and slow code this causes in MATLAB, numpy requires that you explicitly make the new array (using np.zeros) then copy the data over.
Python, unlike MATLAB, actually does have a truly resizable data structure: the list. Lists still require there to be enough elements, since this avoids silent indexing errors that are hard to catch in MATLAB, but you can resize an array with very good performance. You can make an effectively n-dimensional list by using nested lists of lists. Then, once the list is done, you can convert it to a numpy array.

Store multiple two dimensional arrays in Python

I am wondering if for example I have 5 numpy array of 100 by 1, 4 numpy arrays of 100 by 3, 3 numpy arrays of 100 by 5 and 4 arrays of 100 by 6. What is the most efficient way to store all these matrices? I can have just one numpy array for each but this is not efficient. I cannot store them in a 3D array since matrix have different dimensions. Any suggestion on how to efficiently store them ?
Assuming you're speaking of being efficient in storage on disk.
NumPy has a built in method called savez that you can use to save multiple arrays to disk. If you're worried about file size, there's a minor achievable improvement with savez_compressed
If you did save the arrays with pickle enabled, make sure to include allow_pickle=True when attempting to load the saved .npy or .npz files.
HDF5 is definitely an option but is often used for truly large heterogeneous data collections. From what it appears, you have a handful of homogeneous matrices that can easily be managed using the aforementioned facilities.

Construct huge numpy array with pytables

I generate feature vectors for examples from large amount of data, and I would like to store them incrementally while i am reading the data. The feature vectors are numpy arrays. I do not know the number of numpy arrays in advance, and I would like to store/retrieve them incrementally.
Looking at pytables, I found two options:
Arrays: They require predetermined size and I am not quite sure how
much appending is computationally efficient.
Tables: The column types do not support list or arrays.
If it is a plain numpy array, you should probably use Extendable Arrays (EArray) http://pytables.github.io/usersguide/libref/homogenous_storage.html#the-earray-class
If you have a numpy structured array, you should use a Table.
Can't you just store them into an array? You have your code and it should be a loop that will grab things from the data to generate your examples and then it generates the example. create an array outside the loop and append your vector into the array for storage!
array = []
for row in file:
#here is your code that creates the vector
array.append(vector)
then after you have gone through the whole file, you have an array with all of your generated vectors! Hopefully that is what you need, you were a bit unclear...next time please provide some code.
Oh, and you did say you wanted pytables, but I don't think it's necessary, especially because of the limitations you mentioned

Diagonalisation of an array of 2D arrays

I need to diagonalise a very large number of matrices.
These matrices are by themselves quite small (say a x a where a<=10) but due to
their sheer number, it takes a lot of time to diagonalise them all using a for loop
and the numpy.linalg.eig function. So I wanted to make an array of matrices, i.e.,
an array of 2D arrays, but unfortunately, Python seems to consider this to be a 3-dimensional array, gets confused and refuses to do the job. So, is there any way to prevent Python from looking at this array of 2D arrays as a 3D array?
Thanks,
A Python novice
EDIT: To be more clear, I'm not interested in this 3D array per se. Since in general, feeding an array to a function seems to be much faster than using a for loop to feed all elements one by one, I just tried to put all matrices which I need to diagonalise in an array.
If you have an 3D array like:
a = np.random.normal(size=(20,10,10))
you can then just loop through all 20 of the 10x10 arrays using:
for k in xrange(a.shape[0]):
b = np.linalg.eig(a[k,:,:])
where you would save b in a more sophisticated way. This may be what you are already doing, but you can't apply np.linalg.eig to a 3D array and have it calculate along a single axis, so you are stuck with the loop unless there is a formalism for combining all of your arrays into a single 2D array. I doubt however that that would be faster than just looping over the individual 2D arrays.

Numpy equivalent of MATLAB's cell array

I want to create a MATLAB-like cell array in Numpy. How can I accomplish this?
Matlab cell arrays are most similar to Python lists, since they can hold any object - but scipy.io.loadmat imports them as numpy object arrays - which is an array with dtype=object.
To be honest though you are just as well off using Python lists - if you are holding general objects you will loose almost all of the advantages of numpy arrays (which are designed to hold a sequence of values which each take the same amount of memory).

Categories