Store multiple two dimensional arrays in Python - python

I am wondering if for example I have 5 numpy array of 100 by 1, 4 numpy arrays of 100 by 3, 3 numpy arrays of 100 by 5 and 4 arrays of 100 by 6. What is the most efficient way to store all these matrices? I can have just one numpy array for each but this is not efficient. I cannot store them in a 3D array since matrix have different dimensions. Any suggestion on how to efficiently store them ?

Assuming you're speaking of being efficient in storage on disk.
NumPy has a built in method called savez that you can use to save multiple arrays to disk. If you're worried about file size, there's a minor achievable improvement with savez_compressed
If you did save the arrays with pickle enabled, make sure to include allow_pickle=True when attempting to load the saved .npy or .npz files.
HDF5 is definitely an option but is often used for truly large heterogeneous data collections. From what it appears, you have a handful of homogeneous matrices that can easily be managed using the aforementioned facilities.

Related

Most memory efficient way to combining many numpy arrays

I have about 200 numpy arrays saved as files, and I would like to combine them into one big array. Currently I am doing that by using a loop and concatenating each one individually. But I heard this is memory inefficient, because concatenating also makes a copy
Concatenate Numpy arrays without copying
If you know beforehand how many arrays you need, you can instead start
with one big array that you allocate beforehand, and have each of the
small arrays be a view to the big array (e.g. obtained by slicing).
So I am wondering if I should instead load each numpy array individually, count the row size off all the numpy arrays, create a new numpy array of this new row size, and then copy each smaller numpy array individually, and then delete that numpy array. Or is there some aspect of this I am not taking into account?

How to efficiently work with large complex numpy arrays?

For my research I am working with large numpy arrays consisting of complex data.
arr = np.empty((15000, 25400), dtype='complex128')
np.save('array.npy'), arr)
When stored they are about 3 GB each. Loading these arrays is a time consuming process, which made me wonder if there are ways to speed this process up
One of the things I was thinking of was splitting the array into its complex and real part:
arr_real = arr.real
arr_im = arr.imag
and saving each part separately. However, this didn't seem to improve processing speed significantly. There is some documentation about working with large arrays, but I haven't found much information on working with complex data. Are there smart(er) ways to work with large complex arrays?
If you only need parts of the array in memory, you can load it using memory mapping:
arr = np.load('array.npy', mmap_mode='r')
From the docs:
A memory-mapped array is kept on disk. However, it can be accessed and
sliced like any ndarray. Memory mapping is especially useful for
accessing small fragments of large files without reading the entire
file into memory.

how can I save super large array into many small files?

In linux 64bit environment, I have very big float64 array (single one will be 500GB to 1TB). I would like to access these arrays in numpy with uniform way: a[x:y]. So I do not want to access the array as segments file by file. Is there any tools that I can create memmap over many different files? Can hdf5 or pytables store a single CArray into many small files? Maybe something similar to the fileInput? Or Can I do something with the file system to simulate a single file?
In matlab I've been using H5P.set_external to do this. Then I can create a raw dataset and access it as a big raw file. But I do not know if I can create numpy.ndarray over these dataset in python. Or can I spread a single dataset over many small hdf5 files?
and unfortunately the H5P.set_chunk does not work with H5P.set_external, because set_external only work with continuous data type not chunked data type.
some related topics:
Chain datasets from multiple HDF5 files/datasets
I would use hdf5. In h5py, you can specify a chunk size which makes retrieving small pieces of the array efficient:
http://docs.h5py.org/en/latest/high/dataset.html?#chunked-storage
You can use dask. dask arrays allow you to create an object that behaves like a single big numpy array but represents the data stored in many small HDF5 files. dask will take care of figuring out how any operations you carry out relate to the underlying on-disk data for you.

Working with very large arrays - Numpy

My situation is like this:
I have around ~70 million integer values distributed in various files for ~10 categories of data (exact number not known)
I read those several files, and create some python object with that data. This would obviously include reading each file line by line and appending to the python object. So I'll have an array with 70 mil subarrays, with 10 values in each.
I do some statistical processing on that data . This would involve appending several values (say, percentile rank) to each 'row' of data.
I store this object it in a Database
Now I have never worked with data of this scale. My first instinct was to use Numpy for more efficient arrays w.r.t memory. But then I've heard that in Numpy arrays, 'append' is discouraged as it's not as efficient.
So what would you suggest I go with? Any general tips for working with data of this size? I can bring the data down to 20% of its size with random sampling if it's required.
EDIT: Edited for clarity about size and type of data.
If I understand your description correctly, your dataset will contain ~700 million integers. Even if you use 64-bit ints that would still only come to about 6GB. Depending on how much RAM you have and what you want to do in terms of statistical processing, your dataset sounds like it would be quite manageable as a normal numpy array living in core memory.
If the dataset is too large to fit in memory, a simple solution might be to use a memory-mapped array (numpy.memmap). In most respects, an np.memmap array behaves like a normal numpy array, but instead of storing the whole dataset in system memory, it will be dynamically read from/written to a file on disk as required.
Another option would be to store your data in an HDF5 file, for example using PyTables or H5py. HDF5 allows the data to be compressed on disk, and PyTables includes some very fast methods to perform mathematical operations on large disk-based arrays.

Construct huge numpy array with pytables

I generate feature vectors for examples from large amount of data, and I would like to store them incrementally while i am reading the data. The feature vectors are numpy arrays. I do not know the number of numpy arrays in advance, and I would like to store/retrieve them incrementally.
Looking at pytables, I found two options:
Arrays: They require predetermined size and I am not quite sure how
much appending is computationally efficient.
Tables: The column types do not support list or arrays.
If it is a plain numpy array, you should probably use Extendable Arrays (EArray) http://pytables.github.io/usersguide/libref/homogenous_storage.html#the-earray-class
If you have a numpy structured array, you should use a Table.
Can't you just store them into an array? You have your code and it should be a loop that will grab things from the data to generate your examples and then it generates the example. create an array outside the loop and append your vector into the array for storage!
array = []
for row in file:
#here is your code that creates the vector
array.append(vector)
then after you have gone through the whole file, you have an array with all of your generated vectors! Hopefully that is what you need, you were a bit unclear...next time please provide some code.
Oh, and you did say you wanted pytables, but I don't think it's necessary, especially because of the limitations you mentioned

Categories