For my research I am working with large numpy arrays consisting of complex data.
arr = np.empty((15000, 25400), dtype='complex128')
np.save('array.npy'), arr)
When stored they are about 3 GB each. Loading these arrays is a time consuming process, which made me wonder if there are ways to speed this process up
One of the things I was thinking of was splitting the array into its complex and real part:
arr_real = arr.real
arr_im = arr.imag
and saving each part separately. However, this didn't seem to improve processing speed significantly. There is some documentation about working with large arrays, but I haven't found much information on working with complex data. Are there smart(er) ways to work with large complex arrays?
If you only need parts of the array in memory, you can load it using memory mapping:
arr = np.load('array.npy', mmap_mode='r')
From the docs:
A memory-mapped array is kept on disk. However, it can be accessed and
sliced like any ndarray. Memory mapping is especially useful for
accessing small fragments of large files without reading the entire
file into memory.
Related
My question is simple; and I could not find a resource that answers it. Somewhat similar links are using asarray, on numbers in general, and the most succinct one here.
How can I "calculate" the overhead of loading a numpy array into RAM (if there is any overhead)? Or, how to determine the least amount of RAM needed to hold all arrays in memory (without time-consuming trial and error)?
In short, I have several numpy arrays of shape (x, 1323000, 1), with x being as high as 6000. This leads to a disk usage of 30GB for the largest file.
All files together need 50GB. Is it therefore enough if I use slightly more than 50GB as RAM (using Kubernetes)? I want to use the RAM as efficiently as possible, so just using 100GBs is not an option.
I am trying to understand the memory allocation in the following operation:
x_batch,images_path,ImageValidStatus = tf_resize_images(path_list, img_type=col_mode, im_size=IMAGE_SIZE)
x_batch=x_batch/255;
x_batch = 1.0-x_batch
x_batch = x_batch.reshape(x_batch.shape[0],IMAGE_SIZE[0]*IMAGE_SIZE[1]*IMAGE_SIZE[2])
what I am interested in is the x_batch, this is a multi-dim numpy array (100x64x64x3)
where 100 is the number of images and 64x64x3 is the dimensions of the image.
what is the maximum number of copies of the images located inside the memory at one point of time.
in other words, how exactly the operations (x_batch/255) , (1-x_batch) and x_batch.reshape are happening from memory perspective.
my main concern is in some cases I am trying to process 500K images at the same time, if I will make multiple copies of these images in the memory, it will be very difficult to fit everything in the memory.
I see "tf" in your code, so I am unsure if you are asking about tensors or arrays. Lets assume you are asking about arrays. In general, arrays are written to memory once and then manipulated. For example,
import numpy as np
data = np.empty((1000,30,30,5)) #This took up 1000*30*30*5*dtype_size bytes (plus epsilon).
data.reshape((1000,30,150)) #Does nothing but update how numpy accesses the array.
data += 1 #Adds one to all the entries in the array.
data = 1-data #Overwrites the array with the data of 1-data.
x = data + 1 #Re-allocates and copies the whole memory.
As long as you don't change the array size (re-allocate the memory), then numpy operates very quickly and efficiently on the data. Not as good as tensorflow, but very very fast. In place adding, functions, operations, are all done without using more memory. Things like appending to the array could cause problems and make python rewrite the array in memory.
There are a bunch of questions on SO that appear to be the same, but they don't really answer my question fully. I think this is a pretty common use-case for computational scientists, so I'm creating a new question.
QUESTION:
I read in several small numpy arrays from files (~10 MB each) and do some processing on them. I want to create a larger array (~1 TB) where each dimension in the array contains the data from one of these smaller files. Any method that tries to create the whole larger array (or a substantial part of it) in the RAM is not suitable, since it floods up the RAM and brings the machine to a halt. So I need to be able to initialize the larger array and fill it in small batches, so that each batch gets written to the larger array on disk.
I initially thought that numpy.memmap is the way to go, but when I issue a command like
mmapData = np.memmap(mmapFile,mode='w+', shape=(large_no1,large_no2))
the RAM floods and the machine slows to a halt.
After poking around a bit it seems like PyTables might be well suited for this sort of thing, but I'm not really sure. Also, it was hard to find a simple example in the doc or elsewhere which illustrates this common use-case.
IF anyone knows how this can be done using PyTables, or if there's a more efficient/faster way to do this, please let me know! Any refs. to examples appreciated!
That's weird. The np.memmap should work. I've been using it with 250Gb data on a 12Gb RAM machine without problems.
Does the system really runs out of memory at the very moment of the creation of the memmap file? Or it happens along the code? If it happens at the file creation I really don't know what the problem would be.
When I started using memmap I've made some mistakes that led me to memory run out. For me, something like the below code should work:
mmapData = np.memmap(mmapFile, mode='w+', shape = (smallarray_size,number_of_arrays), dtype ='float64')
for k in range(number_of_arrays):
smallarray = np.fromfile(list_of_files[k]) # list_of_file is the list with the files name
smallarray = do_something_with_array(smallarray)
mmapData[:,k] = smallarray
It may not be the most efficient way, but it seems to me that it would have the lowest memory usage.
Ps: Be aware that the default dtype value for memmap(int) and fromfile(float) are different!
HDF5 is a C library that can efficiently store large on-disk arrays. Both PyTables and h5py are Python libraries on top of HDF5. If you're using tabular data then PyTables might be preferred; if you have just plain arrays then h5py is probably more stable/simpler.
There are out-of-core numpy array solutions that handle the chunking for you. Dask.array would give you plain numpy semantics on top of your collection of chunked files (see docs on stacking.)
My situation is like this:
I have around ~70 million integer values distributed in various files for ~10 categories of data (exact number not known)
I read those several files, and create some python object with that data. This would obviously include reading each file line by line and appending to the python object. So I'll have an array with 70 mil subarrays, with 10 values in each.
I do some statistical processing on that data . This would involve appending several values (say, percentile rank) to each 'row' of data.
I store this object it in a Database
Now I have never worked with data of this scale. My first instinct was to use Numpy for more efficient arrays w.r.t memory. But then I've heard that in Numpy arrays, 'append' is discouraged as it's not as efficient.
So what would you suggest I go with? Any general tips for working with data of this size? I can bring the data down to 20% of its size with random sampling if it's required.
EDIT: Edited for clarity about size and type of data.
If I understand your description correctly, your dataset will contain ~700 million integers. Even if you use 64-bit ints that would still only come to about 6GB. Depending on how much RAM you have and what you want to do in terms of statistical processing, your dataset sounds like it would be quite manageable as a normal numpy array living in core memory.
If the dataset is too large to fit in memory, a simple solution might be to use a memory-mapped array (numpy.memmap). In most respects, an np.memmap array behaves like a normal numpy array, but instead of storing the whole dataset in system memory, it will be dynamically read from/written to a file on disk as required.
Another option would be to store your data in an HDF5 file, for example using PyTables or H5py. HDF5 allows the data to be compressed on disk, and PyTables includes some very fast methods to perform mathematical operations on large disk-based arrays.
I am new to python. I have a big array, a, with dimensions such as (43200, 4000) and I need to save this, as I need it for future processing. when I try to save it with a np.savetxt, the txt file is too large and my program runs into memory error as I need to process 5 files of same size. Is there any way to save huge arrays so that it will take less memory?
Thanks.
Saving your data to text file is hugely inefficient. Numpy has built-in saving commands save, and savez/savez_compressed which would be much better suited to storing large arrays.
Depending on how you plan to use your data, you should also look into HDF5 format (h5py or pytables), which allows you to store large data sets, without having to load it all in memory.
You can use PyTables to create a Hierarchical Data Format (HDF) file to store the data. This provides some interesting in-memory options that link the object you're working with to the file it's saved in.
Here is another StackOverflow questions that demonstrates how to do this: "How to store a NumPy multidimensional array in PyTables."
If you are willing to work with your array as a Pandas DataFrame object, you can also use the Pandas interface to PyTables / HDF5, e.g.:
import pandas
import numpy as np
a = np.ones((43200, 4000)) # Not recommended.
x = pandas.HDFStore("some_file.hdf")
x.append("a", pandas.DataFrame(a)) # <-- This will take a while.
x.close()
# Then later on...
my_data = pandas.HDFStore("some_file.hdf") # might also take a while
usable_a_copy = my_data["a"] # Be careful of the way changes to
# `usable_a_copy` affect the saved data.
copy_as_nparray = usable_a_copy.values
With files of this size, you might consider whether your application can be performed with a parallel algorithm and potentially applied to only subsets of the large arrays rather than needing to consume all of the array before proceeding.