I am using numpy and trying to create a huge matrix.
While doing this, I receive a memory error
Because the matrix is not important, I will just show the way how to easily reproduce the error.
a = 10000000000
data = np.array([float('nan')] * a)
not surprisingly, this throws me MemoryError
There are two things I would like to tell:
I really need to create and to use a big matrix
I think I have enough RAM to handle this matrix (I have 24 Gb or RAM)
Is there an easy way to handle big matrices in numpy?
Just to be on the safe side, I previously read these posts (which sounds similar):
Very large matrices using Python and NumPy
Python/Numpy MemoryError
Processing a very very big data set in python - memory error
P.S. apparently I have some problems with multiplication and division of numbers, which made me think that I have enough memory. So I think it is time for me to go to sleep, review math and may be to buy some memory.
May be during this time some genius might come up with idea how to actually create this matrix using only 24 Gb of Ram.
Why I need this big matrix
I am not going to do any manipulations with this matrix. All I need to do with it is to save it into pytables.
Assuming each floating point number is 4 bytes each, you'd have
(10000000000 * 4) /(2**30.0) = 37.25290298461914
Or 37.5 gigabytes you need to store in memory. So I don't think 24gb of RAM is enough.
If you can't afford creating such a matrix, but still wish to do some computations, try sparse matrices.
If you wish to pass it to another Python package that uses duck typing, you may create your own class with __getitem__ implementing dummy access.
If you use pycharm editor for python you can change memory settings from
C:\Program Files\JetBrains\PyCharm 2018.2.4\bin\pycharm64.exe.vmoptions
you can decrease pycharm speed from this file so your program memory will allocate more megabites
you must edit this codes
-Xms1024m
-Xmx2048m
-XX:ReservedCodeCacheSize=960m
so you can make them -Xms512m -Xmx1024m and finally your program will work
but it'll affect the debugging performance in pycharm.
Related
My question is simple; and I could not find a resource that answers it. Somewhat similar links are using asarray, on numbers in general, and the most succinct one here.
How can I "calculate" the overhead of loading a numpy array into RAM (if there is any overhead)? Or, how to determine the least amount of RAM needed to hold all arrays in memory (without time-consuming trial and error)?
In short, I have several numpy arrays of shape (x, 1323000, 1), with x being as high as 6000. This leads to a disk usage of 30GB for the largest file.
All files together need 50GB. Is it therefore enough if I use slightly more than 50GB as RAM (using Kubernetes)? I want to use the RAM as efficiently as possible, so just using 100GBs is not an option.
I have to handle sparse matrix that can occasionally be very big, nearing or exceeding RAM capacity. I also need to support mat*vec and mat*mat operations.
Since internally a csr_matrix is 3 arrays data, indices and indptr is it possible to create a csr matrix from numpy memmap.
This can partially work, until you try to do much with the array. There's a very good chance the subarrays will be fully read into memory if you subset, or you'll get an error.
An important consideration here is that the underlying code is written assuming the arrays are typical in-memory numpy arrays. Cost of random access is very different for mmapped arrays and in memory arrays. In fact, much of the code here is (at time of writing) in Cython, which may not be able to work with more exotic array types.
Also most of this code can change at any time, as long as the behaviour is the same for in-memory arrays. This has personally bitten me when some I learned some code I worked with was doing this, but with h5py.Datasets for the underlying arrays. It worked surprisingly well, until a bug fix release of scipy completely broke it.
This works without any problems.
I am working in python and I have encountered a problem: I have to initialize a huge array (21 x 2000 x 4000 matrix) so that I can copy a submatrix on it.
The problem is that I want it to be really quick since it is for a real-time application, but when I run numpy.ones((21,2000,4000)), it takes about one minute to create this matrix.
When I run numpy.zeros((21,2000,4000)), it is instantaneous, but as soon as I copy the submatrix, it takes one minute, while in the first case the copying part was instantaneous.
Is there a faster way to initialize a huge array?
I guess there's not a faster way. The matrix you're building is quite large (8 byte float64 x 21 x 2000 x 4000 = 1.25 GB), and might be using up a large fraction of the physical memory on your system; thus, the one minute that you're waiting might be because the operating system has to page other stuff out to make room. You could check this by watching top or similar (e.g., System Monitor) while you're doing your allocation and watching memory usage and paging.
numpy.zeros seems to be instantaneous when you call it, because memory is allocated lazily by the OS. However, as soon as you try to use it, the OS actually has to fit that data somewhere. See Why the performance difference between numpy.zeros and numpy.zeros_like?
Can you restructure your code so that you only create the submatrices that you were intending to copy, without making the big matrix?
I have a list "data_list", and I would save it in order to load it in another script.
First of all I converted it in an array, in this way:
data_array = np.array(data_list)
Then I saved it:
np.savez("File", data_array)
Then, in another script I want to access to "File"; so:
a = np.load("File.npz")
b = a['arr_0']
I used this code until two weeks ago and it worked fine. In these days I am trying to work with my program, but it ends with an error identified in the line
b = a['arr_0']
"File" is a 300 MB file. The strangest thing is that it has stopped suddenly to work.
Any idea about what can be happened?
Ps: I give you some information. My list contains 180 matrices 511x511. Each matrix contains decimal numbers (I tried to create 180 matrices of zeros, and the error occurs in the same way). If I reduce the number of matrices, the script works fine: in particular down to 130 matrices it is ok, while up to the program doesn't work.
Here I report the error message
b = a['arr_0']
File "C:\Python27\lib\site-packages\numpy\lib\npyio.py", line 241, in
__getitem__
return format.read_array(value)
File "C:\Python27\lib\site-packages\numpy\lib\format.py", line 459, in
read_array
array = numpy.fromstring(data, dtype=dtype, count=count)
MemoryError
MemoryError is an out of memory condition. This explains why it happens with objects of at least a certain size - more and bigger arrays, as you would expect, require more memory. What the max size is, and why it seems to have changed, is harder. This can be highly specific to your system, especially in regard to considerations like:
How much memory (physical RAM and swap space) exists and is available to the operating system
How much virtual memory the OS gives to Python
How much of that you're already using
The implementation of the C library, especially of its malloc function, which can affect how Python uses the memory it is allocated
And possibly quite a few other things.
Per the comments, it seems the biggest problem here is that you are running a 32 bit build of Python. On Windows, 32 bit processes apparently have an effective maximum memory address space of around 2GB. By my tests, the list of arrays you are using by itself might take around a quarter of that. The fact that your error only comes up when reading the file back in suggests that numpy deserialisation is relatively memory intensive, but I don't know enough about its implementation to be able to say why that would be. In any case, it seems like installing a 64 bit build of Python is your best bet.
There are a bunch of questions on SO that appear to be the same, but they don't really answer my question fully. I think this is a pretty common use-case for computational scientists, so I'm creating a new question.
QUESTION:
I read in several small numpy arrays from files (~10 MB each) and do some processing on them. I want to create a larger array (~1 TB) where each dimension in the array contains the data from one of these smaller files. Any method that tries to create the whole larger array (or a substantial part of it) in the RAM is not suitable, since it floods up the RAM and brings the machine to a halt. So I need to be able to initialize the larger array and fill it in small batches, so that each batch gets written to the larger array on disk.
I initially thought that numpy.memmap is the way to go, but when I issue a command like
mmapData = np.memmap(mmapFile,mode='w+', shape=(large_no1,large_no2))
the RAM floods and the machine slows to a halt.
After poking around a bit it seems like PyTables might be well suited for this sort of thing, but I'm not really sure. Also, it was hard to find a simple example in the doc or elsewhere which illustrates this common use-case.
IF anyone knows how this can be done using PyTables, or if there's a more efficient/faster way to do this, please let me know! Any refs. to examples appreciated!
That's weird. The np.memmap should work. I've been using it with 250Gb data on a 12Gb RAM machine without problems.
Does the system really runs out of memory at the very moment of the creation of the memmap file? Or it happens along the code? If it happens at the file creation I really don't know what the problem would be.
When I started using memmap I've made some mistakes that led me to memory run out. For me, something like the below code should work:
mmapData = np.memmap(mmapFile, mode='w+', shape = (smallarray_size,number_of_arrays), dtype ='float64')
for k in range(number_of_arrays):
smallarray = np.fromfile(list_of_files[k]) # list_of_file is the list with the files name
smallarray = do_something_with_array(smallarray)
mmapData[:,k] = smallarray
It may not be the most efficient way, but it seems to me that it would have the lowest memory usage.
Ps: Be aware that the default dtype value for memmap(int) and fromfile(float) are different!
HDF5 is a C library that can efficiently store large on-disk arrays. Both PyTables and h5py are Python libraries on top of HDF5. If you're using tabular data then PyTables might be preferred; if you have just plain arrays then h5py is probably more stable/simpler.
There are out-of-core numpy array solutions that handle the chunking for you. Dask.array would give you plain numpy semantics on top of your collection of chunked files (see docs on stacking.)