I am working in python and I have encountered a problem: I have to initialize a huge array (21 x 2000 x 4000 matrix) so that I can copy a submatrix on it.
The problem is that I want it to be really quick since it is for a real-time application, but when I run numpy.ones((21,2000,4000)), it takes about one minute to create this matrix.
When I run numpy.zeros((21,2000,4000)), it is instantaneous, but as soon as I copy the submatrix, it takes one minute, while in the first case the copying part was instantaneous.
Is there a faster way to initialize a huge array?
I guess there's not a faster way. The matrix you're building is quite large (8 byte float64 x 21 x 2000 x 4000 = 1.25 GB), and might be using up a large fraction of the physical memory on your system; thus, the one minute that you're waiting might be because the operating system has to page other stuff out to make room. You could check this by watching top or similar (e.g., System Monitor) while you're doing your allocation and watching memory usage and paging.
numpy.zeros seems to be instantaneous when you call it, because memory is allocated lazily by the OS. However, as soon as you try to use it, the OS actually has to fit that data somewhere. See Why the performance difference between numpy.zeros and numpy.zeros_like?
Can you restructure your code so that you only create the submatrices that you were intending to copy, without making the big matrix?
Related
I am generating a large array of random numbers, totaling more than half the available memory on a GPU. I am doing this in a loop.
When I call cupy.random the second time (or third time...), assigning to the same variable name, it does not free the memory for the first array. It tries to allocate more memory, which causes an out of memory error.
Explicitly freeing the memory before generating a new random array is very slow, and seems inefficient.
Is there a way to generate a new set of numbers, but in the same memory space?
Edit: cupy.random.shuffle() is letting me work around the problem, but I wonder if there is a better way?
Edit 2: on further review, shuffle() does not address the problem, and appears to need even more memory than allocating a second block (before freeing the first) of memory... I am back to restricting ndarray size to less than half the remaining memory, so two ndarrays can be allocated alternately
As user2357112 suggests, cupy.random.random() does not appear to support “re-randomizing“ an existing ndarray, even though cuRand does. Writing C to modify an existing cupy array somewhat defeats the point of using python / cupy in the first place.
Curiously, having an array about 1/3rd the size of available memory, while increasing the number of loops, is faster in total execution time (versus larger arrays/fewer loops). I was not able to determine when cupy (or python or cuda?) does garbage collection on the disused array, but it seems to happen asynchronously.
If GPU garbage collection uses cuda cores (I presume it does?), it does not appear to materially effect my code execution time. Nvidia-smi reports “P2” GPU usage when my code calculations are running, suggesting there are still cores available for cupy / cuda to free memory outside of my code?
I don’t like answering my own question... just sharing what I found in case it helps someone else
in c there is an example of importance of memory utilization: naive matrix multiplication using 3 for loops (i,j,k). And one can show that using (i,k,j) is much faster than (i,j,k) due to memory coalescence. In python numpy the order if indexes (again naive 3 loops, not np.dot) does not matter. Why is that?
First of all you need to know why the loop (i,k,j) is faster than (i,j,k) on C. That's happens because the memory usage optimization, in your computer memory the matrix is allocated in a linear way, so if you iterate using (i,k,j) you are using this in your favor where each loop takes a block of memory and load to your RAM. If you use (i,j,k) you are working against it, and each step will take a block of memory load to your RAM and discard on next step because you are iterating jumping blocks.
The implementation of numpy handle it for you, so even if you use the worst order numpy will order it to work faster.
The event of throwing away the cache, and keep changing it all the time is called Cache miss
At this link you can see a much better explanation about how the memory is allocated and why is it faster in some specific itartion way.
I have two questions, but the first takes precedence.
I was doing some timeit testing of some basic numpy operations that will be relevant to me.
I did the following
n = 5000
j = defaultdict()
for i in xrange(n):
print i
j[i] = np.eye(n)
What happened is, python's memory use almost immediately shot up to 6gigs, which is over 90% of my memory. However, numbers printed off at a steady pace, about 10-20 per second. While numbers printed off, memory use sporadically bounced down to ~4 gigs, and back up to 5, back down to 4, up to 6, down to 4.5, etc etc.
At 1350 iterations I had a segmentation fault.
So my question is, what was actually occurring during this time? Are these matrices actually created one at a time? Why is memory use spiking up and down?
My second question is, I may actually need to do something like this in a program I am working on. I will be doing basic arithmetic and comparisons between many large matrices, in a loop. These matrices will sometimes, but rarely, be dense. They will often be sparse.
If I actually need 5000 5000x5000 matrices, is that feasible with 6 gigs of memory? I don't know what can be done with all the tools and tricks available... Maybe I would just have to store some of them on disk and pull them out in chunks?
Any advice for if I have to loop through many matrices and do basic arithmetic between them?
Thank you.
If I actually need 5000 5000x5000 matrices, is that feasible with 6 gigs of memory?
If they're dense matrices, and you need them all at the same time, not by a long shot. Consider:
5K * 5K = 25M cells
25M * 8B = 200MB (assuming float64)
5K * 200MB = 1TB
The matrices are being created one at a time. As you get near 6GB, what happens depends on your platform. It might start swapping to disk, slowing your system to a crawl. There might be a fixed-size or max-size swap, so eventually it runs out of memory anyway. It may make assumptions about how you're going to use the memory, guessing that there will always be room to fit your actual working set at any given moment into memory, only to segfault when it discovers it can't. But the one thing it isn't going to do is just work efficiently.
You say that most of your matrices are sparse. In that case, use one of the sparse matrix representations. If you know which of the 5000 will be dense, you can mix and match dense and sparse matrices, but if not, just use the same sparse matrix type for everything. If this means your occasional dense matrices take 210MB instead of 200MB, but all the rest of your matrices take 1MB instead of 200MB, that's more than worthwhile as a tradeoff.
Also, do you actually need to work on all 5000 matrices at once? If you only need, say, the current matrix and the previous one at each step, you can generate them on the fly (or read from disk on the fly), and you only need 400MB instead of 1TB.
Worst-case scenario, you can effectively swap things manually, with some kind of caching discipline, like least-recently-used. You can easily keep, say, the last 16 matrices in memory. Keep a dirty flag on each so you know whether you have to save it when flushing it to make room for another matrix. That's about as tricky as it's going to get.
I am using numpy and trying to create a huge matrix.
While doing this, I receive a memory error
Because the matrix is not important, I will just show the way how to easily reproduce the error.
a = 10000000000
data = np.array([float('nan')] * a)
not surprisingly, this throws me MemoryError
There are two things I would like to tell:
I really need to create and to use a big matrix
I think I have enough RAM to handle this matrix (I have 24 Gb or RAM)
Is there an easy way to handle big matrices in numpy?
Just to be on the safe side, I previously read these posts (which sounds similar):
Very large matrices using Python and NumPy
Python/Numpy MemoryError
Processing a very very big data set in python - memory error
P.S. apparently I have some problems with multiplication and division of numbers, which made me think that I have enough memory. So I think it is time for me to go to sleep, review math and may be to buy some memory.
May be during this time some genius might come up with idea how to actually create this matrix using only 24 Gb of Ram.
Why I need this big matrix
I am not going to do any manipulations with this matrix. All I need to do with it is to save it into pytables.
Assuming each floating point number is 4 bytes each, you'd have
(10000000000 * 4) /(2**30.0) = 37.25290298461914
Or 37.5 gigabytes you need to store in memory. So I don't think 24gb of RAM is enough.
If you can't afford creating such a matrix, but still wish to do some computations, try sparse matrices.
If you wish to pass it to another Python package that uses duck typing, you may create your own class with __getitem__ implementing dummy access.
If you use pycharm editor for python you can change memory settings from
C:\Program Files\JetBrains\PyCharm 2018.2.4\bin\pycharm64.exe.vmoptions
you can decrease pycharm speed from this file so your program memory will allocate more megabites
you must edit this codes
-Xms1024m
-Xmx2048m
-XX:ReservedCodeCacheSize=960m
so you can make them -Xms512m -Xmx1024m and finally your program will work
but it'll affect the debugging performance in pycharm.
This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Python Numpy Very Large Matrices
I tried numpy.zeros((100k x 100k)) and it returned "array is too big".
Response to comments:
1) I could create 10k x 10k matrix but not 100kx100k and 1milx1mil.
2) The matrix is not sparse.
We can do simple maths to find out. A 1 million by 1 million matrix has 1,000,000,000,000 elements. If each element takes up 4 bytes, it would require 4,000,000,000,000 bytes of memory. That is, 3.64 terabytes.
There are also chances that a given implementation of Python uses more than that for a single number. For instance, just the leap from a float to a double means you'll need 7.28 terabytes instead. (There are also chances that Python stores the number on the heap and all you get is a pointer to it, approximately doubling the footprint, without even taking in account metadata–but that's slippery grounds, I'm always wrong when I talk about Python internals, so let's not dig it too much.)
I suppose numpy doesn't have a hardcoded limit, but if your system doesn't have that much free memory, there isn't really anything to do.
Does your matrix have a lot of zero entries? I suspect it does, few people do dense problems that large.
You can easily do that with a sparse matrix. SciPy has a good set built in. http://docs.scipy.org/doc/scipy/reference/sparse.html
The space required by a sparse matrix grows with the number of nonzero elements, not the dimensions.
Your system probably won't have enough memory to store the matrix in memory, but nowadays you might well have enough terabytes of free disk space. In that case, numpy.memmap would allow you to have the array stored on disk, but appear as if it resides in memory.
However, it's probably best to rethink the problem. Do you really need a matrix this large? Any computations involving it will probably be infeasibly slow, and need to be done blockwise.