I have to handle sparse matrix that can occasionally be very big, nearing or exceeding RAM capacity. I also need to support mat*vec and mat*mat operations.
Since internally a csr_matrix is 3 arrays data, indices and indptr is it possible to create a csr matrix from numpy memmap.
This can partially work, until you try to do much with the array. There's a very good chance the subarrays will be fully read into memory if you subset, or you'll get an error.
An important consideration here is that the underlying code is written assuming the arrays are typical in-memory numpy arrays. Cost of random access is very different for mmapped arrays and in memory arrays. In fact, much of the code here is (at time of writing) in Cython, which may not be able to work with more exotic array types.
Also most of this code can change at any time, as long as the behaviour is the same for in-memory arrays. This has personally bitten me when some I learned some code I worked with was doing this, but with h5py.Datasets for the underlying arrays. It worked surprisingly well, until a bug fix release of scipy completely broke it.
This works without any problems.
Related
While trying to do ndimage.convolve on big numpy.memmap, exception occurs:
Exception has occurred: _ArrayMemoryError
Unable to allocate 56.0 GiB for an array with shape (3710, 1056, 3838) and data type float32
Seems that convolve creates a regular numpy array which won't fit into memory.
Could you tell me please if there is a workaround?
Thank you for any input.
Scipy and Numpy often create new arrays to store the output value returned. This temporary array is stored in RAM even when the array is stored on a storage device and accessed with memmap. There is an output parameter to control that in many functions (including ndimage.convolve). However, this does not prevent internal in-RAM temporary arrays to be created (though such array are not very frequent and often not huge). There is not much more you can do if the output parameter is not present or a big internal is created. The only thing to do is to write your own implementation that does not allocate huge in-RAM array. C modules, Cython and Numba are pretty good for this. Note that doing efficient convolutions is far from being simple when the kernel is not trivial and there are many research paper addressing this problem.
Instead of running your own implementation, another approach that might work would be to use dask's wrapped ndfilters with a dask array created from the memmap. That way, you can delegate the chunking/out-of-memory-calculation parts to Dask.
I haven't actually done this myself, but I see no reason why it wouldn't work!
This is a follow up to this question
What are the benefits / drawbacks of a list of lists compared to a numpy array of OBJECTS with regards to MEMORY?
I'm interested in understanding the speed implications of using a numpy array vs a list of lists when the array is of type object.
If anyone is interested in the object I'm using:
import gmpy2 as gm
gm.mpfr('0') # <-- this is the object
The biggest usual benefits of numpy, as far as speed goes, come from being able to vectorize operations, which means you replace a Python loop around a Python function call with a C loop around some inlined C (or even custom SIMD assembly) code. There are probably no built-in vectorized operations for arrays of mpfr objects, so that main benefit vanishes.
However, there are some place you'll still benefit:
Some operations that would require a copy in pure Python are essentially free in numpy—transposing a 2D array, slicing a column or a row, even reshaping the dimensions are all done by wrapping a pointer to the same underlying data with different striding information. Since your initial question specifically asked about A.T, yes, this is essentially free.
Many operations can be performed in-place more easily in numpy than in Python, which can save you some more copies.
Even when a copy is needed, it's faster to bulk-copy a big array of memory and then refcount all of the objects than to iterate through nested lists deep-copying them all the way down.
It's a lot easier to write your own custom Cython code to vectorize an arbitrary operation with numpy than with Python.
You can still get some benefit from using np.vectorize around a normal Python function, pretty much on the same order as the benefit you get from a list comprehension over a for statement.
Within certain size ranges, if you're careful to use the appropriate striding, numpy can allow you to optimize cache locality (or VM swapping, at larger sizes) relatively easily, while there's really no way to do that at all with lists of lists. This is much less of a win when you're dealing with an array of pointers to objects that could be scattered all over memory than when dealing with values that can be embedded directly in the array, but it's still something.
As for disadvantages… well, one obvious one is that using numpy restricts you to CPython or sometimes PyPy (hopefully in the future that "sometimes" will become "almost always", but it's not quite there as of 2014); if your code would run faster in Jython or IronPython or non-NumPyPy PyPy, that could be a good reason to stick with lists.
There are a bunch of questions on SO that appear to be the same, but they don't really answer my question fully. I think this is a pretty common use-case for computational scientists, so I'm creating a new question.
QUESTION:
I read in several small numpy arrays from files (~10 MB each) and do some processing on them. I want to create a larger array (~1 TB) where each dimension in the array contains the data from one of these smaller files. Any method that tries to create the whole larger array (or a substantial part of it) in the RAM is not suitable, since it floods up the RAM and brings the machine to a halt. So I need to be able to initialize the larger array and fill it in small batches, so that each batch gets written to the larger array on disk.
I initially thought that numpy.memmap is the way to go, but when I issue a command like
mmapData = np.memmap(mmapFile,mode='w+', shape=(large_no1,large_no2))
the RAM floods and the machine slows to a halt.
After poking around a bit it seems like PyTables might be well suited for this sort of thing, but I'm not really sure. Also, it was hard to find a simple example in the doc or elsewhere which illustrates this common use-case.
IF anyone knows how this can be done using PyTables, or if there's a more efficient/faster way to do this, please let me know! Any refs. to examples appreciated!
That's weird. The np.memmap should work. I've been using it with 250Gb data on a 12Gb RAM machine without problems.
Does the system really runs out of memory at the very moment of the creation of the memmap file? Or it happens along the code? If it happens at the file creation I really don't know what the problem would be.
When I started using memmap I've made some mistakes that led me to memory run out. For me, something like the below code should work:
mmapData = np.memmap(mmapFile, mode='w+', shape = (smallarray_size,number_of_arrays), dtype ='float64')
for k in range(number_of_arrays):
smallarray = np.fromfile(list_of_files[k]) # list_of_file is the list with the files name
smallarray = do_something_with_array(smallarray)
mmapData[:,k] = smallarray
It may not be the most efficient way, but it seems to me that it would have the lowest memory usage.
Ps: Be aware that the default dtype value for memmap(int) and fromfile(float) are different!
HDF5 is a C library that can efficiently store large on-disk arrays. Both PyTables and h5py are Python libraries on top of HDF5. If you're using tabular data then PyTables might be preferred; if you have just plain arrays then h5py is probably more stable/simpler.
There are out-of-core numpy array solutions that handle the chunking for you. Dask.array would give you plain numpy semantics on top of your collection of chunked files (see docs on stacking.)
I am using numpy and trying to create a huge matrix.
While doing this, I receive a memory error
Because the matrix is not important, I will just show the way how to easily reproduce the error.
a = 10000000000
data = np.array([float('nan')] * a)
not surprisingly, this throws me MemoryError
There are two things I would like to tell:
I really need to create and to use a big matrix
I think I have enough RAM to handle this matrix (I have 24 Gb or RAM)
Is there an easy way to handle big matrices in numpy?
Just to be on the safe side, I previously read these posts (which sounds similar):
Very large matrices using Python and NumPy
Python/Numpy MemoryError
Processing a very very big data set in python - memory error
P.S. apparently I have some problems with multiplication and division of numbers, which made me think that I have enough memory. So I think it is time for me to go to sleep, review math and may be to buy some memory.
May be during this time some genius might come up with idea how to actually create this matrix using only 24 Gb of Ram.
Why I need this big matrix
I am not going to do any manipulations with this matrix. All I need to do with it is to save it into pytables.
Assuming each floating point number is 4 bytes each, you'd have
(10000000000 * 4) /(2**30.0) = 37.25290298461914
Or 37.5 gigabytes you need to store in memory. So I don't think 24gb of RAM is enough.
If you can't afford creating such a matrix, but still wish to do some computations, try sparse matrices.
If you wish to pass it to another Python package that uses duck typing, you may create your own class with __getitem__ implementing dummy access.
If you use pycharm editor for python you can change memory settings from
C:\Program Files\JetBrains\PyCharm 2018.2.4\bin\pycharm64.exe.vmoptions
you can decrease pycharm speed from this file so your program memory will allocate more megabites
you must edit this codes
-Xms1024m
-Xmx2048m
-XX:ReservedCodeCacheSize=960m
so you can make them -Xms512m -Xmx1024m and finally your program will work
but it'll affect the debugging performance in pycharm.
My project currently uses NumPy, only for memory-efficient arrays (of bool_, uint8, uint16, uint32).
I'd like to get it running on PyPy which doesn't support NumPy. (failed to install it, at any rate)
So I'm wondering: Is there any other memory-efficient way to store arrays of numbers in Python? Anything that is supported by PyPy? Does PyPy have anything of it's own?
Note: array.array is not a viable solution, as it uses a lot more memory than NumPy in my testing.
array.array is a memory efficient array. It packs bytes/words etc together, so there is only a few bytes of extra overhead for the entire array.
The one place where numpy can use less memory is when you have a sparse array (and are using one of the sparse array implementations)
If you are not using sparse arrays, you simply measured it wrong.
array.array also doesn't have a packed bool type, so you can implement that as wrapper around an array.array('I') or a bytearray() or even just use bit masks with a Python long