I am developing recommendation system using collaborative filtering using python, panda dataframe and NumPy array to create matrices. Application is running fine with 1000 user base, but when running with 20k+ data, its throwing memory issue while generating matrix size of 20k*20k. Please help me in solving the issue.
user_test_level_12 = pd.DataFrame(squareform(pdist(user_test_12.ix[:, 1:])), columns=user_test_12.student_id, index=user_test_12.student_id
)
20K x 20K is way too big of a matrix to get using only CPU memory. That's why you get MemoryError.
I'd suggest using either batches (each time calculate a smal part of the matrix) and add them all together, if you really need all at once.
The second option would be using a sparse matrix. I assume most of your data is sparse as it is a recommender system. A sparse matrix can save you both memory and computational time.
Without seeing the code or knowing your intention that's the best I can think of.
Related
Calculate fft with 16GB memory,cause memory exhausted.
print(data_size)
freqs, times, spec_arr = signal.spectrogram(data, fs=samp_rate,nfft=1024,return_onesided=False,axis=0,scaling='spectrum',mode='magnitude')
Output as below:
537089518
Killed
How to calculate fft of large size data ,with existing python package?
A more general solution is to do that yourself. 1D FFTs can be split in smaller ones thanks to the well-known Cooley–Tukey FFT algorithm and multidimentional decomposition. For more information about this strategy, please read The Design and Implementation of FFTW3. You can do the operation in virtually mapped memory so to do that more easily. Some library/package like the FFTW enable you to relatively-easily perform fast in-place FFTs. You may need to write your own Python package or to use Cython so not to allocate additional memory that is not memory mapped.
One alternative solution is to save your data in HDF5 (for example using h5py, and then use out_of_core_fft and then read again the file. But, be aware that this package is a bit old and appear not to be maintained anymore.
I wrote some code to generate a large dataset of complex numpy matrices for ML applications, which I would like to somehow store on disk.
The most suitable idea seems to be saving the matrices into separate binary files. However, commands such as bytearray() seems to flatten the matrices into 1D arrays, thus losing the information over the matrix shape.
I guess I might need to fill each line independently, maybe using an additional for loop, but this would also require a for loop when loading and re-assembling the matrix.
What would be the correct procedure for storing those matrices in a way that minimizes the amount of space on disk and loading time?
I have some very large matrices (let say of the order of the million rows), that I can not keep in memory, and I would need to access to subsample of this matrix in descent time (less than a minute...).
I started looking at hdf5 and blaze in combination with numpy and pandas:
http://web.datapark.io/yves/blaze.html
http://blaze.pydata.org
But I found it a bit complicated, and I am not sure if it is the best solution.
Are there other solutions?
thanks
EDIT
Here some more specifications about the kind of data I am dealing with.
The matrices are usually sparse (< 10% or < 25% of cells with non-zero)
The matrices are symmetric
And what I would need to do is:
Access for reading only
Extract rectangular sub-matrices (mostly along the diagonal, but also outside)
Did you try PyTables ? It can be very useful for very large matrix. Take a look to this SO post.
Your question is lacking a bit in context; but hdf5 compressed block storage is probably as-efficient as a sparse storage format for these relatively dense matrices you describe. In memory, you can always cast your views to sparse matrices if it pays. That seems like an effective and simple solution; and as far as I know there are no sparse matrix formats which can easily be read partially from disk.
I am using numpy and trying to create a huge matrix.
While doing this, I receive a memory error
Because the matrix is not important, I will just show the way how to easily reproduce the error.
a = 10000000000
data = np.array([float('nan')] * a)
not surprisingly, this throws me MemoryError
There are two things I would like to tell:
I really need to create and to use a big matrix
I think I have enough RAM to handle this matrix (I have 24 Gb or RAM)
Is there an easy way to handle big matrices in numpy?
Just to be on the safe side, I previously read these posts (which sounds similar):
Very large matrices using Python and NumPy
Python/Numpy MemoryError
Processing a very very big data set in python - memory error
P.S. apparently I have some problems with multiplication and division of numbers, which made me think that I have enough memory. So I think it is time for me to go to sleep, review math and may be to buy some memory.
May be during this time some genius might come up with idea how to actually create this matrix using only 24 Gb of Ram.
Why I need this big matrix
I am not going to do any manipulations with this matrix. All I need to do with it is to save it into pytables.
Assuming each floating point number is 4 bytes each, you'd have
(10000000000 * 4) /(2**30.0) = 37.25290298461914
Or 37.5 gigabytes you need to store in memory. So I don't think 24gb of RAM is enough.
If you can't afford creating such a matrix, but still wish to do some computations, try sparse matrices.
If you wish to pass it to another Python package that uses duck typing, you may create your own class with __getitem__ implementing dummy access.
If you use pycharm editor for python you can change memory settings from
C:\Program Files\JetBrains\PyCharm 2018.2.4\bin\pycharm64.exe.vmoptions
you can decrease pycharm speed from this file so your program memory will allocate more megabites
you must edit this codes
-Xms1024m
-Xmx2048m
-XX:ReservedCodeCacheSize=960m
so you can make them -Xms512m -Xmx1024m and finally your program will work
but it'll affect the debugging performance in pycharm.
I am a computer engineer student and I started to work with Python. Now, my assignment is create an matrix but very large scale. How can I handle it in order to take less memory? I did some search and found "memory handler", but I cant be sure if the handler can bu used for this. Or are there any module in Python library?
Thank you.
You should be looking into numpy and scipy. They are relatively thin layers on top of blocks of memory, and are usually quite efficient for matrix type calculations. If your matrix is large but sparse (ie most elements are 0), have a look at scipy's sparse matrices.