I have some very large matrices (let say of the order of the million rows), that I can not keep in memory, and I would need to access to subsample of this matrix in descent time (less than a minute...).
I started looking at hdf5 and blaze in combination with numpy and pandas:
http://web.datapark.io/yves/blaze.html
http://blaze.pydata.org
But I found it a bit complicated, and I am not sure if it is the best solution.
Are there other solutions?
thanks
EDIT
Here some more specifications about the kind of data I am dealing with.
The matrices are usually sparse (< 10% or < 25% of cells with non-zero)
The matrices are symmetric
And what I would need to do is:
Access for reading only
Extract rectangular sub-matrices (mostly along the diagonal, but also outside)
Did you try PyTables ? It can be very useful for very large matrix. Take a look to this SO post.
Your question is lacking a bit in context; but hdf5 compressed block storage is probably as-efficient as a sparse storage format for these relatively dense matrices you describe. In memory, you can always cast your views to sparse matrices if it pays. That seems like an effective and simple solution; and as far as I know there are no sparse matrix formats which can easily be read partially from disk.
Related
I work with up to ~100000x100000 sparse complex valued matrices (A), with a sparsity of maximum 3%. I wish to do things like expm(-At)v, A\v , etc., where v is an arbitrary vector. I used MATLAB, but I believe this is too much for MATLAB's memory. So I decided to switch to Python HDF5/Pytables.
However, I am unable to find modules in Python Pytables, etc. that do a sparse matrix multiplication in HDF5 style. I don't want to run a for loop over chunks within the sparse matrix, where you borrow a chunk from the matrix, multiply it with an appropriate chunk grabbed from a vector and then move on to the next chunk, without storing the full, humongous matrix in the RAM. The for loop seems too time consuming.
My eventual goal is to integrate this hard-disk based multiplication routine into modules like scipy.sparse.linalg.expm_multiply to calculate expm(-At)*v, etc. for such huge sparse matrices.
I am developing recommendation system using collaborative filtering using python, panda dataframe and NumPy array to create matrices. Application is running fine with 1000 user base, but when running with 20k+ data, its throwing memory issue while generating matrix size of 20k*20k. Please help me in solving the issue.
user_test_level_12 = pd.DataFrame(squareform(pdist(user_test_12.ix[:, 1:])), columns=user_test_12.student_id, index=user_test_12.student_id
)
20K x 20K is way too big of a matrix to get using only CPU memory. That's why you get MemoryError.
I'd suggest using either batches (each time calculate a smal part of the matrix) and add them all together, if you really need all at once.
The second option would be using a sparse matrix. I assume most of your data is sparse as it is a recommender system. A sparse matrix can save you both memory and computational time.
Without seeing the code or knowing your intention that's the best I can think of.
def tdm_modify(feature_names,tdm):
non_useful_words=['kill','stampede','trigger','cause','death','hospital'\
,'minister','said','told','say','injury','victim','report']
indexes=[feature_names.index(word) for word in non_useful_words]
for index in indexes:
tdm[:,index]=0
return tdm
I want to manually set zero weights for some terms in tdm matrix. Using the above code I get the warning. I don't seem to understand why? Is there a better way to do this?
C:\Anaconda\lib\site-packages\scipy\sparse\compressed.py:730: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
SparseEfficiencyWarning)
First, it is not an error. It's a warning. The next time you perform this action (in a session) it will do it without warning.
To me the message is clear:
Changing the sparsity structure of a csr_matrix is expensive.
lil_matrix is more efficient.
tdm is a csr_matrix. The way that data is stored with the format, it takes quite a bit of extra computation to set a bunch of the elements to 0 (or v.v to change them from 0). As it says, the lil_matrix format is better if you need to do this sort of change frequently.
Try some time tests on a sample matrices. tdm.tolil() will convert the matrix to lil format.
I could get into how the data is stored and why changing csr is less efficient than lil.
I'd suggest reviewing the sparse formats, and their respective pros and cons.
A simple way to think about is - csr (and csc) are designed for fast numerical calculations, especially matrix multiplication. They developed for linear algebra problems. coo is a convenient way of defining sparse matrices. lil is a convenient way for building matrices incrementally.
How are you constructing tdm initially?
In scipy test files (e.g. scipy/sparse/linalg/dsolve/tests/test_linsolve.py) I find code that does
import warnings
from scipy.sparse import (spdiags, SparseEfficiencyWarning, csc_matrix,
csr_matrix, isspmatrix, dok_matrix, lil_matrix, bsr_matrix)
warnings.simplefilter('ignore',SparseEfficiencyWarning)
scipy/sparse/base.py
class SparseWarning(Warning):
pass
class SparseFormatWarning(SparseWarning):
pass
class SparseEfficiencyWarning(SparseWarning):
pass
These warnings use the standard Python Warning class, so standard Python methods for controlling their expression apply.
I ran into this warning message as well working on a machine learning problem. The exact application was constructing a document term matrix from a corpus of text. I agree with the accepted answer. I will add one empirical observation:
My exact task was to build a 25000 x 90000 matrix of uint8.
My desired output was a sparse matrix compressed row format, i.e. csr_matrix.
The fastest way to do this by far, at the cost of using quite a bit more memory in the interim, was to initialize a dense matrix using np.zeros(), build it up, then do csr_matrix(dense_matrix) once at the end.
The second fastest way was to build up a lil_matrix, then convert it to csr_matrix with the .tocsr() method. This is recommended in the accepted answer. (Thank you hpaulj).
The slowest way was to assemble the csr_matrix element by element.
So to sum up, if you have enough working memory to build a dense matrix, and only want to end up with a sparse matrix later on for downstream efficiency, it might be faster to build up the matrix in dense format and then covert it once at the end. If you need to work in sparse format the whole time because of memory limitations, building up the matrix as a lil_matrix and then converting it (as in the accepted answer) is faster than building up a csr_matrix from the start.
I am a computer engineer student and I started to work with Python. Now, my assignment is create an matrix but very large scale. How can I handle it in order to take less memory? I did some search and found "memory handler", but I cant be sure if the handler can bu used for this. Or are there any module in Python library?
Thank you.
You should be looking into numpy and scipy. They are relatively thin layers on top of blocks of memory, and are usually quite efficient for matrix type calculations. If your matrix is large but sparse (ie most elements are 0), have a look at scipy's sparse matrices.
This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Python Numpy Very Large Matrices
I tried numpy.zeros((100k x 100k)) and it returned "array is too big".
Response to comments:
1) I could create 10k x 10k matrix but not 100kx100k and 1milx1mil.
2) The matrix is not sparse.
We can do simple maths to find out. A 1 million by 1 million matrix has 1,000,000,000,000 elements. If each element takes up 4 bytes, it would require 4,000,000,000,000 bytes of memory. That is, 3.64 terabytes.
There are also chances that a given implementation of Python uses more than that for a single number. For instance, just the leap from a float to a double means you'll need 7.28 terabytes instead. (There are also chances that Python stores the number on the heap and all you get is a pointer to it, approximately doubling the footprint, without even taking in account metadata–but that's slippery grounds, I'm always wrong when I talk about Python internals, so let's not dig it too much.)
I suppose numpy doesn't have a hardcoded limit, but if your system doesn't have that much free memory, there isn't really anything to do.
Does your matrix have a lot of zero entries? I suspect it does, few people do dense problems that large.
You can easily do that with a sparse matrix. SciPy has a good set built in. http://docs.scipy.org/doc/scipy/reference/sparse.html
The space required by a sparse matrix grows with the number of nonzero elements, not the dimensions.
Your system probably won't have enough memory to store the matrix in memory, but nowadays you might well have enough terabytes of free disk space. In that case, numpy.memmap would allow you to have the array stored on disk, but appear as if it resides in memory.
However, it's probably best to rethink the problem. Do you really need a matrix this large? Any computations involving it will probably be infeasibly slow, and need to be done blockwise.