I am using Python/Pandas to deal with very large and very sparse single-column data frames, but when I pickle them, there is virtually no benefit. If I try the same thing on Matlab, the difference is colossal, so I am trying to understand what is going on.
Using Pandas:
len(SecondBins)
>> 34300801
dense = pd.DataFrame(np.zeros(len(SecondBins)),columns=['Binary'],index=SecondBins)
sparse = pd.DataFrame(np.zeros(len(SecondBins)),columns=['Binary'],index=SecondBins).to_sparse(fill_value=0)
pickle.dump(dense,open('dense.p','wb'))
pickle.dump(sparse,open('sparse.p','wb'))
Looking at the sizes of the pickled files,
dense = 548.8MB
sparse = 274.4MB
However, when I look at memory usage associated with these variables,
dense.memory_usage()
>>Binary 274406408
>>dtype: int64
sparse.memory_usage()
>>Binary 0
>>dtype: int64
So, for a completely empty sparse vector, there slightly more than 50% savings. Perhaps it was something to do with the fact that variable 'SecondBins' is composed of pd.Timestamp which I use in the Pandas as indices, so I tried a similar procedure using default indices.
dense_defaultindex = pd.DataFrame(np.zeros(len(SecondBins)),columns=['Binary'])
sparse_defaultindex = pd.DataFrame(np.zeros(len(SecondBins)),columns=['Binary']).to_sparse(fill_value=0)
pickle.dump(dense_defaultindex,open('dense_defaultindex.p','wb'))
pickle.dump(sparse_defaultindex,open('sparse_defaultindex.p','wb'))
But it yields the same sizes on disk.
What is pickle doing under the hood?
If I create a similar zero-filled vector in Matlab, and save it in a .mat file, it's ~180 bytes!?
Please advise.
Regards
Remember that pandas is labelled data. The column labels and the index labels are essentially specialized arrays, and those arrays take up space. So in practice the index acts as an additional column as far as space usage goes, and the column headings act as an additional row.
In the dense case, you essentially have two columns, the data and the index. In the sparse case, you have essentially one column, the index (since the spare data column contains close to no data). So from this perspective, you would expect the sparse case to be about half the size of the dense case. And that is what you see in your file sizes.
In the MATLAB case, however, the data is not labeled. Therefore, the sparse case takes up almost no space. The equivalent to the MATLAB case would be a sparse matrix, not a spare dataframe structure. So if you want to take full advantage of the space savings you should use scipy.sparse, which provides sparse matrix support similar to what you get in MATLAB.
Related
I have some very large matrices (let say of the order of the million rows), that I can not keep in memory, and I would need to access to subsample of this matrix in descent time (less than a minute...).
I started looking at hdf5 and blaze in combination with numpy and pandas:
http://web.datapark.io/yves/blaze.html
http://blaze.pydata.org
But I found it a bit complicated, and I am not sure if it is the best solution.
Are there other solutions?
thanks
EDIT
Here some more specifications about the kind of data I am dealing with.
The matrices are usually sparse (< 10% or < 25% of cells with non-zero)
The matrices are symmetric
And what I would need to do is:
Access for reading only
Extract rectangular sub-matrices (mostly along the diagonal, but also outside)
Did you try PyTables ? It can be very useful for very large matrix. Take a look to this SO post.
Your question is lacking a bit in context; but hdf5 compressed block storage is probably as-efficient as a sparse storage format for these relatively dense matrices you describe. In memory, you can always cast your views to sparse matrices if it pays. That seems like an effective and simple solution; and as far as I know there are no sparse matrix formats which can easily be read partially from disk.
def tdm_modify(feature_names,tdm):
non_useful_words=['kill','stampede','trigger','cause','death','hospital'\
,'minister','said','told','say','injury','victim','report']
indexes=[feature_names.index(word) for word in non_useful_words]
for index in indexes:
tdm[:,index]=0
return tdm
I want to manually set zero weights for some terms in tdm matrix. Using the above code I get the warning. I don't seem to understand why? Is there a better way to do this?
C:\Anaconda\lib\site-packages\scipy\sparse\compressed.py:730: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
SparseEfficiencyWarning)
First, it is not an error. It's a warning. The next time you perform this action (in a session) it will do it without warning.
To me the message is clear:
Changing the sparsity structure of a csr_matrix is expensive.
lil_matrix is more efficient.
tdm is a csr_matrix. The way that data is stored with the format, it takes quite a bit of extra computation to set a bunch of the elements to 0 (or v.v to change them from 0). As it says, the lil_matrix format is better if you need to do this sort of change frequently.
Try some time tests on a sample matrices. tdm.tolil() will convert the matrix to lil format.
I could get into how the data is stored and why changing csr is less efficient than lil.
I'd suggest reviewing the sparse formats, and their respective pros and cons.
A simple way to think about is - csr (and csc) are designed for fast numerical calculations, especially matrix multiplication. They developed for linear algebra problems. coo is a convenient way of defining sparse matrices. lil is a convenient way for building matrices incrementally.
How are you constructing tdm initially?
In scipy test files (e.g. scipy/sparse/linalg/dsolve/tests/test_linsolve.py) I find code that does
import warnings
from scipy.sparse import (spdiags, SparseEfficiencyWarning, csc_matrix,
csr_matrix, isspmatrix, dok_matrix, lil_matrix, bsr_matrix)
warnings.simplefilter('ignore',SparseEfficiencyWarning)
scipy/sparse/base.py
class SparseWarning(Warning):
pass
class SparseFormatWarning(SparseWarning):
pass
class SparseEfficiencyWarning(SparseWarning):
pass
These warnings use the standard Python Warning class, so standard Python methods for controlling their expression apply.
I ran into this warning message as well working on a machine learning problem. The exact application was constructing a document term matrix from a corpus of text. I agree with the accepted answer. I will add one empirical observation:
My exact task was to build a 25000 x 90000 matrix of uint8.
My desired output was a sparse matrix compressed row format, i.e. csr_matrix.
The fastest way to do this by far, at the cost of using quite a bit more memory in the interim, was to initialize a dense matrix using np.zeros(), build it up, then do csr_matrix(dense_matrix) once at the end.
The second fastest way was to build up a lil_matrix, then convert it to csr_matrix with the .tocsr() method. This is recommended in the accepted answer. (Thank you hpaulj).
The slowest way was to assemble the csr_matrix element by element.
So to sum up, if you have enough working memory to build a dense matrix, and only want to end up with a sparse matrix later on for downstream efficiency, it might be faster to build up the matrix in dense format and then covert it once at the end. If you need to work in sparse format the whole time because of memory limitations, building up the matrix as a lil_matrix and then converting it (as in the accepted answer) is faster than building up a csr_matrix from the start.
My situation is like this:
I have around ~70 million integer values distributed in various files for ~10 categories of data (exact number not known)
I read those several files, and create some python object with that data. This would obviously include reading each file line by line and appending to the python object. So I'll have an array with 70 mil subarrays, with 10 values in each.
I do some statistical processing on that data . This would involve appending several values (say, percentile rank) to each 'row' of data.
I store this object it in a Database
Now I have never worked with data of this scale. My first instinct was to use Numpy for more efficient arrays w.r.t memory. But then I've heard that in Numpy arrays, 'append' is discouraged as it's not as efficient.
So what would you suggest I go with? Any general tips for working with data of this size? I can bring the data down to 20% of its size with random sampling if it's required.
EDIT: Edited for clarity about size and type of data.
If I understand your description correctly, your dataset will contain ~700 million integers. Even if you use 64-bit ints that would still only come to about 6GB. Depending on how much RAM you have and what you want to do in terms of statistical processing, your dataset sounds like it would be quite manageable as a normal numpy array living in core memory.
If the dataset is too large to fit in memory, a simple solution might be to use a memory-mapped array (numpy.memmap). In most respects, an np.memmap array behaves like a normal numpy array, but instead of storing the whole dataset in system memory, it will be dynamically read from/written to a file on disk as required.
Another option would be to store your data in an HDF5 file, for example using PyTables or H5py. HDF5 allows the data to be compressed on disk, and PyTables includes some very fast methods to perform mathematical operations on large disk-based arrays.
I generate feature vectors for examples from large amount of data, and I would like to store them incrementally while i am reading the data. The feature vectors are numpy arrays. I do not know the number of numpy arrays in advance, and I would like to store/retrieve them incrementally.
Looking at pytables, I found two options:
Arrays: They require predetermined size and I am not quite sure how
much appending is computationally efficient.
Tables: The column types do not support list or arrays.
If it is a plain numpy array, you should probably use Extendable Arrays (EArray) http://pytables.github.io/usersguide/libref/homogenous_storage.html#the-earray-class
If you have a numpy structured array, you should use a Table.
Can't you just store them into an array? You have your code and it should be a loop that will grab things from the data to generate your examples and then it generates the example. create an array outside the loop and append your vector into the array for storage!
array = []
for row in file:
#here is your code that creates the vector
array.append(vector)
then after you have gone through the whole file, you have an array with all of your generated vectors! Hopefully that is what you need, you were a bit unclear...next time please provide some code.
Oh, and you did say you wanted pytables, but I don't think it's necessary, especially because of the limitations you mentioned
I have a large matrix (approx. 80,000 X 60,000), and I basically want to scramble all the entries (that is, randomly permute both rows and columns independently).
I believe it'll work if I loop over the columns, and use randperm to randomly permute each column. (Or, I could equally well do rows.) Since this involves a loop with 60K iterations, I'm wondering if anyone can suggest a more efficient option?
I've also been working with numpy/scipy, so if you know of a good option in python, that would be great as well.
Thanks!
Susan
Thanks for all the thoughtful answers! Some more info: the rows of the matrix represent documents, and the data in each row is a vector of tf-idf weights for that document. Each column corresponds to one term in the vocabulary. I'm using pdist to calculate cosine similarities between all pairs of papers. And I want to generate a random set of papers to compare to.
I think that just permuting the columns will work, then, because each paper gets assigned a random set of term frequencies. (Permuting the rows just means reordering the papers.) As Jonathan pointed out, this has the advantage of not making a new copy of the whole matrix, and it sounds like the other options all will.
You should be able to reshape the matrix to a 1 × 4800000000 "array", randperm it, and finally reshape it back to a 80000 × 60000 matrix.
This will require copying the 4.8 billion entries 3 times at worst. This might not be efficient.
EDIT: Actually Matlab automatically uses linear indexing, so the first reshape is not needed. Just
reshape(x(randperm(4800000000), 80000, 60000))
is enough (thus reducing 1 unnecessary potential copying).
Note that, this assumes you have a dense matrix. If you have a sparse matrix, you could extract the values, and then randomly reassign indices to them. If there are N nonzero entries, then only 8N copying are needed at worst (3 numbers are required to describe one entry).
I think it would be better to do this:
import numpy as np
flat = matrix.ravel()
np.random.shuffle(flat)
You are basically flattening the matrix to a list, shuffling the list, and then re-constructing a matrix out of the list.
Both solutions above are great, and will work, but I believe both will involve making a completely new copy of the entire matrix in memory while doing the work. Since this is a huge matrix, that's pretty painful. In the case of the MATLAB solution, I think you'll be possibly creating two extra temporary copies, depending on how reshape works internally. I think you were on the right track by operating on columns, but the problem is that it will only scramble along columns. However, I believe if you do randperm along rows after that, you'll end up with a fully permuted matrix. This way you'll only be creating temporary variables that are, at worst, 80,000 by 1. Yes, that's two loops with 60,000 and 80,000 iterations each, but internally that's going to have to happen regardless. The algorithm is going to have to visit each memory location at least twice. You could probably do a more efficient algorithm by writing a C MEX function that operates completely in place, but I assume you'd rather not do that.