When trying to turn a roughly (2,000,000x3) array of one hot encoded values into a data frame I encounter a 'DataFrame constructor not properly called!' error.
I've also explicitly tried wrapping the array in np.asarray() but get an 'Must pass 2-d input' error.
enc = skp.OneHotEncoder()
X_ismale = enc.fit_transform(X.IsMaleBucket.values.reshape(-1,1))
X_ismale = pd.DataFrame(X_ismale,columns=['IsMale_'+str(i) for i in np.sort(X.IsMaleBucket.unique())])
X_ismale has type:
<2256308x3 sparse matrix of type '<class 'numpy.float64'>'
with 2256308 stored elements in Compressed Sparse Row format>
Error is as previously described.
I expect an errorless conversion to dataframe but can't get it.
Pandas cannot work with sparse matrices, only with dense data. You can use to_array to convert the sparse matrix to a dense array. – jdehesa 9 mins ago
Using to_array worked although the current version turned out to be toarray.
Thanks.
Related
I am trying to use scipy to perform sparse linear algebra calculations in the dok (dictionary of keys) format.
When I multiply two matricies together the format changes from dok type to csr format which is an inefficient format for the data and subsequent operations.
How can I keep the dok format?
I have looked at the docs:
scipy sparse matrix
dok_matrix
But cannot see any information automatic type conversion or if and how it can be avoided.
See this example:
from scipy.sparse import dok_matrix
my_mat = dok_matrix([[1,2], [3,4]])
print(type(my_mat.dot(my_mat)))
print(type(my_mat # my_mat))
shows that the format has been changed:
<class 'scipy.sparse.csr.csr_matrix'>
<class 'scipy.sparse.csr.csr_matrix'>
Just convert back:
result = result.todok()
CSR may be an inefficient format for subsequent operations (or maybe not, we can't tell), but it's great for matrix multiplication. Trying to make the matrix multiplication code operate on a DOK result natively would be slower than just converting the result.
As pointed out by #user2357112 csr is good for linear algebra. The cost of conversion is, however, significant. As dok is not the only format that supports acceptable time editing it is worthwhile to check out the other option which is lil. Depending on your use case you may save quite a bit of time:
from scipy import sparse
from timeit import timeit
a = random(100,100,0.1,format='lil')
b = random(100,100,0.1,format='dok')
a
# <100x100 sparse matrix of type '<class 'numpy.float64'>'
# with 1000 stored elements in LInked List format>
b
# <100x100 sparse matrix of type '<class 'numpy.float64'>'
# with 1000 stored elements in Dictionary Of Keys format>
timeit(lambda:(a#a).tolil(),number=100)*10
# 1.491789099527523
timeit(lambda:(b#b).todok(),number=100)*10
# 4.220661079743877
Note that a#a/b#b is rather dense in this example, if we choose a sparser case the difference is less pronounced:
a = random(100,100,0.01,format='lil')
b = random(100,100,0.01,format='dok')
timeit(lambda:(a#a).tolil(),number=100)*10
# 0.6880075298249722
timeit(lambda:(b#b).todok(),number=100)*10
# 0.7450748200062662
(Python)
Can anyone please suggest the easiest and fastest way to populate a csr matrix A with the values from the columns of another csr matrix B which is of size 400k*800k.
My failed attempt:
#x is a list of size 500 which contains the column numbers needed from B
A=sparse.csr_matrix((400000,500))
for i in range(400000):
for j in range(500):
A[i,j]=B[i,x[j]]
Also is there an easy way to split the matrix B in the ratio of 4:1
It helps to think about the problem as if A and B were both dense arrays first. If I understand your question right, you'd want something like:
A = B[:, x]
It turns out that you can do the same operation with CSR matrices as well, and it's reasonably efficient. The key is to avoid assigning values to an existing sparse matrix (especially if it's in CSR or CSC format). By doing the indexing all at once, scipy is able to use more efficient methods.
I am creating a co-occurring matrix, which is of size 1M by 1M integer numbers.
After the matrix is created, the only operation I will do on it is to get top N values per each row (or column. as it is a symmetric matrix).
I have to create matrix as sparse to be able to fit it in memory. I read input data from a big file, and update co-occurance of two indexes (row, col) incrementally.
The sample code for Sparse dok_matrix specifies that I should declare the size of matrix before hand. I know the upper boundary for my matrix (1m by 1m), but in reality it might has less than that.
Do I have to specify the size beforehand, or can i just create it incrementally?
import numpy as np
from scipy.sparse import dok_matrix
S = dok_matrix((5, 5), dtype=np.float32)
for i in range(5):
for j in range(5):
S[i, j] = i + j # Update element
A SO question from a couple of days ago, creating sparse matrix of unknown size, talks about creating a sparse matrix from data read from a file. There the OP wanted to use lil format; I recommended building the input arrays for a coo format.
In other SO questions I've found that adding values to a plain dictionary is faster than adding them to a dok matrix - even though a dok is a dictionary subclass. There's quite a bit of overhead in the dok indexing method. In some cases, I suggested building a dict with a tuple key, and using update to add the values to a defined dok. But I suspect in your case the coo route is better.
dok and lil are the best formats for incremental construction, but neither is that great compared to python list and dict methods.
As for the top N values of each row, I recall exploring that, but back some time, so can't pull up a good SO question offhand. You probably want a row oriented format such as lil or csr.
As for the question - 'do you need to specify the size on creation'. Yes. Because a sparse matrix, regardless of format, only stores nonzero values, there's little harm in creating a matrix that is too large.
I can't think of anything in a dok or coo format matrix that hinges on the shape - at least not in terms of data storage or creation. lil and csr will have some extra values. If you really need to explore this, read up on how values are stored, and play with small matrices.
==================
It looks like all the code for the dok format is Python in
/usr/lib/python3/dist-packages/scipy/sparse/dok.py
Scanning that file, I see that dok does have a resize method
d.resize?
Signature: d.resize(shape)
Docstring:
Resize the matrix in-place to dimensions given by 'shape'.
Any non-zero elements that lie outside the new shape are removed.
File: /usr/lib/python3/dist-packages/scipy/sparse/dok.py
Type: method
So if you want to initialize the matrix to 1M x 1M and resize to 100 x 100 you can do so - it will step through all the keys to make sure there aren't any outside the new range. So it isn't cheap, even though the main action is to change the shape parameter.
newM, newN = shape
M, N = self.shape
if newM < M or newN < N:
# Remove all elements outside new dimensions
for (i, j) in list(self.keys()):
if i >= newM or j >= newN:
del self[i, j]
self._shape = shape
If you know for sure that there aren't any keys that fall outside the new shape, you could change _shape directly. The other sparse formats don't have a resize method.
In [31]: d=sparse.dok_matrix((10,10),int)
In [32]: d
Out[32]:
<10x10 sparse matrix of type '<class 'numpy.float64'>'
with 0 stored elements in Dictionary Of Keys format>
In [33]: d.resize((5,5))
In [34]: d
Out[34]:
<5x5 sparse matrix of type '<class 'numpy.float64'>'
with 0 stored elements in Dictionary Of Keys format>
In [35]: d._shape=(9,9)
In [36]: d
Out[36]:
<9x9 sparse matrix of type '<class 'numpy.float64'>'
with 0 stored elements in Dictionary Of Keys format>
See also:
Why are lil_matrix and dok_matrix so slow compared to common dict of dicts?
Get top-n items of every row in a scipy sparse matrix
(This question relates to "populate a Pandas SparseDataFrame from a SciPy Sparse Matrix". I want to populate a SparseDataFrame from a scipy.sparse.coo_matrix (specifically) The mentioned question is for a different SciPy Sparse Matrix (csr)...
So here it goes...)
I noticed Pandas now has support for Sparse Matrices and Arrays. Currently, I create DataFrame()s like this:
return DataFrame(matrix.toarray(), columns=features, index=observations)
Is there a way to create a SparseDataFrame() with a scipy.sparse.coo_matrix() or coo_matrix()? Converting to dense format kills RAM badly...!
http://pandas.pydata.org/pandas-docs/stable/sparse.html#interaction-with-scipy-sparse
A convenience method SparseSeries.from_coo() is implemented for creating a SparseSeries from a scipy.sparse.coo_matrix.
Within scipy.sparse there are methods that convert the data forms to each other. .tocoo, .tocsc, etc. So you can use which ever form is best for a particular operation.
For going the other way, I've answered
Pandas sparse dataFrame to sparse matrix, without generating a dense matrix in memory
Your linked answer from 2013 iterates by row - using toarray to make the row dense. I haven't looked at what the pandas from_coo does.
A more recent SO question on pandas sparse
non-NDFFrame object error using pandas.SparseSeries.from_coo() function
From https://github.com/pydata/pandas/blob/master/pandas/sparse/scipy_sparse.py
def _coo_to_sparse_series(A, dense_index=False):
""" Convert a scipy.sparse.coo_matrix to a SparseSeries.
Use the defaults given in the SparseSeries constructor. """
s = Series(A.data, MultiIndex.from_arrays((A.row, A.col)))
s = s.sort_index()
s = s.to_sparse() # TODO: specify kind?
# ...
return s
In effect it takes the same data, i, j used to build a coo matrix, makes a series, sorts it, and turns it into a sparse series.
I was working with some scipy.sparse.csr_matrixes. Honestly, the one I have at hand is from Scikit-learn's TfidfVectorizer:
vectorizer = TfidfVectorizer(min_df=0.0005)
textsMet2 = vectorizer.fit_transform(textsMet)
Ok, so this is a matrix:
textsMet2
<999x1632 sparse matrix of type '<class 'numpy.float64'>'
with 5042 stored elements in Compressed Sparse Row format>
Now I want to get only those rows which have any non-zero elements. So obviously I go for simple indexing:
textsMet2[(textsMet2.sum(axis=1)>0),:]
And get a error:
File "D:\Apps\Python\lib\site-packages\scipy\sparse\sputils.py", line 327, in _boolean_index_to_array
raise IndexError('invalid index shape')
IndexError: invalid index shape
If I remove last part of indexing I get something strange:
textsMet2[(textsMet2.sum(axis=1)>0)]
<1x492 sparse matrix of type '<class 'numpy.float64'>'
with 1 stored elements in Compressed Sparse Row format>
Why it shows me just 1 row matrix?
Once again, I want to get all of the rows of this matrix which have any non-zero element. Anyone knows how to do this?
You need to ravel your mask. Here is a bit of code from the thing I'm working on at the moment:
tr_matrix = pipeline.fit_transform(train_text, y_train, **fit_params)
# remove documents with too few features
to_keep_train = tr_matrix.sum(axis=1) >= config['min_train_features']
to_keep_train = np.ravel(np.array(to_keep_train))
logging.info('%d/%d train documents have enough features',
sum(to_keep_train), len(y_train))
tr_matrix = tr_matrix[to_keep_train, :]
This is a little inelegant but gets the job done.
This will remove 0 rows and columns.
X = X[np.array(np.sum(X,axis=1)).ravel() != 0,:]
X = X[:,np.array(np.sum(X,axis=0)).ravel() != 0]