Check if two scipy.sparse.csr_matrix are equal - python

I want to check if two csr_matrix are equal.
If I do:
x.__eq__(y)
I get:
raise ValueError("The truth value of an array with more than one "
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all().
This, However, works well:
assert (z in x for z in y)
Is there a better way to do it? maybe using some scipy optimized function instead?
Thanks so much

Can we assume they are the same shape?
In [202]: a=sparse.csr_matrix([[0,1],[1,0]])
In [203]: b=sparse.csr_matrix([[0,1],[1,1]])
In [204]: (a!=b).nnz==0
Out[204]: False
This checks the sparsity of the inequality array.
It will give you an efficiency warning if you try a==b (at least the 1st time you use it). That's because it has to test all those zeros. It can't take much advantage of the sparsity.
You need a relatively recent version to use logical operators like this. Were you trying to use x.__eq__(y) in some if expression, or did you get error from just that expression?
In general you probably want to check several parameters first. Same shape, same nnz, same dtype. You need to be careful with floats.
For dense arrays np.allclose is a good way of testing equality. And if the sparse arrays aren't too large, that might be good as well
np.allclose(a.A, b.A)
allclose uses all(less_equal(abs(x-y), atol + rtol * abs(y))). You can use a-b, but I suspect that this too will give an efficiecy warning.

SciPy and Numpy Hybrid Method
What worked best for my case was (using a generic code example):
bool_answer = np.arrays_equal(sparse_matrix_1.todense(), sparse_matrix_2.todense())
You might need to pay attention to the equal_nan parameter in np.arrays_equal
The following doc references helped me get there:
CSR Sparse Matrix Methods
CSC Sparse Matrix Methods
Numpy arrays_equal method
SciPy todense method

Related

Slicing a 2D numpy array using vectors for start-stop indices

First post here, so please go easy on me. :)
I want to vectorize the following:
rowStart=array of length N
rowStop=rowStart+4
colStart=array of length N
colStop=colStart+4
x=np.random.rand(512,512) #dummy test array
output=np.zeros([N,4,4])
for i in range(N):
output[i,:,:]=x[ rowStart[i]:rowStop[i], colStart[i]:colStop[i] ]
What I'd like to be able to do is something like:
output=x[rowStart:rowStop, colStart:colStop ]
where numpy recognizes that the slicing indices are vectors and broadcasts the slicing. I understand that this probably doesn't work because while I know that my slice output is always the same size, numpy doesn't.
I've looked at various approaches, including "fancy" or "advanced" indexing (which seems to work for indexing, not slicing), massive boolean indexing using meshgrids (not practical from a memory standpoint, as my N can get to 50k-100k), and np.take, which just seems to be another way of doing fancy/advanced indexing.
I could see how I could potentially use fancy/advanced indexing if I could get an array that looks like:
[np.arange(rowStart[0],rowStop[0]),
np.arange(rowStart[1],rowStop[1]),
...,
np.arange(rowStart[N],rowStop[N])]
and a similar one for columns, but I'm also having trouble figuring out a vectorized approach for creating that.
I'd appreciate any advice you can provide.
Thanks!
We can leverage np.lib.stride_tricks.as_strided based scikit-image's view_as_windows to get sliding windows and hence solve our case here. More info on use of as_strided based view_as_windows.
from skimage.util.shape import view_as_windows
BSZ = (4, 4) # block size
w = view_as_windows(x, BSZ)
out = w[rowStart, colStart]

TypeError from hstack on sparse matrices

I have two csr sparse matrices. One contains the transform from a sklearn.feature_extraction.text.TfidfVectorizer and the other converted from a numpy array. I am trying to do a scipy.sparse.hstack on the two to increase my feature matrix but I always get the error:
TypeError: 'coo_matrix' object is not subscriptable
Below is the code:
vectorizer = TfidfVectorizer(analyzer="char", lowercase=True, ngram_range=(1, 2), strip_accents="unicode")
ngram_features = vectorizer.fit_transform(df["strings"].values.astype(str))
list_other_features = ["entropy", "string_length"]
other_features = csr_matrix(df[list_other_features].values)
joined_features = scipy.sparse.hstack((ngram_features, other_features))
Both feature matrices are scipy.sparse.csr_matrix objects and I have also tried not converting other_features, leaving it as a numpy.array, but it results in the same error.
Python package versions:
numpy == 1.13.3
pandas == 0.22.0
scipy == 1.1.0
I can not understand why it is talking about coo_matrix object in this case, especially when I have both matrices converted to csr_matrix. Looking at the scipy code I understand it will not do any conversion if the input matrices are csr_matrix objects.
In the source code of scipy.sparse.hstack, it calls bmat, where it potentially converts matrices into coo_matrix if fast path cases are not established.
Diagnosis
Looking at the scipy code I understand it will not do any conversion
if the input matrices are csr_matrix objects.
In bat's source code, There are actually more conditions besides two matrices being csr_matrix before it will not be turned into coo_matrix objects. Seeing the source code, one of the following 2 conditions need to be met
# check for fast path cases
if (N == 1 and format in (None, 'csr') and all(isinstance(b, csr_matrix)
for b in blocks.flat)):
...
elif (M == 1 and format in (None, 'csc')
and all(isinstance(b, csc_matrix) for b in blocks.flat)):
...
before line 573 A = coo_matrix(blocks[i,j]) to be called.
Suggestion
To resolve the issue, I would suggest you make one more check to see whether you meet the fast path case for either csr_matrix or csc_matrix (the two condition listed above). Please see the whole source code for bat to gain a better understanding. If you do not meet the conditions, you will be forwarded to transform matrices into coo_matrix.
It's a little unclear whether this error occurs in the hstack or after when you use the result.
If it's in the hstack you need to provide a traceback so we can see what's going on.
hstack, using bmat, normally collects the coo attributes of all inputs, and combines them to make a new coo matrix. So regardless of inputs (except the special cases), the result will be coo. But hstack also accepts a fmt parameter.
Or you can add a .tocsr(). There's no extra cost if the matrix is already csr.

numpy/pandas NaN difference confusion

I happened onto this when trying to find the means/sums of non-nan elements in rows of a pandas dataframe. It seems that
df.apply(np.mean, axis=1)
works fine.
However, applying np.mean to a numpy array containing nans returns a nan.
Is this all speced out somewhere? I would not want to get burned down the road...
numpy's mean function first checks whether its input has a mean method, as #EdChum explains in this answer.
When you use df.apply, the input passed to the function is a pandas.Series. Since pandas.Series has a mean method, numpy uses that instead of using its own function. And by default, pandas.Series.mean ignores NaN.
You can access the underlying numpy array by the values attribute and pass that to the function:
df.apply(lambda x: np.mean(x.values), axis=1)
this will use numpy's version.
Divakar has correctly suggested using np.nanmean
If I may answer the question still standing, the semantics differ because Numpy supports masked arrays, while Pandas does not.

Efficiently unwrap in multiple dimensions with numpy

Let's assume I have an array of phases (from complex numbers)
A = np.angle(np.random.uniform(-1,1,[10,10,10]) + 1j*np.random.uniform(-1,1,[10,10,10]))
I would now like to unwrap this array in ALL dimensions. In the above 3D case I would do
A_unwrapped = np.unwrap(np.unwrap(np.unwrap(A,axis=0), axis=1),axis=2)
While this is still feasible in the 3D case, in case of higher dimensionality, this approach seems a little cumbersome to me. Is there a more efficient way to do this with numpy?
You could use np.apply_over_axes, which is supposed to apply a function over each dimension of an array in turn:
np.apply_over_axes(np.unwrap, A, np.arange(len(A.shape)))
I believe this should do it.
I'm not sure if there is a way to bypass performing the unwrap operation along each axis. Obviously if it acted on individual elements you could use vectorization, but that doesn't seem to be an option here. What you can do that will at least make the code cleaner is create a loop over the dimensions:
for dim in range(len(A.shape)):
A = np.unwrap(A, axis=dim)
You could also repeatedly apply a function that takes the dimension on which to operate as a parameter:
reduce(lambda A, axis: np.unwrap(A, axis=axis), range(len(A.shape)), A)
Remember that in Python 3 reduce needs to be imported from functools.

Cannot cast array data from dtype('O') to dtype('float64')

I am using scipy's curve_fit to fit a function to some data, and receive the following error;
Cannot cast array data from dtype('O') to dtype('float64') according to the rule 'safe'
which points me to this line in my code;
popt_r, pcov = curve_fit(
self.rightFunc, np.array(wavelength)[beg:end][edgeIndex+30:],
np.dstack(transmitted[:,:,c][edgeIndex+30:])[0][0],
p0=[self.m_right, self.a_right])
rightFunc is defined as follows;
def rightFunc(self, x, m, const):
return np.exp(-(m*x + const))
As I understand it, the 'O' type refers to a python object, but I can't see what is causing this error.
Complete Error:
Any ideas for what I should investigate to get to the bottom of this?
Just in case it could help someone else, I used numpy.array(wavelength,dtype='float64') to force the conversion of objects in the list to numpy's float64. Works well for me.
Typically these scipy functions require parameters like:
curvefit( function, initial_values, (aux_values,), ...)
where the tuple of aux_values is passed through to your function along with the current value of the main variable.
Is the dstack expression this aux_values? Or a concatenation of several. It may need to be wrapped in a tuple.
(np.dstack(transmitted[:,:,c][edgeIndex+30:])[0][0],)
We may need to know exactly where this error arises, not just which line of your code does it. We need to know what value is being converted. Where is there an array with dtype object?
Just to clarify, I had the same problem, did not see the right answers in the comments, before solving it on my own. So I just repeat them here:
I have resolved the issue. I was passing an array with one element to the p0 list, rather than the element itself. Thank you for your help – Jacobadtr Sep 12 at 17:51
An O dtype often results when constructing an array from a list of sublists that differ in size. If np.array(...) can't make a clean n-d array of numbers, it resorts to making an array of objects. – hpaulj Sep 12 at 17:15
That is, make sure that the tuple of parameters you pass to curve_fit can be properly casted to an numpy array
From here, apparently numpy struggles with index type. The proposed solution is:
One thing you can do is use np.intp as dtype whenever things have to do with indexing or are logically related to indexing/array sizes. This is the natural dtype for it and it will normally also be the fastest one.
Does this help?

Categories