Elementwise comparison between two large vectors, high degree of sparsity

Elementwise comparison between two large vectors, high degree of sparsity - python

Need a function that performs similarly to numpy.where function, but that doesn't run into memory issues caused by the dense representation of the Boolean array. The function should therefore be able to return an extremely sparse Boolean array.
While the example presented below works fine for small data sets/vectors, it is impossible to use the numpy.where function once my_sample is - for example - of shape (10.000.000, 1) and my_population is of shape (100.000, 1). Having read other threads, numpy.where apparently creates a dense Boolean array of shape (10.000.000, 100.000) when evaluating the expression numpy.where((my_sample == my_population.T)). This dense (10.000.000, 100.000) array cannot fit into memory on my machine/most machines.
The resulting array is extremely sparse. In my case, know it will have at most two 1s per row! Using the specifications from above, the sparsity equals 0.002%. This should definitely fit into memory.
Trying to create something similar to a model/design matrix for a numerical simulation. The resulting matrix will be used for some linear algebra operations.
Minimal working example: Please note that the positions/coordinates in the vectors are of importance.
# import packages
import numpy as np
# my_sample is the vector of observations
my_sample = ['a', 'b', 'c', 'a']
# my_population is the lookup vector
my_population = ['a', 'b', 'c']
# initalise the matrix (dense matrix for this exampe)
my_zero = np.zeros((len(my_sample), len(my_population)))
# reshape to arrays
my_sample = np.array(my_sample).reshape(-1, 1)
my_population = np.array((my_population)).reshape(-1, 1)
# THIS STEP CAUSES THE MEMORY ISSUES
my_indices = np.where((my_sample == my_population.T))
# set the matches to equal one
my_zero[my_indices] = 1
# show matrix
my_zero
array([[1., 0., 0.],
[0., 1., 0.],
[0., 0., 1.],
[1., 0., 0.]])

First, let's encode this as integers, not strings. Strings suck.
pop_levels, pop_idx = np.unique(my_population, return_inverse=True)
sample_levels, sample_idx = np.unique(my_sample, return_inverse=True)
It's important that pop_levels and sample_levels be identical, but if they are you're pretty much done - pack these into sparse masks:
sample_mask = sps.csr_matrix((np.ones_like(sample_idx), sample_idx, range(len(sample_idx) + 1)))
And we're done:
>>> sample_mask.A
array([[1, 0, 0],
[0, 1, 0],
[0, 0, 1],
[1, 0, 0]])
You may need to reorder your factor levels so that they're the same between your sample and population, but as long as you can unify those labels this is very simple to do with just matrix assignment.

A more direct route:
In [125]: my_sample = ['a', 'b', 'c', 'a']
...: my_population = ['a', 'b', 'c']
...:
...:
In [126]: np.array(my_sample)[:,None]==np.array(my_population)
Out[126]:
array([[ True, False, False],
[False, True, False],
[False, False, True],
[ True, False, False]])
This is a boolean dtype. If you want 0/1 integer matrix:
In [128]: (np.array(my_sample)[:,None]==np.array(my_population)).astype(int)
Out[128]:
array([[1, 0, 0],
[0, 1, 0],
[0, 0, 1],
[1, 0, 0]])
If you have space to make my_zero, you should have space to make this array. If there is still a problem with a large temporary buffer, you could try converting to 'uint8', which takes up less space.
In your version you make two large arrays, my_zero and my_sample == my_population.T. But beware that even if you make it past this step, you may not have space to do anything else with my_zero.
Creating a sparse matrix may save you space, but the sparsity has to be quite high to maintain any sort of speed. Though matrix-multiplication is a relative strong area for scipy.sparse matrices.
time tests
In [134]: %%timeit
...: pop_levels, pop_idx = np.unique(my_population, return_inverse=True)
...: sample_levels, sample_idx = np.unique(my_sample, return_inverse=True)
...: sample_mask = sparse.csr_matrix((np.ones_like(sample_idx), sample_idx, range(len(s
...: ample_idx) + 1)))
247 µs ± 195 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [135]: timeit (np.array(my_sample)[:,None]==np.array(my_population)).astype(int)
9.61 µs ± 9.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
And make a sparse matrix from the dense:
In [136]: timeit sparse.csr_matrix((np.array(my_sample)[:,None]==np.array(my_population)).as
...: type(int))
332 µs ± 1.44 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Scaling for large arrays could be quite different.

Related

get variance of matrix without zero values numpy

How to can I compute variance without zero elements?
For example
np.var([[1, 1], [1, 2]], axis=1) -> [0, 0.25]
I need:
var([[1, 1, 0], [1, 2, 0]], axis=1) -> [0, 0.25]

Is it what your are looking for? You can filter out columns where all values are 0 (or at least one value is not 0).
m = np.array([[1, 1, 0], [1, 2, 0]])
np.var(m[:, np.any(m != 0, axis=0)], axis=1)
# Output
array([0. , 0.25])

V1
You can use a masked array:
data = np.array([[1, 1, 0], [1, 2, 0]])
np.ma.array(data, mask=(data == 0)).var(axis=1)
The result is
masked_array(data=[0. , 0.25],
mask=False,
fill_value=1e+20)
The raw numpy array is the data attribute of the resulting masked array:
>>> np.ma.array(data, mask=(data == 0)).var(axis=1).data
array([0. , 0.25])
V2
Without masked arrays, the operation of removing a variable number of elements in each row is a bit tricky. It would be simpler to implement the variance in terms of the formula sum(x**2) / N - (sum(x) / N)**2 and partial reduction of ufuncs.
First we need to find the split indices and segment lengths. In the general case, that looks like
lens = np.count_nonzero(data, axis=1)
inds = np.r_[0, lens[:-1].cumsum()]
Now you can operate on the raveled masked data:
mdata = data[data != 0]
mdata2 = mdata**2
var = np.add.reduceat(mdata2, inds) / lens - (np.add.reduceat(mdata, inds) / lens)**2
This gives you the same result for var (probably more efficiently than the masked version by the way):
array([0. , 0.25])
V3
The var function appears to use the more traditional formula (x - x.mean()).mean(). You can implement that using the quantities above with just a bit more work:
means = (np.add.reduceat(mdata, inds) / lens).repeat(lens)
var = np.add.reduceat((mdata - means)**2, inds) / lens
Comparison
Here is a quick benchmark for the two approaches:
def nzvar_v1(data):
return np.ma.array(data, mask=(data == 0)).var(axis=1).data
def nzvar_v2(data):
lens = np.count_nonzero(data, axis=1)
inds = np.r_[0, lens[:-1].cumsum()]
mdata = data[data != 0]
return np.add.reduceat(mdata**2, inds) / lens - (np.add.reduceat(mdata, inds) / lens)**2
def nzvar_v3(data):
lens = np.count_nonzero(data, axis=1)
inds = np.r_[0, lens[:-1].cumsum()]
mdata = data[data != 0]
return np.add.reduceat((mdata - (np.add.reduceat(mdata, inds) / lens).repeat(lens))**2, inds) / lens
np.random.seed(100)
data = np.random.randint(10, size=(1000, 1000))
%timeit nzvar_v1(data)
18.3 ms ± 278 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit nzvar_v2(data)
5.89 ms ± 69.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit nzvar_v3(data)
11.8 ms ± 62.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
So for a large dataset, the second approach, while requiring a bit more code, appears to be ~3x faster than masked arrays and ~2x faster than using the traditional formulation.

Numpy - How to vectorize on Sub-Arrays

How do you apply vectorized functions on sub-arrays? Suppose I have the following:
array = np.array([
[0, 1, 2],
[2],
[],
])
And I wanted to obtain the first element in each subarray, else None.
[0, 2, None]
While simple, is there are way to do this leveraging Numpy's pure vectorization? There doesn't seem to be native operations, and the np.vectorize() function is described to not be true documentation and has been stated at various other points in threads.
Is my only option to do a np.apply_along_axes()?
When do I know when I cannot solve my problem with numpy's pure vectorization?

You've created an object dtype array - containing lists (not subarrays):
In [2]: array = np.array([
...: [0, 1, 2],
...: [2],
...: [],
...: ])
/usr/local/bin/ipython3:4: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (1.19dev gives warning)
In [3]: array
Out[3]: array([list([0, 1, 2]), list([2]), list([])], dtype=object)
We could use a list comprehension:
In [4]: [a[0] for a in array]
....
IndexError: list index out of range
and correcting for the empty list:
In [5]: [a[0] if a else None for a in array]
Out[5]: [0, 2, None]
Most of the fast compiled code for numpy - the "vectorized" stuff - only works with numeric dtype arrays. For object dtype it has to do something akin to a list comprehension. Even when math works, it's because it was able to delegate the action to the elements.
For example applying list replication to all elements of your array:
In [7]: array*3
Out[7]:
array([list([0, 1, 2, 0, 1, 2, 0, 1, 2]), list([2, 2, 2]), list([])],
dtype=object)
and sum is just list join:
In [8]: array.sum()
Out[8]: [0, 1, 2, 2]
apply_along_axis isn't an faster than np.vectorize. And I can't imagine how it would be used in a case like this. array is 1d.
Sometimes frompyfunc is handy when working with object dtype arrays (but it's not a speed solution):
In [11]: timeit np.frompyfunc(lambda a: a[0] if a else None, 1,1)(array)
3.8 µs ± 9.85 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [12]: timeit [a[0] if a else None for a in array]
1.02 µs ± 5.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [14]: timeit np.vectorize(lambda a: a[0] if a else None, otypes=['O'])(array)
18 µs ± 46.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Intersection between two multi-dimensional arrays with tolerance - NumPy / Python

i am stuck at a problem. I have two 2-D numpy arrays, filled with x and y coordinates. Those arrays might look like:
array1([[(1.22, 5.64)],
[(2.31, 7.63)],
[(4.94, 4.15)]],
array2([[(1.23, 5.63)],
[(6.31, 10.63)],
[(2.32, 7.65)]],
Now I have to find "duplicate nodes". However, i also have to consider nodes as equal within a given tolerance of the coordinates, therefore, i can't use solutions like this . Since my arrays are quite big (~200.000 lines each) two simple for loops are not an option as well. My final output should look like this:
output([[(1.23, 5.63)],
[(2.32, 7.65)]],
I would appreciate some hints.
Cheers,

In order to compare to nodes with a giving tolerance I recommend to use numpy.isclose(), where you can set a relative and absolute tolerance.
numpy.isclose(1.24, 1.25, atol=1e-1)
# [True]
numpy.isclose([1.24, 2.31], [1.25, 2.32], atol=1e-1)
# [True, True]
Instead of using a two for loops, you can make use of itertools.product() package, to go through all pairs. The following code does what you want:
array1 = np.array([[1.22, 5.64],
[2.31, 7.63],
[4.94, 4.15]])
array2 = np.array([[1.23, 5.63],
[6.31, 10.63],
[2.32, 7.64]])
output = np.empty((0,2))
for i0, i1 in itertools.product(np.arange(array1.shape[0]),
np.arange(array2.shape[0])):
if np.all(np.isclose(array1[i0], array2[i1], atol=1e-1)):
output = np.concatenate((output, [array2[i1]]), axis=0)
# output = [[ 1.23 5.63]
# [ 2.32 7.64]]

Defining a isclose function similar to numpy.isclose, but a bit faster (mostly due to not checking any input and not supporting both relative and absolute tolerance):
import numpy as np
array1 = np.array([[(1.22, 5.64)],
[(2.31, 7.63)],
[(4.94, 4.15)]])
array2 = np.array([[(1.23, 5.63)],
[(6.31, 10.63)],
[(2.32, 7.65)]])
def isclose(x, y, atol):
return np.abs(x - y) < atol
Now comes the hard part. We need to calculate if any two values are close within the inner most dimension. For this I reshape the arrays in such a way that the first array has its values along the second dimension, replicated across the first and the second array has its values along the first dimension, replicated along the second (note the 1, 3 and 3, 1):
In [92]: isclose(array1.reshape(1,3,2), array2.reshape(3,1,2), 0.03)
Out[92]:
array([[[ True, True],
[False, False],
[False, False]],
[[False, False],
[False, False],
[False, False]],
[[False, False],
[ True, True],
[False, False]]], dtype=bool)
Now we want all entries where the value is close to any other value (along the same dimension):
In [93]: isclose(array1.reshape(1,3,2), array2.reshape(3,1,2), 0.03).any(axis=0)
Out[93]:
array([[ True, True],
[ True, True],
[False, False]], dtype=bool)
Then we want only those where both values of the tuple are close:
In [111]: isclose(array1.reshape(1,3,2), array2.reshape(3,1,2), 0.03).any(axis=0).all(axis=-1)
Out[111]: array([ True, True, False], dtype=bool)
And finally, we can use this to index array1:
In [112]: array1[isclose(array1.reshape(1,3,2), array2.reshape(3,1,2), 0.03).any(axis=0).all(axis=-1)]
Out[112]:
array([[[ 1.22, 5.64]],
[[ 2.31, 7.63]]])
If you want to, you can swap the any and all calls. One might be faster than the other in your case.
The 3 in the reshape calls needs to be substituted for the actual length of your data.
This algorithm will have the same bad runtime of the other answer using itertools.product, but at least the actual looping is done implicitly by numpy and is implemented in C. This is visible in the timings:
In [122]: %timeit array1[isclose(array1.reshape(1,len(array1),2), array2.reshape(len(array2),1,2), 0.03).any(axis=0).all(axis=-1)]
11.6 µs ± 493 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [126]: %timeit pares(array1_pares, array2_pares)
267 µs ± 8.72 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Where the pares function is the code defined by #Ferran Parés in another answer and the arrays as already reshaped there.
And for larger arrays it becomes more obvious:
array1 = np.random.normal(0, 0.1, size=(1000, 1, 2))
array2 = np.random.normal(0, 0.1, size=(1000, 1, 2))
array1_pares = array1.reshape(1000, 2)
array2_pares = arra2.reshape(1000, 2)
In [149]: %timeit array1[isclose(array1.reshape(1,len(array1),2), array2.reshape(len(array2),1,2), 0.03).any(axis=0).all(axis=-1)]
135 µs ± 5.34 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [157]: %timeit pares(array1_pares, array2_pares)
1min 36s ± 6.85 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
In the end this is limited by the available system memory. My machine (16GB RAM) can still handle arrays of length 20000, but that pushes it almost to 100%. It also takes about 12s:
In [14]: array1 = np.random.normal(0, 0.1, size=(20000, 1, 2))
In [15]: array2 = np.random.normal(0, 0.1, size=(20000, 1, 2))
In [16]: %timeit array1[isclose(array1.reshape(1,len(array1),2), array2.reshape(len(array2),1,2), 0.03).any(axis=0).all(axis=-1)]
12.3 s ± 514 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

There are many possible ways to define that tolerance. Since, we are talking about XY coordinates, most probably we are talking about euclidean distances to set that tolerance value. So, we can use Cython-powered kd-tree for quick nearest-neighbor lookup, which is very efficient both memory-wise and with performance. The implementation would look something like this -
from scipy.spatial import cKDTree
# Assuming a default tolerance value of 1 here
def intersect_close(a, b, tol=1):
# Get closest distances for each pt in b
dist = cKDTree(a).query(b, k=1)[0] # k=1 selects closest one neighbor
# Check the distances against the given tolerance value and
# thus filter out rows off b for the final output
return b[dist <= tol]
Sample step-by-step run -
# Input 2D arrays
In [68]: a
Out[68]:
array([[1.22, 5.64],
[2.31, 7.63],
[4.94, 4.15]])
In [69]: b
Out[69]:
array([[ 1.23, 5.63],
[ 6.31, 10.63],
[ 2.32, 7.65]])
# Get closest distances for each pt in b
In [70]: dist = cKDTree(a).query(b, k=1)[0]
In [71]: dist
Out[71]: array([0.01414214, 5. , 0.02236068])
# Mask of distances within the given tolerance
In [72]: tol = 1
In [73]: dist <= tol
Out[73]: array([ True, False, True])
# Finally filter out valid ones off b
In [74]: b[dist <= tol]
Out[74]:
array([[1.23, 5.63],
[2.32, 7.65]])
Timings on 200,000 pts -
In [20]: N = 200000
...: np.random.seed(0)
...: a = np.random.rand(N,2)
...: b = np.random.rand(N,2)
In [21]: %timeit intersect_close(a, b)
1 loop, best of 3: 1.37 s per loop

As commented, scaling and rounding your numbers might allow you to use intersect1d or the equivalent.
And if you have just 2 columns, it might work to turn it into a 1d array of complex dtype.
But you might also want to keep in mind what intersect1d does:
if not assume_unique:
# Might be faster than unique( intersect1d( ar1, ar2 ) )?
ar1 = unique(ar1)
ar2 = unique(ar2)
aux = np.concatenate((ar1, ar2))
aux.sort()
return aux[:-1][aux[1:] == aux[:-1]]
unique has been enhanced to handle rows (axis parameters), but intersect has not. In any case it uses argsort to put similar elements next to each other, and then skips the duplicates.
Notice that intersect concatenenates the unique arrays, sorts, and again finds the duplicates.
I know you didn't want a loop version, but to promote conceptualization of the problem here's one anyways:
In [581]: a = np.array([(1.22, 5.64),
...: (2.31, 7.63),
...: (4.94, 4.15)])
...:
...: b = np.array([(1.23, 5.63),
...: (6.31, 10.63),
...: (2.32, 7.65)])
...:
I removed a layer of nesting in your arrays.
In [582]: c = []
In [583]: for a1 in a:
...: for b1 in b:
...: if np.allclose(a1,b1, atol=0.5): c.append((a1,b1))
or as list comprehension
In [586]: [(a1,b1) for a1 in a for b1 in b if np.allclose(a1,b1,atol=0.5)]
Out[586]:
[(array([1.22, 5.64]), array([1.23, 5.63])),
(array([2.31, 7.63]), array([2.32, 7.65]))]
complex approximation
In [604]: aa = (a*10).astype(int)
In [605]: aa
Out[605]:
array([[12, 56],
[23, 76],
[49, 41]])
In [606]: ac=aa[:,0]+1j*aa[:,1]
In [607]: bb = (b*10).astype(int)
In [608]: bc=bb[:,0]+1j*bb[:,1]
In [609]: np.intersect1d(ac,bc)
Out[609]: array([12.+56.j, 23.+76.j])
intersect inspired
Concatenate the arrays, sort them, take difference, and find the small differences:
In [616]: ab = np.concatenate((a,b),axis=0)
In [618]: np.lexsort(ab.T)
Out[618]: array([2, 3, 0, 1, 5, 4], dtype=int32)
In [619]: ab1 = ab[_,:]
In [620]: ab1
Out[620]:
array([[ 4.94, 4.15],
[ 1.23, 5.63],
[ 1.22, 5.64],
[ 2.31, 7.63],
[ 2.32, 7.65],
[ 6.31, 10.63]])
In [621]: ab1[1:]-ab1[:-1]
Out[621]:
array([[-3.71, 1.48],
[-0.01, 0.01],
[ 1.09, 1.99],
[ 0.01, 0.02],
[ 3.99, 2.98]])
In [623]: ((ab1[1:]-ab1[:-1])<.1).all(axis=1) # refine with abs
Out[623]: array([False, True, False, True, False])
In [626]: np.where(Out[623])
Out[626]: (array([1, 3], dtype=int32),)
In [627]: ab[_]
Out[627]:
array([[2.31, 7.63],
[1.23, 5.63]])

May be you could try this using pure NP and self defined function:
import numpy as np
#Your Example
xDA=np.array([[1.22, 5.64],[2.31, 7.63],[4.94, 4.15],[6.1,6.2]])
yDA=np.array([[1.23, 5.63],[6.31, 10.63],[2.32, 7.65],[3.1,9.2]])
###Try this large sample###
#xDA=np.round(np.random.uniform(1,2, size=(5000, 2)),2)
#yDA=np.round(np.random.uniform(1,2, size=(5000, 2)),2)
print(xDA)
print(yDA)
#Match x to y
def np_matrix(myx,myy,calp=0.2):
Xxx = np.transpose(np.repeat(myx[:, np.newaxis], myy.size, axis=1))
Yyy = np.repeat(myy[:, np.newaxis], myx.size, axis=1)
# define a caliper
matches = {}
dist = np.abs(Xxx - Yyy)
for m in range(0, myx.size):
if (np.min(dist[:, m]) <= calp) or not calp:
matches[m] = np.argmin(dist[:, m])
return matches
alwd_dist=0.1
xc1=xDA[:,1]
yc1=yDA[:,1]
m1=np_matrix(xc1,yc1,alwd_dist)
xc0=xDA[:,0]
yc0=yDA[:,0]
m0=np_matrix(xc0,yc0,alwd_dist)
shared_items = set(m1.items()) & set(m0.items())
if (int(len(shared_items))==0):
print("No Matched Items based on given allowed distance:",alwd_dist)
else:
print("Matched:")
for ke in shared_items:
print(xDA[ke[0]],yDA[ke[1]])

Find all-zero columns in pandas sparse matrix

For example I have a coo_matrix A :
import scipy.sparse as sp
A = sp.coo_matrix([3,0,3,0],
[0,0,2,0],
[2,5,1,0],
[0,0,0,0])
How can I get result [0,0,0,1], which indicates that first 3 columns contain non-zero values, only the 4th column is all zeros.
PS : cannot convert A to other type.
PS2 : I tried using np.nonzeros but it seems that my implementation is not very elegant.

Approach #1 We could do something like this -
# Get the columns indices of the input sparse matrix
C = sp.find(A)[1]
# Use np.in1d to create a mask of non-zero columns.
# So, we invert it and convert to int dtype for desired output.
out = (~np.in1d(np.arange(A.shape[1]),C)).astype(int)
Alternatively, to make the code shorter, we can use subtraction -
out = 1-np.in1d(np.arange(A.shape[1]),C)
Step-by-step run -
1) Input array and sparse matrix from it :
In [137]: arr # Regular dense array
Out[137]:
array([[3, 0, 3, 0],
[0, 0, 2, 0],
[2, 5, 1, 0],
[0, 0, 0, 0]])
In [138]: A = sp.coo_matrix(arr) # Convert to sparse matrix as input here on
2) Get non-zero column indices :
In [139]: C = sp.find(A)[1]
In [140]: C
Out[140]: array([0, 2, 2, 0, 1, 2], dtype=int32)
3) Use np.in1d to get mask of non-zero columns :
In [141]: np.in1d(np.arange(A.shape[1]),C)
Out[141]: array([ True, True, True, False], dtype=bool)
4) Invert it :
In [142]: ~np.in1d(np.arange(A.shape[1]),C)
Out[142]: array([False, False, False, True], dtype=bool)
5) Finally convert to int dtype :
In [143]: (~np.in1d(np.arange(A.shape[1]),C)).astype(int)
Out[143]: array([0, 0, 0, 1])
Alternative subtraction approach :
In [145]: 1-np.in1d(np.arange(A.shape[1]),C)
Out[145]: array([0, 0, 0, 1])
Approach #2 Here's another way and possibly a faster one using matrix-multiplication -
out = 1-np.ones(A.shape[0],dtype=bool)*A.astype(bool)
Runtime test
Let's test out all the posted approaches on a big and really sparse matrix -
In [29]: A = sp.coo_matrix((np.random.rand(4000,4000)>0.998).astype(int))
In [30]: %timeit 1-np.in1d(np.arange(A.shape[1]),sp.find(A)[1])
100 loops, best of 3: 4.12 ms per loop # Approach1
In [31]: %timeit 1-np.ones(A.shape[0],dtype=bool)*A.astype(bool)
1000 loops, best of 3: 771 µs per loop # Approach2
In [32]: %timeit 1 - (A.col==np.arange(A.shape[1])[:,None]).any(axis=1)
1 loops, best of 3: 236 ms per loop # #hpaulj's soln
In [33]: %timeit (A!=0).sum(axis=0)==0
1000 loops, best of 3: 1.03 ms per loop # #jez's soln
In [34]: %timeit (np.sum(np.absolute(A.toarray()), 0) == 0) * 1
10 loops, best of 3: 86.4 ms per loop # #wwii's soln

The actual logical operation can be performed like this:
b = (A!=0).sum(axis=0)==0
# matrix([[False, False, False, True]], dtype=bool)
Now, to ensure that I'm answering your question exactly, I'd better tell you how you could convert from booleans to integers (although really, for most applications I can think of, you can do a lot more in numpy and friends if you stick with an array of bools):
b = b.astype(int)
#matrix([[0, 0, 0, 1]])
Either way, to then convert from a matrix to a list, you could do this:
c = list(b.flat)
# [0, 0, 0, 1]
...although again, I'm not sure this is the best thing to do: for most applications I can imagine, I would perhaps just convert to a one-dimensional numpy.array with c = b.A.flatten() instead.

Recent
scipy.sparse.coo_matrix how to fast find all zeros column, fill with 1 and normalize
similar, except it wants to fill those columns with 1s and normalize them.
I immediately suggested the lil format of the transpose. All-0 columns will be empty lists in this format. But sticking with the coo format I suggested
np.nonzero(~(Mo.col==np.arange(Mo.shape[1])[:,None]).any(axis=1))[0]
or for this 1/0 format
1 - (Mo.col==np.arange(Mo.shape[1])[:,None]).any(axis=1)
which is functionally the same as:
1 - np.in1d(np.arange(Mo.shape[1]),Mo.col)
sparse.find converts the matrix to csr to sum duplicates and eliminate duplicates, and then back to coo to get the data, row, and col attributes (which it returns).
Mo.nonzero uses A.data != 0 to eliminate 0s before returning the col and row attributes.
The np.ones(A.shape[0],dtype=bool)*A.astype(bool) solution requires converting A to csr format for multiplication.
(A!=0).sum(axis=0) also converts to csr because column (or row) sum is done with a matrix multiplication.
So the no-convert requirement is unrealistic, at least within the bounds of sparse formats.
===============
For Divakar's test case my == version is quite slow; it's ok with small ones, but creates too large of test array with the 1000 columns.
Testing on a matrix that is sparse enough to have a number of 0 columns:
In [183]: Arr=sparse.random(1000,1000,.001)
In [184]: (1-np.in1d(np.arange(Arr.shape[1]),Arr.col)).any()
Out[184]: True
In [185]: (1-np.in1d(np.arange(Arr.shape[1]),Arr.col)).sum()
Out[185]: 367
In [186]: timeit 1-np.ones(Arr.shape[0],dtype=bool)*Arr.astype(bool)
1000 loops, best of 3: 334 µs per loop
In [187]: timeit 1-np.in1d(np.arange(Arr.shape[1]),Arr.col)
1000 loops, best of 3: 323 µs per loop
In [188]: timeit 1-(Arr.col==np.arange(Arr.shape[1])[:,None]).any(axis=1)
100 loops, best of 3: 3.9 ms per loop
In [189]: timeit (Arr!=0).sum(axis=0)==0
1000 loops, best of 3: 820 µs per loop

Convert to an array or dense matrix, sum the absolute value along the first axis, test the result against zero, convert to int
>>> import numpy as np
>>> (np.sum(np.absolute(a.toarray()), 0) == 0) * 1
array([0, 0, 0, 1])
>>> (np.sum(np.absolute(a.todense()), 0) == 0) * 1
matrix([[0, 0, 0, 1]])
>>>
>>> np.asarray((np.sum(np.absolute(a.todense()), 0) == 0), dtype = np.int32)
array([[0, 0, 0, 1]])
>>>
The first is the fastest - 24 uS for your example on my machine.
For a matrix made with np.random.randint(0,3,(1000,1000)), all are right at 25 mS on my machine.

How to convert the output of meshgrid to the corresponding array of points?

I want to create a list of points that would correspond to a grid. So if I want to create a grid of the region from (0, 0) to (1, 1), it would contain the points (0, 0), (0, 1), (1, 0) and (1, 0).
I know that that this can be done with the following code:
g = np.meshgrid([0,1],[0,1])
np.append(g[0].reshape(-1,1),g[1].reshape(-1,1),axis=1)
Yielding the result:
array([[0, 0],
[1, 0],
[0, 1],
[1, 1]])
My question is twofold:
Is there a better way of doing this?
Is there a way of generalizing this to higher dimensions?

I just noticed that the documentation in numpy provides an even faster way to do this:
X, Y = np.mgrid[xmin:xmax:100j, ymin:ymax:100j]
positions = np.vstack([X.ravel(), Y.ravel()])
This can easily be generalized to more dimensions using the linked meshgrid2 function and mapping 'ravel' to the resulting grid.
g = meshgrid2(x, y, z)
positions = np.vstack(map(np.ravel, g))
The result is about 35 times faster than the zip method for a 3D array with 1000 ticks on each axis.
Source: http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.gaussian_kde.html#scipy.stats.gaussian_kde
To compare the two methods consider the following sections of code:
Create the proverbial tick marks that will help to create the grid.
In [23]: import numpy as np
In [34]: from numpy import asarray
In [35]: x = np.random.rand(100,1)
In [36]: y = np.random.rand(100,1)
In [37]: z = np.random.rand(100,1)
Define the function that mgilson linked to for the meshgrid:
In [38]: def meshgrid2(*arrs):
....: arrs = tuple(reversed(arrs))
....: lens = map(len, arrs)
....: dim = len(arrs)
....: sz = 1
....: for s in lens:
....: sz *= s
....: ans = []
....: for i, arr in enumerate(arrs):
....: slc = [1]*dim
....: slc[i] = lens[i]
....: arr2 = asarray(arr).reshape(slc)
....: for j, sz in enumerate(lens):
....: if j != i:
....: arr2 = arr2.repeat(sz, axis=j)
....: ans.append(arr2)
....: return tuple(ans)
Create the grid and time the two functions.
In [39]: g = meshgrid2(x, y, z)
In [40]: %timeit pos = np.vstack(map(np.ravel, g)).T
100 loops, best of 3: 7.26 ms per loop
In [41]: %timeit zip(*(x.flat for x in g))
1 loops, best of 3: 264 ms per loop

Are your gridpoints always integral? If so, you could use numpy.ndindex
print list(np.ndindex(2,2))
Higher dimensions:
print list(np.ndindex(2,2,2))
Unfortunately, this does not meet the requirements of the OP since the integral assumption (starting with 0) is not met. I'll leave this answer only in case someone else is looking for the same thing where those assumptions are true.
Another way to do this relies on zip:
g = np.meshgrid([0,1],[0,1])
zip(*(x.flat for x in g))
This portion scales nicely to arbitrary dimensions. Unfortunately, np.meshgrid doesn't scale well to multiple dimensions, so that part will need to be worked out, or (assuming it works), you could use this SO answer to create your own ndmeshgrid function.

Yet another way to do it is:
np.indices((2,2)).T.reshape(-1,2)
Which can be generalized to higher dimensions, e.g.:
In [60]: np.indices((2,2,2)).T.reshape(-1,3)
Out[60]:
array([[0, 0, 0],
[1, 0, 0],
[0, 1, 0],
[1, 1, 0],
[0, 0, 1],
[1, 0, 1],
[0, 1, 1],
[1, 1, 1]])

To get the coordinates of a grid from 0 to 1, a reshape can do the work. Here are examples for 2D and 3D. Also works with floats.
grid_2D = np.mgrid[0:2:1, 0:2:1]
points_2D = grid_2D.reshape(2, -1).T
grid_3D = np.mgrid[0:2:1, 0:2:1, 0:2:1]
points_3D = grid_3D.reshape(3, -1).T

A simple example in 3D (can be extended to N-dimensions I guess, but beware of the final dimension and RAM usage):
import numpy as np
ndim = 3
xmin = 0.
ymin = 0.
zmin = 0.
length_x = 1000.
length_y = 1000.
length_z = 50.
step_x = 1.
step_y = 1.
step_z = 1.
x = np.arange(xmin, length_x, step_x)
y = np.arange(ymin, length_y, step_y)
z = np.arange(zmin, length_z, step_z)
%timeit xyz = np.array(np.meshgrid(x, y, z)).T.reshape(-1, ndim)
in: 2.76 s ± 185 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
which yields:
In [2]: xyx
Out[2]:
array([[ 0., 0., 0.],
[ 0., 1., 0.],
[ 0., 2., 0.],
...,
[999., 997., 49.],
[999., 998., 49.],
[999., 999., 49.]])
In [4]: xyz.shape
Out[4]: (50000000, 3)
Python 3.6.9
Numpy: 1.19.5

I am using the following to convert meshgrid to M X 2 array. Also changes the list of vectors to iterators can make it really fast.
import numpy as np
# Without iterators
x_vecs = [np.linspace(0,1,1000), np.linspace(0,1,1000)]
%timeit np.reshape(np.meshgrid(*x_vecs),(2,-1)).T
6.85 ms ± 93.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# With iterators
x_vecs = iter([np.linspace(0,1,1000), np.linspace(0,1,1000)])
%timeit np.reshape(np.meshgrid(*x_vecs),(2,-1)).T
5.78 µs ± 172 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
for N-D array using generator
vec_dim = 3
res = 100
# Without iterators
x_vecs = [np.linspace(0,1,res) for i in range(vec_dim)]
>>> %timeit np.reshape(np.meshgrid(*x_vecs),(vec_dim,-1)).T
11 ms ± 124 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# With iterators
x_vecs = (np.linspace(0,1,res) for i in range(vec_dim))
>>> %timeit np.reshape(np.meshgrid(*x_vecs),(vec_dim,-1)).T
5.54 µs ± 32.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Elementwise comparison between two large vectors, high degree of sparsity - python

Related

get variance of matrix without zero values numpy

Numpy - How to vectorize on Sub-Arrays

Intersection between two multi-dimensional arrays with tolerance - NumPy / Python

Find all-zero columns in pandas sparse matrix

How to convert the output of meshgrid to the corresponding array of points?

Categories

Resources