Iterate over an array and apply it to a function - python

In python I iterate with every column in P which is and 4,4 array with the function "q".
like:
P = np.array([[1, 0, 0, 1], [0, 1, 0, 0], [0, 0, 0.5, 0], [0, 0, 0.5, 0]])
def q(q_u):
q = np.array(
[
[np.dot(0, 2, q_u)],
[np.zeros((4, 1), dtype=int)],
[np.zeros((2, 1), dtype=int)],
],
dtype=object,
)
return q
np.apply_along_axis(q, axis=0, arr=P)
I get a (3,1,4) array applying q function to the P array. This is correct. But how is posible to save and later call the 4 (3,1) arrays to a dictonary to later apply it to another function printR which needs a (3,1) array.
printR(60, res, q)
Should add the the 4 arrays to a dictionary in order to iterate with PrintR or there is another method?

Use transpose, zip to create the dictionary.
To create 4 of (1,3), simply pass them to dict
arr = np.apply_along_axis(q, axis=0, arr=P)
d = dict(zip(range(arr.size), arr.T))
Out[259]:
{0: array([[0, array([[0],
[0],
[0],
[0]]),
array([[0],
[0]])]], dtype=object), 1: array([[0, array([[0],
[0],
[0],
[0]]),
array([[0],
[0]])]], dtype=object), 2: array([[0, array([[0],
[0],
[0],
[0]]),
array([[0],
[0]])]], dtype=object), 3: array([[0, array([[0],
[0],
[0],
[0]]),
array([[0],
[0]])]], dtype=object)}
In [260]: d[0].shape
Out[260]: (1, 3)
To create 4 of (3,1), use dict comprehension
d = {k: v.T for k, v in zip(range(arr.size), arr.T)}
Out[269]:
{0: array([[0],
[array([[0],
[0],
[0],
[0]])],
[array([[0],
[0]])]], dtype=object), 1: array([[0],
[array([[0],
[0],
[0],
[0]])],
[array([[0],
[0]])]], dtype=object), 2: array([[0],
[array([[0],
[0],
[0],
[0]])],
[array([[0],
[0]])]], dtype=object), 3: array([[0],
[array([[0],
[0],
[0],
[0]])],
[array([[0],
[0]])]], dtype=object)}
In [270]: d[0].shape
Out[270]: (3, 1)
Note: I intentionally use arr.size to let zip trim tuples solely basing on the length of arr.T

Correcting the puzzling dot to
[np.dot(0.2, q_u)],
produces the ost in your other question.
I still wonder why you insist on using apply_along_axis. It doesn't have any speed benefits. Compare these timings:
In [36]: timeit np.apply_along_axis(q, axis=0, arr=P)
141 µs ± 112 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [37]: timeit np.stack([q(P[:,i]) for i in range(P.shape[1])], axis=2)
72.1 µs ± 500 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [38]: timeit [q(P[:,i]) for i in range(P.shape[1])]
53 µs ± 42.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
That dot(0.2, q_u) line just does 0.2*q_u, which applied to the P can be 0.2*P or 0.2*P.T.
Let's change q to omit the size 1 dimensions, to make a more compact display:
In [49]: def q1(q_u):
...: q = np.array(
...: [
...: np.dot(0.2, q_u),
...: np.zeros((4,), dtype=int),
...: np.zeros((2,), dtype=int),
...: ],
...: dtype=object,
...: )
...: return q
...:
In [50]: np.apply_along_axis(q1, axis=0, arr=P)
Out[50]:
array([[array([0.2, 0. , 0. , 0. ]), array([0. , 0.2, 0. , 0. ]),
array([0. , 0. , 0.1, 0.1]), array([0.2, 0. , 0. , 0. ])],
[array([0, 0, 0, 0]), array([0, 0, 0, 0]), array([0, 0, 0, 0]),
array([0, 0, 0, 0])],
[array([0, 0]), array([0, 0]), array([0, 0]), array([0, 0])]],
dtype=object)
In [51]: _.shape
Out[51]: (3, 4)
We can generate the same numbers, arranged slightly differently with:
In [52]: [0.2 * P.T, np.zeros((4,4),int), np.zeros((4,2),int)]
Out[52]:
[array([[0.2, 0. , 0. , 0. ],
[0. , 0.2, 0. , 0. ],
[0. , 0. , 0.1, 0.1],
[0.2, 0. , 0. , 0. ]]),
array([[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0]]),
array([[0, 0],
[0, 0],
[0, 0],
[0, 0]])]
You are making 3 2d arrays, each with one row per column of P.
The list comprehension that I timed in [38] produces 4 size (3,) arrays, that is one array per column of P. apply_along_axis obscures that, joining them on a last dimension (as my stack with axis=2 does).
In [53]: [q1(P[:,i]) for i in range(P.shape[1])]
Out[53]:
[array([array([0.2, 0. , 0. , 0. ]), array([0, 0, 0, 0]), array([0, 0])],
dtype=object),
array([array([0. , 0.2, 0. , 0. ]), array([0, 0, 0, 0]), array([0, 0])],
dtype=object),
array([array([0. , 0. , 0.1, 0.1]), array([0, 0, 0, 0]), array([0, 0])],
dtype=object),
array([array([0.2, 0. , 0. , 0. ]), array([0, 0, 0, 0]), array([0, 0])],
dtype=object)]
The list comprehension is not only fast, but it also keeps the q output 'intact', making it easier to pass on to another function.

Related

Numpy: Efficiently create this Matrix (N,3) base values of another list and repeating them

How can I create the matrix
[[a, 0, 0],
[0, a, 0],
[0, 0, a],
[b, 0, 0],
[0, b, 0],
[0, 0, b],
...]
from the vector
[a, b, ...]
efficiently?
There must be a better solution than
np.squeeze(np.reshape(np.tile(np.eye(3), (len(foo), 1, 1)) * np.expand_dims(foo, (1, 2)), (1, -1, 3)))
right?
You can create a zero array in advance, and then quickly assign values by slicing:
def concated_diagonal(ar, col):
ar = np.asarray(ar).ravel()
size = ar.size
ret = np.zeros((col * size, col), ar.dtype)
for i in range(col):
ret[i::col, i] = ar
return ret
Test:
>>> concated_diagonal([1, 2, 3], 3)
array([[1, 0, 0],
[0, 1, 0],
[0, 0, 1],
[2, 0, 0],
[0, 2, 0],
[0, 0, 2],
[3, 0, 0],
[0, 3, 0],
[0, 0, 3]])
Note that because the number of columns you require is small, the impact of the relatively slow Python level for loop is acceptable:
%timeit concated_diagonal(np.arange(1_000_000), 3)
17.1 ms ± 84.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Update:
A solution with better performance! This is done in one step by clever reshaping and slice assignment:
def concated_diagonal(ar, col):
ar = np.asarray(ar).reshape(-1, 1)
size = ar.size
ret = np.zeros((col * size, col), ar.dtype)
ret.reshape(size, -1)[:, ::col + 1] = ar
return ret
Time comparison:
%timeit concated_diagonal(np.arange(1_000_000), 3)
10.7 ms ± 198 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
You can use numpy.tile , numpy.repeat, and numpy.eye.
rep = 3
lst = np.array([1,2,3,4])
res = np.tile(np.eye(rep), (len(lst),1))*np.repeat(lst, rep)[:,None]
print(res)
[[1. 0. 0.]
[0. 1. 0.]
[0. 0. 1.]
[2. 0. 0.]
[0. 2. 0.]
[0. 0. 2.]
[3. 0. 0.]
[0. 3. 0.]
[0. 0. 3.]
[4. 0. 0.]
[0. 4. 0.]
[0. 0. 4.]]
Explanation:
>>> np.tile(np.eye(3), (2,1))
array([[1., 0., 0.],
[0., 1., 0.],
[0., 0., 1.],
[1., 0., 0.],
[0., 1., 0.],
[0., 0., 1.]])
>>> np.repeat([3,4], 3)[:,None]
array([[3],
[3],
[3],
[4],
[4],
[4]])
>>> np.tile(np.eye(3), (2,1)) * np.repeat([3,4], 3)[:,None]
array([[3., 0., 0.],
[0., 3., 0.],
[0., 0., 3.],
[4., 0., 0.],
[0., 4., 0.],
[0., 0., 4.]])
Benchmark on colab(Because you want an efficient approach)
Variable is len(arr) and eye(3)
Code of benchmark:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import time
bench = []
for num in np.power(np.arange(10,1500,5),2):
arr = np.arange(num)
start = time.time()
col = 3
size = arr.size
ret1 = np.zeros((col * size, col), arr.dtype)
for i in range(col):
ret1[i::col, i] = arr
bench.append({'len_arr':num, 'Method':'Mechanic_Pig', 'Time':time.time() - start})
start = time.time()
N = 3
M = N*len(arr)
ret2 = np.zeros((M, N), dtype=int)
idx = np.arange(M)
ret2[idx, idx%N] = np.repeat(arr, N)
bench.append({'len_arr':num, 'Method':'mozway', 'Time':time.time() - start})
start = time.time()
ret3 = np.tile(np.eye(3), (len(arr),1))*np.repeat(arr, 3)[:,None]
bench.append({'len_arr':num, 'Method':'Imahdi', 'Time':time.time() - start})
start = time.time()
ret4 = np.einsum('j,ik->jki', arr, np.eye(3)).reshape(-1, 3)
bench.append({'len_arr':num, 'Method':'Michael_Szczesn', 'Time':time.time() - start})
plt.subplots(1,1, figsize=(10,7))
df = pd.DataFrame(bench)
sns.lineplot(data=df, x="len_arr", y="Time", hue="Method", style="Method")
plt.show()
# Check result of different approaches are equal or not
print(((ret1 == ret2).all() == (ret1 == ret3).all() == (ret1 == ret4).all() == (ret2 == ret3).all() == (ret2 == ret4).all() == (ret3 == ret4).all()))
# True
Here is a solution by indexing:
a = [1,2,3]
N = 3
M = N*len(a)
out = np.zeros((M, N), dtype=int)
idx = np.arange(M)
out[idx, idx%N] = np.repeat(a, N)
output:
array([[1, 0, 0],
[0, 1, 0],
[0, 0, 1],
[2, 0, 0],
[0, 2, 0],
[0, 0, 2],
[3, 0, 0],
[0, 3, 0],
[0, 0, 3]])
intermediates:
idx
# array([0, 1, 2, 3, 4, 5, 6, 7, 8])
idx%N
# array([0, 1, 2, 0, 1, 2, 0, 1, 2])
np.repeat(a, N)
# array([1, 1, 1, 2, 2, 2, 3, 3, 3])
An almost one-line solution:
import numpy as np
def concated_diagonal(vals):
length = len(vals)
return np.vstack([np.diag(np.full(length, v)) for v in vals])
print(concated_diagonal([1, 2, 3]))
Output
[[1 0 0]
[0 1 0]
[0 0 1]
[2 0 0]
[0 2 0]
[0 0 2]
[3 0 0]
[0 3 0]
[0 0 3]]

Understanding _r from numpy

a = np.zeros([4, 4])
b = np.ones([4, 4])
#vertical stacking(ROW WISE)
print(np.r_[a,b])
print(np.r_[[1,2,3],0,0,[4,5,6]])
# output is
[[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]]
[[1. 1. 1. 1.]
[1. 1. 1. 1.]
[1. 1. 1. 1.]
[1. 1. 1. 1.]]
[1 2 3 0 0 4 5 6]
But here np._r doesn't perform vertical stacking, but does horizontal stacking. How does np._r work? Would be grateful for any help
In [324]: a = np.zeros([4, 4],int)
...: b = np.ones([4, 4],int)
In [325]: np.r_[a,b]
Out[325]:
array([[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[1, 1, 1, 1],
[1, 1, 1, 1],
[1, 1, 1, 1],
[1, 1, 1, 1]])
This is a row stack; same as vstack. And since the arrays are already 2d, concatenate is enough:
In [326]: np.concatenate((a,b), axis=0)
Out[326]:
array([[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[1, 1, 1, 1],
[1, 1, 1, 1],
[1, 1, 1, 1],
[1, 1, 1, 1]])
With the mix of 1d and scalars, r_ is the same as hstack:
In [327]: np.r_[[1,2,3],0,0,[4,5,6]]
Out[327]: array([1, 2, 3, 0, 0, 4, 5, 6])
In [328]: np.hstack([[1,2,3],0,0,[4,5,6]])
Out[328]: array([1, 2, 3, 0, 0, 4, 5, 6])
In [329]: np.concatenate([[1,2,3],0,0,[4,5,6]],axis=0)
...
ValueError: all the input arrays must have same number of dimensions, but the array at index 0 has 1 dimension(s) and the array at index 1 has 0 dimension(s)
concatenate fails because of the scalars. The other methods first convert those to 1d arrays.
In both case, r_ does
Translates slice objects to concatenation along the first axis.
r_ is actually an instance of a special class, with its own __getitem__ method, that allows us to use [] instead of (). It also means it can take slices as inputs (which are actually rendered as np.arange or np.linspace).
r_ takes an optional initial string argument, which if consisting of 3 numbers, can control the concatenate axis, and control how inputs are adjusted to matching dimensions. See the docs for details, and np.lib.index_tricks.py file for more details.
In order of importance I think the concatenate functions are:
np.concatenate # base
np.vstack # easy join 1d arrays into 2d
np.stack # generalize np.array
np.hstack # saves specifying axis
np.r_
np.c_
r_ and c_ can do neat things when mixing arrays of different shapes, but it all boils down to using concatanate correctly.

Scipy: Calculation of standardized euclidean via cdist

The formula is available in the docs and pointed to in this answer. However when I'm trying to apply it I'm not getting a matching answer. I'm sure there's some silly mistake I'm making somewhere so thanks for bearing with me:
Setup
Say I have 2 matrices:
X: array([[0, 1, 0],
[1, 1, 1]])
X2: array([[1, 1, 0],
[1, 1, 1],
[1, 2, 0]])
Now applying Xans = scipy.spatial.distance.cdist(X, X2, 'seuclidean') gives:
Xans: array([[2.23606798, 2.88675135, 3.16227766],
[1.82574186, 0. , 2.88675135]])
Let's just focus on Xans[0][0] = 2.23606798, which should have been obtained by applying seuclidean(X[0], X2[0]).
Method 1: Using pdist
I tried doing this via pdist but get a NaN:
In [104]: scipy.spatial.distance.pdist([X[0], X2[0]], metric='seuclidean')
Out[104]: array([nan])
Why is this happening?
Method 2: Direct Formula Application
I tried manually using the formula linked in the answer above as follows:
In [107]: (((X[0] - X2[0])**2).sum()/(np.var([X[0], X2[0]])))**0.5
Out[107]: 2.0
As can be seen this is giving 2.0?
I'm clearly doing something very wrong - What is it?
The standardized Euclidean distance weights each variable with a separate variance. If you don't provide the variances with the V argument, it computes them from the input array. This is mentioned in the pdist docstring in the "Parameters" section under **kwargs, where it shows:
V : ndarray
The variance vector for standardized Euclidean.
Default: var(X, axis=0, ddof=1)
For example:
In [39]: A
Out[39]:
array([[3, 0, 2],
[2, 1, 2],
[0, 0, 1],
[3, 1, 2],
[1, 0, 0]])
In [40]: from scipy.spatial.distance import pdist
In [41]: pdist(A, metric='seuclidean')
Out[41]:
array([ 1.98029509, 2.55814731, 1.82574186, 2.71163072, 2.63368079,
0.76696499, 2.9868995 , 3.14284123, 1.35581536, 3.26898677])
We get the same result if we provide the variances computed as explained in the docstring:
In [42]: pdist(A, metric='seuclidean', V=np.var(A, axis=0, ddof=1))
Out[42]:
array([ 1.98029509, 2.55814731, 1.82574186, 2.71163072, 2.63368079,
0.76696499, 2.9868995 , 3.14284123, 1.35581536, 3.26898677])
Of course, if you provide variances that are all 1, you get the regular Euclidean distance:
In [43]: pdist(A, metric='seuclidean', V=np.ones(A.shape[1]))
Out[43]:
array([ 1.41421356, 3.16227766, 1. , 2.82842712, 2.44948974,
1. , 2.44948974, 3.31662479, 1.41421356, 3. ])
In [44]: pdist(A, metric='euclidean')
Out[44]:
array([ 1.41421356, 3.16227766, 1. , 2.82842712, 2.44948974,
1. , 2.44948974, 3.31662479, 1.41421356, 3. ])
The problem with your "Method 1" is that in your input array of just two points (i.e. [X[0], X2[0]]), the second and third components of the points don't change, so the variance associated with those components is 0:
In [45]: p = np.array([X[0], X2[0]])
In [46]: p
Out[46]:
array([[0, 1, 0],
[1, 1, 0]])
In [47]: np.var(p, axis=0, ddof=1)
Out[47]: array([ 0.5, 0. , 0. ])
When the code for the seuclidean divides by these variances, the result is either infinity or NaN--the latter if the numerator is also 0, which is the case in the third component of the input [X[0], X2[0]].
To work around this, you have to decide how you want to handle the case where the variance of a component is 0, and handle it explicitly. For example, if you want it to act like that variance is 1 in that case (just to avoid dividing by 0) you could do something like the following.
Suppose B is our array of points. The third column of B is all 1s.
In [63]: B
Out[63]:
array([[3, 0, 1],
[2, 1, 1],
[0, 0, 1],
[3, 1, 1],
[1, 0, 1]])
Compute the variances of the columns:
In [64]: V = np.var(B, axis=0, ddof=1)
In [65]: V
Out[65]: array([ 1.7, 0.3, 0. ])
Replace the variances that are 0 with 1:
In [66]: V[V == 0] = 1
In [67]: V
Out[67]: array([ 1.7, 0.3, 1. ])
Use V to compute the standardized Euclidean distances:
In [68]: pdist(B, metric='seuclidean', V=V)
Out[68]:
array([ 1.98029509, 2.30089497, 1.82574186, 1.53392998, 2.38459106,
0.76696499, 1.98029509, 2.93725228, 0.76696499, 2.38459106])
This has the same effect as simply removing the constant column:
In [69]: pdist(B[:, :2], metric='seuclidean')
Out[69]:
array([ 1.98029509, 2.30089497, 1.82574186, 1.53392998, 2.38459106,
0.76696499, 1.98029509, 2.93725228, 0.76696499, 2.38459106])
Your "Method 2" is wrong because your formula is wrong. You have to keep the variances for each component. np.var([X[0], X2[0]]) computes the (single) variance of all the values in the input. Instead, you need to use the axis and ddof arguments shown above.

Two-array counting in NumPy [duplicate]

I'm searching for an efficient way to create a matrix of occurrences from two arrays that contains indexes, one represents the row indexes in this matrix, the other, the column ones.
eg. I have:
#matrix will be size 4x3 in this example
#array of rows idxs, with values from 0 to 3
[0, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3]
#array of columns idxs, with values from 0 to 2
[0, 1, 1, 1, 2, 2, 0, 1, 2, 0, 2, 2, 2, 2]
And need to create a matrix of occurrences like:
[[1 0 0]
[0 2 0]
[0 1 2]
[2 1 5]]
I can create an array of one hot vectors in a simple form, but cant get it work when there is more than one occurrence:
n_rows = 4
n_columns = 3
#data
rows = np.array([0, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3])
columns = np.array([0, 1, 1, 1, 2, 2, 0, 1, 2, 0, 2, 2, 2, 2])
#empty matrix
new_matrix = np.zeros([n_rows, n_columns])
#adding 1 for each [row, column] occurrence:
new_matrix[rows, columns] += 1
print(new_matrix)
Which returns:
[[ 1. 0. 0.]
[ 0. 1. 0.]
[ 0. 1. 1.]
[ 1. 1. 1.]]
It seems like indexing and adding a value like this doesn't work when there is more than one occurrence/index, besides printing it seems to work just fine:
print(new_matrix[rows, :])
:
[[ 1. 0. 0.]
[ 0. 1. 0.]
[ 0. 1. 0.]
[ 0. 1. 1.]
[ 0. 1. 1.]
[ 0. 1. 1.]
[ 1. 1. 1.]
[ 1. 1. 1.]
[ 1. 1. 1.]
[ 1. 1. 1.]
[ 1. 1. 1.]
[ 1. 1. 1.]
[ 1. 1. 1.]
[ 1. 1. 1.]]
So maybe I'm missing something there? Or this cant be done and I need to search for another way to do it?
Use np.add.at, specifying a tuple of indices:
>>> np.add.at(new_matrix, (rows, columns), 1)
>>> new_matrix
array([[ 1., 0., 0.],
[ 0., 2., 0.],
[ 0., 1., 2.],
[ 2., 1., 5.]])
np.add.at operates on the array in-place, adding 1 as many times to the indices as specified by the (row, columns) tuple.
Approach #1
We can convert those pairs to linear indices and then use np.bincount -
def bincount_app(rows, columns, n_rows, n_columns):
# Get linear index equivalent
lidx = (columns.max()+1)*rows + columns
# Use binned count on the linear indices
return np.bincount(lidx, minlength=n_rows*n_columns).reshape(n_rows,n_columns)
Sample run -
In [242]: n_rows = 4
...: n_columns = 3
...:
...: rows = np.array([0, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3])
...: columns = np.array([0, 1, 1, 1, 2, 2, 0, 1, 2, 0, 2, 2, 2, 2])
In [243]: bincount_app(rows, columns, n_rows, n_columns)
Out[243]:
array([[1, 0, 0],
[0, 2, 0],
[0, 1, 2],
[2, 1, 5]])
Approach #2
Alternatively, we can sort the linear indices and get the counts using slicing to have our second approach, like so -
def mask_diff_app(rows, columns, n_rows, n_columns):
lidx = (columns.max()+1)*rows + columns
lidx.sort()
mask = np.concatenate(([True],lidx[1:] != lidx[:-1],[True]))
count = np.diff(np.flatnonzero(mask))
new_matrix = np.zeros([n_rows, n_columns],dtype=int)
new_matrix.flat[lidx[mask[:-1]]] = count
return new_matrix
Approach #3
This seems like a straight-forward one with sparse matrix csr_matrix as well, as it does accumulation on its own for repeated indices. The benefit is the memory efficiency, given that it's a sparse matrix, which would be noticeable if you are filling a small number of places in the output and a sparse matrix output is okay.
The implementation would look something like this -
from scipy.sparse import csr_matrix
def sparse_matrix_app(rows, columns, n_rows, n_columns):
out_shp = (n_rows, n_columns)
data = np.ones(len(rows),dtype=int)
return csr_matrix((data, (rows, columns)), shape=out_shp)
If you need a regular/dense array, simply do -
sparse_matrix_app(rows, columns, n_rows, n_columns).toarray()
Sample output -
In [319]: sparse_matrix_app(rows, columns, n_rows, n_columns).toarray()
Out[319]:
array([[1, 0, 0],
[0, 2, 0],
[0, 1, 2],
[2, 1, 5]])
Benchmarking
Other approach(es) -
# #cᴏʟᴅsᴘᴇᴇᴅ's soln
def add_at_app(rows, columns, n_rows, n_columns):
new_matrix = np.zeros([n_rows, n_columns],dtype=int)
np.add.at(new_matrix, (rows, columns), 1)
Timings
Case #1 : Output array of shape (1000, 1000) and no. of indices = 10k
In [307]: # Setup
...: n_rows = 1000
...: n_columns = 1000
...: rows = np.random.randint(0,1000,(10000))
...: columns = np.random.randint(0,1000,(10000))
In [308]: %timeit add_at_app(rows, columns, n_rows, n_columns)
...: %timeit bincount_app(rows, columns, n_rows, n_columns)
...: %timeit mask_diff_app(rows, columns, n_rows, n_columns)
...: %timeit sparse_matrix_app(rows, columns, n_rows, n_columns)
1000 loops, best of 3: 1.05 ms per loop
1000 loops, best of 3: 424 µs per loop
1000 loops, best of 3: 1.05 ms per loop
1000 loops, best of 3: 1.41 ms per loop
Case #2 : Output array of shape (1000, 1000) and no. of indices = 100k
In [309]: # Setup
...: n_rows = 1000
...: n_columns = 1000
...: rows = np.random.randint(0,1000,(100000))
...: columns = np.random.randint(0,1000,(100000))
In [310]: %timeit add_at_app(rows, columns, n_rows, n_columns)
...: %timeit bincount_app(rows, columns, n_rows, n_columns)
...: %timeit mask_diff_app(rows, columns, n_rows, n_columns)
...: %timeit sparse_matrix_app(rows, columns, n_rows, n_columns)
100 loops, best of 3: 11.4 ms per loop
1000 loops, best of 3: 1.27 ms per loop
100 loops, best of 3: 7.44 ms per loop
10 loops, best of 3: 20.4 ms per loop
Case #3 : Sparse-ness in output
As stated earlier, for the sparse method to work better, we would need sparse-ness. Such a case would be like this -
In [314]: # Setup
...: n_rows = 5000
...: n_columns = 5000
...: rows = np.random.randint(0,5000,(1000))
...: columns = np.random.randint(0,5000,(1000))
In [315]: %timeit add_at_app(rows, columns, n_rows, n_columns)
...: %timeit bincount_app(rows, columns, n_rows, n_columns)
...: %timeit mask_diff_app(rows, columns, n_rows, n_columns)
...: %timeit sparse_matrix_app(rows, columns, n_rows, n_columns)
100 loops, best of 3: 11.7 ms per loop
100 loops, best of 3: 11.1 ms per loop
100 loops, best of 3: 11.1 ms per loop
1000 loops, best of 3: 269 µs per loop
If you need a dense array, we lose the memory efficiency and hence performance one as well -
In [317]: %timeit sparse_matrix_app(rows, columns, n_rows, n_columns).toarray()
100 loops, best of 3: 11.7 ms per loop

scipy.sparse.coo_matrix how to fast find all zeros column, fill with 1 and normalize

For a matrix, i want to find columns with all zeros and fill with 1s, and then normalize the matrix by column. I know how to do that with np.arrays
[[0 0 0 0 0]
[0 0 1 0 0]
[1 0 0 1 0]
[0 0 0 0 1]
[1 0 0 0 0]]
|
V
[[0 1 0 0 0]
[0 1 1 0 0]
[1 1 0 1 0]
[0 1 0 0 1]
[1 1 0 0 0]]
|
V
[[0 0.2 0 0 0]
[0 0.2 1 0 0]
[0.5 0.2 0 1 0]
[0 0.2 0 0 1]
[0.5 0.2 0 0 0]]
But how can I do the same thing when the matrix is in scipy.sparse.coo.coo_matrix form, without converting it back to np.arrays. how can I achieve the same thing?
This will be a lot easier with the lil format, and working with rows rather than columns:
In [1]: from scipy import sparse
In [2]: A=np.array([[0,0,0,0,0],[0,0,1,0,0],[1,0,0,1,0],[0,0,0,0,1],[1,0,0,0,0]])
In [3]: A
Out[3]:
array([[0, 0, 0, 0, 0],
[0, 0, 1, 0, 0],
[1, 0, 0, 1, 0],
[0, 0, 0, 0, 1],
[1, 0, 0, 0, 0]])
In [4]: At=A.T # switch to work with rows
In [5]: M=sparse.lil_matrix(At)
Now it is obvious which row is all zeros
In [6]: M.data
Out[6]: array([[1, 1], [], [1], [1], [1]], dtype=object)
In [7]: M.rows
Out[7]: array([[2, 4], [], [1], [2], [3]], dtype=object)
And lil format allows us to fill that row:
In [8]: M.data[1]=[1,1,1,1,1]
In [9]: M.rows[1]=[0,1,2,3,4]
In [10]: M.A
Out[10]:
array([[0, 0, 1, 0, 1],
[1, 1, 1, 1, 1],
[0, 1, 0, 0, 0],
[0, 0, 1, 0, 0],
[0, 0, 0, 1, 0]], dtype=int32)
I could have also used M[1,:]=np.ones(5,int)
The coo format is great for creating the array from the data/row/col arrays, but doesn't implement indexing or math. It has to be transformed to csr for that. And csc for column oriented stuff.
The row that I filled isn't so obvious in the csr format:
In [14]: Mc=M.tocsr()
In [15]: Mc.data
Out[15]: array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)
In [16]: Mc.indices
Out[16]: array([2, 4, 0, 1, 2, 3, 4, 1, 2, 3], dtype=int32)
In [17]: Mc.indptr
Out[17]: array([ 0, 2, 7, 8, 9, 10], dtype=int32)
On the other hand normalizing is probably easier in this format.
In [18]: Mc.sum(axis=1)
Out[18]:
matrix([[2],
[5],
[1],
[1],
[1]], dtype=int32)
In [19]: Mc/Mc.sum(axis=1)
Out[19]:
matrix([[ 0. , 0. , 0.5, 0. , 0.5],
[ 0.2, 0.2, 0.2, 0.2, 0.2],
[ 0. , 1. , 0. , 0. , 0. ],
[ 0. , 0. , 1. , 0. , 0. ],
[ 0. , 0. , 0. , 1. , 0. ]])
Notice that it's converted the sparse matrix to a dense one. The sum is dense, and math involving sparse and dense usually produces dense.
I have to use a more round about calculation to preserve the sparse status:
In [27]: Mc.multiply(sparse.csr_matrix(1/Mc.sum(axis=1)))
Out[27]:
<5x5 sparse matrix of type '<class 'numpy.float64'>'
with 10 stored elements in Compressed Sparse Row format>
Here's a way of doing this with the csc format (on A)
In [40]: Ms=sparse.csc_matrix(A)
In [41]: Ms.sum(axis=0)
Out[41]: matrix([[2, 0, 1, 1, 1]], dtype=int32)
Use sum to find the all-zeros column. Obviously this could be wrong if the columns have negative values and happen to sum to 0. If that's a concern I can see making a copy of the matrix with all data values replaced by 1.
In [43]: Ms[:,1]=np.ones(5,int)[:,None]
/usr/lib/python3/dist-packages/scipy/sparse/compressed.py:730: SparseEfficiencyWarning: Changing the sparsity structure of a csc_matrix is expensive. lil_matrix is more efficient.
SparseEfficiencyWarning)
In [44]: Ms.A
Out[44]:
array([[0, 1, 0, 0, 0],
[0, 1, 1, 0, 0],
[1, 1, 0, 1, 0],
[0, 1, 0, 0, 1],
[1, 1, 0, 0, 0]])
The warning matters more if you do this sort of change repeatedly. Notice I have to adjust the dimension of the LHS array. Depending on the number of all-zero columns this action can change the sparsity of the matrix substantially.
==================
I could search the col of coo format for missing values with:
In [69]: Mo=sparse.coo_matrix(A)
In [70]: Mo.col
Out[70]: array([2, 0, 3, 4, 0], dtype=int32)
In [71]: Mo.col==np.arange(Mo.shape[1])[:,None]
Out[71]:
array([[False, True, False, False, True],
[False, False, False, False, False],
[ True, False, False, False, False],
[False, False, True, False, False],
[False, False, False, True, False]], dtype=bool)
In [72]: idx = np.nonzero(~(Mo.col==np.arange(Mo.shape[1])[:,None]).any(axis=1))[0]
In [73]: idx
Out[73]: array([1], dtype=int32)
I could then add a column of 1s at this idx with:
In [75]: N=Mo.shape[0]
In [76]: data = np.concatenate([Mo.data, np.ones(N,int)])
In [77]: row = np.concatenate([Mo.row, np.arange(N)])
In [78]: col = np.concatenate([Mo.col, np.ones(N,int)*idx])
In [79]: Mo1 = sparse.coo_matrix((data,(row, col)), shape=Mo.shape)
In [80]: Mo1.A
Out[80]:
array([[0, 1, 0, 0, 0],
[0, 1, 1, 0, 0],
[1, 1, 0, 1, 0],
[0, 1, 0, 0, 1],
[1, 1, 0, 0, 0]])
As written it works for just one column, but it could be generalized to several. I also created a new matrix rather than update Mo. But this in-place seems to work as well:
Mo.data,Mo.col,Mo.row = data,col,row
The normalization still requires csr conversion, though I think sparse can hide that for you.
In [87]: Mo1/Mo1.sum(axis=0)
Out[87]:
matrix([[ 0. , 0.2, 0. , 0. , 0. ],
[ 0. , 0.2, 1. , 0. , 0. ],
[ 0.5, 0.2, 0. , 1. , 0. ],
[ 0. , 0.2, 0. , 0. , 1. ],
[ 0.5, 0.2, 0. , 0. , 0. ]])
Even when I take the extra work of maintaining the sparse nature, I still get a csr matrix:
In [89]: Mo1.multiply(sparse.coo_matrix(1/Mo1.sum(axis=0)))
Out[89]:
<5x5 sparse matrix of type '<class 'numpy.float64'>'
with 10 stored elements in Compressed Sparse Row format>
See
Find all-zero columns in pandas sparse matrix
for more methods of finding the 0 columns. It turns out Mo.col==np.arange(Mo.shape[1])[:,None] is too slow with large Mo. A test using np.in1d is much better.
1 - np.in1d(np.arange(Mo.shape[1]),Mo.col)

Categories