In python I iterate with every column in P which is and 4,4 array with the function "q".
like:
P = np.array([[1, 0, 0, 1], [0, 1, 0, 0], [0, 0, 0.5, 0], [0, 0, 0.5, 0]])
def q(q_u):
q = np.array(
[
[np.dot(0, 2, q_u)],
[np.zeros((4, 1), dtype=int)],
[np.zeros((2, 1), dtype=int)],
],
dtype=object,
)
return q
np.apply_along_axis(q, axis=0, arr=P)
I get a (3,1,4) array applying q function to the P array. This is correct. But how is posible to save and later call the 4 (3,1) arrays to a dictonary to later apply it to another function printR which needs a (3,1) array.
printR(60, res, q)
Should add the the 4 arrays to a dictionary in order to iterate with PrintR or there is another method?
Use transpose, zip to create the dictionary.
To create 4 of (1,3), simply pass them to dict
arr = np.apply_along_axis(q, axis=0, arr=P)
d = dict(zip(range(arr.size), arr.T))
Out[259]:
{0: array([[0, array([[0],
[0],
[0],
[0]]),
array([[0],
[0]])]], dtype=object), 1: array([[0, array([[0],
[0],
[0],
[0]]),
array([[0],
[0]])]], dtype=object), 2: array([[0, array([[0],
[0],
[0],
[0]]),
array([[0],
[0]])]], dtype=object), 3: array([[0, array([[0],
[0],
[0],
[0]]),
array([[0],
[0]])]], dtype=object)}
In [260]: d[0].shape
Out[260]: (1, 3)
To create 4 of (3,1), use dict comprehension
d = {k: v.T for k, v in zip(range(arr.size), arr.T)}
Out[269]:
{0: array([[0],
[array([[0],
[0],
[0],
[0]])],
[array([[0],
[0]])]], dtype=object), 1: array([[0],
[array([[0],
[0],
[0],
[0]])],
[array([[0],
[0]])]], dtype=object), 2: array([[0],
[array([[0],
[0],
[0],
[0]])],
[array([[0],
[0]])]], dtype=object), 3: array([[0],
[array([[0],
[0],
[0],
[0]])],
[array([[0],
[0]])]], dtype=object)}
In [270]: d[0].shape
Out[270]: (3, 1)
Note: I intentionally use arr.size to let zip trim tuples solely basing on the length of arr.T
Correcting the puzzling dot to
[np.dot(0.2, q_u)],
produces the ost in your other question.
I still wonder why you insist on using apply_along_axis. It doesn't have any speed benefits. Compare these timings:
In [36]: timeit np.apply_along_axis(q, axis=0, arr=P)
141 µs ± 112 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [37]: timeit np.stack([q(P[:,i]) for i in range(P.shape[1])], axis=2)
72.1 µs ± 500 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [38]: timeit [q(P[:,i]) for i in range(P.shape[1])]
53 µs ± 42.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
That dot(0.2, q_u) line just does 0.2*q_u, which applied to the P can be 0.2*P or 0.2*P.T.
Let's change q to omit the size 1 dimensions, to make a more compact display:
In [49]: def q1(q_u):
...: q = np.array(
...: [
...: np.dot(0.2, q_u),
...: np.zeros((4,), dtype=int),
...: np.zeros((2,), dtype=int),
...: ],
...: dtype=object,
...: )
...: return q
...:
In [50]: np.apply_along_axis(q1, axis=0, arr=P)
Out[50]:
array([[array([0.2, 0. , 0. , 0. ]), array([0. , 0.2, 0. , 0. ]),
array([0. , 0. , 0.1, 0.1]), array([0.2, 0. , 0. , 0. ])],
[array([0, 0, 0, 0]), array([0, 0, 0, 0]), array([0, 0, 0, 0]),
array([0, 0, 0, 0])],
[array([0, 0]), array([0, 0]), array([0, 0]), array([0, 0])]],
dtype=object)
In [51]: _.shape
Out[51]: (3, 4)
We can generate the same numbers, arranged slightly differently with:
In [52]: [0.2 * P.T, np.zeros((4,4),int), np.zeros((4,2),int)]
Out[52]:
[array([[0.2, 0. , 0. , 0. ],
[0. , 0.2, 0. , 0. ],
[0. , 0. , 0.1, 0.1],
[0.2, 0. , 0. , 0. ]]),
array([[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0]]),
array([[0, 0],
[0, 0],
[0, 0],
[0, 0]])]
You are making 3 2d arrays, each with one row per column of P.
The list comprehension that I timed in [38] produces 4 size (3,) arrays, that is one array per column of P. apply_along_axis obscures that, joining them on a last dimension (as my stack with axis=2 does).
In [53]: [q1(P[:,i]) for i in range(P.shape[1])]
Out[53]:
[array([array([0.2, 0. , 0. , 0. ]), array([0, 0, 0, 0]), array([0, 0])],
dtype=object),
array([array([0. , 0.2, 0. , 0. ]), array([0, 0, 0, 0]), array([0, 0])],
dtype=object),
array([array([0. , 0. , 0.1, 0.1]), array([0, 0, 0, 0]), array([0, 0])],
dtype=object),
array([array([0.2, 0. , 0. , 0. ]), array([0, 0, 0, 0]), array([0, 0])],
dtype=object)]
The list comprehension is not only fast, but it also keeps the q output 'intact', making it easier to pass on to another function.
Related
How can I create the matrix
[[a, 0, 0],
[0, a, 0],
[0, 0, a],
[b, 0, 0],
[0, b, 0],
[0, 0, b],
...]
from the vector
[a, b, ...]
efficiently?
There must be a better solution than
np.squeeze(np.reshape(np.tile(np.eye(3), (len(foo), 1, 1)) * np.expand_dims(foo, (1, 2)), (1, -1, 3)))
right?
You can create a zero array in advance, and then quickly assign values by slicing:
def concated_diagonal(ar, col):
ar = np.asarray(ar).ravel()
size = ar.size
ret = np.zeros((col * size, col), ar.dtype)
for i in range(col):
ret[i::col, i] = ar
return ret
Test:
>>> concated_diagonal([1, 2, 3], 3)
array([[1, 0, 0],
[0, 1, 0],
[0, 0, 1],
[2, 0, 0],
[0, 2, 0],
[0, 0, 2],
[3, 0, 0],
[0, 3, 0],
[0, 0, 3]])
Note that because the number of columns you require is small, the impact of the relatively slow Python level for loop is acceptable:
%timeit concated_diagonal(np.arange(1_000_000), 3)
17.1 ms ± 84.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Update:
A solution with better performance! This is done in one step by clever reshaping and slice assignment:
def concated_diagonal(ar, col):
ar = np.asarray(ar).reshape(-1, 1)
size = ar.size
ret = np.zeros((col * size, col), ar.dtype)
ret.reshape(size, -1)[:, ::col + 1] = ar
return ret
Time comparison:
%timeit concated_diagonal(np.arange(1_000_000), 3)
10.7 ms ± 198 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
You can use numpy.tile , numpy.repeat, and numpy.eye.
rep = 3
lst = np.array([1,2,3,4])
res = np.tile(np.eye(rep), (len(lst),1))*np.repeat(lst, rep)[:,None]
print(res)
[[1. 0. 0.]
[0. 1. 0.]
[0. 0. 1.]
[2. 0. 0.]
[0. 2. 0.]
[0. 0. 2.]
[3. 0. 0.]
[0. 3. 0.]
[0. 0. 3.]
[4. 0. 0.]
[0. 4. 0.]
[0. 0. 4.]]
Explanation:
>>> np.tile(np.eye(3), (2,1))
array([[1., 0., 0.],
[0., 1., 0.],
[0., 0., 1.],
[1., 0., 0.],
[0., 1., 0.],
[0., 0., 1.]])
>>> np.repeat([3,4], 3)[:,None]
array([[3],
[3],
[3],
[4],
[4],
[4]])
>>> np.tile(np.eye(3), (2,1)) * np.repeat([3,4], 3)[:,None]
array([[3., 0., 0.],
[0., 3., 0.],
[0., 0., 3.],
[4., 0., 0.],
[0., 4., 0.],
[0., 0., 4.]])
Benchmark on colab(Because you want an efficient approach)
Variable is len(arr) and eye(3)
Code of benchmark:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import time
bench = []
for num in np.power(np.arange(10,1500,5),2):
arr = np.arange(num)
start = time.time()
col = 3
size = arr.size
ret1 = np.zeros((col * size, col), arr.dtype)
for i in range(col):
ret1[i::col, i] = arr
bench.append({'len_arr':num, 'Method':'Mechanic_Pig', 'Time':time.time() - start})
start = time.time()
N = 3
M = N*len(arr)
ret2 = np.zeros((M, N), dtype=int)
idx = np.arange(M)
ret2[idx, idx%N] = np.repeat(arr, N)
bench.append({'len_arr':num, 'Method':'mozway', 'Time':time.time() - start})
start = time.time()
ret3 = np.tile(np.eye(3), (len(arr),1))*np.repeat(arr, 3)[:,None]
bench.append({'len_arr':num, 'Method':'Imahdi', 'Time':time.time() - start})
start = time.time()
ret4 = np.einsum('j,ik->jki', arr, np.eye(3)).reshape(-1, 3)
bench.append({'len_arr':num, 'Method':'Michael_Szczesn', 'Time':time.time() - start})
plt.subplots(1,1, figsize=(10,7))
df = pd.DataFrame(bench)
sns.lineplot(data=df, x="len_arr", y="Time", hue="Method", style="Method")
plt.show()
# Check result of different approaches are equal or not
print(((ret1 == ret2).all() == (ret1 == ret3).all() == (ret1 == ret4).all() == (ret2 == ret3).all() == (ret2 == ret4).all() == (ret3 == ret4).all()))
# True
Here is a solution by indexing:
a = [1,2,3]
N = 3
M = N*len(a)
out = np.zeros((M, N), dtype=int)
idx = np.arange(M)
out[idx, idx%N] = np.repeat(a, N)
output:
array([[1, 0, 0],
[0, 1, 0],
[0, 0, 1],
[2, 0, 0],
[0, 2, 0],
[0, 0, 2],
[3, 0, 0],
[0, 3, 0],
[0, 0, 3]])
intermediates:
idx
# array([0, 1, 2, 3, 4, 5, 6, 7, 8])
idx%N
# array([0, 1, 2, 0, 1, 2, 0, 1, 2])
np.repeat(a, N)
# array([1, 1, 1, 2, 2, 2, 3, 3, 3])
An almost one-line solution:
import numpy as np
def concated_diagonal(vals):
length = len(vals)
return np.vstack([np.diag(np.full(length, v)) for v in vals])
print(concated_diagonal([1, 2, 3]))
Output
[[1 0 0]
[0 1 0]
[0 0 1]
[2 0 0]
[0 2 0]
[0 0 2]
[3 0 0]
[0 3 0]
[0 0 3]]
a = np.zeros([4, 4])
b = np.ones([4, 4])
#vertical stacking(ROW WISE)
print(np.r_[a,b])
print(np.r_[[1,2,3],0,0,[4,5,6]])
# output is
[[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]]
[[1. 1. 1. 1.]
[1. 1. 1. 1.]
[1. 1. 1. 1.]
[1. 1. 1. 1.]]
[1 2 3 0 0 4 5 6]
But here np._r doesn't perform vertical stacking, but does horizontal stacking. How does np._r work? Would be grateful for any help
In [324]: a = np.zeros([4, 4],int)
...: b = np.ones([4, 4],int)
In [325]: np.r_[a,b]
Out[325]:
array([[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[1, 1, 1, 1],
[1, 1, 1, 1],
[1, 1, 1, 1],
[1, 1, 1, 1]])
This is a row stack; same as vstack. And since the arrays are already 2d, concatenate is enough:
In [326]: np.concatenate((a,b), axis=0)
Out[326]:
array([[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[1, 1, 1, 1],
[1, 1, 1, 1],
[1, 1, 1, 1],
[1, 1, 1, 1]])
With the mix of 1d and scalars, r_ is the same as hstack:
In [327]: np.r_[[1,2,3],0,0,[4,5,6]]
Out[327]: array([1, 2, 3, 0, 0, 4, 5, 6])
In [328]: np.hstack([[1,2,3],0,0,[4,5,6]])
Out[328]: array([1, 2, 3, 0, 0, 4, 5, 6])
In [329]: np.concatenate([[1,2,3],0,0,[4,5,6]],axis=0)
...
ValueError: all the input arrays must have same number of dimensions, but the array at index 0 has 1 dimension(s) and the array at index 1 has 0 dimension(s)
concatenate fails because of the scalars. The other methods first convert those to 1d arrays.
In both case, r_ does
Translates slice objects to concatenation along the first axis.
r_ is actually an instance of a special class, with its own __getitem__ method, that allows us to use [] instead of (). It also means it can take slices as inputs (which are actually rendered as np.arange or np.linspace).
r_ takes an optional initial string argument, which if consisting of 3 numbers, can control the concatenate axis, and control how inputs are adjusted to matching dimensions. See the docs for details, and np.lib.index_tricks.py file for more details.
In order of importance I think the concatenate functions are:
np.concatenate # base
np.vstack # easy join 1d arrays into 2d
np.stack # generalize np.array
np.hstack # saves specifying axis
np.r_
np.c_
r_ and c_ can do neat things when mixing arrays of different shapes, but it all boils down to using concatanate correctly.
The formula is available in the docs and pointed to in this answer. However when I'm trying to apply it I'm not getting a matching answer. I'm sure there's some silly mistake I'm making somewhere so thanks for bearing with me:
Setup
Say I have 2 matrices:
X: array([[0, 1, 0],
[1, 1, 1]])
X2: array([[1, 1, 0],
[1, 1, 1],
[1, 2, 0]])
Now applying Xans = scipy.spatial.distance.cdist(X, X2, 'seuclidean') gives:
Xans: array([[2.23606798, 2.88675135, 3.16227766],
[1.82574186, 0. , 2.88675135]])
Let's just focus on Xans[0][0] = 2.23606798, which should have been obtained by applying seuclidean(X[0], X2[0]).
Method 1: Using pdist
I tried doing this via pdist but get a NaN:
In [104]: scipy.spatial.distance.pdist([X[0], X2[0]], metric='seuclidean')
Out[104]: array([nan])
Why is this happening?
Method 2: Direct Formula Application
I tried manually using the formula linked in the answer above as follows:
In [107]: (((X[0] - X2[0])**2).sum()/(np.var([X[0], X2[0]])))**0.5
Out[107]: 2.0
As can be seen this is giving 2.0?
I'm clearly doing something very wrong - What is it?
The standardized Euclidean distance weights each variable with a separate variance. If you don't provide the variances with the V argument, it computes them from the input array. This is mentioned in the pdist docstring in the "Parameters" section under **kwargs, where it shows:
V : ndarray
The variance vector for standardized Euclidean.
Default: var(X, axis=0, ddof=1)
For example:
In [39]: A
Out[39]:
array([[3, 0, 2],
[2, 1, 2],
[0, 0, 1],
[3, 1, 2],
[1, 0, 0]])
In [40]: from scipy.spatial.distance import pdist
In [41]: pdist(A, metric='seuclidean')
Out[41]:
array([ 1.98029509, 2.55814731, 1.82574186, 2.71163072, 2.63368079,
0.76696499, 2.9868995 , 3.14284123, 1.35581536, 3.26898677])
We get the same result if we provide the variances computed as explained in the docstring:
In [42]: pdist(A, metric='seuclidean', V=np.var(A, axis=0, ddof=1))
Out[42]:
array([ 1.98029509, 2.55814731, 1.82574186, 2.71163072, 2.63368079,
0.76696499, 2.9868995 , 3.14284123, 1.35581536, 3.26898677])
Of course, if you provide variances that are all 1, you get the regular Euclidean distance:
In [43]: pdist(A, metric='seuclidean', V=np.ones(A.shape[1]))
Out[43]:
array([ 1.41421356, 3.16227766, 1. , 2.82842712, 2.44948974,
1. , 2.44948974, 3.31662479, 1.41421356, 3. ])
In [44]: pdist(A, metric='euclidean')
Out[44]:
array([ 1.41421356, 3.16227766, 1. , 2.82842712, 2.44948974,
1. , 2.44948974, 3.31662479, 1.41421356, 3. ])
The problem with your "Method 1" is that in your input array of just two points (i.e. [X[0], X2[0]]), the second and third components of the points don't change, so the variance associated with those components is 0:
In [45]: p = np.array([X[0], X2[0]])
In [46]: p
Out[46]:
array([[0, 1, 0],
[1, 1, 0]])
In [47]: np.var(p, axis=0, ddof=1)
Out[47]: array([ 0.5, 0. , 0. ])
When the code for the seuclidean divides by these variances, the result is either infinity or NaN--the latter if the numerator is also 0, which is the case in the third component of the input [X[0], X2[0]].
To work around this, you have to decide how you want to handle the case where the variance of a component is 0, and handle it explicitly. For example, if you want it to act like that variance is 1 in that case (just to avoid dividing by 0) you could do something like the following.
Suppose B is our array of points. The third column of B is all 1s.
In [63]: B
Out[63]:
array([[3, 0, 1],
[2, 1, 1],
[0, 0, 1],
[3, 1, 1],
[1, 0, 1]])
Compute the variances of the columns:
In [64]: V = np.var(B, axis=0, ddof=1)
In [65]: V
Out[65]: array([ 1.7, 0.3, 0. ])
Replace the variances that are 0 with 1:
In [66]: V[V == 0] = 1
In [67]: V
Out[67]: array([ 1.7, 0.3, 1. ])
Use V to compute the standardized Euclidean distances:
In [68]: pdist(B, metric='seuclidean', V=V)
Out[68]:
array([ 1.98029509, 2.30089497, 1.82574186, 1.53392998, 2.38459106,
0.76696499, 1.98029509, 2.93725228, 0.76696499, 2.38459106])
This has the same effect as simply removing the constant column:
In [69]: pdist(B[:, :2], metric='seuclidean')
Out[69]:
array([ 1.98029509, 2.30089497, 1.82574186, 1.53392998, 2.38459106,
0.76696499, 1.98029509, 2.93725228, 0.76696499, 2.38459106])
Your "Method 2" is wrong because your formula is wrong. You have to keep the variances for each component. np.var([X[0], X2[0]]) computes the (single) variance of all the values in the input. Instead, you need to use the axis and ddof arguments shown above.
I'm searching for an efficient way to create a matrix of occurrences from two arrays that contains indexes, one represents the row indexes in this matrix, the other, the column ones.
eg. I have:
#matrix will be size 4x3 in this example
#array of rows idxs, with values from 0 to 3
[0, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3]
#array of columns idxs, with values from 0 to 2
[0, 1, 1, 1, 2, 2, 0, 1, 2, 0, 2, 2, 2, 2]
And need to create a matrix of occurrences like:
[[1 0 0]
[0 2 0]
[0 1 2]
[2 1 5]]
I can create an array of one hot vectors in a simple form, but cant get it work when there is more than one occurrence:
n_rows = 4
n_columns = 3
#data
rows = np.array([0, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3])
columns = np.array([0, 1, 1, 1, 2, 2, 0, 1, 2, 0, 2, 2, 2, 2])
#empty matrix
new_matrix = np.zeros([n_rows, n_columns])
#adding 1 for each [row, column] occurrence:
new_matrix[rows, columns] += 1
print(new_matrix)
Which returns:
[[ 1. 0. 0.]
[ 0. 1. 0.]
[ 0. 1. 1.]
[ 1. 1. 1.]]
It seems like indexing and adding a value like this doesn't work when there is more than one occurrence/index, besides printing it seems to work just fine:
print(new_matrix[rows, :])
:
[[ 1. 0. 0.]
[ 0. 1. 0.]
[ 0. 1. 0.]
[ 0. 1. 1.]
[ 0. 1. 1.]
[ 0. 1. 1.]
[ 1. 1. 1.]
[ 1. 1. 1.]
[ 1. 1. 1.]
[ 1. 1. 1.]
[ 1. 1. 1.]
[ 1. 1. 1.]
[ 1. 1. 1.]
[ 1. 1. 1.]]
So maybe I'm missing something there? Or this cant be done and I need to search for another way to do it?
Use np.add.at, specifying a tuple of indices:
>>> np.add.at(new_matrix, (rows, columns), 1)
>>> new_matrix
array([[ 1., 0., 0.],
[ 0., 2., 0.],
[ 0., 1., 2.],
[ 2., 1., 5.]])
np.add.at operates on the array in-place, adding 1 as many times to the indices as specified by the (row, columns) tuple.
Approach #1
We can convert those pairs to linear indices and then use np.bincount -
def bincount_app(rows, columns, n_rows, n_columns):
# Get linear index equivalent
lidx = (columns.max()+1)*rows + columns
# Use binned count on the linear indices
return np.bincount(lidx, minlength=n_rows*n_columns).reshape(n_rows,n_columns)
Sample run -
In [242]: n_rows = 4
...: n_columns = 3
...:
...: rows = np.array([0, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3])
...: columns = np.array([0, 1, 1, 1, 2, 2, 0, 1, 2, 0, 2, 2, 2, 2])
In [243]: bincount_app(rows, columns, n_rows, n_columns)
Out[243]:
array([[1, 0, 0],
[0, 2, 0],
[0, 1, 2],
[2, 1, 5]])
Approach #2
Alternatively, we can sort the linear indices and get the counts using slicing to have our second approach, like so -
def mask_diff_app(rows, columns, n_rows, n_columns):
lidx = (columns.max()+1)*rows + columns
lidx.sort()
mask = np.concatenate(([True],lidx[1:] != lidx[:-1],[True]))
count = np.diff(np.flatnonzero(mask))
new_matrix = np.zeros([n_rows, n_columns],dtype=int)
new_matrix.flat[lidx[mask[:-1]]] = count
return new_matrix
Approach #3
This seems like a straight-forward one with sparse matrix csr_matrix as well, as it does accumulation on its own for repeated indices. The benefit is the memory efficiency, given that it's a sparse matrix, which would be noticeable if you are filling a small number of places in the output and a sparse matrix output is okay.
The implementation would look something like this -
from scipy.sparse import csr_matrix
def sparse_matrix_app(rows, columns, n_rows, n_columns):
out_shp = (n_rows, n_columns)
data = np.ones(len(rows),dtype=int)
return csr_matrix((data, (rows, columns)), shape=out_shp)
If you need a regular/dense array, simply do -
sparse_matrix_app(rows, columns, n_rows, n_columns).toarray()
Sample output -
In [319]: sparse_matrix_app(rows, columns, n_rows, n_columns).toarray()
Out[319]:
array([[1, 0, 0],
[0, 2, 0],
[0, 1, 2],
[2, 1, 5]])
Benchmarking
Other approach(es) -
# #cᴏʟᴅsᴘᴇᴇᴅ's soln
def add_at_app(rows, columns, n_rows, n_columns):
new_matrix = np.zeros([n_rows, n_columns],dtype=int)
np.add.at(new_matrix, (rows, columns), 1)
Timings
Case #1 : Output array of shape (1000, 1000) and no. of indices = 10k
In [307]: # Setup
...: n_rows = 1000
...: n_columns = 1000
...: rows = np.random.randint(0,1000,(10000))
...: columns = np.random.randint(0,1000,(10000))
In [308]: %timeit add_at_app(rows, columns, n_rows, n_columns)
...: %timeit bincount_app(rows, columns, n_rows, n_columns)
...: %timeit mask_diff_app(rows, columns, n_rows, n_columns)
...: %timeit sparse_matrix_app(rows, columns, n_rows, n_columns)
1000 loops, best of 3: 1.05 ms per loop
1000 loops, best of 3: 424 µs per loop
1000 loops, best of 3: 1.05 ms per loop
1000 loops, best of 3: 1.41 ms per loop
Case #2 : Output array of shape (1000, 1000) and no. of indices = 100k
In [309]: # Setup
...: n_rows = 1000
...: n_columns = 1000
...: rows = np.random.randint(0,1000,(100000))
...: columns = np.random.randint(0,1000,(100000))
In [310]: %timeit add_at_app(rows, columns, n_rows, n_columns)
...: %timeit bincount_app(rows, columns, n_rows, n_columns)
...: %timeit mask_diff_app(rows, columns, n_rows, n_columns)
...: %timeit sparse_matrix_app(rows, columns, n_rows, n_columns)
100 loops, best of 3: 11.4 ms per loop
1000 loops, best of 3: 1.27 ms per loop
100 loops, best of 3: 7.44 ms per loop
10 loops, best of 3: 20.4 ms per loop
Case #3 : Sparse-ness in output
As stated earlier, for the sparse method to work better, we would need sparse-ness. Such a case would be like this -
In [314]: # Setup
...: n_rows = 5000
...: n_columns = 5000
...: rows = np.random.randint(0,5000,(1000))
...: columns = np.random.randint(0,5000,(1000))
In [315]: %timeit add_at_app(rows, columns, n_rows, n_columns)
...: %timeit bincount_app(rows, columns, n_rows, n_columns)
...: %timeit mask_diff_app(rows, columns, n_rows, n_columns)
...: %timeit sparse_matrix_app(rows, columns, n_rows, n_columns)
100 loops, best of 3: 11.7 ms per loop
100 loops, best of 3: 11.1 ms per loop
100 loops, best of 3: 11.1 ms per loop
1000 loops, best of 3: 269 µs per loop
If you need a dense array, we lose the memory efficiency and hence performance one as well -
In [317]: %timeit sparse_matrix_app(rows, columns, n_rows, n_columns).toarray()
100 loops, best of 3: 11.7 ms per loop
For a matrix, i want to find columns with all zeros and fill with 1s, and then normalize the matrix by column. I know how to do that with np.arrays
[[0 0 0 0 0]
[0 0 1 0 0]
[1 0 0 1 0]
[0 0 0 0 1]
[1 0 0 0 0]]
|
V
[[0 1 0 0 0]
[0 1 1 0 0]
[1 1 0 1 0]
[0 1 0 0 1]
[1 1 0 0 0]]
|
V
[[0 0.2 0 0 0]
[0 0.2 1 0 0]
[0.5 0.2 0 1 0]
[0 0.2 0 0 1]
[0.5 0.2 0 0 0]]
But how can I do the same thing when the matrix is in scipy.sparse.coo.coo_matrix form, without converting it back to np.arrays. how can I achieve the same thing?
This will be a lot easier with the lil format, and working with rows rather than columns:
In [1]: from scipy import sparse
In [2]: A=np.array([[0,0,0,0,0],[0,0,1,0,0],[1,0,0,1,0],[0,0,0,0,1],[1,0,0,0,0]])
In [3]: A
Out[3]:
array([[0, 0, 0, 0, 0],
[0, 0, 1, 0, 0],
[1, 0, 0, 1, 0],
[0, 0, 0, 0, 1],
[1, 0, 0, 0, 0]])
In [4]: At=A.T # switch to work with rows
In [5]: M=sparse.lil_matrix(At)
Now it is obvious which row is all zeros
In [6]: M.data
Out[6]: array([[1, 1], [], [1], [1], [1]], dtype=object)
In [7]: M.rows
Out[7]: array([[2, 4], [], [1], [2], [3]], dtype=object)
And lil format allows us to fill that row:
In [8]: M.data[1]=[1,1,1,1,1]
In [9]: M.rows[1]=[0,1,2,3,4]
In [10]: M.A
Out[10]:
array([[0, 0, 1, 0, 1],
[1, 1, 1, 1, 1],
[0, 1, 0, 0, 0],
[0, 0, 1, 0, 0],
[0, 0, 0, 1, 0]], dtype=int32)
I could have also used M[1,:]=np.ones(5,int)
The coo format is great for creating the array from the data/row/col arrays, but doesn't implement indexing or math. It has to be transformed to csr for that. And csc for column oriented stuff.
The row that I filled isn't so obvious in the csr format:
In [14]: Mc=M.tocsr()
In [15]: Mc.data
Out[15]: array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)
In [16]: Mc.indices
Out[16]: array([2, 4, 0, 1, 2, 3, 4, 1, 2, 3], dtype=int32)
In [17]: Mc.indptr
Out[17]: array([ 0, 2, 7, 8, 9, 10], dtype=int32)
On the other hand normalizing is probably easier in this format.
In [18]: Mc.sum(axis=1)
Out[18]:
matrix([[2],
[5],
[1],
[1],
[1]], dtype=int32)
In [19]: Mc/Mc.sum(axis=1)
Out[19]:
matrix([[ 0. , 0. , 0.5, 0. , 0.5],
[ 0.2, 0.2, 0.2, 0.2, 0.2],
[ 0. , 1. , 0. , 0. , 0. ],
[ 0. , 0. , 1. , 0. , 0. ],
[ 0. , 0. , 0. , 1. , 0. ]])
Notice that it's converted the sparse matrix to a dense one. The sum is dense, and math involving sparse and dense usually produces dense.
I have to use a more round about calculation to preserve the sparse status:
In [27]: Mc.multiply(sparse.csr_matrix(1/Mc.sum(axis=1)))
Out[27]:
<5x5 sparse matrix of type '<class 'numpy.float64'>'
with 10 stored elements in Compressed Sparse Row format>
Here's a way of doing this with the csc format (on A)
In [40]: Ms=sparse.csc_matrix(A)
In [41]: Ms.sum(axis=0)
Out[41]: matrix([[2, 0, 1, 1, 1]], dtype=int32)
Use sum to find the all-zeros column. Obviously this could be wrong if the columns have negative values and happen to sum to 0. If that's a concern I can see making a copy of the matrix with all data values replaced by 1.
In [43]: Ms[:,1]=np.ones(5,int)[:,None]
/usr/lib/python3/dist-packages/scipy/sparse/compressed.py:730: SparseEfficiencyWarning: Changing the sparsity structure of a csc_matrix is expensive. lil_matrix is more efficient.
SparseEfficiencyWarning)
In [44]: Ms.A
Out[44]:
array([[0, 1, 0, 0, 0],
[0, 1, 1, 0, 0],
[1, 1, 0, 1, 0],
[0, 1, 0, 0, 1],
[1, 1, 0, 0, 0]])
The warning matters more if you do this sort of change repeatedly. Notice I have to adjust the dimension of the LHS array. Depending on the number of all-zero columns this action can change the sparsity of the matrix substantially.
==================
I could search the col of coo format for missing values with:
In [69]: Mo=sparse.coo_matrix(A)
In [70]: Mo.col
Out[70]: array([2, 0, 3, 4, 0], dtype=int32)
In [71]: Mo.col==np.arange(Mo.shape[1])[:,None]
Out[71]:
array([[False, True, False, False, True],
[False, False, False, False, False],
[ True, False, False, False, False],
[False, False, True, False, False],
[False, False, False, True, False]], dtype=bool)
In [72]: idx = np.nonzero(~(Mo.col==np.arange(Mo.shape[1])[:,None]).any(axis=1))[0]
In [73]: idx
Out[73]: array([1], dtype=int32)
I could then add a column of 1s at this idx with:
In [75]: N=Mo.shape[0]
In [76]: data = np.concatenate([Mo.data, np.ones(N,int)])
In [77]: row = np.concatenate([Mo.row, np.arange(N)])
In [78]: col = np.concatenate([Mo.col, np.ones(N,int)*idx])
In [79]: Mo1 = sparse.coo_matrix((data,(row, col)), shape=Mo.shape)
In [80]: Mo1.A
Out[80]:
array([[0, 1, 0, 0, 0],
[0, 1, 1, 0, 0],
[1, 1, 0, 1, 0],
[0, 1, 0, 0, 1],
[1, 1, 0, 0, 0]])
As written it works for just one column, but it could be generalized to several. I also created a new matrix rather than update Mo. But this in-place seems to work as well:
Mo.data,Mo.col,Mo.row = data,col,row
The normalization still requires csr conversion, though I think sparse can hide that for you.
In [87]: Mo1/Mo1.sum(axis=0)
Out[87]:
matrix([[ 0. , 0.2, 0. , 0. , 0. ],
[ 0. , 0.2, 1. , 0. , 0. ],
[ 0.5, 0.2, 0. , 1. , 0. ],
[ 0. , 0.2, 0. , 0. , 1. ],
[ 0.5, 0.2, 0. , 0. , 0. ]])
Even when I take the extra work of maintaining the sparse nature, I still get a csr matrix:
In [89]: Mo1.multiply(sparse.coo_matrix(1/Mo1.sum(axis=0)))
Out[89]:
<5x5 sparse matrix of type '<class 'numpy.float64'>'
with 10 stored elements in Compressed Sparse Row format>
See
Find all-zero columns in pandas sparse matrix
for more methods of finding the 0 columns. It turns out Mo.col==np.arange(Mo.shape[1])[:,None] is too slow with large Mo. A test using np.in1d is much better.
1 - np.in1d(np.arange(Mo.shape[1]),Mo.col)