How to do sampling based on the some conditions in parallel Python? - python

Assume I would like to do sampling in parallel based on a condition.
For example, give the matrix A. I want to sample the p pairs of indices (i,j) such that A[i][j] != 5
import numpy as np
import random
A = np.random.randint(10, size=(5000, 5000)) # assume this is fixed
p = 400 # sample 400 index
res = set()
cnt = 0
while cnt < p:
r, c = random.randint(0, A.shape[0]-1), random.randint(0, A.shape[0]-1)
if A[r, c] != 5 and (r,c) not in res:
res.add((r,c))
cnt += 1
Above is my attempt. However, the matrix A and the number of samples p can be very large. Can we do it in parallel? Like use joblib, multiprocessing? Or any fast way to obtain the row and col?

You can use Numba to speed up this code. Numba can generate fast (parallel) functions at runtime using a just-in-time compiler (JIT). Using a smaller datatype like np.int8 save some memory space and result in a faster execution time. Indeed, smaller arrays can be read/written faster from/into RAM. Moreover, they are more likely to fit in the CPU cache speeding up random access. While you can parallelize the random picking, this is quite hard and the creation of threads can be more expensive than the actual computation regarding the chosen parameters. Still, Numba can improve its speed by a large margin by just (mostly) removing the overhead of the Python interpreter.
Here is the resulting code:
# Initial conditions
import numba as nb
import numpy as np
import random
#nb.njit('int8[:,:](int_, int_)', parallel=True)
def genArray(n, m):
res = np.empty((n, m), dtype=np.int8)
# Parallel loop
for i in nb.prange(n):
for j in range(m):
res[i, j] = np.random.randint(10)
return res
p = 400
A = genArray(5000, 5000)
# Actual computing code
#nb.njit('(int8[:,::1], int_)')
def genPosSet(A, p):
maxi = A.shape[0]-1
res = set()
cnt = 0
while cnt < p:
r, c = random.randint(0, maxi), random.randint(0, maxi)
if A[r, c] != 5 and (r,c) not in res:
res.add((r,c))
cnt += 1
return res
res = genPosSet(A, p)
This implementation of genPosSet takes 64 us on my machine while the initial function takes 1350 us. The new implementation is thus 21 times faster.
Note that the time to create/delete threads (1 thread/core) and share the work between them takes usually from 10 us to 1000 us.
Note that if p is not much smaller than A.size * prob where prob is the probability to find a value different of 5, then the current algorithm is not very efficient. In this case, it is better to filter the values that are different of 5 before picking random locations. If p is not much smaller than A.size, then the best solution is to shuffle all the possible locations that can be picked and finally extract the p first values of the resulting list.

Related

Is there a better way to search a sorted list if the other list is sorted too?

In the numpy library, one can pass a list into the numpy.searchsorted function, whereby it searched through a different list one element at a time and returns an array of the same sizes as the indices needed to preserve order. However, it seems to be wasting performance if both lists are sorted. For example:
m=[1,3,5,7,9]
n=[2,4,6,8,10]
numpy.searchsorted(m,n)
would return [1,2,3,4,5] which is the correct answer, but it looks like this would have complexity O(n ln(m)), whereby if one were to simply loop through m, and have some kind of pointer to n, it seems like the complexity is more like O(n+m)? Is there some kind of function in NumPy which does this?
AFAIK, this is not possible to do that in linear time only with Numpy without making additional assumptions on the inputs (eg. the integer are small and bounded). An alternative solution is to use Numba to do the merge manually:
import numba as nb
# Note: Numba requires a function signature with well defined array types
#nb.njit('int64[:](int64[::1], int64[::1])')
def search_both_sorted(a, b):
i, j = 0, 0
result = np.empty(b.size, np.int64)
while i < a.size and j < a.size:
if a[i] < b[j]:
i += 1
else:
result[j] = i
j += 1
for k in range(j, b.size):
result[k] = i
return result
a, b = np.cumsum(np.random.randint(0, 100, (2, 1000000)).astype(np.int64), axis=1)
result = search_both_sorted(a, b)
A faster implementation consists in using a branch-less approach so to remove the overhead of branch mis-prediction (especially on random/unpredictable inputs) when a and b are about the same size. Additionally, the O(n log m) algorithm can be faster when b is small so using np.searchsorted in that case is very efficient as pointed out by #MichaelSzczesny. Note that the Numba implementation of np.searchsorted can be a bit slower than the one of Numpy so it is better to pick the Numpy implementation. Here is the optimized version:
#nb.njit('int64[:](int64[::1], int64[::1])')
def search_both_sorted_opt_numba(a, b):
sa, sb = a.size, b.size
# Choose the best algorithm
if sb < sa * 0.15:
# Use a version with branches because `a[i] < b[j]`
# should be most of the time true.
i, j = 0, 0
result = np.empty(b.size, np.int64)
while i < a.size and j < b.size:
if a[i] < b[j]:
i += 1
else:
result[j] = i
j += 1
for k in range(j, b.size):
result[k] = i
else:
# Use a branchless approach to avoid miss-predictions
i, j = 0, 0
result = np.empty(b.size, np.int64)
while i < a.size and j < b.size:
tmp = a[i] < b[j]
result[j] = i
i += tmp
j += ~tmp
for k in range(j, b.size):
result[k] = i
return result
def search_both_sorted_opt(a, b):
sa, sb = a.size, b.size
# Choose the best algorithm
if 2 * sb * np.log2(sa) < sa + sb:
return np.searchsorted(a, b)
else:
return search_both_sorted_opt_numba(a, b)
searchsorted: 19.1 ms
snp_search: 11.8 ms
search_both_sorted: 6.5 ms
search_both_sorted_branchless: 4.3 ms
The optimized branchless Numba implementation is about 4.4 times faster than searchsorted which is pretty good considering that the code of searchsorted is already highly optimized. It can be even faster when a and b are huge because of cache locality.
You could use sortednp, unfortunately it does not give too much flexibility, In the code snippet below I used its merge tracking indices, but it produces three arrays, four times more memory than necessary is used, but it is faster than searchsorted.
import numpy as np
import sortednp as snp
a = np.cumsum(np.random.rand(1000000))
b = np.cumsum(np.random.rand(1000000))
def snp_search(a,b):
m, (ib, ia) = snp.merge(b, a, indices=True)
return ib - np.arange(len(ib))
assert(np.all(snp_search(a,b) == np.searchsorted(a,b)))
np.searchsorted(a, b); #58 ms
snp_search(a,b); # 22ms
np.searchsorted takes this into account already as can be seen from the source code:
/*
* Updating only one of the indices based on the previous key
* gives the search a big boost when keys are sorted, but slightly
* slows down things for purely random ones.
*/
if (cmp(last_key_val, key_val)) {
max_idx = arr_len;
}
else {
min_idx = 0;
max_idx = (max_idx < arr_len) ? (max_idx + 1) : arr_len;
}
Here min_idx, max_idx are used to perform binary search on the array. If last_key_val < key_val then only max_idx is reset to the array length, but min_idx remains at its current value, i.e. binary search starts at the same lower boundary as for the previous key.

Is there any way to speed up the computation time of calling a function with multiple time in python?

import numpy as np
import matplotlib.pyplot as plt
from numpy import random
import time
from collections import Counter
def simulation(N):
I = 10
success = 0
M = 100
for i in range(I):
s = allocate(N,M)
M -= s
success += s
return success
def allocate(N,M):
count = Counter(random.randint(N,size = M))
success = sum(j for v,j in count.items() if j == 1)
return success
if __name__ == "__main__":
start = time.perf_counter()
SAMPLE_SIZE = 100000
N = np.linspace(5,45,41).astype(int)
Ps = []
for n in N:
ps = []
for _ in range(SAMPLE_SIZE):
ps.append(simulation(n)/100)
result = np.average(np.array(ps))
Ps.append(result)
elapsed = (time.perf_counter() - start)
print("Time used:",elapsed)
plt.scatter(N,Ps)
plt.show()
Here is my situation. The ultimate goal is to set SAMPLE_SIZE to 10^7. However, when I set it to 10^5, it already requires about 1000sec to run it. Is there any way to make it more efficient and faster? Thanks for giving me suggestions.
First of all, the implementation of allocate is not very efficient: you can use vectorized Numpy function to do that:
def allocate(N, M):
success = np.count_nonzero(np.bincount(random.randint(N, size=M)) == 1)
return success
The thing is most of the time comes from the overhead of Numpy functions performing some checks and create some temporary arrays. You can use Numba to fix this problem:
import numba as nb
#nb.njit('int_(int_, int_)')
def allocate(N,M):
tmp = np.zeros(N, np.int_)
for i in range(M):
rnd = np.random.randint(0, N)
tmp[rnd] += 1
count = 0
for i in range(N):
count += tmp[i] == 1
return count
Then, you can speed up the code a bit further by using the Numba decorator #nb.njit('int_(int_)') to the simulation function so to avoid the overhead of calling Numba functions from the CPython interpreter.
Finally, you can speed up the main loop by running it in parallel with Numba (and also avoid the use of slow lists). You can also recycle the tmp array so not to cause too many allocations (that are expensive and do not scale with the number of cores). Here is the resulting final code:
import numpy as np
import matplotlib.pyplot as plt
import time
import numba as nb
# Recycle the `tmp` buffer not to do many allocations
#nb.njit('int_(int_, int_, int_[::1])')
def allocate(N, M, tmp):
tmp.fill(0)
for i in range(M):
rnd = np.random.randint(0, N)
tmp[rnd] += 1
count = 0
for i in range(N):
count += tmp[i] == 1
return count
#nb.njit('int_(int_)')
def simulation(N):
I = 10
success = 0
M = 100
tmp = np.zeros(N, np.int_) # Preallocated buffer
for i in range(I):
s = allocate(N, M, tmp)
M -= s
success += s
return success
#nb.njit('float64(int_, int_)', parallel=True)
def compute_ps_avg(n, sample_size):
ps = np.zeros(sample_size, dtype=np.float64)
for i in nb.prange(sample_size):
ps[i] = simulation(n) / 100.0
# Note that np.average is not yet supported by Numba
return np.mean(ps)
if __name__ == "__main__":
start = time.perf_counter()
SAMPLE_SIZE = 100_000
N = np.linspace(5,45,41).astype(int)
Ps = [compute_ps_avg(n, SAMPLE_SIZE) for n in N]
elapsed = (time.perf_counter() - start)
print("Time used:",elapsed)
plt.scatter(N,Ps)
plt.show()
Here are performance results on my 10-core machine:
Initial code: 670.6 s
Optimized Numba code: 3.9 s
The resulting code is 172 times faster.
More than 80% of the time is spent in the generation of random numbers. Thus, if you want the code to be faster, one solution is to speed up the generation of random number using a SIMD-optimized random number generator. Unfortunately, AFAIK, this is not possible to (efficiently) achieve this in Python. You certainly need to use a native language like C or C++ to do that.
I might be missing the point but it seems like you can replace your simulation with a closed form calculation.
I believe the problem you are solving is find the expected number of boxes with exactly 1 ball in them given a random distribution of M balls in N boxes.
Follow the answer here https://math.stackexchange.com/a/66094 to get the closed form expression (André Nicolas solved it for 10 balls and 5 boxes but you should be able to extrapolate)
(As a side note I will often also write code to confirm that my probability calculations are correct, if this is what you are doing sorry about stating the obvious :P )

efficient loop over numpy array

Versions of this question have already been asked but I have not found a satisfactory answer.
Problem: given a large numpy vector, find indices of the vector elements which are duplicated (a variation of that could be comparison with tolerance).
So the problem is ~O(N^2) and memory bound (at least from the current algorithm point of view). I wonder why whatever I tried Python is 100x or more slower than an equivalent C code.
import numpy as np
N = 10000
vect = np.arange(float(N))
vect[N/2] = 1
vect[N/4] = 1
dupl = []
print("init done")
counter = 0
for i in range(N):
for j in range(i+1, N):
if vect[i] == vect[j]:
dupl.append(j)
counter += 1
print("counter =", counter)
print(dupl)
# For simplicity, this code ignores repeated indices
# which can be trimmed later. Ref output is
# counter = 3
# [2500, 5000, 5000]
I tried using numpy iterators but they are even worse (~ x4-5)
http://docs.scipy.org/doc/numpy/reference/arrays.nditer.html
Using N=10,000 I'm getting 0.1 sec in C, 12 sec in Python (code above), 40 sec in Python using np.nditer, 50 sec in Python using np.ndindex. I pushed it to N=160,000 and the timing scales as N^2 as expected.
Since the answers have stopped coming and none was totally satisfactory, for the record I post my own solution.
It is my understanding that it's the assignment which makes Python slow in this case, not the nested loops as I thought initially. Using a library or compiled code eliminates the need for assignments and performance improves dramatically.
from __future__ import print_function
import numpy as np
from numba import jit
N = 10000
vect = np.arange(N, dtype=np.float32)
vect[N/2] = 1
vect[N/4] = 1
dupl = np.zeros(N, dtype=np.int32)
print("init done")
# uncomment to enable compiled function
##jit
def duplicates(i, counter, dupl, vect):
eps = 0.01
ns = len(vect)
for j in range(i+1, ns):
# replace if to use approx comparison
#if abs(vect[i] - vect[j]) < eps:
if vect[i] == vect[j]:
dupl[counter] = j
counter += 1
return counter
counter = 0
for i in xrange(N):
counter = duplicates(i, counter, dupl, vect)
print("counter =", counter)
print(dupl[0:counter])
Tests
# no jit
$ time python array-test-numba.py
init done
counter = 3
[2500 5000 5000]
elapsed 10.135 s
# with jit
$ time python array-test-numba.py
init done
counter = 3
[2500 5000 5000]
elapsed 0.480 s
The performance of compiled version (with #jit uncommented) is close to C code performance ~0.1 - 0.2 sec. Perhaps eliminating the last loop could improve the performance even further. The difference in performance is even stronger when using approximate comparison using eps while there is very little difference for the compiled version.
# no jit
$ time python array-test-numba.py
init done
counter = 3
[2500 5000 5000]
elapsed 109.218 s
# with jit
$ time python array-test-numba.py
init done
counter = 3
[2500 5000 5000]
elapsed 0.506 s
This is ~ 200x difference. In the real code, I had to put both loops in the function as well as use a function template with variable types so it was a bit more complex but not very much.
Python itself is a highly-dynamic, slow, language. The idea in numpy is to use vectorization, and avoid explicit loops. In this case, you can use np.equal.outer. You can start with
a = np.equal.outer(vect, vect)
Now, for example, to find the sum:
>>> np.sum(a)
10006
To find the indices of i that are equal, you can do
np.fill_diagonal(a, 0)
>>> np.nonzero(np.any(a, axis=0))[0]
array([ 1, 2500, 5000])
Timing
def find_vec():
a = np.equal.outer(vect, vect)
s = np.sum(a)
np.fill_diagonal(a, 0)
return np.sum(a), np.nonzero(np.any(a, axis=0))[0]
>>> %timeit find_vec()
1 loops, best of 3: 214 ms per loop
def find_loop():
dupl = []
counter = 0
for i in range(N):
for j in range(i+1, N):
if vect[i] == vect[j]:
dupl.append(j)
counter += 1
return dupl
>>> % timeit find_loop()
1 loops, best of 3: 8.51 s per loop
This solution using the numpy_indexed package has complexity n Log n, and is fully vectorized; so not terribly different from C performance, in all likelihood.
import numpy_indexed as npi
dpl = np.flatnonzero(npi.multiplicity(vect) > 1)
The obvious question is why you want to do this in this way. NumPy arrays are intended to be opaque data structures – by this I mean NumPy arrays are intended to be created inside the NumPy system and then operations sent in to the NumPy subsystem to deliver a result. i.e. NumPy should be a black box into which you throw requests and out come results.
So given the code above I am not at all suprised that NumPy performance is worse than dreadful.
The following should be effectively what you want, I believe, but done the NumPy way:
import numpy as np
N = 10000
vect = np.arange(float(N))
vect[N/2] = 1
vect[N/4] = 1
print([np.where(a == vect)[0] for a in vect][1])
# Delivers [1, 2500, 5000]
Approach #1
You can simulate that iterator dependency criteria for a vectorized solution using a triangular matrix. This is based on this post that dealt with multiplication involving iterator dependency. For performing the elementwise equality of each element in vect against its all elements, we can use NumPy broadcasting. Finally, we can use np.count_nonzero to get the count, as it's supposed to be very efficient in summing purposes on boolean arrays.
So, we would have a solution like so -
mask = np.triu(vect[:,None] == vect,1)
counter = np.count_nonzero(mask)
dupl = np.where(mask)[1]
If you only care about the count counter, we could have two more approaches as listed next.
Approach #2
We can avoid the use of the triangular matrix and simply get the entire count and just subtract the contribution from diagonal elements and consider just one of either lower of upper triangular regions by just halving the remaining count as the contributions from either ones would be identical.
So, we would have a modified solution like so -
counter = (np.count_nonzero(vect[:,None] == vect) - vect.size)//2
Approach #3
Here's an entirely different approach that uses the fact the count of each unique element plays a cumsumed contribution to the final total.
So, with that idea in mind, we would have a third approach like so -
count = np.bincount(vect) # OR np.unique(vect,return_counts=True)[1]
idx = count[count>1]
id_arr = np.ones(idx.sum(),dtype=int)
id_arr[0] = 0
id_arr[idx[:-1].cumsum()] = -idx[:-1]+1
counter = np.sum(id_arr.cumsum())
As an alternative to Ami Tavory's answer, you can use a Counter from the collections package to detect duplicates. On my computer it seems to be even faster. See the function below which can also find different duplicates.
import collections
import numpy as np
def find_duplicates_original(x):
d = []
for i in range(len(x)):
for j in range(i + 1, len(x)):
if x[i] == x[j]:
d.append(j)
return d
def find_duplicates_outer(x):
a = np.equal.outer(x, x)
np.fill_diagonal(a, 0)
return np.flatnonzero(np.any(a, axis=0))
def find_duplicates_counter(x):
counter = collections.Counter(x)
values = (v for v, c in counter.items() if c > 1)
return {v: np.flatnonzero(x == v) for v in values}
n = 10000
x = np.arange(float(n))
x[n // 2] = 1
x[n // 4] = 1
>>>> find_duplicates_counter(x)
{1.0: array([ 1, 2500, 5000], dtype=int64)}
>>>> %timeit find_duplicates_original(x)
1 loop, best of 3: 12 s per loop
>>>> %timeit find_duplicates_outer(x)
10 loops, best of 3: 84.3 ms per loop
>>>> %timeit find_duplicates_counter(x)
1000 loops, best of 3: 1.63 ms per loop
This runs in 8 ms compared to 18 s for your code and doesn't use any strange libraries. It's similar to the approach by #vs0, but I like defaultdict more. It should be approximately O(N).
from collections import defaultdict
dupl = []
counter = 0
indexes = defaultdict(list)
for i, e in enumerate(vect):
indexes[e].append(i)
if len(indexes[e]) > 1:
dupl.append(i)
counter += 1
I wonder why whatever I tried Python is 100x or more slower than an equivalent C code.
Because Python programs are usually 100x slower than C programs.
You can either implement critical code paths in C and provide Python-C bindings, or change the algorithm. You can write an O(N) version by using a dict that reverses the array from value to index.
import numpy as np
N = 10000
vect = np.arange(float(N))
vect[N/2] = 1
vect[N/4] = 1
dupl = {}
print("init done")
counter = 0
for i in range(N):
e = dupl.get(vect[i], None)
if e is None:
dupl[vect[i]] = [i]
else:
e.append(i)
counter += 1
print("counter =", counter)
print([(k, v) for k, v in dupl.items() if len(v) > 1])
Edit:
If you need to test against an eps with abs(vect[i] - vect[j]) < eps you can then normalize the values up to eps
abs(vect[i] - vect[j]) < eps ->
abs(vect[i] - vect[j]) / eps < (eps / eps) ->
abs(vect[i]/eps - vect[j]/eps) < 1
int(abs(vect[i]/eps - vect[j]/eps)) = 0
Like this:
import numpy as np
N = 10000
vect = np.arange(float(N))
vect[N/2] = 1
vect[N/4] = 1
dupl = {}
print("init done")
counter = 0
eps = 0.01
for i in range(N):
k = int(vect[i] / eps)
e = dupl.get(k, None)
if e is None:
dupl[k] = [i]
else:
e.append(i)
counter += 1
print("counter =", counter)
print([(k, v) for k, v in dupl.items() if len(v) > 1])

Efficient Way to Recursively Multiply

I'm creating N_MC paths of simulated stock prices S with n points in each path, excluding the initial point. The algorithm to do so is recursive on the previous value of the stock price, for a given path. Here's what I have now:
import numpy as np
import time
N_MC = 1000
n = 10000
S = np.zeros((N_MC, n+1))
S0 = 1.0
S[:, 0] = S0
start_time_normals = time.clock()
Z = np.exp(np.random.normal(size=(N_MC, n)))
print "generate normals time = ", time.clock() - start_time_normals
start_time_prices = time.clock()
for i in xrange(N_MC):
for j in xrange(1, n+1):
S[i, j] = S[i, j-1]*Z[i, j-1]
print "pices time = ", time.clock() - start_time_prices
The times were:
generate normals time = 1.07
pices time = 9.98
Is there a much more efficient way to generate the arrays S, perhaps using Numpy's routines? It would be nice if the normal random variables Z could be generated more quickly, too, but I'm not as hopeful.
It's not necessary to loop over 'paths', because they're independent of each other. So, you can remove the outer loop for i in xrange(N_MC) and just operate on entire columns of S and Z.
For accelerating the recursive computation, let's just consider a single 'path'. Say z is vector containing the random values at each timestep (all known ahead of time). s is a vector that should contain the output at each timestep. s0 is the initial output at time zero. j is time.
Your code defines the ouput recursively:
s[j] = s[j-1]*z[j-1]
Let's expand this:
s[1] = s[0]*z[0]
s[2] = s[1]*z[1]
= s[0]*z[0]*z[1]
s[3] = s[2]*z[2]
= s[0]*z[0]*z[1]*z[2]
s[4] = s[3]*z[3]
= s[0]*z[0]*z[1]*z[2]*z[3]
Each output s[j] is given by s[0] times the product of the random values from 0 to j-1. You can calculate cumulative products like this using numpy.cumprod(), which should be much more efficient than looping:
s = np.concatenate(([s0], s0 * np.cumprod(z[0:-1])))
You can use the axis parameter for operating along one dimension of a matrix (e.g. for doing this in parallel across 'paths').

Sparse matrix multiplication when results' sparsity is known (in python|scipy|cython)

Suppose we want to compute C=A*B for given sparse matrices A,B but are interested in a very small subset of entries of C, represented by a list of index pairs:
rows=[i1, i2, i3 ... ]
cols=[j1, j2, j3 ... ]
Both A and B are quite large (say 50Kx50K), but very sparse (<1% of entries is non-zero).
How can we compute this subset of the multiplication?
Here's a naive implementation that works really slow:
def naive(A, B, rows, cols):
N = len(rows)
vals = []
for n in xrange(N):
v = A.getrow(rows[n]) * B.getcol(cols[n])
vals.append(v[0, 0])
R = sps.coo_matrix((np.array(vals), (np.array(rows), np.array(cols))), shape=(A.shape[0], B.shape[1]), dtype=np.float64)
return R
even for small matrices this is quite bad:
import scipy.sparse as sps
import numpy as np
D = 1000
A = np.random.randn(D, D)
A[np.abs(A) > 0.1] = 0
A = sps.csr_matrix(A)
B = np.random.randn(D, D)
B[np.abs(B) > 0.1] = 0
B = sps.csr_matrix(B)
X = np.random.randn(D, D)
X[np.abs(X) > 0.1] = 0
X[X != 0] = 1
X = sps.csr_matrix(X)
rows, cols = X.nonzero()
naive(A, B, rows, cols)
On my machine, naive() finishes after 1 minute, and most of the effort is spent on structuring the rows/cols (in getrow(), getcol()).
Of course, converting this (very small) example to dense matrices, the computation takes about 100ms:
A0 = np.array(A.todense())
B0 = np.array(B.todense())
X0 = np.array(X.todense())
A0.dot(B0) * X0
Any thoughts on how to efficiently compute such matrix multiplication?
Note: This question is almost identical to the following question:
Subset of a matrix multiplication, fast, and sparse
However, there, A and B are full matrices, and, one of the dimensions is very low (say, 10)
the proposed solutions seem to benefit from both.
The format of your sparse matrices is important here. You always need a row form A and a column from B. So, store A as a csr and B as csc to get rid of the getrow/getcol overhead. Unfortunately, this is only a small part of the story.
The best solution depends a lot on the structure of your sparse matrix (a lot of sparse columns/rows, etc), but you might try one based on dictionaries and sets. For matrix A for each row the following are kept:
a set with all non-zero column indices on that row
a dictionary with the non-zero indices as keys and the corresponding non-zero values as values
For matrix B similar dicts and sets are kept for each column.
To calculate element (M, N) in the multiplication result, row M of A is multiplied with column N of B. The multiplication:
find the set intersection of the non-zero sets
calculate the sum of multiplications of the non-zero elements (i.e. the intersection above)
In most cases this should be very fast, as in a sparse matrix the set intersection is usually very small.
Some code:
class rowarray():
def __init__(self, arr):
self.rows = []
for row in arr:
nonzeros = np.nonzero(row)[0]
nzvalues = { i: row[i] for i in nonzeros }
self.rows.append((set(nonzeros), nzvalues))
def __getitem__(self, key):
return self.rows[key]
def __len__(self):
return len(self.rows)
class colarray(rowarray):
def __init__(self, arr):
rowarray.__init__(self, arr.T)
def maybe_less_naive(A, B, rows, cols):
N = len(rows)
vals = []
for n in xrange(N):
nz1,v1 = A[rows[n]]
nz2,v2 = B[cols[n]]
# list of common non-zeros
nz = nz1.intersection(nz2)
# sum of non-zeros
vals.append(sum([ v1[i]*v2[i] for i in nz]))
R = sps.coo_matrix((np.array(vals), (np.array(rows), np.array(cols))), shape=(len(A), len(B)), dtype=np.float64)
return R
D = 1000
Ap = np.random.randn(D, D)
Ap[np.abs(Ap) > 0.1] = 0
A = rowarray(Ap)
Bp = np.random.randn(D, D)
Bp[np.abs(Bp) > 0.1] = 0
B = colarray(Bp)
X = np.random.randn(D, D)
X[np.abs(X) > 0.1] = 0
X[X != 0] = 1
X = sps.csr_matrix(X)
rows, cols = X.nonzero()
maybe_less_naive(A, B, rows, cols)
This is a bit more efficient, the multiplication takes approximately 2 seconds for the test (80 000 elements). The results seem to be essentially the same.
A few comments on the performance.
There are two operations performed for each output element:
set intersection
multiplication
The complexity of set intersection should be O(min(m,n)) where m and n are the numbers of non-zeros in each operand. This is invariant of the size of the matrix, only the average number of non-zeros per row/column is important.
The number of multiplications (and dict lookups) depends on the number of non-zeros found in the intersection above.
If both matrices have randomly distributed non-zeros with probability (density) p, and the row/column length is n, then:
set intersection: O(np)
dictionary lookup, multiplication: O(np^2)
This shows that with really sparse matrices finding the intersections is the critical point. This can also be verified by profiling; most of the time is spent calculating the intersections.
When this is reflected to the real world, we seem to spend around 20 us for a row/column of 80 non-zeros. This is not blindingly fast, and the code can certainly be made faster. Cython may be one solution, but this may be one of the problems where Python is not the best possible solution. A simple linear matching (merge sort -type algorithm) for sorted integers should be at least an order of magnitude faster when written in C.
One important thing to note is that the algorithm can be done in parallel for several elements at a time. There is no need to settle for a single thread, as the calculations are independent as far as one thread handles one output point.

Categories