Apply slicing, conditionals to Sparse Arrays with Pallalization in Python - python

Apply slicing, conditionals to Sparse Arrays with Pallalization
I want to do something like dynamic programming on sparse array.
could you check the following example function,which I would like to implement for Sparse Array
(the first example is for numpy.array)
First,importing modules
from numba import jit
import numpy as np
from scipy import sparse as sp
from numba import prange
then the first example
#jit(parallel=True, nopython=True)
def mytest_csc(inptmat):
something = np.zeros(inptmat.shape[1])
for i in prange(inptmat.shape[1]):
partmat = inptmat[:, i]
for j in range(len(partmat)):
if partmat[j] > 0:
new_val = partmat[j] / (partmat[j] + something[j])
target = (something[j] + new_val) / (counter + 1)
something[i] = target
return something
In the above function,
slicing/indexing sparse array
add and mulitiplication
nested for-loop
with Parallelization by Numba's prange
were done.
here is my question,how can I implement this for Sparse Array like scipy.sparse.csc_matrix?
the following is what I have tried.
This function can accept np.array or scipy.sparse.csc_matrix as the input,but it cannot be parallelized...
def mytest_csc2(inptmat):
something = np.zeros(inptmat.shape[1])
for i in prange(inptmat.shape[1]):
partmat = inptmat[:, i]
for j in range(len(partmat)):
if partmat[j] > 0:
new_val = partmat[j] / (partmat[j] + something[j])
target = (something[j] + new_val) / (counter + 1)
something[i] = target
return something
The parallalization is must.
here is the speeds of the above functions.
in the example I made 100100 matrix,but in fact I need to process the significant big matrix like 100000100000. so I can't avoid Sparse Array...
inptmat=np.zeros((100,100)) #test input matrix,normal numpy array
16.1 µs ± 125 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
inptmat=sp.csc_matrix(inptmat) #test input matrix,scipy.sparse.csc_matrix
1.39 s ± 70.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I need to optimize the test function2 so that it can work fast as possible as the example with Numba.


The function multiplies the GF4 field matrices for a very long time

Multiplies large matrices for a very long time. How can this problem be solved. I use the galois library, and numpy, I think it should still work stably. I tried to implement my GF4 arithmetic and multiplied matrices using numpy, but it takes even longer. Thank you for your reply.
When r = 2,3,4,5,6 multiplies quickly, then it takes a long time. As for me, these are not very large sizes of matrices. This is just a code snippet. I get the sizes n, k of matrices of a certain family given r. And I need to multiply the matrices of those obtained parameters.
import numpy as np
import galois
def family_Hamming(q,r):
n = int((q**r-1)/(q-1))
k = int((q**r-1)/(q-1)-r)
res = (n,k)
return res
q = 4
r = 7
n,k = family_Hamming(q,r)
GF = galois.GF(2**2)
a = GF(np.random.randint(4, size=(k, k)))
b = GF(np.random.randint(4, size=(k, n)))
c =,b)
I'm not sure if it is actually faster but should be used for the dot product of two vectors, for matrix multiplication use A # B. That's as efficient as you can get with Python as far as I know
I'm the author of galois. I added performance improvements to matrix multiplication in v0.3.0 by parallelizing the arithmetic over multiple cores. The next performance improvement will come once GPU support is added.
I'm open to other performance improvement suggestions, but as far as I know the algorithm is running as fast as possible on a CPU.
In [1]: import galois
In [2]: GF = galois.GF(2**2)
In [3]: A = GF.Random((300, 400), seed=1)
In [4]: B = GF.Random((400, 500), seed=2)
# v0.2.0
In [5]: %timeit A # B
1.02 s ± 7.35 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# v0.3.0
In [5]: %timeit A # B
99 ms ± 1.86 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Try using jax on a CUDA runtime. For example, you can try it out on Google Colab's free GPU. (Open a notebook -> Runtime -> Change runtime type -> GPU).
import jax.numpy as jnp
from jax import device_put
a = GF(np.random.randint(4, size=(k, k)))
b = GF(np.random.randint(4, size=(k, n)))
a, b = device_put(a), device_put(b)
c =, b)
c = np.asarray(c)
Timing test:
%timeit, b).block_until_ready()
# 765 ms ± 96.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

How to remove nested for loop?

I have the following nested loop:
sum_tot = 0.0
for i in range(len(N)-1):
for j in range(len(N)-1):
sum_tot = sum_tot + N[i]**2*N[j]**2*W[i]*W[j]*x_i[j][-1] / (N[j]**2 - x0**2) *(z_i[i][j] - z_j[i][j])*x_j[i][-1] / (N[i]**2 - x0**2)
It's basically a mathematical function that has a double summation. Each sum goes up to the length of N. I've been trying to figure out if there was a way to write this without using a nested for-loop in order to reduce computational time. I tried using list comprehension, but the computational time is similar if not the same. Is there a way to write this expression as matrices to avoid the loops?
Note that range will stop at N-2 given your current loop: range goes up to but not including its argument. You probably mean to write for i in range(len(N)).
It's also difficult to reduce summation: the actual time it takes is based on the number of terms computed, so if you write it a different way which still involves the same number of terms, it will take just as long. However, O(n^2) isn't exactly bad: it looks like the best you can do in this situation unless you find a mathematical simplification of the problem.
You might consider checking this post to gather ways to write out the summation in a neater fashion.
#Kraigolas makes valid points. But let's try a few benchmarks on a dummy, double nested operation, either way. (Hint: Numba might help you speed things up)
Note, I would avoid numpy arrays specifically because all of the cross-product between the range is going to be in memory at once. If this is a massive range, you may run out of memory.
Nested for loops
n = 5000
s1 = 0
for i in range(n):
for j in range(n):
s1 += (i/2) + (j/3)
#2.26 s ± 101 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
List comprehension
n = 5000
s2 = 0
s2 = sum([i/2+j/3 for i in range(n) for j in range(n)])
#3.2 s ± 307 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Itertools product
from itertools import product
n = 5000
s3 = 0
for i,j in product(range(n),repeat=2):
s3 += (i/2) + (j/3)
#2.35 s ± 186 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Note: When using Numba, you would want to run the code at least once before, because the first time it compiles the code and therefore the speed is slow. The real speedup comes second run onwards.
Numba njit (SIMD)
from numba import njit
def f(n):
s = 0
for i in range(n):
for j in range(n):
s += (i/2) + (j/3)
return s
s4 = f(n)
#29.4 ms ± 1.85 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Numba njit parallel with prange
An excellent suggestion by #Tim, added to benchmarks
def f(n):
s = 0
for i in prange(n):
for j in prange(n):
s += (i/2) + (j/3)
return s
s5 = f(n)
#21.8 ms ± 4.81 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Significant boost up with Numba as expected. Maybe try that?
To convert this to matrix calculations, I would suggest combine some terms first.
If these objects are not numpy arrays, it's better to convert them to numpy arrays, as they support element-wise operations.
To convert, simply do
import numpy
N = numpy.array(N)
w = numpy.array(w)
x_i = numpy.array(x_i)
x_j = numpy.array(x_j)
z_i = numpy.array(z_i)
z_j = numpy.array(z_j)
common_terms = N**2*w/(N**2-x0**2)
i_terms = common_terms*x_j[:,-1]
j_terms = common_terms*x_i[:,-1]
i_j_matrix = z_i - z_j
sum_output = (i_terms.reshape((1,-1)) # i_j_matrix # j_terms.reshape((-1,1)))[0,0]

Optimizing while cycle with numba for error tolerance

I have a doubt when using numba for optimization. I am coding a fixed point iteration to calculate the value of a certain array, named gamma, which satisfies the equation f(gamma)=gamma. I am trying to optimize this function with python package Numba. It seems as follows.
def fixed_point(gamma_guess):
for i in range(17):
return gamma_guess
Numba is capable of optimizing well this function, because it knows how many times it will perform the opertation, 17 times,and it works fast. But I need to control the tolerance of error of my desired gamma, I mean , the difference of a gamma and the next one obtained by the fixed point iteration should be less than some number epsilon=0.01, then I tried
def fixed_point(gamma_guess):
return gamma_guess
It also works and calculate the desired result, but not as fast as last implementation, it is much slower. I think it is because Numba cannot optimize well the while cycle since we do not know when will it stop. Is there a way I can optimizate this and run as fast as last implementation?
Here is the f that I'm using
from scipy import fftpack as sp
def f(gammaa,z,zal,kappa):
for i in range(N):
for j in range(N):
if (abs(j-i))%2 ==1:
return gamma0
I always use np.ones(2048)*0.5 as initial guess and the other parameters that I pass to my function are z=np.cos(alphas)+1j*(np.sin(alphas)+0.1) , zal=-np.sin(alphas)+1j*np.cos(alphas) , kappa=np.ones(2048) and alphas=np.arange(0,2*np.pi,2*np.pi/2048)
I made a small test script, to see if I could reproduce your error:
import numba as nb
from IPython import get_ipython
ipython = get_ipython()
def f(x):
return (x+1)/x
def fixed_point_for(x):
for _ in range(17):
x = f(x)
return x
def fixed_point_for_nb(x):
for _ in range(17):
x = f(x)
return x
def fixed_point_while(x):
x_old = x
while error>0.01:
x = f(x)
error = abs(x_old-x)
x_old = x
return x
def fixed_point_while_nb(x):
x_old = x
while error>0.01:
x = f(x)
error = abs(x_old-x)
x_old = x
return x
print("for loop without numba:")
ipython.magic("%timeit fixed_point_for(10)")
print("for loop with numba:")
ipython.magic("%timeit fixed_point_for_nb(10)")
print("while loop without numba:")
ipython.magic("%timeit fixed_point_while(10)")
print("for loop with numba:")
ipython.magic("%timeit fixed_point_while_nb(10)")
As I don't know about your f I just used the most simple stabilizing function, that I could think of. I then ran tests with and without numba, both times with for and while loops. The results on my machine are:
for loop without numba:
3.35 µs ± 8.72 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
for loop with numba:
282 ns ± 1.07 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
while loop without numba:
1.86 µs ± 7.09 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
for loop with numba:
214 ns ± 1.36 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
The following thoughts arise:
It can't be, that your function is not optimizable, since your for loop is fast (at least you said so; have you tested without numba?).
It could be, that your function takes way more loops to converge as you might think
We are using different software versions. My versions are:
numba 0.49.0
numpy 1.18.3
python 3.8.2

Logarithm of two dimensional array in Python

I have an array of two dimensional arrays named matrices. Each matrix in there is of dimension 1000 x 1000 and consists of positive values. Now I want to take the log of all values in all the matrices (except for 0). How do I do this easily in python? I have the following code that does what I want, but knowing Python this can be made more brief:
newMatrices = []
for matrix in matrices:
newMaxtrix = []
for row in matrix:
newRow = []
for value in row:
if value > 0:
You can convert it into numpy array and usenumpy.log to calculate the value.
For 0 value, the results will be -Inf. After that you can convert it back to list and replace the -Inf with 0
Or you can use where in numpy
res = where(arr!= 0, log2(arr), 0)
It will ignore all zero elements.
While #Amadan 's answer is certainly correct (and much shorter/elegant), it may not be the most efficient in your case (depends a bit on the input, of course), because np.where() will generate an integer index for each matching value. A more efficient approach would be to generate a boolean mask. This has two advantages: (1) it is typically more memory efficient (2) the [] operator is typically faster on masks than on integer lists.
To illustrate this, I reimplemented both the np.where()-based and the mask-based solution on a toy input (but with the correct sizes).
I have also included a solution which is also quite inefficient.
import numpy as np
def log_matrices_where(matrices):
return [np.where(matrix > 0, np.log(matrix), 0) for matrix in matrices]
def log_matrices_mask(matrices):
arr = np.array(matrices, dtype=float)
mask = arr > 0
arr[mask] = np.log(arr[mask])
arr[~mask] = 0 # if the values are always positive this is not needed
return [x for x in arr]
def log_matrices_at(matrices):
arr = np.array(matrices, dtype=float), arr > 0)
arr[~(arr > 0)] = 0 # if the values are always positive this is not needed
return [x for x in arr]
N = 1000
matrices = [
np.arange((N * N)).reshape((N, N)) - N
for _ in range(2)]
(some sanity check to make sure we are doing the same thing)
# check that the result is the same
print(all(np.all(np.isclose(x, y)) for x, y in zip(log_matrices_where(matrices), log_matrices_mask(matrices))))
# True
print(all(np.all(np.isclose(x, y)) for x, y in zip(log_matrices_where(matrices), log_matrices_at(matrices))))
# True
And the timings on my machine:
%timeit log_matrices_where(matrices)
# 33.8 ms ± 1.13 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit log_matrices_mask(matrices)
# 11.9 ms ± 97 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit log_matrices_at(matrices)
# 153 ms ± 831 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
EDIT: additionally included solution and a note on zeroing out the values for which log is not defined
Another alternative using numpy:
arr = np.ndarray((1000,1000)), np.nonzero(arr))
As simple as...
import numpy as np
newMatrices = [np.where(matrix != 0, np.log(matrix), 0) for matrix in matrices]
No need to worry about rows and columns, numpy takes care of it. No need to explicitly iterate over matrices in a for loop when a comprehension is readable enough.
EDIT: I just noticed OP had log, not log2. Not really important for the shape of the solution (though likely very important to not getting a wrong answer :P )
as suugested by #R.yan
you can try something like this.
import numpy as np
newMatrices = []
for matrix in matrices:
newMaxtrix = []
for row in matrix:
newRow = []
for value in row:
if value > 0:
newArray = np.asarray(newMatrices)
logVal = np.log(newArray)

Setting structured array field in Numba

I would like to set an entire field of a NumPy structured scalar from within a Numba compiled nopython function. The desired_fn in the code below is a simple example of what I would like to do, and working_fn is an example of how I can currently accomplish this task.
import numpy as np
import numba as nb
test_numpy_dtype = np.dtype([("blah", np.int64)])
test_numba_dtype = nb.from_dtype(test_numpy_dtype)
def working_fn(thing):
for j in range(len(thing)):
thing[j]['blah'] += j
def desired_fn(thing):
thing['blah'] += np.arange(len(thing))
a = np.zeros(3,test_numpy_dtype)
The error generated from running desired_fn(a) is:
numba.errors.InternalError: unsupported array index type const('blah') in [const('blah')]
[1] During: typing of staticsetitem at /home/sam/PycharmProjects/ChessAI/ (938)
This is needed for extremely performance critical code, and will be run billions of times, so eliminating the need for these types of loops seems to be crucial.
The following works (numba 0.37):
def desired_fn(thing):
thing.blah[:] += np.arange(len(thing))
# or
# thing['blah'][:] += np.arange(len(thing))
If you are operating primarily on columns of your data instead of rows, you might consider using a different data container. A numpy structured array is laid out like a vector of structs rather than a struct of arrays. This means that when you want to update blah, you are moving through non-contiguous memory space as you traverse the array.
Also, with any code optimizations, it's aways worth it to use timeit or some other timing harness (that removes the time required to jit the code) to see what is the actual performance. You might find with numba that explicit looping while more verbose could actually be faster than your vectorized code.
Without numba, accessing field values is no slower than accessing columns of a 2d array:
In [1]: arr2 = np.zeros((10000), dtype='i,i')
In [2]: arr2.dtype
Out[2]: dtype([('f0', '<i4'), ('f1', '<i4')])
Modifying a field:
In [4]: %%timeit x = arr2.copy()
...: x['f0'] += 1
16.2 µs ± 13.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Similar time if I assign the field to a new variable:
In [5]: %%timeit x = arr2.copy()['f0']
...: x += 1
15.2 µs ± 14.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Much faster if I construct a 1d array of the same size:
In [6]: %%timeit x = np.zeros(arr2.shape, int)
...: x += 1
8.01 µs ± 15.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
But similar time when accessing the column of a 2d array:
In [7]: %%timeit x = np.zeros((arr2.shape[0],2), int)
...: x[:,0] += 1
17.3 µs ± 23.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
