I am trying to use NumbaPro's cuda extension to multiply large array matrixes. What I want in the end is to multiply a matrix of size NxN by a diagonal matrix that would be fed in as a 1D matrix (thus, a.dot(numpy.diagflat(b)) which I have found to be synonymous to a * b). However, I am getting an assertion error that provides no information.
I can only avoid this assertion error if I multiply two 1D array matrixes but that is not what I want to do.
from numbapro import vectorize, cuda
from numba import f4,f8
import numpy as np
def generate_input(n):
import numpy as np
A = np.array(np.random.sample((n,n)))
B = np.array(np.random.sample(n) + 10)
return A, B
def product(a, b):
return a * b
def main():
cu_product = vectorize([f4(f4, f4), f8(f8, f8)], target='gpu')(product)
N = 1000
A, B = generate_input(N)
D = np.empty(A.shape)
stream = cuda.stream()
with stream.auto_synchronize():
dA = cuda.to_device(A, stream)
dB = cuda.to_device(B, stream)
dD = cuda.to_device(D, stream, copy=False)
cu_product(dA, dB, out=dD, stream=stream)
dD.to_host(stream)
if __name__ == '__main__':
main()
This is what my terminal spits out:
Traceback (most recent call last):
File "cuda_vectorize.py", line 32, in <module>
main()
File "cuda_vectorize.py", line 28, in main
cu_product(dA, dB, out=dD, stream=stream)
File "/opt/anaconda1anaconda2anaconda3/lib/python2.7/site-packages/numbapro/_cudadispatch.py", line 109, in __call__
File "/opt/anaconda1anaconda2anaconda3/lib/python2.7/site-packages/numbapro/_cudadispatch.py", line 191, in _arguments_requirement
AssertionError
The problem is you are using vectorize on a function that takes non-scalar arguments. The idea with NumbaPro's vectorize is that it takes a scalar function as input, and generates a function that applies the scalar operation in parallel to all the elements of a vector. See the NumbaPro documentation.
Your function takes a matrix and a vector, which are definitely not scalar. [Edit] You can do what you want on the GPU using either NumbaPro's wrapper for cuBLAS, or by writing your own simple kernel function. Here's an example that demonstrates both. Note will need NumbaPro 0.12.2 or later (just released as of this edit).
from numbapro import jit, cuda
from numba import float32
import numbapro.cudalib.cublas as cublas
import numpy as np
from timeit import default_timer as timer
def generate_input(n):
A = np.array(np.random.sample((n,n)), dtype=np.float32)
B = np.array(np.random.sample(n), dtype=A.dtype)
return A, B
#cuda.jit(argtypes=[float32[:,:], float32[:,:], float32[:]])
def diagproduct(c, a, b):
startX, startY = cuda.grid(2)
gridX = cuda.gridDim.x * cuda.blockDim.x;
gridY = cuda.gridDim.y * cuda.blockDim.y;
height, width = c.shape
for y in range(startY, height, gridY):
for x in range(startX, width, gridX):
c[y, x] = a[y, x] * b[x]
def main():
N = 1000
A, B = generate_input(N)
D = np.empty(A.shape, dtype=A.dtype)
E = np.zeros(A.shape, dtype=A.dtype)
F = np.empty(A.shape, dtype=A.dtype)
start = timer()
E = np.dot(A, np.diag(B))
numpy_time = timer() - start
blas = cublas.api.Blas()
start = timer()
blas.gemm('N', 'N', N, N, N, 1.0, np.diag(B), A, 0.0, D)
cublas_time = timer() - start
diff = np.abs(D-E)
print("Maximum CUBLAS error %f" % np.max(diff))
blockdim = (32, 8)
griddim = (16, 16)
start = timer()
dA = cuda.to_device(A)
dB = cuda.to_device(B)
dF = cuda.to_device(F, copy=False)
diagproduct[griddim, blockdim](dF, dA, dB)
dF.to_host()
cuda_time = timer() - start
diff = np.abs(F-E)
print("Maximum CUDA error %f" % np.max(diff))
print("Numpy took %f seconds" % numpy_time)
print("CUBLAS took %f seconds, %0.2fx speedup" % (cublas_time, numpy_time / cublas_time))
print("CUDA JIT took %f seconds, %0.2fx speedup" % (cuda_time, numpy_time / cuda_time))
if __name__ == '__main__':
main()
The kernel is significantly faster because SGEMM does a full matrix-matrix multiply (O(n^3)), and expands the diagonal into a full matrix. The diagproduct function is smarter. It simply does a single multiply for each matrix element, and never expands the diagonal to a full matrix. Here are the results on my NVIDIA Tesla K20c GPU for N=1000:
Maximum CUBLAS error 0.000000
Maximum CUDA error 0.000000
Numpy took 0.024535 seconds
CUBLAS took 0.010345 seconds, 2.37x speedup
CUDA JIT took 0.004857 seconds, 5.05x speedup
The timing includes all of the copies to and from the GPU, which is a significant bottleneck for small matrices. If we set N to 10,000 and run again, we get a much bigger speedup:
Maximum CUBLAS error 0.000000
Maximum CUDA error 0.000000
Numpy took 7.245677 seconds
CUBLAS took 1.371524 seconds, 5.28x speedup
CUDA JIT took 0.264598 seconds, 27.38x speedup
For very small matrices, however, CUBLAS SGEMM has an optimized path so it is closer to the CUDA performance. Here, N=100
Maximum CUBLAS error 0.000000
Maximum CUDA error 0.000000
Numpy took 0.006876 seconds
CUBLAS took 0.001425 seconds, 4.83x speedup
CUDA JIT took 0.001313 seconds, 5.24x speedup
Just to bounce back on all those considerations. I also wanted to implement some matrix computations on CUDA, but then heard about the numpy.einsum function.
It turns out that einsum is incredibly fast.
In a case like this, here is the code for it. But it can be applied to many types of computations.
G = np.einsum('ij,j -> ij',A, B)
In terms of speed, here are the results for N = 10000
Numpy took 8.387756 seconds
CUDA JIT took 0.218394 seconds, 38.41x speedup
EINSUM took 0.131751 seconds, 63.66x speedup
Related
I have a python code that scales really badly with n so much so that for n=50 the time is in seconds but for n=1000 is many hours.
I therefore tried to use Numba to speed it up but have only been dealing with a lot of errors. I have managed to rectify the errors so far but suddenly now I have the error of "StopIteration" and that gives no hint on what bug to look for in the code.
Here is the non-Numba version (with the output which is what I am trying to get with Numba):
import numpy as np
import matplotlib.pyplot as plt
from math import *
counter=0
n=100
iter=0
np.random.seed(0)
Zinitial=np.random.normal(0,1,size=(2,n))[0,:]
np.random.seed(1)
Pinitial=np.random.normal(0,1,size=(2,n))[1,:]
SPIN=np.zeros(n,dtype=int);
SPIN[int(n/2):]=0;
SPIN[:int(n/2)]=1;
SP = np.array(sorted(np.array([np.array([i,j,k]) for i, j,k in zip(Zinitial, Pinitial,SPIN)]),
key=lambda x: x[0]))
print("Initial energy : ",(np.sum(SP[:,0]**2)+np.sum(SP[:,1]**2))/2 )
total_time=0
Tmax=10
alf=sqrt(10)
while total_time<Tmax:
T=[]
for j in range(n-1):
b=(SP[j+1,1]-SP[j,1])/(SP[j+1,0]-SP[j,0])
val1=b+sqrt(b**2+2)
val2=b-sqrt(b**2+2)
if val1>0:
T.append(val1)
else:
T.append(val2)
T=np.array(T)
dt=min(T[T>0])
total_time=dt+total_time
indix=list(T).index(dt)
SP0=SP[:,0].copy()
SP[:,0]=SP0*cos(dt)+SP[:,1]*sin(dt)
SP[:,1]=SP[:,1]*cos(dt)-SP0*sin(dt)
prel=(SP[indix,1]-SP[indix+1,1])/2
rcoeff=1/(1+(prel*alf)**2)
SP[[indix,indix+1]]=SP[[indix+1,indix]]
SP=np.array(sorted(SP,key=lambda x:x[0]))
rand_value=np.random.random()
rcoeff=1/(1+(prel*alf)**2)
if rcoeff>rand_value and SP[indix,2]!=SP[indix+1,2]:
counter=counter+1
SP[indix,2],SP[indix+1,2]=SP[indix+1,2],SP[indix,2]
print("total_time = ",total_time)
print("n= ", n)
print("rate = ", 2*counter/(n*(total_time)))
OUTPUT:
Initial energy : 95.56821008840404
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:29: RuntimeWarning: divide by zero encountered in double_scalars
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:31: RuntimeWarning: invalid value encountered in double_scalars
total_time = 10.000079819065235
n= 100
rate = 3.2079743942482555
final energy : 95.5682100884033
Here is the devil Numba version with the error output:
import numpy as np
import matplotlib.pyplot as plt
from math import *
from numba import jit
from numba import types,typed
#jit(nopython=True)
def f(SP, alf,Tmax, n):
counter=0
total_time=0
T=np.empty(0,dtype=np.float64)
while total_time<Tmax:
for j in range(n-1):
b=(SP[j+1,1]-SP[j,1])/(SP[j+1,0]-SP[j,0])
val1=b+sqrt(b**2+2)
val2=b-sqrt(b**2+2)
if val1>0:
np.append(T,val1)
else:
np.append(T,val2)
dt=min(T[T>0])
total_time=dt+total_time
indix,=np.where(T==dt)[0]
SP0=SP[:,0].copy()
SP[:,0]=SP0*cos(dt)+SP[:,1]*sin(dt)
SP[:,1]=SP[:,1]*cos(dt)-SP0*sin(dt)
prel=(SP[indix,1]-SP[indix+1,1])/2;
rcoeff=1/(1+(prel*alf)**2);
for h in range(n-1):
for z in range(SP.shape[1]):
SP[h, z], SP[h + 1, z] = SP[h + 1, z], SP[h, z]
SP=SP[SP[:, 0].argsort()]
rand_value=np.random.random()
rcoeff=1/(1+(prel*alf)**2)
if rcoeff>rand_value and SP[indix,2]!=SP[indix+1,2]:
counter=counter+1
SP[indix,2],SP[indix+1,2]=SP[indix+1,2],SP[indix,2]
rate=2*counter/(n*total_time)
energy=np.sum((SP[:,0]**2+SP[:,1]**2)/2)
return rate,energy
if __name__ == '__main__':
n=100
Tmax=10
alf=sqrt(10)
np.random.seed(0)
Zinitial=np.random.normal(0,1,size=(2,n))[0,:]
np.random.seed(1)
Pinitial=np.random.normal(0,1,size=(2,n))[1,:]
SPIN=np.zeros(n,dtype=int);
SPIN[int(n/2):]=0;
SPIN[:int(n/2)]=1;
SP = np.array(sorted(np.array([np.array([i,j,k]) for i, j,k in zip(Zinitial, Pinitial,SPIN)]),
key=lambda x: x[0]))
print("Initial energy : ",(np.sum(SP[:,0]**2)+np.sum(SP[:,1]**2))/2 )
rate,energy = f(SP, alf, Tmax, n)
print("Rate of collision per particle = ",rate)
print("Final energy : ",energy)
OUTPUT:
Initial energy : 95.56821008840404
---------------------------------------------------------------------------
StopIteration Traceback (most recent call last)
<ipython-input-49-9d8cc152e0a8> in <module>()
64
65 print("Initial energy : ",(np.sum(SP[:,0]**2)+np.sum(SP[:,1]**2))/2 )
---> 66 rate,energy = f(SP, alf, Tmax, n)
67 print("Rate of collision per particle = ",rate)
68 print("Final energy : ",energy)
StopIteration:
Pardon me for dumping huge chunks but I don't know which part has the bug.
There are many problems in the Numba code:
The result of np.append is not assigned so the instruction does nothing: np.append is not an in-place operation. This is the reason why you get a StopIteration: the expression dt=min(T[T>0]) fails because T is empty.
np.append should never be use in a loop since it create a new array for every iteration resulting in a slow quadratic execution. Use a preallocated array (best solution) or a list (less efficient).
There are out-of-bound access causing crashes and undefined behaviours. Please use the flags debug=True and boundscheck=True to track them. For example indix is set to 228 (which seems legit since T.shape is 297) while SP.shape is (100,3) so SP[indix,1] fails. Numba is design to be fast by default so it does not track out-of-bound access by default which can cause surprising behaviour since an out-of-bound is an undefined behaviour and the JIT can make crazy assumption in this case. I also strongly advise you to print the values of the variable so to check deterministic computations (ie. not random-based ones) are the same in Numba.
Note that np.random.seed does not affect the Numba seed (Numba use a separate seed that is not synchronised with Numpy). Thus, random numbers will likely differ between Numba and a pure-Numpy code.
I have a Juypter Notebook where I am working with large matrices (20000x20000). I am running multiple iterations, but I am getting an error saying that I do not have enough RAM after every iteration. If I restart the kernel, I can run the next iteration, so perhaps the Juypter Notebook is running out of RAM because it stores the variables (which aren't needed for the next iteration). Is there a way to free up RAM?
Edit: I don't know if the bold segment is correct. In any case, I am looking to free up RAM, any suggestions are welcome.
## Outputs:
two_moons_n_of_samples = [int(_) for _ in np.repeat(20000, 10)]
for i in range(len(two_moons_n_of_samples)):
# print(f'n: {two_moons_n_of_samples[i]}')
## Generate the data and the graph
X, ground_truth, fid = synthetic_data({'type': 'two_moons', 'n': two_moons_n_of_samples[i], 'fidelity': 60, 'sigma': 0.18})
N = X.shape[0]
dist_mat = sqdist(X.T, X.T)
opt = {
'graph': 'full',
'tau': 0.004,
'type': 's'
}
LS = dense_laplacian(dist_mat, opt)
## Eigenvalues and eigenvectors
tic = time.time() ## Time how long to calculate eigenvalues/eigenvectors
V, E = np.linalg.eigh(LS)
idx = np.argsort(V)
V, E = V[idx], E[:, idx]
V = V / V.max()
decomposition_time = time.time() - tic
## Initialize u0
u0 = np.zeros(N)
for j in range(len(fid[0])):
u0[fid[0][j]] = 1
for j in range(len(fid[1])):
u0[fid[1][j]] = -1
## Initialize parameters
dt = 0.05
gamma = 0.07
max_iter = 100
## Run MAP estimation
tic = time.time()
u_eg, _ = probit_optimization_eig(E, V, u0, dt, gamma, fid, max_iter)
eg_time = time.time() - tic
## Run MAP estimation with CG
tic2 = time.time()
u_cg, _ = probit_optimization_cg(LS, u0, dt, gamma, fid, max_iter)
cg_time = time.time() - tic2
## Write to file:
with open('results2_two_moons_egvscg.txt', 'a') as f:
f.write(f'{i},{two_moons_n_of_samples[i]},{decomposition_time + eg_time},{cg_time}\n')
Error:
MemoryError: Unable to allocate 1.07 GiB for an array with shape (12000, 12000) and data type float64
---------------------------------------------------------------------------
MemoryError Traceback (most recent call last)
~\AppData\Local\Temp\2/ipykernel_2344/941022539.py in <module>
11 'type': 's'
12 }
---> 13 LS = dense_laplacian(dist_mat, opt)
14
15 ## Eigenvalues and eigenvectors
C:/Users/\util\graph\dense_laplacian.py in dense_laplacian(dist_mat, opt)
69 D_inv_sqrt = 1.0 / np.sqrt(D)
70 D_inv_sqrt = np.diag(D_inv_sqrt)
---> 71 L = np.eye(W.shape[0]) - D_inv_sqrt # W # D_inv_sqrt
72 # L = 0.5 * (L + L.T)
73 if opt['type'] == 'rw':
MemoryError: Unable to allocate 1.07 GiB for an array with shape (12000, 12000) and data type float64
I faced the same problem, the way I solved it was -
Writing Functions wherever preprocessing is required and returning only preprocessed variables.
Deleting used huge variables just use del x
Clearing Garbage
import gc
gc.collect()
Sometimes clearing garbage doesn't helps and i used to clear the cache as well by using
import ctypes
libc = ctypes.CDLL("libc.so.6") # clearing cache
libc.malloc_trim(0)
I tried to batch my code as far as possible.
I think the best solution for you would be to batch the matrix multiplication. Libraries like TensorFlow and PyTorch does it by default, not sure about NumPy though. Check - https://www.tensorflow.org/api_docs/python/tf/linalg/matmul ( An API for matrix multiplication in batches ). Most of modern-day GPU calculations are possible due to batching !
I would suggest adding more swap space which is really easy and will probably save you more time and headache than redesigning the code to be less wasteful or trying to delete and garbage collect unnecessary objects. It would of course be slower than using ram memory since it will use the disk to simulate the extra memory needed.
Excellent answer on how to do this on ubuntu, link
I want to use cuda streams in order to speed up small calculations on the GPU. My test so far consists of the following:
import cupy as xp
import time
x = xp.random.randn(10003, 20000) + 1j * xp.random.randn(10003, 20000)
y = xp.zeros_like(x)
nStreams = 16
streams = [xp.cuda.stream.Stream() for ii in range(nStreams)]
f = xp.fft.fft(x[:,:200])
t = time.time()
for ii in range(int(x.shape[1]/100)):
ss = streams[ii % nStreams]
with ss:
y[:,ii*200:(ii+1)*200] = xp.fft.fft(x[:,ii*200:(ii+1)*200], axis=0)
for ii,ss in enumerate(streams):
ss.synchronize()
print(time.time()-t)
t = time.time()
for ii in range(int(x.shape[1]/100)):
y[:,ii*200:(ii+1)*200] = xp.fft.fft(x[:,ii*200:(ii+1)*200], axis=0)
xp.cuda.Stream.null.synchronize()
print(time.time()-t)
produces
[user#pc snippets]$ intelpython3 strm.py
0.019365549087524414
0.018717050552368164
which I have trouble believing that I do everything correctly. Additionally, the situation becomes even more severe when replacing the FFT-calls with calls to xp.sum, which yields
[user#pc snippets]$ intelpython3 strm.py
0.002195596694946289
0.001004934310913086
What is the rationale behind cupy streams? How do I use them to my advantage?
I have a 2-D numpy array like x = array([[ 1., 5.],[ 3., 4.]]), I have to compare each row with every other row in the matrix and create an new array of minimum values from both the rows and take the sum of minimum row and save it in a new matrix. Finally I will get a symmetric matrix.
Eg: I compare array [1,5] with itself. New 2-D array is array([[ 1., 5.],[ 1., 5.]]), I create a minimum array along axis=0 i.e [ 1., 5.] then take the sum of array which will be 6. Similarly I repeat the operation for all the rows and I end up with a 2*2 matrix array([[ 6, 5.],[ 5, 7.]]).
import numpy as np
x=np.array([[1,5],[3,4]])
y=np.zeros((len(x),len(x)))
for i in range(len(x)):
array_a=x[i]
for j in range(len(x)):
array_b=x[j]
array_c=np.array([array_a,array_b])
min_array=np.min(array_c,axis=0)
array_sum=np.sum(min_array)
y[i,j]=array_sum
My 2-D array is very big and performing above mentioned operations are taking lot of time. I am new to python so any suggestion to improve the performance will be really helpful.
The obvious improvement to save roughly half the time is to run only on i>=j indices. For elegance and some saving you can also use less variables.
import numpy as np
import time
x=np.random.randint(0, 10, (500, 500))
y=np.zeros((len(x),len(x)))
# OP version
t0 = time.time()
for i in range(len(x)):
array_a=x[i]
for j in range(len(x)):
array_b=x[j]
array_c=np.array([array_a,array_b])
min_array=np.min(array_c,axis=0)
array_sum=np.sum(min_array)
y[i,j]=array_sum
print(time.time() - t0)
z=np.zeros((len(x),len(x)))
# modified version
t0 = time.time()
for i in range(len(x)):
for j in range(i, len(x)):
z[i, j]=np.sum(np.min([x[i], x[j]], axis=0))
z[j, i] = z[i, j]
print(time.time() - t0)
# verify that the result are the same
print(np.all(z == y))
The results on my machine:
4.2974278926849365
2.746302604675293
True
The obvious way to speed up your code would be to do all the looping in numpy. I had a first solution (f2 in the code below), which would generate a matrix that contained all the combinations that need to be compared and then reduced that matrix into the final result performing the np.min and np.sum commands. Unfortunately that method is quite memory consuming and therefore becomes slow when the matrices are big, because the intermediate matrix is NxNx2xN for a NxN input matrix.
However, I found a different solution that uses one for loop (f3 below) and appears to be reasonably fast. The speed-up to the original posted by the OP is about 4 times for a 1000x1000 matrix. Here the codes with some tests:
import numpy as np
import timeit
def f(x):
y = np.zeros_like(x)
for i in range(x.shape[0]):
a = x[i]
for j in range(x.shape[1]):
b = x[j]
y[i,j] = np.sum(np.min([a,b], axis=0))
return y
def f2(x):
y = np.empty((x.shape[0],1,2,x.shape[0]))
y[:,0,0,:] = x[:,:]
y = np.repeat(y, x.shape[0],axis=1)
y[:,:,1,:] = x[:,:]
return np.sum(np.min(y,axis=2),axis=2)
def f3(x):
y = np.empty_like(x)
for i in range(x.shape[1]):
y[:,i] = np.sum(np.minimum(x[i,:],x[:,:]),axis=1)
return y
##some testing that the functions work
x = np.array([[1,5],[3,4]])
a=f(x)
b=f2(x)
c=f3(x)
print(np.all(a==b))
print(np.all(a==c))
x = np.array([[1,7,5],[2,3,8],[5,2,4]])
a=f(x)
b=f2(x)
c=f3(x)
print(np.all(a==b))
print(np.all(a==c))
x = np.random.randint(0,10,(100,100))
a=f(x)
b=f2(x)
c=f3(x)
print(np.all(a==b))
print(np.all(a==c))
##some speed testing:
print('-'*50)
print("speed test small")
x = np.random.randint(0,100,(100,100))
print("original")
print(min(timeit.Timer(
'f(x)',
setup = 'from __main__ import f,x',
).repeat(3,10)))
print("using np.repeat")
print(min(timeit.Timer(
'f2(x)',
setup = 'from __main__ import f2,x',
).repeat(3,10)))
print("one for loop")
print(min(timeit.Timer(
'f3(x)',
setup = 'from __main__ import f3,x',
).repeat(3,10)))
print('-'*50)
print("speed test big")
x = np.random.randint(0,100,(1000,1000))
print("original")
print(min(timeit.Timer(
'f(x)',
setup = 'from __main__ import f,x',
).repeat(3,1)))
print("one for loop")
print(min(timeit.Timer(
'f3(x)',
setup = 'from __main__ import f3,x',
).repeat(3,1)))
And here the output:
True
True
True
True
True
True
--------------------------------------------------
speed test small
original
1.3070102719939314
using np.repeat
0.15176948899170384
one for loop
0.029766165011096746
--------------------------------------------------
speed test big
original
17.505746565002482
one for loop
4.437685210024938
With other words, f2 is pretty fast for matrices that don't exhaust your memory, but especially for big matrices, f3 is the fastest that I could find.
EDIT:
Inspired by #Aguy's answer and this post, here still a modification that only computes the lower triangle of the matrix and then copies the results to the upper triangle:
def f4(x):
y = np.empty_like(x)
for i in range(x.shape[1]):
y[i:,i] = np.sum(np.minimum(x[i,:],x[i:,:]),axis=1)
i_upper = np.triu_indices(x.shape[1],1)
y[i_upper] = y.T[i_upper]
return y
The speed test for the 1000x1000 matrix now gives
speed test big
original
18.71281115297461
one for loop over lower triangle
2.0939957330119796
EDIT 2:
Here still a version that uses numba for speed up. According to this post it is better to write the loops explicitly in this case:
import numba as nb
#nb.jit(nopython=True)
def f_nb(x):
res = np.empty_like(x)
for j in range(res.shape[1]):
for i in range(j,res.shape[0]):
res[j,i] = res[i,j] = np.sum(np.minimum(x[i,:], x[j,:]))
return res
And the relevant speed tests give:
0.015975199989043176 for a 100x100 matrix
0.37946902704425156 for a 1000x1000 matrix
467.06363476096885 for a 10000x10000 matrix
The 10000x10000 speed test for f4 didn't seem to want to finish at all, so I left it out. If your matrices get much bigger than that, you might actually run into memory problems -- did you consider this?
Why can't Numba's jit compile a simple Numpy array operation?
Here is a minimal non-working example that reproduces Numba's failure to compile
import numpy as np
from numba import jit
rows = 10
columns = 999999
A = np.empty((rows, columns))
b = np.linspace(0, 1, num=rows)
#jit(nopython=True)
def replicate(A, b):
for i in range(A.shape[1]):
A[:, i] = b
return A #optional
replicate(a, b)
With the following error:
TypingError: Failed at nopython (nopython frontend)
Cannot resolve setitem: array(float64, 1d, C, nonconst)[(slice3_type, int64)] = array(float64, 1d, C, nonconst)
File "<ipython-input-32-db24fbe2922f>", line 12
Am I doing something wrong?
As an aside, I need nopython mode because in my real-life situation I need to perform array addition, multiplication with scalars and array populating with other arrays frequently. and my understanding is that in object mode I won't be able to do loop jitting and thus I won't see any real performance boost on the execution.
Numba doesn't support numpy slicing in nopython mode. Try unrolling the loops explicitly:
rows = 10
columns = 999999
a = np.empty((rows, columns))
b = np.linspace(0, 1, num=rows)
#jit(nopython=True)
def replicate(A, b):
for i in xrange(A.shape[0]):
for j in xrange(A.shape[1]):
A[i, j] = b[i]
return A #optional
replicate(a, b)