I'm completely new to python. I'm trying to do a very simple thing, evaluate a non-trivial function that takes floats as input on a 2D mesh. The following code does exactly what I want, but it is slow, due to the double for loop.
import numpy as np
from galpy.potential import RazorThinExponentialDiskPotential
R = np.logspace(0., 2., 10)
z=R
#initialize with default values for this example
potfunc=RazorThinExponentialDiskPotential()
pot=np.zeros((R.size, z.size))
for i in range(0, R.size):
for j in range(0, z.size):
pot[i,j]=potfunc(R[i],z[j])
At the end, the array pot contains all the information I want, but now I want to increase the efficency. I know that pure python is slow, expecially on loops (like IDL), so I checked np.vectorize, but it's just a python loop under the hood.
The problem is that potfunc seems not accepting arrays, but just plain scalars.
How can I optimize this simple program?
Many thanks in advance.
The standard way to do that is using meshgrid :
r,z=np.meshgrid(R,Z)
pot=potfunc(r,z)
You must avoid looping on numpy array, or you will loose all vectorisation efficiency.
In case you cannot vectorize the function by hand (maybe you could subclass the Razor.. class and rewrite the function), you could use multiprocessing. Instead of my simple worker function you could use the function you like.:
from multiprocessing import pool
import numpy as np
def worker(x):
ai,bj = x
return ai + bj
def run_pool():
a = np.linspace(0,10,10)
b = np.logspace(0,10,len(a))
vec = [(a[i],b[j]) for i in range(len(a)) for j in range(len(b))]
p = pool.Pool(processes=4) # as many cores as you have
print(p.map(worker,vec))
p.close()
p.join()
run_pool()
But before you think about speeding things up, profiling would be good. I am pretty sure that in your case the function itself is the bottleneck. So either you rewrite it in a compiler language, vectorize it or you use all of your cores.
Related
This is a basic example.
#jax.jit
def block(arg1, arg2):
for x1 in range(cons1):
for x2 in range(cons2):
for x3 in range(cons3):
--do something--
return result
When cons are small, the compile-time is around a minute. With larger cons, compile time is much higher—10s of minutes. And I need even higher cons. What can be done?
From what I am reading, the loops are the cause. They are unrolled at compile time.
Are there any workarounds? There is also jax.fori_loop. But I don't understand how to use it. There is jax.experimental.loops module, but again I'm not able to understand it.
I am very new to all this. Hence, all help is appreciated.
If you can provide some examples of how to use jax loops, that will be much appreciated.
Also, what is an ok compile time? Is it ok for it to be in minutes?
In one of the examples, compile time is 262 seconds and remaining runs are ~0.1-0.2 seconds.
Any gain in runtime is overshadowed by the compile time.
JAX's JIT compiler flattens all Python loops. To see what I mean, take a look at this simple function run through jax.make_jaxpr, which is a way to examine how JAX's tracer interprets python code (see Understanding Jaxprs for more):
import jax
def f(x):
for i in range(5):
x += i
return x
print(jax.make_jaxpr(f)(0))
# { lambda ; a.
# let b = add a 0
# c = add b 1
# d = add c 2
# e = add d 3
# f = add e 4
# in (f,) }
Notice that the loop is flattened: every step becomes an explicit operation sent to the XLA compiler. The XLA compile time increases as you increase the number of operations in the function, so it makes sense that a triply-nested for-loop would lead to long compile times.
So, how to address this? Well, unfortunately the answer depends on what your --do something-- is doing, so I can't guess that.
In general, the best option is to use vectorized array operations rather than loops over the values in those vectors; for example, here is a very slow way of adding two vectors:
import jax.numpy as jnp
def f_slow(x, y):
z = []
for xi, yi in zip(xi, yi):
z.append(xi + yi)
return jnp.array(z)
and here is a much faster way to do the same thing:
def f_fast(x, y):
return x + y
If your operations don't lend themselves to vectorization, another option is to use lax control flow operators in place of the for loops: this will push the loop down into XLA. This can have quite good performance on CPU, but is slower on accelerators when compared to equivalent vectorized array operations.
For more discussion on JAX and Python control flow statements (such as for, if, while, etc.), see 🔪 JAX - The Sharp Bits 🔪: Control Flow.
I am not sure if this is will be the same as with numba, but this might be similar case.
When I use numba.jit compiler and have big data input, first I compile function on some small example data, then use it.
Pseudo-code:
func_being_compiled(small_amount_of_data) # compile-only purpose
func_being_compiled(large_amount_of_data)
I want to parallelize a piece of code that resembles the following:
Ngal=10
sampind=[7,16,22,31,45]
samples=0.3*np.ones((60,Ngal))
zt=[2.15,7.16,1.23,3.05,4.1,2.09,1.324,3.112,0.032,0.2356]
toavg=[]
for j in range(Ngal):
gal=[]
for m in sampind:
gal.append(samples[m][j]-zt[j])
toavg.append(np.mean(gal))
accuracy=np.mean(toavg)
so I followed the advice here and I rewrote it as follows:
toavg=[]
gal=[]
p = mp.Pool()
def deltaz(params):
j=params[0] # index of the galaxy
m=params[1] # indices for which we have sampled redshifts
gal.append(samples[m][j]-zt[j])
return np.mean(gal)
j=(np.linspace(0,Ngal-1,Ngal).astype(int))
m=sampind
grid=[j,m]
input=itertools.product(*grid)
results = p.map(deltaz,input)
accuracy=np.mean(results)
p.close()
p.join()
but the results are not the same. In fact, sometimes they are, sometimes they're not. It doesn't seem very deterministic. Is my approach correct? If not, what should I fix? Thank you! The modules that you will need to reproduce the above examples are:
import numpy as np
import multiprocess as mp
import itertools
Thank you!
The first issue I see is that you are creating a global variable gal which is being accessed by the function deltaz. These are however not shared between the pool processes but instantiated for each process separately. You will have to use shared memory if you want them to share this structure. This is probably why you see a non-deterministic behavior.
The next issue is that you are not actually completing the same tasking with the different variation. The first one you are taking an average of each set of averages (gal). The parallel one is taking an average of which ever elements happen to end up in that list. This is nondeterministic because items are assigned to processes as they become available and this is not necessarily predictable.
I would suggest parallelizing the inner loop. To do this, you need zt and samples to both be in shared memory because they are accessed by all of the processes. This can get dangerous if you are modifying data but since you appear to only be reading it should be fine.
import numpy as np
import multiprocessing as mp
import itertools
import ctypes
#Non-parallel code
Ngal=10
sampind=[7,16,22,31,45]
samples=0.3*np.ones((60,Ngal))
zt=[2.15,7.16,1.23,3.05,4.1,2.09,1.324,3.112,0.032,0.2356]
#Nonparallel
toavg=[]
for j in range(Ngal):
gal=[]
for m in sampind:
gal.append(samples[m][j]-zt[j])
toavg.append(np.mean(gal))
accuracy=np.mean(toavg)
print(toavg)
# Parallel function
def deltaz(j):
sampind=[7,16,22,31,45]
gal = []
for m in sampind:
gal.append(samples[m][j]-zt[j])
return np.mean(gal)
# Shared array for zt
zt_base = mp.Array(ctypes.c_double, int(len(zt)),lock=False)
ztArr = np.ctypeslib.as_array(zt_base)
#Shared array for samples
sample_base = mp.Array(ctypes.c_double, int(np.product(samples.shape)),lock=False)
sampArr = np.ctypeslib.as_array(sample_base)
sampArr = sampArr.reshape(samples.shape)
#Copy arrays to shared
sampArr[:,:] = samples[:,:]
ztArr[:] = zt[:]
with mp.Pool() as p:
result = p.map(deltaz,(np.linspace(0,Ngal-1,Ngal).astype(int)))
print(result)
Here is an example that produces the same results. You can add more complexity to this as you see fit but I would read about multiprocessing in general and memory types/scopes to get an idea of what will and won't work. You have to take more care when you get into the multiprocessing world. Let me know if this doesn't help and I will try to update it so that it does.
I am learning the ways of Numba and have not figured out how to use or whether I need to use multiprocessing.queue to combine all my loop data from separate processes.
Do I even want to use the multiprocessing module to break up big loops into multiple smaller ones to run in separate processes or does Numba do this automatically?
The code below is run in the multiprocessing module where it opens up in multiple processes that are divided up into your system core count. So there are many instances of the code running and compute looping through different segments of the overall calculation and then the result 0 or 1 is sent back to the parent function.
My guess is Numba does this differently on its own and I don't want to use queue or the multiprocessing module?
#jit(nopython=True)
def prime_multiprocess(n, c, q):
a, b, c = n[0], n[1], c
for i in range(a, b):
if c % i == 0:
return q.put(0)
return q.put(1)
This error may have been caused by the following argument(s):
- argument 2: cannot determine Numba type of <class 'multiprocessing.queues.Queue'>
I appreciate any explanation or link that explains using numba with parallel loops that speed things up.
I did some testing and it appears that a nested function solved the problem:
I rewrote it to:
def prime_multiprocess(n, c, q):
a, b, c = n[0], n[1], c
#jit(nopython=True)
def speed_comp():
for i in range(a, b):
if c % i == 0:
return 0
return 1
q.put(speed_comp())
It is faster!
edit:
It appears there is a downside to where I am limited to the size of the integers I can use. "sigh" "Why is there always a trade off :( "
I wonder if its possible to workaround this with numpy and if it would slow it down. Answer might be here: Numba support for big integers?
The way Numba works is it converts integers into machine-level integers which are limited in scope to your system level such as 64 bit. This is what makes it run faster because there is no overhead on-top of the calculations. Unfortunately without the overhead slowing things down, you cannot compute bigger integers.
I'm trying to speed up a section of my code using parallel processing in python, but I'm having trouble getting it to work right, or even find examples that are relevant to me.
The code produces a low-polygon version of an image using Delaunay triangulation, and the part that's slowing me down is finding the mean values of each triangle.
I've been able to get a good speed increase by vectorizing my code, but hope to get more using parallelization:
The code I'm having trouble with is an extremely simple for loop:
for tri in tris:
lopo[tridex==tri,:] = np.mean(hipo[tridex==tri,:],axis=0)
The variables referenced are as follows.
tris - a unique python list of all the indices of the triangles
lopo - a Numpy array of the final low-polygon version of the image
hipo - a Numpy array of the original image
tridex - a Numpy array the same size as the image. Each element represents a pixel and stores the triangle that the pixel lies within
I can't seem to find a good example that uses multiple numpy arrays as input, with one of them shared.
I've tried multiprocessing (with the above snippet wrapped in a function called colorImage):
p = Process(target=colorImage, args=(hipo,lopo,tridex,ppTris))
p.start()
p.join()
But I get a a broken pipe error immediately.
So the way that Python's multiprocessing works (for the most part) is that you have to designate the individual threads that you want to run. I made a brief introductory tutorial here: http://will-farmer.com/parallel-python.html
In your case, what I would recommend is split tris into a bunch of different parts, each equally sized, each that represents a "worker". You can split this list with numpy.split() (documentation here: http://docs.scipy.org/doc/numpy/reference/generated/numpy.split.html).
Then for each list in tri, we use the Threading and Queue modules to designate 8 workers.
import numpy as np
# split into 8 different lists
tri_lists = np.split(tris, 8)
# Queues are threadsafe
return_values = queue.Queue()
threads = []
def color_image(q, tris, hipo, tridex):
""" This is the function we're parallelizing """
for tri in tris:
return_values.put(np.mean(hipo[tridex==tri,:], axis=0))
# Now we run the jobs
for i in range(8):
threads.append(threading.Thread(
target=color_image,
args=(return_values, tri_lists[i], hipo, tridex)))
# Now we have to cleanup our results
# First get items from queue
results = [item for item in return_values.queue]
# Now set values in lopo
for i in range(len(results)):
for t in tri_lists[i]:
lopo[tridex==t, :] = results[i]
This isn't the cleanest way to do it, and I'm not sure if it works since I can't test it, but this is a decent way to do it. The parallelized part is now np.mean(), while setting the values is not parallelized.
If you want to also parallelize the setting of the values, you'll have to have a shared variable, either using the Queue, or with a global variable.
See this post for a shared global variable: Python Global Variable with thread
i have a costly calculation to do for fitting some experimental data. The fitting function is a sum over eigenmodes, each of them containing a specific surface integral. As it is rather slow if you do it the classical way i thought about threading it. I'm using python btw.
The function i want to calculate is something like
def fit_func(params , Mmin, Mmax):
values = np.zeros(1000)
for m in range(Mmin, Mmax):
# Fancy Calculation for each mode
# some calulation with all modes, adding them up 'values'
return values
How can i split this up? I did something like
data1 = thread.start_new_thread(fit_func, (params,0,13))
data2 = thread.start_new_thread(fit_func, (params,13,25))
but then the sum of data1 and data2 is not the same as fitfunc(params, 0,25)...
Try out multiprocessing. This will effectively create separate Python processes using a thread-like interface. However, make sure that you profile your computation and make sure that it is the problem, not something else like IO. Starting processes is very slow, so keep them around for a while if you are planning to use them.
You can also use numpy for those functions. They're written in C code, so they're stupid fast. Check them both out and see what fits best. I would go for the numpy solution myself...
use multiprocessing pool
import multiprocessing as mp
p = mp.Pool(10)
res = p.map(your_function, range(Mmin, Mmax))