i have a costly calculation to do for fitting some experimental data. The fitting function is a sum over eigenmodes, each of them containing a specific surface integral. As it is rather slow if you do it the classical way i thought about threading it. I'm using python btw.
The function i want to calculate is something like
def fit_func(params , Mmin, Mmax):
values = np.zeros(1000)
for m in range(Mmin, Mmax):
# Fancy Calculation for each mode
# some calulation with all modes, adding them up 'values'
return values
How can i split this up? I did something like
data1 = thread.start_new_thread(fit_func, (params,0,13))
data2 = thread.start_new_thread(fit_func, (params,13,25))
but then the sum of data1 and data2 is not the same as fitfunc(params, 0,25)...
Try out multiprocessing. This will effectively create separate Python processes using a thread-like interface. However, make sure that you profile your computation and make sure that it is the problem, not something else like IO. Starting processes is very slow, so keep them around for a while if you are planning to use them.
You can also use numpy for those functions. They're written in C code, so they're stupid fast. Check them both out and see what fits best. I would go for the numpy solution myself...
use multiprocessing pool
import multiprocessing as mp
p = mp.Pool(10)
res = p.map(your_function, range(Mmin, Mmax))
Related
I want to parallelize a piece of code that resembles the following:
Ngal=10
sampind=[7,16,22,31,45]
samples=0.3*np.ones((60,Ngal))
zt=[2.15,7.16,1.23,3.05,4.1,2.09,1.324,3.112,0.032,0.2356]
toavg=[]
for j in range(Ngal):
gal=[]
for m in sampind:
gal.append(samples[m][j]-zt[j])
toavg.append(np.mean(gal))
accuracy=np.mean(toavg)
so I followed the advice here and I rewrote it as follows:
toavg=[]
gal=[]
p = mp.Pool()
def deltaz(params):
j=params[0] # index of the galaxy
m=params[1] # indices for which we have sampled redshifts
gal.append(samples[m][j]-zt[j])
return np.mean(gal)
j=(np.linspace(0,Ngal-1,Ngal).astype(int))
m=sampind
grid=[j,m]
input=itertools.product(*grid)
results = p.map(deltaz,input)
accuracy=np.mean(results)
p.close()
p.join()
but the results are not the same. In fact, sometimes they are, sometimes they're not. It doesn't seem very deterministic. Is my approach correct? If not, what should I fix? Thank you! The modules that you will need to reproduce the above examples are:
import numpy as np
import multiprocess as mp
import itertools
Thank you!
The first issue I see is that you are creating a global variable gal which is being accessed by the function deltaz. These are however not shared between the pool processes but instantiated for each process separately. You will have to use shared memory if you want them to share this structure. This is probably why you see a non-deterministic behavior.
The next issue is that you are not actually completing the same tasking with the different variation. The first one you are taking an average of each set of averages (gal). The parallel one is taking an average of which ever elements happen to end up in that list. This is nondeterministic because items are assigned to processes as they become available and this is not necessarily predictable.
I would suggest parallelizing the inner loop. To do this, you need zt and samples to both be in shared memory because they are accessed by all of the processes. This can get dangerous if you are modifying data but since you appear to only be reading it should be fine.
import numpy as np
import multiprocessing as mp
import itertools
import ctypes
#Non-parallel code
Ngal=10
sampind=[7,16,22,31,45]
samples=0.3*np.ones((60,Ngal))
zt=[2.15,7.16,1.23,3.05,4.1,2.09,1.324,3.112,0.032,0.2356]
#Nonparallel
toavg=[]
for j in range(Ngal):
gal=[]
for m in sampind:
gal.append(samples[m][j]-zt[j])
toavg.append(np.mean(gal))
accuracy=np.mean(toavg)
print(toavg)
# Parallel function
def deltaz(j):
sampind=[7,16,22,31,45]
gal = []
for m in sampind:
gal.append(samples[m][j]-zt[j])
return np.mean(gal)
# Shared array for zt
zt_base = mp.Array(ctypes.c_double, int(len(zt)),lock=False)
ztArr = np.ctypeslib.as_array(zt_base)
#Shared array for samples
sample_base = mp.Array(ctypes.c_double, int(np.product(samples.shape)),lock=False)
sampArr = np.ctypeslib.as_array(sample_base)
sampArr = sampArr.reshape(samples.shape)
#Copy arrays to shared
sampArr[:,:] = samples[:,:]
ztArr[:] = zt[:]
with mp.Pool() as p:
result = p.map(deltaz,(np.linspace(0,Ngal-1,Ngal).astype(int)))
print(result)
Here is an example that produces the same results. You can add more complexity to this as you see fit but I would read about multiprocessing in general and memory types/scopes to get an idea of what will and won't work. You have to take more care when you get into the multiprocessing world. Let me know if this doesn't help and I will try to update it so that it does.
The following simple spark program takes 4 minutes to run. I don't know what's wrong with this code.
First, I generate a VERY small rdd
D = spark.sparkContext.parallelize([(0,[1,2,3]),(1,[2,3]),(2,[0,3]),(3,[1])]).cache()
Then I generate a vector
P1 = spark.sparkContext.parallelize(list(zip(list(range(4)),[1/4]*4))).cache()
Then I defines a function to do the map step
def MyFun(x):
L0 = len(x[2])
L = []
for i in x[2]:
L.append((i,x[1]/L0))
return L
Then I execute the following code
P0 = P1
D0 = D.join(P1).map(lambda x: [x[0],x[1][1],x[1][0]]).cache()
C0 = D0.flatMap(lambda x: MyFun(x)).cache()
P1 = C0.reduceByKey(lambda x,y:x+y).mapValues(lambda x:x*1.2+3.4).sortByKey().cache()
Diff = P1.join(P0).map(lambda x: abs(x[1][0]-x[1][1])).sum()
Given my data is so small, I couldn't figure out a reason why this piece of code runs so slow...
I have a few suggestions to help you speed up this job.
Cache only when needed
The process of caching is to write the dag you created on the disk. So caching every step might cost a lot instead of speeding up the process.
I would suggest you to cache only P1.
Use DataFrames to allow Spark to help you
Afterwards, I strongly suggest you to use the DataFrame api, Spark will be able to do some optimizations for you, such as push down predicates optimizations.
The last, but not the least, using custom functions cost a lot as well. If you are using DataFrames, try to use only existing functions from the org.apache.spark.sql.functions module.
Profile the code with Spark UI
I also suggest to profile your code via the Spark UI, because it might not be a problem of your code since you have a small data, but a problem with the nodes.
I'm completely new to python. I'm trying to do a very simple thing, evaluate a non-trivial function that takes floats as input on a 2D mesh. The following code does exactly what I want, but it is slow, due to the double for loop.
import numpy as np
from galpy.potential import RazorThinExponentialDiskPotential
R = np.logspace(0., 2., 10)
z=R
#initialize with default values for this example
potfunc=RazorThinExponentialDiskPotential()
pot=np.zeros((R.size, z.size))
for i in range(0, R.size):
for j in range(0, z.size):
pot[i,j]=potfunc(R[i],z[j])
At the end, the array pot contains all the information I want, but now I want to increase the efficency. I know that pure python is slow, expecially on loops (like IDL), so I checked np.vectorize, but it's just a python loop under the hood.
The problem is that potfunc seems not accepting arrays, but just plain scalars.
How can I optimize this simple program?
Many thanks in advance.
The standard way to do that is using meshgrid :
r,z=np.meshgrid(R,Z)
pot=potfunc(r,z)
You must avoid looping on numpy array, or you will loose all vectorisation efficiency.
In case you cannot vectorize the function by hand (maybe you could subclass the Razor.. class and rewrite the function), you could use multiprocessing. Instead of my simple worker function you could use the function you like.:
from multiprocessing import pool
import numpy as np
def worker(x):
ai,bj = x
return ai + bj
def run_pool():
a = np.linspace(0,10,10)
b = np.logspace(0,10,len(a))
vec = [(a[i],b[j]) for i in range(len(a)) for j in range(len(b))]
p = pool.Pool(processes=4) # as many cores as you have
print(p.map(worker,vec))
p.close()
p.join()
run_pool()
But before you think about speeding things up, profiling would be good. I am pretty sure that in your case the function itself is the bottleneck. So either you rewrite it in a compiler language, vectorize it or you use all of your cores.
I'm trying to speed up a section of my code using parallel processing in python, but I'm having trouble getting it to work right, or even find examples that are relevant to me.
The code produces a low-polygon version of an image using Delaunay triangulation, and the part that's slowing me down is finding the mean values of each triangle.
I've been able to get a good speed increase by vectorizing my code, but hope to get more using parallelization:
The code I'm having trouble with is an extremely simple for loop:
for tri in tris:
lopo[tridex==tri,:] = np.mean(hipo[tridex==tri,:],axis=0)
The variables referenced are as follows.
tris - a unique python list of all the indices of the triangles
lopo - a Numpy array of the final low-polygon version of the image
hipo - a Numpy array of the original image
tridex - a Numpy array the same size as the image. Each element represents a pixel and stores the triangle that the pixel lies within
I can't seem to find a good example that uses multiple numpy arrays as input, with one of them shared.
I've tried multiprocessing (with the above snippet wrapped in a function called colorImage):
p = Process(target=colorImage, args=(hipo,lopo,tridex,ppTris))
p.start()
p.join()
But I get a a broken pipe error immediately.
So the way that Python's multiprocessing works (for the most part) is that you have to designate the individual threads that you want to run. I made a brief introductory tutorial here: http://will-farmer.com/parallel-python.html
In your case, what I would recommend is split tris into a bunch of different parts, each equally sized, each that represents a "worker". You can split this list with numpy.split() (documentation here: http://docs.scipy.org/doc/numpy/reference/generated/numpy.split.html).
Then for each list in tri, we use the Threading and Queue modules to designate 8 workers.
import numpy as np
# split into 8 different lists
tri_lists = np.split(tris, 8)
# Queues are threadsafe
return_values = queue.Queue()
threads = []
def color_image(q, tris, hipo, tridex):
""" This is the function we're parallelizing """
for tri in tris:
return_values.put(np.mean(hipo[tridex==tri,:], axis=0))
# Now we run the jobs
for i in range(8):
threads.append(threading.Thread(
target=color_image,
args=(return_values, tri_lists[i], hipo, tridex)))
# Now we have to cleanup our results
# First get items from queue
results = [item for item in return_values.queue]
# Now set values in lopo
for i in range(len(results)):
for t in tri_lists[i]:
lopo[tridex==t, :] = results[i]
This isn't the cleanest way to do it, and I'm not sure if it works since I can't test it, but this is a decent way to do it. The parallelized part is now np.mean(), while setting the values is not parallelized.
If you want to also parallelize the setting of the values, you'll have to have a shared variable, either using the Queue, or with a global variable.
See this post for a shared global variable: Python Global Variable with thread
I need to do some intense numerical computations and fortunately python offers very simple ways to implement parallelisations. However, the results I got were totally weird and after some trial'n error I stumbled upon the problem.
The following code simply calculates the mean of a random sample of numbers but illustrates my problem:
import multiprocessing
import numpy as np
from numpy.random import random
# Define function to generate random number
def get_random(seed):
dummy = random(1000) * seed
return np.mean(dummy)
# Input data
input_data = [100,100,100,100]
pool = multiprocessing.Pool(processes=4)
result = pool.map(get_random, input_data)
print result
for i in input_data:
print get_random(i)
Now the output looks like this:
[51.003368466729405, 51.003368466729405, 51.003368466729405, 51.003368466729405]
for the parallelisation, which is always the same
and like this for the normal not parallelised loop:
50.8581749381
49.2887091049
50.83585841
49.3067281055
As you can see, the parallelisation just returns the same results, even though it should have calculated difference means just as the loop. Now, sometimes I get only 3 equal numbers with one being different from the other 3.
I suspect that some memory is allocated to all sub processes...
I would love some hints on what is going on here and what a fix would look like. :)
thanks
When you use multiprocessing, you're talking about distinct processes. Distinct processes means distinct Python interpreters. Distinct interpreters means distinct random states. If you aren't seeding the random number generator uniquely on each process, then you're going to get the same starting random state from each process.
The answer was to put a new random seed into each process. Changing the function to
def get_random(seed):
np.random.seed()
dummy = random(1000) * seed
return np.mean(dummy)
gives the wanted results. 😊