I am trying to use Pool.starmap_async to run some code that takes multiple parameters as inputs, in order to quickly sweep through a parameter space. The code runs a linalg function that sometimes does not converge, and instead throws a np.linalg.LinAlgError. In this case I'd like my code to return np.nan, and carry on its merry way. I would also, ideally, like to specify a timeout so that the code gives up after a set number of seconds and continues on to a different parameter combination.
# This is actually some long function that sometimes returns a linalg error
def run_solver(A, B):
return A+B
if __name__ == '__main__':
# Parameters
Asearch = np.arange(4, 8, 1)
Bsearch = np.arange(0.2, 2, 0.2)
# Search all combinations of Qsearch and Rmsearch
AB = np.array(list(itertools.product(Qsearch, Rmsearch)))
A = AB[:, 0]
B = AB[:, 1]
result = {}
with Pool(processes=15) as pool:
def cb(r):
print("callback")
result[params] = r
def ec(r):
result[params] = np.nan
print("error callback")
raise np.linalg.LinAlgError
try:
params = (zip(A, B))
r = pool.starmap_async(run_solver, params, callback=cb, error_callback=ec)
print(r.get(timeout=10))
except np.linalg.LinAlgError:
print("parameters did not converge")
except mp.context.TimeoutError:
print("Timeout error. Continuing...")
pickle.dump(result, open("result.p", "wb"))
print("pickling output:", result)`
I have tried to catch the TimeoutError as an exception so that the code will continue, and I'm purposefully raising the LinAlgError because I'm trying to pick apart when the code runs out of time vs fails to converge in time -- I realize that's redundant. For one thing, the result dictionary does not end up being how I intended: is there a way to query the current process's parameters and use those as the dictionary keys? Also, if a Timeout error occurs I would ideally flag those parameters in some way -- what's the best way to do this?
Finally, why in this code is callback only called once? Shouldn't it be called as each process successfully completes? The code returns a dictionary where all of the parameters are crammed into a single key (as a .zip file) and all of the answers are a list in the key value.
I don't think I'm fully understanding the problem here, but what if you simplified it down to something like this, where you catch the LinAlgError in the calculation function.
Here apply_async is used to get a result object for each task sent to the pool. This allows you to easily apply a timeout to the result objects.
def run_solver(A, B):
try:
result = A + B
except np.linalg.LinAlgError:
result = np.nan
return result
results = []
with Pool(processes=15) as pool:
params = (zip(A, B))
result_pool = [pool.apply_async(run_solver, args) for args in params]
for result in result_pool:
try:
results.append(result.get(15))
except context.TimeoutError:
# do desired action on timeout
results.append(None)
Related
I am trying to parallize methods from a class using Dask on a PBS cluster.
My greatest challenge is that this method should parallelize some computations, then run further parallel computations on the result. Of course, this should be distributed on the cluster to run similar computations on other data...
The cluster is created :
cluster = PBSCluster(cores=4,
memory=10GB,
interface="ib0",
queue=queue,
processes=1,
nanny=False,
walltime="02:00:00",
shebang="#!/bin/bash",
env_extra=env_extra,
python=python_bin
)
cluster.scale(8)
client = Client(cluster)
The class I need to distribute has 2 separate steps which do have to be run separately since step1 writes a file that is then read at the beginning of the second step.
I have tried the following by putting both steps one after the other in a method :
def computations(params):
my_class(**params).run_step1(run_path)
my_class(**params).run_step2()
chain = []
for p in params_compute:
y = dask.delayed(computations)(p)
chain.append(y)
dask.compute(*chain)
But it does not work because the second step is trying to read the file immediately.
So I need to find a way to stop the execution after step1.
I have tried to force the execution of first step by adding a compute() :
def computations(params):
my_class(**params).run_step1(run_path).compute()
my_class(**params).run_step2()
But it may not be a good idea because when running dask.compute(*chain) I'd be ultimately doing compute(compute()) .. which might explain why the second step is not executed ?
What would the best approach be ?
Should I include a persist() somewhere at the end of step1 ?
For info, step1 and step2 below :
def run_step1(self, path_step):
preprocess_result = dask.delayed(self.run_preprocess)(path_step)
gpu_result = dask.delayed(self.run_gpu)(preprocess_result)
post_gpu = dask.delayed(self.run_postgpu)(gpu_result) # Write a result file post_gpu.tif
return post_gpu
def run_step2(self):
data_file = rio.open(self.outputdir + "/post_gpu.tif").read() #opens the file written at the end of step1
temp_result1 = self.process(data_file )
final_merge = dask.delayed(self.merging)(temp_result1 )
write =dask.delayed(self.write_final)(final_merge )
return write
This is only a rough suggestion, as I don't have a reproducible example as a starting point, but the key idea is to pass a delayed object to run_step2 to explicitly link it to run_step1. Note I'm not sure how essential for you it is to use a class in this case, but for me it's easier to pass the params as a dict explicitly.
def run_step1(params):
# params is assumed to be a dict
# unpack params here if needed (path_step was not explicitly in the `for p in params_compute:` loop so I assume it can be stored in params)
preprocess_result = run_preprocess(path_step, params)
gpu_result = run_gpu(preprocess_result, params)
post_gpu = run_postgpu(gpu_result, params) # Write a result file post_gpu.tif
return post_gpu
def run_step2(post_gpu, params):
# unpack params here if needed
data_file = rio.open(outputdir + "/post_gpu.tif").read() #opens the file written at the end of step1
temp_result1 = process(data_file, params)
final_merge = merging(temp_result1, params)
write = write_final(final_merge, params)
return write
chain = []
for p in params_compute:
y = dask.delayed(run_step1)(p)
z = dask.delayed(run_step2)(y, p)
chain.append(z)
dask.compute(*chain)
Sultan's answer almost works, but fails due to an internal misconception in the library I was provided.
I have used the following workaround which works for now (I'll use your solution later). I simply create 2 successive chains and compute them one after the other. Not really elegant but works fine...
chain1 = []
for p in params_compute:
y = (run_step1)(p)
chain1.append(y)
dask.compute(chain1)
chain2 = []
for p in params_compute:
y = (run_step2)(p)
chain2.append(y)
dask.compute(chain2)
I am experiencing a memory issue when I'm trying to run the below problem.
Consider a function that for each argument, arg_i_j, it returns a pandas dataframe as,
def some_fun(arg_i_j):
...
return DF_i_j
Now, I have structured all arguments that I want to test in the following format,
All_lists = [ [arg_0_0,..., arg_0_N], ..., [arg_k_0,..., arg_k_N]],
and I'm trying to execute the following code in the main function
# Version A
results_per_list = []
the_pool = multiprocessing.Pool(processes=mp.cpu_count(), initializer=...,initargs=...)
for a_list in All_lists:
results = the_pool.map(some_fun, a_list)
results_per_list.append(results)
the_pool.close()
the_pool.join()
# then use results_per_list to do operations
and I end up with the error,
...\multiprocessing\connection.py", line 251, in recv
return _ForkingPickler.loads(buf.getbuffer())
MemoryError
1) Anyone has any idea how can I resolve that issue??
2) Do you see any problem in creating a "pool" object for each "a_list" in "All_lists" like below?
# Version B
results_per_list = []
for a_list in All_lists:
the_pool = multiprocessing.Pool(processes=mp.cpu_count(), initializer=...,initargs=...)
results = the_pool.map(some_fun, a_list)
results_per_list.append(results)
the_pool.close()
the_pool.join()
I have a function that processes one url at a time:
def sanity(url):
try:
if 'media' in url[:10]:
url = "http://dummy.s3.amazonaws.com" + url
req = urllib.request.Request(url, headers={'User-Agent' : "Magic Browser"})
ret = urllib.request.urlopen(req)
allurls.append(url)
return 1
except (urllib.request.HTTPError,urllib.request.URLError,http.client.HTTPException, ValueError) as e:
print(e, url)
allurls.append(url)
errors.append(url)
return 0
In the main function, I have a list of URLs that need to be processed by the above function. I have tried but doesn't work.
start=0
allurls=[]
errors=[]
#arr=[0,100,200...]
for i in arr:
p=Process(target=sanity,args=(urls[start:i],))
p.start()
p.join()
The above code is supposed to process the URLs in a batch of 100. But it doesn't work. I know it's not working because I am writing the lists allurls and errors to two different files and they are empty when they should not be. I have found that the lists are empty. I don't understand this behavior.
If I understand you correctly, you want to process chunks of a list at a time, but process those chunks in parallel? Secondly you want to store the answers in a global variable. Problem is processes are not threads, so you much more involved to share memory between them.
So the alternative is to return the answer, the below code helps you do just that. First you need to convert your list to a list of lists, each list containing the data you would want to process in that chunk. You can then pass that list of lists to a function that processes each of those. The output of each chunk is a list of answers, and a list of errors (I'd recommend to convert this to a dict to keep track of which one threw an error). Then after the processes returns you can untangle the list of lists the create your list of answers and list of errors.
Here is the code that would achieve the above:
from multiprocessing import Pool
def f(x):
try:
return [x*x, None] # 0 for sucess
except Exception as e:
return [None, e] # 1 for failure
def chunk_f(x):
output = []
errors = []
for xi in x:
ans, err = f(xi)
if ans:
output.append(ans)
if err:
errors.append(err)
return [output, errors]
n = 10 # chunk size
data = list(range(95)) # test data
data.extend(['a', 'b'])
l = [data[k*n:(k+1)*n] for k in range(int(len(data)/n+1))]
p = Pool(8)
d = p.map(chunk_f, l)
new_data = []
all_errors = []
for da, de in d:
new_data.extend(da)
all_errors.extend(de)
print(new_data)
print(all_errors)
You can also look at this stack overflow answer on different methods of chunking your data.
I have a python function that requires the user enter an array of data; at which point the function works on the data and produces two arrays which are returned to the main program. In this question I am including a greatly simplified example from which I hope I can solicit some advice or help. I have created a function titled "Test_Function" that requires the programmer supply an array of data titled "Array", which in this case has a length of 5000. The function works on the data and produces two sets of arrays titled "Result1" and "Result2" which are returned to the user in the main program as the variables "Res1" and "Res2". I would like to thread the function so that the function "Test_Function" so that one thread will work on half of the input array and the other thread will work on the other half and then combine them back together in the main program for both output arrays "Result1" and "Result2"/"Res1" and "Res2". I described a scenario where I would produce two threads, but I would like to make it generic enough so that it could run a user defined number of threads. How do I do this with the thread functionality?
import numpy as np
def Test_Function(Array):
Result1 = Array*np.pi*(1-Array)
Result2 = Array+478.5 + (1/Array)
return(np.array(Result1,dtype=float), np.array(Result2,dtype=float))
#---------------------------------------------------------------------------
if __name__ == "__main__":
Dependent_Array = np.linspace(1.0,5000.0,num=5000)
Res1, Res2 = Test_Function(Dependent_Array)
# eof
You could use a ThreadPool to which assign async tasks:
import numpy as np
from multiprocessing.pool import ThreadPool
def sub1(Array):
return Array * np.pi * (1-Array)
def sub2(Array):
return Array + 478.5 + (1/Array)
def Test_Function(Array):
pool = ThreadPool(processes=2)
res1 = pool.apply_async(sub1, (Array,))
res2 = pool.apply_async(sub2, (Array,))
return (np.array(res1.get(),dtype=float), np.array(res2.get(),dtype=float))
if __name__ == "__main__":
Dependent_Array = np.linspace(1.0,5000.0,num=5000)
Res1, Res2 = Test_Function(Dependent_Array)
This module is not very well documented, but, basically, it creates a pool of workers to which you can assign subroutines to execute. The method get() of the result object created by apply_async will return the result only when the corresponding thread has finished its operations.
Inside of a network, information (package) can be passed to different node(hosts), by modify it's content it can carry different meaning. The final package depends on hosts input via it's given route of network.
Now I want to implement a calculating network model can do small jobs by give different calculate path.
Prototype:
def a(p): return p + 1
def b(p): return p + 2
def c(p): return p + 3
def d(p): return p + 4
def e(p): return p + 5
def link(p, r):
p1 = p
for x in r:
p1 = x(p1)
return p1
p = 100
route = [a,c,d]
result = link(p,result)
#========
target_result = 108
if result = target_result:
# route is OK
I think finally I need something like this:
p with [init_payload, expected_target, passed_path, actual_calculated_result]
|
\/
[CHAOS of possible of functions networks]
|
\/
px [a,a,b,c,e] # ok this path is ok and match the target
Here is my questions hope may get your help:
can p carry(determin) the route(s) by inspect the function and estmated result?
(1.1 ) for example, if on the route there's a node x()
def x(p): return x / 0 # I suppose it can pass the compile
can p know in somehow this path is not good then avoid select this path?
(1.2) Another confuse is if p is a self-defined class type, the payload inside of this class essentially is a string, when it carry with a path [a,c,d], can p know a() must with a int type then avoid to select this node?'
same as 1.2 when generating the path, can I avoid such oops
def a(p): return p + 1
def b(p): return p + 2
def x(p): return p.append(1)
def y(p): return p.append(2)
full_node_list = [a,b,x,y]
path = random(2,full_node_list) # oops x,y will be trouble for the inttype P and a,b will be trouble to list type.
pls consider if the path is lambda list of functions
PS: as the whole model is not very clear in my mind the any leading and directing will be appreciated.
THANKS!
You could test each function first with a set of sample data; any function which returns consistently unusable values might then be discarded.
def isGoodFn(f):
testData = [1,2,3,8,38,73,159] # random test input
goodEnough = 0.8 * len(testData) # need 80% pass rate
try:
good = 0
for i in testData:
if type(f(i)) is int:
good += 1
return good >= goodEnough
except:
return False
If you know nothing about what the functions do, you will have to essentially do a full breadth-first tree search with error-checking at each node to discard bad results. If you have more than a few functions this will get very large very quickly. If you can guarantee some of the functions' behavior, you might be able to greatly reduce the search space - but this would be domain-specific, requiring more exact knowledge of the problem.
If you had a heuristic measure for how far each result is from your desired result, you could do a directed search to find good answers much more quickly - but such a heuristic would depend on knowing the overall form of the functions (a distance heuristic for multiplicative functions would be very different than one for additive functions, etc).
Your functions can raise TypeError if they are not satisfied with the data types they receive. You can then catch this exception and see whether you are passing an appropriate type. You can also catch any other exception type. But trying to call the functions and catching the exceptions can be quite slow.
You could also organize your functions into different sets depending on the argument type.
functions = { list : [some functions taking a list], int : [some functions taking an int]}
...
x = choose_function(functions[type(p)])
p = x(p)
I'm somewhat confused as to what you're trying to do, but: p cannot "know about" the functions until it is run through them. By design, Python functions don't specify what type of data they operate on: e.g. a*5 is valid whether a is a string, a list, an integer or a float.
If there are some functions that might not be able to operate on p, then you could catch exceptions, for example in your link function:
def link(p, r):
try:
for x in r:
p = x(p)
except ZeroDivisionError, AttributeError: # List whatever errors you want to catch
return None
return p