I am experiencing a memory issue when I'm trying to run the below problem.
Consider a function that for each argument, arg_i_j, it returns a pandas dataframe as,
def some_fun(arg_i_j):
...
return DF_i_j
Now, I have structured all arguments that I want to test in the following format,
All_lists = [ [arg_0_0,..., arg_0_N], ..., [arg_k_0,..., arg_k_N]],
and I'm trying to execute the following code in the main function
# Version A
results_per_list = []
the_pool = multiprocessing.Pool(processes=mp.cpu_count(), initializer=...,initargs=...)
for a_list in All_lists:
results = the_pool.map(some_fun, a_list)
results_per_list.append(results)
the_pool.close()
the_pool.join()
# then use results_per_list to do operations
and I end up with the error,
...\multiprocessing\connection.py", line 251, in recv
return _ForkingPickler.loads(buf.getbuffer())
MemoryError
1) Anyone has any idea how can I resolve that issue??
2) Do you see any problem in creating a "pool" object for each "a_list" in "All_lists" like below?
# Version B
results_per_list = []
for a_list in All_lists:
the_pool = multiprocessing.Pool(processes=mp.cpu_count(), initializer=...,initargs=...)
results = the_pool.map(some_fun, a_list)
results_per_list.append(results)
the_pool.close()
the_pool.join()
Related
I have been trying to get over this error for a while and can't quite find a way to fix it. I'm trying to minimize a function, but whenever I call it I get the error in the title. I've looked at several other posts and have tried several different tactics, but no dice. here's the snippet in question here:
def objective_func(a, x_sum,y_sum):
alpha_sum = np.sum(a)
alpha_dot_sum= np.sum(np.dot(a[i],a[j]) for i in range(len(a)) for j in range(len(a)))
return (1/2) * (x_sum*y_sum*alpha_dot_sum)-alpha_sum
def Dual_SVM(data,c,a):
inputs = []
for example in data:
inputs.append(example[0:5])
outputs = []
for example in data:
outputs.append(example[len(example)-1])
bound = [(0,c)]
cons_function = np.sum(a*outputs)
cons = ({'type':'eq','fun':cons_function})
# inputs = []
x_sum= np.sum(np.dot(inputs[i],inputs[j]) for i in range(len(inputs)) for j in range(len(inputs)))
y_sum= np.sum(np.dot(outputs[i],outputs[j]) for i in range(len(outputs)) for j in range(len(outputs)))
sol = minimize(objective_func,x0=a,args=(x_sum,y_sum,),method='SLSQP',constraints=cons,bounds=bound)
return sol
Any feedback on this would be greatly appreciated. I know that the first argument needs to be a function and not just a function call, but I feel like I'm following the proper syntax. Not sure what to do here.
Fixed it. The problem was that cons_function was returning as a float, and not a function. A lambda function in its place fixed the problem.
I'm working with arrays of datasets, iterating over each dataset to extract information, and using the extracted information to build a new dataset that I then pass to a parallel processing function that might do parallel I/O (requests) on the data.
The return is a new dataset array with new information, which I then have to consolidate with the previous one. The pattern ends up being Loop->parallel->Loop.
parallel_request = []
for item in dataset:
transform(item)
subdata = extract(item)
parallel_request.append(subdata)
new_dataset = parallel_function(parallel_request)
for item in dataset:
transform(item)
subdata = extract(item)
if subdata in new_dataset:
item[subdata] = new_dataset[subdata]
I'm forced to use two loops. Once to build the parallel request, and the again to consolidate the parallel results with my old data. Large chunks of these loops end up repeating steps. This pattern is becoming uncomfortably prevalent and repetitive in my code.
Is there some technique to "yield" inside the first loop after adding data to parallel_request, continuing on to the next item. Once parallel_request is filled, execute parallel function, and then resume the loop for each item again, restoring the previously saved context (local variables).
EDIT: I think one solution would be to use a function instead of a loop, and call it recursively. The downside being that i would definitely hit the recursion limit.
parallel_requests = []
final_output = []
index = 0
def process_data(dataset, last=False):
data = dataset[index]
data2 = transform(data)
data3 = expensive_slow_transform(data2)
subdata = extract(data3)
# ... some other work
index += 1
parallel_requests.append(subdata)
# If not last, recurse
# Otherwise, call the processing function.
if not last:
process_data(dataset, index == len(dataset))
else:
new_data = process_requests(parallel_requests)
# Now processing of each item can resume, keeping it's
# local data variables, transforms, subdata...etc.
final_data = merge(subdata, new_data[index], data, data2, data3))
final_output.append(final_data)
process_data(original_dataset)
Any solution would involve somehow preserving data, data2, data3, subdata...etc, which would have to be stored somewhere. Recursion uses the stack to store them, which will trigger the recursion limit. Another way would be store them in some array outside of the loop, which makes the code much more cumbersome. Another solution would be to just recompute them, and would also require code duplication.
So I suspect to achieve this you'd need some specific Python facility that enables this.
I believe i have solved the issue:
Based on the previous recursive code, you can can exploit the generator facilities offered by Python to preserve the serial context when calling the parallel function:
def process_data(dataset, parallel_requests, final_output):
data = dataset[index]
data2 = transform(data)
data3 = expensive_slow_transform(data2)
subdata = extract(data3)
# ... some other work
parallel_requests.append(subdata)
yield
# Now processing of each item can resume, keeping it's
# local data variables, transforms, subdata...etc.
final_data = merge(subdata, new_data[index], data, data2, data3))
final_output.append(final_data)
final_output = []
parallel_requests = []
funcs = [process_data(datum, parallel_requests, final_output) for datum in dataset]
[next(f) for f in funcs]
process_requests(parallel_requests)
[next(f) for f in funcs]
The output list and generator calls are general enough that you can abstract away these lines in a helper function sets it up and calls the generators for you, leading to a very clean result with code overhead being one line for the function definition, and one line to call the helper.
I have a dataset of 2.7 million samples that I need to test my ML model on. I have 8 cores on my laptop and want to try parallelizing my testing code to save time. This is the test function :
def testMTGP(x_sample, y_sample, ind, model, likelihood):
x_sample = x_sample.view(1, -1)
y_sample = y_sample.view(1, -1)
model.eval()
likelihood.eval()
with torch.no_grad():
prediction = likelihood(model(x_sample))
mean = (prediction.mean).detach().numpy()
prewhiten_error = (y_sample.detach().numpy()) - mean
cov_matrix = (prediction.covariance_matrix).detach().numpy()
white_error, matcheck = Whiten(prewhiten_error, cov_matrix)
return (
ind,
{
"prediction": mean,
"prewhiten_error": prewhiten_error,
"white_error": white_error,
"cov_matrix": cov_matrix,
"matcheck": matcheck,
},
)
I return the index corresponding to the sample I tested and a dictionary of data related to the computations the model does for testing. The function Whiten(prewhiten_error, cov_matrix) is also defined by me and was imported at the beginning of the code file, so it is available globally. It simply takes the inputs, transforms cov_matrix and multiplies it with prewhiten_error and returns the answer, along with a variable that indicates some state information about the cov_matrix.
For the multiprocessing, the idea is to first divide the entire dataset into roughly equal sizes chunks; pick each chunk and send one sample to every core for processing. I am using pool.apply_async. This is the code:
test_X = torch.load(test_X_filename) #torch tensor of shape 2.7M x 3
test_Y = torch.load(test_Y_filename) #torch tensor of shape 2.7M x 3
cores = mp.cpu_count()
chunk_size = int(test_X.shape[0] / cores)
start_time = time.time()
parent_list = []
for start_ind in range(0, test_X.shape[0], chunk_size):
pool = mp.Pool(processes=cores)
proc_data_size = int(chunk_size / cores)
stop_ind = min(test_X.shape[0], start_ind + chunk_size)
results = [
pool.apply_async(
testMTGP, (test_X[i].detach(), test_Y[i].detach(), i, model, likelihood,)
)
for i in range(start_ind, stop_ind)
]
for res in results:
print("Length of results list= ", len(results))
print("Data type of res is: ", type(res))
res_dict = res.get()
parent_list.append(res_dict)
pool.close()
test_X[i] and test_Y[i] are both tensors with shape (3,). On executing the code I get:
Traceback (most recent call last):
File "multiproc_async.py", line 288, in
res_dict = res.get() # [1]
File
"/home/aman/anaconda3/envs/thesis/lib/python3.8/multiprocessing/pool.py",
line 771, in get
raise self._value
File
"/home/aman/anaconda3/envs/thesis/lib/python3.8/multiprocessing/pool.py",
line 537, in _handle_tasks
put(task)
File
"/home/aman/anaconda3/envs/thesis/lib/python3.8/multiprocessing/connection.py",
line 206, in send
self._send_bytes(_ForkingPickler.dumps(obj))
File
"/home/aman/anaconda3/envs/thesis/lib/python3.8/multiprocessing/reduction.py",
line 51, in dumps
cls(buf, protocol).dump(obj)
AttributeError: Can't pickle local object
MultitaskGaussianLikelihood.__init__.<locals>.<lambda>
I am new to multiprocessing and googling for this error did not really help (some of it was not relevant and some was beyond my understanding). Can someone please help me understand what mistake I am making?
Well this issue is fairly complex and I've never used Torch, and I'm by no means an expert in multiprocessing. But I do have a decent grasp on the concepts here so I'll do my best to explain what is wrong but you will probably need to come up with the fix because it'll depend on your end goal.
Note: I notice you're just typing python. It looks like this is a Windows Store version of Ubuntu, which if that's the case you may want to run the program using python3. (If you've re-mapped the alias please ignore.)
So that final error in the stacktrace, Can't picke local object 'MultitaskGaussianLikelihood.__init__.<locals>.<lambda>'; This is refering to the library Pickle which is a serializer library. If you're unfamiliar with seralization, it's basically a standard format to rebuild something cross-system. For example, JSON is a very common serializer; it allows you to transfer multiple variables as an array across multiple programming languages. Pickle allows for the searlization of objects so they can be transfered to another program. I believe the reason res.get() is serializing here is due to the limited functionality in python with cores being able to talk to eachother which is evident throughout the multiprocessing documentation.
The problem is the Class MultitaskGaussianLikelihood appears to use a lambda as one of it's parameters, and according to that AttributeError, pickle is not capable of seralizing a lambda. Which means it can't serialize MultitaskGaussianLikelihood as it contains one. I don't have all the code here so I can't see where the MultitaskGaussianLikelihood object is in your return, but I would say you need to extract all the information you will need from that class and return that data instead of returning the class and extracting it after the fact.
Hope I explained that well!
Let's simplify your problem to the root cause of it. We need a working example, for the multiprocessing part, otherwise we don't have a reproducible example to help you. Then you can patch in the actual training the model.
Let's use this dummy function:
def testMTGP(x_sample, y_sample, ind, model, likelihood):
return (
ind,
{
"prediction": 1,
"prewhiten_error": 1,
"white_error": 1,
"cov_matrix": 1,
"matcheck": 1,
},
)
Then a working and clean example is:
if __name__ == '__main__':
cores = mp.cpu_count()
args = [(None, None, i, None, None,) for i in range(0, 5)]
start_time = time.time()
with mp.Pool(processes=3) as pool:
results = pool.starmap(testMTGP, args)
end_time = time.time()
print(results)
print("it took %s" % (end_time-start_time))
Try with this and, little by little, bring in the actual logic you need for training the model. I suggest you start by passing the actual arguments you want each time, and at the end of it updating the testMTGP function (replacing the dummy one).
When you isolate what makes the code crash, and/or post the stack trace, I can help more.
I am trying to parallize methods from a class using Dask on a PBS cluster.
My greatest challenge is that this method should parallelize some computations, then run further parallel computations on the result. Of course, this should be distributed on the cluster to run similar computations on other data...
The cluster is created :
cluster = PBSCluster(cores=4,
memory=10GB,
interface="ib0",
queue=queue,
processes=1,
nanny=False,
walltime="02:00:00",
shebang="#!/bin/bash",
env_extra=env_extra,
python=python_bin
)
cluster.scale(8)
client = Client(cluster)
The class I need to distribute has 2 separate steps which do have to be run separately since step1 writes a file that is then read at the beginning of the second step.
I have tried the following by putting both steps one after the other in a method :
def computations(params):
my_class(**params).run_step1(run_path)
my_class(**params).run_step2()
chain = []
for p in params_compute:
y = dask.delayed(computations)(p)
chain.append(y)
dask.compute(*chain)
But it does not work because the second step is trying to read the file immediately.
So I need to find a way to stop the execution after step1.
I have tried to force the execution of first step by adding a compute() :
def computations(params):
my_class(**params).run_step1(run_path).compute()
my_class(**params).run_step2()
But it may not be a good idea because when running dask.compute(*chain) I'd be ultimately doing compute(compute()) .. which might explain why the second step is not executed ?
What would the best approach be ?
Should I include a persist() somewhere at the end of step1 ?
For info, step1 and step2 below :
def run_step1(self, path_step):
preprocess_result = dask.delayed(self.run_preprocess)(path_step)
gpu_result = dask.delayed(self.run_gpu)(preprocess_result)
post_gpu = dask.delayed(self.run_postgpu)(gpu_result) # Write a result file post_gpu.tif
return post_gpu
def run_step2(self):
data_file = rio.open(self.outputdir + "/post_gpu.tif").read() #opens the file written at the end of step1
temp_result1 = self.process(data_file )
final_merge = dask.delayed(self.merging)(temp_result1 )
write =dask.delayed(self.write_final)(final_merge )
return write
This is only a rough suggestion, as I don't have a reproducible example as a starting point, but the key idea is to pass a delayed object to run_step2 to explicitly link it to run_step1. Note I'm not sure how essential for you it is to use a class in this case, but for me it's easier to pass the params as a dict explicitly.
def run_step1(params):
# params is assumed to be a dict
# unpack params here if needed (path_step was not explicitly in the `for p in params_compute:` loop so I assume it can be stored in params)
preprocess_result = run_preprocess(path_step, params)
gpu_result = run_gpu(preprocess_result, params)
post_gpu = run_postgpu(gpu_result, params) # Write a result file post_gpu.tif
return post_gpu
def run_step2(post_gpu, params):
# unpack params here if needed
data_file = rio.open(outputdir + "/post_gpu.tif").read() #opens the file written at the end of step1
temp_result1 = process(data_file, params)
final_merge = merging(temp_result1, params)
write = write_final(final_merge, params)
return write
chain = []
for p in params_compute:
y = dask.delayed(run_step1)(p)
z = dask.delayed(run_step2)(y, p)
chain.append(z)
dask.compute(*chain)
Sultan's answer almost works, but fails due to an internal misconception in the library I was provided.
I have used the following workaround which works for now (I'll use your solution later). I simply create 2 successive chains and compute them one after the other. Not really elegant but works fine...
chain1 = []
for p in params_compute:
y = (run_step1)(p)
chain1.append(y)
dask.compute(chain1)
chain2 = []
for p in params_compute:
y = (run_step2)(p)
chain2.append(y)
dask.compute(chain2)
I am trying to use Pool.starmap_async to run some code that takes multiple parameters as inputs, in order to quickly sweep through a parameter space. The code runs a linalg function that sometimes does not converge, and instead throws a np.linalg.LinAlgError. In this case I'd like my code to return np.nan, and carry on its merry way. I would also, ideally, like to specify a timeout so that the code gives up after a set number of seconds and continues on to a different parameter combination.
# This is actually some long function that sometimes returns a linalg error
def run_solver(A, B):
return A+B
if __name__ == '__main__':
# Parameters
Asearch = np.arange(4, 8, 1)
Bsearch = np.arange(0.2, 2, 0.2)
# Search all combinations of Qsearch and Rmsearch
AB = np.array(list(itertools.product(Qsearch, Rmsearch)))
A = AB[:, 0]
B = AB[:, 1]
result = {}
with Pool(processes=15) as pool:
def cb(r):
print("callback")
result[params] = r
def ec(r):
result[params] = np.nan
print("error callback")
raise np.linalg.LinAlgError
try:
params = (zip(A, B))
r = pool.starmap_async(run_solver, params, callback=cb, error_callback=ec)
print(r.get(timeout=10))
except np.linalg.LinAlgError:
print("parameters did not converge")
except mp.context.TimeoutError:
print("Timeout error. Continuing...")
pickle.dump(result, open("result.p", "wb"))
print("pickling output:", result)`
I have tried to catch the TimeoutError as an exception so that the code will continue, and I'm purposefully raising the LinAlgError because I'm trying to pick apart when the code runs out of time vs fails to converge in time -- I realize that's redundant. For one thing, the result dictionary does not end up being how I intended: is there a way to query the current process's parameters and use those as the dictionary keys? Also, if a Timeout error occurs I would ideally flag those parameters in some way -- what's the best way to do this?
Finally, why in this code is callback only called once? Shouldn't it be called as each process successfully completes? The code returns a dictionary where all of the parameters are crammed into a single key (as a .zip file) and all of the answers are a list in the key value.
I don't think I'm fully understanding the problem here, but what if you simplified it down to something like this, where you catch the LinAlgError in the calculation function.
Here apply_async is used to get a result object for each task sent to the pool. This allows you to easily apply a timeout to the result objects.
def run_solver(A, B):
try:
result = A + B
except np.linalg.LinAlgError:
result = np.nan
return result
results = []
with Pool(processes=15) as pool:
params = (zip(A, B))
result_pool = [pool.apply_async(run_solver, args) for args in params]
for result in result_pool:
try:
results.append(result.get(15))
except context.TimeoutError:
# do desired action on timeout
results.append(None)