multiprocessing on dataframe

multiprocessing on dataframe - python

I have this ipynb script to apply a function from a different python script on my dataframe. The "predict_tautomers.py" file contains a function called "get_taut_data".
my script:
import pandas as pd
import multiprocessing as mp
import numpy as np
import predict_tautomers
train_set=pd.read_csv("data/df_train_set.csv", nrows=3)
def parallelize_function(df):
df["tautomer_dict"] = df["smiles"].apply(predict_tautomers.get_taut_data)
return df
def parallelize_dataframe(df, func):
num_processes = 2
df_split = np.array_split(df, num_processes)
with mp.Pool(num_processes) as p:
df = pd.concat(p.map(func, df_split))
p.map(func, df_split)
return df
train_set = parallelize_dataframe(train_set, parallelize_function)
However, running this code gives me this attribute error.
the error:
Process SpawnPoolWorker-15:
Traceback (most recent call last):
File "/Users/michel_lim/miniconda3/envs/moltaut2/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/Users/michel_lim/miniconda3/envs/moltaut2/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/Users/michel_lim/miniconda3/envs/moltaut2/lib/python3.10/multiprocessing/pool.py", line 114, in worker
task = get()
File "/Users/michel_lim/miniconda3/envs/moltaut2/lib/python3.10/multiprocessing/queues.py", line 367, in get
return _ForkingPickler.loads(res)
AttributeError: Can't get attribute 'parallelize_function' on <module '__main__' (built-in)>
Process SpawnPoolWorker-16:
Traceback (most recent call last):
File "/Users/michel_lim/miniconda3/envs/moltaut2/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/Users/michel_lim/miniconda3/envs/moltaut2/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/Users/michel_lim/miniconda3/envs/moltaut2/lib/python3.10/multiprocessing/pool.py", line 114, in worker
task = get()
File "/Users/michel_lim/miniconda3/envs/moltaut2/lib/python3.10/multiprocessing/queues.py", line 367, in get
return _ForkingPickler.loads(res)
AttributeError: Can't get attribute 'parallelize_function' on <module '__main__' (built-in)>
Edit: The "predict_tautomers" script I am importing contains functions with multiprocessing which looks like this:
def predict_by_smi(smi, fmax, num_confs):
pmol = pybel.readstring("smi", smi)
blocks = gen_confs_set(smi, num_confs)
params = zip(blocks, [fmax for i in range(len(blocks))])
pool = Pool()
score = pool.map(predict_multicore_wrapper, params)
pool.close()
def predict_by_smis(smis, fmax, num_confs):
params = []
for idx, smi in enumerate(smis):
blocks = gen_confs_set(smi, num_confs)
for block in blocks:
params.append([idx, smi, block, fmax])
pool = Pool()
score = pool.map(predict_multicore_wrapper, params)
pool.close()

Related

Use multiprocess in class with python version == 3.9

I am trying to use multiprocessing in a class in the following code:
class test:
def __init__(self):
return
global calc_corr
#staticmethod
def calc_corr(idx, df1, df2):
arr1 = df1.iloc[idx:idx+5, :].values.flatten('F')
arr2 = df2.iloc[idx:idx+5, :].values.flatten('F')
df_tmp = pd.DataFrame([arr1, arr2]).T
df_tmp.dropna(how='any', inplace=True)
corr = df_tmp.corr().iloc[0, 1]
return corr
def aa(self):
df1 = pd.DataFrame(np.random.normal(size=(100, 6)))
df2 = pd.DataFrame(np.random.normal(size=(100, 6)))
with concurrent.futures.ProcessPoolExecutor() as executor:
results = [executor.submit(calc_corr, (i, df1, df2)) for i in range(20)]
for f in concurrent.futures.as_completed(results):
print(f.result())
if __name__ == '__main__':
t = test()
t.aa()
I am using a #staticmethod because it is not related to the class, it's just a computing tool. But using it raises the following error when running the code:
D:\anaconda3\python.exe C:/Users/jonas/Desktop/728_pj/test.py
concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
File "D:\anaconda3\lib\multiprocessing\queues.py", line 245, in _feed
obj = _ForkingPickler.dumps(obj)
File "D:\anaconda3\lib\multiprocessing\reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
TypeError: cannot pickle 'staticmethod' object
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "C:\Users\jonas\Desktop\728_pj\test.py", line 31, in <module>
t.aa()
File "C:\Users\jonas\Desktop\728_pj\test.py", line 26, in aa
print(f.result())
File "D:\anaconda3\lib\concurrent\futures\_base.py", line 438, in result
return self.__get_result()
File "D:\anaconda3\lib\concurrent\futures\_base.py", line 390, in __get_result
raise self._exception
File "D:\anaconda3\lib\multiprocessing\queues.py", line 245, in _feed
obj = _ForkingPickler.dumps(obj)
File "D:\anaconda3\lib\multiprocessing\reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
TypeError: cannot pickle 'staticmethod' object
Process finished with exit code 1
Can anyone help me fix this?

I think it is somehow caused by the staticmethod being declared as global. When I tried removing the global calc_corr line and changing
results = [executor.submit(calc_corr, (i, df1, df2)) for i in range(20)] to
results = [executor.submit(self.calc_corr, i, df1, df2) for i in range(20)] it seemed to work fine. I'm not actually sure of the reason what you wrote doesn't work but hopefully this will.
Note: Removing the tuple for the arguments is unrelated to this issue but was causing another issue afterwards.

How to recover the return value of a function passed to multiprocessing.Process?

I have looked at this question to get started and it works just fine How can I recover the return value of a function passed to multiprocessing.Process?
But in my case I would like to write a small tool, that would connect to many computers and gather some statistics, each stat would be gathered within a Process to make it snappy. But as soon as I try to wrap up the multiprocessing command in a class for a machine then it fails.
Here is my code
import multiprocessing
import pprint
def run_task(command):
p = subprocess.Popen(command, stdout = subprocess.PIPE, universal_newlines = True, shell = False)
result = p.communicate()[0]
return result
MACHINE_NAME = "cptr_name"
A_STAT = "some_stats_A"
B_STAT = "some_stats_B"
class MachineStatsGatherer():
def __init__(self, machineName):
self.machineName = machineName
manager = multiprocessing.Manager()
self.localStats = manager.dict() # creating a shared ressource for the sub processes to use
self.localStats[MACHINE_NAME] = machineName
def gatherStats(self):
self.runInParallel(
self.GatherSomeStatsA,
self.GatherSomeStatsB,
)
self.printStats()
def printStats(self):
pprint.pprint(self.localStats)
def runInParallel(self, *fns):
processes = []
for fn in fns:
process = multiprocessing.Process(target=fn, args=(self.localStats))
processes.append(process)
process.start()
for process in processes:
process.join()
def GatherSomeStatsA(self, returnStats):
# do some remote command, simplified here for the sake of debugging
result = "Windows"
returnStats[A_STAT] = result.find("Windows") != -1
def GatherSomeStatsB(self, returnStats):
# do some remote command, simplified here for the sake of debugging
result = "Windows"
returnStats[B_STAT] = result.find("Windows") != -1
def main():
machine = MachineStatsGatherer("SOMEMACHINENAME")
machine.gatherStats()
return
if __name__ == '__main__':
main()
And here is the error message
Traceback (most recent call last):
File "C:\Users\mesirard\AppData\Local\Programs\Python\Python37\lib\multiprocessing\process.py", line 297, in _bootstrap
self.run()
File "C:\Users\mesirard\AppData\Local\Programs\Python\Python37\lib\multiprocessing\process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "d:\workdir\trunks6\Tools\VTKAppTester\Utils\NXMachineMonitorShared.py", line 45, in GatherSomeStatsA
returnStats[A_STAT] = result.find("Windows") != -1
TypeError: 'str' object does not support item assignment
Process Process-3:
Traceback (most recent call last):
File "C:\Users\mesirard\AppData\Local\Programs\Python\Python37\lib\multiprocessing\process.py", line 297, in _bootstrap
self.run()
File "C:\Users\mesirard\AppData\Local\Programs\Python\Python37\lib\multiprocessing\process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "d:\workdir\trunks6\Tools\VTKAppTester\Utils\NXMachineMonitorShared.py", line 50, in GatherSomeStatsB
returnStats[B_STAT] = result.find("Windows") != -1
TypeError: 'str' object does not support item assignment

The issue is coming from this line
process = multiprocessing.Process(target=fn, args=(self.localStats))
it should have a extra comma at the end of args like so
process = multiprocessing.Process(target=fn, args=(self.localStats,))

Memory issue with multiprocessing in Python

I am trying to use my other cores in my python program. And the following is the basic structure/logic of my code:
import multiprocessing as mp
import pandas as pd
import gc
def multiprocess_RUN(param):
result = Analysis_Obj.run(param)
return result
class Analysis_Obj():
def __init__(self, filename):
self.DF = pd.read_csv(filename)
return
def run_Analysis(self, param):
# Multi-core option
pool = mp.Pool(processes=1)
run_result = pool.map(multiprocess_RUN, [self, param])
# Normal option
run_result = self.run(param)
return run_result
def run(self, param):
# Let's say I have written a function to count the frequency of 'param' in the target file
result = count(self.DF, param)
return result
if __name__ == "__main__":
files = ['file1.csv', 'file2.csv']
params = [1,2,3,4]
results = []
for i in range(0,len(files)):
analysis = Analysis_Obj(files[i])
for j in range(0,len(params)):
result = analysis.run_Analysis(params[j])
results.append(result)
del result
del analysis
gc.collect()
If I comment out the 'Multi-core option' and run the 'Normal option' everything runs fine. But even if I run the 'Multi-core option' with processes=1 I get a Memory Error when my for loop starts on the 2nd file. I have deliberately set it up so that I create and delete an Analysis object in each for loop, so that the file that has been processed will be cleared from memory. Clearly this hasn't worked. Advice of how to get around this would be very much appreciated.
Cheers
EDIT:
Here is the error message I have in the terminal:
Exception in thread Thread-7:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "/usr/lib/python2.7/multiprocessing/pool.py", line 326, in _handle_workers
pool._maintain_pool()
File "/usr/lib/python2.7/multiprocessing/pool.py", line 230, in _maintain_pool
self._repopulate_pool()
File "/usr/lib/python2.7/multiprocessing/pool.py", line 223, in _repopulate_pool
w.start()
File "/usr/lib/python2.7/multiprocessing/process.py", line 130, in start
self._popen = Popen(self)
File "/usr/lib/python2.7/multiprocessing/forking.py", line 121, in __init__
self.pid = os.fork()
OSError: [Errno 12] Cannot allocate memory

Class variable in multiprocessing - python

Here is my code:
import multiprocessing
import dill
class Some_class():
class_var = 'Foo'
def __init__(self, param):
self.name = param
def print_name(self):
print("we are in object "+self.name)
print(Some_class.class_var)
def run_dill_encoded(what):
fun, args = dill.loads(what)
return fun(*args)
def apply_async(pool, fun, args):
return pool.apply_async(run_dill_encoded, (dill.dumps((fun, args)),))
if __name__ == '__main__':
list_names = [Some_class('object_1'), Some_class('object_2')]
pool = multiprocessing.Pool(processes=4)
results = [apply_async(pool, Some_class.print_name, args=(x,)) for x in list_names]
output = [p.get() for p in results]
print(output)
It returns error:
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "C:\Python34\lib\multiprocessing\pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "C:\...\temp_obj_output_standard.py", line 18, in run_dill_encoded
return fun(*args)
File "C:/...temp_obj_output_standard.py", line 14, in print_name
print(Some_class.class_var)
NameError: name 'Some_class' is not defined
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "C:/...temp_obj_output_standard.py", line 31, in <module>
output = [p.get() for p in results]
File "C:/...temp_obj_output_standard.py", line 31, in <listcomp>
output = [p.get() for p in results]
File "C:\Python34\lib\multiprocessing\pool.py", line 599, in get
raise self._value
NameError: name 'Some_class' is not defined
Process finished with exit code 1
The code works fine without line print(Some_class.class_var). What is wrong with accessing class variables, both objects should have it and I don't think processes should conflict about it. Am I missing something?
Any suggestions on how to troubleshoot it? Do not worry about run_dill_encoded and
apply_async, I am using this solution until I compile multiprocess on Python 3.x.
P.S. This is already enough, but stackoverflow wants me to put more details, not really sure what to put.

KeyError: 0 using multiprocessing in python

I have the following code inwhich I try to call a function compute_cluster which do some computations and write the results in a txt file (each process write its results in different txt files independently), however, when I run the following code:
def main():
p = Pool(19)
p.map(compute_cluster, [(l, r) for l in range(6, 25) for r in range(1, 4)])
p.close()
if __name__ == "__main__":
main()
it crashes with the following errors:
File "RMSD_calc.py", line 124, in <module>
main()
File "RMSD_calc.py", line 120, in main
p.map(compute_cluster, [(l, r) for l in range(6, 25) for r in range(1, 4)])
File "/usr/local/lib/python2.7/multiprocessing/pool.py", line 225, in map
return self.map_async(func, iterable, chunksize).get()
File "/usr/local/lib/python2.7/multiprocessing/pool.py", line 522, in get
raise self._value
KeyError: 0
and when I searched online for the meaning of "KeyError: 0" i didn't find anything helpful so any suggestions why this error happens is highly appreciated

KeyError happens in compute_cluster() in a child process and p.map() reraises it for you in the parent:
from multiprocessing import Pool
def f(args):
d = {}
d[0] # <-- raises KeyError
if __name__=="__main__":
p = Pool()
p.map(f, [None])
Output
Traceback (most recent call last):
File "raise-exception-in-child.py", line 9, in <module>
p.map(f, [None])
File "/usr/lib/python2.7/multiprocessing/pool.py", line 227, in map
return self.map_async(func, iterable, chunksize).get()
File "/usr/lib/python2.7/multiprocessing/pool.py", line 528, in get
raise self._value
KeyError: 0
To see the full traceback, catch the exception in the child process:
import logging
from multiprocessing import Pool
def f(args):
d = {}
d[0] # <-- raises KeyError
def f_mp(args):
try:
return f(args)
except Exception:
logging.exception("f(%r) failed" % (args,))
if __name__=="__main__":
p = Pool()
p.map(f_mp, [None])
Output
ERROR:root:f(None) failed
Traceback (most recent call last):
File "raise-exception-in-child.py", line 10, in f_mp
return f(args)
File "raise-exception-in-child.py", line 6, in f
d[0] # <-- raises KeyError
KeyError: 0
It shows that d[0] caused the exception.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

multiprocessing on dataframe - python

Related

Use multiprocess in class with python version == 3.9

How to recover the return value of a function passed to multiprocessing.Process?

Memory issue with multiprocessing in Python

Class variable in multiprocessing - python

KeyError: 0 using multiprocessing in python

Categories

Resources