import modules in python jobs with dispy - python

I'm working with a program that works in parallel execution with dispy.
I'm using dispy to create tasks and then distribute it to different CPUs to execution.
I have standar libraries and developed by me libraries (data and connection).
The code is like this:
import dispy
import sys
import data
import connection
def compute(num):
#some code that call data and connection methods, and generate a solution
return solution
def main():
cluster = dispy.JobCluster(compute)
jobs = []
for i in range(10)
job = cluster.submit(i)
job.id = i # optionally associate an ID to job (if needed later)
jobs.append(job)
for job in jobs:
job()
print "Result = " + str(job.result)
print "Exception = " + str(job.exception)
if __name__ == "__main__":
main()
`
The problem is that I need if a work with data and connection in the main def it works all fine, also if I call compute as a function instead of using the dispy library.
But when I work like that and in the compute procedure call a data function it throws and exception that data is not defined and print exception None.
Any help? The documentation suggests of use setup but I can't figure out how it works.

Put the import data call inside the compute function.
Dispy ships the function to call along with its arguments to the new process. The new process doesn't have data imported. That's why adding import data inside the function definition should fix this.

JobCluster(compute, depends=[data])
Specify that the comoute function depends on whichever modules you need.

If it is a module that you know that all machines have it installed, you can just import data,connections inside the compute function.
I know it is not elegant but is working for me and there are 2 options:
get rid of main function and put it in the if main block, because it is likely to be executed when function gets in cluster.
define all your module data inside one big function and pass it to the cluster, this is very simple way and yet powerfull.
import dispy
import sys
def compute(num):
def data_func1(json_):
#do something to json_
return json_
def data_func2(json_):
#do something diff
return json_
#some code that call data and connection methods, and generate a solution
return solution
if __name__ == "__main__":
cluster = dispy.JobCluster(compute)
jobs = []
for i in range(10)
job = cluster.submit(i)
job.id = i # optionally associate an ID to job (if needed later)
jobs.append(job)
for job in jobs:
job()
print "Result = " + str(job.result)
print "Exception = " + str(job.exception)
or define all your functions in script and pass all of then as depends at job cluster creation time like
import dispy
import sys
def data_func1(json_):
#do something to json_
return json_
def data_func2(json_):
#do something diff
return json_
class DataClass:
pass
def compute(num):
#some code that call data and connection methods, and generate a solution
return solution
if __name__ == "__main__":
cluster = dispy.JobCluster(compute, depends=[data_func1,
data_func2,
DataClass])
jobs = []
for i in range(10)
job = cluster.submit(i)
job.id = i # optionally associate an ID to job (if needed later)
jobs.append(job)
for job in jobs:
job()
print "Result = " + str(job.result)
print "Exception = " + str(job.exception)

Related

my python Ray script runs on a single worker only

I am a new with Ray and after have read he documentation, I came up with a script that mimics what I want to do further with Ray. Here is my script:
import ray
import time
import h5py
#ray.remote
class Analysis:
def __init__(self):
self._file = h5py.File('./Data/Trajectories/MDANSE/apoferritin.h5')
def __getstate__(self):
print('I dump')
d = self.__dict__.copy()
del d['_file']
return d
def __setstate__(self,state):
self.__dict__ = state
self._file = h5py.File('./Data/Trajectories/MDANSE/apoferritin.h5')
def run_step(self,index):
time.sleep(5)
print('I run a step',index)
def combine(self,index):
print('I combine',index)
ray.init(num_cpus=4)
a = Analysis.remote()
obj_id = ray.put(a)
for i in range(100):
output = ray.get(a.run_step.remote(i))
My problem is that when I run this script it runs on a single worker as indicated by the Ray output whereas I would expect 4 workers to be fired. Would you know what is wrong with my script ?
Quoting from ray docs on actor
Methods called on different actors can execute in parallel, and methods called on the same actor are executed serially in the order that they are called.
Another issue with the above code is that ray.get is a blocking call.
I will suggest instantiating multiple actors and running the jobs, like
actors = [Analysis.remote() for i in range(num_cpus)]
outputs = []
for i in range(100):
outputs.append(actors[i % num_cpus].run_step.remote(i))
output = ray.get(outputs)

Turn for-loop code into multi-threading code with max number of threads

Background: I'm trying to do 100's of dymola simulations with the python-dymola interface. I managed to run them in a for-loop. Now I want them to run while multi-threading so I can run multiple models parallel (which will be much faster). Since probably nobody uses the interface, I wrote some simple code that also shows my problem:
1: Turn a for-loop into a definition that is run into another for-loop BUT both the def and the for-loop share the same variable 'i'.
2: Turn a for-loop into a definition and use multi-threading to execute it. A for-loop runs the command one by one. I want to run them parallel with a maximum of x threads at the same time. The result should be the same as when executing the for-loop
Example-code:
import os
nSim = 100
ndig='{:01d}'
for i in range(nSim):
os.makedirs(str(ndig.format(i)))
Note that the name of the created directories are just the numbers from the for-loop (this is important). Now instead of using the for-loop, I would love to create the directories with multi-threading (note: probably not interesting for this short code but when calling and executing 100's of simulation models it definitely is interesting to use multi-threading).
So I started with something simple I thought, turning the for-loop into a function that then is run inside another for-loop and hoped to have the same result as with the for-loop code above but got this error:
AttributeError: 'NoneType' object has no attribute 'start'
(note: I just started with this, because I did not use the def-statement before and the thread package is also new. After this I would evolve towards the multi-threading.)
1:
import os
nSim = 100
ndig='{:01d}'
def simulation(i):
os.makedirs(str(ndig.format(i)))
for i in range(nSim):
simulation(i=i).start
After that failed, I tried to evolve to multi-threading (converting the for-loop into something that does the same but with multi-threading and by that running the code parallel instead of one by one and with a maximum number of threads):
2:
import os
import threading
nSim = 100
ndig='{:01d}'
def simulation(i):
os.makedirs(str(ndig.format(i)))
if __name__ == '__main__':
i in range(nSim)
simulation_thread[i] = threading.Thread(target=simulation(i=i))
simulation_thread[i].daemon = True
simulation_thread[i].start()
Unfortunately that attempt failed as well and now I got the error:
NameError: name 'i' is not defined
Does anybody has suggestions for issues 1 or 2?
Both examples are incomplete. Here's a complete example. Note that target gets passed the name of the function target=simulation and a tuple of its arguments args=(i,). Don't call the function target=simulation(i=i) because that just passes the result of the function, which is equivalent to target=None in this case.
import threading
nSim = 100
def simulation(i):
print(f'{threading.current_thread().name}: {i}')
if __name__ == '__main__':
threads = [threading.Thread(target=simulation,args=(i,)) for i in range(nSim)]
for t in threads:
t.start()
for t in threads:
t.join()
Output:
Thread-1: 0
Thread-2: 1
Thread-3: 2
.
.
Thread-98: 97
Thread-99: 98
Thread-100: 99
Note you usually don't want more threads that CPUs, which you can get from multiprocessing.cpu_count(). You can use create a thread pool and use queue.Queue to post work that the threads execute. An example is in the Python Queue documentation.
Cannot call .start like this
simulation(i=i).start
on an non-threading object. Also, you have to import the module as well
It seems like you forgot to add 'for' and indent the code in your loop
i in range(nSim)
simulation_thread[i] = threading.Thread(target=simulation(i=i))
simulation_thread[i].daemon = True
simulation_thread[i].start()
to
for i in range(nSim):
simulation_thread[i] = threading.Thread(target=simulation(i=i))
simulation_thread[i].daemon = True
simulation_thread[i].start()
If you would like to have max number of thread in a pool, and to run all items in the queue. We can continue #mark-tolonen answer and do like this:
import threading
import queue
import time
def main():
size_of_threads_pool = 10
num_of_tasks = 30
task_seconds = 1
q = queue.Queue()
def worker():
while True:
item = q.get()
print(my_st)
print(f'{threading.current_thread().name}: Working on {item}')
time.sleep(task_seconds)
print(f'Finished {item}')
q.task_done()
my_st = "MY string"
threads = [threading.Thread(target=worker, daemon=True) for i in range(size_of_threads_pool)]
for t in threads:
t.start()
# send the tasks requests to the worker
for item in range(num_of_tasks):
q.put(item)
# block until all tasks are done
q.join()
print('All work completed')
# NO need this, as threads are while True, so never will stop..
# for t in threads:
# t.join()
if __name__ == '__main__':
main()
This will run 30 tasks of 1 second in each, using 10 threads.
So total time would be 3 seconds.
$ time python3 q_test.py
...
All work completed
real 0m3.064s
user 0m0.033s
sys 0m0.016s
EDIT: I found another higher-level interface for asynchronously executing callables.
Use concurrent.futures, see the example in the docs:
import concurrent.futures
import urllib.request
URLS = ['http://www.foxnews.com/',
'http://www.cnn.com/',
'http://europe.wsj.com/',
'http://www.bbc.co.uk/',
'http://some-made-up-domain.com/']
# Retrieve a single page and report the URL and contents
def load_url(url, timeout):
with urllib.request.urlopen(url, timeout=timeout) as conn:
return conn.read()
# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
# Start the load operations and mark each future with its URL
future_to_url = {executor.submit(load_url, url, 60): url for url in URLS}
for future in concurrent.futures.as_completed(future_to_url):
url = future_to_url[future]
try:
data = future.result()
except Exception as exc:
print('%r generated an exception: %s' % (url, exc))
else:
print('%r page is %d bytes' % (url, len(data)))
Note the max_workers=5 that will tell the max number of threads, and
note the for loop for url in URLS that you can use.

Run an object method in a daemon thread in python

I am trying to simulate an environment with vms and trying to run an object method in background thread. My code looks like the following.
hyper_v.py file :
import random
from threading import Thread
from virtual_machine import VirtualMachine
class HyperV(object):
def __init__(self, hyperv_name):
self.hyperv_name = hyperv_name
self.vms_created = {}
def create_vm(self, vm_name):
if vm_name not in self.vms_created:
vm1 = VirtualMachine({'vm_name': vm_name})
self.vms_created[vm_name] = vm1
vm1.boot()
else:
print('VM:', vm_name, 'already exists')
def get_vm_stats(self, vm_name):
print('vm stats of ', vm_name)
print(self.vms_created[vm_name].get_values())
if __name__ == '__main__':
hv = HyperV('temp')
vm_name = 'test-vm'
hv.create_vm(vm_name)
print('getting vm stats')
th2 = Thread(name='vm1_stats', target=hv.get_vm_stats(vm_name) )
th2.start()
virtual_machine.py file in the same directory:
import random, time, uuid, json
from threading import Thread
class VirtualMachine(object):
def __init__(self, interval = 2, *args, **kwargs):
self.vm_id = str(uuid.uuid4())
#self.vm_name = kwargs['vm_name']
self.cpu_percentage = 0
self.ram_percentage = 0
self.disk_percentage = 0
self.interval = interval
def boot(self):
print('Bootingup', self.vm_id)
th = Thread(name='vm1', target=self.update() )
th.daemon = True #Setting the thread as daemon thread to run in background
print(th.isDaemon()) #This prints true
th.start()
def update(self):
# This method needs to run in the background simulating an actual vm with changing values.
i = 0
while(i < 5 ): #Added counter for debugging, ideally this would be while(True)
i+=1
time.sleep(self.interval)
print('updating', self.vm_id)
self.cpu_percentage = round(random.uniform(0,100),2)
self.ram_percentage = round(random.uniform(0,100),2)
self.disk_percentage = round(random.uniform(0,100),2)
def get_values(self):
return_json = {'cpu_percentage': self.cpu_percentage,
'ram_percentage': self.ram_percentage,
'disk_percentage': self.disk_percentage}
return json.dumps(return_json)
The idea is to create a thread that keeps on updating the values and on request, we read the values of the vm object by calling the vm_obj.get_values() we would be creating multiple vm_objects to simulate multiple vms running in parallel and we need to get the information from a particular vm on request.
The problem, that I am facing, is that the update() function of the vm doesnot run in the background (even though the thread is set as daemon thread).
The method call hv.get_vm_stats(vm_name) waits until the completion of vm_object.update() (which is called by vm_object.boot()) and then prints the stats. I would like to get the stats of the vm on request by keeping the vm_object.update() running in the background forever.
Please share your thoughts if I am overlooking anything related to the basics. I tried looking into the issues related to the python threading library but I could not come to any conclusion. Any help is greatly appreciated. The next steps would be to have a REST api to call these functions to get the data of any vm but I am struck with this problem.
Thanks in advance,
As pointed out by #Klaus D in the comments, my mistake was using the braces when specifying the target function in the thread definition, which resulted in the function being called right away.
target=self.update() will call the method right away. Remove the () to
hand the method over to the thread without calling it.

Using Multiprocessing with Modules

I am writing a module such that in one function I want to use the Pool function from the multiprocessing library in Python 3.6. I have done some research on the problem and the it seems that you cannot use if __name__=="__main__" as the code is not being run from main. I have also noticed that the python pool processes get initialized in my task manager but essentially are stuck.
So for example:
class myClass()
...
lots of different functions here
...
def multiprocessFunc()
do stuff in here
def funcThatCallsMultiprocessFunc()
array=[array of filenames to be called]
if __name__=="__main__":
p = Pool(processes=20)
p.map_async(multiprocessFunc,array)
I tried to remove the if __name__=="__main__" part but still no dice. any help would appreciated.
It seems to me that your have just missed out a self. from your code. I should think this will work:
class myClass():
...
# lots of different functions here
...
def multiprocessFunc(self, file):
# do stuff in here
def funcThatCallsMultiprocessFunc(self):
array = [array of filenames to be called]
p = Pool(processes=20)
p.map_async(self.multiprocessFunc, array) #added self. here
Now having done some experiments, I see that map_async could take quite some time to start up (I think because multiprocessing creates processes) and any test code might call funcThatCallsMultiprocessFunc and then quit before the Pool has got started.
In my tests I had to wait for over 10 seconds after funcThatCallsMultiprocessFunc before calls to multiprocessFunc started. But once started, they seemed to run just fine.
This is the actual code I've used:
MyClass.py
from multiprocessing import Pool
import time
import string
class myClass():
def __init__(self):
self.result = None
def multiprocessFunc(self, f):
time.sleep(1)
print(f)
return f
def funcThatCallsMultiprocessFunc(self):
array = [c for c in string.ascii_lowercase]
print(array)
p = Pool(processes=20)
p.map_async(self.multiprocessFunc, array, callback=self.done)
p.close()
def done(self, arg):
self.result = 'Done'
print('done', arg)
Run.py
from MyClass import myClass
import time
def main():
c = myClass()
c.funcThatCallsMultiprocessFunc()
for i in range(30):
print(i, c.result)
time.sleep(1)
if __name__=="__main__":
main()
The if __name__=='__main__' construct is an import protection. You want to use it, to stop multiprocessing from running your setup on import.
In your case, you can leave out this protection in the class setup. Be sure to protect the execution points of the class in the calling file like this:
def apply_async_with_callback():
pool = mp.Pool(processes=30)
for i in range(z):
pool.apply_async(parallel_function, args = (i,x,y, ), callback = callback_function)
pool.close()
pool.join()
print "Multiprocessing done!"
if __name__ == '__main__':
apply_async_with_callback()

os.chdir between multiple python processes

I have a complex python pipeline (which code I cant change), calling multiple other scripts and other executables. The point is it takes ages to run over 8000 directories, doing some scientific analyses. So, I wrote a simple wrapper, (might not be most effective, but seems to work) using the multiprocessing module.
from os import path, listdir, mkdir, system
from os.path import join as osjoin, exists, isfile
from GffTools import Gene, Element, Transcript
from GffTools import read as gread, write as gwrite, sort as gsort
from re import match
from multiprocessing import JoinableQueue, Process
from sys import argv, exit
# some absolute paths
inbase = "/.../abfgp_in"
outbase = "/.../abfgp_out"
abfgp_cmd = "python /.../abfgp-2.rev/abfgp.py"
refGff = "/.../B0510_manual_reindexed_noSeq.gff"
# the Queue
Q = JoinableQueue()
i = 0
# define number of processes
try: num_p = int(argv[1])
except ValueError: exit("Wrong CPU argument")
# This is the function calling the abfgp.py script, which in its turn calls alot of third party software
def abfgp(id_, pid):
out = osjoin(outbase, id_)
if not exists(out): mkdir(out)
# logfile
log = osjoin(outbase, "log_process_%s" %(pid))
try:
# call the script
system("%s --dna %s --multifasta %s --target %s -o %s -q >>%s" %(abfgp_cmd, osjoin(inbase, id_, id_ +".dna.fa"), osjoin(inbase, id_, "informants.mfa"), id_, out, log))
except:
print "ABFGP FAILED"
return
# parse the output
def extractGff(id_):
# code not relevant
# function called by multiple processes, using the Queue
def run(Q, pid):
while not Q.empty():
try:
d = Q.get()
print "%s\t=>>\t%s" %(str(i-Q.qsize()), d)
abfgp(d, pid)
Q.task_done()
except KeyboardInterrupt:
exit("Interrupted Child")
# list of directories
genedirs = [d for d in listdir(inbase)]
genes = gread(refGff)
for d in genedirs:
i += 1
indir = osjoin(inbase, d)
outdir = osjoin(outbase, d)
Q.put(d)
# this loop creates the multiple processes
procs = []
for pid in range(num_p):
try:
p = Process(target=run, args=(Q, pid+1))
p.daemon = True
procs.append(p)
p.start()
except KeyboardInterrupt:
print "Aborting start of child processes"
for x in procs:
x.terminate()
exit("Interrupted")
try:
for p in procs:
p.join()
except:
print "Terminating child processes"
for x in procs:
x.terminate()
exit("Interrupted")
print "Parsing output..."
for d in genedirs: extractGff(d)
Now the problem is, abfgp.py uses the os.chdir function, which seems to disrupt the parallel processing. I get a lot of errors, stating that some (input/output) files/directories cannot be found for reading/writing. Even though I call the script through os.system(), from which I though spawning separate processes would prevent this.
How can I work around these chdir interference?
Edit: I might change os.system() to subprocess.Popen(cwd="...") with the right directory. I hope this makes a difference.
Thanks.
Edit 2
Do not use os.system() use subprocess.call()
system("%s --dna %s --multifasta %s --target %s -o %s -q >>%s" %(abfgp_cmd, osjoin(inbase, id_, id_ +".dna.fa"), osjoin(inbase, id_, "informants.mfa"), id_, out, log))
would translate to
subprocess.call((abfgp_cmd, '--dna', osjoin(inbase, id_, id_ +".dna.fa"), '--multifasta', osjoin(inbase, id_, "informants.mfa"), '--target', id_, '-o', out, '-q')) # without log.
Edit 1
I think the problem is that multiprocessing is using the module names to serialize functions, classes.
This means if you do import module where module is in ./module.py and the you do something like os.chdir('./dir') now you would need to from .. import module.
The child processes inherit the folder of the parent process. This may be a problem.
Solutions
Make sure that all modules are imported (in the child processes) and after this you change the directory
insert the original os.getcwd() to sys.path to enable import from the original directory. This must be done before any functions are called from the local directory.
put all functions that you use inside a directory that can always be imported. The site-packages could be such a directory. Then you can do something like import module module.main() to start what you do.
This is a hack that I do because I know how pickle works. Only use this if other attempts fail.
The script prints:
serialized # the function runD is serialized
string executed # before the function is loaded the code is executed
loaded # now the function run is deserialized
run # run is called
In you case you would do something like this:
runD = evalBeforeDeserialize('__import__("sys").path.append({})'.format(repr(os.getcwd())), run)
p = Process(target=runD, args=(Q, pid+1))
This is the script:
# functions that you need
class R(object):
def __init__(self, call, *args):
self.ret = (call, args)
def __reduce__(self):
return self.ret
def __call__(self, *args, **kw):
raise NotImplementedError('this should never be called')
class evalBeforeDeserialize(object):
def __init__(self, string, function):
self.function = function
self.string = string
def __reduce__(self):
return R(getattr, tuple, '__getitem__'), \
((R(eval, self.string), self.function), -1)
# code to show how it works
def printing():
print('string executed')
def run():
print('run')
runD = evalBeforeDeserialize('__import__("__main__").printing()', run)
import pickle
s = pickle.dumps(runD)
print('serialized')
run2 = pickle.loads(s)
print('loaded')
run2()
Please report back if these do not work.
You could determine which instance of the os library the unalterable program is using; then create a tailored version of chdir in that library that does what you need -- prevent the directory change, log it, whatever. If the tailored behavior needs to be just for the single program, you can use the inspect module to identify the caller and tailor the behavior in a specific way for just that caller.
Your options are limited if you truly can't alter the existing program; but if you have the option of altering libraries it imports, something like this could be a least-invasive way to skirt the undesired behavior.
Usual caveats apply when altering a standard library.

Categories