I am struggeling around with multiprocessing. I have some heavy image processing to do and wanted to make use of multi-core CPU power. However, I tried a lot and finally wanted to use the concurrent.futures module because it is more or less quite handy to use. But when I set up my programm, it runs and runs and runs and... It does not stop. The basic idea is as follows (not related to image processing, just a dummy setup):
import concurrent.futures as cf
import time
import multiprocessing as mp
def someFunc(seconds, multiplier=1):
time.sleep(multiplier*seconds)
return (f'Slept for {multiplier*seconds} s, Proc: {mp.current_process()}')
def parallelize(secs):
factor=2
def wrapper(sec):
return someFunc(sec, factor)
with cf.ProcessPoolExecutor() as executor:
results=[executor.submit(wrapper, secs) for _ in range(8)]
for result in cf.as_completed(results):
print(result.result())
So, I am running this under windows 10 in a jupyter notebook. For this reason, the functions are saved in a separate func.py file which is imported in the notebook and then runs using the if__name__=='main' -statement.
import func
if __name__=='__main__':
func.parallelize('some_int_number')
The reason I do this is that I have to pass two arguments to the parallelize() function, but the submit()-method only provides one argument. I know, one could also make use of the map()-method or whatever, but for some reason (overhead?!?!) the effect of parallelizing is not very significant (I am playing with the possibilities for days now). So I wanted to try the submit()-method as proposed.
BUT, this does not work (the script runs endlessly) and I don't know why. The problem also is, I have to handle a 'static' argument (factor) which is only known in the scope of the parallelize-function.
If I would define the wrapper-function outside the parallelize-function, the script would run as expected, but then I had the problem of the static factor variable.
Any ideas?
Greetings
phtagen
Related
I'm writing a program that uses ray package for multiprocessing programming. In the program, there is a function that would be called 5 times at the same time. During the execution, I want to show a progress bar using PyQT5 QprogressBar to indicate how much work is done. My idea is to let every execution of the function updates the progress bar by 20%. So I wrote the code like the following:
running_tasks = [myFunction.remote(x,y,z,self.progressBar,QApplication) for x in myList]
Results = list(ray.get(running_tasks))
Inside myFunction, there is a line to update the sent progress bar as the following:
QApplication.processEvents()
progressBar.setValue(progressBar.Value()+20)
But, when I run the code, I got the following error:
TypeError: Could not serialize the argument
<PyQt5.QtWidgets.QProgressBar object at 0x000001B787A36B80> for a task
or actor myFile.myFunction. Check
https://docs.ray.io/en/master/serialization.html#troubleshooting for
more information.
I searched through the internet (The URL returns 404) and I understand that this error is because multiprocessing in ray doesn't have shared memory between the processors, and sending a class attribute (like self.prgressBar) will lead each processor to have its own copy where it will modify it locally only. I also tried using the multiprocessing package instead of ray but it throws a pickling error, and I assume it is due to the same reason. So, Can anyone confirm if I'm right? or provide a further explanation about the error?
Also, how can I achieve my requirement in multiprocessing (i.e. updating the same progress bar simultaneously) If multiprocessing doesn't have shared memory between the processors?
I am unfamiliar with ray, but you can do this in the multiprocessing library using the multiprocessing.Queue().
The Queue is exactly as it's named, a queue where you can put data for other multiprocesses to read. In my case I usually put a dictionary in the Queue with a Command (Key) and what to do with that command (Value).
In one multiprocess you will do Queue.put() and in the other you can do Queue.get(). If you want to pass in one direction. In the example below I emulate what you may be looking to do.
I usually use a QTimer to check if there is any data in the queue, but you can also check whenever you feel like by calling a method to do so.
from multiprocessing import Process, Queue
myQueue = Queue()
class FirstProcess():
...
def update_progress_percentage(self, percentage):
self.progresss_percentage = percentage
def send_data_to_other_process(self):
myQueue.put({"UpdateProgress":self.progresss_percentage})
class SecondProcess():
...
def get_data_from_other_process(self):
while not myQueue.empty():
queue_dict = myQueue.get()
for key in queue_dict :
if key == "UpdateProgress":
percentage = queue_dict["UpdateProgress"]
progressBar.setValue(percentage)
I have 2 separate scripts working with the same variables.
To be more precise, one code edits the variables and the other one uses them (It would be nice if it could edit them too but not absolutely necessary.)
This is what i am currently doing:
When code 1 edits a variable it dumps it into a json file.
Code 2 repeatedly opens the json file to get the variables.
This method is really not elegant and the while loop is really slow.
How can i share variables across scripts?
My first scripts gets data from a midi controller and sends web-requests.
My second script is for LED strips (those run thanks to the same midi controller). Both script run in a "while true" loop.
I can't simply put them in the same script since every webrequest would slow the LEDs down. I am currently just sharing the variables via a json file.
If enough people ask for it i will post the whole code but i have been told not to do this
Considering the information you provided, meaning...
Both script run in a "while true" loop.
I can't simply put them in the same script since every webrequest would slow the LEDs down.
To me, you have 2 choices :
Use a client/server model. You have 2 machines. One acts as the server, and the second as the client. The server has a script with an infinite loop that consistently updates the data, and you would have an API that would just read and expose the current state of your file/database to the client. The client would be on another machine, and as I understand it, it would simply request the current data, and process it.
Make a single multiprocessing script. Each script would run on a separate 'thread' and would manage its own memory. As you also want to share variables between your two programs, you could pass as argument an object that would be shared between both your programs. See this resource to help you.
Note that there are more solutions to this. For instance, you're using a JSON file that you are consistently opening and closing (that is probably what takes the most time in your program). You could use a real Database that could handle being opened only once, and processed many times, while still being updated.
a Manager from multiprocessing lets you do this sort thing pretty easily
first I simplify your "midi controller and sends web-request" code down to something that just sleeps for random amounts of time and updates a variable in a managed dictionary:
from time import sleep
from random import random
def slow_fn(d):
i = 0
while True:
sleep(random() ** 2)
i += 1
d['value'] = i
next we simplify the "LED strip" control down to something that just prints to the screen:
from time import perf_counter
def fast_fn(d):
last = perf_counter()
while True:
sleep(0.05)
value = d.get('value')
now = perf_counter()
print(f'fast {value} {(now - last) * 1000:.2f}ms')
last = now
you can then run these functions in separate processes:
import multiprocessing as mp
with mp.Manager() as manager:
d = manager.dict()
procs = []
for fn in [slow_fn, fast_fn]:
p = mp.Process(target=fn, args=[d])
procs.append(p)
p.start()
for p in procs:
p.join()
the "fast" output happens regularly with no obvious visual pauses
My classes have become complicated, and I think the way I am instantiating and not closing may be a problem... sometimes the program just hangs, code that took .2 seconds to run will take 15 seconds. It happens randomly and rebooting seems to fix it. Restarting the program without rebooting does not, I'm using cntrl+C to stop the program running on CENTOS 7.
Here's my class layout:
def do_stuff(self):
pool_class = pools()
sql_pool_insert = pool_class.generate_sql()
print "pools done"
sql_HD_insert = hard_drives(pool_class).generate_sql()
print "hds done"
class pools
class devs
class hard_drives(object):
def __init__(self, **pool_class = pools()**):
self.drives_table=[]
self.assigned_drive_names =[]
self.unassigned_drives_table=[]
#for each pool, generate drives list actively combining
for pool in pool_class.pool_name:
self.drives_table = self.drives_table + pool_class.Vdevs(pool).get_drive_table()
i=0
Outside of that file, I have my main program that imports the above as one of many packages. The main file calls the functions in the above file that are above the classes (because they use the classes) and sometimes the main program uses discrete functions inside classes. The problem is the pools and devs class init sections are HUGE! It takes .7 seconds to run all that code in some cases, so I want to instantiate it correctly and close it effectively. Here is a sample from the main program:
if (new_pool==1):
result = mypackage.pools().create_pool(pool_to_create)
I really feel Like I'm doing two or three things wrong. I think, perhaps, I should just create an instance of the pools class in my main program while loop, at the top, then use that everywhere in the loop... then I would close it at the end of the loop; the loop runs every .25 seconds. I want the pool class to be initiated at the beginning of the while loop and closed at the end. Is this a good approach? This question applies equally to the Mysql InnoDB cursors I'm using, which may be another factor of the issue, how should these be handled as not to let instances of cursors and classes get out of hand? Heck, maybe I should be instantiating the entire package instead of one class in the package; i.e.:
with package as MyPackage:
package.pool()...#do whatever, all inits in the package have run once and you don't have to worry about them running again so long as you use the package object for everything???
Maybe I'm just confused... it comes down to this... how do I control when a class init function is run... it really only needs to run at the beginning of the loop, so that the loop has fresh data to work with, and then get used everywhere else... but you see how the pools class is used by the hard_drives class... how would I handle that?
Joblib for parallel computation taking more time for njob>1 (njob=2 takes 12.6s finished) than njob=1 (1.3s finished). I am in mac OSX 10.9 with 16GB RAM. Am I doing some mistake? Here is a simple demo code:
from joblib import Parallel, delayed
def func():
for i in range(200):
for j in range(300):
yield i, j
def evaluate(x):
i=x[0]
j=x[1]
p=i*j
return p, i, j
if __name__ == '__main__':
results = Parallel(n_jobs=3, verbose=2)(delayed(evaluate)(x) for x in func())
res, i, j = zip(*results)
Short answer: Joblib is a multiprocessing system, and has a fair amount of overhead in booting up a new python process for each of your 3 simultaneous jobs. As a result, your specific code is likely to get even slower if you add more jobs.
There's some documentation about this here.
The workarounds aren't great:
accept the overhead
don't use parallel code
Use multithreading instead of multiprocessing.. Unfortunately, multithreading is rarely an option unless you are using a fully compiled function in place of evaluate, because python is almost always single-threaded (see the python GIL).
That said, for functions that take a long time, multiprocessing is often worth it. Depending on your application, it's really a judgment call. Note that every variable used in the function is copied to each process - variable copy is rare in python, so this can be a surprise. As a result, the overhead is in part a function of the size of the variables passed either explicitly or implicitly (eg. via use of global variables).
I'm trying to a program that executes a piece of code in such a way that the user can stop its execution at any time without stopping the main program. I thought I could do this using threading.Thread, but then I ran the following code in IDLE (Python 3.3):
from threading import *
import math
def f():
eval("math.factorial(1000000000)")
t = Thread(target = f)
t.start()
The last line doesn't return: I eventually restarted the shell. Is this a consequence of the Global Interpreter Lock, or am I doing something wrong? I didn't see anything specific to this problem in the threading documentation (http://docs.python.org/3/library/threading.html)
I tried to do the same thing using a process:
from multiprocessing import *
import math
def f():
eval("math.factorial(1000000000)")
p = Process(target = f)
p.start()
p.is_alive()
The last line returns False, even though I ran it only a few seconds after I started the process! Based on my processor usage, I am forced to conclude that the process never started in the first place. Can somebody please explain what I am doing wrong here?
Thread.start() never returns! Could this have something to do with the C implementation of the math library?
As #eryksun pointed out in the comment: math.factorial() is implemented as a C function that doesn't release GIL so no other Python code may run until it returns.
Note: multiprocessing version should work as is: each Python process has its own GIL.
factorial(1000000000) has hundreds millions of digits. Try import time; time.sleep(10) as dummy calculation instead.
If you have issues with multithreaded code in IDLE then try the same code from the command line, to make sure that the error persists.
If p.is_alive() returns False after p.start() is already called then it might mean that there is an error in f() function e.g., MemoryError.
On my machine, p.is_alive() returns True and one of cpus is at 100% if I paste your code from the question into Python shell.
Unrelated: remove wildcard imports such as from multiprocessing import *. They may shadow other names in your code so that you can't be sure what a given name means e.g., threading could define eval function (it doesn't but it could) with a similar but different semantics that might break your code silently.
I want my program to be able to handle ridiculous inputs from the user gracefully
If you pass user input directly to eval() then the user can do anything.
Is there any way to get a process to print, say, an error message without constructing a pipe or other similar structure?
It is an ordinary Python code:
print(message) # works
The difference is that if several processes run print() then the output might be garbled. You could use a lock to synchronize print() calls.