Im trying to understand multiprocessing.Process class. I want to collect data asynchronously storing it somewhere. After having stored the data, it somehow gets lost. Here is my MWE:
from __future__ import print_function
import multiprocessing as mp
def append_test(tgt):
tgt.append(42)
print('Appended:', tgt)
l = []
p = mp.Process(target=lambda: append_test(l))
p.run()
print('l is', l)
p.start()
p.join()
print('l is', l)
If I'm running that snippet, I get
Appended: [42]
l is [42]
Appended: [42, 42]
l is [42]
As you can see, there is a difference between calling run and using start/join. It has nothing to do with the order (using run afterwards) - I've tried that. Can someone elaborate how the second 42 gets lost? It seems to be stored at some time? But at some other time its definetly not.
Just in case that could make any difference: I've tried python2.7 and python3.4, both with the exact same result described above.
Update: Apparently only start spawns a new process where run will be invoked afterwards. Then my actual problem translates to the following question: How do I pass l to the spawned process s.t. I can see the actual result?
Solution: The following example shows how to pass shared data safely to a Process:
from __future__ import print_function
import multiprocessing as mp
def append_test(tgt):
tgt.append(42)
print('Appended:', tgt)
m = mp.Manager()
l = m.list()
p = mp.Process(target=lambda: append_test(l))
p.start()
p.join()
print('l is', l)
Further reading: Multiprocessing Managers Documentation
From Python: Essential Reference by Beazley:
p.run(): The method that runs when the process starts. By default, this invokes target that was passed to the Process constructor. ...
p.start(): Starts the process. This launches the subprocess that represents the process and invokes p.run() in that subprocess.
So, they are not meant to be doing the same thing. It looks to me like in this case, p.run() is being invoked for the ongoing process and p.start() calls p.run() in a new process with the original target that was passed to the constructor (in which l is [ ] still).
Run executes the callable object that you target in multiprocessing. Start will call the run() method for the object.
From multiprocessing's documentation
run() Method representing the process’s activity.
You may override this method in a subclass. The standard run() method
invokes the callable object passed to the object’s constructor as the
target argument, if any, with sequential and keyword arguments taken
from the args and kwargs arguments, respectively.
start() Start the process’s activity.
This must be called at most once per process object. It arranges for
the object’s run() method to be invoked in a separate process.
Related
I have a program that needs to create several graphs, with each one often taking hours. Therefore I want to run these simultaneously on different cores, but cannot seem to get these processes to run with the multiprocessing module. Here is my code:
if __name__ == '__main__':
jobs = []
for i in range(5):
p = multiprocessing.Process(target=full_graph)
jobs.append(p)
p.start()
p.join()
(full_graph() has been defined earlier in the program, and is simply a function that runs a collection of other functions)
The function normally outputs some graphs, and saves the data to a .txt file. All data is saved to the same 2 text files. However, calling the functions using the above code gives no console output, nor any output to the text file. All that happens is a few second long pause, and then the program exits.
I am using the Spyder IDE with WinPython 3.6.3
Without a simple full_graph sample nobody can tell you what's happening. But your code is inherently wrong.
if __name__ == '__main__':
jobs = []
for i in range(5):
p = multiprocessing.Process(target=full_graph)
jobs.append(p)
p.start()
p.join() # <- This would block until p is done
See the comment after p.join(). If your processes really take hours to complete, you would run one process for hours and then the 2nd, the 3rd. Serially and using a single core.
From the docs: https://docs.python.org/3/library/multiprocessing.html
Process.join: https://docs.python.org/3/library/multiprocessing.html#multiprocessing.Process.join
If the optional argument timeout is None (the default), the method blocks until the process whose join() method is called terminates. If timeout is a positive number, it blocks at most timeout seconds. Note that the method returns None if its process terminates or if the method times out. Check the process’s exitcode to determine if it terminated.
If each process does something different, you should then also have some args for full_graph(hint: may that be the missing factor?)
You probably want to use an interface like map from Pool
https://docs.python.org/3/library/multiprocessing.html#multiprocessing.pool.Pool
And do (from the docs again)
from multiprocessing import Pool
def f(x):
return x*x
if __name__ == '__main__':
with Pool(5) as p:
print(p.map(f, [1, 2, 3]))
Earlier I tried to use the threading module in python to create multiple threads. Then I learned about the GIL and how it does not allow taking advantage of multiple CPU cores on a single machine. So now I'm trying to do multiprocessing (I don't strictly need seperate threads).
Here is a sample code I wrote to see if distinct processes are being created. But as can be seen in the output below, I'm getting the same process ID everytime. So multiple processes are not being created. What am I missing?
import multiprocessing as mp
import os
def pri():
print(os.getpid())
if __name__=='__main__':
# Checking number of CPU cores
print(mp.cpu_count())
processes=[mp.Process(target=pri()) for x in range(1,4)]
for p in processes:
p.start()
for p in processes:
p.join()
Output:
4
12554
12554
12554
The Process class requires a callable as its target.
Instead of running the function in the separate process, you are calling it and passing its result (None in this case) to the Process class.
Just change the following:
mp.Process(target=pri())
with:
mp.Process(target=pri)
Since the subprocesses runs on a different process, you won't see their print statements. They also don't share the same memory space. You pass pri() to target, where it needs to be pri. You need to pass a callable object, not execute it.
The prints you see are part of your main thread executions. Because you pass pri(), the code is actually executed. You need to change your code so the pri function returns value, rather than prints it.
Then you need to implement a queue, where all your threads write to it and when they're done, your main thread reads the queue.
A nice feature of the multiprocessing module is the Pool object. It allows you to create a thread pool, and then just use it. It's more convenient.
I have tried your code, the thing is the command executes too quick, so the OS reuses the PIDs. If you add a time.sleep(1) in your pri function, it would work as you expect.
That is True only for Windows. The example below is made on Windows platform. On Unix like machines, you won't need the sleep.
The more convenience solution is like this:
from multiprocessing import Pool
from time import sleep
import os
def pri(x):
sleep(1)
return os.getpid()
def use_procs():
p_pool = Pool(4)
p_results = p_pool.map(pri, [_ for _ in range(1,4)])
p_pool.close()
p_pool.join()
return p_results
if __name__ == '__main__':
res = use_procs()
for r in res:
print r
Without the sleep:
==================== RESTART: C:/Python27/tests/test2.py ====================
6576
6576
6576
>>>
with the sleep:
==================== RESTART: C:/Python27/tests/test2.py ====================
10396
10944
9000
I know that multiprocessing uses pickling in order to have the processes run on different CPUs, but I think I am a little confused as to what is being pickled. Lets look at this code.
from multiprocessing import Process
def f(I):
print('hello world!',I)
if __name__ == '__main__':
for I in (range1, 3):
Process(target=f,args=(I,)).start()
I assume what is being pickled is the def f(I) and the argument going in. First, is this assumption correct?
Second, lets say f(I) has a function call within in it like:
def f(I):
print('hello world!',I)
randomfunction()
Does the randomfunction's definition get pickled as well, or is it only the function call?
Further more, if that function call was located in another file, would the process be able to call it?
In this particular example, what gets pickled is platform dependent. On systems that support os.fork, like Linux, nothing is pickled here. Both the target function and the args you're passing get inherited by the child process via fork.
On platforms that don't support fork, like Windows, the f function and args tuple will both be pickled and sent to the child process. The child process will re-import your __main__ module, and then unpickle the function and its arguments.
In either case, randomfunction is not actually pickled. When you pickle f, all you're really pickling is a pointer for the child function to re-build the f function object. This is usually little more than a string that tells the child how to re-import f:
>>> def f(I):
... print('hello world!',I)
... randomfunction()
...
>>> pickle.dumps(f)
'c__main__\nf\np0\n.'
The child process will just re-import f, and then call it. randomfunction will be accessible as long as it was properly imported into the original script to begin with.
Note that in Python 3.4+, you can get the Windows-style behavior on Linux by using contexts:
ctx = multiprocessing.get_context('spawn')
ctx.Process(target=f,args=(I,)).start() # even on Linux, this will use pickle
The descriptions of the contexts are also probably relevant here, since they apply to Python 2.x as well:
spawn
The parent process starts a fresh python interpreter process.
The child process will only inherit those resources necessary to run
the process objects run() method. In particular, unnecessary file
descriptors and handles from the parent process will not be inherited.
Starting a process using this method is rather slow compared to using
fork or forkserver.
Available on Unix and Windows. The default on Windows.
fork
The parent process uses os.fork() to fork the Python interpreter.
The child process, when it begins, is effectively identical to the
parent process. All resources of the parent are inherited by the child
process. Note that safely forking a multithreaded process is
problematic.
Available on Unix only. The default on Unix.
forkserver
When the program starts and selects the forkserver start
method, a server process is started. From then on, whenever a new
process is needed, the parent process connects to the server and
requests that it fork a new process. The fork server process is single
threaded so it is safe for it to use os.fork(). No unnecessary
resources are inherited.
Available on Unix platforms which support passing file descriptors
over Unix pipes.
Note that forkserver is only available in Python 3.4, there's no way to get that behavior on 2.x, regardless of the platform you're on.
The function is pickled, but possibly not in the way you think of it:
You can look at what's actually in a pickle like this:
pickletools.dis(pickle.dumps(f))
I get:
0: c GLOBAL '__main__ f'
12: p PUT 0
15: . STOP
You'll note that there is nothing in there correspond to the code of the function. Instead, it has references to __main__ f which is the module and name of the function. So when this is unpickled, it will always attempt to lookup the f function in the __main__ module and use that. When you use the multiprocessing module, that ends up being a copy of the same function as it was in your original program.
This does mean that if you somehow modify which function is located at __main__.f you'll end up unpickling a different function then you pickled in.
Multiprocessing brings up a complete copy of your program complete with all the functions you defined it. So you can just call functions. The entire function isn't copied over, just the name of the function. The pickle module's assumption is that function will be same in both copies of your program, so it can just lookup the function by name.
Only the function arguments (I,) and the return value of the function f are pickled. The actual definition of the function f has to be available when loading the module.
The easiest way to see this is through the code:
from multiprocessing import Process
if __name__ == '__main__':
def f(I):
print('hello world!',I)
for I in [1,2,3]:
Process(target=f,args=(I,)).start()
That returns:
AttributeError: 'module' object has no attribute 'f'
How does the flow of apply_async work between calling the iterable (?) function and the callback function?
Setup: I am reading some lines of all the files inside a 2000 file directory, some with millions of lines, some with only a few. Some header/formatting/date data is extracted to charecterize each file. This is done on a 16 CPU machine, so it made sense to multiprocess it.
Currently, the expected result is being sent to a list (ahlala) so I can print it out; later, this will be written to *.csv. This is a simplified version of my code, originally based off this extremely helpful post.
import multiprocessing as mp
def dirwalker(directory):
ahlala = []
# X() reads files and grabs lines, calls helper function to calculate
# info, and returns stuff to the callback function
def X(f):
fileinfo = Z(arr_of_lines)
return fileinfo
# Y() reads other types of files and does the same thing
def Y(f):
fileinfo = Z(arr_of_lines)
return fileinfo
# results() is the callback function
def results(r):
ahlala.extend(r) # or .append, haven't yet decided
# helper function
def Z(arr):
return fileinfo # to X() or Y()!
for _,_,files in os.walk(directory):
pool = mp.Pool(mp.cpu_count()
for f in files:
if (filetype(f) == filetypeX):
pool.apply_async(X, args=(f,), callback=results)
elif (filetype(f) == filetypeY):
pool.apply_async(Y, args=(f,), callback=results)
pool.close(); pool.join()
return ahlala
Note, the code works if I put all of Z(), the helper function, into either X(), Y(), or results(), but is this either repetitive or possibly slower than possible? I know that the callback function is called for every function call, but when is the callback function called? Is it after pool.apply_async()...finishes all the jobs for the processes? Shouldn't it be faster if these helper functions were called within the scope (?) of the first function pool.apply_async() takes (in this case, X())? If not, should I just put the helper function in results()?
Other related ideas: Are daemon processes why nothing shows up? I am also very confused about how to queue things, and if this is the problem. This seems like a place to start learning it, but can queuing be safely ignored when using apply_async, or only at a noticable time inefficiency?
You're asking about a whole bunch of different things here, so I'll try to cover it all as best I can:
The function you pass to callback will be executed in the main process (not the worker) as soon as the worker process returns its result. It is executed in a thread that the Pool object creates internally. That thread consumes objects from a result_queue, which is used to get the results from all the worker processes. After the thread pulls the result off the queue, it executes the callback. While your callback is executing, no other results can be pulled from the queue, so its important that the callback finishes quickly. With your example, as soon as one of the calls to X or Y you make via apply_async completes, the result will be placed into the result_queue by the worker process, and then the result-handling thread will pull the result off of the result_queue, and your callback will be executed.
Second, I suspect the reason you're not seeing anything happen with your example code is because all of your worker function calls are failing. If a worker function fails, callback will never be executed. The failure won't be reported at all unless you try to fetch the result from the AsyncResult object returned by the call to apply_async. However, since you're not saving any of those objects, you'll never know the failures occurred. If I were you, I'd try using pool.apply while you're testing so that you see errors as soon as they occur.
The reason the workers are probably failing (at least in the example code you provided) is because X and Y are defined as function inside another function. multiprocessing passes functions and objects to worker processes by pickling them in the main process, and unpickling them in the worker processes. Functions defined inside other functions are not picklable, which means multiprocessing won't be able to successfully unpickle them in the worker process. To fix this, define both functions at the top-level of your module, rather than embedded insice the dirwalker function.
You should definitely continue to call Z from X and Y, not in results. That way, Z can be run concurrently across all your worker processes, rather than having to be run one call at a time in your main process. And remember, your callback function is supposed to be as quick as possible, so you don't hold up processing results. Executing Z in there would slow things down.
Here's some simple example code that's similar to what you're doing, that hopefully gives you an idea of what your code should look like:
import multiprocessing as mp
import os
# X() reads files and grabs lines, calls helper function to calculate
# info, and returns stuff to the callback function
def X(f):
fileinfo = Z(f)
return fileinfo
# Y() reads other types of files and does the same thing
def Y(f):
fileinfo = Z(f)
return fileinfo
# helper function
def Z(arr):
return arr + "zzz"
def dirwalker(directory):
ahlala = []
# results() is the callback function
def results(r):
ahlala.append(r) # or .append, haven't yet decided
for _,_,files in os.walk(directory):
pool = mp.Pool(mp.cpu_count())
for f in files:
if len(f) > 5: # Just an arbitrary thing to split up the list with
pool.apply_async(X, args=(f,), callback=results) # ,error_callback=handle_error # In Python 3, there's an error_callback you can use to handle errors. It's not available in Python 2.7 though :(
else:
pool.apply_async(Y, args=(f,), callback=results)
pool.close()
pool.join()
return ahlala
if __name__ == "__main__":
print(dirwalker("/usr/bin"))
Output:
['ftpzzz', 'findhyphzzz', 'gcc-nm-4.8zzz', 'google-chromezzz' ... # lots more here ]
Edit:
You can create a dict object that's shared between your parent and child processes using the multiprocessing.Manager class:
pool = mp.Pool(mp.cpu_count())
m = multiprocessing.Manager()
helper_dict = m.dict()
for f in files:
if len(f) > 5:
pool.apply_async(X, args=(f, helper_dict), callback=results)
else:
pool.apply_async(Y, args=(f, helper_dict), callback=results)
Then make X and Y take a second argument called helper_dict (or whatever name you want), and you're all set.
The caveat is that this worked by creating a server process that contains a normal dict, and all your other processes talk to that one dict via a Proxy object. So every time you read or write to the dict, you're doing IPC. This makes it a lot slower than a real dict.
I am currently scrapping data from various sites.
The code for the scrappers is stored in modules (x,y,z,a,b)
Where x.dump is a function which uses Files for storing the scraped data.
The dump function takes a single argument 'input'.
Note : All the dump functions are not same.
I am trying to run each of these dump function in parallel.
The following code runs fine.
But i have noticed that it still follows serial order x then y ... for execution.
Is this the correct way of going about the problem?
Are multithreading and multiprocessing the only native ways for parallel programming?
from multiprocessing import Process
import x.x as x
import y.y as y
import z.z as z
import a.a as a
import b.b as b
input = ""
f_list = [x.dump, y.dump, z.dump, a.dump, b.dump]
processes = []
for function in f_list:
processes.append(Process(target=function, args=(input,)))
for process in processes:
process.run()
for process in processes:
process.join()
That's because run() is the method to implement the task itself, you're not meant to call it from outside like that. You are supposed to call start() which spawns a new process which then calls run() in the other process and returns control to you so you can do more work (and later join()).
You should be calling process.start() not process.run()
The start method does the work of starting the extra process and then running the run method in that process.
Python docs