Sharing many queues among processes in Python - python

I am aware of multiprocessing.Manager() and how it can be used to create shared objects, in particular queues which can be shared between workers. There is this question, this question, this question and even one of my own questions.
However, I need to define a great many queues, each of which is linking a specific pair of processes. Say that each pair of processes and its linking queue is identified by the variable key.
I want to use a dictionary to access my queues when I need to put and get data. I cannot make this work. I've tried a number of things. With multiprocessing imported as mp:
Defining a dict like for key in all_keys: DICT[key] = mp.Queue in a config file which is imported by the multiprocessing module (call it multi.py) does not return errors, but the queue DICT[key] is not shared between the processes, each one seems to have their own copy of the queue and thus no communication happens.
If I try to define the DICT at the beginning of the main multiprocessing function that defines the processes and starts them, like
DICT = mp.Manager().dict()
for key in all_keys:
DICT[key] = mp.Queue()
I get the error
RuntimeError: Queue objects should only be shared between processes through
inheritance
Changing to
DICT = mp.Manager().dict()
for key in all_keys:
DICT[key] = mp.Manager().Queue()
only makes everything worse. Trying similar definitions at the head of multi.py rather than inside the main function returns similar errors.
There must be a way to share many queues between processes without explicitly naming each one in the code. Any ideas?
Edit
Here is a basic schema of the program:
1- load the first module, which defines some variables, imports multi, launches multi.main(), and loads another module which starts a cascade of module loads and code execution. Meanwhile...
2- multi.main looks like this:
def main():
manager = mp.Manager()
pool = mp.Pool()
DICT2 = manager.dict()
for key in all_keys:
DICT2[key] = manager.Queue()
proc_1 = pool.apply_async(targ1,(DICT1[key],) ) #DICT1 is defined in the config file
proc_2 = pool.apply_async(targ2,(DICT2[key], otherargs,)
Rather than use pool and manager, I was also launching processes with the following:
mp.Process(target=targ1, args=(DICT[key],))
3 - The function targ1 takes input data that is coming in (sorted by key) from the main process. It is meant to pass the result to DICT[key] so targ2 can do its work. This is the part that is not working. There are an arbitrary number of targ1s, targ2s, etc. and therefore an arbitrary number of queues.
4 - The results of some of these processes will be sent to a bunch of different arrays / pandas dataframes which are also indexed by key, and which I would like to be accessible from arbitrary processes, even ones launched in a different module. I have yet to write this part and it might be a different question. (I mention it here because the answer to 3 above might also solve 4 nicely.)

It sounds like your issues started when you tried to share a multiprocessing.Queue() by passing it as an argument. You can get around this by creating a managed queue instead:
import multiprocessing
manager = multiprocessing.Manager()
passable_queue = manager.Queue()
When you use a manager to create it, you are storing and passing around a proxy to the queue, rather than the queue itself, so even when the object you pass to your worker processes is a copied, it will still point at the same underlying data structure: your queue. It's very similar (in concept) to pointers in C/C++. If you create your queues this way, you will be able to pass them when you launch a worker process.
Since you can pass queues around now, you no longer need your dictionary to be managed. Keep a normal dictionary in main that will store all the mappings, and only give your worker processes the queues they need, so they won't need access to any mappings.
I've written an example of this here. It looks like you are passing objects between your workers, so that's what's done here. Imagine we have two stages of processing, and the data both starts and ends in the control of main. Look at how we can create the queues that connect the workers like a pipeline, but by giving them only they queues they need, there's no need for them to know about any mappings:
import multiprocessing as mp
def stage1(q_in, q_out):
q_out.put(q_in.get()+"Stage 1 did some work.\n")
return
def stage2(q_in, q_out):
q_out.put(q_in.get()+"Stage 2 did some work.\n")
return
def main():
pool = mp.Pool()
manager = mp.Manager()
# create managed queues
q_main_to_s1 = manager.Queue()
q_s1_to_s2 = manager.Queue()
q_s2_to_main = manager.Queue()
# launch workers, passing them the queues they need
results_s1 = pool.apply_async(stage1, (q_main_to_s1, q_s1_to_s2))
results_s2 = pool.apply_async(stage2, (q_s1_to_s2, q_s2_to_main))
# Send a message into the pipeline
q_main_to_s1.put("Main started the job.\n")
# Wait for work to complete
print(q_s2_to_main.get()+"Main finished the job.")
pool.close()
pool.join()
return
if __name__ == "__main__":
main()
The code produces this output:
Main started the job.
Stage 1 did some work.
Stage 2 did some work.
Main finished the job.
I didn't include an example of storing the queues or AsyncResults objects in dictionaries, because I still don't quite understand how your program is supposed to work. But now that you can pass your queues freely, you can build your dictionary to store the queue/process mappings as needed.
In fact, if you really do build a pipeline between multiple workers, you don't even need to keep a reference to the "inter-worker" queues in main. Create the queues, pass them to your workers, then only retain references to queues that main will use. I would definitely recommend trying to let old queues be garbage collected as quickly as possible if you really do have "an arbitrary number" of queues.

Related

How are locks differenciated in multiprocessing?

Let's say that you have two lists created with manager.list(), and two locks created with manager.Lock(). How do you assign each lock to each list?
I was doing like
lock1 = manager.Lock()
lock2 = manager.Lock()
list1 = manager.list()
list2 = manager.list()
and when I wanted to write/read from the list
lock1.acquire()
list1.pop(0)
lock1.release()
lock2.acquire()
list2.pop(0)
lock2.released()
Today I realized that there's nothing that associates lock1 to list1.
Am I misunderstanding these functions?
TL;DR yes, and this might be an XY problem!
If you create a multiprocessing.Manager() and use its methods to create container primitives (.list and .dict), they will already be synchronized and you don't need to deal with the synchronization primitives yourself
from multiprocessing import Manager, Process, freeze_support
def my_function(d, lst):
lst.append([x**2 for x in d.values()])
def main():
with Manager() as manager: # context-managed SyncManager
normal_dict = {'a': 1, 'b': 2}
managed_synchronized_dict = manager.dict(normal_dict)
managed_synchronized_list = manager.list() # used to store results
p = Process(
target=my_function,
args=(managed_synchronized_dict, managed_synchronized_list)
)
p.start()
p.join()
print(managed_synchronized_list)
if __name__ == '__main__':
freeze_support()
main()
% python3 ./test_so_66603485.py
[[1, 4]]
multiprocessing.Array, is also synchronized
BEWARE: proxy objects are not directly comparable to their Python collection equivalents
Note: The proxy types in multiprocessing do nothing to support comparisons by value. So, for instance, we have:
>>> manager.list([1,2,3]) == [1,2,3]
False
One should just use a copy of the referent instead when making comparisons.
Some confusion might come from the section of the multiprocessing docs on Synchronization Primitives, which implies that one should use a Manager to create synchronization primitives, when really the Manager can already do the synchronization for you
Synchronization primitives
Generally synchronization primitives are not as necessary in a multiprocess program as they are in a multithreaded program. See the documentation for threading module.
Note that one can also create synchronization primitives by using a manager object – see Managers.
If you use simply multiprocessing.Manager(), per the docs, it
Returns a started SyncManager object which can be used for sharing objects between processes. The returned manager object corresponds to a spawned child process and has methods which will create shared objects and return corresponding proxies.
From the SyncManager section
Its methods create and return Proxy Objects for a number of commonly used data types to be synchronized across processes. This notably includes shared lists and dictionaries.
This means that you probably have most of what you want already
manager object with methods for building managed types
synchronization via Proxy Objects
Finally, to sum up the thread from comments specifically about instances of Lock objects
there's no inherent way to tell that some named lock is for anything in particular other than meta-information such as its name, comment(s), documentation ..conversely, they're free to be used for whatever synchronization needs you may have
some useful class/container can be made to manage both the lock and whatever it should be synchronizing -- a normal multiprocessing.Manager (SyncManager)'s .list and .dict do this, and a variety of other useful constructs exist, such as Pipe and Queue
one lock can be used to synchronize any number of actions, but having more locks can be a valuable trade-off as they are potentially unnecessarily blocking access to resources
a variety of synchronization primitives also exist for different purposes
value = my_queue.get() # already synchronized
if not my_lock1.acquire(timeout=5): # False if cannot acquire
raise CustomException("failed to acquire my_lock1 after 5 seconds")
try:
with my_lock2(): # blocks until acquired
some_shared_mutable = some_container_of_mutables[-1]
some_shared_mutable = some_object_with_mutables.get()
if foo(value, some_shared_mutable):
action1(value, some_shared_mutable)
action2(value, some_other_shared_mutable)
finally:
lock2.release()

Python Using Multiprocessing

I am trying to use multiprocessing in python 3.6. I have a for loopthat runs a method with different arguments. Currently, it is running one at a time which is taking quite a bit of time so I am trying to use multiprocessing. Here is what I have:
def test(self):
for key, value in dict.items():
pool = Pool(processes=(cpu_count() - 1))
pool.apply_async(self.thread_process, args=(key,value))
pool.close()
pool.join()
def thread_process(self, key, value):
# self.__init__()
print("For", key)
I think what my code is using 3 processes to run one method but I would like to run 1 method per process but I don't know how this is done. I am using 4 cores btw.
You're making a pool at every iteration of the for loop. Make a pool beforehand, apply the processes you'd like to run in multiprocessing, and then join them:
from multiprocessing import Pool, cpu_count
import time
def t():
# Make a dummy dictionary
d = {k: k**2 for k in range(10)}
pool = Pool(processes=(cpu_count() - 1))
for key, value in d.items():
pool.apply_async(thread_process, args=(key, value))
pool.close()
pool.join()
def thread_process(key, value):
time.sleep(0.1) # Simulate a process taking some time to complete
print("For", key, value)
if __name__ == '__main__':
t()
You're not populating your multiprocessing.Pool with data - you're re-initializing the pool on each loop. In your case you can use Pool.map() to do all the heavy work for you:
def thread_process(args):
print(args)
def test():
pool = Pool(processes=(cpu_count() - 1))
pool.map(thread_process, your_dict.items())
pool.close()
if __name__ == "__main__": # important guard for cross-platform use
test()
Also, given all those self arguments I reckon you're snatching this off of a class instance and if so - don't, unless you know what you're doing. Since multiprocessing in Python essentially works as, well, multi-processing (unlike multi-threading) you don't get to share your memory, which means your data is pickled when exchanging between processes, which means anything that cannot be pickled (like instance methods) doesn't get called. You can read more on that problem on this answer.
I think what my code is using 3 processes to run one method but I would like to run 1 method per process but I don't know how this is done. I am using 4 cores btw.
No, you are in fact using the correct syntax here to utilize 3 cores to run an arbitrary function independently on each. You cannot magically utilize 3 cores to work together on one task with out explicitly making that a part of the algorithm itself/ coding that your self often using threads (which do not work the same in python as they do outside of the language).
You are however re-initializing the pool every loop you'll need to do something like this instead to actually perform this properly:
cpus_to_run_on = cpu_count() - 1
pool = Pool(processes=(cpus_to_run_on)
# don't call a dictionary a dict, you will not be able to use dict() any
# more after that point, that's like calling a variable len or abs, you
# can't use those functions now
pool.map(your_function, your_function_args)
pool.close()
Take a look at the python multiprocessing docs for more specific information if you'd like to get a better understanding of how it works. Under python, you cannot utilize threading to do multiprocessing with the default CPython interpreter. This is because of something called the global interpreter lock, which stops concurrent resource access from within python itself. The GIL doesn't exist in other implementations of the language, and is not something other languages like C and C++ have to deal with (and thus you can actually use threads in parallel to work together on a task, unlike CPython)
Python gets around this issue by simply making multiple interpreter instances when using the multiprocessing module, and any message passing between instances is done via copying data between processes (ie the same memory is typically not touched by both interpreter instances). This does not however happen in the misleadingly named threading module, which often actually slow processes down because of a process called context switching. Threading today has limited usefullness, but provides an easier way around non GIL locked processes like socket and file reads/writes than async python.
Beyond all this though there is a bigger problem with your multiprocessing. Your writing to standard output. You aren't going to get the gains you want. Think about it. Each of your processes "print" data, but its all being displayed in one terminal/output screen. So even if your processes are "printing" they aren't really doing that independently, and the information has to be coalesced back into another processes where the text interface lies (ie your console). So these processes write whatever they were going to to some sort of buffer, which then has to be copied (as we learned from how multiprocessing works) to another process which will then take that buffered data and output it.
Typically dummy programs use printing as a means of showing how there is no order between execution of these processes, that they can finish at different times, they aren't meant to demonstrate the performance benefits of multi core processing.
I have experimented a bit this week with multiprocessing. The fastest way that I discovered to do multiprocessing in python3 is using imap_unordered, at least in my scenario. Here is a script you can experiment with using your scenario to figure out what works best for you:
import multiprocessing
NUMBER_OF_PROCESSES = multiprocessing.cpu_count()
MP_FUNCTION = 'imap_unordered' # 'imap_unordered' or 'starmap' or 'apply_async'
def process_chunk(a_chunk):
print(f"processig mp chunk {a_chunk}")
return a_chunk
map_jobs = [1, 2, 3, 4]
result_sum = 0
if MP_FUNCTION == 'imap_unordered':
pool = multiprocessing.Pool(processes=NUMBER_OF_PROCESSES)
for i in pool.imap_unordered(process_chunk, map_jobs):
result_sum += i
elif MP_FUNCTION == 'starmap':
pool = multiprocessing.Pool(processes=NUMBER_OF_PROCESSES)
try:
map_jobs = [(i, ) for i in map_jobs]
result_sum = pool.starmap(process_chunk, map_jobs)
result_sum = sum(result_sum)
finally:
pool.close()
pool.join()
elif MP_FUNCTION == 'apply_async':
with multiprocessing.Pool(processes=NUMBER_OF_PROCESSES) as pool:
result_sum = [pool.apply_async(process_chunk, [i, ]).get() for i in map_jobs]
result_sum = sum(result_sum)
print(f"result_sum is {result_sum}")
I found that starmap was not too far behind in performance, in my scenario it used more cpu and ended up being a bit slower. Hope this boilerplate helps.

Can't mutate a variable local to a function in another function

I am trying to manipulate the lists inside the dictionary clean_txt in another function, but its not working and I end up with empty lists inside the dict.
My understading is that both lists and dicts are mutable objects so what is the problem here?
def process_questions(i, question_list, questions, question_list_name):
''' Transform questions and display progress '''
print('processing {}: process {}'.format(question_list_name, i))
for question in questions:
question_list.append(text_to_wordlist(str(question)))
#timeit
def multi(n_cores, tq, qln):
procs = []
clean_txt = {}
for i in range(n_cores):
clean_txt[i] = []
for index in range(n_cores):
tq_indexed = tq[index*len(tq)//n_cores:(index+1)*len(tq)//n_cores]
proc = Process(target=process_questions, args=(index, clean_txt[index], tq_indexed, qln, ))
procs.append(proc)
proc.start()
for proc in procs:
proc.join()
print('{} records processed from {}'.format(sum([len(x) for x in clean_txt.values()]), qln))
print('-'*100)
Your are using Processes not threads.
When the process is created the memory of your program is copied and each process work on its own set, therefore it is NOT shared.
Here's a question that can help you understand: Multiprocessing vs Threading Python
If you want to share memory between processes you should look into semaphores or use Threads instead. There are also other solutions to share data, like queues or database etc.
You are appending to clean_txt[index] from a different Process. clean_txt[index] belongs to the main python process who created it. Since a process can't access or modify another process's memory, you can't append to it. (Not really. See edit below)
You will need to create shared memory.
You can use Manager to create shared memory, something like this
from multiprocessing import Manager
manager = Manager()
...
clean_txt[i] = manager.list()
Now you can append to this list in the other process.
Edit -
My explanation about clean_txt wasn't clear. Thanks to #Maresh.
When a new Process is created the whole memory is copied. So modifying the list in the new process won't affect the copy in the main process. So you need a shared memory.

Python multi function multithreading with threading.Thread? (variable number of threads)

I'm trying to start a variable number of threads to compute the results of functions for one of my automated trading modules. I have about 14 functions all of which are computationally expensive. I've been calculating each function sequentially, but it takes around 3 minutes to complete, and my platform is high frequency, I have the need to cut that computation time down to 1 minute or less.
I've read up on multiprocessing and multithreading, but I can't find a solution that fits my need.
What I'm trying to do is define "n" number of threads to use, then divide my list of functions into "n" groups, then compute each group of functions in a separate thread. Essentially:
functionList = [func1,func2,func3,func4]
outputList = [func1out,func2out,func3out,func4out]
argsList = [func1args,func2args,func3args,func4args]
# number of threads
n = 3
functionSplit = np.array_split(np.array(functionList),n)
outputSplit = np.array_split(np.array(outputList),n)
argSplit = np.array_split(np.array(argsList),n)
Now I'd like to start "n" seperate threads, each processing the functions according to the split lists. Then I'd like to name the output of each function according to the outputList and create a master dict of the outputs from each function. I then will loop through the output dict and create a dataframe with column ID numbers according to the information in each column (already have this part worked out, just need the multithreading).
Is there any way to do something like this? I've been looking into creating a subclass of the threading.Thread class and passing the functions, output names, and arguments into the run() method, but I don't know how to name and output the results of the functions from each thread! Nor do I know how to call functions in a list according to their corresponding arguments!
The reason that I'm doing this is to discover the optimum thread number balance between computational efficiency and time. Like I said, this will be integrated into a high frequency trading platform I'm developing where time is my major constraint!
Any ideas?
You can use multiprocessing library like below
import multiprocessing
def callfns(fnList, argList, outList, d):
for i in range(len(fnList)):
d[somekey] = fnList[i](argList, outList)
...
manager = multiprocessing.Manager()
d = manager.dict()
processes = []
for i in range(len(functionSplit)):
process = multiprocessing.Process(target=callfns, args=(functionSplit[i], argSplit[i], outputSplit[i], d))
processes.append(process)
for j in processes:
j.start()
for j in processes:
j.join()
# use d here
You can use a server process to share the dictionary between these processes. To interact with the server process you need Manager. Then you can create a dictionary in server process manager.dict(). Once all process join back to the main process, you can use the dictionary d.
I hope this help you solve your problem.
You should use multiprocessing instead of threading for cpu bound tasks.
Manually creating and managing processes can be difficult and require more efforts. Do checkout the concurrent.futures and try the ProcessPool for maintaining a pool of processes. You can submit tasks to them and retrieve results.
The Pool.map method from multiprocessing module can take a function and iterable and then process them in chunks in parallel to compute faster. The iterable is broken into separate chunks. These chunks are passed to the function in separate processes. Then the results are then put back together.

multiprocessing: sharing a large read-only object between processes?

Do child processes spawned via multiprocessing share objects created earlier in the program?
I have the following setup:
do_some_processing(filename):
for line in file(filename):
if line.split(',')[0] in big_lookup_object:
# something here
if __name__ == '__main__':
big_lookup_object = marshal.load('file.bin')
pool = Pool(processes=4)
print pool.map(do_some_processing, glob.glob('*.data'))
I'm loading some big object into memory, then creating a pool of workers that need to make use of that big object. The big object is accessed read-only, I don't need to pass modifications of it between processes.
My question is: is the big object loaded into shared memory, as it would be if I spawned a process in unix/c, or does each process load its own copy of the big object?
Update: to clarify further - big_lookup_object is a shared lookup object. I don't need to split that up and process it separately. I need to keep a single copy of it. The work that I need to split it is reading lots of other large files and looking up the items in those large files against the lookup object.
Further update: database is a fine solution, memcached might be a better solution, and file on disk (shelve or dbm) might be even better. In this question I was particularly interested in an in memory solution. For the final solution I'll be using hadoop, but I wanted to see if I can have a local in-memory version as well.
Do child processes spawned via multiprocessing share objects created earlier in the program?
No for Python < 3.8, yes for Python ≥ 3.8.
Processes have independent memory space.
Solution 1
To make best use of a large structure with lots of workers, do this.
Write each worker as a "filter" – reads intermediate results from stdin, does work, writes intermediate results on stdout.
Connect all the workers as a pipeline:
process1 <source | process2 | process3 | ... | processn >result
Each process reads, does work and writes.
This is remarkably efficient since all processes are running concurrently. The writes and reads pass directly through shared buffers between the processes.
Solution 2
In some cases, you have a more complex structure – often a fan-out structure. In this case you have a parent with multiple children.
Parent opens source data. Parent forks a number of children.
Parent reads source, farms parts of the source out to each concurrently running child.
When parent reaches the end, close the pipe. Child gets end of file and finishes normally.
The child parts are pleasant to write because each child simply reads sys.stdin.
The parent has a little bit of fancy footwork in spawning all the children and retaining the pipes properly, but it's not too bad.
Fan-in is the opposite structure. A number of independently running processes need to interleave their inputs into a common process. The collector is not as easy to write, since it has to read from many sources.
Reading from many named pipes is often done using the select module to see which pipes have pending input.
Solution 3
Shared lookup is the definition of a database.
Solution 3A – load a database. Let the workers process the data in the database.
Solution 3B – create a very simple server using werkzeug (or similar) to provide WSGI applications that respond to HTTP GET so the workers can query the server.
Solution 4
Shared filesystem object. Unix OS offers shared memory objects. These are just files that are mapped to memory so that swapping I/O is done instead of more convention buffered reads.
You can do this from a Python context in several ways
Write a startup program that (1) breaks your original gigantic object into smaller objects, and (2) starts workers, each with a smaller object. The smaller objects could be pickled Python objects to save a tiny bit of file reading time.
Write a startup program that (1) reads your original gigantic object and writes a page-structured, byte-coded file using seek operations to assure that individual sections are easy to find with simple seeks. This is what a database engine does – break the data into pages, make each page easy to locate via a seek.
Spawn workers with access to this large page-structured file. Each worker can seek to the relevant parts and do their work there.
Do child processes spawned via multiprocessing share objects created earlier in the program?
It depends. For global read-only variables it can be often considered so (apart from the memory consumed) else it should not.
multiprocessing's documentation says:
Better to inherit than pickle/unpickle
On Windows many types from
multiprocessing need to be picklable
so that child processes can use them.
However, one should generally avoid
sending shared objects to other
processes using pipes or queues.
Instead you should arrange the program
so that a process which need access to
a shared resource created elsewhere
can inherit it from an ancestor
process.
Explicitly pass resources to child processes
On Unix a child process can make use
of a shared resource created in a
parent process using a global
resource. However, it is better to
pass the object as an argument to the
constructor for the child process.
Apart from making the code
(potentially) compatible with Windows
this also ensures that as long as the
child process is still alive the
object will not be garbage collected
in the parent process. This might be
important if some resource is freed
when the object is garbage collected
in the parent process.
Global variables
Bear in mind that if code run in a
child process tries to access a global
variable, then the value it sees (if
any) may not be the same as the value
in the parent process at the time that
Process.start() was called.
Example
On Windows (single CPU):
#!/usr/bin/env python
import os, sys, time
from multiprocessing import Pool
x = 23000 # replace `23` due to small integers share representation
z = [] # integers are immutable, let's try mutable object
def printx(y):
global x
if y == 3:
x = -x
z.append(y)
print os.getpid(), x, id(x), z, id(z)
print y
if len(sys.argv) == 2 and sys.argv[1] == "sleep":
time.sleep(.1) # should make more apparant the effect
if __name__ == '__main__':
pool = Pool(processes=4)
pool.map(printx, (1,2,3,4))
With sleep:
$ python26 test_share.py sleep
2504 23000 11639492 [1] 10774408
1
2564 23000 11639492 [2] 10774408
2
2504 -23000 11639384 [1, 3] 10774408
3
4084 23000 11639492 [4] 10774408
4
Without sleep:
$ python26 test_share.py
1148 23000 11639492 [1] 10774408
1
1148 23000 11639492 [1, 2] 10774408
2
1148 -23000 11639324 [1, 2, 3] 10774408
3
1148 -23000 11639324 [1, 2, 3, 4] 10774408
4
S.Lott is correct. Python's multiprocessing shortcuts effectively give you a separate, duplicated chunk of memory.
On most *nix systems, using a lower-level call to os.fork() will, in fact, give you copy-on-write memory, which might be what you're thinking. AFAIK, in theory, in the most simplistic of programs possible, you could read from that data without having it duplicated.
However, things aren't quite that simple in the Python interpreter. Object data and meta-data are stored in the same memory segment, so even if the object never changes, something like a reference counter for that object being incremented will cause a memory write, and therefore a copy. Almost any Python program that is doing more than "print 'hello'" will cause reference count increments, so you will likely never realize the benefit of copy-on-write.
Even if someone did manage to hack a shared-memory solution in Python, trying to coordinate garbage collection across processes would probably be pretty painful.
If you're running under Unix, they may share the same object, due to how fork works (i.e., the child processes have separate memory but it's copy-on-write, so it may be shared as long as nobody modifies it). I tried the following:
import multiprocessing
x = 23
def printx(y):
print x, id(x)
print y
if __name__ == '__main__':
pool = multiprocessing.Pool(processes=4)
pool.map(printx, (1,2,3,4))
and got the following output:
$ ./mtest.py
23 22995656
1
23 22995656
2
23 22995656
3
23 22995656
4
Of course this doesn't prove that a copy hasn't been made, but you should be able to verify that in your situation by looking at the output of ps to see how much real memory each subprocess is using.
Different processes have different address space. Like running different instances of the interpreter. That's what IPC (interprocess communication) is for.
You can use either queues or pipes for this purpose. You can also use rpc over tcp if you want to distribute the processes over a network later.
http://docs.python.org/dev/library/multiprocessing.html#exchanging-objects-between-processes
Not directly related to multiprocessing per se, but from your example, it would seem you could just use the shelve module or something like that. Does the "big_lookup_object" really have to be completely in memory?
No, but you can load your data as a child process and allow it to share its data with other children. see below.
import time
import multiprocessing
def load_data( queue_load, n_processes )
... load data here into some_variable
"""
Store multiple copies of the data into
the data queue. There needs to be enough
copies available for each process to access.
"""
for i in range(n_processes):
queue_load.put(some_variable)
def work_with_data( queue_data, queue_load ):
# Wait for load_data() to complete
while queue_load.empty():
time.sleep(1)
some_variable = queue_load.get()
"""
! Tuples can also be used here
if you have multiple data files
you wish to keep seperate.
a,b = queue_load.get()
"""
... do some stuff, resulting in new_data
# store it in the queue
queue_data.put(new_data)
def start_multiprocess():
n_processes = 5
processes = []
stored_data = []
# Create two Queues
queue_load = multiprocessing.Queue()
queue_data = multiprocessing.Queue()
for i in range(n_processes):
if i == 0:
# Your big data file will be loaded here...
p = multiprocessing.Process(target = load_data,
args=(queue_load, n_processes))
processes.append(p)
p.start()
# ... and then it will be used here with each process
p = multiprocessing.Process(target = work_with_data,
args=(queue_data, queue_load))
processes.append(p)
p.start()
for i in range(n_processes)
new_data = queue_data.get()
stored_data.append(new_data)
for p in processes:
p.join()
print(processes)
For Linux/Unix/MacOS platform, forkmap is a quick-and-dirty solution.

Categories