Understanding a key difference between multiprocessing and threading in Python - python

I have written a program to use threads and I create an instance of a custom object with a run(p, q) method. I pass this run() method as the target for the thread, like this:
class MyClass(object):
def run(p, q):
# code here
obj = MyClass()
thrd = threading.Thread(target=obj.run, args=(a, b))
My thread starts by executing the run() method with the passed arguments, a and b. In my case, one of them is an Event that is eventually used to stop the thread. Also, the run() method has access to all the object's instance variables, which include other objects.
As I understand it, this works because a thread shares memory with the program that creates it.
My question is how this differs with multiprocessing. e.g.
proc = multiprocessing.Process(target=obj.run, args=(a, b))
I believe a process does not share memory, so am I able to do the same? How is the process able to access the whole obj object when it is just being given a reference to one method? Can I pass an Event? If the created process gets a copy of the whole creating program's memory, what happens with things like open database connections? How does it connect with the original Event?
And a final question (thanks for bearing with me). Is it necessary that the whole program is duplicated (with all its imported modules etc.) in a created process? What if I want a minimal process that doesn't need as much as the main program?
Happy to receive any hints at answers, or a pointer to somewhere that describes multiprocessing in this amount of detail.
Thanks so much.
Julian

each process has its own memory map! two process don't share thier memory each other.
Inside a process thread are declared inside the memory of the process they come from.
it doesn't, "obj.run" 's code is copied from this process to the new spawned.

Related

Mutithreading a method from an object instantiated in the main thread

If I instantiate an object in the main thread, and then send one of it's member methods to a ThreadPoolExecutor, does Python somehow create a copy-by-value of the object and sends it to the subthread, so that the objects member method will have access to its own copy of self?
Or is it indeed accessing self from the object in the main thread, thus meaning that every member in a subthread is modifying / overwriting the same properties (living in the main thread)?
Threads share a memory space. There is no magic going on behind the scenes, so code in different threads accesses the same objects. Thread switches can occur at any time, although most simple Python operations are atomic. It is up to you to avoid race conditions. Normal Python scoping rules apply.
You might want to read about ThreadLocal variables if you want to find out about workarounds to the default behavior.
Processes as quite different. Each Process has its own memory space and its own copy of all the objects it references.

Python: Why is the multiprocessing lock shared among processes here?

I am trying to share a lock among processes. I understand that the way to share a lock is to pass it as an argument to the target function. However I found that even the approach below is working. I could not understand the way the processes are sharing this lock. Could anyone please explain?
import multiprocessing as mp
import time
class SampleClass:
def __init__(self):
self.lock = mp.Lock()
self.jobs = []
self.total_jobs = 10
def test_run(self):
for i in range(self.total_jobs):
p = mp.Process(target=self.run_job, args=(i,))
p.start()
self.jobs.append(p)
for p in self.jobs:
p.join()
def run_job(self, i):
with self.lock:
print('Sleeping in process {}'.format(i))
time.sleep(5)
if __name__ == '__main__':
t = SampleClass()
t.test_run()
On Windows (which you said you're using), these kinds of things always reduce to details about how multiprocessing plays with pickle, because all Python data crossing process boundaries on Windows is implemented by pickling on the sending end (and unpickling on the receiving end).
My best advice is to avoid doing things that raise such questions to begin with ;-) For example, the code you showed blows up on Windows under Python 2, and also blows up under Python 3 if you use a multiprocessing.Pool method instead of multiprocessing.Process.
It's not just the lock, simply trying to pickle a bound method (like self.run_job) blows up in Python 2. Think about it. You're crossing a process boundary, and there isn't an object corresponding to self on the receiving end. To what object is self.run_job supposed to be bound on the receiving end?
In Python 3, pickling self.run_job also pickles a copy of the self object. So that's the answer: a SampleClass object corresponding to self is created by magic on the receiving end. Clear as mud. t's entire state is pickled, including t.lock. That's why it "works".
See this for more implementation details:
Why can I pass an instance method to multiprocessing.Process, but not a multiprocessing.Pool?
In the long run, you'll suffer the fewest mysteries if you stick to things that were obviously intended to work: pass module-global callable objects (neither, e.g., instance methods nor local functions), and explicitly pass multiprocessing data objects (whether an instance of Lock, Queue, manager.list, etc etc).
On Unix Operating Systems, new processes are created via the fork primitive.
The fork primitive works by cloning the parent process memory address space assigning it to the child. The child will have a copy of the parent's memory as well as for the file descriptors and shared objects.
This means that, when you call fork, if the parent has a file opened, the child will have it too. The same applied with shared objects such as pipes, sockets etc...
In Unix+CPython, Locks are realized via the sem_open primitive which is designed to be shared when forking a process.
I usually recommend against mixing concurrency (multiprocessing in particular) and OOP because it frequently leads to these kind of misunderstandings.
EDIT:
Saw just now that you are using Windows. Tim Peters gave the right answer. For the sake of abstraction, Python is trying to provide OS independent behaviour over its API. When calling an instance method, it will pickle the object and send it over a pipe. Thus providing a similar behaviour as for Unix.
I'd recommend you to read the programming guidelines for multiprocessing. Your issue is addressed in particular in the first point:
Avoid shared state
As far as possible one should try to avoid shifting large amounts of data between processes.
It is probably best to stick to using queues or pipes for communication between processes rather than using the lower level synchronization primitives.

How to make spawned process wait forever till an item appears in queue in python

I have two classes that I use to automate gdb from the command line, the main class creates a process for it to execute gdb commands from the command-line, then sends commands to the gdb_engine class(which I override the run() method in, it also has the gdb related functions in it) depending on the user request. The two separate processes communicate from the queue which holds the jobs that should be done. To do this task, I thought of this simple plan:
1-Check queue
2-wait if queue is empty, if not, execute the first function in the queue
3-Rewrite the queue
4-return to 1
But I couldn't find any function in multiprocessing documentary to make the spawned process stop/sleep if the queue is empty. I'm sure that there's a way to do this but since I'm still a beginner at python, I can't find my way easily. Things are still a bit confusing at this point.
Thanks in advance, have a good day! (I use python3.4 btw, if that matters)
EDIT: I don't have much going on right now but still posting my code on grzgrzgrz3's request. The codebase is somewhat large so I'm only copy/pasting the multiprocessing related ones.
GDB_Engine class, where I control gdb with pexpect:
class GDB_Engine(Process):
jobqueue=Queue()
def __init__(self):
super(GDB_Engine, self).__init__()
self.jobqueue=GDB_Engine.jobqueue
def run(self):
#empty since I still don't know how to implement that algorithm
Main of the program
if __name__ == "__main__":
gdbprocess=GDB_Engine()
gdbprocess.start()
I simply put items in queue whenever I need to do a job like this(middle of the code that attaches gdb to the target):
gdbprocess.jobqueue.put("attachgdb")
My main idea about it is spawned process will compare the string in queue and run the specified function in GDB_Engine class, to show an example, here's the attach code:
def attachgdb(self,str):
global p
p=pexpect.spawnu('sudo gdb')
p.expect_exact("(gdb) ")
p.sendline("attach " + str)
p.expect_exact("(gdb) ")
p.sendline("c")
p.expect_exact("Continuing")
I just found out that get() method blocks the process automatically if the queue is empty, so the answer to my question was very simple. I should have tried the methods more before asking, looks like it was just another silly and unnecessary question of mine.

Python multiprocessing.Process object behaves like it would hold a reference to an object in another process. Why?

import multiprocessing as mp
def delay_one_second(event):
print 'in SECONDARY process, preparing to wait for 1 second'
event.wait(1)
print 'in the SECONDARY process, preparing to raise the event'
event.set()
if __name__=='__main__':
evt = mp.Event()
print 'preparing to wait 10 seconds in the PRIMARY process'
mp.Process(target = delay_one_second, args=(evt,)).start()
evt.wait(10)
print 'PRIMARY process, waking up'
This code (run nicely from inside a module with the "python module.py" command inside cmd.exe) yields a surprising result.
The main process apparently only waits for 1 second before waking up. For this to happen, it means that the secondary process has a reference to an object in the main process.
How can this be? I was expecting to have to use a multiprocessing.Manager(), to share objects between processes, but how is this possible?
I mean the Processes are not threads, they shouldn't use the same memory space. Anyone have any ideas what's going on here?
The short answer is that the shared memory is not managed by a separate process; it's managed by the operating system itself.
You can see how this works if you spend some time browsing through the multiprocessing source. You'll see that an Event object uses a Semaphore and a Condition, both of which rely on the locking behavior provided by the SemLock object. This, in turn, wraps a _multiprocessing.SemLock object, which is implemented in c and depends on either sem_open (POSIX) or CreateSemaphore (Windows).
These are c functions that enable access to shared resources that are managed by the operating system itself -- in this case, named semaphores. They can be shared between threads or processes; the OS takes care of everything. What happens is that when a new semaphore is created, it is given a handle. Then, when a new process that needs access to that semaphore is created, it's given a copy of the handle. It then passes that handle to sem_open or CreateSemapohre, and the operating system gives the new process access to the original semaphore.
So the memory is being shared, but it's being shared as part of the operating system's built-in support for synchronization primitives. In other words, in this case, you don't need to open a new process to manage the shared memory; the operating system takes over that task. But this is only possible because Event doesn't need anything more complex than a semaphore to work.
The documentation says that the multiprocessing module follows the threading API. My guess would be that it uses a mechanism similar to 'fork'. If you fork a thread your OS will create a copy of the current process. It means that it copies the heap and stack, including all your variables and globals and that's what you're seeing.
You can see it for yourself if you pass the function below to a new process.
def print_globals():
print globals()

Invoking a method in a thread

My code spawns a number of threads to manage communications with a number of I/O boards. Generally the threads receive events from the boards and update external data sources as necessary. The threads (1 or more) are invoked as:
phThreadDict[devId] = ifkit(self, phDevId, phIpAddr, phIpPort, phSerial)
phThreadDict[devId].start()
This works fine. However, in some cases I also need the thread to send a message to the boards. The thread contains a method that does the work and I call that method, from the main thread, as: (this example turns on a digital output)
phThreadDict[devId].writeDigitalOutput(digitalOut, True)
this is the method contained in the thread:
def writeDigitalOutput(self,index, state):
interfaceKit.setOutputState(index, state)
threading.enumerate() produces:
{134997634: <ifkit(Thread-1, started daemon)>, 554878244: <ifkit(Thread-3, started daemon)>, 407897606: <tempsensor(Thread-4, started daemon)>}
and the instance is:
<ifkit(Thread-3, started daemon)>
This works fine if I have only one thread. But, if I have multiple threads, only one is used - the choice appears to be made at random when the program starts.
I suspect that storing the thread identifier in the dict is the problem, but still, it works with one thread.
Instead of storing your threads in a "simple" associative array maybe you should instantiate a threadpool beforehand (you can find an example of implementation here h**p://code.activestate.com/recipes/577187-python-thread-pool/ or directly use the following lib http://pypi.python.org/pypi/threadpool).
Also instantiate a "watchdog", each of your thread will hold a reference to this watchdog, so when your threads need to do their callback they'll send back the info to this watchdog. (beware of the deadlock, look at http://dabeaz.blogspot.fr/2009/11/python-thread-deadlock-avoidance_20.html).
Note : sorry for the lame "h**p" but SO won't let me post more than 2 links....

Categories