If a software project supports a version of Python that multiprocessing has been backported to, is there any reason to use threading.Lock over multiprocessing.Lock? Would a multiprocessing lock not be thread safe as well?
For that matter, is there a reason to use any synchronization primitives from threading that are also in multiprocessing?
The threading module's synchronization primitive are lighter and faster than multiprocessing, due to the lack of dealing with shared semaphores, etc. If you are using threads; use threading's locks. Processes should use multiprocessing's locks.
I would expect the multi-threading synchronization primitives to be quite faster as they can use shared memory area easily. But I suppose you will have to perform speed test to be sure of it. Also, you might have side-effects that are quite unwanted (and unspecified in the doc).
For example, a process-wise lock could very well block all threads of the process. And if it doesn't, releasing a lock might not wake up the threads of the process.
In short, if you want your code to work for sure, you should use the thread-synchronization primitives if you are using threads and the process-synchronization primitives if you are using processes. Otherwise, it might work on your platform only, or even just with your specific version of Python.
multiprocessing and threading packages have slightly different aims, though both are concurrency related. threading coordinates threads within one process, while multiprocessing provide thread-like interface for coordinating multiple processes.
If your application doesn't spawn new processes which require data synchronization, multiprocessing is a bit more heavy weight, and threading package should be better suited.
Related
What are best practices or work-arounds for using both multiprocessing and user threads in the same python application in Linux with respect to Issue 6721, Locks in python standard library should be sanitized on fork?
Why do I need both? I use child processes to do heavy computation that produce data structure results that are much too large to return through a queue -- rather they must be immediately stored to disk. It seemed efficient to have each of these child processes monitored by a separate thread, so that when finished, the thread could handle the IO of reading the large (eg multi GB) data back into the process where the result was needed for further computation in combination with the results of other child processes.
The children processes would intermittently hang, which I just (after much head pounding) found was 'caused' by using the logging module. Others have documented the problem here:
https://twiki.cern.ch/twiki/bin/view/Main/PythonLoggingThreadingMultiprocessingIntermixedStudy
which points to this apparently unsolved python issue: Locks in python standard library should be sanitized on fork; http://bugs.python.org/issue6721
Alarmed at the difficulty I had tracking this down, I answered:
Are there any reasons not to mix Multiprocessing and Threading module in Python
with the rather unhelpful suggestion to 'Be careful' and links to the above.
But the lengthy discussion re: Issue 6721 suggests that it is a 'bug' to use both multiprocessing (or os.fork) and user threads in the same application. With my limited understanding of the problem, I find too much disagreement in the discussion to conclude what are the work-arounds or strategies for using both multiprocessing and threading in the same application. My immediate problem was solved by disabling logging, but I create a small handful of other (explicit) locks in both parent and child processes, and suspect I am setting myself up for further intermittent deadlocks.
Can you give practical recommendations to avoid deadlocks while using locks and/or the logging module while using threading and multiprocessing in a python (2.7,3.2,3.3) application?
You will be safe if you fork off additional processes while you still have only one thread in your program (that is, fork from main thread, before spawning worker threads).
Your use case looks like you don't even need multiprocessing module; you can use subprocess (or even simpler os.system-like calls).
See also Is it safe to fork from within a thread?
I have been trying to write a simple python application to implement a worker queue
every webpage I found about threading has some random guy commenting on it, you shouldn't use python threading because this or that, can someone help me out? what is up with Python threading, can I use it or not? if yes which lib? the standard one is good enough?
Python's threads are perfectly viable and useful for many tasks. Since they're implemented with native OS threads, they allow executing blocking system calls and keep "running" simultaneously - by calling the blocking syscall in a separate thread. This is very useful for programs that have to do multiple things at the same time (i.e. GUIs and other event loops) and can even improve performance for IO bound tasks (such as web-scraping).
However, due to the Global Interpreter Lock, which precludes the Python interpreter of actually running more than a single thread simultaneously, if you expect to distribute CPU-intensive code over several CPU cores with threads and improve performance this way, you're out of luck. You can do it with the multiprocessing module, however, which provides an interface similar to threading and distributes work using processes rather than threads.
I should also add that C extensions are not required to be bound by the GIL and many do release it, so C extensions can employ multiple cores by using threads.
So, it all depends on what exactly you need to do.
You shouldn't need to use
threading. 95% of code does not need
threads.
Yes, Python threading is
perfectly valid, it's implemented
through the operating system's native
threads.
Use the standard library
threading module, it's excellent.
GIL should provide you some information on that topic.
The principal challenge of
multi-threaded applications is
coordinating threads that share data
or other resources. To that end, the
threading module provides a number of
synchronization primitives including
locks, events, condition variables,
and semaphores.
While those tools are powerful, minor
design errors can result in problems
that are difficult to reproduce. So,
the preferred approach to task
coordination is to concentrate all
access to a resource in a single
thread and then use the Queue module
to feed that thread with requests from
other threads. Applications using
Queue.Queue objects for inter-thread
communication and coordination are
easier to design, more readable, and
more reliable.
It, basically, states to use Queue.Queue for inter-thread communication and coordination, instead of the powerful tools such as semaphores, locks, etc.
My question is, what's the drawback of the suggested method? When should one use the more "powerful tools" instead, and why?
Edit
To be clear, I know what semaphores are. I was just wondering why the Python documentation suggests to use the Queue.Queue method instead of the "powerful tools" -- I'm simply using the documentation's own verbiage, not coming up with my own.
I'm not sure I'd consider semaphores and locks "more powerful methods", as you suggest.
Queues are generally a higher-order abstraction. In other words, you could use semaphores and locks to build thread-safe queues.
Which you'd use where depends on your application. Queues are good for passing "work" between threads and processes, and semaphores/locks are good for protecting critical sections or shared resources, so only one thread can access at a time.
Take a look at the source code for Python's thread-safe queue. The queue class builds a useful abstraction from 3 Conditions and a Lock, correctly.
I wouldn't say coordination is the hardest problem. In shared-state multithreading the hardest thing is preventing threads from "sharing". You always have to look out for non-deterministic behaviour due to threads accidentally sharing and stomping on each other's data.
So, I recommend you don't use threads at all. You should use the lower-level tools when you feel you haven't spent enough time tracking down heisenbugs, but if there's any way you can get away with using a simple queue, go for it.
In my script, I have a function foo which basically uses pynotify to notify user about something repeatedly after a time interval say 15 minutes.
def foo:
while True:
"""Does something"""
time.sleep(900)
My main script has to interact with user & does all other things so I just cant call the foo() function. directly.
Whats the better way of doing it and why?
Using fork or threads?
I won't tell you which one to use, but here are some of the advantages of each:
Threads can start more quickly than processes, and threads use fewer operating system resources than processes, including memory, file handles, etc. Threads also give you the option of communicating through shared variables (although many would say this is more of a disadvantage than an advantage - See below).
Processes each have their own separate memory and variables, which means that processes generally communicate by sending messages to each other. This is much easier to do correctly than having threads communicate via shared memory. Processes can also run truly concurrently, so that if you have multiple CPU cores, you can keep all of them busy using processes. In Python*, the global interpreter lock prevents threads from making much use of more than a single core.
* - That is, CPython, which the implementation of Python that you get if you go to http://python.org and download Python. Other Python implementations (such as Jython) do not necessarily prohibit Python from running threads on multiple CPUs simultaneously. Thanks to #EOL for the clarification.
For these kinds of problems, neither threads nor forked processes seem the right approach. If all you want to do is to once every 15 minutes notify the user of something, why not use an event loop like GLib's or Twisted's reactor ? This allows you to schedule operations that should run once in a while, and get on with the rest of your program.
Using multiple processes lets you exploit multiple CPU cores at the same time, while, in CPython, using threads doesn't (threads take turns using a single CPU core) -- so, if you have CPU intensive work and absolutely want to use threads, you should consider Jython or IronPython; with CPython, this consideration is often enough to sway the choice towards the multiprocessing module and away from the threading one (they offer pretty similar interfaces, because multiprocessing was designed to be easily put in place in lieu of threading).
Net of this crucial consideration, threads might often be a better choice (performance-wise) on Windows (where making a new process is a heavy task), but less often on Unix variants (Linux, BSD versions, OpenSolaris, MacOSX, ...), since making a new process is faster there (but if you're using IronPython or Jython, you should check, on the platforms you care about, that this still applies in the virtual machines in question -- CLR with either .NET or Mono for IronPython, your JVM of choice for Jython).
Processes are much simpler. Just turn them loose and let the OS handle it.
Also, processes are often much more efficient. Processes do not share a common pool of I/O resources; they are completely independent.
Python's subprocess.Popen handles everything.
If by fork you mean os.fork then I would avoid using that. It is not cross platform and too low level - you would need to implement communication between the processes yourself.
If you want to use a separate process then use either the subprocess module or if you are on Python 2.6 or later the new multiprocessing module. This has a very similar API to the threading module, so you could start off using threads and then easily switch to processes, or vice-versa.
For what you want to do I think I would use threads, unless """does something""" is CPU intensive and you want to take advantage of multiple cores, which I doubt in this particular case.
Are the locks from the threading module interchangeable with those from the multiprocessing module?
You can typically use the two interchangeably, but you need to cognizant of the differences. For example, multiprocessing.Event is backed by a named semaphore, which is sensitive to the platform under the application.
Multiprocessing.Lock is backed by Multiprocessing.SemLock - so it needs named semaphores. In essence, you can use them interchangeably, but using multiprocessing's locks introduces some platform requirements on the application (namely, it doesn't run on BSD :))
I don't think so. Threading locks are within the same process, while the multiprocessing lock would likely be in shared memory.
Last time I checked, multiprocessing doesn't allow you to share the lock in a Queue, which is a threading lock.
Yes, you can use locks from the multiprocessing module as normal in your one-process application, but if you're using multiprocessing, you should use its locks.