I have a Python app that needs to fetch data from 2-3 different sources (SQL Server, MongoDB etc..) and it can be done in parallel, as I simply need all of the data together later, and each request does not rely on the others.
I couldn't figure which is better for this case - threads, processes or async await?
I read that differences are mostly in CPU usage and I/O. But what if I simply wish to make multiple requests simultaneously (and not sequentially)? Of course, no CPU usage here at all.
I'd suggest you have a look at python's builtin Threading module...
... a quote from the doc:
CPython implementation detail: In CPython, due to the Global Interpreter Lock, only one thread can execute Python code at once (even though certain performance-oriented libraries might overcome this limitation). If you want your application to make better use of the computational resources of multi-core machines, you are advised to use multiprocessing or concurrent.futures.ProcessPoolExecutor. However, threading is still an appropriate model if you want to run multiple I/O-bound tasks simultaneously.
in short... multithreading has the problem of GIL (e.g. global interpreter lock) but as you can read here...
...the GIL is always released when doing I/O.
Related
Is it possible launch multiple threads on all available CPUs rather than one? An example code would be great.
Alternatively, can I span multiple processes and then create multi threading within each process?
I am using multithreading which works fine for IO side of my script. However, my script is also computation expensive so I would like to launch multiple threads on multiple CPUs.
My code flow:
def worker(url):
extract url (io bound)
process url content (cpu bound)
What should be the efficient way to deal with such type of worker?
Is it possible launch multiple threads on all available CPUs rather than one?
In general, threads run on whatever CPU is available. Unless you have specified a thread/process to run on a specific CPU. (How this is done varies per operating system)
However, if you are using the Python implementation from python.org ("CPython"), it doesn't matter. CPython has a "Global Interpreter Lock" that enforces that only one thread at a time is executing Python bytecode. So using threads will not yield improved processing performance with CPython.
So for computationally expensive tasks, you should probably use the multiprocessing module to do it in different processes. If the same work is done on a lot of data, using a multiprocessing.Pool is generally a good idea.
CPython has a Global Interpreter Lock (GIL).
So, multiple threads cannot concurrently run Python bytecodes.
What then is the use and relevance of the threading package in CPython ?
During I/O the GIL is released to other threads can run.
Also some extensions (like numpy) can release the GIL when doing calculations.
So an important purpose is to improve performance on not CPU-bound programs. From the Python documentation for the threading module:
CPython implementation detail: In CPython, due to the Global Interpreter Lock, only one thread can execute Python code at once (even though certain performance-oriented libraries might overcome this limitation). If you want your application to make better use of the computational resources of multi-core machines, you are advised to use multiprocessing or concurrent.futures.ProcessPoolExecutor. However, threading is still an appropriate model if you want to run multiple I/O-bound tasks simultaneously.
Another benefit of threading is to do long-running calculations in a GUI program without having to chop up your calculations in small enough pieces to make them fit in timeout functions.
Also keep in mind that while CPython has a GIL now, that might not always be the case in the future.
When python runs some code, the code is compiled in "atomic" commands (= small instructions). Every few hundred atomic instructions python will switch to the next thread and execute the the instructions for that thread. This allows running code pseudo-parallel.
Lets assume you have this code:
def f1():
while True:
# wait for incomming connections and serve a website to them
def f2():
while True:
# get new tweets and process them
And you want to execute f1() and f2() at the same time. In this case, you can simpy use threading and dont need to worry about breaking the loops every now and then to execute the other function. This is also way easier than asynchronous programming.
Simple said: It makes writing scripts which needs to do multiple things easier.
Also, like #roland-smith said, Python releases the GIL during I/O and some other low-level c-code.
i just recently read an article about the GIL (Global Interpreter Lock) in python.
Which seems to be some big issue when it comes to Python performance. So i was wondering myself
what would be the best practice to archive more performance. Would it be threading or
either multiprocessing? Because i hear everybody say something different, it would be
nice to have one clear answer. Or at least to know the pros and contras of multithreading
against multiprocessing.
Kind regards,
Dirk
It depends on the application, and on the python implementation that you are using.
In CPython (the reference implementation) and pypy the GIL only allows one thread at a time to execute Python bytecode. Other threads might be doing I/O or running extensions written in C.
It is worth noting that some other implementations like IronPython and JPython don't have a GIL.
A characteristic of threading is that all threads share the same interpreter and all the live objects. So threads can share global data almost without extra effort. You need to use locking to serialize access to data, though! Imagine what would happen if two threads would try to modifiy the same list.
Multiprocessing actually runs in different processes. That sidesteps the GIL, but if large amounts of data need to be shared between processes that data has to be pickled and transported to another process via IPC where it has to be unpickled again. The multiprocessing module can take care of the messy details for you, but it still adds overhead.
So if your program wants to run Python code in parallel but doesn't need to share huge amounts of data between instances (e.g. just filenames of files that need to be processed), multiprocessing is a good choice.
Currently multiprocessing is the only way that I'm aware of in the standard library to use all the cores of your CPU at the same time.
On the other hand if your tasks need to share a lot of data and most of the processing is done in extension or is I/O, threading would be a good choice.
I'm confused about Twisted threading.
I've heard and read more than a few articles, books, and sat through a few presentations on the subject of threading vs processes in Python. It just seems to me that unless one is doing lots of IO or wanting to utilize shared memory across jobs, then the right choice is to use multiprocessing.
However, from what I've seen so far, it seems like Twisted uses Threads (pThreads from the python threading module). And Twisted seems to perform really really well in processing lots of data.
I've got a fairly large number of processes that I'd like to distribute processing to using the MapReduce pattern in Python on a single node/server. They don't do any IO really, they just do a lot of processing.
Is the Twisted reactor the right tool for this job?
The short answer to your question: no, twisted threading is not the right solution for heavy processing.
If you have a lot of processing to do, twisted's threading will still be subject to the GIL (Global Interpreter Lock). Without going into a long in depth explanation, the GIL is what allows only one thread at a time to execute python code. What this means in effect is you will not be able to take advantage of multiple cores with a single multi-threaded twisted process. That said, some C modules (such as bits of SciPy) can release the GIL and run multi-threaded, though the python code associated is still effectively single-threaded.
What twisted's threading is mainly useful for is using it along with blocking I/O based modules. A prime example of this is database API's, because the db-api spec doesn't account for asynchronous use cases, and most database modules adhere to the spec. Thusly, to use PostgreSQL for example from a twisted app, one has to either block or use something like twisted.enterprise.adbapi which is a wrapper that uses twisted.internet.threads.deferToThread to allow a SQL query to execute while other stuff is going on. This can allow other python code to run because the socket module (among most others involving operating system I/O) will release the GIL while in a system call.
That said, you can use twisted to write a network application talking to many twisted (or non-twisted, if you'd like) workers. Each worker could then work on little bits of work, and you would not be restricted by the GIL, because each worker would be its own completely isolated process. The master process can then make use of many of twisted's asynchronous primitives. For example you could use a DeferredList to wait on a number of results coming from any number of workers, and then run a response handler when all of the Deferred's complete. (thus allowing you to do your map call) If you want to go down this route, I recommend looking at twisted.protocols.amp, which is their Asynchronous Message Protocol, and can be used very trivially to implement a network-based RPC or map-reduce.
The downside of running many disparate processes versus something like multiprocessing is that
you lose out on simple process management, and
the subprocesses can't share memory as if they would if they were forked on a unix system.
Though for modern systems, 2) is rarely a problem unless you are running hundreds of subprocesses. And problem 1) can be solved by using a process management system like supervisord
Edit For more on python and the GIL, you should watch Dave Beazley's talks on the subject ( website , video, slides )
In my script, I have a function foo which basically uses pynotify to notify user about something repeatedly after a time interval say 15 minutes.
def foo:
while True:
"""Does something"""
time.sleep(900)
My main script has to interact with user & does all other things so I just cant call the foo() function. directly.
Whats the better way of doing it and why?
Using fork or threads?
I won't tell you which one to use, but here are some of the advantages of each:
Threads can start more quickly than processes, and threads use fewer operating system resources than processes, including memory, file handles, etc. Threads also give you the option of communicating through shared variables (although many would say this is more of a disadvantage than an advantage - See below).
Processes each have their own separate memory and variables, which means that processes generally communicate by sending messages to each other. This is much easier to do correctly than having threads communicate via shared memory. Processes can also run truly concurrently, so that if you have multiple CPU cores, you can keep all of them busy using processes. In Python*, the global interpreter lock prevents threads from making much use of more than a single core.
* - That is, CPython, which the implementation of Python that you get if you go to http://python.org and download Python. Other Python implementations (such as Jython) do not necessarily prohibit Python from running threads on multiple CPUs simultaneously. Thanks to #EOL for the clarification.
For these kinds of problems, neither threads nor forked processes seem the right approach. If all you want to do is to once every 15 minutes notify the user of something, why not use an event loop like GLib's or Twisted's reactor ? This allows you to schedule operations that should run once in a while, and get on with the rest of your program.
Using multiple processes lets you exploit multiple CPU cores at the same time, while, in CPython, using threads doesn't (threads take turns using a single CPU core) -- so, if you have CPU intensive work and absolutely want to use threads, you should consider Jython or IronPython; with CPython, this consideration is often enough to sway the choice towards the multiprocessing module and away from the threading one (they offer pretty similar interfaces, because multiprocessing was designed to be easily put in place in lieu of threading).
Net of this crucial consideration, threads might often be a better choice (performance-wise) on Windows (where making a new process is a heavy task), but less often on Unix variants (Linux, BSD versions, OpenSolaris, MacOSX, ...), since making a new process is faster there (but if you're using IronPython or Jython, you should check, on the platforms you care about, that this still applies in the virtual machines in question -- CLR with either .NET or Mono for IronPython, your JVM of choice for Jython).
Processes are much simpler. Just turn them loose and let the OS handle it.
Also, processes are often much more efficient. Processes do not share a common pool of I/O resources; they are completely independent.
Python's subprocess.Popen handles everything.
If by fork you mean os.fork then I would avoid using that. It is not cross platform and too low level - you would need to implement communication between the processes yourself.
If you want to use a separate process then use either the subprocess module or if you are on Python 2.6 or later the new multiprocessing module. This has a very similar API to the threading module, so you could start off using threads and then easily switch to processes, or vice-versa.
For what you want to do I think I would use threads, unless """does something""" is CPU intensive and you want to take advantage of multiple cores, which I doubt in this particular case.