Python threading, queuing, asyncing... What does it all mean?

Python threading, queuing, asyncing... What does it all mean? - python

I've been experimenting with threading recently in Python and was curious when to use what.
For example, when should I use multithreading over multiprocessing? What would be a scenario when I should be using asynchronous IO rather than threading?
I mostly understand what each does (I think) but I can't see any benefits/downsides of using one over the other.
What should I use if I was creating a small HTTP server?
What should I use if I was creating a small HTTP client?
This baffles me...

What you want talk about is not specific to python only it's about multiprocessing vs threading in general i think you can find in google lot of argument from the two side the ones that prefer threading and the others that prefer multiprocessing.
But for python multi-threading is limited (if you're using CPython) by the GIL (Global Interpreter Lock), so most python programmer prefer using the multiprocessing over the threading (it's Guido recommendation)
Nevertheless, you re right the GIL is
not as bad as you would initially
think: you just have to undo the
brainwashing you got from Windows and
Java proponents who seem to consider
threads as the only way to approach
concurrent activities, Guido van Rossum.
you can find here some more info

Python multiprocessing makes sense when you have a machine with multiple cores and/or CPUs. The main difference between using threads and processes is that processes do not share an address space, and thus one process cannot easily access the data of another process. That is why the multiprocessing module provides managers and queues and stuff like that.
The issue with threading is Pythons Global Interpreter Lock, which seriously messes with multithreaded applications.
Asynchronous IO is useful when you have long running IO operations (read large file, wait for response from network) and do not want your application to block. Many operating systems offer built-in implementations of that.
So, for your server you would probably use multiprocessing or multithreading, and for your client async IO is more fitting.

Related

Python asyncio or multithread for fetching data from different sources?

I have a Python app that needs to fetch data from 2-3 different sources (SQL Server, MongoDB etc..) and it can be done in parallel, as I simply need all of the data together later, and each request does not rely on the others.
I couldn't figure which is better for this case - threads, processes or async await?
I read that differences are mostly in CPU usage and I/O. But what if I simply wish to make multiple requests simultaneously (and not sequentially)? Of course, no CPU usage here at all.

I'd suggest you have a look at python's builtin Threading module...
... a quote from the doc:
CPython implementation detail: In CPython, due to the Global Interpreter Lock, only one thread can execute Python code at once (even though certain performance-oriented libraries might overcome this limitation). If you want your application to make better use of the computational resources of multi-core machines, you are advised to use multiprocessing or concurrent.futures.ProcessPoolExecutor. However, threading is still an appropriate model if you want to run multiple I/O-bound tasks simultaneously.
in short... multithreading has the problem of GIL (e.g. global interpreter lock) but as you can read here...
...the GIL is always released when doing I/O.

Python and truly concurrent threads

I've been reading for hours now and I can completely figure out how python multi threading is faster than a single thread.
The question really stems from GIL. If there is GIL, and only one thread is really running at any single time, how can multi threading be faster than a single thread?
I read that with some operations GIL is released (like writing to file). Is that what makes multi threading faster?
And about greenlets. How do those help with concurrency at all? So far all the purpose I see for them is easy switching between functions and less complicated yield functions.
EDIT: And how in the world a server like Tornado can deal with thousands of simultaneous connections?

You are correct - when python is waiting on C code execution the GIL is released, and that is how you can get some speedup. But only one line of python can be executed at a time. Note that this is a CPython (implementation) detail, and not strictly speaking part of the language python itself. For example, Jython and IronPython have no GIL and can fully exploit multiprocessor systems.
If you need truly concurrent programming in CPython, you should be looking at multiprocessing rather than threading.

Python multi threading Yay or nay?

I have been trying to write a simple python application to implement a worker queue
every webpage I found about threading has some random guy commenting on it, you shouldn't use python threading because this or that, can someone help me out? what is up with Python threading, can I use it or not? if yes which lib? the standard one is good enough?

Python's threads are perfectly viable and useful for many tasks. Since they're implemented with native OS threads, they allow executing blocking system calls and keep "running" simultaneously - by calling the blocking syscall in a separate thread. This is very useful for programs that have to do multiple things at the same time (i.e. GUIs and other event loops) and can even improve performance for IO bound tasks (such as web-scraping).
However, due to the Global Interpreter Lock, which precludes the Python interpreter of actually running more than a single thread simultaneously, if you expect to distribute CPU-intensive code over several CPU cores with threads and improve performance this way, you're out of luck. You can do it with the multiprocessing module, however, which provides an interface similar to threading and distributes work using processes rather than threads.
I should also add that C extensions are not required to be bound by the GIL and many do release it, so C extensions can employ multiple cores by using threads.
So, it all depends on what exactly you need to do.

You shouldn't need to use
threading. 95% of code does not need
threads.
Yes, Python threading is
perfectly valid, it's implemented
through the operating system's native
threads.
Use the standard library
threading module, it's excellent.

GIL should provide you some information on that topic.

Twisted Threading + MapReduce on a single node/server?

I'm confused about Twisted threading.
I've heard and read more than a few articles, books, and sat through a few presentations on the subject of threading vs processes in Python. It just seems to me that unless one is doing lots of IO or wanting to utilize shared memory across jobs, then the right choice is to use multiprocessing.
However, from what I've seen so far, it seems like Twisted uses Threads (pThreads from the python threading module). And Twisted seems to perform really really well in processing lots of data.
I've got a fairly large number of processes that I'd like to distribute processing to using the MapReduce pattern in Python on a single node/server. They don't do any IO really, they just do a lot of processing.
Is the Twisted reactor the right tool for this job?

The short answer to your question: no, twisted threading is not the right solution for heavy processing.
If you have a lot of processing to do, twisted's threading will still be subject to the GIL (Global Interpreter Lock). Without going into a long in depth explanation, the GIL is what allows only one thread at a time to execute python code. What this means in effect is you will not be able to take advantage of multiple cores with a single multi-threaded twisted process. That said, some C modules (such as bits of SciPy) can release the GIL and run multi-threaded, though the python code associated is still effectively single-threaded.
What twisted's threading is mainly useful for is using it along with blocking I/O based modules. A prime example of this is database API's, because the db-api spec doesn't account for asynchronous use cases, and most database modules adhere to the spec. Thusly, to use PostgreSQL for example from a twisted app, one has to either block or use something like twisted.enterprise.adbapi which is a wrapper that uses twisted.internet.threads.deferToThread to allow a SQL query to execute while other stuff is going on. This can allow other python code to run because the socket module (among most others involving operating system I/O) will release the GIL while in a system call.
That said, you can use twisted to write a network application talking to many twisted (or non-twisted, if you'd like) workers. Each worker could then work on little bits of work, and you would not be restricted by the GIL, because each worker would be its own completely isolated process. The master process can then make use of many of twisted's asynchronous primitives. For example you could use a DeferredList to wait on a number of results coming from any number of workers, and then run a response handler when all of the Deferred's complete. (thus allowing you to do your map call) If you want to go down this route, I recommend looking at twisted.protocols.amp, which is their Asynchronous Message Protocol, and can be used very trivially to implement a network-based RPC or map-reduce.
The downside of running many disparate processes versus something like multiprocessing is that
you lose out on simple process management, and
the subprocesses can't share memory as if they would if they were forked on a unix system.
Though for modern systems, 2) is rarely a problem unless you are running hundreds of subprocesses. And problem 1) can be solved by using a process management system like supervisord
Edit For more on python and the GIL, you should watch Dave Beazley's talks on the subject ( website , video, slides )

What threading module should I use to prevent disk IO from blocking network IO?

I have a Python application that, to be brief, receives data from a remote server, processes it, responds to the server, and occasionally saves the processed data to disk. The problem I've encountered is that there is a lot of data to write, and the save process can take upwards of half a minute. This is apparently a blocking operation, so the network IO is stalled during this time. I'd like to be able to make the save operation take place in the background, so-to-speak, so that the application can continue to communicate with the server reasonably quickly.
I know that I probably need some kind of threading module to accomplish this, but I can't tell what the differences are between thread, threading, multiprocessing, and the various other options. Does anybody know what I'm looking for?

Since you're I/O bound, then use the threading module.
You should almost never need to use thread, it's a low-level interface; the threading module is a high-level interface wrapper for thread.
The multiprocessing module is different from the threading module, multiprocessing uses multiple subprocesses to execute a task; multiprocessing just happens to use the same interface as threading to reduce learning curve. multiprocessing is typically used when you have CPU bound calculation, and need to avoid the GIL (Global Interpreter Lock) in a multicore CPU.
A somewhat more esoteric alternative to multi-threading is asynchronous I/O using asyncore module. Another options includes Stackless Python and Twisted.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.