Multithreading With Very Large Number of Threads

Multithreading With Very Large Number of Threads - python

I'm working on simulating a mesh network with a large number of nodes. The nodes pass data between different master nodes throughout the network.
Each master comes live once a second to receive the information, but the slave nodes don't know when the master is up or not, so when they have information to send, they try and do so every 5 ms for 1 second to make sure they can find the master.
Running this on a regular computer with 1600 nodes results in 1600 threads and the performance is extremely bad.
What is a good approach to handling the threading so each node acts as if it is running on its own thread?
In case it matters, I'm building the simulation in python 2.7, but I'm open to changing to something else if that makes sense.

For one, are you really using regular, default Python threads available in the default Python 2.7 interpreter (CPython), and is all of your code in Python? If so, you are probably not actually using multiple CPU cores because of the global interpreter lock CPython has (see https://wiki.python.org/moin/GlobalInterpreterLock). You could maybe try running your code under Jython, just to check if performance will be better.
You should probably rethink your application architecture and switch to manually scheduling events instead of using threads, or maybe try using something like greenlets (https://stackoverflow.com/a/15596277/1488821), but that would probably mean less precise timings because of lack of parallelism.

To me, 1600 threads sounds like a lot but not excessive given that it's a simulation. If this were a production application it would probably not be production-worthy.
A standard machine should have no trouble handling 1600 threads. As to the OS this article could provide you with some insights.
When it comes to your code a Python script is not a native application but an interpreted script and as such will require more CPU resources to execute.
I suggest you try implementing the simulation in C or C++ instead which will produce a native application which should execute more efficiently.

Do not use threading for that. If sticking to Python, let the nodes perform their actions one by one. If the performance you get doing so is OK, you will not have to use C/C++. If the actions each node perform are simple, that may work. Anyway, there is no reason to use threads in Python at all. Python threads are usable mostly for making blocking I/O not to block your program, not for multiple CPU kernels utilization.
If you want to really use parallel processing and to write your nodes as if they were really separated and exchanging only using messages, you may use Erlang (http://www.erlang.org/). It is a functional language very well suited for executing parallel processes and making them exchange messages. Erlang processes do not map to OS threads, and you may create thousands of them. However, Erlang is a purely functional language and may seem extremely strange if you have never used such languages. And it also is not very fast, so, like Python, it is unlikely to handle 1600 actions every 5ms unless the actions are rather simple.
Finally, if you do not get desired performance using Python or Erlang, you may move to C or C++. However, still do not use 1600 threads. In fact, using threads to gain performance is reasonable only if the number of threads does not dramatically exceed number of CPU kernels. A reactor pattern (with several reactor threads) is what you may need in that case (http://en.wikipedia.org/wiki/Reactor_pattern). There is an excellent implementation of the reactor pattern in boost.asio library. It is explained here: http://www.gamedev.net/blog/950/entry-2249317-a-guide-to-getting-started-with-boostasio/

Some random thoughts here:
I did rather well with several hundred threads working like this in Java; it can be done with the right language. (But I haven't tried this in Python.)
In any language, you could run the master node code in one thread; just have it loop continuously, running the code for each master in each cycle. You'll lose the benefits of multiple cores that way, though. On the other hand, you'll lose the problems of multithreading, too. (You could have, say, 4 such threads, utilizing the cores but getting the multithreading headaches back. It'll keep the thread-overhead down, too, but then there's blocking...)
One big problem I had was threads blocking each other. Enabling 100 threads to call the same method on the same object at the same time without waiting for each other requires a bit of thought and even research. I found my multithreading program at first often used only 25% of a 4-core CPU even when running flat out. This might be one reason you're running slow.
Don't have your slave nodes repeat sending data. The master nodes should come alive in response to data coming in, or have some way of storing it until they do come alive, or some combination.
It does pay to have more threads than cores. Once you have two threads, they can block each other (and will if they share any data). If you have code to run that won't block, you want to run it in its own thread so it won't be waiting for code that does block to unblock and finish. I found once I had a few threads, they started to multiply like crazy--hence my hundreds-of-threads program. Even when 100 threads block at one spot despite all my brilliance, there's plenty of other threads to keep the cores busy!

Related

Python multithreading or multiprocessing?

Since this is a design question I don't think code would do it justice.
I'm making a program to process and log a high bandwidth stream of data and concurrently trying to observe that data live. (like a production line and an inspector watching the production line)
I want to distribute the load between cores on my computer to leave room for future functionality but I can't figure out if I can use multiprocessing for this. It seems most examples of multiprocessing all have initial data sets and outputs and don't need to be actively communicating throughout their lifetime.
Am I able to use multiprocessing to actively communicate between processes or is that a bad idea and I should stick with multithreading?

If you expect the computation needed to process the full load of signal processing exceeds the capacity of a single core, you should consider spreading the load over multiple cores using multi-processing.
If you expect the computation needed to fit on a single core, but you expect many slow or blocking I/O operations to hold up the work, you should consider multi-threading.
If overall performance is an issue, you should reconsider writing the actual solution in pure regular Python and instead look for an implementation in another language, or in a version of Python that gets you a solution closer to the hardware. You can of course still come up with a result that would be usable from a regular Python program.

Multiprocessing and multithreading are two different way of executing code.
In Multiprocessing, you essentially just throw more raw computing power at it (hence the name) as your using more CPUs.
https://en.wikipedia.org/wiki/Multiprocessing#:~:text=Multiprocessing%20is%20the%20use%20of,to%20allocate%20tasks%20between%20them.
Multithreading is when the CPU basically breaks the code up into more "threads" of code to perform multiple task simultaneously.
https://en.wikipedia.org/wiki/Multithreading_(computer_architecture)#:~:text=In%20computer%20architecture%2C%20multithreading%20is,supported%20by%20the%20operating%20system.
For Python, which is normal linear (i.e. it runs from line 1 down to the end of the code in order), threading is better suited for network bound or multiprocessing if its CPU bound.
https://timber.io/blog/multiprocessing-vs-multithreading-in-python-what-you-need-to-know/
So that's the difference, how the code turns out and how the system will operate will dictate whether you use multiprocessing or multithreading.

Concurrency and race condition [duplicate]

Does the presence of python GIL imply that in python multi threading the same operation is not so different from repeating it in a single thread?.
For example, If I need to upload two files, what is the advantage of doing them in two threads instead of uploading them one after another?.
I tried a big math operation in both ways. But they seem to take almost equal time to complete.
This seems to be unclear to me. Can someone help me on this?.
Thanks.

Python's threads get a slightly worse rap than they deserve. There are three (well, 2.5) cases where they actually get you benefits:
If non-Python code (e.g. a C library, the kernel, etc.) is running, other Python threads can continue executing. It's only pure Python code that can't run in two threads at once. So if you're doing disk or network I/O, threads can indeed buy you something, as most of the time is spent outside of Python itself.
The GIL is not actually part of Python, it's an implementation detail of CPython (the "reference" implementation that the core Python devs work on, and that you usually get if you just run "python" on your Linux box or something.
Jython, IronPython, and any other reimplementations of Python generally do not have a GIL, and multiple pure-Python threads can execute simultaneously.
The 0.5 case: Even if you're entirely pure-Python and see little or no performance benefit from threading, some problems are really convenient in terms of developer time and difficulty to solve with threads. This depends in part on the developer, too, of course.

It really depends on the library you're using. The GIL is meant to prevent Python objects and its internal data structures to be changed at the same time. If you're doing an upload, the library you use to do the actual upload might release the GIL while it's waiting for the actual HTTP request to complete (I would assume that is the case with the HTTP modules in the standard library, but I didn't check).
As a side note, if you really want to have things running in parallel, just use multiple processes. It will save you a lot of trouble and you'll end up with better code (more robust, more scalable, and most probably better structured).

It depends on the native code module that's executing. Native modules can release the GIL and then go off and do their own thing allowing another thread to lock the GIL. The GIL is normally held while code, both python and native, are operating on python objects. If you want more detail you'll probably need to go and read quite a bit about it. :)
See:
What is a global interpreter lock (GIL)? and Thread State and the Global Interpreter Lock

Multithreading is a concept where two are more tasks need be completed simultaneously, for example, I have word processor in this application there are N numbers of a parallel task have to work. Like listening to keyboard, formatting input text, sending a formatted text to display unit. In this context with sequential processing, it is time-consuming and one task has to wait till the next task completion. So we put these tasks in threads and simultaneously complete the task. Three threads are always up and waiting for the inputs to arrive, then take that input and produce the output simultaneously.
So multi-threading works faster if we have multi-core and processors. But in reality with single processors, threads will work one after the other, but we feel it's executing with greater speed, Actually, one instruction executes at a time and a processor can execute billions of instructions at a time. So the computer creates illusion that multi-task or thread working parallel. It just an illusion.

Why do we blame GIL if CPU can execute one process (light weight) at a time? [duplicate]

I'm slightly confused about whether multithreading works in Python or not.
I know there has been a lot of questions about this and I've read many of them, but I'm still confused. I know from my own experience and have seen others post their own answers and examples here on StackOverflow that multithreading is indeed possible in Python. So why is it that everyone keep saying that Python is locked by the GIL and that only one thread can run at a time? It clearly does work. Or is there some distinction I'm not getting here?
Many posters/respondents also keep mentioning that threading is limited because it does not make use of multiple cores. But I would say they are still useful because they do work simultaneously and thus get the combined workload done faster. I mean why would there even be a Python thread module otherwise?
Update:
Thanks for all the answers so far. The way I understand it is that multithreading will only run in parallel for some IO tasks, but can only run one at a time for CPU-bound multiple core tasks.
I'm not entirely sure what this means for me in practical terms, so I'll just give an example of the kind of task I'd like to multithread. For instance, let's say I want to loop through a very long list of strings and I want to do some basic string operations on each list item. If I split up the list, send each sublist to be processed by my loop/string code in a new thread, and send the results back in a queue, will these workloads run roughly at the same time? Most importantly will this theoretically speed up the time it takes to run the script?
Another example might be if I can render and save four different pictures using PIL in four different threads, and have this be faster than processing the pictures one by one after each other? I guess this speed-component is what I'm really wondering about rather than what the correct terminology is.
I also know about the multiprocessing module but my main interest right now is for small-to-medium task loads (10-30 secs) and so I think multithreading will be more appropriate because subprocesses can be slow to initiate.

The GIL does not prevent threading. All the GIL does is make sure only one thread is executing Python code at a time; control still switches between threads.
What the GIL prevents then, is making use of more than one CPU core or separate CPUs to run threads in parallel.
This only applies to Python code. C extensions can and do release the GIL to allow multiple threads of C code and one Python thread to run across multiple cores. This extends to I/O controlled by the kernel, such as select() calls for socket reads and writes, making Python handle network events reasonably efficiently in a multi-threaded multi-core setup.
What many server deployments then do, is run more than one Python process, to let the OS handle the scheduling between processes to utilize your CPU cores to the max. You can also use the multiprocessing library to handle parallel processing across multiple processes from one codebase and parent process, if that suits your use cases.
Note that the GIL is only applicable to the CPython implementation; Jython and IronPython use a different threading implementation (the native Java VM and .NET common runtime threads respectively).
To address your update directly: Any task that tries to get a speed boost from parallel execution, using pure Python code, will not see a speed-up as threaded Python code is locked to one thread executing at a time. If you mix in C extensions and I/O, however (such as PIL or numpy operations) and any C code can run in parallel with one active Python thread.
Python threading is great for creating a responsive GUI, or for handling multiple short web requests where I/O is the bottleneck more than the Python code. It is not suitable for parallelizing computationally intensive Python code, stick to the multiprocessing module for such tasks or delegate to a dedicated external library.

Yes. :)
You have the low level thread module and the higher level threading module. But it you simply want to use multicore machines, the multiprocessing module is the way to go.
Quote from the docs:
In CPython, due to the Global Interpreter Lock, only one thread can
execute Python code at once (even though certain performance-oriented
libraries might overcome this limitation). If you want your
application to make better use of the computational resources of
multi-core machines, you are advised to use multiprocessing. However,
threading is still an appropriate model if you want to run multiple
I/O-bound tasks simultaneously.

Threading is Allowed in Python, the only problem is that the GIL will make sure that just one thread is executed at a time (no parallelism).
So basically if you want to multi-thread the code to speed up calculation it won't speed it up as just one thread is executed at a time, but if you use it to interact with a database for example it will.

I feel for the poster because the answer is invariably "it depends what you want to do". However parallel speed up in python has always been terrible in my experience even for multiprocessing.
For example check this tutorial out (second to top result in google): https://www.machinelearningplus.com/python/parallel-processing-python/
I put timings around this code and increased the number of processes (2,4,8,16) for the pool map function and got the following bad timings:
serial 70.8921644706279
parallel 93.49704207479954 tasks 2
parallel 56.02441442012787 tasks 4
parallel 51.026168536394835 tasks 8
parallel 39.18044807203114 tasks 16
code:
# increase array size at the start
# my compute node has 40 CPUs so I've got plenty to spare here
arr = np.random.randint(0, 10, size=[2000000, 600])
.... more code ....
tasks = [2,4,8,16]
for task in tasks:
tic = time.perf_counter()
pool = mp.Pool(task)
results = pool.map(howmany_within_range_rowonly, [row for row in data])
pool.close()
toc = time.perf_counter()
time1 = toc - tic
print(f"parallel {time1} tasks {task}")

Fastest way to handle threading with callback in Python

I'm playing around with some measurement instruments through PyVisa (Python wrapper for VISA). Specifically I need to read the measurement value from four instruments, similar to this:
current1 = instrument1.ask("READ?")
current2 = instrument2.ask("READ?")
current3 = instrument3.ask("READ?")
current4 = instrument4.ask("READ?")
For my application, speed is a must. Individually, I can get between 50 and 200 measurements per second from the four instruments, but unfortunately my current code evaluates the four instruments serially.
From what I've read, there's some options with Threading and Multiprocessing in Python, but it's not obvious what the best, and fastest, option for me.
Best case scenario I will be spawning ~4x50 threads per second, so overhead is a bit of a concern.
The task is in no way CPU intensive, it is simply a matter of waiting for the readout from the instrument.
Any advice on what the proper course of action might be?
Many thanks in advance!

Try using the multiprocessing module in python. Because of the way the python interpreter is written, the threading module for python actually slows down your application when using multiple threads on a computer with multiple processors. This is due to something called the Global Interpreter Lock (GIL) which only allows one python thread per process to be run at a time. This makes things like memory management and concurrency easier for the interpreter.
I would suggest trying to create a process for each instrument (producer process) and send the results to a shared queue that gets processed by a single process (consumer process).
I would also limit the number of processes that you create to the number of processors on your computer. Creating too many processes introduces a lot of overhead.
If you really feel like you need the 200 threads you suggested in your post, I would program this in Java or C++.

I'm not sure if this fits your requirements I'm not too familiar with pyVisa, however have you looked into pypy's stackless examples the speed increase is pretty impressive, and you're able to spawn almost infinite amount of 'threadlets' without much hindrance to your system / program.
pypy is kinda limited in terms of supporting various packages but looks like pyVisa isn't one of them.
https://pypi.python.org/pypi/PyVISA

What would I use Stackless Python for?

There are many questions related to Stackless Python. But none answering this my question, I think (correct me if wrong - please!). There's some buzz about it all the time so I curious to know. What would I use Stackless for? How is it better than CPython?
Yes it has green threads (stackless) that allow quickly create many lightweight threads as long as no operations are blocking (something like Ruby's threads?). What is this great for? What other features it has I want to use over CPython?

It allows you to work with massive amounts of concurrency. Nobody sane would create one hundred thousand system threads, but you can do this using stackless.
This article tests doing just that, creating one hundred thousand tasklets in both Python and Google Go (a new programming language): http://dalkescientific.com/writings/diary/archive/2009/11/15/100000_tasklets.html
Surprisingly, even if Google Go is compiled to native code, and they tout their co-routines implementation, Python still wins.
Stackless would be good for implementing a map/reduce algorithm, where you can have a very large number of reducers depending on your input data.

Stackless Python's main benefit is the support for very lightweight coroutines. CPython doesn't support coroutines natively (although I expect someone to post a generator-based hack in the comments) so Stackless is a clear improvement on CPython when you have a problem that benefits from coroutines.
I think the main area where they excel are when you have many concurrent tasks running within your program. Examples might be game entities that run a looping script for their AI, or a web server that is servicing many clients with pages that are slow to create.
You still have many of the typical problems with concurrency correctness however regarding shared data, but the deterministic task switching makes it easier to write safe code since you know exactly where control will be transferred and therefore know the exact points at which the shared state must be up to date.

Thirler already mentioned that stackless was used in Eve Online. Keep in mind, that:
(..) stackless adds a further twist to this by allowing tasks to be separated into smaller tasks, Tasklets, which can then be split off the main program to execute on their own. This can be used for fire-and-forget tasks, like sending off an email, or dispatching an event, or for IO operations, e.g. sending and receiving network packets. One tasklet waits for a packet from the network while others continue running the game loop.
It is in some ways like threads, but is non-preemptive and explicitly scheduled, so there are fewer issues with synchronization. Also, switching between tasklets is much faster than thread switching, and you can have a huge number of active tasklets whereas the number of threads is severely limited by the computer hardware.
(got this citation from here)
At PyCon 2009 there was given a very interesting talk, describing why and how Stackless is used at CCP Games.
Also, there is a very good introductory material, which describes why stackless is a good solution for Your applications. (it may be somewhat old, but I think that it is worth reading).

EVEOnline is largely programmed in Stackless Python. They have several dev blogs on the use of it. It seems it is very useful for high performance computing.

While I've not used Stackless itself, I have used Greenlet for implementing highly-concurrent network applications. Some of the use cases Linden Lab has put it towards are: high-performance smart proxies, a fast system for distributing commands over huge numbers of machines, and an application that does a ton of database writes and reads (at a ratio of about 1:2, which is very write-heavy, so it's spending most of its time waiting for the database to return), and a web-crawler-type-thing for internal web data. Basically any app that's expecting to have to do a lot of network I/O will benefit from being able to create a bajillion lightweight threads. 10,000 connected clients doesn't seem like a huge deal to me.
Stackless or Greenlet aren't really a complete solution, though. They are very low-level and you're going to have to do a lot of monkeywork to build an application with them that uses them to their fullest. I know this because I maintain a library that provides a networking and scheduling layer on top of Greenlet, specifically because writing apps is so much easier with it. There are a bunch of these now; I maintain Eventlet, but also there is Concurrence, Chiral, and probably a few more that I don't know about.
If the sort of app you want to write sounds like what I wrote about, consider one of these libraries. The choice of Stackless vs Greenlet is somewhat less important than deciding what library best suits the needs of what you want to do.

The basic usefulness for green threads, the way I see it, is to implement a system in which you have a large amount of objects that do high latency operations. A concrete example would be communicating with other machines:
def Run():
# Do stuff
request_information() # This call might block
# Proceed doing more stuff
Threads let you write the above code naturally, but if the number of objects is large enough, threads just cannot perform adequately. But you can use green threads even for in really large amounts. The request_information() above could switch out to some scheduler where other work is waiting and return later. You get all the benefits of being able to call "blocking" functions as if they return immediately without using threads.
This is obviously very useful for any kind of distributed computing if you want to write code in a straightforward way.
It is also interesting for multiple cores to mitigate waiting for locks:
def Run():
# Do some calculations
green_lock(the_foo)
# Do some more calculations
The green_lock function would basically attempt to acquire the lock and just switch out to a main scheduler if it fails due to other cores using the object.
Again, green threads are being used to mitigate blocking, allowing code to be written naturally and still perform well.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.